Excluding Content

There are several ways to exclude your content from the search engine.

1. Robots.txt

Robots.txt is by far the most reliable and easiest way to exclude parts of your site from the search engine index. In brief, robots.txt is used to tell search engines not to index content at specific URLs. If a file named robots.txt is present in the top folder of your site, search engines will read this file for information on what sites to exclude from their crawl. An excellent resource for this technique is robotstxt.org.

To exclude your content from Northwestern's Google Programmable Search instance, you will have to remove it from Google.com using the robots.txt rule below or Google's Search Console.

user-agent: googlebot
disallow: /

2. Meta Directives

Several meta tag directives exist to communicate to search engines that they shouldn't index web content. The most commonly used one is the ROBOTS directive. The following snippet in the <head> portion of your HTML documents will instruct engines not to index your content or to follow its links:

<meta name="robots" content="noindex,follow">

3. Web Server Headers

In some cases, it may not be possible to include the above <meta> tag in your HTML templates, or you may wish to exclude non-HTML content. Instead of using the <meta> directive, you can modify your server configuration to send the same information in an HTTP header like so:

X-Robots-Tag: noindex