Skip to main content

Excluding Content

There are several ways to exclude your content from the search engine.

1. Robots.txt

Robots.txt is by far the most reliable and easiest way to exclude parts of your site from the search engine index. In brief, robots.txt is used to tell search engines not to index content at specific URLs. If a file named robots.txt is present in the top folder of your site, search engines will read this file for information on what sites to exclude from their crawl. An excellent resource for this technique is robotstxt.org.

If you're interested in keeping just the Northwestern search engine out of your content, you can target the user agent "Northwestern-Search", e.g.


User-agent: Northwestern-Search
Disallow: /

2. Meta Directives

Several meta tag directives exist to communicate to search engines that they shouldn't index web content. The most commonly used one is the ROBOTS directive. The following snippet in the <head> portion of your HTML documents will instruct engines not to index your content or to follow its links:


<meta name="robots" content="noindex,follow">

3. Web Server Headers

In some cases, it may not be possible to include the above <meta> tag in your HTML templates, or you may wish to exclude non-HTML content. Instead of using the <meta> directive, you can send the same information in the HTTP header like so:


X-Robots-Tag: noindex

4. Collection disallow path

If you manage your own SearchBlox collections, you may add URL patterns you don't want indexed to the Disallow Paths field. Note that if the content has already been indexed, you will need to clear the collection or remove the documents one by one using your collection's Documents tab or the SearchBlox API. 

If you don't manage your own collection, you may use this content removal form. Please specify the URL pattern you'd like blocked when making your request and be as specific as possible.