Skip to main content

Crawling and Indexing

It's important to make a distinction between crawling and indexing. Crawling is the process by which new URLs are discovered by following links in pages. Indexing is the process by which content in pages known to exist is read and processed for search.

Crawling

The SearchBlox crawler starts from the Root URLs of each collection. Most collections are set to refresh weekly. If you require content to be refreshed more quickly, please contact search-help@northwestern.edu.

If your site seems to be running slowly during a SearchBlox crawl, a spider delay can be set in milliseconds to reduce server load. Please contact Web Communications if you require assistance.

If-Modified-Since header

The crawler will send a HTTP request to the server, including an If-Modified-Since HTTP header. We recommend servers be properly configured to respond with the date and time the page was last updated, like this:


If-Modified-Since: Wed, 19 Oct 2005 10:50:00 GMT

If the page requested has not been updated since the date supplied, the server should respond with a code 304 (not modified). Otherwise, it should respond 200 (OK) and send the page. Properly configuring these responses will help search engines determine how often to re-index your site's content.

How can I submit my page for crawling?

Schools and some departments manage their own collections. If you have a manager account in SearchBlox, add your new site to your collection's Root URLs field, then trigger a refresh. If you don't have a manager account, please use this form to submit your northwestern.edu site.

Indexing

Indexing is the process by which the content on your page is examined and ingested into the "index" -- a database of keywords and associated pages (URLs) containing them.

What content is indexed?

The entire page is stored in the index. Specific components of the page are used by the search algorithm to determine relevance based on keyword searches. The easiest way to get your content ranking high for a given term is to include that term in your content! The indexing algorithm ranks pages from multiple components of the page:

For more information on how to improve each of these areas, see writing for search engines.

What about meta tagging and keywords?

The most recognized meta tagging schemes are the Dublin Core Metadata set and the Open Graph protocol. SearchBlox automatically extracts metatags for use in results.