Crawling and Indexing
It's important to make a distinction between crawling and indexing. Crawling is the process by which new URLs are discovered by following links in pages. Indexing is the process by which content in pages known to exist is read and processed for search.
The SearchBlox crawler starts from the Root URLs of each collection. Most collections are set to refresh weekly. If you require content to be refreshed more quickly, please contact firstname.lastname@example.org.
If your site seems to be running slowly during a SearchBlox crawl, a spider delay can be set in milliseconds to reduce server load. Please contact Web Communications if you require assistance.
The crawler will send a HTTP request to the server, including an If-Modified-Since HTTP header. We recommend servers be properly configured to respond with the date and time the page was last updated, like this:
If-Modified-Since: Wed, 19 Oct 2005 10:50:00 GMT
If the page requested has not been updated since the date supplied, the server should respond with a code 304 (not modified). Otherwise, it should respond 200 (OK) and send the page. Properly configuring these responses will help search engines determine how often to re-index your site's content.
How can I submit my page for crawling?
Schools and some departments manage their own collections. If you have a manager account in SearchBlox, add your new site to your collection's Root URLs field, then trigger a refresh. If you don't have a manager account, please use this form to submit your northwestern.edu site.
Indexing is the process by which the content on your page is examined and ingested into the "index" -- a database of keywords and associated pages (URLs) containing them.
What content is indexed?
The entire page is stored in the index. Specific components of the page are used by the search algorithm to determine relevance based on keyword searches. The easiest way to get your content ranking high for a given term is to include that term in your content! The indexing algorithm ranks pages from multiple components of the page:
- The page URL
- The text between <title> elements
- The text in the <meta> description field, e.g.
<META NAME="Description" CONTENT="Your descriptive sentence or two goes here.">
- The document's body contents
For more information on how to improve each of these areas, see writing for search engines.