Blog Post

Spider


Spider: A spider, also known as a web crawler or search engine spider, is an internet bot that browses the web for the purposes of web indexing. Search engines use spiders to update their web content or index other sites' content. These web crawlers copy the pages they visit so that they may later be processed by a search engine that indexes the downloaded pages, allowing users to search them more quickly. A spider discovers new pages by harvesting the links on every page it finds. It then follows these links to other web pages.


The spider's behavior is determined by a number of policies:

  • Selection policy: The selection policy states which pages to be downloaded. A metric of importance is used to prioritize web pages, based on their intrinsic quality, popularity in terms of links or visits, and URLs.

  • Re-visit policy: The re-visit policy states when to check for changes to the pages. The search engine uses a page's freshness and age to determine whether or not it needs to be re-visited.

  • Politeness policy: The politeness policy states how to avoid overloading the websites. Spiders can use a great deal of bandwith, which can effect a site's overall performance as the servers get overloaded.

  • Parallelization policy: The parallelization policy states how to coordinate distributed spiders. A parallel spider crawls multiple processes in parallel, in order to maximize the download rate while minimizing the overhead from parallelization, and avoid repeated downloads of the same page.


All of the major search engines have their own spider bots. Google's spider is called Googlebot, Bing's is Bingbot (which replaced Msnbot), and Yahoo! Slurp was the name of Yahoo!'s web crawler until Yahoo contracted with Microsoft to use Bingbot.