Mush-Room: Chapter 8 : How do search engines work?

Internet search engines are web search engines that search and retrieve information on the web. Most of them use crawler indexer architecture. They depend on their crawler modules. Crawlers also referred to as spiders are small programs that browse the web.
Crawlers are given an initial set of URLs whose pages they retrieve. They extract the URLs that appear on the crawled pages and give this information to the crawler control module. The crawler module decides which pages to visit next and gives their URLs back to the crawlers.
The topics covered by different search engines vary according to the algorithms they use. Some search engines are programmed to search sites on a particular topic while the crawlers in others may be visiting as many sites as possible.
The crawl control module may use the link graph of a previous crawl or may use usage patterns to help in its crawling strategy.
The indexer module extracts the words form each page it visits and records its URLs. It results into a large lookup table that gives a list of URLs pointing to pages where each word occurs. The table lists those pages, which were covered in the crawling process.
A collection analysis module is another important part of the search engine architecture. It creates a utility index. A utility index may provide access to pages of a given length or pages containing a certain number of pictures on them.
During the process of crawling and indexing, a search engine stores the pages it retrieves. They are temporarily stored in a page repository. Search engines maintain a cache of pages they visit so that retrieval of already visited pages expedites.
The query module of a search engine receives search requests form users in the form of keywords. The ranking module sorts the results.
The crawler indexer architecture has many variants. It is modified in the distributed architecture of a search engine. These search engine architectures consist of gatherers and brokers. Gatherers collect indexing information from web servers while the brokers give the indexing mechanism and the query interface. Brokers update indices on the basis of information received from gatherers and other brokers. They can filter information. Many search engines of today use this type of architecture.