Telecom Tutorials by Samir Amberkar: Search Engine Concepts and Design

online LTE test
online C test

Updated or New

GPRS RAN refresh notes ^New
GSM RAN refresh notes ^New

3GPP Modem
Simulator

Search Engine Concepts and Design

Though discussed for Web search engine, the concepts are applicable to other search tools as well.

Aim of search engine is to provide relevant results in minimal time for words, phrases searched.

Relevancy

Relevancy of results is trial and error process and should get better over period of time - if there is a team in place who analyse what is searched, what is typically expected, and what search engine gives. For example, if user search for pizza , it is more likely that user meant to search for nearby pizza restaurants rather than pizza recipe or origins of pizza or meaning of word pizza. Good search engine will provide most relevant results, but at the same time, it may also provide alternate results in side bar or by some other means.

Though relevancy is trial and error process and there may not be thumb s rules in place, certain assumptions can be made. Like if user search for 3G mobile , even though user has not put quotes around, it makes sense to provide results wherein words 3G and mobile are next to each other or as near as possible. Another good assumption is: the documents in which word is found at the beginning or found in capitals are more relevant. This is based on observation that all documents will have title (and may be abstract too) at the beginning describing the content of page ; and that makes document more relevant for those words.

Relevancy also depend on type of document (like HTML page, word/pdf document, presentation document) and type of website (like is it knowledge related, is it commercial site, is it blog etc.)

Minimal time

To minimise searching time, it makes sense that rather than searching the documents as and when required for search word, we can create a sort of index or database which can be searched at much faster speed. This is similar to what you do when you arrange your music CD collection, you arrange in certain order most suitable to you so that later when you need it, you get CD that you want easily. Good search engine will also do the same: arrange data in certain way, taking partial or as much as possible care of relevancy !

Design

Following diagram shows a simple search engine design:

Search engine consist of three major components: Search Index, Index Generator, and Searcher

Index Generator works off-line and generates transient index which need to be synchronised with permanent index . Major components of Index Generator are: Parser and Crawler. Crawler crawls and gets the documents like in case of web search engine, crawler will go around the Internet looking for documents and downloading the same. Crawler will also need to check for updates to documents already downloaded and/or present in Search Index. Parser will parse these documents and may be based on partial or as much as possible relevancy, create search index entries.

The search index entries are stored in transient part of Search Index which would be later synchronised with permanent part. Synchronisation is major process as many times it leads to rearrangement of Search Index as relevancy of a document keep changing with additions of new documents, modification of earlier documents, logic applied for relevancy etc. This is the reason Search Index will have at least two parts as mentioned.

Searcher is in charge of user interaction and it is relatively simple compared to other two components. Searcher may consist of Interface part and Logic part. Interface part takes care of user interface (simple line by line result or categorised results, font/colors etc.). Interface gets the results data from Logic part. Logic part searches in Permanent part of Search Index. Its job is simpler if Relevancy is properly taken care during indexing. Logic can in fact fine tune the relevancy.

This complete brief on Search Engine Concepts and Design.

Above knowledge is based on my experience of search engine application developed (for company intranet) out of my own curiosity.