How Search Algorithms Work (in Information Retrieval)
When you type a query into a search engine (e.g., “machine learning applications in healthcare”), the system must:
Index documents
- Before searching, all documents are processed into an inverted index (a mapping from terms → list of documents where they appear).
- Example:
- “data” → {doc1, doc3, doc7}
- “learning” → {doc2, doc3, doc4}
Retrieve candidate documents
- Using the inverted index, the system finds documents containing at least one of the query terms.
Rank the documents
- The system applies a ranking function (scoring model) to order documents by relevance to the query.
- Early methods: Boolean matching (exact matches).
- More advanced: Vector Space Models (TF-IDF, BM25 (Best Match 25)), Neural embeddings, Transformers (BERT).
Return top results
- Usually top-10 or top-100 documents are returned.
In short:
- Search algorithms: retrieve and rank documents.
- BM25: a probabilistic ranking function that balances term frequency, term rarity, and document length.