BM25 (Best Match 25)

BM25 is a ranking function widely used in search engines (e.g., Elasticsearch, Whoosh, Lucene). It’s an extension of TF-IDF with improvements for handling term frequency and document length.

Formula:

For a document $D$ and query $Q$, the BM25 score is:

Where:

  • $f(t,D)$ = term frequency of term $t$ in document $D$

  • $|D|$ = length of document $D$ (number of words)

  • $\text{avgdl}$ = average document length across all docs

  • $k_1$ (usually ~1.2–2.0) controls term frequency saturation (diminishing returns after many repetitions).

  • $b$ (usually ~0.75) controls length normalization (longer documents don’t get unfair advantage).

  • $IDF(t)$ = inverse document frequency:

    where $N$ is total number of docs, $n_t$ is number of docs containing $t$.

Intuition:

  • Term Frequency: More occurrences of a query term increase relevance, but with diminishing returns.
  • Inverse Document Frequency: Rare terms are more discriminative, so they’re weighted higher.
  • Document Length Normalization: A term in a shorter document is more meaningful than in a very long one.

Example (simplified):

Suppose query = “data science”.

  • Doc1: “data science is growing fast”
  • Doc2: “big data in business and industry”

BM25 will:

  • Give high score to Doc1 since it contains both “data” and “science”.
  • Score Doc2 lower because it lacks “science”.
  • Adjust for length: if Doc2 were extremely long, the contribution of “data” would be dampened.