TF-IDF is a statistical technique used in text analysis to determine the importance of a word in a document relative to a collection of documents (corpus). It balances two ideas:

  • Term Frequency (TF): Captures how often a term occurs in a document.
  • Inverse Document Frequency (IDF): Discounts terms that appear in many documents.

High TF-IDF scores indicate terms that are frequent in a document but rare in the corpus, making them useful for distinguishing between documents in tasks such as information retrieval, document classification, and recommendation.

TF-IDF combines local and global term Statistics:

  • TF gives high scores to frequent terms in a document
  • IDF reduces the weight of common terms across documents
  • TF-IDF identifies terms that are both frequent and distinctive

Can be used to give an initial snapshot of a notes themes and topic.

Equations

Term Frequency

measures how often a term appears in a document , normalized by the total number of terms in :

Where:

  • is the raw count of term in document
  • is the total number of terms in (i.e. the document length)

Inverse Document Frequency

IDF assigns lower weights to frequent terms:

Where:

  • is the number of documents in the corpus
  • is the number of documents containing term
  • Adding 1 to the denominator avoids division by zero

TF-IDF Score

The final score is:

Exploratory Ideas

  • Can track TF-IDF over time (e.g., note evolution)
  • Can cluster or classify the documents using TF-IDF?