TF-IDF is a statistical technique used in text analysis to determine the importance of a word in a document relative to a collection of documents (corpus). It balances two ideas:
- Term Frequency (TF): Captures how often a term occurs in a document.
- Inverse Document Frequency (IDF): Discounts terms that appear in many documents.
High TF-IDF scores indicate terms that are frequent in a document but rare in the corpus, making them useful for distinguishing between documents in tasks such as information retrieval, document classification, and recommendation.
TF-IDF combines local and global term Statistics:
- TF gives high scores to frequent terms in a document
- IDF reduces the weight of common terms across documents
- TF-IDF identifies terms that are both frequent and distinctive
Equations
Term Frequency
measures how often a term appears in a document , normalized by the total number of terms in :
Where:
- is the raw count of term in document
- is the total number of terms in (i.e. the document length)
Inverse Document Frequency
IDF assigns lower weights to frequent terms:
Where:
- is the number of documents in the corpus
- is the number of documents containing term
- Adding 1 to the denominator avoids division by zero
TF-IDF Score
The final score is:
Related Notes
Exploratory Ideas
- Can track TF-IDF over time (e.g., note evolution)
- Can cluster or classify the documents using TF-IDF?
Implementations
Python Script (scikit-learn version)
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Step 1: Tokenize and vectorize using Bag of Words
bow = CountVectorizer(tokenizer=normalize_document)
X_counts = bow.fit_transform(corpus)
# Step 2: Apply TF-IDF transformation
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
# Optional: View TF-IDF scores per document
for doc_id in range(len(corpus)):
print(f"Document {doc_id}: {corpus[doc_id]}")
print("TF-IDF values:")
tfidf_vector = X_tfidf[doc_id].T.toarray()
for term, score in zip(bow.get_feature_names_out(), tfidf_vector):
if score > 0:
print(f"{term.rjust(10)} : {score[0]:.4f}")
Python Script (custom TF-IDF implementation)
import math
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
def tokenize(text):
tokens = tokenizer.tokenize(text.lower())
tokens = [t for t in tokens if len(t) > 2 and t not in stop_words]
return tokens + [' '.join(b) for b in bigrams(tokens)] + [' '.join(t) for t in trigrams(tokens)]
def tf(term, doc_tokens):
return doc_tokens.count(term) / len(doc_tokens)
def idf(term, docs_tokens):
doc_count = sum(1 for doc in docs_tokens if term in doc)
return math.log(len(docs_tokens) / (1 + doc_count))
def compute_tfidf(docs):
docs_tokens = [tokenize(doc) for doc in docs]
all_terms = set(term for doc in docs_tokens for term in doc)
tfidf_scores = []
for tokens in docs_tokens:
tfidf = {}
for term in all_terms:
if term in tokens:
tfidf[term] = tf(term, tokens) * idf(term, docs_tokens)
tfidf_scores.append(tfidf)
return tfidf_scores