Evaluate Embedding Methods

A good embedding should place semantically similar sentences or words closer together in the embedding space. You can test this using:

Compute the cosine similarity between embedding vectors.
Higher similarity between semantically related pairs (e.g., “Paris is the capital of France” vs “France’s capital is Paris”) indicates better embedding quality.

Compare scores across methods:

from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([vec1], [vec2])

b) Analogy or Word Arithmetic

Test whether embeddings support compositional reasoning.
Example: $Embedding (king) - Embedding (man) + Embedding (woman) \approx queen$
This shows if semantic and syntactic dimensions are meaningfully encoded (syntactic relationships).

c) Clustering Consistency

Cluster the embeddings (e.g. via k-means) and evaluate whether related texts group together.
Measure cluster cohesion and separation (e.g. using Silhouette Score).

a) Use TF-IDF as Baseline

TF-IDF scores highlight the most important words in a text.
Evaluate how well dense embeddings retain the importance structure identified by TF-IDF.
For example, check whether high TF-IDF words receive higher attention in models like BERT (via attention weights) or influence sentence embedding directions.

Train simple classifiers (e.g. logistic regression) on embeddings to predict:

Better embeddings typically yield better accuracy/F1 on such tasks.

Do the embeddings distinguish fine-grained semantic shifts (e.g., “bank” as a financial institution vs riverbank)?
Are word or sentence embeddings stable across paraphrased or reordered text?
Do similar sentences result in embeddings with high cosine similarity?
How well do embeddings handle OOV words or rare terms?

Data Archive