Overview

Pretraining and Transfer Learning

Pre-trained on large corpora using two main objectives:

  • Masked Language Modeling (MLM): Predict randomly masked words.
  • Next Sentence Prediction (NSP): Predict whether one sentence follows another. Enables Transfer Learning through task-specific fine-tuning.

Input Embeddings

Applications of BERT

  1. Text Classification – Sentiment analysis, topic classification.
  2. NER – Extraction of entities like names, places, etc.
  3. Question Answering – Find answers based on a passage.
  4. Text Summarisation – Create concise summaries of documents.
  5. Language Translation – Assist with machine translation.
  6. Sentence Similarity – Evaluate semantic similarity between sentences.

Limitations of BERT with Large Datasets

BERT generates contextual embeddings for each word in a sentence, which are typically pooled—using methods like mean pooling—to form a single sentence embedding (see Sentence Transformers). However, such pooling treats all words equally, regardless of their importance to the sentence’s overall meaning. This limits BERT’s ability to capture fine-grained semantic relationships.

While fine-tuning BERT on sentence pairs can help produce embeddings that better reflect relational meaning, this process is computationally intensive and does not scale well to large datasets.

Resources

Exploratory Questions

Variants

  • BERT-base: 12 layers, 110M parameters.
  • BERT-large: 24 layers, 340M parameters.
  • Optimized alternatives for specific tasks: