Key Components of Attention and Formula

  1. Query: Represents the current word or position that requires attention:
  2. Key: Represents each word in the input sequence:
  3. Value: Represents the actual content or information in the input sequence:
  4. Attention Scores: The attention mechanism computes the relevance between the query and each key by computing a similarity score (such as dot-product or other scoring methods).
  5. Softmax: These scores are then passed through a softmax function to form a probability distribution, which gives us the attention weights.
  6. Context Vector: A weighted sum of the values (), using the attention weights, is computed. This context vector is what the model uses to generate the output token.

Given a query matrix, key matrix, and value matrix, attention is calculated as:

Where:

  • ,, and are matrices of query, key, and value vectors.
  • is the dimension of the keys.
  • The softmax is applied row-wise to produce attention weights.