Key Components of Attention and Formula
- Query: Represents the current word or position that requires attention:
- Key: Represents each word in the input sequence:
- Value: Represents the actual content or information in the input sequence:
- Attention Scores: The attention mechanism computes the relevance between the query and each key by computing a similarity score (such as dot-product or other scoring methods).
- Softmax: These scores are then passed through a softmax function to form a probability distribution, which gives us the attention weights.
- Context Vector: A weighted sum of the values (), using the attention weights, is computed. This context vector is what the model uses to generate the output token.
Given a query matrix, key matrix, and value matrix, attention is calculated as:
Where:
- ,, and are matrices of query, key, and value vectors.
- is the dimension of the keys.
- The softmax is applied row-wise to produce attention weights.