Key Components of Attention and Formula

Query: Represents the current word or position that requires attention: $Q$
Key: Represents each word in the input sequence: $K$
Value: Represents the actual content or information in the input sequence: $V$
Attention Scores: The attention mechanism computes the relevance between the query and each key by computing a similarity score (such as dot-product or other scoring methods).
Softmax: These scores are then passed through a softmax function to form a probability distribution, which gives us the attention weights.
Context Vector: A weighted sum of the values ( $V$ ), using the attention weights, is computed. This context vector is what the model uses to generate the output token.

Given a query matrix, key matrix, and value matrix, attention is calculated as:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Where:

Data Archive