Multi-head attention extends the standard attention mechanism by enabling the model to attend to different parts of an input sequence simultaneously, capturing diverse relationships-both local and global.
Why Use Multi-Head Attention?
- Multiple Focus Areas: Each head attends to different parts of the sequence. Some capture short-range (syntactic) relationships, others long-range (semantic) dependencies.
- Diverse Representations: Each head operates in a distinct learned subspace, allowing the model to represent the same input in multiple ways.
- Richer Contextual Understanding: By aggregating these views, the model gains a more expressive and nuanced understanding of the input.
How It Works (Simplified Steps)
- Linear Projections: Input tokens are projected into queries (), keys (), and values () separately for each head.
- Independent Attention: Each head computes attention scores and outputs a context vector.
- Concatenation: Outputs from all heads are concatenated.
- Final Projection: A linear transformation combines the multi-head output into a single vector.
Example Applications
In language translation, heads might focus on:
- Aligning subject-verb structures
- Resolving pronoun references
- Handling grammatical reordering between source and target languages
In semantic tasks, they can disambiguate words (e.g., “bank” as riverbank or financial institution) by attending to context.