https://www.youtube.com/shorts/Muvjex0nkes
Self Attention: Take every word pays attention to every other word to capture context by:
- take input word vectors,
- break words into Q,K,V vectors,
- compute attention matrix
- generate final word vectors.
Multi-head attention: perform self attention in parallel.
- take word vectors,
- break words into Q,K,V vectors,
- Break each Q,K,V vector into the number of heads parts
- compute attention matrix for each head
- generate final word vectors for each head
- Combine back together
These have better understanding of the context.