Data Archive

Self attention vs multi-head attention

https://www.youtube.com/shorts/Muvjex0nkes

Self Attention: Take every word pays attention to every other word to capture context by:

take input word vectors,
break words into Q,K,V vectors,
compute attention matrix
generate final word vectors.

Multi-head attention: perform self attention in parallel.

take word vectors,
break words into Q,K,V vectors,
1. Break each Q,K,V vector into the number of heads parts
compute attention matrix for each head
generate final word vectors for each head
Combine back together

These have better understanding of the context.

Backlinks

No backlinks found

NLP

Created with Quartz v4.3.1 © 2025

GitHub
Linkedin