In this mechanism, the model applies attention to itself. This means each word in the input sequence attends to all other words in the sequence, including itself. Self-attention is used in models like Transformer to capture dependencies within a sentence.
- Each token in a sequence considers all others when computing its representation.
- This enables rich, context-aware embeddings, even for long inputs.
- Unlike Recurrent Neural Networks, Transformers allow parallel processing, making them more efficient and scalable.