Attention Mechanism
The attention mechanism revolutionized sequence modeling by allowing models to dynamically focus on relevant parts of the input when generating each output, eliminating the fixed-bottleneck limitation of encoder-decoder architectures.
Core Intuition
Traditional sequence-to-sequence (seq2seq):
- Encoder compresses entire input into single fixed-size vector (final hidden state)
- Decoder generates output based only on this compressed representation
- Problem: Single vector becomes information bottleneck for long sequences
Attention solution:
- When generating each output, attend to (focus on) relevant parts of input
- Different output positions can focus on different input positions
- Context is dynamically adapted for each output
Analogy: When translating a sentence, you don’t memorize the entire source and then translate. Instead, you look back at specific words in the source as you translate each target word.
The Attention Process
Step 1: Compute Alignment Scores
For each output position, compute how well it aligns with all input positions:
where:
- : Decoder state at position
- : Encoder state at position
- : Alignment function (dot product, learned network, etc.)
Intuition: How well does encoder position match with decoder position ?
Step 2: Normalize to Attention Weights
Apply softmax to convert scores to probability distribution:
Properties:
- All weights sum to 1:
- Higher scores get higher weights
- Weights represent “how much to focus on each input position”
Step 3: Compute Context Vector
Create weighted sum of encoder states using attention weights:
The context vector is a dynamic representation of the input, customized for each output position.
Step 4: Generate Output
Use context vector (along with decoder state) to generate output:
Implementation Example
def attention(decoder_state, encoder_states):
"""
Basic attention mechanism
Args:
decoder_state: (d_model,)
encoder_states: (T, d_model)
Returns:
context: (d_model,)
weights: (T,)
"""
# 1. Compute alignment scores
scores = []
for h_j in encoder_states:
score = dot_product(decoder_state, h_j)
scores.append(score)
# 2. Normalize to attention weights
weights = softmax(scores) # (T,)
# 3. Compute context vector
context = sum(weight * h_j
for weight, h_j in zip(weights, encoder_states))
return context, weightsTypes of Alignment Functions
The alignment function can take different forms:
Dot Product (Multiplicative Attention)
Advantages: Fast, no parameters Disadvantages: Requires same dimensionality
Additive Attention (Bahdanau Attention)
Advantages: More expressive, handles different dimensions Disadvantages: More parameters, slower computation
Scaled Dot Product (Transformer)
Advantages: Fast, stable for large dimensions Used in: Transformers (see Scaled Dot-Product Attention)
Attention Visualization
Attention weights form a matrix showing which input positions the model focuses on for each output:
Input positions (j)
the cat sat on mat
Output The 0.7 0.2 0.0 0.0 0.1
(i) cat 0.1 0.8 0.0 0.0 0.1
sat 0.0 0.1 0.7 0.1 0.1
on 0.0 0.0 0.1 0.8 0.1
mat 0.0 0.0 0.0 0.2 0.8Each row shows the attention distribution for one output word. Higher values indicate stronger focus.
Key Advantages
- Direct connections: Every output can directly access every input position
- No information bottleneck: Context vector adapts dynamically for each output
- Better gradient flow: Direct paths for gradients between distant positions
- Interpretability: Visualize attention weights to understand model behavior
- Variable-length handling: Works for any input/output length combination
From Cross-Attention to Self-Attention
The attention mechanism described above is encoder-decoder attention (also called cross-attention):
- Query comes from decoder
- Keys and values come from encoder
- Connects two different sequences
Self-attention (used in transformers) is when a sequence attends to itself:
- Query, key, and value all come from the same sequence
- Each position attends to all positions (including itself)
- Enables rich contextual representations
See Scaled Dot-Product Attention for the self-attention formulation used in transformers.
Key Takeaways
- Attention allows dynamic focus on relevant input parts for each output
- Compute alignment scores → normalize to weights → weighted sum of encoder states
- Context vector is dynamically adapted for each output position
- Attention weights provide interpretability
- Solves RNN’s information bottleneck problem
- Enables direct connections between any two positions in operations
Related Concepts
- RNN Limitations - The problem attention solves
- Scaled Dot-Product Attention - The transformer attention formulation
- Multi-Head Attention - Running multiple attention mechanisms in parallel
References
Key Papers:
- Bahdanau et al. (2014): Neural Machine Translation by Jointly Learning to Align and Translate - Original attention mechanism
- Luong et al. (2015): Effective Approaches to Attention-based Neural Machine Translation - Different attention variants
Learning Resources:
- Jay Alammar: Visualizing A Neural Machine Translation Model
- Lilian Weng: Attention? Attention!
- Distill: Attention and Augmented Recurrent Neural Networks