Attention Mechanism

The attention mechanism revolutionized sequence modeling by allowing models to dynamically focus on relevant parts of the input when generating each output, eliminating the fixed-bottleneck limitation of encoder-decoder architectures.

Core Intuition

Traditional sequence-to-sequence (seq2seq):

Encoder compresses entire input into single fixed-size vector (final hidden state)
Decoder generates output based only on this compressed representation
Problem: Single vector becomes information bottleneck for long sequences

Attention solution:

When generating each output, attend to (focus on) relevant parts of input
Different output positions can focus on different input positions
Context is dynamically adapted for each output

Analogy: When translating a sentence, you don’t memorize the entire source and then translate. Instead, you look back at specific words in the source as you translate each target word.

The Attention Process

Step 1: Compute Alignment Scores

For each output position, compute how well it aligns with all input positions:

e_{ij} = a(s_{i-1}, h_j)

where:

$s_{i-1}$ : Decoder state at position $i-1$
$h_j$ : Encoder state at position $j$
$a(\cdot)$ : Alignment function (dot product, learned network, etc.)

Intuition: How well does encoder position $j$ match with decoder position $i$ ?

Step 2: Normalize to Attention Weights

Apply softmax to convert scores to probability distribution:

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}

Properties:

All weights sum to 1: $\sum_{j=1}^{T} \alpha_{ij} = 1$
Higher scores get higher weights
Weights represent “how much to focus on each input position”

Step 3: Compute Context Vector

Create weighted sum of encoder states using attention weights:

c_i = \sum_{j=1}^{T} \alpha_{ij} h_j

The context vector $c_i$ is a dynamic representation of the input, customized for each output position.

Step 4: Generate Output

Use context vector (along with decoder state) to generate output:

y_i = f(s_i, c_i)

Implementation Example


def attention(decoder_state, encoder_states):
    """
    Basic attention mechanism
 
    Args:
        decoder_state: (d_model,)
        encoder_states: (T, d_model)
 
    Returns:
        context: (d_model,)
        weights: (T,)
    """
    # 1. Compute alignment scores
    scores = []
    for h_j in encoder_states:
        score = dot_product(decoder_state, h_j)
        scores.append(score)
 
    # 2. Normalize to attention weights
    weights = softmax(scores)  # (T,)
 
    # 3. Compute context vector
    context = sum(weight * h_j
                  for weight, h_j in zip(weights, encoder_states))
 
    return context, weights

Types of Alignment Functions

The alignment function $a(s, h)$ can take different forms:

Dot Product (Multiplicative Attention)

a(s, h) = s^T h

Advantages: Fast, no parameters Disadvantages: Requires same dimensionality

Additive Attention (Bahdanau Attention)

a(s, h) = v^T \tanh(W_1 s + W_2 h)

Advantages: More expressive, handles different dimensions Disadvantages: More parameters, slower computation

Scaled Dot Product (Transformer)

a(s, h) = \frac{s^T h}{\sqrt{d_k}}

Advantages: Fast, stable for large dimensions Used in: Transformers (see Scaled Dot-Product Attention)

Attention Visualization

Attention weights form a matrix showing which input positions the model focuses on for each output:


           Input positions (j)
           the  cat  sat  on  mat
Output  The  0.7  0.2  0.0  0.0  0.1
(i)     cat  0.1  0.8  0.0  0.0  0.1
        sat  0.0  0.1  0.7  0.1  0.1
        on   0.0  0.0  0.1  0.8  0.1
        mat  0.0  0.0  0.0  0.2  0.8

Each row shows the attention distribution for one output word. Higher values indicate stronger focus.

Key Advantages

Direct connections: Every output can directly access every input position
No information bottleneck: Context vector adapts dynamically for each output
Better gradient flow: Direct paths for gradients between distant positions
Interpretability: Visualize attention weights to understand model behavior
Variable-length handling: Works for any input/output length combination

From Cross-Attention to Self-Attention

The attention mechanism described above is encoder-decoder attention (also called cross-attention):

Query comes from decoder
Keys and values come from encoder
Connects two different sequences

Self-attention (used in transformers) is when a sequence attends to itself:

Query, key, and value all come from the same sequence
Each position attends to all positions (including itself)
Enables rich contextual representations

See Scaled Dot-Product Attention for the self-attention formulation used in transformers.

Key Takeaways

Attention allows dynamic focus on relevant input parts for each output
Compute alignment scores → normalize to weights → weighted sum of encoder states
Context vector is dynamically adapted for each output position
Attention weights provide interpretability
Solves RNN’s information bottleneck problem
Enables direct connections between any two positions in $O(1)$ operations

RNN Limitations - The problem attention solves
Scaled Dot-Product Attention - The transformer attention formulation
Multi-Head Attention - Running multiple attention mechanisms in parallel

References

Key Papers:

Bahdanau et al. (2014): Neural Machine Translation by Jointly Learning to Align and Translate - Original attention mechanism
Luong et al. (2015): Effective Approaches to Attention-based Neural Machine Translation - Different attention variants

Learning Resources:

Jay Alammar: Visualizing A Neural Machine Translation Model
Lilian Weng: Attention? Attention!
Distill: Attention and Augmented Recurrent Neural Networks