Skip to Content
LibraryConceptsAttention Mechanism

Attention Mechanism

The attention mechanism revolutionized sequence modeling by allowing models to dynamically focus on relevant parts of the input when generating each output, eliminating the fixed-bottleneck limitation of encoder-decoder architectures.

Core Intuition

Traditional sequence-to-sequence (seq2seq):

  • Encoder compresses entire input into single fixed-size vector (final hidden state)
  • Decoder generates output based only on this compressed representation
  • Problem: Single vector becomes information bottleneck for long sequences

Attention solution:

  • When generating each output, attend to (focus on) relevant parts of input
  • Different output positions can focus on different input positions
  • Context is dynamically adapted for each output

Analogy: When translating a sentence, you don’t memorize the entire source and then translate. Instead, you look back at specific words in the source as you translate each target word.

The Attention Process

Step 1: Compute Alignment Scores

For each output position, compute how well it aligns with all input positions:

eij=a(si1,hj)e_{ij} = a(s_{i-1}, h_j)

where:

  • si1s_{i-1}: Decoder state at position i1i-1
  • hjh_j: Encoder state at position jj
  • a()a(\cdot): Alignment function (dot product, learned network, etc.)

Intuition: How well does encoder position jj match with decoder position ii?

Step 2: Normalize to Attention Weights

Apply softmax to convert scores to probability distribution:

αij=exp(eij)k=1Texp(eik)\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}

Properties:

  • All weights sum to 1: j=1Tαij=1\sum_{j=1}^{T} \alpha_{ij} = 1
  • Higher scores get higher weights
  • Weights represent “how much to focus on each input position”

Step 3: Compute Context Vector

Create weighted sum of encoder states using attention weights:

ci=j=1Tαijhjc_i = \sum_{j=1}^{T} \alpha_{ij} h_j

The context vector cic_i is a dynamic representation of the input, customized for each output position.

Step 4: Generate Output

Use context vector (along with decoder state) to generate output:

yi=f(si,ci)y_i = f(s_i, c_i)

Implementation Example

def attention(decoder_state, encoder_states): """ Basic attention mechanism Args: decoder_state: (d_model,) encoder_states: (T, d_model) Returns: context: (d_model,) weights: (T,) """ # 1. Compute alignment scores scores = [] for h_j in encoder_states: score = dot_product(decoder_state, h_j) scores.append(score) # 2. Normalize to attention weights weights = softmax(scores) # (T,) # 3. Compute context vector context = sum(weight * h_j for weight, h_j in zip(weights, encoder_states)) return context, weights

Types of Alignment Functions

The alignment function a(s,h)a(s, h) can take different forms:

Dot Product (Multiplicative Attention)

a(s,h)=sTha(s, h) = s^T h

Advantages: Fast, no parameters Disadvantages: Requires same dimensionality

Additive Attention (Bahdanau Attention)

a(s,h)=vTtanh(W1s+W2h)a(s, h) = v^T \tanh(W_1 s + W_2 h)

Advantages: More expressive, handles different dimensions Disadvantages: More parameters, slower computation

Scaled Dot Product (Transformer)

a(s,h)=sThdka(s, h) = \frac{s^T h}{\sqrt{d_k}}

Advantages: Fast, stable for large dimensions Used in: Transformers (see Scaled Dot-Product Attention)

Attention Visualization

Attention weights form a matrix showing which input positions the model focuses on for each output:

Input positions (j) the cat sat on mat Output The 0.7 0.2 0.0 0.0 0.1 (i) cat 0.1 0.8 0.0 0.0 0.1 sat 0.0 0.1 0.7 0.1 0.1 on 0.0 0.0 0.1 0.8 0.1 mat 0.0 0.0 0.0 0.2 0.8

Each row shows the attention distribution for one output word. Higher values indicate stronger focus.

Key Advantages

  1. Direct connections: Every output can directly access every input position
  2. No information bottleneck: Context vector adapts dynamically for each output
  3. Better gradient flow: Direct paths for gradients between distant positions
  4. Interpretability: Visualize attention weights to understand model behavior
  5. Variable-length handling: Works for any input/output length combination

From Cross-Attention to Self-Attention

The attention mechanism described above is encoder-decoder attention (also called cross-attention):

  • Query comes from decoder
  • Keys and values come from encoder
  • Connects two different sequences

Self-attention (used in transformers) is when a sequence attends to itself:

  • Query, key, and value all come from the same sequence
  • Each position attends to all positions (including itself)
  • Enables rich contextual representations

See Scaled Dot-Product Attention for the self-attention formulation used in transformers.

Key Takeaways

  • Attention allows dynamic focus on relevant input parts for each output
  • Compute alignment scores → normalize to weights → weighted sum of encoder states
  • Context vector is dynamically adapted for each output position
  • Attention weights provide interpretability
  • Solves RNN’s information bottleneck problem
  • Enables direct connections between any two positions in O(1)O(1) operations

References

Key Papers:

Learning Resources: