Skip to Content
LibraryPapersTransformers (2017)

Attention Is All You Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Published: NeurIPS 2017 Paper: arxiv.org/abs/1706.03762 

The paper that introduced the Transformer architecture and revolutionized sequence modeling by showing that attention mechanisms alone, without recurrence or convolution, can achieve state-of-the-art results on machine translation.

Core Innovation

Key insight: Replace recurrent and convolutional layers entirely with attention mechanisms.

Previous paradigm:

  • Sequence models used RNNs/LSTMs (sequential processing)
  • CNNs for parallelization but limited context
  • Attention as add-on to recurrence

Transformer paradigm:

  • Pure attention architecture (no recurrence, no convolution)
  • Full parallelization during training
  • Direct connections between all positions
  • Scales better to long sequences

Architecture Overview

The transformer consists of an encoder-decoder structure with stacked layers:

Encoder

Each encoder layer has two sub-layers:

  1. Multi-head self-attention: Each position attends to all positions
  2. Position-wise feed-forward network: Two linear transformations with ReLU

Both sub-layers use:

  • Residual connections: LayerNorm(x+Sublayer(x))\text{LayerNorm}(x + \text{Sublayer}(x))
  • Layer normalization

6 encoder layers in the base model.

Decoder

Each decoder layer has three sub-layers:

  1. Masked multi-head self-attention: Prevents attending to future positions
  2. Multi-head cross-attention: Attends to encoder output
  3. Position-wise feed-forward network

All sub-layers use residual connections and layer normalization.

6 decoder layers in the base model.

Key Components

1. Scaled Dot-Product Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Why scaling? For large dkd_k, dot products grow large, pushing softmax into regions with tiny gradients. Scaling by dk\sqrt{d_k} prevents this.

See: Scaled Dot-Product Attention

2. Multi-Head Attention

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

where each head is computed with separate learned projections:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)

Benefit: Allows model to jointly attend to information from different representation subspaces.

See: Multi-Head Attention

3. Positional Encoding

Since transformers have no recurrence or convolution, they have no inherent notion of position. The paper uses sinusoidal positional encodings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Properties:

  • Each dimension has different frequency
  • Allows model to learn relative positions
  • Can extrapolate to longer sequences than seen during training

4. Position-wise Feed-Forward Networks

Applied to each position separately and identically:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

Two linear transformations with ReLU activation in between.

Dimensions: Input and output are dmodel=512d_{\text{model}} = 512, inner layer is dff=2048d_{\text{ff}} = 2048.

Model Configurations

Transformer Base

  • dmodel=512d_{\text{model}} = 512 (model dimension)
  • dff=2048d_{\text{ff}} = 2048 (feed-forward dimension)
  • h=8h = 8 (attention heads)
  • dk=dv=64d_k = d_v = 64 (dimension per head)
  • 6 encoder layers, 6 decoder layers
  • Parameters: ~65M
  • Training: 12 hours on 8 P100 GPUs

Transformer Large

  • dmodel=1024d_{\text{model}} = 1024
  • dff=4096d_{\text{ff}} = 4096
  • h=16h = 16
  • Parameters: ~213M
  • Training: 3.5 days on 8 P100 GPUs

Three Types of Attention

The transformer uses attention in three different ways:

  1. Encoder self-attention: Encoder attends to all positions in input (bidirectional)
  2. Decoder masked self-attention: Decoder attends to earlier positions only (causal/autoregressive)
  3. Encoder-decoder attention: Decoder attends to encoder output (cross-attention)

Training Details

Task: Machine translation (English-to-German, English-to-French)

Optimization:

  • Adam optimizer with custom learning rate schedule:
    • Warmup: Linear increase for first 4000 steps
    • Decay: lr=dmodel0.5min(step0.5,stepwarmup1.5)\text{lr} = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup}^{-1.5})

Regularization:

  • Dropout rate 0.1 applied to:
    • Output of each sub-layer before residual connection
    • Sums of embeddings and positional encodings
  • Label smoothing with ϵ=0.1\epsilon = 0.1

Data:

  • WMT 2014 English-German: 4.5M sentence pairs
  • WMT 2014 English-French: 36M sentence pairs
  • Byte-pair encoding with 37K tokens

Results

Machine Translation Performance

English-to-German (WMT 2014):

  • Transformer (base): 27.3 BLEU (previous SOTA: 26.4)
  • Transformer (large): 28.4 BLEU (new SOTA)

English-to-French (WMT 2014):

  • Transformer (large): 41.0 BLEU (new SOTA, previous: 40.4)

Training cost:

  • 3.5 days on 8 GPUs vs weeks for previous models
  • More parallelizable than RNN/LSTM-based models

Computational Efficiency

ModelParamsTraining CostTest BLEU (EN-DE)
ByteNet--23.75
GNMT + RL278M-24.60
ConvS2S252M-25.16
Transformer (base)65M0.4 days27.3
Transformer (large)213M3.5 days28.4

Transformer achieved better results with less training time than previous architectures.

Advantages Over RNNs/CNNs

Computational Complexity Per Layer

OperationComplexitySequential OpsMax Path Length
Self-AttentionO(n2d)O(n^2 \cdot d)O(1)O(1)O(1)O(1)
RecurrentO(nd2)O(n \cdot d^2)O(n)O(n)O(n)O(n)
ConvolutionalO(knd2)O(k \cdot n \cdot d^2)O(1)O(1)O(logk(n))O(\log_k(n))

Key advantages:

  1. Parallelization: O(1)O(1) sequential operations vs O(n)O(n) for RNNs
  2. Direct connections: O(1)O(1) path length between any two positions vs O(n)O(n) for RNNs
  3. Interpretability: Attention weights show what model focuses on

Attention Visualizations

The paper shows attention patterns learned by different heads:

  • Some heads focus on syntactic relationships (subject-verb, verb-object)
  • Others attend to local context (nearby words)
  • Some capture long-range dependencies
  • Different heads specialize in different linguistic phenomena

This multi-head design allows the model to capture diverse patterns simultaneously.

Impact and Legacy

This paper fundamentally changed deep learning:

Immediate impact:

  • State-of-the-art on machine translation
  • Much faster training than RNN-based models
  • Better handling of long-range dependencies

Long-term impact:

  • Foundation for BERT (2018), GPT series (2018-present)
  • Extended to computer vision (ViT, 2020)
  • Now the dominant architecture in NLP
  • Enabled large language models (100B+ parameters)
  • Sparked research into efficient attention variants

Most cited AI paper of the 2010s decade with 100,000+ citations.

Limitations and Future Work

Quadratic complexity: O(n2)O(n^2) limits application to very long sequences

  • Led to efficient variants: Longformer, Reformer, Performer, Flash Attention

Fixed positional encoding: Learned embeddings might be more flexible

  • Most modern transformers use learned positional embeddings

Limited inductive biases: Unlike CNNs (translation invariance) or RNNs (sequential bias)

  • Requires more data but more flexible

Key Takeaways

  • Transformers use pure attention with no recurrence or convolution
  • Three types of attention: encoder self-attention, decoder masked self-attention, encoder-decoder cross-attention
  • Multi-head attention captures diverse patterns in parallel
  • Positional encoding injects position information
  • Parallel processing enables faster training than RNNs
  • Direct connections between all positions solve vanishing gradients
  • Achieved state-of-the-art on machine translation with less training time
  • Foundation of modern NLP: BERT, GPT, T5, and beyond

References

Original Paper:

Learning Resources:

Implementations:

Follow-up Papers: