Attention Is All You Need
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Published: NeurIPS 2017 Paper: arxiv.org/abs/1706.03762
The paper that introduced the Transformer architecture and revolutionized sequence modeling by showing that attention mechanisms alone, without recurrence or convolution, can achieve state-of-the-art results on machine translation.
Core Innovation
Key insight: Replace recurrent and convolutional layers entirely with attention mechanisms.
Previous paradigm:
- Sequence models used RNNs/LSTMs (sequential processing)
- CNNs for parallelization but limited context
- Attention as add-on to recurrence
Transformer paradigm:
- Pure attention architecture (no recurrence, no convolution)
- Full parallelization during training
- Direct connections between all positions
- Scales better to long sequences
Architecture Overview
The transformer consists of an encoder-decoder structure with stacked layers:
Encoder
Each encoder layer has two sub-layers:
- Multi-head self-attention: Each position attends to all positions
- Position-wise feed-forward network: Two linear transformations with ReLU
Both sub-layers use:
- Residual connections:
- Layer normalization
6 encoder layers in the base model.
Decoder
Each decoder layer has three sub-layers:
- Masked multi-head self-attention: Prevents attending to future positions
- Multi-head cross-attention: Attends to encoder output
- Position-wise feed-forward network
All sub-layers use residual connections and layer normalization.
6 decoder layers in the base model.
Key Components
1. Scaled Dot-Product Attention
Why scaling? For large , dot products grow large, pushing softmax into regions with tiny gradients. Scaling by prevents this.
See: Scaled Dot-Product Attention
2. Multi-Head Attention
where each head is computed with separate learned projections:
Benefit: Allows model to jointly attend to information from different representation subspaces.
See: Multi-Head Attention
3. Positional Encoding
Since transformers have no recurrence or convolution, they have no inherent notion of position. The paper uses sinusoidal positional encodings:
Properties:
- Each dimension has different frequency
- Allows model to learn relative positions
- Can extrapolate to longer sequences than seen during training
4. Position-wise Feed-Forward Networks
Applied to each position separately and identically:
Two linear transformations with ReLU activation in between.
Dimensions: Input and output are , inner layer is .
Model Configurations
Transformer Base
- (model dimension)
- (feed-forward dimension)
- (attention heads)
- (dimension per head)
- 6 encoder layers, 6 decoder layers
- Parameters: ~65M
- Training: 12 hours on 8 P100 GPUs
Transformer Large
- Parameters: ~213M
- Training: 3.5 days on 8 P100 GPUs
Three Types of Attention
The transformer uses attention in three different ways:
- Encoder self-attention: Encoder attends to all positions in input (bidirectional)
- Decoder masked self-attention: Decoder attends to earlier positions only (causal/autoregressive)
- Encoder-decoder attention: Decoder attends to encoder output (cross-attention)
Training Details
Task: Machine translation (English-to-German, English-to-French)
Optimization:
- Adam optimizer with custom learning rate schedule:
- Warmup: Linear increase for first 4000 steps
- Decay:
Regularization:
- Dropout rate 0.1 applied to:
- Output of each sub-layer before residual connection
- Sums of embeddings and positional encodings
- Label smoothing with
Data:
- WMT 2014 English-German: 4.5M sentence pairs
- WMT 2014 English-French: 36M sentence pairs
- Byte-pair encoding with 37K tokens
Results
Machine Translation Performance
English-to-German (WMT 2014):
- Transformer (base): 27.3 BLEU (previous SOTA: 26.4)
- Transformer (large): 28.4 BLEU (new SOTA)
English-to-French (WMT 2014):
- Transformer (large): 41.0 BLEU (new SOTA, previous: 40.4)
Training cost:
- 3.5 days on 8 GPUs vs weeks for previous models
- More parallelizable than RNN/LSTM-based models
Computational Efficiency
| Model | Params | Training Cost | Test BLEU (EN-DE) |
|---|---|---|---|
| ByteNet | - | - | 23.75 |
| GNMT + RL | 278M | - | 24.60 |
| ConvS2S | 252M | - | 25.16 |
| Transformer (base) | 65M | 0.4 days | 27.3 |
| Transformer (large) | 213M | 3.5 days | 28.4 |
Transformer achieved better results with less training time than previous architectures.
Advantages Over RNNs/CNNs
Computational Complexity Per Layer
| Operation | Complexity | Sequential Ops | Max Path Length |
|---|---|---|---|
| Self-Attention | |||
| Recurrent | |||
| Convolutional |
Key advantages:
- Parallelization: sequential operations vs for RNNs
- Direct connections: path length between any two positions vs for RNNs
- Interpretability: Attention weights show what model focuses on
Attention Visualizations
The paper shows attention patterns learned by different heads:
- Some heads focus on syntactic relationships (subject-verb, verb-object)
- Others attend to local context (nearby words)
- Some capture long-range dependencies
- Different heads specialize in different linguistic phenomena
This multi-head design allows the model to capture diverse patterns simultaneously.
Impact and Legacy
This paper fundamentally changed deep learning:
Immediate impact:
- State-of-the-art on machine translation
- Much faster training than RNN-based models
- Better handling of long-range dependencies
Long-term impact:
- Foundation for BERT (2018), GPT series (2018-present)
- Extended to computer vision (ViT, 2020)
- Now the dominant architecture in NLP
- Enabled large language models (100B+ parameters)
- Sparked research into efficient attention variants
Most cited AI paper of the 2010s decade with 100,000+ citations.
Limitations and Future Work
Quadratic complexity: limits application to very long sequences
- Led to efficient variants: Longformer, Reformer, Performer, Flash Attention
Fixed positional encoding: Learned embeddings might be more flexible
- Most modern transformers use learned positional embeddings
Limited inductive biases: Unlike CNNs (translation invariance) or RNNs (sequential bias)
- Requires more data but more flexible
Key Takeaways
- Transformers use pure attention with no recurrence or convolution
- Three types of attention: encoder self-attention, decoder masked self-attention, encoder-decoder cross-attention
- Multi-head attention captures diverse patterns in parallel
- Positional encoding injects position information
- Parallel processing enables faster training than RNNs
- Direct connections between all positions solve vanishing gradients
- Achieved state-of-the-art on machine translation with less training time
- Foundation of modern NLP: BERT, GPT, T5, and beyond
Related Concepts
- Attention Mechanism - General attention formulation
- Scaled Dot-Product Attention - The attention formula
- Multi-Head Attention - Multiple parallel attention heads
- RNN Limitations - Problems transformers solve
References
Original Paper:
Learning Resources:
- Jay Alammar: The Illustrated Transformer
- Harvard NLP: The Annotated Transformer - Line-by-line implementation
- 3Blue1Brown: Visualizing Attention
Implementations:
Follow-up Papers:
- BERT: Pre-training of Deep Bidirectional Transformers
- GPT-2: Language Models are Unsupervised Multitask Learners
- Vision Transformer: An Image is Worth 16x16 Words