Attention Is All You Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin Published: NeurIPS 2017 Paper: arxiv.org/abs/1706.03762

The paper that introduced the Transformer architecture and revolutionized sequence modeling by showing that attention mechanisms alone, without recurrence or convolution, can achieve state-of-the-art results on machine translation.

Core Innovation

Key insight: Replace recurrent and convolutional layers entirely with attention mechanisms.

Previous paradigm:

Sequence models used RNNs/LSTMs (sequential processing)
CNNs for parallelization but limited context
Attention as add-on to recurrence

Transformer paradigm:

Pure attention architecture (no recurrence, no convolution)
Full parallelization during training
Direct connections between all positions
Scales better to long sequences

Architecture Overview

The transformer consists of an encoder-decoder structure with stacked layers:

Encoder

Each encoder layer has two sub-layers:

Multi-head self-attention: Each position attends to all positions
Position-wise feed-forward network: Two linear transformations with ReLU

Both sub-layers use:

Residual connections: $\text{LayerNorm}(x + \text{Sublayer}(x))$
Layer normalization

6 encoder layers in the base model.

Decoder

Each decoder layer has three sub-layers:

Masked multi-head self-attention: Prevents attending to future positions
Multi-head cross-attention: Attends to encoder output
Position-wise feed-forward network

All sub-layers use residual connections and layer normalization.

6 decoder layers in the base model.

Key Components

1. Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Why scaling? For large $d_k$ , dot products grow large, pushing softmax into regions with tiny gradients. Scaling by $\sqrt{d_k}$ prevents this.

See: Scaled Dot-Product Attention

2. Multi-Head Attention

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

where each head is computed with separate learned projections:

\text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)

Benefit: Allows model to jointly attend to information from different representation subspaces.

See: Multi-Head Attention

3. Positional Encoding

Since transformers have no recurrence or convolution, they have no inherent notion of position. The paper uses sinusoidal positional encodings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Properties:

Each dimension has different frequency
Allows model to learn relative positions
Can extrapolate to longer sequences than seen during training

4. Position-wise Feed-Forward Networks

Applied to each position separately and identically:

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

Two linear transformations with ReLU activation in between.

Dimensions: Input and output are $d_{\text{model}} = 512$ , inner layer is $d_{\text{ff}} = 2048$ .

Model Configurations

Transformer Base

$d_{\text{model}} = 512$ (model dimension)
$d_{\text{ff}} = 2048$ (feed-forward dimension)
$h = 8$ (attention heads)
$d_k = d_v = 64$ (dimension per head)
6 encoder layers, 6 decoder layers
Parameters: ~65M
Training: 12 hours on 8 P100 GPUs

Transformer Large

$d_{\text{model}} = 1024$
$d_{\text{ff}} = 4096$
$h = 16$
Parameters: ~213M
Training: 3.5 days on 8 P100 GPUs

Three Types of Attention

The transformer uses attention in three different ways:

Encoder self-attention: Encoder attends to all positions in input (bidirectional)
Decoder masked self-attention: Decoder attends to earlier positions only (causal/autoregressive)
Encoder-decoder attention: Decoder attends to encoder output (cross-attention)

Training Details

Task: Machine translation (English-to-German, English-to-French)

Optimization:

Adam optimizer with custom learning rate schedule:
- Warmup: Linear increase for first 4000 steps
- Decay: $\text{lr} = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup}^{-1.5})$

Regularization:

Dropout rate 0.1 applied to:
- Output of each sub-layer before residual connection
- Sums of embeddings and positional encodings
Label smoothing with $\epsilon = 0.1$

Data:

WMT 2014 English-German: 4.5M sentence pairs
WMT 2014 English-French: 36M sentence pairs
Byte-pair encoding with 37K tokens

Results

Machine Translation Performance

English-to-German (WMT 2014):

Transformer (base): 27.3 BLEU (previous SOTA: 26.4)
Transformer (large): 28.4 BLEU (new SOTA)

English-to-French (WMT 2014):

Transformer (large): 41.0 BLEU (new SOTA, previous: 40.4)

Training cost:

3.5 days on 8 GPUs vs weeks for previous models
More parallelizable than RNN/LSTM-based models

Computational Efficiency

Model	Params	Training Cost	Test BLEU (EN-DE)
ByteNet	-	-	23.75
GNMT + RL	278M	-	24.60
ConvS2S	252M	-	25.16
Transformer (base)	65M	0.4 days	27.3
Transformer (large)	213M	3.5 days	28.4

Transformer achieved better results with less training time than previous architectures.

Advantages Over RNNs/CNNs

Computational Complexity Per Layer

Operation	Complexity	Sequential Ops	Max Path Length
Self-Attention	$O(n^2 \cdot d)$	$O(1)$	$O(1)$
Recurrent	$O(n \cdot d^2)$	$O(n)$	$O(n)$
Convolutional	$O(k \cdot n \cdot d^2)$	$O(1)$	$O(\log_k(n))$

Key advantages:

Parallelization: $O(1)$ sequential operations vs $O(n)$ for RNNs
Direct connections: $O(1)$ path length between any two positions vs $O(n)$ for RNNs
Interpretability: Attention weights show what model focuses on

Attention Visualizations

The paper shows attention patterns learned by different heads:

Some heads focus on syntactic relationships (subject-verb, verb-object)
Others attend to local context (nearby words)
Some capture long-range dependencies
Different heads specialize in different linguistic phenomena

This multi-head design allows the model to capture diverse patterns simultaneously.

Impact and Legacy

This paper fundamentally changed deep learning:

Immediate impact:

State-of-the-art on machine translation
Much faster training than RNN-based models
Better handling of long-range dependencies

Long-term impact:

Foundation for BERT (2018), GPT series (2018-present)
Extended to computer vision (ViT, 2020)
Now the dominant architecture in NLP
Enabled large language models (100B+ parameters)
Sparked research into efficient attention variants

Most cited AI paper of the 2010s decade with 100,000+ citations.

Limitations and Future Work

Quadratic complexity: $O(n^2)$ limits application to very long sequences

Led to efficient variants: Longformer, Reformer, Performer, Flash Attention

Fixed positional encoding: Learned embeddings might be more flexible

Most modern transformers use learned positional embeddings

Limited inductive biases: Unlike CNNs (translation invariance) or RNNs (sequential bias)

Requires more data but more flexible

Key Takeaways

Transformers use pure attention with no recurrence or convolution
Three types of attention: encoder self-attention, decoder masked self-attention, encoder-decoder cross-attention
Multi-head attention captures diverse patterns in parallel
Positional encoding injects position information
Parallel processing enables faster training than RNNs
Direct connections between all positions solve vanishing gradients
Achieved state-of-the-art on machine translation with less training time
Foundation of modern NLP: BERT, GPT, T5, and beyond

Attention Mechanism - General attention formulation
Scaled Dot-Product Attention - The attention formula
Multi-Head Attention - Multiple parallel attention heads
RNN Limitations - Problems transformers solve

References

Original Paper:

Learning Resources:

Jay Alammar: The Illustrated Transformer
Harvard NLP: The Annotated Transformer - Line-by-line implementation
3Blue1Brown: Visualizing Attention

Implementations:

Follow-up Papers:

BERT: Pre-training of Deep Bidirectional Transformers
GPT-2: Language Models are Unsupervised Multitask Learners
Vision Transformer: An Image is Worth 16x16 Words