GPT Architecture
The Generative Pre-trained Transformer (GPT) uses a decoder-only architecture that stacks multiple transformer blocks with causal self-attention. Unlike the original transformer’s encoder-decoder design, GPT simplifies to a pure autoregressive language model.
Core Components
GPT Block Structure
Each GPT block contains two main sub-layers with residual connections:
- Causal Self-Attention: Multi-head attention with causal masking
- Feed-Forward Network: Two-layer MLP with non-linear activation
- Layer Normalization: Applied before each sub-layer (pre-norm architecture)
- Residual Connections: Enable gradient flow through deep networks
class GPTBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = CausalSelfAttention(d_model, n_heads)
self.ln2 = nn.LayerNorm(d_model)
self.mlp = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
def forward(self, x):
# Attention with residual
x = x + self.attn(self.ln1(x))
# MLP with residual
x = x + self.mlp(self.ln2(x))
return xComplete GPT Model
The full architecture combines embeddings, stacked blocks, and an output projection:
class GPT(nn.Module):
def __init__(self, vocab_size, block_size, n_layer, d_model, n_heads):
super().__init__()
self.block_size = block_size
# Token and position embeddings
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(block_size, d_model)
# Transformer blocks
self.blocks = nn.Sequential(*[
GPTBlock(d_model, n_heads, 4*d_model)
for _ in range(n_layer)
])
# Final layer norm and output projection
self.ln_f = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
def forward(self, idx):
B, T = idx.shape
# Get embeddings
tok_emb = self.token_embedding(idx) # (B, T, d_model)
pos_emb = self.position_embedding(torch.arange(T, device=idx.device)) # (T, d_model)
# Add position to token embeddings
x = tok_emb + pos_emb
# Apply transformer blocks
x = self.blocks(x)
# Final layer norm and project to vocabulary
x = self.ln_f(x)
logits = self.lm_head(x) # (B, T, vocab_size)
return logitsArchitecture Flow
Input Token IDs [B, T]
↓
Token Embedding [B, T, d_model]
+
Position Embedding [T, d_model]
↓
┌─────────────────────────────┐
│ GPT Block × n_layer │
│ ┌───────────────────────┐ │
│ │ LayerNorm │ │
│ │ Causal Self-Attention │ │
│ │ Residual Connection │ │
│ ├───────────────────────┤ │
│ │ LayerNorm │ │
│ │ MLP (Feed-Forward) │ │
│ │ Residual Connection │ │
│ └───────────────────────┘ │
└─────────────────────────────┘
↓
Final LayerNorm
↓
LM Head [B, T, vocab_size]
↓
Logits (predictions)Model Configurations
GPT-2 Small (124M parameters)
- Layers: 12
- Model dimension: 768
- Attention heads: 12
- Feed-forward dimension: 3072 (4 × d_model)
- Vocabulary size: 50,257
- Context length: 1024 tokens
NanoGPT (~10M parameters)
- Layers: 6
- Model dimension: 384
- Attention heads: 6
- Feed-forward dimension: 1536 (4 × d_model)
- Vocabulary size: 50,257
- Context length: 256 tokens
GPT-3 175B (175B parameters)
- Layers: 96
- Model dimension: 12,288
- Attention heads: 96
- Feed-forward dimension: 49,152
- Vocabulary size: 50,257
- Context length: 2048 tokens
Design Decisions
Pre-Norm vs Post-Norm
Pre-Norm (used in GPT):
x = x + attn(ln(x))
x = x + mlp(ln(x))- More stable training for deep networks
- Better gradient flow
- Standard in modern LLMs
Post-Norm (original transformer):
x = ln(x + attn(x))
x = ln(x + mlp(x))- Slightly better final performance (sometimes)
- Harder to train very deep networks
GELU Activation
GPT uses GELU (Gaussian Error Linear Unit) instead of ReLU:
where is the cumulative distribution function of the standard normal distribution.
Benefits:
- Smoother activation function
- Empirically better for language models
- Used in GPT, BERT, and most modern transformers
Weight Tying
GPT often ties weights between token embedding and output projection:
# Share weights
self.lm_head.weight = self.token_embedding.weightBenefits:
- Reduces parameters by ~38M-50M (for GPT-2)
- Enforces symmetry: similar tokens have similar output distributions
- Slight performance improvement in practice
- Reduces memory footprint
Parameter Calculation
For a GPT model with layers, model dimension , vocabulary size , and context length :
Embeddings:
- Token embedding:
- Position embedding:
Per Transformer Block:
- Attention Q, K, V projections:
- Attention output projection:
- MLP first layer:
- MLP second layer:
- Layer norms: negligible ( per block)
Total per block: parameters
Total model: parameters (excluding embeddings)
Note: The majority of parameters (8 of 12 per block) are in the MLP layers, not attention.
Key Insights
- Depth vs Width: More layers capture longer-range dependencies; larger increases per-layer capacity
- Attention Heads: Multiple heads allow attending to different representation subspaces
- Context Length: Block size limits historical context available to the model
- Decoder-Only: Simpler than encoder-decoder for pure generation tasks
- Pre-Norm: Essential for stable training of very deep networks (GPT-3: 96 layers)
Common Variations
Flash Attention
Optimized attention implementation that reduces memory and computation:
- Reduces memory from to
- 2-4× faster than standard attention
- Used in modern large-scale training
Grouped Query Attention (GQA)
Shares key and value across multiple query heads:
- Reduces KV cache memory during inference
- Used in Llama 2 and later models
- Minimal quality impact with significant efficiency gains
Rotary Position Embeddings (RoPE)
Alternative to learned position embeddings:
- Encodes position through rotation matrices
- Better length extrapolation
- Used in GPT-J, PaLM, Llama
Related Concepts
- Causal Attention - The masked attention mechanism used in GPT blocks
- Multi-Head Attention - Parallel attention heads foundation
- Tokenization - Converting text to token IDs for GPT input
- Language Model Training - How to train GPT models
- Text Generation - Inference strategies for GPT
Learning Resources
Core Papers
- Improving Language Understanding by Generative Pre-Training (Radford et al., 2018) - Original GPT paper
- Language Models are Unsupervised Multitask Learners (Radford et al., 2019) - GPT-2
- Language Models are Few-Shot Learners (Brown et al., 2020) - GPT-3
Implementation Guides
- Andrej Karpathy’s nanoGPT - Minimal, clean GPT implementation
- Let’s Build GPT from Scratch - Karpathy’s video walkthrough
- The Annotated GPT-2 - Line-by-line code explanation
Conceptual Explanations
- The Illustrated GPT-2 - Jay Alammar’s visual guide
- GPT in 60 Lines of NumPy - Minimal implementation for understanding