Skip to Content
LibraryConceptsGPT Architecture

GPT Architecture

The Generative Pre-trained Transformer (GPT) uses a decoder-only architecture that stacks multiple transformer blocks with causal self-attention. Unlike the original transformer’s encoder-decoder design, GPT simplifies to a pure autoregressive language model.

Core Components

GPT Block Structure

Each GPT block contains two main sub-layers with residual connections:

  1. Causal Self-Attention: Multi-head attention with causal masking
  2. Feed-Forward Network: Two-layer MLP with non-linear activation
  3. Layer Normalization: Applied before each sub-layer (pre-norm architecture)
  4. Residual Connections: Enable gradient flow through deep networks
class GPTBlock(nn.Module): def __init__(self, d_model, n_heads, d_ff, dropout=0.1): super().__init__() self.ln1 = nn.LayerNorm(d_model) self.attn = CausalSelfAttention(d_model, n_heads) self.ln2 = nn.LayerNorm(d_model) self.mlp = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model), nn.Dropout(dropout) ) def forward(self, x): # Attention with residual x = x + self.attn(self.ln1(x)) # MLP with residual x = x + self.mlp(self.ln2(x)) return x

Complete GPT Model

The full architecture combines embeddings, stacked blocks, and an output projection:

class GPT(nn.Module): def __init__(self, vocab_size, block_size, n_layer, d_model, n_heads): super().__init__() self.block_size = block_size # Token and position embeddings self.token_embedding = nn.Embedding(vocab_size, d_model) self.position_embedding = nn.Embedding(block_size, d_model) # Transformer blocks self.blocks = nn.Sequential(*[ GPTBlock(d_model, n_heads, 4*d_model) for _ in range(n_layer) ]) # Final layer norm and output projection self.ln_f = nn.LayerNorm(d_model) self.lm_head = nn.Linear(d_model, vocab_size, bias=False) def forward(self, idx): B, T = idx.shape # Get embeddings tok_emb = self.token_embedding(idx) # (B, T, d_model) pos_emb = self.position_embedding(torch.arange(T, device=idx.device)) # (T, d_model) # Add position to token embeddings x = tok_emb + pos_emb # Apply transformer blocks x = self.blocks(x) # Final layer norm and project to vocabulary x = self.ln_f(x) logits = self.lm_head(x) # (B, T, vocab_size) return logits

Architecture Flow

Input Token IDs [B, T] Token Embedding [B, T, d_model] + Position Embedding [T, d_model] ┌─────────────────────────────┐ │ GPT Block × n_layer │ │ ┌───────────────────────┐ │ │ │ LayerNorm │ │ │ │ Causal Self-Attention │ │ │ │ Residual Connection │ │ │ ├───────────────────────┤ │ │ │ LayerNorm │ │ │ │ MLP (Feed-Forward) │ │ │ │ Residual Connection │ │ │ └───────────────────────┘ │ └─────────────────────────────┘ Final LayerNorm LM Head [B, T, vocab_size] Logits (predictions)

Model Configurations

GPT-2 Small (124M parameters)

  • Layers: 12
  • Model dimension: 768
  • Attention heads: 12
  • Feed-forward dimension: 3072 (4 × d_model)
  • Vocabulary size: 50,257
  • Context length: 1024 tokens

NanoGPT (~10M parameters)

  • Layers: 6
  • Model dimension: 384
  • Attention heads: 6
  • Feed-forward dimension: 1536 (4 × d_model)
  • Vocabulary size: 50,257
  • Context length: 256 tokens

GPT-3 175B (175B parameters)

  • Layers: 96
  • Model dimension: 12,288
  • Attention heads: 96
  • Feed-forward dimension: 49,152
  • Vocabulary size: 50,257
  • Context length: 2048 tokens

Design Decisions

Pre-Norm vs Post-Norm

Pre-Norm (used in GPT):

x = x + attn(ln(x)) x = x + mlp(ln(x))
  • More stable training for deep networks
  • Better gradient flow
  • Standard in modern LLMs

Post-Norm (original transformer):

x = ln(x + attn(x)) x = ln(x + mlp(x))
  • Slightly better final performance (sometimes)
  • Harder to train very deep networks

GELU Activation

GPT uses GELU (Gaussian Error Linear Unit) instead of ReLU:

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

where Φ(x)\Phi(x) is the cumulative distribution function of the standard normal distribution.

Benefits:

  • Smoother activation function
  • Empirically better for language models
  • Used in GPT, BERT, and most modern transformers

Weight Tying

GPT often ties weights between token embedding and output projection:

# Share weights self.lm_head.weight = self.token_embedding.weight

Benefits:

  • Reduces parameters by ~38M-50M (for GPT-2)
  • Enforces symmetry: similar tokens have similar output distributions
  • Slight performance improvement in practice
  • Reduces memory footprint

Parameter Calculation

For a GPT model with LL layers, model dimension dd, vocabulary size VV, and context length CC:

Embeddings:

  • Token embedding: V×dV \times d
  • Position embedding: C×dC \times d

Per Transformer Block:

  • Attention Q, K, V projections: 3×d×d3 \times d \times d
  • Attention output projection: d×dd \times d
  • MLP first layer: d×4dd \times 4d
  • MLP second layer: 4d×d4d \times d
  • Layer norms: negligible (4d4d per block)

Total per block: 12d2\approx 12d^2 parameters

Total model: 12×L×d2\approx 12 \times L \times d^2 parameters (excluding embeddings)

Note: The majority of parameters (8 of 12 d2d^2 per block) are in the MLP layers, not attention.

Key Insights

  1. Depth vs Width: More layers capture longer-range dependencies; larger dd increases per-layer capacity
  2. Attention Heads: Multiple heads allow attending to different representation subspaces
  3. Context Length: Block size limits historical context available to the model
  4. Decoder-Only: Simpler than encoder-decoder for pure generation tasks
  5. Pre-Norm: Essential for stable training of very deep networks (GPT-3: 96 layers)

Common Variations

Flash Attention

Optimized attention implementation that reduces memory and computation:

  • Reduces memory from O(n2)O(n^2) to O(n)O(n)
  • 2-4× faster than standard attention
  • Used in modern large-scale training

Grouped Query Attention (GQA)

Shares key and value across multiple query heads:

  • Reduces KV cache memory during inference
  • Used in Llama 2 and later models
  • Minimal quality impact with significant efficiency gains

Rotary Position Embeddings (RoPE)

Alternative to learned position embeddings:

  • Encodes position through rotation matrices
  • Better length extrapolation
  • Used in GPT-J, PaLM, Llama

Learning Resources

Core Papers

  • Improving Language Understanding by Generative Pre-Training (Radford et al., 2018) - Original GPT paper
  • Language Models are Unsupervised Multitask Learners (Radford et al., 2019) - GPT-2
  • Language Models are Few-Shot Learners (Brown et al., 2020) - GPT-3

Implementation Guides

Conceptual Explanations