GPT Architecture

The Generative Pre-trained Transformer (GPT) uses a decoder-only architecture that stacks multiple transformer blocks with causal self-attention. Unlike the original transformer’s encoder-decoder design, GPT simplifies to a pure autoregressive language model.

Core Components

GPT Block Structure

Each GPT block contains two main sub-layers with residual connections:

Causal Self-Attention: Multi-head attention with causal masking
Feed-Forward Network: Two-layer MLP with non-linear activation
Layer Normalization: Applied before each sub-layer (pre-norm architecture)
Residual Connections: Enable gradient flow through deep networks


class GPTBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
 
    def forward(self, x):
        # Attention with residual
        x = x + self.attn(self.ln1(x))
        # MLP with residual
        x = x + self.mlp(self.ln2(x))
        return x

Complete GPT Model

The full architecture combines embeddings, stacked blocks, and an output projection:


class GPT(nn.Module):
    def __init__(self, vocab_size, block_size, n_layer, d_model, n_heads):
        super().__init__()
        self.block_size = block_size
 
        # Token and position embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = nn.Embedding(block_size, d_model)
 
        # Transformer blocks
        self.blocks = nn.Sequential(*[
            GPTBlock(d_model, n_heads, 4*d_model)
            for _ in range(n_layer)
        ])
 
        # Final layer norm and output projection
        self.ln_f = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
 
    def forward(self, idx):
        B, T = idx.shape
 
        # Get embeddings
        tok_emb = self.token_embedding(idx)  # (B, T, d_model)
        pos_emb = self.position_embedding(torch.arange(T, device=idx.device))  # (T, d_model)
 
        # Add position to token embeddings
        x = tok_emb + pos_emb
 
        # Apply transformer blocks
        x = self.blocks(x)
 
        # Final layer norm and project to vocabulary
        x = self.ln_f(x)
        logits = self.lm_head(x)  # (B, T, vocab_size)
 
        return logits

Architecture Flow


Input Token IDs [B, T]
    ↓
Token Embedding [B, T, d_model]
    +
Position Embedding [T, d_model]
    ↓
┌─────────────────────────────┐
│     GPT Block × n_layer     │
│  ┌───────────────────────┐  │
│  │ LayerNorm             │  │
│  │ Causal Self-Attention │  │
│  │ Residual Connection   │  │
│  ├───────────────────────┤  │
│  │ LayerNorm             │  │
│  │ MLP (Feed-Forward)    │  │
│  │ Residual Connection   │  │
│  └───────────────────────┘  │
└─────────────────────────────┘
    ↓
Final LayerNorm
    ↓
LM Head [B, T, vocab_size]
    ↓
Logits (predictions)

Model Configurations

GPT-2 Small (124M parameters)

Layers: 12
Model dimension: 768
Attention heads: 12
Feed-forward dimension: 3072 (4 × d_model)
Vocabulary size: 50,257
Context length: 1024 tokens

NanoGPT (~10M parameters)

Layers: 6
Model dimension: 384
Attention heads: 6
Feed-forward dimension: 1536 (4 × d_model)
Vocabulary size: 50,257
Context length: 256 tokens

GPT-3 175B (175B parameters)

Layers: 96
Model dimension: 12,288
Attention heads: 96
Feed-forward dimension: 49,152
Vocabulary size: 50,257
Context length: 2048 tokens

Design Decisions

Pre-Norm vs Post-Norm

Pre-Norm (used in GPT):


x = x + attn(ln(x))
x = x + mlp(ln(x))

More stable training for deep networks
Better gradient flow
Standard in modern LLMs

Post-Norm (original transformer):


x = ln(x + attn(x))
x = ln(x + mlp(x))

Slightly better final performance (sometimes)
Harder to train very deep networks

GELU Activation

GPT uses GELU (Gaussian Error Linear Unit) instead of ReLU:

$\text{GELU}(x) = x \cdot \Phi(x)$

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution.

Benefits:

Smoother activation function
Empirically better for language models
Used in GPT, BERT, and most modern transformers

Weight Tying

GPT often ties weights between token embedding and output projection:


# Share weights
self.lm_head.weight = self.token_embedding.weight

Benefits:

Reduces parameters by ~38M-50M (for GPT-2)
Enforces symmetry: similar tokens have similar output distributions
Slight performance improvement in practice
Reduces memory footprint

Parameter Calculation

For a GPT model with $L$ layers, model dimension $d$ , vocabulary size $V$ , and context length $C$ :

Embeddings:

Token embedding: $V \times d$
Position embedding: $C \times d$

Per Transformer Block:

Attention Q, K, V projections: $3 \times d \times d$
Attention output projection: $d \times d$
MLP first layer: $d \times 4d$
MLP second layer: $4d \times d$
Layer norms: negligible ( $4d$ per block)

Total per block: $\approx 12d^2$ parameters

Total model: $\approx 12 \times L \times d^2$ parameters (excluding embeddings)

Note: The majority of parameters (8 of 12 $d^2$ per block) are in the MLP layers, not attention.

Key Insights

Depth vs Width: More layers capture longer-range dependencies; larger $d$ increases per-layer capacity
Attention Heads: Multiple heads allow attending to different representation subspaces
Context Length: Block size limits historical context available to the model
Decoder-Only: Simpler than encoder-decoder for pure generation tasks
Pre-Norm: Essential for stable training of very deep networks (GPT-3: 96 layers)

Common Variations

Flash Attention

Optimized attention implementation that reduces memory and computation:

Reduces memory from $O(n^2)$ to $O(n)$
2-4× faster than standard attention
Used in modern large-scale training

Grouped Query Attention (GQA)

Shares key and value across multiple query heads:

Reduces KV cache memory during inference
Used in Llama 2 and later models
Minimal quality impact with significant efficiency gains

Rotary Position Embeddings (RoPE)

Alternative to learned position embeddings:

Encodes position through rotation matrices
Better length extrapolation
Used in GPT-J, PaLM, Llama

Causal Attention - The masked attention mechanism used in GPT blocks
Multi-Head Attention - Parallel attention heads foundation
Tokenization - Converting text to token IDs for GPT input
Language Model Training - How to train GPT models
Text Generation - Inference strategies for GPT

Learning Resources

Core Papers

Improving Language Understanding by Generative Pre-Training (Radford et al., 2018) - Original GPT paper
Language Models are Unsupervised Multitask Learners (Radford et al., 2019) - GPT-2
Language Models are Few-Shot Learners (Brown et al., 2020) - GPT-3

Implementation Guides

Andrej Karpathy’s nanoGPT - Minimal, clean GPT implementation
Let’s Build GPT from Scratch - Karpathy’s video walkthrough
The Annotated GPT-2 - Line-by-line code explanation

Conceptual Explanations

The Illustrated GPT-2 - Jay Alammar’s visual guide
GPT in 60 Lines of NumPy - Minimal implementation for understanding