Building an LLM from Scratch - NanoGPT
This module provides complete, code-level understanding of how GPT-style language models are built and trained. By working through Andrej Karpathy’s NanoGPT implementation, you’ll solidify your transformer knowledge and learn the practical details of training large models.
Why This Module Matters
This hands-on experience is critical for modern AI applications:
- Complete Understanding: Move from theory to practice by implementing every component
- Decoder-Only Transformers: Master the architecture powering GPT-3, GPT-4, LLaMA, and most modern LLMs
- Sequence Modeling Skills: Learn to model any sequential data (text, code, medical events) as a language modeling problem
- Healthcare Application: Apply these principles to tokenize and model patient event sequences
Learning Objectives
After completing this module, you will:
- Complete GPT Understanding: Gain line-by-line understanding of how GPT works, from tokenization to generation
- Decoder-Only Transformers: Master causal self-attention and autoregressive generation, the core of modern LLMs
- Training Large Models: Learn practical training techniques - gradient accumulation, learning rate schedules, checkpointing
- Healthcare Sequence Foundation: Prepare to tokenize and model EHR event sequences as language modeling problems
Prerequisites
Before starting this module, ensure you have:
- Deep understanding of attention and transformers
- PyTorch: Comfortable with nn.Module, training loops, optimizers
- Time: Block out substantial time - this module requires focused, uninterrupted work (coding along with Karpathy’s 2-hour video tutorial)
Primary Resource
The primary resource for this module is Andrej Karpathy’s work:
-
Video: “Let’s Build GPT: From Scratch, in Code, Spelled Out” (2 hours)
- You must code along - passive watching won’t work
- Pause frequently to understand each line
-
Code: NanoGPT Repository
- Clean, educational implementation
- ~300 lines of core model code
- Reproduces GPT-2 results
Remember: There’s no substitute for building things from scratch. The best way to understand GPT is to implement it yourself, one line at a time.
Week 1: From Tokenization to Architecture
Day 1-2: Tokenization
Core Concept:
TokenizationWhat You’ll Learn:
- Byte Pair Encoding (BPE): The standard tokenization for LLMs
- Subword Units: Balance between character and word-level tokenization
- Vocabulary Construction: Building the token vocabulary from data
- Special Tokens:
<PAD>,<BOS>,<EOS>,<UNK> - Encoding/Decoding: Text → tokens → text
Why Tokenization Matters:
- Character-level: Too long sequences, hard to learn
- Word-level: Huge vocabulary, can’t handle rare words
- Subword (BPE): Best of both worlds
BPE Algorithm (Simplified):
- Start with character vocabulary
- Find most frequent character pair
- Merge pair into single token
- Repeat until desired vocabulary size
For Healthcare:
- Tokenize medical event codes (ICD-10, ATC)
- Hierarchical tokenization (code → category → broad category)
- Temporal tokens (time intervals between events)
Learning Resources:
- Papers:
- “Neural Machine Translation of Rare Words with Subword Units” (Sennrich et al., 2016)
- GPT-2 paper discusses BPE
- Code:
- SentencePiece library
- tiktoken (OpenAI’s tokenizer)
- Reading: Hugging Face tokenizers documentation
Exercises:
- Implement simple BPE algorithm
- Tokenize sample text with different vocabulary sizes
- Analyze vocabulary distribution (common vs rare tokens)
- Understand token-to-string mapping
Checkpoint: Can you explain why BPE is better than word-level or character-level tokenization?
Day 3-4: Causal Attention
Core Concept:
Causal AttentionWhat You’ll Learn:
- Autoregressive Generation: Predicting next token given previous tokens
- Causal Masking: Preventing attention to future positions
- Difference from Encoder: Unidirectional vs bidirectional attention
- Implementation: Creating and applying causal mask
The Causal Mask:
# Lower triangular matrix - can only attend to past
mask = torch.tril(torch.ones(seq_len, seq_len))
# Shape: [seq_len, seq_len], 1 = attend, 0 = maskAutoregressive Formulation:
Why Causal?
- Training: Model learns to predict next token from previous tokens
- Inference: Generate token-by-token, each token only sees previous ones
- Prevents information leakage from future
Encoder vs Decoder Attention:
- Encoder (BERT): Bidirectional, see full sequence
- Decoder (GPT): Causal, see only previous tokens
- Use case: Classification vs generation
Learning Resources:
- Videos: Karpathy’s video explains causal attention clearly
- Code: Implement causal attention from scratch
- Reading: GPT-2 paper (Section 2)
Exercises:
- Implement causal masking in PyTorch
- Visualize causal attention patterns
- Compare bidirectional vs causal attention on same task
- Understand masked_fill operation
Checkpoint: Can you implement causal self-attention from scratch?
Day 5-7: Complete GPT Architecture
Core Concept:
GPT ArchitectureWhat You’ll Learn:
- Decoder-Only Transformer: Stack of transformer decoder blocks
- GPT Block Components: Causal self-attention + MLP
- Architectural Choices: Pre-norm vs post-norm, GELU activation
- Model Sizes: GPT-2 configurations (117M to 1.5B parameters)
GPT Block Structure:
GPT Block (×N):
1. Layer Norm
2. Causal Multi-Head Self-Attention
3. Residual Connection
4. Layer Norm
5. MLP (Feed-Forward: Linear → GELU → Linear)
6. Residual ConnectionComplete GPT Architecture:
Input:
Token Embeddings + Positional Embeddings
↓
GPT Blocks (×N)
↓
Layer Norm
↓
Linear (project to vocabulary)
↓
Output: Logits over vocabularyArchitectural Details:
Pre-Norm vs Post-Norm:
- Pre-norm (GPT-2): LayerNorm before attention/MLP - more stable training
- Post-norm (Original Transformer): LayerNorm after - harder to train deep models
Activation Functions:
- GELU (Gaussian Error Linear Unit): Smooth ReLU alternative
- Used in BERT and GPT-2
- Better than ReLU for language modeling
Weight Tying:
- Share weights between token embeddings and output projection
- Reduces parameters, improves performance
- Token embedding matrix = output projection matrix transposed
GPT-2 Model Sizes:
| Model | Layers | d_model | Heads | Parameters |
|---|---|---|---|---|
| GPT-2 Small | 12 | 768 | 12 | 117M |
| GPT-2 Medium | 24 | 1024 | 16 | 345M |
| GPT-2 Large | 36 | 1280 | 20 | 762M |
| GPT-2 XL | 48 | 1600 | 25 | 1.5B |
Learning Resources:
- Papers:
- “Language Models are Unsupervised Multitask Learners” (GPT-2 paper)
- “Improving Language Understanding by Generative Pre-Training” (GPT-1 paper)
- Videos: Karpathy’s GPT video (main resource)
- Code: NanoGPT implementation walkthrough
Exercises:
- Implement complete GPT model in PyTorch
- Calculate parameter count for different model sizes
- Understand shape transformations through the model
- Trace forward pass with example input
Checkpoint: Can you implement a GPT model from scratch and explain every design choice?
Week 2: Training and Generation
Day 8-10: Training Language Models
Core Concept:
Language Model TrainingWhat You’ll Learn:
- Next-Token Prediction: The core training objective
- Cross-Entropy Loss: Measuring prediction accuracy
- Gradient Accumulation: Training with effective large batch sizes
- Learning Rate Schedules: Warmup + cosine decay
- Training Loop: Complete training pipeline
Training Objective: Maximize likelihood of next token given previous tokens:
Practical Training Techniques:
1. Gradient Accumulation:
- Problem: Can’t fit large batch on single GPU
- Solution: Accumulate gradients over multiple forward passes
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()2. Learning Rate Schedule:
- Warmup: Linear increase from 0 to max_lr (first 2000 steps typical)
- Cosine Decay: Smooth decrease to min_lr
- Why? Transformers sensitive to learning rate, warmup stabilizes training
3. Optimizer:
- AdamW: Adam with decoupled weight decay
- Typical hyperparameters:
- lr = 6e-4 for small models
- betas = (0.9, 0.95)
- weight_decay = 0.1
4. Checkpointing:
- Save model periodically during training
- Resume from checkpoints if training interrupted
- Keep best checkpoints based on validation loss
Training Best Practices:
- Monitor train and validation loss
- Watch for overfitting (train << val loss)
- Use gradient clipping (clip_grad_norm)
- Log to TensorBoard or wandb
- Validate every N steps
Learning Resources:
- Videos: Karpathy’s video covers training in detail
- Code: NanoGPT training script
- Papers: GPT-2 training details
Exercises:
- Implement complete training loop
- Train small GPT on Shakespeare text
- Implement gradient accumulation
- Add learning rate scheduling
- Plot training curves
Checkpoint: Can you train a small GPT model and get it to generate coherent text?
Day 11-12: Text Generation Strategies
Core Concept:
Text GenerationWhat You’ll Learn:
- Greedy Decoding: Always pick highest probability token
- Sampling: Sample from probability distribution
- Temperature: Controlling randomness
- Top-k Sampling: Sample from k most likely tokens
- Top-p (Nucleus) Sampling: Sample from cumulative probability p
- Beam Search: Keep track of multiple sequences
Generation Strategies Comparison:
1. Greedy Decoding:
next_token = logits.argmax(dim=-1)- Pros: Deterministic, fast
- Cons: Repetitive, boring text
- Use case: Factual tasks, short generation
2. Temperature Sampling:
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, 1)- temperature < 1: More confident (peaked distribution)
- temperature > 1: More random (flatter distribution)
- temperature = 1: Sample from model’s distribution
- Use case: Creative text generation
3. Top-k Sampling:
top_k_probs, top_k_indices = torch.topk(probs, k)
next_token = top_k_indices[torch.multinomial(top_k_probs, 1)]- Only sample from k most likely tokens
- k = 40 is common default
- Prevents sampling very low probability tokens
- Use case: Balanced coherence and creativity
4. Top-p (Nucleus) Sampling:
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Remove tokens with cumulative probability > p
mask = cumulative_probs > p- Dynamically adjust number of candidates
- p = 0.9 is common
- Better than top-k for varying distributions
- Use case: High-quality generation (GPT-3 default)
5. Beam Search:
- Keep track of top-k sequences at each step
- Expand each sequence with possible next tokens
- Keep top-k overall sequences
- Return sequence with highest score
- Use case: Machine translation, summarization
Generation Strategy Selection:
- Factual tasks (QA, classification): Greedy or low temperature
- Creative tasks (stories, poetry): Higher temperature, top-p
- Code generation: Top-p with temperature ~0.8
- Chat: Top-p with temperature ~0.7-0.9
KV Cache Optimization:
- Problem: Recomputing attention for all previous tokens is wasteful
- Solution: Cache key and value tensors from previous steps
- Speedup: ~10x faster generation
- Trade-off: Memory vs compute
Learning Resources:
- Papers:
- “The Curious Case of Neural Text Degeneration” (top-p paper)
- “Hierarchical Neural Story Generation” (sampling strategies)
- Code: Implement all sampling strategies
- Reading: Hugging Face generation documentation
Exercises:
- Implement all 5 generation strategies
- Compare outputs from different strategies
- Experiment with temperature values
- Implement KV cache for faster generation
- Generate text with different prompts
Checkpoint: Can you implement top-p sampling and explain when to use each generation strategy?
Day 13-14: Hands-On Project - Build and Train NanoGPT
Complete Implementation Project:
Requirements:
-
Implementation: Build NanoGPT from scratch (following Karpathy’s video)
- Token embeddings + positional embeddings
- GPT blocks with causal self-attention
- Layer norms and residuals
- Output projection to vocabulary
-
Training: Train on a small dataset
- Shakespeare text (Karpathy’s default)
- Or choose your own: code, medical text, etc.
- Implement gradient accumulation
- Add learning rate scheduling
- Log training metrics
-
Generation: Generate text with multiple strategies
- Implement greedy, temperature, top-k, top-p
- Compare generation quality
- Experiment with different hyperparameters
-
Analysis: Analyze your trained model
- Plot training and validation loss
- Visualize attention patterns
- Generate samples at different temperatures
- Analyze what the model learned
Deliverables:
- Complete working GPT implementation (~300 lines)
- Trained model checkpoint
- Training curves and metrics
- Generated text samples (various strategies)
- Analysis of attention patterns
- README documenting design choices
Success Criteria:
- Model trains without errors
- Validation loss decreases smoothly
- Generated text is coherent (for Shakespeare: captures style)
- Can generate with different strategies
- Understand every line of code
Time Estimate: 10-15 hours
Module Completion Criteria
You have completed this module when you can:
- ✅ Understand and implement BPE tokenization
- ✅ Implement causal self-attention with masking
- ✅ Build complete GPT model from scratch
- ✅ Train GPT on a dataset with gradient accumulation and LR scheduling
- ✅ Implement all major generation strategies (greedy, temperature, top-k, top-p, beam search)
- ✅ Understand the difference between encoder-only, decoder-only, and encoder-decoder transformers
- ✅ Explain when to use each architecture for different tasks
- ✅ Generate coherent text with your trained model
Key Resources
Essential Video (Must Watch and Code Along)
Andrej Karpathy: “Let’s Build GPT: From Scratch, in Code, Spelled Out”
- 2-hour tutorial
- Code along - don’t just watch
- Pause frequently to understand
- Best resource for understanding GPT implementation
Essential Code Repository
NanoGPT (github.com/karpathy/nanoGPT)
- Clean, educational implementation
- ~300 lines of core model code
- Reproduces GPT-2 results
- Read every line, understand every choice
Essential Papers
-
“Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018)
- GPT-1 paper, introduces the architecture
-
“Language Models are Unsupervised Multitask Learners” (Radford et al., 2019)
- GPT-2 paper, demonstrates scale and zero-shot capabilities
-
“The Curious Case of Neural Text Degeneration” (Holtzman et al., 2019)
- Top-p (nucleus) sampling paper
Additional Resources
- OpenAI GPT-3 paper (optional, for scale insights)
- GPT-4 technical report (optional, for latest developments)
- Stanford CS224N lectures on language models
Common Pitfalls
1. Not Coding Along
Problem: Watching Karpathy’s video passively Solution: Pause every few minutes, implement yourself, debug
2. Wrong Attention Mask
Problem: Using bidirectional attention instead of causal Solution: Always use lower triangular mask for GPT
3. Forgetting Positional Encodings
Problem: Model has no sense of token order Solution: Add positional embeddings to token embeddings
4. Too High Learning Rate
Problem: Training diverges immediately Solution: Start with lr=6e-4, use warmup
5. Insufficient Training Data
Problem: Model doesn’t generate coherent text Solution: Need sufficient data (Shakespeare corpus is ~1MB, sufficient for small model)
6. Wrong Generation Temperature
Problem: Repetitive or incoherent text Solution: Try temperature=0.8-1.0, or use top-p sampling
Success Tips
-
Code Along with Karpathy
- Pause video frequently
- Type every line yourself
- Debug every error
- Understand before proceeding
-
Start Small
- Begin with tiny model (4 layers, 128 dim)
- Train on small dataset
- Verify training works
- Then scale up
-
Monitor Everything
- Plot training and validation loss
- Log learning rate schedule
- Visualize attention patterns
- Generate samples during training
-
Experiment
- Try different model sizes
- Test generation strategies
- Vary temperature
- Use different datasets
-
Understand Every Line
- Don’t copy-paste without understanding
- Trace shapes through the network
- Understand gradient flow
- Read PyTorch docs for unfamiliar functions
Connection to Healthcare AI
GPT-style models are powerful for healthcare applications:
EHR Sequence Modeling:
- Treat patient event sequences as “language”
- Predict next medical event given history
- ETHOS (thesis baseline) uses similar architecture
Clinical Note Generation:
- Generate clinical notes from structured data
- Summarize long medical histories
- Draft discharge summaries
Medical Code Prediction:
- Predict ICD codes from clinical notes
- Autoregressive generation of diagnosis codes
Patient Trajectory Modeling:
- Predict future medical events
- Risk stratification
- Treatment recommendation
Next Steps
After completing this module, you have several directions:
-
Advanced Language Models:
- BERT (encoder-only) for classification
- T5 (encoder-decoder) for sequence-to-sequence
- Scaling laws and efficient training
-
Vision-Language Models:
- Vision Transformers (ViT)
- CLIP for vision-language alignment
-
Healthcare Applications:
- Apply to EHR analysis
- Clinical language models
- Multimodal patient modeling
-
Advanced Topics:
- Efficient transformers (FlashAttention)
- Instruction tuning (RLHF)
- Prompt engineering
Time Investment
Total estimated time: 15-25 hours over 2 weeks
- Karpathy video + coding along: 4-6 hours
- NanoGPT implementation: 6-10 hours
- Training and experiments: 3-5 hours
- Reading papers: 2-4 hours
Block out focused time - this module requires deep concentration.
Key Takeaway
“There’s no substitute for building it yourself.”
You can read about GPT, watch videos about GPT, but you don’t truly understand GPT until you’ve built it from scratch, trained it, and generated text with it. The debugging process - figuring out why attention masks aren’t working, why loss isn’t decreasing, why generation is repetitive - is where the real learning happens.
Code along with Karpathy’s video. Implement NanoGPT yourself. Train it. Break it. Fix it. That’s how you master language models.
Ready to begin? Start with Tokenization and then watch Karpathy’s “Let’s Build GPT” video while coding along.