Building an LLM from Scratch - NanoGPT

This module provides complete, code-level understanding of how GPT-style language models are built and trained. By working through Andrej Karpathy’s NanoGPT implementation, you’ll solidify your transformer knowledge and learn the practical details of training large models.

Why This Module Matters

This hands-on experience is critical for modern AI applications:

Complete Understanding: Move from theory to practice by implementing every component
Decoder-Only Transformers: Master the architecture powering GPT-3, GPT-4, LLaMA, and most modern LLMs
Sequence Modeling Skills: Learn to model any sequential data (text, code, medical events) as a language modeling problem
Healthcare Application: Apply these principles to tokenize and model patient event sequences

Learning Objectives

After completing this module, you will:

Complete GPT Understanding: Gain line-by-line understanding of how GPT works, from tokenization to generation
Decoder-Only Transformers: Master causal self-attention and autoregressive generation, the core of modern LLMs
Training Large Models: Learn practical training techniques - gradient accumulation, learning rate schedules, checkpointing
Healthcare Sequence Foundation: Prepare to tokenize and model EHR event sequences as language modeling problems

Prerequisites

Before starting this module, ensure you have:

Deep understanding of attention and transformers
PyTorch: Comfortable with nn.Module, training loops, optimizers
Time: Block out substantial time - this module requires focused, uninterrupted work (coding along with Karpathy’s 2-hour video tutorial)

Primary Resource

The primary resource for this module is Andrej Karpathy’s work:

Video: “Let’s Build GPT: From Scratch, in Code, Spelled Out” (2 hours)
- You must code along - passive watching won’t work
- Pause frequently to understand each line
Code: NanoGPT Repository
- Clean, educational implementation
- ~300 lines of core model code
- Reproduces GPT-2 results

Remember: There’s no substitute for building things from scratch. The best way to understand GPT is to implement it yourself, one line at a time.

Week 1: From Tokenization to Architecture

Day 1-2: Tokenization

Core Concept:

Tokenization

What You’ll Learn:

Byte Pair Encoding (BPE): The standard tokenization for LLMs
Subword Units: Balance between character and word-level tokenization
Vocabulary Construction: Building the token vocabulary from data
Special Tokens: <PAD>, <BOS>, <EOS>, <UNK>
Encoding/Decoding: Text → tokens → text

Why Tokenization Matters:

Character-level: Too long sequences, hard to learn
Word-level: Huge vocabulary, can’t handle rare words
Subword (BPE): Best of both worlds

BPE Algorithm (Simplified):

Start with character vocabulary
Find most frequent character pair
Merge pair into single token
Repeat until desired vocabulary size

For Healthcare:

Tokenize medical event codes (ICD-10, ATC)
Hierarchical tokenization (code → category → broad category)
Temporal tokens (time intervals between events)

Learning Resources:

Papers:
- “Neural Machine Translation of Rare Words with Subword Units” (Sennrich et al., 2016)
- GPT-2 paper discusses BPE
Code:
- SentencePiece library
- tiktoken (OpenAI’s tokenizer)
Reading: Hugging Face tokenizers documentation

Exercises:

Implement simple BPE algorithm
Tokenize sample text with different vocabulary sizes
Analyze vocabulary distribution (common vs rare tokens)
Understand token-to-string mapping

Checkpoint: Can you explain why BPE is better than word-level or character-level tokenization?

Day 3-4: Causal Attention

Core Concept:

Causal Attention

What You’ll Learn:

Autoregressive Generation: Predicting next token given previous tokens
Causal Masking: Preventing attention to future positions
Difference from Encoder: Unidirectional vs bidirectional attention
Implementation: Creating and applying causal mask

The Causal Mask:


# Lower triangular matrix - can only attend to past
mask = torch.tril(torch.ones(seq_len, seq_len))
# Shape: [seq_len, seq_len], 1 = attend, 0 = mask

Autoregressive Formulation:

P(x_1, ..., x_T) = \prod_{t=1}^T P(x_t | x_1, ..., x_{t-1})

Why Causal?

Training: Model learns to predict next token from previous tokens
Inference: Generate token-by-token, each token only sees previous ones
Prevents information leakage from future

Encoder vs Decoder Attention:

Encoder (BERT): Bidirectional, see full sequence
Decoder (GPT): Causal, see only previous tokens
Use case: Classification vs generation

Learning Resources:

Videos: Karpathy’s video explains causal attention clearly
Code: Implement causal attention from scratch
Reading: GPT-2 paper (Section 2)

Exercises:

Implement causal masking in PyTorch
Visualize causal attention patterns
Compare bidirectional vs causal attention on same task
Understand masked_fill operation

Checkpoint: Can you implement causal self-attention from scratch?

Day 5-7: Complete GPT Architecture

Core Concept:

GPT Architecture

What You’ll Learn:

Decoder-Only Transformer: Stack of transformer decoder blocks
GPT Block Components: Causal self-attention + MLP
Architectural Choices: Pre-norm vs post-norm, GELU activation
Model Sizes: GPT-2 configurations (117M to 1.5B parameters)

GPT Block Structure:


GPT Block (×N):
  1. Layer Norm
  2. Causal Multi-Head Self-Attention
  3. Residual Connection
  4. Layer Norm
  5. MLP (Feed-Forward: Linear → GELU → Linear)
  6. Residual Connection

Complete GPT Architecture:


Input:
  Token Embeddings + Positional Embeddings
  ↓
GPT Blocks (×N)
  ↓
Layer Norm
  ↓
Linear (project to vocabulary)
  ↓
Output: Logits over vocabulary

Architectural Details:

Pre-Norm vs Post-Norm:

Pre-norm (GPT-2): LayerNorm before attention/MLP - more stable training
Post-norm (Original Transformer): LayerNorm after - harder to train deep models

Activation Functions:

GELU (Gaussian Error Linear Unit): Smooth ReLU alternative
Used in BERT and GPT-2
Better than ReLU for language modeling

Weight Tying:

Share weights between token embeddings and output projection
Reduces parameters, improves performance
Token embedding matrix = output projection matrix transposed

GPT-2 Model Sizes:

Model	Layers	d_model	Heads	Parameters
GPT-2 Small	12	768	12	117M
GPT-2 Medium	24	1024	16	345M
GPT-2 Large	36	1280	20	762M
GPT-2 XL	48	1600	25	1.5B

Learning Resources:

Papers:
- “Language Models are Unsupervised Multitask Learners” (GPT-2 paper)
- “Improving Language Understanding by Generative Pre-Training” (GPT-1 paper)
Videos: Karpathy’s GPT video (main resource)
Code: NanoGPT implementation walkthrough

Exercises:

Implement complete GPT model in PyTorch
Calculate parameter count for different model sizes
Understand shape transformations through the model
Trace forward pass with example input

Checkpoint: Can you implement a GPT model from scratch and explain every design choice?

Week 2: Training and Generation

Day 8-10: Training Language Models

Core Concept:

Language Model Training

What You’ll Learn:

Next-Token Prediction: The core training objective
Cross-Entropy Loss: Measuring prediction accuracy
Gradient Accumulation: Training with effective large batch sizes
Learning Rate Schedules: Warmup + cosine decay
Training Loop: Complete training pipeline

Training Objective: Maximize likelihood of next token given previous tokens:

\mathcal{L} = -\sum_{t=1}^T \log P(x_t | x_1, ..., x_{t-1})

Practical Training Techniques:

1. Gradient Accumulation:

Problem: Can’t fit large batch on single GPU
Solution: Accumulate gradients over multiple forward passes


for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
 
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

2. Learning Rate Schedule:

Warmup: Linear increase from 0 to max_lr (first 2000 steps typical)
Cosine Decay: Smooth decrease to min_lr
Why? Transformers sensitive to learning rate, warmup stabilizes training

3. Optimizer:

AdamW: Adam with decoupled weight decay
Typical hyperparameters:
- lr = 6e-4 for small models
- betas = (0.9, 0.95)
- weight_decay = 0.1

4. Checkpointing:

Save model periodically during training
Resume from checkpoints if training interrupted
Keep best checkpoints based on validation loss

Training Best Practices:

Monitor train and validation loss
Watch for overfitting (train << val loss)
Use gradient clipping (clip_grad_norm)
Log to TensorBoard or wandb
Validate every N steps

Learning Resources:

Videos: Karpathy’s video covers training in detail
Code: NanoGPT training script
Papers: GPT-2 training details

Exercises:

Implement complete training loop
Train small GPT on Shakespeare text
Implement gradient accumulation
Add learning rate scheduling
Plot training curves

Checkpoint: Can you train a small GPT model and get it to generate coherent text?

Day 11-12: Text Generation Strategies

Core Concept:

Text Generation

What You’ll Learn:

Greedy Decoding: Always pick highest probability token
Sampling: Sample from probability distribution
Temperature: Controlling randomness
Top-k Sampling: Sample from k most likely tokens
Top-p (Nucleus) Sampling: Sample from cumulative probability p
Beam Search: Keep track of multiple sequences

Generation Strategies Comparison:

1. Greedy Decoding:


next_token = logits.argmax(dim=-1)

Pros: Deterministic, fast
Cons: Repetitive, boring text
Use case: Factual tasks, short generation

2. Temperature Sampling:


logits = logits / temperature
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, 1)

temperature < 1: More confident (peaked distribution)
temperature > 1: More random (flatter distribution)
temperature = 1: Sample from model’s distribution
Use case: Creative text generation

3. Top-k Sampling:


top_k_probs, top_k_indices = torch.topk(probs, k)
next_token = top_k_indices[torch.multinomial(top_k_probs, 1)]

Only sample from k most likely tokens
k = 40 is common default
Prevents sampling very low probability tokens
Use case: Balanced coherence and creativity

4. Top-p (Nucleus) Sampling:


sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Remove tokens with cumulative probability > p
mask = cumulative_probs > p

Dynamically adjust number of candidates
p = 0.9 is common
Better than top-k for varying distributions
Use case: High-quality generation (GPT-3 default)

5. Beam Search:

Keep track of top-k sequences at each step
Expand each sequence with possible next tokens
Keep top-k overall sequences
Return sequence with highest score
Use case: Machine translation, summarization

Generation Strategy Selection:

Factual tasks (QA, classification): Greedy or low temperature
Creative tasks (stories, poetry): Higher temperature, top-p
Code generation: Top-p with temperature ~0.8
Chat: Top-p with temperature ~0.7-0.9

KV Cache Optimization:

Problem: Recomputing attention for all previous tokens is wasteful
Solution: Cache key and value tensors from previous steps
Speedup: ~10x faster generation
Trade-off: Memory vs compute

Learning Resources:

Papers:
- “The Curious Case of Neural Text Degeneration” (top-p paper)
- “Hierarchical Neural Story Generation” (sampling strategies)
Code: Implement all sampling strategies
Reading: Hugging Face generation documentation

Exercises:

Implement all 5 generation strategies
Compare outputs from different strategies
Experiment with temperature values
Implement KV cache for faster generation
Generate text with different prompts

Checkpoint: Can you implement top-p sampling and explain when to use each generation strategy?

Day 13-14: Hands-On Project - Build and Train NanoGPT

Complete Implementation Project:

Requirements:

Implementation: Build NanoGPT from scratch (following Karpathy’s video)
- Token embeddings + positional embeddings
- GPT blocks with causal self-attention
- Layer norms and residuals
- Output projection to vocabulary
Training: Train on a small dataset
- Shakespeare text (Karpathy’s default)
- Or choose your own: code, medical text, etc.
- Implement gradient accumulation
- Add learning rate scheduling
- Log training metrics
Generation: Generate text with multiple strategies
- Implement greedy, temperature, top-k, top-p
- Compare generation quality
- Experiment with different hyperparameters
Analysis: Analyze your trained model
- Plot training and validation loss
- Visualize attention patterns
- Generate samples at different temperatures
- Analyze what the model learned

Deliverables:

Complete working GPT implementation (~300 lines)
Trained model checkpoint
Training curves and metrics
Generated text samples (various strategies)
Analysis of attention patterns
README documenting design choices

Success Criteria:

Model trains without errors
Validation loss decreases smoothly
Generated text is coherent (for Shakespeare: captures style)
Can generate with different strategies
Understand every line of code

Time Estimate: 10-15 hours

Module Completion Criteria

You have completed this module when you can:

✅ Understand and implement BPE tokenization
✅ Implement causal self-attention with masking
✅ Build complete GPT model from scratch
✅ Train GPT on a dataset with gradient accumulation and LR scheduling
✅ Implement all major generation strategies (greedy, temperature, top-k, top-p, beam search)
✅ Understand the difference between encoder-only, decoder-only, and encoder-decoder transformers
✅ Explain when to use each architecture for different tasks
✅ Generate coherent text with your trained model

Key Resources

Essential Video (Must Watch and Code Along)

Andrej Karpathy: “Let’s Build GPT: From Scratch, in Code, Spelled Out”

2-hour tutorial
Code along - don’t just watch
Pause frequently to understand
Best resource for understanding GPT implementation

Essential Code Repository

NanoGPT (github.com/karpathy/nanoGPT)

Clean, educational implementation
~300 lines of core model code
Reproduces GPT-2 results
Read every line, understand every choice

Essential Papers

“Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018)
- GPT-1 paper, introduces the architecture
“Language Models are Unsupervised Multitask Learners” (Radford et al., 2019)
- GPT-2 paper, demonstrates scale and zero-shot capabilities
“The Curious Case of Neural Text Degeneration” (Holtzman et al., 2019)
- Top-p (nucleus) sampling paper

Additional Resources

OpenAI GPT-3 paper (optional, for scale insights)
GPT-4 technical report (optional, for latest developments)
Stanford CS224N lectures on language models

Common Pitfalls

1. Not Coding Along

Problem: Watching Karpathy’s video passively Solution: Pause every few minutes, implement yourself, debug

2. Wrong Attention Mask

Problem: Using bidirectional attention instead of causal Solution: Always use lower triangular mask for GPT

3. Forgetting Positional Encodings

Problem: Model has no sense of token order Solution: Add positional embeddings to token embeddings

4. Too High Learning Rate

Problem: Training diverges immediately Solution: Start with lr=6e-4, use warmup

5. Insufficient Training Data

Problem: Model doesn’t generate coherent text Solution: Need sufficient data (Shakespeare corpus is ~1MB, sufficient for small model)

6. Wrong Generation Temperature

Problem: Repetitive or incoherent text Solution: Try temperature=0.8-1.0, or use top-p sampling

Success Tips

Code Along with Karpathy
- Pause video frequently
- Type every line yourself
- Debug every error
- Understand before proceeding
Start Small
- Begin with tiny model (4 layers, 128 dim)
- Train on small dataset
- Verify training works
- Then scale up
Monitor Everything
- Plot training and validation loss
- Log learning rate schedule
- Visualize attention patterns
- Generate samples during training
Experiment
- Try different model sizes
- Test generation strategies
- Vary temperature
- Use different datasets
Understand Every Line
- Don’t copy-paste without understanding
- Trace shapes through the network
- Understand gradient flow
- Read PyTorch docs for unfamiliar functions

Connection to Healthcare AI

GPT-style models are powerful for healthcare applications:

EHR Sequence Modeling:

Treat patient event sequences as “language”
Predict next medical event given history
ETHOS (thesis baseline) uses similar architecture

Clinical Note Generation:

Generate clinical notes from structured data
Summarize long medical histories
Draft discharge summaries

Medical Code Prediction:

Predict ICD codes from clinical notes
Autoregressive generation of diagnosis codes

Patient Trajectory Modeling:

Predict future medical events
Risk stratification
Treatment recommendation

Next Steps

After completing this module, you have several directions:

Advanced Language Models:
- BERT (encoder-only) for classification
- T5 (encoder-decoder) for sequence-to-sequence
- Scaling laws and efficient training
Vision-Language Models:
- Vision Transformers (ViT)
- CLIP for vision-language alignment
Healthcare Applications:
- Apply to EHR analysis
- Clinical language models
- Multimodal patient modeling
Advanced Topics:
- Efficient transformers (FlashAttention)
- Instruction tuning (RLHF)
- Prompt engineering

Time Investment

Total estimated time: 15-25 hours over 2 weeks

Karpathy video + coding along: 4-6 hours
NanoGPT implementation: 6-10 hours
Training and experiments: 3-5 hours
Reading papers: 2-4 hours

Block out focused time - this module requires deep concentration.

Key Takeaway

“There’s no substitute for building it yourself.”

You can read about GPT, watch videos about GPT, but you don’t truly understand GPT until you’ve built it from scratch, trained it, and generated text with it. The debugging process - figuring out why attention masks aren’t working, why loss isn’t decreasing, why generation is repetitive - is where the real learning happens.

Code along with Karpathy’s video. Implement NanoGPT yourself. Train it. Break it. Fix it. That’s how you master language models.

Ready to begin? Start with Tokenization and then watch Karpathy’s “Let’s Build GPT” video while coding along.