Skip to Content
Learning PathsFoundationsLanguage Models & GPT

Building an LLM from Scratch - NanoGPT

This module provides complete, code-level understanding of how GPT-style language models are built and trained. By working through Andrej Karpathy’s NanoGPT implementation, you’ll solidify your transformer knowledge and learn the practical details of training large models.

Why This Module Matters

This hands-on experience is critical for modern AI applications:

  • Complete Understanding: Move from theory to practice by implementing every component
  • Decoder-Only Transformers: Master the architecture powering GPT-3, GPT-4, LLaMA, and most modern LLMs
  • Sequence Modeling Skills: Learn to model any sequential data (text, code, medical events) as a language modeling problem
  • Healthcare Application: Apply these principles to tokenize and model patient event sequences

Learning Objectives

After completing this module, you will:

  • Complete GPT Understanding: Gain line-by-line understanding of how GPT works, from tokenization to generation
  • Decoder-Only Transformers: Master causal self-attention and autoregressive generation, the core of modern LLMs
  • Training Large Models: Learn practical training techniques - gradient accumulation, learning rate schedules, checkpointing
  • Healthcare Sequence Foundation: Prepare to tokenize and model EHR event sequences as language modeling problems

Prerequisites

Before starting this module, ensure you have:

  • Deep understanding of attention and transformers
  • PyTorch: Comfortable with nn.Module, training loops, optimizers
  • Time: Block out substantial time - this module requires focused, uninterrupted work (coding along with Karpathy’s 2-hour video tutorial)

Primary Resource

The primary resource for this module is Andrej Karpathy’s work:

Remember: There’s no substitute for building things from scratch. The best way to understand GPT is to implement it yourself, one line at a time.

Week 1: From Tokenization to Architecture

Day 1-2: Tokenization

Core Concept:

Tokenization

What You’ll Learn:

  • Byte Pair Encoding (BPE): The standard tokenization for LLMs
  • Subword Units: Balance between character and word-level tokenization
  • Vocabulary Construction: Building the token vocabulary from data
  • Special Tokens: <PAD>, <BOS>, <EOS>, <UNK>
  • Encoding/Decoding: Text → tokens → text

Why Tokenization Matters:

  • Character-level: Too long sequences, hard to learn
  • Word-level: Huge vocabulary, can’t handle rare words
  • Subword (BPE): Best of both worlds

BPE Algorithm (Simplified):

  1. Start with character vocabulary
  2. Find most frequent character pair
  3. Merge pair into single token
  4. Repeat until desired vocabulary size

For Healthcare:

  • Tokenize medical event codes (ICD-10, ATC)
  • Hierarchical tokenization (code → category → broad category)
  • Temporal tokens (time intervals between events)

Learning Resources:

  • Papers:
    • “Neural Machine Translation of Rare Words with Subword Units” (Sennrich et al., 2016)
    • GPT-2 paper discusses BPE
  • Code:
    • SentencePiece library
    • tiktoken (OpenAI’s tokenizer)
  • Reading: Hugging Face tokenizers documentation

Exercises:

  • Implement simple BPE algorithm
  • Tokenize sample text with different vocabulary sizes
  • Analyze vocabulary distribution (common vs rare tokens)
  • Understand token-to-string mapping

Checkpoint: Can you explain why BPE is better than word-level or character-level tokenization?

Day 3-4: Causal Attention

Core Concept:

Causal Attention

What You’ll Learn:

  • Autoregressive Generation: Predicting next token given previous tokens
  • Causal Masking: Preventing attention to future positions
  • Difference from Encoder: Unidirectional vs bidirectional attention
  • Implementation: Creating and applying causal mask

The Causal Mask:

# Lower triangular matrix - can only attend to past mask = torch.tril(torch.ones(seq_len, seq_len)) # Shape: [seq_len, seq_len], 1 = attend, 0 = mask

Autoregressive Formulation:

P(x1,...,xT)=t=1TP(xtx1,...,xt1)P(x_1, ..., x_T) = \prod_{t=1}^T P(x_t | x_1, ..., x_{t-1})

Why Causal?

  • Training: Model learns to predict next token from previous tokens
  • Inference: Generate token-by-token, each token only sees previous ones
  • Prevents information leakage from future

Encoder vs Decoder Attention:

  • Encoder (BERT): Bidirectional, see full sequence
  • Decoder (GPT): Causal, see only previous tokens
  • Use case: Classification vs generation

Learning Resources:

  • Videos: Karpathy’s video explains causal attention clearly
  • Code: Implement causal attention from scratch
  • Reading: GPT-2 paper (Section 2)

Exercises:

  • Implement causal masking in PyTorch
  • Visualize causal attention patterns
  • Compare bidirectional vs causal attention on same task
  • Understand masked_fill operation

Checkpoint: Can you implement causal self-attention from scratch?

Day 5-7: Complete GPT Architecture

Core Concept:

GPT Architecture

What You’ll Learn:

  • Decoder-Only Transformer: Stack of transformer decoder blocks
  • GPT Block Components: Causal self-attention + MLP
  • Architectural Choices: Pre-norm vs post-norm, GELU activation
  • Model Sizes: GPT-2 configurations (117M to 1.5B parameters)

GPT Block Structure:

GPT Block (×N): 1. Layer Norm 2. Causal Multi-Head Self-Attention 3. Residual Connection 4. Layer Norm 5. MLP (Feed-Forward: Linear → GELU → Linear) 6. Residual Connection

Complete GPT Architecture:

Input: Token Embeddings + Positional Embeddings GPT Blocks (×N) Layer Norm Linear (project to vocabulary) Output: Logits over vocabulary

Architectural Details:

Pre-Norm vs Post-Norm:

  • Pre-norm (GPT-2): LayerNorm before attention/MLP - more stable training
  • Post-norm (Original Transformer): LayerNorm after - harder to train deep models

Activation Functions:

  • GELU (Gaussian Error Linear Unit): Smooth ReLU alternative
  • Used in BERT and GPT-2
  • Better than ReLU for language modeling

Weight Tying:

  • Share weights between token embeddings and output projection
  • Reduces parameters, improves performance
  • Token embedding matrix = output projection matrix transposed

GPT-2 Model Sizes:

ModelLayersd_modelHeadsParameters
GPT-2 Small1276812117M
GPT-2 Medium24102416345M
GPT-2 Large36128020762M
GPT-2 XL481600251.5B

Learning Resources:

  • Papers:
    • “Language Models are Unsupervised Multitask Learners” (GPT-2 paper)
    • “Improving Language Understanding by Generative Pre-Training” (GPT-1 paper)
  • Videos: Karpathy’s GPT video (main resource)
  • Code: NanoGPT implementation walkthrough

Exercises:

  • Implement complete GPT model in PyTorch
  • Calculate parameter count for different model sizes
  • Understand shape transformations through the model
  • Trace forward pass with example input

Checkpoint: Can you implement a GPT model from scratch and explain every design choice?

Week 2: Training and Generation

Day 8-10: Training Language Models

Core Concept:

Language Model Training

What You’ll Learn:

  • Next-Token Prediction: The core training objective
  • Cross-Entropy Loss: Measuring prediction accuracy
  • Gradient Accumulation: Training with effective large batch sizes
  • Learning Rate Schedules: Warmup + cosine decay
  • Training Loop: Complete training pipeline

Training Objective: Maximize likelihood of next token given previous tokens:

L=t=1TlogP(xtx1,...,xt1)\mathcal{L} = -\sum_{t=1}^T \log P(x_t | x_1, ..., x_{t-1})

Practical Training Techniques:

1. Gradient Accumulation:

  • Problem: Can’t fit large batch on single GPU
  • Solution: Accumulate gradients over multiple forward passes
for i, batch in enumerate(dataloader): loss = model(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()

2. Learning Rate Schedule:

  • Warmup: Linear increase from 0 to max_lr (first 2000 steps typical)
  • Cosine Decay: Smooth decrease to min_lr
  • Why? Transformers sensitive to learning rate, warmup stabilizes training

3. Optimizer:

  • AdamW: Adam with decoupled weight decay
  • Typical hyperparameters:
    • lr = 6e-4 for small models
    • betas = (0.9, 0.95)
    • weight_decay = 0.1

4. Checkpointing:

  • Save model periodically during training
  • Resume from checkpoints if training interrupted
  • Keep best checkpoints based on validation loss

Training Best Practices:

  • Monitor train and validation loss
  • Watch for overfitting (train << val loss)
  • Use gradient clipping (clip_grad_norm)
  • Log to TensorBoard or wandb
  • Validate every N steps

Learning Resources:

  • Videos: Karpathy’s video covers training in detail
  • Code: NanoGPT training script
  • Papers: GPT-2 training details

Exercises:

  • Implement complete training loop
  • Train small GPT on Shakespeare text
  • Implement gradient accumulation
  • Add learning rate scheduling
  • Plot training curves

Checkpoint: Can you train a small GPT model and get it to generate coherent text?

Day 11-12: Text Generation Strategies

Core Concept:

Text Generation

What You’ll Learn:

  • Greedy Decoding: Always pick highest probability token
  • Sampling: Sample from probability distribution
  • Temperature: Controlling randomness
  • Top-k Sampling: Sample from k most likely tokens
  • Top-p (Nucleus) Sampling: Sample from cumulative probability p
  • Beam Search: Keep track of multiple sequences

Generation Strategies Comparison:

1. Greedy Decoding:

next_token = logits.argmax(dim=-1)
  • Pros: Deterministic, fast
  • Cons: Repetitive, boring text
  • Use case: Factual tasks, short generation

2. Temperature Sampling:

logits = logits / temperature probs = F.softmax(logits, dim=-1) next_token = torch.multinomial(probs, 1)
  • temperature < 1: More confident (peaked distribution)
  • temperature > 1: More random (flatter distribution)
  • temperature = 1: Sample from model’s distribution
  • Use case: Creative text generation

3. Top-k Sampling:

top_k_probs, top_k_indices = torch.topk(probs, k) next_token = top_k_indices[torch.multinomial(top_k_probs, 1)]
  • Only sample from k most likely tokens
  • k = 40 is common default
  • Prevents sampling very low probability tokens
  • Use case: Balanced coherence and creativity

4. Top-p (Nucleus) Sampling:

sorted_probs, sorted_indices = torch.sort(probs, descending=True) cumulative_probs = torch.cumsum(sorted_probs, dim=-1) # Remove tokens with cumulative probability > p mask = cumulative_probs > p
  • Dynamically adjust number of candidates
  • p = 0.9 is common
  • Better than top-k for varying distributions
  • Use case: High-quality generation (GPT-3 default)

5. Beam Search:

  • Keep track of top-k sequences at each step
  • Expand each sequence with possible next tokens
  • Keep top-k overall sequences
  • Return sequence with highest score
  • Use case: Machine translation, summarization

Generation Strategy Selection:

  • Factual tasks (QA, classification): Greedy or low temperature
  • Creative tasks (stories, poetry): Higher temperature, top-p
  • Code generation: Top-p with temperature ~0.8
  • Chat: Top-p with temperature ~0.7-0.9

KV Cache Optimization:

  • Problem: Recomputing attention for all previous tokens is wasteful
  • Solution: Cache key and value tensors from previous steps
  • Speedup: ~10x faster generation
  • Trade-off: Memory vs compute

Learning Resources:

  • Papers:
    • “The Curious Case of Neural Text Degeneration” (top-p paper)
    • “Hierarchical Neural Story Generation” (sampling strategies)
  • Code: Implement all sampling strategies
  • Reading: Hugging Face generation documentation

Exercises:

  • Implement all 5 generation strategies
  • Compare outputs from different strategies
  • Experiment with temperature values
  • Implement KV cache for faster generation
  • Generate text with different prompts

Checkpoint: Can you implement top-p sampling and explain when to use each generation strategy?

Day 13-14: Hands-On Project - Build and Train NanoGPT

Complete Implementation Project:

Requirements:

  1. Implementation: Build NanoGPT from scratch (following Karpathy’s video)

    • Token embeddings + positional embeddings
    • GPT blocks with causal self-attention
    • Layer norms and residuals
    • Output projection to vocabulary
  2. Training: Train on a small dataset

    • Shakespeare text (Karpathy’s default)
    • Or choose your own: code, medical text, etc.
    • Implement gradient accumulation
    • Add learning rate scheduling
    • Log training metrics
  3. Generation: Generate text with multiple strategies

    • Implement greedy, temperature, top-k, top-p
    • Compare generation quality
    • Experiment with different hyperparameters
  4. Analysis: Analyze your trained model

    • Plot training and validation loss
    • Visualize attention patterns
    • Generate samples at different temperatures
    • Analyze what the model learned

Deliverables:

  • Complete working GPT implementation (~300 lines)
  • Trained model checkpoint
  • Training curves and metrics
  • Generated text samples (various strategies)
  • Analysis of attention patterns
  • README documenting design choices

Success Criteria:

  • Model trains without errors
  • Validation loss decreases smoothly
  • Generated text is coherent (for Shakespeare: captures style)
  • Can generate with different strategies
  • Understand every line of code

Time Estimate: 10-15 hours

Module Completion Criteria

You have completed this module when you can:

  • ✅ Understand and implement BPE tokenization
  • ✅ Implement causal self-attention with masking
  • ✅ Build complete GPT model from scratch
  • ✅ Train GPT on a dataset with gradient accumulation and LR scheduling
  • ✅ Implement all major generation strategies (greedy, temperature, top-k, top-p, beam search)
  • ✅ Understand the difference between encoder-only, decoder-only, and encoder-decoder transformers
  • ✅ Explain when to use each architecture for different tasks
  • ✅ Generate coherent text with your trained model

Key Resources

Essential Video (Must Watch and Code Along)

Andrej Karpathy: “Let’s Build GPT: From Scratch, in Code, Spelled Out”

  • 2-hour tutorial
  • Code along - don’t just watch
  • Pause frequently to understand
  • Best resource for understanding GPT implementation

Essential Code Repository

NanoGPT (github.com/karpathy/nanoGPT)

  • Clean, educational implementation
  • ~300 lines of core model code
  • Reproduces GPT-2 results
  • Read every line, understand every choice

Essential Papers

  1. “Improving Language Understanding by Generative Pre-Training” (Radford et al., 2018)

    • GPT-1 paper, introduces the architecture
  2. “Language Models are Unsupervised Multitask Learners” (Radford et al., 2019)

    • GPT-2 paper, demonstrates scale and zero-shot capabilities
  3. “The Curious Case of Neural Text Degeneration” (Holtzman et al., 2019)

    • Top-p (nucleus) sampling paper

Additional Resources

  • OpenAI GPT-3 paper (optional, for scale insights)
  • GPT-4 technical report (optional, for latest developments)
  • Stanford CS224N lectures on language models

Common Pitfalls

1. Not Coding Along

Problem: Watching Karpathy’s video passively Solution: Pause every few minutes, implement yourself, debug

2. Wrong Attention Mask

Problem: Using bidirectional attention instead of causal Solution: Always use lower triangular mask for GPT

3. Forgetting Positional Encodings

Problem: Model has no sense of token order Solution: Add positional embeddings to token embeddings

4. Too High Learning Rate

Problem: Training diverges immediately Solution: Start with lr=6e-4, use warmup

5. Insufficient Training Data

Problem: Model doesn’t generate coherent text Solution: Need sufficient data (Shakespeare corpus is ~1MB, sufficient for small model)

6. Wrong Generation Temperature

Problem: Repetitive or incoherent text Solution: Try temperature=0.8-1.0, or use top-p sampling

Success Tips

  1. Code Along with Karpathy

    • Pause video frequently
    • Type every line yourself
    • Debug every error
    • Understand before proceeding
  2. Start Small

    • Begin with tiny model (4 layers, 128 dim)
    • Train on small dataset
    • Verify training works
    • Then scale up
  3. Monitor Everything

    • Plot training and validation loss
    • Log learning rate schedule
    • Visualize attention patterns
    • Generate samples during training
  4. Experiment

    • Try different model sizes
    • Test generation strategies
    • Vary temperature
    • Use different datasets
  5. Understand Every Line

    • Don’t copy-paste without understanding
    • Trace shapes through the network
    • Understand gradient flow
    • Read PyTorch docs for unfamiliar functions

Connection to Healthcare AI

GPT-style models are powerful for healthcare applications:

EHR Sequence Modeling:

  • Treat patient event sequences as “language”
  • Predict next medical event given history
  • ETHOS (thesis baseline) uses similar architecture

Clinical Note Generation:

  • Generate clinical notes from structured data
  • Summarize long medical histories
  • Draft discharge summaries

Medical Code Prediction:

  • Predict ICD codes from clinical notes
  • Autoregressive generation of diagnosis codes

Patient Trajectory Modeling:

  • Predict future medical events
  • Risk stratification
  • Treatment recommendation

Next Steps

After completing this module, you have several directions:

  1. Advanced Language Models:

    • BERT (encoder-only) for classification
    • T5 (encoder-decoder) for sequence-to-sequence
    • Scaling laws and efficient training
  2. Vision-Language Models:

  3. Healthcare Applications:

    • Apply to EHR analysis
    • Clinical language models
    • Multimodal patient modeling
  4. Advanced Topics:

    • Efficient transformers (FlashAttention)
    • Instruction tuning (RLHF)
    • Prompt engineering

Time Investment

Total estimated time: 15-25 hours over 2 weeks

  • Karpathy video + coding along: 4-6 hours
  • NanoGPT implementation: 6-10 hours
  • Training and experiments: 3-5 hours
  • Reading papers: 2-4 hours

Block out focused time - this module requires deep concentration.

Key Takeaway

“There’s no substitute for building it yourself.”

You can read about GPT, watch videos about GPT, but you don’t truly understand GPT until you’ve built it from scratch, trained it, and generated text with it. The debugging process - figuring out why attention masks aren’t working, why loss isn’t decreasing, why generation is repetitive - is where the real learning happens.

Code along with Karpathy’s video. Implement NanoGPT yourself. Train it. Break it. Fix it. That’s how you master language models.


Ready to begin? Start with Tokenization and then watch Karpathy’s “Let’s Build GPT” video  while coding along.