Skip to Content
LibraryConceptsLM Scaling Laws

Language Model Scaling Laws

Understanding how model performance scales with size, data, and compute is crucial for making optimal training decisions. This guide covers the empirical power laws discovered by OpenAI, the compute-optimal Chinchilla result, and practical implications for training language models.

Model Size Calculation

The number of parameters in a transformer scales approximately as:

Parameters12×nlayer×dmodel2\text{Parameters} \approx 12 \times n_{\text{layer}} \times d_{\text{model}}^2

This comes from:

  • Attention layers: 4×dmodel24 \times d_{\text{model}}^2 per layer (Q, K, V, output projections)
  • MLP layers: 8×dmodel28 \times d_{\text{model}}^2 per layer (two linear layers with 4× hidden dimension expansion)

GPT Model Sizes

ModelParametersLayersd_modelHeadsContext
NanoGPT~10M63846256
GPT-2 Small124M12768121024
GPT-2 Medium355M241024161024
GPT-2 Large774M361280201024
GPT-2 XL1.5B481600251024
GPT-3175B9612288962048
GPT-4~1.8T (est)???32K+

Scaling Laws: The Basics

Research from OpenAI (Kaplan et al., 2020) revealed power laws governing language model performance across multiple orders of magnitude:

Loss vs Parameters

L(N)=(NcN)αNL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

Where:

  • LL: Cross-entropy loss
  • NN: Number of parameters
  • NcN_c: Constant (~8.8 billion)
  • αN\alpha_N: Exponent (~0.076)

Key insight: Doubling model size improves loss by a constant factor. This relationship holds from 10M to 100B+ parameters.

Loss vs Data

L(D)=(DcD)αDL(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}

Where:

  • DD: Number of training tokens
  • DcD_c: Constant (~5.4 billion)
  • αD\alpha_D: Exponent (~0.095)

Key insight: More data consistently improves performance. No saturation has been observed—models continue improving with more training data.

Loss vs Compute

L(C)=(CcC)αCL(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

Where:

  • CC: Total compute budget (in FLOPs)
  • CcC_c: Constant
  • αC\alpha_C: Exponent (~0.050)

Key insight: Compute budget determines achievable performance. Given a fixed compute budget, you must decide how to allocate it between model size and training duration.

The Chinchilla Result

The Chinchilla paper (Hoffmann et al., 2022) revised scaling laws, showing that most large language models were undertrained:

Original Approach (GPT-3 Era)

The prevailing wisdom was to train large models on relatively few tokens:

  • GPT-3: 175B parameters, 300B tokens
  • Gopher: 280B parameters, 300B tokens
  • Megatron-Turing NLG: 530B parameters, 300B tokens

Compute-Optimal Approach (Chinchilla)

DeepMind’s analysis showed that compute should be balanced between model size and training data:

  • Chinchilla: 70B parameters, 1.4T tokens
  • Result: Outperformed GPT-3 on most benchmarks with 4× less compute
  • Conclusion: Previous models were too large and trained on too little data

The Compute-Optimal Rule

For optimal compute efficiency:

NparamsC6×ntokensN_{\text{params}} \approx \frac{C}{6 \times n_{\text{tokens}}}

Or equivalently: Train on 20 tokens per parameter

Example:

Compute budget: 1e20 FLOPs Optimal split: - Model size: ~6B parameters - Training tokens: ~120B tokens (20 × 6B)

Practical implication: If you have limited compute, it’s better to train a smaller model on more data than a larger model on less data. This is especially relevant for domain-specific models with limited data.

Emergent Abilities

As models scale, they develop emergent abilities—capabilities that appear suddenly at certain scales rather than gradually improving:

Examples of Emergent Abilities

Few-shot learning (appears ~1B parameters):

  • Learn from just a few examples in context
  • No gradient updates or fine-tuning needed
  • Example: GPT-3 can learn new tasks from 5-10 examples

Chain-of-thought reasoning (appears ~100B parameters):

  • Solve multi-step problems by breaking them down
  • Show intermediate reasoning steps
  • Example: “Let’s think step by step…” prompting

Instruction following (strengthens with scale):

  • Follow complex natural language instructions
  • Generalize to unseen task types
  • Example: InstructGPT, ChatGPT capabilities

The Scaling Hypothesis

“Scaling up language models improves performance across nearly all NLP tasks, with no signs of saturation.”

This hypothesis has held remarkably well from 10M to 1T+ parameters, suggesting that even larger models will continue to improve.

Practical Scaling Considerations

1. Memory Requirements

Model memory (parameters in FP32): Memory=4N bytes\text{Memory} = 4N \text{ bytes}

Example: GPT-2 (124M params) = 496 MB

Training memory (including gradients, optimizer states with Adam): Memory16N bytes\text{Memory} \approx 16N \text{ bytes}

  • 4 bytes: model parameters (FP32)
  • 4 bytes: gradients
  • 8 bytes: optimizer states (Adam: momentum + variance)

Example: GPT-2 training ≈ 2 GB

With mixed precision (FP16), you can reduce this to ~12N bytes.

2. Training Time

FLOPs per token (forward + backward pass): C6N FLOPsC \approx 6N \text{ FLOPs}

Where NN is the number of parameters.

Total training FLOPs: Ctotal=6NDC_{\text{total}} = 6ND

Where DD is the number of training tokens.

Example: GPT-3

  • Parameters: 175B
  • Training tokens: 300B
  • Total FLOPs: 6×175B×300B=3.15×10236 \times 175B \times 300B = 3.15 \times 10^{23} FLOPs
  • On A100 GPU (312 TFLOPS): ~32 GPU-years
  • Cost: ~$5-10M at cloud GPU rates

3. Inference Cost

Inference is much cheaper than training:

  • Linear in sequence length: Each token requires one forward pass
  • Constant per token with KV caching: Reuse attention keys/values
  • Batch multiple requests: Amortize overhead across batch

Example:

  • GPT-3 inference: ~1-2 ms per token on A100
  • Cost: ~$0.0001-0.0002 per 1K tokens
  • Much more affordable than training

Scaling for Healthcare

Healthcare presents unique scaling challenges that differ from web-scale LM training:

Challenges

Limited data:

  • Medical data is orders of magnitude scarcer than web text
  • Privacy regulations (HIPAA, GDPR) limit data sharing and aggregation
  • Domain-specific: can’t easily transfer from general web data
  • Example: MIMIC-III has 2M clinical notes; MIMIC-IV adds 269,573 ED notes, still tiny compared to GPT-3’s 500B tokens

Smaller model scale:

  • Healthcare LMs are typically 10-100M parameters
  • ClinicalBERT: 110M parameters (BERT-base scale)
  • BioClinicalBERT: 110M parameters
  • Med-PaLM: 540B parameters (exception, not typical)

Different scaling dynamics:

  • Performance plateaus faster due to limited data
  • Overfitting risk is higher
  • Domain vocabulary is more important than model size

Compute-Optimal Strategy for Healthcare

Given limited healthcare data, the Chinchilla principle applies even more:

  1. Train smaller models on more passes of data

    • Better to have 50M parameters trained for 100 epochs
    • Than 500M parameters trained for 10 epochs
  2. Use domain-specific tokenization

    • Medical vocabulary: ICD codes, SNOMED-CT, RxNorm
    • Reduces effective sequence length
    • More efficient parameter usage
  3. Focus on data quality over quantity

    • Clean, well-structured data matters more
    • Expert annotations are valuable
    • Demographic diversity is critical
  4. Transfer learning from general models

    • Start with pre-trained BERT/GPT
    • Fine-tune on medical data
    • Achieves good performance with less data
  5. Data augmentation

    • Synonym replacement with medical thesaurus
    • Back-translation
    • Synthetic data generation
    • Masked prediction for pre-training

Example: Clinical note prediction

# Compute-optimal healthcare LM model_size = 50_000_000 # 50M parameters available_tokens = 10_000_000_000 # 10B tokens (MIMIC + other sources) # Chinchilla rule: 20 tokens per parameter optimal_tokens = 20 * model_size # 1B tokens # With 10B available, train for 10 epochs num_epochs = available_tokens // optimal_tokens # 10 epochs # This is compute-optimal for healthcare scale

Key Scaling Insights

  1. Bigger is usually better: Larger models consistently outperform smaller ones when trained on sufficient data

  2. Data matters more than thought: The Chinchilla result shows most models are undertrained. Training longer often beats scaling up.

  3. Power laws are robust: Predictable performance improvements across 5+ orders of magnitude

  4. Emergent abilities exist: Qualitative changes happen at certain scales (few-shot learning, reasoning)

  5. Compute is the bottleneck: Both model size and training duration are limited by compute budget

  6. Healthcare is different: Limited data means smaller models trained longer, with focus on domain knowledge

Extrapolating Future Capabilities

Using scaling laws, we can predict future model capabilities:

Near-term (2024-2025):

  • Models approaching 10T parameters
  • Trained on 100-500T tokens
  • Continued improvement on reasoning tasks
  • More consistent few-shot learning

Economic limits:

  • Training cost is exponentially increasing
  • GPT-4 scale: ~$100M training cost (estimated)
  • Future models: $1B+ training cost possible
  • Eventually hits economic feasibility limits

Caveats:

  • Scaling laws may break at some point
  • Architectural innovations can shift curves upward
  • Efficient training methods can improve constants
  • Data quality and curation becoming more important

Practical Recommendations

For your projects:

  1. Start small: Begin with 10-50M parameters

    • Iterate quickly
    • Validate approach before scaling
    • Easier to debug
  2. Maximize data usage: Use all available training data

    • Multiple epochs are okay (Chinchilla!)
    • Data augmentation helps
    • Don’t leave data on the table
  3. Scale compute-optimally: Balance model size and tokens using Chinchilla rule

    • 20 tokens per parameter
    • Prefer longer training over larger models
    • Monitor loss curves
  4. Monitor scaling efficiency: Track loss vs compute to ensure you’re on the power law

    • If efficiency drops, investigate
    • May indicate bugs or suboptimal hyperparameters
  5. Consider inference cost: Sometimes a smaller, well-trained model is better

    • 10× smaller = 10× faster inference
    • Distillation can compress models
    • Edge deployment favors smaller models

Further Reading