Language Model Scaling Laws
Understanding how model performance scales with size, data, and compute is crucial for making optimal training decisions. This guide covers the empirical power laws discovered by OpenAI, the compute-optimal Chinchilla result, and practical implications for training language models.
Model Size Calculation
The number of parameters in a transformer scales approximately as:
This comes from:
- Attention layers: per layer (Q, K, V, output projections)
- MLP layers: per layer (two linear layers with 4× hidden dimension expansion)
GPT Model Sizes
| Model | Parameters | Layers | d_model | Heads | Context |
|---|---|---|---|---|---|
| NanoGPT | ~10M | 6 | 384 | 6 | 256 |
| GPT-2 Small | 124M | 12 | 768 | 12 | 1024 |
| GPT-2 Medium | 355M | 24 | 1024 | 16 | 1024 |
| GPT-2 Large | 774M | 36 | 1280 | 20 | 1024 |
| GPT-2 XL | 1.5B | 48 | 1600 | 25 | 1024 |
| GPT-3 | 175B | 96 | 12288 | 96 | 2048 |
| GPT-4 | ~1.8T (est) | ? | ? | ? | 32K+ |
Scaling Laws: The Basics
Research from OpenAI (Kaplan et al., 2020) revealed power laws governing language model performance across multiple orders of magnitude:
Loss vs Parameters
Where:
- : Cross-entropy loss
- : Number of parameters
- : Constant (~8.8 billion)
- : Exponent (~0.076)
Key insight: Doubling model size improves loss by a constant factor. This relationship holds from 10M to 100B+ parameters.
Loss vs Data
Where:
- : Number of training tokens
- : Constant (~5.4 billion)
- : Exponent (~0.095)
Key insight: More data consistently improves performance. No saturation has been observed—models continue improving with more training data.
Loss vs Compute
Where:
- : Total compute budget (in FLOPs)
- : Constant
- : Exponent (~0.050)
Key insight: Compute budget determines achievable performance. Given a fixed compute budget, you must decide how to allocate it between model size and training duration.
The Chinchilla Result
The Chinchilla paper (Hoffmann et al., 2022) revised scaling laws, showing that most large language models were undertrained:
Original Approach (GPT-3 Era)
The prevailing wisdom was to train large models on relatively few tokens:
- GPT-3: 175B parameters, 300B tokens
- Gopher: 280B parameters, 300B tokens
- Megatron-Turing NLG: 530B parameters, 300B tokens
Compute-Optimal Approach (Chinchilla)
DeepMind’s analysis showed that compute should be balanced between model size and training data:
- Chinchilla: 70B parameters, 1.4T tokens
- Result: Outperformed GPT-3 on most benchmarks with 4× less compute
- Conclusion: Previous models were too large and trained on too little data
The Compute-Optimal Rule
For optimal compute efficiency:
Or equivalently: Train on 20 tokens per parameter
Example:
Compute budget: 1e20 FLOPs
Optimal split:
- Model size: ~6B parameters
- Training tokens: ~120B tokens (20 × 6B)Practical implication: If you have limited compute, it’s better to train a smaller model on more data than a larger model on less data. This is especially relevant for domain-specific models with limited data.
Emergent Abilities
As models scale, they develop emergent abilities—capabilities that appear suddenly at certain scales rather than gradually improving:
Examples of Emergent Abilities
Few-shot learning (appears ~1B parameters):
- Learn from just a few examples in context
- No gradient updates or fine-tuning needed
- Example: GPT-3 can learn new tasks from 5-10 examples
Chain-of-thought reasoning (appears ~100B parameters):
- Solve multi-step problems by breaking them down
- Show intermediate reasoning steps
- Example: “Let’s think step by step…” prompting
Instruction following (strengthens with scale):
- Follow complex natural language instructions
- Generalize to unseen task types
- Example: InstructGPT, ChatGPT capabilities
The Scaling Hypothesis
“Scaling up language models improves performance across nearly all NLP tasks, with no signs of saturation.”
This hypothesis has held remarkably well from 10M to 1T+ parameters, suggesting that even larger models will continue to improve.
Practical Scaling Considerations
1. Memory Requirements
Model memory (parameters in FP32):
Example: GPT-2 (124M params) = 496 MB
Training memory (including gradients, optimizer states with Adam):
- 4 bytes: model parameters (FP32)
- 4 bytes: gradients
- 8 bytes: optimizer states (Adam: momentum + variance)
Example: GPT-2 training ≈ 2 GB
With mixed precision (FP16), you can reduce this to ~12N bytes.
2. Training Time
FLOPs per token (forward + backward pass):
Where is the number of parameters.
Total training FLOPs:
Where is the number of training tokens.
Example: GPT-3
- Parameters: 175B
- Training tokens: 300B
- Total FLOPs: FLOPs
- On A100 GPU (312 TFLOPS): ~32 GPU-years
- Cost: ~$5-10M at cloud GPU rates
3. Inference Cost
Inference is much cheaper than training:
- Linear in sequence length: Each token requires one forward pass
- Constant per token with KV caching: Reuse attention keys/values
- Batch multiple requests: Amortize overhead across batch
Example:
- GPT-3 inference: ~1-2 ms per token on A100
- Cost: ~$0.0001-0.0002 per 1K tokens
- Much more affordable than training
Scaling for Healthcare
Healthcare presents unique scaling challenges that differ from web-scale LM training:
Challenges
Limited data:
- Medical data is orders of magnitude scarcer than web text
- Privacy regulations (HIPAA, GDPR) limit data sharing and aggregation
- Domain-specific: can’t easily transfer from general web data
- Example: MIMIC-III has 2M clinical notes; MIMIC-IV adds 269,573 ED notes, still tiny compared to GPT-3’s 500B tokens
Smaller model scale:
- Healthcare LMs are typically 10-100M parameters
- ClinicalBERT: 110M parameters (BERT-base scale)
- BioClinicalBERT: 110M parameters
- Med-PaLM: 540B parameters (exception, not typical)
Different scaling dynamics:
- Performance plateaus faster due to limited data
- Overfitting risk is higher
- Domain vocabulary is more important than model size
Compute-Optimal Strategy for Healthcare
Given limited healthcare data, the Chinchilla principle applies even more:
-
Train smaller models on more passes of data
- Better to have 50M parameters trained for 100 epochs
- Than 500M parameters trained for 10 epochs
-
Use domain-specific tokenization
- Medical vocabulary: ICD codes, SNOMED-CT, RxNorm
- Reduces effective sequence length
- More efficient parameter usage
-
Focus on data quality over quantity
- Clean, well-structured data matters more
- Expert annotations are valuable
- Demographic diversity is critical
-
Transfer learning from general models
- Start with pre-trained BERT/GPT
- Fine-tune on medical data
- Achieves good performance with less data
-
Data augmentation
- Synonym replacement with medical thesaurus
- Back-translation
- Synthetic data generation
- Masked prediction for pre-training
Example: Clinical note prediction
# Compute-optimal healthcare LM
model_size = 50_000_000 # 50M parameters
available_tokens = 10_000_000_000 # 10B tokens (MIMIC + other sources)
# Chinchilla rule: 20 tokens per parameter
optimal_tokens = 20 * model_size # 1B tokens
# With 10B available, train for 10 epochs
num_epochs = available_tokens // optimal_tokens # 10 epochs
# This is compute-optimal for healthcare scaleKey Scaling Insights
-
Bigger is usually better: Larger models consistently outperform smaller ones when trained on sufficient data
-
Data matters more than thought: The Chinchilla result shows most models are undertrained. Training longer often beats scaling up.
-
Power laws are robust: Predictable performance improvements across 5+ orders of magnitude
-
Emergent abilities exist: Qualitative changes happen at certain scales (few-shot learning, reasoning)
-
Compute is the bottleneck: Both model size and training duration are limited by compute budget
-
Healthcare is different: Limited data means smaller models trained longer, with focus on domain knowledge
Extrapolating Future Capabilities
Using scaling laws, we can predict future model capabilities:
Near-term (2024-2025):
- Models approaching 10T parameters
- Trained on 100-500T tokens
- Continued improvement on reasoning tasks
- More consistent few-shot learning
Economic limits:
- Training cost is exponentially increasing
- GPT-4 scale: ~$100M training cost (estimated)
- Future models: $1B+ training cost possible
- Eventually hits economic feasibility limits
Caveats:
- Scaling laws may break at some point
- Architectural innovations can shift curves upward
- Efficient training methods can improve constants
- Data quality and curation becoming more important
Practical Recommendations
For your projects:
-
Start small: Begin with 10-50M parameters
- Iterate quickly
- Validate approach before scaling
- Easier to debug
-
Maximize data usage: Use all available training data
- Multiple epochs are okay (Chinchilla!)
- Data augmentation helps
- Don’t leave data on the table
-
Scale compute-optimally: Balance model size and tokens using Chinchilla rule
- 20 tokens per parameter
- Prefer longer training over larger models
- Monitor loss curves
-
Monitor scaling efficiency: Track loss vs compute to ensure you’re on the power law
- If efficiency drops, investigate
- May indicate bugs or suboptimal hyperparameters
-
Consider inference cost: Sometimes a smaller, well-trained model is better
- 10× smaller = 10× faster inference
- Distillation can compress models
- Edge deployment favors smaller models
Related Concepts
- Training Dynamics - Double descent and overparameterization theory
- Practical Training Techniques - Techniques for stable and efficient training
- Language Model Training - Training autoregressive language models
- Text Generation - Inference and generation strategies
- Healthcare Foundation Models - Pre-training strategies for medical data
Further Reading
- Scaling Laws for Neural Language Models - Original Kaplan et al. paper
- Training Compute-Optimal Large Language Models (Chinchilla) - DeepMind’s revised scaling laws
- Emergent Abilities of Large Language Models - Survey of emergent capabilities
- EleutherAI Scaling Laws Calculator - Interactive tool for scaling predictions