Language Model Scaling Laws

Understanding how model performance scales with size, data, and compute is crucial for making optimal training decisions. This guide covers the empirical power laws discovered by OpenAI, the compute-optimal Chinchilla result, and practical implications for training language models.

Model Size Calculation

The number of parameters in a transformer scales approximately as:

$\text{Parameters} \approx 12 \times n_{\text{layer}} \times d_{\text{model}}^2$

This comes from:

Attention layers: $4 \times d_{\text{model}}^2$ per layer (Q, K, V, output projections)
MLP layers: $8 \times d_{\text{model}}^2$ per layer (two linear layers with 4× hidden dimension expansion)

GPT Model Sizes

Model	Parameters	Layers	d_model	Heads	Context
NanoGPT	~10M	6	384	6	256
GPT-2 Small	124M	12	768	12	1024
GPT-2 Medium	355M	24	1024	16	1024
GPT-2 Large	774M	36	1280	20	1024
GPT-2 XL	1.5B	48	1600	25	1024
GPT-3	175B	96	12288	96	2048
GPT-4	~1.8T (est)	?	?	?	32K+

Scaling Laws: The Basics

Research from OpenAI (Kaplan et al., 2020) revealed power laws governing language model performance across multiple orders of magnitude:

Loss vs Parameters

$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$

Where:

$L$ : Cross-entropy loss
$N$ : Number of parameters
$N_c$ : Constant (~8.8 billion)
$\alpha_N$ : Exponent (~0.076)

Key insight: Doubling model size improves loss by a constant factor. This relationship holds from 10M to 100B+ parameters.

Loss vs Data

$L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}$

Where:

$D$ : Number of training tokens
$D_c$ : Constant (~5.4 billion)
$\alpha_D$ : Exponent (~0.095)

Key insight: More data consistently improves performance. No saturation has been observed—models continue improving with more training data.

Loss vs Compute

$L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$

Where:

$C$ : Total compute budget (in FLOPs)
$C_c$ : Constant
$\alpha_C$ : Exponent (~0.050)

Key insight: Compute budget determines achievable performance. Given a fixed compute budget, you must decide how to allocate it between model size and training duration.

The Chinchilla Result

The Chinchilla paper (Hoffmann et al., 2022) revised scaling laws, showing that most large language models were undertrained:

Original Approach (GPT-3 Era)

The prevailing wisdom was to train large models on relatively few tokens:

GPT-3: 175B parameters, 300B tokens
Gopher: 280B parameters, 300B tokens
Megatron-Turing NLG: 530B parameters, 300B tokens

Compute-Optimal Approach (Chinchilla)

DeepMind’s analysis showed that compute should be balanced between model size and training data:

Chinchilla: 70B parameters, 1.4T tokens
Result: Outperformed GPT-3 on most benchmarks with 4× less compute
Conclusion: Previous models were too large and trained on too little data

The Compute-Optimal Rule

For optimal compute efficiency:

$N_{\text{params}} \approx \frac{C}{6 \times n_{\text{tokens}}}$

Or equivalently: Train on 20 tokens per parameter

Example:


Compute budget: 1e20 FLOPs
Optimal split:
- Model size: ~6B parameters
- Training tokens: ~120B tokens (20 × 6B)

Practical implication: If you have limited compute, it’s better to train a smaller model on more data than a larger model on less data. This is especially relevant for domain-specific models with limited data.

Emergent Abilities

As models scale, they develop emergent abilities—capabilities that appear suddenly at certain scales rather than gradually improving:

Examples of Emergent Abilities

Few-shot learning (appears ~1B parameters):

Learn from just a few examples in context
No gradient updates or fine-tuning needed
Example: GPT-3 can learn new tasks from 5-10 examples

Chain-of-thought reasoning (appears ~100B parameters):

Solve multi-step problems by breaking them down
Show intermediate reasoning steps
Example: “Let’s think step by step…” prompting

Instruction following (strengthens with scale):

Follow complex natural language instructions
Generalize to unseen task types
Example: InstructGPT, ChatGPT capabilities

The Scaling Hypothesis

“Scaling up language models improves performance across nearly all NLP tasks, with no signs of saturation.”

This hypothesis has held remarkably well from 10M to 1T+ parameters, suggesting that even larger models will continue to improve.

Practical Scaling Considerations

1. Memory Requirements

Model memory (parameters in FP32): $\text{Memory} = 4N \text{ bytes}$

Example: GPT-2 (124M params) = 496 MB

Training memory (including gradients, optimizer states with Adam): $\text{Memory} \approx 16N \text{ bytes}$

4 bytes: model parameters (FP32)
4 bytes: gradients
8 bytes: optimizer states (Adam: momentum + variance)

Example: GPT-2 training ≈ 2 GB

With mixed precision (FP16), you can reduce this to ~12N bytes.

2. Training Time

FLOPs per token (forward + backward pass): $C \approx 6N \text{ FLOPs}$

Where $N$ is the number of parameters.

Total training FLOPs: $C_{\text{total}} = 6ND$

Where $D$ is the number of training tokens.

Example: GPT-3

Parameters: 175B
Training tokens: 300B
Total FLOPs: $6 \times 175B \times 300B = 3.15 \times 10^{23}$ FLOPs
On A100 GPU (312 TFLOPS): ~32 GPU-years
Cost: ~$5-10M at cloud GPU rates

3. Inference Cost

Inference is much cheaper than training:

Linear in sequence length: Each token requires one forward pass
Constant per token with KV caching: Reuse attention keys/values
Batch multiple requests: Amortize overhead across batch

Example:

GPT-3 inference: ~1-2 ms per token on A100
Cost: ~$0.0001-0.0002 per 1K tokens
Much more affordable than training

Scaling for Healthcare

Healthcare presents unique scaling challenges that differ from web-scale LM training:

Challenges

Limited data:

Medical data is orders of magnitude scarcer than web text
Privacy regulations (HIPAA, GDPR) limit data sharing and aggregation
Domain-specific: can’t easily transfer from general web data
Example: MIMIC-III has 2M clinical notes; MIMIC-IV adds 269,573 ED notes, still tiny compared to GPT-3’s 500B tokens

Smaller model scale:

Healthcare LMs are typically 10-100M parameters
ClinicalBERT: 110M parameters (BERT-base scale)
BioClinicalBERT: 110M parameters
Med-PaLM: 540B parameters (exception, not typical)

Different scaling dynamics:

Performance plateaus faster due to limited data
Overfitting risk is higher
Domain vocabulary is more important than model size

Compute-Optimal Strategy for Healthcare

Given limited healthcare data, the Chinchilla principle applies even more:

Train smaller models on more passes of data
- Better to have 50M parameters trained for 100 epochs
- Than 500M parameters trained for 10 epochs
Use domain-specific tokenization
- Medical vocabulary: ICD codes, SNOMED-CT, RxNorm
- Reduces effective sequence length
- More efficient parameter usage
Focus on data quality over quantity
- Clean, well-structured data matters more
- Expert annotations are valuable
- Demographic diversity is critical
Transfer learning from general models
- Start with pre-trained BERT/GPT
- Fine-tune on medical data
- Achieves good performance with less data
Data augmentation
- Synonym replacement with medical thesaurus
- Back-translation
- Synthetic data generation
- Masked prediction for pre-training

Example: Clinical note prediction


# Compute-optimal healthcare LM
model_size = 50_000_000  # 50M parameters
available_tokens = 10_000_000_000  # 10B tokens (MIMIC + other sources)
 
# Chinchilla rule: 20 tokens per parameter
optimal_tokens = 20 * model_size  # 1B tokens
 
# With 10B available, train for 10 epochs
num_epochs = available_tokens // optimal_tokens  # 10 epochs
 
# This is compute-optimal for healthcare scale

Key Scaling Insights

Bigger is usually better: Larger models consistently outperform smaller ones when trained on sufficient data
Data matters more than thought: The Chinchilla result shows most models are undertrained. Training longer often beats scaling up.
Power laws are robust: Predictable performance improvements across 5+ orders of magnitude
Emergent abilities exist: Qualitative changes happen at certain scales (few-shot learning, reasoning)
Compute is the bottleneck: Both model size and training duration are limited by compute budget
Healthcare is different: Limited data means smaller models trained longer, with focus on domain knowledge

Extrapolating Future Capabilities

Using scaling laws, we can predict future model capabilities:

Near-term (2024-2025):

Models approaching 10T parameters
Trained on 100-500T tokens
Continued improvement on reasoning tasks
More consistent few-shot learning

Economic limits:

Training cost is exponentially increasing
GPT-4 scale: ~$100M training cost (estimated)
Future models: $1B+ training cost possible
Eventually hits economic feasibility limits

Caveats:

Scaling laws may break at some point
Architectural innovations can shift curves upward
Efficient training methods can improve constants
Data quality and curation becoming more important

Practical Recommendations

For your projects:

Start small: Begin with 10-50M parameters
- Iterate quickly
- Validate approach before scaling
- Easier to debug
Maximize data usage: Use all available training data
- Multiple epochs are okay (Chinchilla!)
- Data augmentation helps
- Don’t leave data on the table
Scale compute-optimally: Balance model size and tokens using Chinchilla rule
- 20 tokens per parameter
- Prefer longer training over larger models
- Monitor loss curves
Monitor scaling efficiency: Track loss vs compute to ensure you’re on the power law
- If efficiency drops, investigate
- May indicate bugs or suboptimal hyperparameters
Consider inference cost: Sometimes a smaller, well-trained model is better
- 10× smaller = 10× faster inference
- Distillation can compress models
- Edge deployment favors smaller models

Training Dynamics - Double descent and overparameterization theory
Practical Training Techniques - Techniques for stable and efficient training
Language Model Training - Training autoregressive language models
Text Generation - Inference and generation strategies
Healthcare Foundation Models - Pre-training strategies for medical data

Language Model Scaling Laws

Model Size Calculation

GPT Model Sizes

Scaling Laws: The Basics

Loss vs Parameters

Loss vs Data

Loss vs Compute

The Chinchilla Result

Original Approach (GPT-3 Era)

Compute-Optimal Approach (Chinchilla)

The Compute-Optimal Rule

Emergent Abilities

Examples of Emergent Abilities

The Scaling Hypothesis

Practical Scaling Considerations

1. Memory Requirements

2. Training Time

3. Inference Cost

Scaling for Healthcare

Challenges

Compute-Optimal Strategy for Healthcare

Key Scaling Insights

Extrapolating Future Capabilities

Practical Recommendations

Related Concepts

Further Reading