Scaling Laws
Large Language Models
35 SLIDES Part 3: Advanced Topics
The Scaling Laws Paradox: Why is AI progress so predictable? Performance follows power laws - we can literally predict the future capabilities of larger models.
Prerequisites
- Week 5: Basic transformer architecture
- Week 6: Pre-trained models (BERT, GPT)
- Understanding of model parameters and computational complexity
Overview
Scale to billions of parameters. Scaling laws, emergent abilities, and modern LLMs.
Learning Objectives
- Explain scaling laws and why AI progress is predictable
- Compare the three paths: bigger (GPT-3), smarter (MoE), efficient (Reformer)
- Understand emergent abilities that appear at scale
- Analyze the compute-optimal training (Chinchilla scaling)
- Evaluate trade-offs between model size, data, and compute
Key Topics
Scaling laws
Emergent abilities
In-context learning
Chain-of-thought
Key Concepts
Scaling lawsPerformance follows power laws with compute/data/parameters
Emergent abilitiesCapabilities that appear suddenly at certain scales
Mixture of Experts (MoE)1.6T parameters but only 10B active
Sparse attentionLinear complexity alternatives (Reformer, Linformer)
Chinchilla scalingOptimal ratio of model size to training data
Compute-optimal trainingBalancing FLOPs across parameters and tokens
Key Visualizations
Scaling Laws
Model Scale Timeline
Emergent Abilities Chart
Gpt3 Capabilities