Scaling Laws

Large Language Models

35 SLIDES Part 3: Advanced Topics

?
The Scaling Laws Paradox: Why is AI progress so predictable? Performance follows power laws - we can literally predict the future capabilities of larger models.

Prerequisites

  • Week 5: Basic transformer architecture
  • Week 6: Pre-trained models (BERT, GPT)
  • Understanding of model parameters and computational complexity

Overview

Scale to billions of parameters. Scaling laws, emergent abilities, and modern LLMs.

Learning Objectives

  • Explain scaling laws and why AI progress is predictable
  • Compare the three paths: bigger (GPT-3), smarter (MoE), efficient (Reformer)
  • Understand emergent abilities that appear at scale
  • Analyze the compute-optimal training (Chinchilla scaling)
  • Evaluate trade-offs between model size, data, and compute

Key Topics

Scaling laws
Emergent abilities
In-context learning
Chain-of-thought

Key Concepts

Scaling lawsPerformance follows power laws with compute/data/parameters
Emergent abilitiesCapabilities that appear suddenly at certain scales
Mixture of Experts (MoE)1.6T parameters but only 10B active
Sparse attentionLinear complexity alternatives (Reformer, Linformer)
Chinchilla scalingOptimal ratio of model size to training data
Compute-optimal trainingBalancing FLOPs across parameters and tokens

Key Visualizations

Scaling Laws Scaling Laws
Model Scale Timeline Model Scale Timeline
Emergent Abilities Chart Emergent Abilities Chart
Gpt3 Capabilities Gpt3 Capabilities

Resources