Scaling Laws

Large Language Models

35 SLIDES Part 3: Advanced Topics

The Scaling Laws Paradox: Why is AI progress so predictable? Performance follows power laws - we can literally predict the future capabilities of larger models.

Prerequisites

Week 5: Basic transformer architecture
Week 6: Pre-trained models (BERT, GPT)
Understanding of model parameters and computational complexity

Overview

Scale to billions of parameters. Scaling laws, emergent abilities, and modern LLMs.

Learning Objectives

Explain scaling laws and why AI progress is predictable
Compare the three paths: bigger (GPT-3), smarter (MoE), efficient (Reformer)
Understand emergent abilities that appear at scale
Analyze the compute-optimal training (Chinchilla scaling)
Evaluate trade-offs between model size, data, and compute

Key Topics

Scaling laws

Emergent abilities

In-context learning

Chain-of-thought

Key Concepts

Scaling lawsPerformance follows power laws with compute/data/parameters

Emergent abilitiesCapabilities that appear suddenly at certain scales

Mixture of Experts (MoE)1.6T parameters but only 10B active

Sparse attentionLinear complexity alternatives (Reformer, Linformer)

Chinchilla scalingOptimal ratio of model size to training data

Compute-optimal trainingBalancing FLOPs across parameters and tokens

Key Visualizations

Scaling Laws

Model Scale Timeline

Emergent Abilities Chart

Gpt3 Capabilities

Resources

View Slides (PDF) [source] Open in Colab Download Notebook Chart Gallery

Previous Pre-trained Models Next Tokenization