Scaling Laws and Emergence
Prerequisites
Summary
Chapter 11 presents the quantitative science of scale in language modeling: the empirical discovery that pre-training loss follows precise power laws in model size, dataset size, and compute budget. Kaplan et al. (2020) established the initial scaling laws, and Hoffmann et al. (2022) revised them with the Chinchilla finding that models should scale parameters and data equally with compute -- demonstrating that earlier models were severely undertrained. Beyond smooth scaling, the chapter examines emergent abilities -- capabilities like arithmetic and chain-of-thought reasoning that appear abruptly above certain scale thresholds (Wei et al., 2022b) -- and the controversy over whether these are genuine phase transitions or measurement artifacts (Schaeffer et al., 2023). Mixture-of-Experts (MoE) architectures provide a practical mechanism for scaling parameters without proportionally scaling compute, enabling models like Switch Transformer and Mixtral. The chapter concludes with the engineering of large-scale training (parallelism strategies, mixed precision, ZeRO) and the compute frontier.
Learning Objectives
- State and interpret the Kaplan and Chinchilla scaling laws, derive the compute-optimal relationship between model size and dataset size, and use these laws to estimate the training loss for a given compute budget.
- Define emergent abilities in large language models, provide concrete examples (e.g., arithmetic, chain-of-thought reasoning), and critically evaluate the debate over whether emergence is a genuine phase transition or a measurement artifact.
- Explain the Mixture-of-Experts (MoE) architecture, including sparse gating, top-k routing, and load balancing, and articulate why MoE enables scaling parameter count without proportionally increasing compute.
- Describe key techniques for efficient large-scale training -- data parallelism, tensor parallelism, pipeline parallelism, mixed-precision training, and gradient checkpointing -- and reason about their trade-offs.
Section Outline
11.1 Scaling Laws (~5pp)
The empirical discovery that pre-training loss follows power laws in model size $N$, dataset size $D$, and compute budget $C$. The Kaplan et al. (2020) power-law relationships and the Chinchilla revision (Hoffmann et al., 2022) showing compute-optimal training requires scaling data and parameters in roughly equal proportion. Implications for training budget allocation and infrastructure investment.
- 11.1.1 Power Laws in Language Modeling
- 11.1.2 Kaplan et al.: Scaling Laws for Neural Language Models
- 11.1.3 Chinchilla: Compute-Optimal Training
- 11.1.4 Using Scaling Laws for Planning
11.2 Emergent Abilities (~4pp)
Capabilities that appear abruptly once models cross a certain scale threshold. Examples: multi-step arithmetic, chain-of-thought reasoning, instruction following, code generation. The debate: genuine phase transitions vs. metric artifacts (Schaeffer et al., 2023).
- 11.2.1 What Are Emergent Abilities?
- 11.2.2 Concrete Examples Across Benchmarks
- 11.2.3 The Phase Transition Debate
11.3 Mixture of Experts (MoE) (~4pp)
Scaling parameters without proportionally scaling compute by activating only a subset of experts per token. The gating function, top-k routing, load-balancing losses, and concrete architectures: Switch Transformer, Mixtral.
- 11.3.1 Sparse Gating and Top-K Routing
- 11.3.2 Load Balancing and Capacity Factor
- 11.3.3 MoE Architectures in Practice
11.4 Efficient Training (~4pp)
Practical techniques that make large-scale training feasible: data parallelism, tensor parallelism, pipeline parallelism, mixed-precision training, gradient checkpointing, and ZeRO optimization stages.
- 11.4.1 Data, Tensor, and Pipeline Parallelism
- 11.4.2 Mixed Precision and Gradient Checkpointing
- 11.4.3 ZeRO and Memory Optimization
11.5 The Compute Frontier (~3pp)
Historical trends in training compute, cost estimates for frontier models, the role of hardware, and the open-weight movement as a democratizing force.
- 11.5.1 Compute Trends and Cost Estimates
- 11.5.2 The Open-Weight Movement
- 11.5.3 Implications for Researchers
Key Equations
Key Figures
Exercises
Theory
- Compute-Optimal Sizing (Basic). Given a compute budget of $C = 10^{22}$ FLOPs, use the Chinchilla scaling law ($N_{\text{opt}} \propto C^{0.5}$, with $6ND = C$ and $D/N \approx 20$) to compute $N_{\text{opt}}$ and $D_{\text{opt}}$. Is a 10B-parameter model trained on 100B tokens compute-optimal for this budget?
- Power-Law Interpretation (Intermediate). The Kaplan scaling law gives $\alpha_N \approx 0.076$. Compute the factor by which loss decreases when model size increases from 1B to 10B parameters, then from 10B to 100B. Explain why the multiplicative improvement factor is the same in both cases.
- MoE vs. Dense (Intermediate). A dense Transformer with 70B parameters requires ~420B FLOPs per token. A Mixtral-style MoE with 8 experts of 7B each and top-2 routing has 56B total parameters. What is its approximate FLOPs per token?
- Emergence Experiment Design (Intermediate). Schaeffer et al. (2023) argue that emergent abilities are metric artifacts. Design an experiment to test this claim for multi-digit addition, specifying both a discrete and a continuous evaluation metric.
Programming
- Scaling Law Visualization (Basic). Plot the predicted loss $L(N) = (N_c/N)^{\alpha_N}$ for models from 10M to 1T parameters on a log-log scale. Annotate GPT-2, GPT-3, and Chinchilla.
- MoE Layer Implementation (Intermediate). Implement a simple MoE FFN layer in PyTorch with $N_E = 8$ experts and top-$K = 2$ routing, including a load-balancing loss term.
- Scaling Law Fitting (Intermediate). Fit a power law $L(N) = a \cdot N^{-b}$ using least-squares regression on log-transformed data. Report the fitted exponent $b$ and compare with Kaplan's $\alpha_N = 0.076$.
- Emergence Simulation (Advanced). Simulate the emergence phenomenon with a synthetic task: a model must get $K$ independent sub-steps correct. Plot exact-match accuracy $p(N)^K$ vs. $\log N$ for $K = 1, 4, 8$ and show that higher $K$ produces sharper apparent "emergence."
Cross-References
This chapter references:
- Ch 1 (Sections 1.1--1.2): The prediction paradigm and history. Chapter 11 quantifies this arc: scaling laws describe exactly how prediction improves with scale.
- Ch 9 (Sections 9.1--9.3): Pre-training paradigms. The pre-training objectives (MLM, CLM) whose loss is the dependent variable in the scaling laws. The GPT progression from 117M to 175B parameters is the empirical backdrop.
This chapter is referenced by:
- Ch 12 (soft): Understanding that models improve predictably with scale helps explain why alignment becomes necessary -- more capable models require more careful steering.
Key Papers
- Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. [Sections 11.1]
- Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models. Advances in NeurIPS. [Sections 11.1.3--11.1.4]
- Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. [Section 11.2]
- Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? Advances in NeurIPS. [Section 11.2.3]
- Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR, 23(120), 1--39. [Section 11.3.3]
- Jiang, A. Q., Sablayrolles, A., Roux, A., et al. (2024). Mixtral of Experts. arXiv:2401.04088. [Section 11.3.3]
- Shazeer, N., Mirhoseini, A., Maziarz, K., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. Proceedings of ICLR. [Section 11.3.1]
- Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971. [Section 11.5.2]