Ch 11: Scaling Laws and Emergence | Predicting the Next Words

Why this chapter matters. Scaling laws are the quantitative science behind prediction improvement. They tell us precisely how much better a language model's next-word predictions become as we invest more parameters, data, and compute -- transforming language modeling from an art into an engineering discipline with predictable returns on investment.

Prerequisites

Ch 9: Pre-training Paradigms → Ch 11: Scaling Laws and Emergence

Summary

Chapter 11 presents the quantitative science of scale in language modeling: the empirical discovery that pre-training loss follows precise power laws in model size, dataset size, and compute budget. Kaplan et al. (2020) established the initial scaling laws, and Hoffmann et al. (2022) revised them with the Chinchilla finding that models should scale parameters and data equally with compute -- demonstrating that earlier models were severely undertrained. Beyond smooth scaling, the chapter examines emergent abilities -- capabilities like arithmetic and chain-of-thought reasoning that appear abruptly above certain scale thresholds (Wei et al., 2022b) -- and the controversy over whether these are genuine phase transitions or measurement artifacts (Schaeffer et al., 2023). Mixture-of-Experts (MoE) architectures provide a practical mechanism for scaling parameters without proportionally scaling compute, enabling models like Switch Transformer and Mixtral. The chapter concludes with the engineering of large-scale training (parallelism strategies, mixed precision, ZeRO) and the compute frontier.

Learning Objectives

State and interpret the Kaplan and Chinchilla scaling laws, derive the compute-optimal relationship between model size and dataset size, and use these laws to estimate the training loss for a given compute budget.
Define emergent abilities in large language models, provide concrete examples (e.g., arithmetic, chain-of-thought reasoning), and critically evaluate the debate over whether emergence is a genuine phase transition or a measurement artifact.
Explain the Mixture-of-Experts (MoE) architecture, including sparse gating, top-k routing, and load balancing, and articulate why MoE enables scaling parameter count without proportionally increasing compute.
Describe key techniques for efficient large-scale training -- data parallelism, tensor parallelism, pipeline parallelism, mixed-precision training, and gradient checkpointing -- and reason about their trade-offs.

Section Outline

11.1 Scaling Laws (~5pp)

The empirical discovery that pre-training loss follows power laws in model size $N$, dataset size $D$, and compute budget $C$. The Kaplan et al. (2020) power-law relationships and the Chinchilla revision (Hoffmann et al., 2022) showing compute-optimal training requires scaling data and parameters in roughly equal proportion. Implications for training budget allocation and infrastructure investment.

11.1.1 Power Laws in Language Modeling
11.1.2 Kaplan et al.: Scaling Laws for Neural Language Models
11.1.3 Chinchilla: Compute-Optimal Training
11.1.4 Using Scaling Laws for Planning

11.2 Emergent Abilities (~4pp)

Capabilities that appear abruptly once models cross a certain scale threshold. Examples: multi-step arithmetic, chain-of-thought reasoning, instruction following, code generation. The debate: genuine phase transitions vs. metric artifacts (Schaeffer et al., 2023).

11.2.1 What Are Emergent Abilities?
11.2.2 Concrete Examples Across Benchmarks
11.2.3 The Phase Transition Debate

11.3 Mixture of Experts (MoE) (~4pp)

Scaling parameters without proportionally scaling compute by activating only a subset of experts per token. The gating function, top-k routing, load-balancing losses, and concrete architectures: Switch Transformer, Mixtral.

11.3.1 Sparse Gating and Top-K Routing
11.3.2 Load Balancing and Capacity Factor
11.3.3 MoE Architectures in Practice

11.4 Efficient Training (~4pp)

Practical techniques that make large-scale training feasible: data parallelism, tensor parallelism, pipeline parallelism, mixed-precision training, gradient checkpointing, and ZeRO optimization stages.

11.4.1 Data, Tensor, and Pipeline Parallelism
11.4.2 Mixed Precision and Gradient Checkpointing
11.4.3 ZeRO and Memory Optimization

11.5 The Compute Frontier (~3pp)

Historical trends in training compute, cost estimates for frontier models, the role of hardware, and the open-weight movement as a democratizing force.

11.5.1 Compute Trends and Cost Estimates
11.5.2 The Open-Weight Movement
11.5.3 Implications for Researchers

Key Equations

(11.1)

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}$$

Kaplan Scaling Law -- where $L$ is the cross-entropy loss, $N$ is non-embedding parameters, $D$ is dataset size in tokens, $\alpha_N \approx 0.076$, and $\alpha_D \approx 0.095$.

(11.2)

$$L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_S} + \left(\frac{D_c}{D}\right)^{\alpha_D / \alpha_S}\right]^{\alpha_S}$$

Joint Scaling Law -- the parametric fit describing loss as a function of both model size and data.

(11.3)

$$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$

Chinchilla Optimal Allocation -- for a given compute budget $C$ (in FLOPs), optimal model size and dataset size both scale as the square root of compute.

(11.4)

$$g(\mathbf{x}) = \text{TopK}\!\left(\text{softmax}(\mathbf{W}_g \mathbf{x})\right), \quad \text{MoE}(\mathbf{x}) = \sum_{i \in \text{TopK}} g_i(\mathbf{x}) \cdot E_i(\mathbf{x})$$

MoE Routing -- $\mathbf{W}_g$ is a learned gating matrix, $\text{TopK}$ selects the $K$ experts with highest gating scores, and $E_i(\mathbf{x})$ is expert $i$'s output.

(11.5)

$$\mathcal{L}_{\text{balance}} = N_E \sum_{i=1}^{N_E} f_i \cdot p_i$$

Load Balancing Loss -- where $f_i$ is the fraction of tokens routed to expert $i$ and $p_i$ is the mean gating probability for expert $i$, encouraging uniform expert utilization.

Key Figures

Figure 11.1 · Line Plots (3 panels) · Matplotlib

Scaling Law Curves

Log-log plots showing pre-training loss vs. model parameters, dataset size, and compute budget, reproducing the key results from Kaplan et al. (2020). Each panel shows the power-law relationship as a straight line on log-log axes.

Figure 11.2 · Scatter/Line Plot · Matplotlib

Chinchilla Optimal Frontier

For each compute budget, the optimal $(N, D)$ pair that minimizes loss. Models above the frontier are under-trained; models below are over-parameterized. Annotated with GPT-3, Chinchilla, and LLaMA positions.

Figure 11.3 · Line Plot with Threshold Annotation · Matplotlib

Emergent Abilities Plot

Benchmark accuracy vs. model scale (log parameters) for several tasks, showing flat performance that jumps sharply at a threshold -- the signature of emergent abilities.

Figure 11.4 · Architecture Diagram · TikZ

MoE Architecture Diagram

The gating network routing tokens to a subset of expert FFN blocks, with the routing decision and weighted combination of expert outputs visualized.

Figure 11.5 · Scatter/Line Plot with Annotations · Matplotlib

Compute Cost Trends

Historical plot of estimated training compute (FLOPs) for notable models from 2017 to 2026, with cost estimates overlaid. Tracks the exponential growth from the original Transformer through GPT-3 to frontier models.

Figure 11.6 · Architecture Diagram · TikZ

Distributed Training Diagram

Illustration of data parallelism, tensor parallelism, and pipeline parallelism, showing how a model and its data are partitioned across multiple GPUs in 3D parallelism.

Exercises

Theory

Compute-Optimal Sizing (Basic). Given a compute budget of $C = 10^{22}$ FLOPs, use the Chinchilla scaling law ($N_{\text{opt}} \propto C^{0.5}$, with $6ND = C$ and $D/N \approx 20$) to compute $N_{\text{opt}}$ and $D_{\text{opt}}$. Is a 10B-parameter model trained on 100B tokens compute-optimal for this budget?
Power-Law Interpretation (Intermediate). The Kaplan scaling law gives $\alpha_N \approx 0.076$. Compute the factor by which loss decreases when model size increases from 1B to 10B parameters, then from 10B to 100B. Explain why the multiplicative improvement factor is the same in both cases.
MoE vs. Dense (Intermediate). A dense Transformer with 70B parameters requires ~420B FLOPs per token. A Mixtral-style MoE with 8 experts of 7B each and top-2 routing has 56B total parameters. What is its approximate FLOPs per token?
Emergence Experiment Design (Intermediate). Schaeffer et al. (2023) argue that emergent abilities are metric artifacts. Design an experiment to test this claim for multi-digit addition, specifying both a discrete and a continuous evaluation metric.

Programming

Scaling Law Visualization (Basic). Plot the predicted loss $L(N) = (N_c/N)^{\alpha_N}$ for models from 10M to 1T parameters on a log-log scale. Annotate GPT-2, GPT-3, and Chinchilla.
MoE Layer Implementation (Intermediate). Implement a simple MoE FFN layer in PyTorch with $N_E = 8$ experts and top-$K = 2$ routing, including a load-balancing loss term.
Scaling Law Fitting (Intermediate). Fit a power law $L(N) = a \cdot N^{-b}$ using least-squares regression on log-transformed data. Report the fitted exponent $b$ and compare with Kaplan's $\alpha_N = 0.076$.
Emergence Simulation (Advanced). Simulate the emergence phenomenon with a synthetic task: a model must get $K$ independent sub-steps correct. Plot exact-match accuracy $p(N)^K$ vs. $\log N$ for $K = 1, 4, 8$ and show that higher $K$ produces sharper apparent "emergence."

Cross-References

This chapter references:

Ch 1 (Sections 1.1--1.2): The prediction paradigm and history. Chapter 11 quantifies this arc: scaling laws describe exactly how prediction improves with scale.
Ch 9 (Sections 9.1--9.3): Pre-training paradigms. The pre-training objectives (MLM, CLM) whose loss is the dependent variable in the scaling laws. The GPT progression from 117M to 175B parameters is the empirical backdrop.

This chapter is referenced by:

Ch 12 (soft): Understanding that models improve predictably with scale helps explain why alignment becomes necessary -- more capable models require more careful steering.

Key Papers

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. [Sections 11.1]
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models. Advances in NeurIPS. [Sections 11.1.3--11.1.4]
Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. [Section 11.2]
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? Advances in NeurIPS. [Section 11.2.3]
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR, 23(120), 1--39. [Section 11.3.3]
Jiang, A. Q., Sablayrolles, A., Roux, A., et al. (2024). Mixtral of Experts. arXiv:2401.04088. [Section 11.3.3]
Shazeer, N., Mirhoseini, A., Maziarz, K., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. Proceedings of ICLR. [Section 11.3.1]
Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971. [Section 11.5.2]