Part III ยท Chapter 11

Scaling Laws and Emergence

Part III: The Transformer Revolution Moderate ~20pp Phase 2
Why this chapter matters. Scaling laws are the quantitative science behind prediction improvement. They tell us precisely how much better a language model's next-word predictions become as we invest more parameters, data, and compute -- transforming language modeling from an art into an engineering discipline with predictable returns on investment.

Prerequisites

Ch 9: Pre-training Paradigms Ch 11: Scaling Laws and Emergence

Summary

Chapter 11 presents the quantitative science of scale in language modeling: the empirical discovery that pre-training loss follows precise power laws in model size, dataset size, and compute budget. Kaplan et al. (2020) established the initial scaling laws, and Hoffmann et al. (2022) revised them with the Chinchilla finding that models should scale parameters and data equally with compute -- demonstrating that earlier models were severely undertrained. Beyond smooth scaling, the chapter examines emergent abilities -- capabilities like arithmetic and chain-of-thought reasoning that appear abruptly above certain scale thresholds (Wei et al., 2022b) -- and the controversy over whether these are genuine phase transitions or measurement artifacts (Schaeffer et al., 2023). Mixture-of-Experts (MoE) architectures provide a practical mechanism for scaling parameters without proportionally scaling compute, enabling models like Switch Transformer and Mixtral. The chapter concludes with the engineering of large-scale training (parallelism strategies, mixed precision, ZeRO) and the compute frontier.

Learning Objectives

  1. State and interpret the Kaplan and Chinchilla scaling laws, derive the compute-optimal relationship between model size and dataset size, and use these laws to estimate the training loss for a given compute budget.
  2. Define emergent abilities in large language models, provide concrete examples (e.g., arithmetic, chain-of-thought reasoning), and critically evaluate the debate over whether emergence is a genuine phase transition or a measurement artifact.
  3. Explain the Mixture-of-Experts (MoE) architecture, including sparse gating, top-k routing, and load balancing, and articulate why MoE enables scaling parameter count without proportionally increasing compute.
  4. Describe key techniques for efficient large-scale training -- data parallelism, tensor parallelism, pipeline parallelism, mixed-precision training, and gradient checkpointing -- and reason about their trade-offs.

Section Outline

11.1 Scaling Laws (~5pp)

The empirical discovery that pre-training loss follows power laws in model size $N$, dataset size $D$, and compute budget $C$. The Kaplan et al. (2020) power-law relationships and the Chinchilla revision (Hoffmann et al., 2022) showing compute-optimal training requires scaling data and parameters in roughly equal proportion. Implications for training budget allocation and infrastructure investment.

  • 11.1.1 Power Laws in Language Modeling
  • 11.1.2 Kaplan et al.: Scaling Laws for Neural Language Models
  • 11.1.3 Chinchilla: Compute-Optimal Training
  • 11.1.4 Using Scaling Laws for Planning

11.2 Emergent Abilities (~4pp)

Capabilities that appear abruptly once models cross a certain scale threshold. Examples: multi-step arithmetic, chain-of-thought reasoning, instruction following, code generation. The debate: genuine phase transitions vs. metric artifacts (Schaeffer et al., 2023).

  • 11.2.1 What Are Emergent Abilities?
  • 11.2.2 Concrete Examples Across Benchmarks
  • 11.2.3 The Phase Transition Debate

11.3 Mixture of Experts (MoE) (~4pp)

Scaling parameters without proportionally scaling compute by activating only a subset of experts per token. The gating function, top-k routing, load-balancing losses, and concrete architectures: Switch Transformer, Mixtral.

  • 11.3.1 Sparse Gating and Top-K Routing
  • 11.3.2 Load Balancing and Capacity Factor
  • 11.3.3 MoE Architectures in Practice

11.4 Efficient Training (~4pp)

Practical techniques that make large-scale training feasible: data parallelism, tensor parallelism, pipeline parallelism, mixed-precision training, gradient checkpointing, and ZeRO optimization stages.

  • 11.4.1 Data, Tensor, and Pipeline Parallelism
  • 11.4.2 Mixed Precision and Gradient Checkpointing
  • 11.4.3 ZeRO and Memory Optimization

11.5 The Compute Frontier (~3pp)

Historical trends in training compute, cost estimates for frontier models, the role of hardware, and the open-weight movement as a democratizing force.

  • 11.5.1 Compute Trends and Cost Estimates
  • 11.5.2 The Open-Weight Movement
  • 11.5.3 Implications for Researchers

Key Equations

(11.1)
$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}$$
Kaplan Scaling Law -- where $L$ is the cross-entropy loss, $N$ is non-embedding parameters, $D$ is dataset size in tokens, $\alpha_N \approx 0.076$, and $\alpha_D \approx 0.095$.
(11.2)
$$L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_S} + \left(\frac{D_c}{D}\right)^{\alpha_D / \alpha_S}\right]^{\alpha_S}$$
Joint Scaling Law -- the parametric fit describing loss as a function of both model size and data.
(11.3)
$$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$
Chinchilla Optimal Allocation -- for a given compute budget $C$ (in FLOPs), optimal model size and dataset size both scale as the square root of compute.
(11.4)
$$g(\mathbf{x}) = \text{TopK}\!\left(\text{softmax}(\mathbf{W}_g \mathbf{x})\right), \quad \text{MoE}(\mathbf{x}) = \sum_{i \in \text{TopK}} g_i(\mathbf{x}) \cdot E_i(\mathbf{x})$$
MoE Routing -- $\mathbf{W}_g$ is a learned gating matrix, $\text{TopK}$ selects the $K$ experts with highest gating scores, and $E_i(\mathbf{x})$ is expert $i$'s output.
(11.5)
$$\mathcal{L}_{\text{balance}} = N_E \sum_{i=1}^{N_E} f_i \cdot p_i$$
Load Balancing Loss -- where $f_i$ is the fraction of tokens routed to expert $i$ and $p_i$ is the mean gating probability for expert $i$, encouraging uniform expert utilization.

Key Figures

Figure 11.1 · Line Plots (3 panels) · Matplotlib
Scaling Law Curves
Log-log plots showing pre-training loss vs. model parameters, dataset size, and compute budget, reproducing the key results from Kaplan et al. (2020). Each panel shows the power-law relationship as a straight line on log-log axes.
Figure 11.2 · Scatter/Line Plot · Matplotlib
Chinchilla Optimal Frontier
For each compute budget, the optimal $(N, D)$ pair that minimizes loss. Models above the frontier are under-trained; models below are over-parameterized. Annotated with GPT-3, Chinchilla, and LLaMA positions.
Figure 11.3 · Line Plot with Threshold Annotation · Matplotlib
Emergent Abilities Plot
Benchmark accuracy vs. model scale (log parameters) for several tasks, showing flat performance that jumps sharply at a threshold -- the signature of emergent abilities.
Figure 11.4 · Architecture Diagram · TikZ
MoE Architecture Diagram
The gating network routing tokens to a subset of expert FFN blocks, with the routing decision and weighted combination of expert outputs visualized.
Figure 11.5 · Scatter/Line Plot with Annotations · Matplotlib
Compute Cost Trends
Historical plot of estimated training compute (FLOPs) for notable models from 2017 to 2026, with cost estimates overlaid. Tracks the exponential growth from the original Transformer through GPT-3 to frontier models.
Figure 11.6 · Architecture Diagram · TikZ
Distributed Training Diagram
Illustration of data parallelism, tensor parallelism, and pipeline parallelism, showing how a model and its data are partitioned across multiple GPUs in 3D parallelism.

Exercises

Theory

  1. Compute-Optimal Sizing (Basic). Given a compute budget of $C = 10^{22}$ FLOPs, use the Chinchilla scaling law ($N_{\text{opt}} \propto C^{0.5}$, with $6ND = C$ and $D/N \approx 20$) to compute $N_{\text{opt}}$ and $D_{\text{opt}}$. Is a 10B-parameter model trained on 100B tokens compute-optimal for this budget?
  2. Power-Law Interpretation (Intermediate). The Kaplan scaling law gives $\alpha_N \approx 0.076$. Compute the factor by which loss decreases when model size increases from 1B to 10B parameters, then from 10B to 100B. Explain why the multiplicative improvement factor is the same in both cases.
  3. MoE vs. Dense (Intermediate). A dense Transformer with 70B parameters requires ~420B FLOPs per token. A Mixtral-style MoE with 8 experts of 7B each and top-2 routing has 56B total parameters. What is its approximate FLOPs per token?
  4. Emergence Experiment Design (Intermediate). Schaeffer et al. (2023) argue that emergent abilities are metric artifacts. Design an experiment to test this claim for multi-digit addition, specifying both a discrete and a continuous evaluation metric.

Programming

  1. Scaling Law Visualization (Basic). Plot the predicted loss $L(N) = (N_c/N)^{\alpha_N}$ for models from 10M to 1T parameters on a log-log scale. Annotate GPT-2, GPT-3, and Chinchilla.
  2. MoE Layer Implementation (Intermediate). Implement a simple MoE FFN layer in PyTorch with $N_E = 8$ experts and top-$K = 2$ routing, including a load-balancing loss term.
  3. Scaling Law Fitting (Intermediate). Fit a power law $L(N) = a \cdot N^{-b}$ using least-squares regression on log-transformed data. Report the fitted exponent $b$ and compare with Kaplan's $\alpha_N = 0.076$.
  4. Emergence Simulation (Advanced). Simulate the emergence phenomenon with a synthetic task: a model must get $K$ independent sub-steps correct. Plot exact-match accuracy $p(N)^K$ vs. $\log N$ for $K = 1, 4, 8$ and show that higher $K$ produces sharper apparent "emergence."

Cross-References

This chapter references:

  • Ch 1 (Sections 1.1--1.2): The prediction paradigm and history. Chapter 11 quantifies this arc: scaling laws describe exactly how prediction improves with scale.
  • Ch 9 (Sections 9.1--9.3): Pre-training paradigms. The pre-training objectives (MLM, CLM) whose loss is the dependent variable in the scaling laws. The GPT progression from 117M to 175B parameters is the empirical backdrop.

This chapter is referenced by:

  • Ch 12 (soft): Understanding that models improve predictably with scale helps explain why alignment becomes necessary -- more capable models require more careful steering.

Key Papers

  • Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. [Sections 11.1]
  • Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models. Advances in NeurIPS. [Sections 11.1.3--11.1.4]
  • Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. [Section 11.2]
  • Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? Advances in NeurIPS. [Section 11.2.3]
  • Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR, 23(120), 1--39. [Section 11.3.3]
  • Jiang, A. Q., Sablayrolles, A., Roux, A., et al. (2024). Mixtral of Experts. arXiv:2401.04088. [Section 11.3.3]
  • Shazeer, N., Mirhoseini, A., Maziarz, K., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. Proceedings of ICLR. [Section 11.3.1]
  • Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971. [Section 11.5.2]