Ch 2: Mathematical Foundations | Predicting the Next Words

Prerequisites

Ch 1: Introduction → Ch 2: Mathematical Foundations

Chapter Summary

Chapter 2 builds the mathematical toolkit that every subsequent chapter depends on: probability theory applied to word sequences, maximum likelihood estimation, information theory (entropy, cross-entropy, KL divergence), perplexity, and gradient-based optimization. Every concept is developed through the prediction lens established in Chapter 1: probability is the formalism for predicting the next word, entropy measures how uncertain those predictions are, cross-entropy measures how far our model's predictions diverge from reality, perplexity exponentiates that gap into an interpretable number, and optimization is how we improve our predictions.

Why this chapter matters: These concepts are not just background -- they ARE the language of language modeling. The chain rule decomposition is the mathematical heart of every language model; cross-entropy is the loss function; perplexity is the evaluation metric. Every chapter from 3 through 15 depends directly on this toolkit.

Learning Objectives

Apply the chain rule of probability to decompose the joint probability of a word sequence into a product of conditional probabilities
Derive the maximum likelihood estimate for language model parameters and explain its connection to cross-entropy minimization
Compute entropy, cross-entropy, and KL divergence for discrete distributions and interpret their meaning in the context of language modeling
Calculate perplexity from cross-entropy and explain why lower perplexity indicates a better language model

Section Outline

2.1 Probability and Conditional Probability (~5 pages)

Reviews the probability foundations needed for language modeling: sample spaces over vocabularies, joint probability of word sequences, marginal and conditional distributions, and the chain rule decomposition.

2.1.1 Probability over vocabularies
2.1.2 Joint and conditional distributions
2.1.3 The chain rule for language
2.1.4 Independence assumptions and the Markov property

2.2 Maximum Likelihood Estimation (~5 pages)

Derives the MLE for a simple language model (counting frequencies), shows the equivalence between maximizing log-likelihood and minimizing cross-entropy loss, and discusses the bias-variance tradeoff.

2.2.1 The likelihood function for language models
2.2.2 Log-likelihood and its properties
2.2.3 MLE as counting (n-gram preview)
2.2.4 Overfitting and the need for smoothing

2.3 Information Theory Essentials (~7 pages)

Shannon entropy as the theoretical lower bound on compression, cross-entropy as the loss function, KL divergence as the gap between model and true distribution, and perplexity as the standard evaluation metric.

2.3.1 Entropy of a language
2.3.2 Cross-entropy between distributions
2.3.3 KL divergence and its asymmetry
2.3.4 Perplexity: the language modeler's metric
2.3.5 Bits-per-character and bits-per-token

2.4 Evaluation Metrics (~4 pages)

How language models are evaluated in practice: intrinsic metrics (perplexity on held-out data), limitations of perplexity, and preview of downstream evaluation.

2.4.1 Held-out perplexity
2.4.2 Interpreting perplexity values
2.4.3 When perplexity fails: downstream evaluation

2.5 Optimization Basics (~4 pages)

A concise review of gradient-based optimization: stochastic gradient descent, mini-batches, momentum, the Adam optimizer, and learning rate schedules.

2.5.1 Gradient descent and SGD
2.5.2 Adam and adaptive methods
2.5.3 Learning rate schedules

Key Equations

$$P(w_1, w_2, \ldots, w_n) = \prod_{t=1}^{n} P(w_t \mid w_1, \ldots, w_{t-1})$$ (2.1)

Chain rule of probability -- Section 2.1

$$H(P, Q) = -\sum_{x} P(x) \log Q(x)$$ (2.2)

Cross-entropy loss -- Section 2.3

$$D_{\mathrm{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$$ (2.3)

KL divergence -- Section 2.3

$$\mathrm{PP}(W) = 2^{H(P, Q)} = 2^{-\frac{1}{N}\sum_{i=1}^{N} \log_2 Q(w_i \mid w_{<i})}$$ (2.4)

Perplexity -- Section 2.3

$$\hat{P}(w_t \mid w_{t-n+1}, \ldots, w_{t-1}) = \frac{C(w_{t-n+1}, \ldots, w_t)}{C(w_{t-n+1}, \ldots, w_{t-1})}$$ (2.5)

MLE for n-grams -- Section 2.2

$$H(P) = -\sum_{x} P(x) \log P(x)$$ (2.6)

Shannon entropy -- Section 2.3

Key Figures

Matplotlib

Figure 2.1: Probability Distributions over Vocabulary

Bar chart showing P(w | "The cat sat on the") for a small vocabulary, comparing uniform vs. trained model distributions.

Matplotlib

Figure 2.2: Entropy Visualization

Diagram showing entropy as the expected number of bits needed to encode samples from a distribution; compares high-entropy (uniform) vs. low-entropy (peaked) distributions.

Matplotlib

Figure 2.3: Cross-Entropy Comparison

Side-by-side visualization of P (true) vs Q (model) distributions, with cross-entropy H(P,Q) annotated as the average number of bits under Q when the true distribution is P.

TikZ

Figure 2.4: Perplexity Interpretation

An annotated number line showing perplexity values: PP=|V| (random), PP from typical n-gram models, PP from neural LMs, PP from transformers, PP=1 (perfect). Contextualizes what perplexity values mean in practice.

Exercises

10 exercises (4 theory, 6 programming)

Cross-References

This chapter builds on:

Ch 1: Introduction -- the prediction paradigm and notation conventions

This chapter is needed for:

Ch 3: Classical Language Models -- uses chain rule, MLE, perplexity, cross-entropy directly for n-gram models
Ch 4: Word Representations -- uses optimization basics (SGD, Adam) for training word embeddings

Key Papers

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379--423. View in bibliography →
Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), 50--64. View in bibliography →
Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Proceedings of ICLR. View in bibliography →
Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79--86. View in bibliography →
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. View in bibliography →