Part I ยท Chapter 2

Mathematical Foundations

Part I: Foundations Moderate ~25 pages Phase 2

Prerequisites

Ch 1: Introduction Ch 2: Mathematical Foundations

Chapter Summary

Chapter 2 builds the mathematical toolkit that every subsequent chapter depends on: probability theory applied to word sequences, maximum likelihood estimation, information theory (entropy, cross-entropy, KL divergence), perplexity, and gradient-based optimization. Every concept is developed through the prediction lens established in Chapter 1: probability is the formalism for predicting the next word, entropy measures how uncertain those predictions are, cross-entropy measures how far our model's predictions diverge from reality, perplexity exponentiates that gap into an interpretable number, and optimization is how we improve our predictions.

Why this chapter matters: These concepts are not just background -- they ARE the language of language modeling. The chain rule decomposition is the mathematical heart of every language model; cross-entropy is the loss function; perplexity is the evaluation metric. Every chapter from 3 through 15 depends directly on this toolkit.

Learning Objectives

  1. Apply the chain rule of probability to decompose the joint probability of a word sequence into a product of conditional probabilities
  2. Derive the maximum likelihood estimate for language model parameters and explain its connection to cross-entropy minimization
  3. Compute entropy, cross-entropy, and KL divergence for discrete distributions and interpret their meaning in the context of language modeling
  4. Calculate perplexity from cross-entropy and explain why lower perplexity indicates a better language model

Section Outline

2.1 Probability and Conditional Probability (~5 pages)

Reviews the probability foundations needed for language modeling: sample spaces over vocabularies, joint probability of word sequences, marginal and conditional distributions, and the chain rule decomposition.

  • 2.1.1 Probability over vocabularies
  • 2.1.2 Joint and conditional distributions
  • 2.1.3 The chain rule for language
  • 2.1.4 Independence assumptions and the Markov property
2.2 Maximum Likelihood Estimation (~5 pages)

Derives the MLE for a simple language model (counting frequencies), shows the equivalence between maximizing log-likelihood and minimizing cross-entropy loss, and discusses the bias-variance tradeoff.

  • 2.2.1 The likelihood function for language models
  • 2.2.2 Log-likelihood and its properties
  • 2.2.3 MLE as counting (n-gram preview)
  • 2.2.4 Overfitting and the need for smoothing
2.3 Information Theory Essentials (~7 pages)

Shannon entropy as the theoretical lower bound on compression, cross-entropy as the loss function, KL divergence as the gap between model and true distribution, and perplexity as the standard evaluation metric.

  • 2.3.1 Entropy of a language
  • 2.3.2 Cross-entropy between distributions
  • 2.3.3 KL divergence and its asymmetry
  • 2.3.4 Perplexity: the language modeler's metric
  • 2.3.5 Bits-per-character and bits-per-token
2.4 Evaluation Metrics (~4 pages)

How language models are evaluated in practice: intrinsic metrics (perplexity on held-out data), limitations of perplexity, and preview of downstream evaluation.

  • 2.4.1 Held-out perplexity
  • 2.4.2 Interpreting perplexity values
  • 2.4.3 When perplexity fails: downstream evaluation
2.5 Optimization Basics (~4 pages)

A concise review of gradient-based optimization: stochastic gradient descent, mini-batches, momentum, the Adam optimizer, and learning rate schedules.

  • 2.5.1 Gradient descent and SGD
  • 2.5.2 Adam and adaptive methods
  • 2.5.3 Learning rate schedules

Key Equations

$$P(w_1, w_2, \ldots, w_n) = \prod_{t=1}^{n} P(w_t \mid w_1, \ldots, w_{t-1})$$ (2.1)
Chain rule of probability -- Section 2.1
$$H(P, Q) = -\sum_{x} P(x) \log Q(x)$$ (2.2)
Cross-entropy loss -- Section 2.3
$$D_{\mathrm{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$$ (2.3)
KL divergence -- Section 2.3
$$\mathrm{PP}(W) = 2^{H(P, Q)} = 2^{-\frac{1}{N}\sum_{i=1}^{N} \log_2 Q(w_i \mid w_{<i})}$$ (2.4)
Perplexity -- Section 2.3
$$\hat{P}(w_t \mid w_{t-n+1}, \ldots, w_{t-1}) = \frac{C(w_{t-n+1}, \ldots, w_t)}{C(w_{t-n+1}, \ldots, w_{t-1})}$$ (2.5)
MLE for n-grams -- Section 2.2
$$H(P) = -\sum_{x} P(x) \log P(x)$$ (2.6)
Shannon entropy -- Section 2.3

Key Figures

Matplotlib
Figure 2.1: Probability Distributions over Vocabulary
Bar chart showing P(w | "The cat sat on the") for a small vocabulary, comparing uniform vs. trained model distributions.
Matplotlib
Figure 2.2: Entropy Visualization
Diagram showing entropy as the expected number of bits needed to encode samples from a distribution; compares high-entropy (uniform) vs. low-entropy (peaked) distributions.
Matplotlib
Figure 2.3: Cross-Entropy Comparison
Side-by-side visualization of P (true) vs Q (model) distributions, with cross-entropy H(P,Q) annotated as the average number of bits under Q when the true distribution is P.
TikZ
Figure 2.4: Perplexity Interpretation
An annotated number line showing perplexity values: PP=|V| (random), PP from typical n-gram models, PP from neural LMs, PP from transformers, PP=1 (perfect). Contextualizes what perplexity values mean in practice.

Exercises

10 exercises (4 theory, 6 programming)

Cross-References

This chapter builds on:

This chapter is needed for:

Key Papers

  • Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379--423. View in bibliography →
  • Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), 50--64. View in bibliography →
  • Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Proceedings of ICLR. View in bibliography →
  • Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79--86. View in bibliography →
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. View in bibliography →