Mathematical Foundations
Prerequisites
Chapter Summary
Chapter 2 builds the mathematical toolkit that every subsequent chapter depends on: probability theory applied to word sequences, maximum likelihood estimation, information theory (entropy, cross-entropy, KL divergence), perplexity, and gradient-based optimization. Every concept is developed through the prediction lens established in Chapter 1: probability is the formalism for predicting the next word, entropy measures how uncertain those predictions are, cross-entropy measures how far our model's predictions diverge from reality, perplexity exponentiates that gap into an interpretable number, and optimization is how we improve our predictions.
Learning Objectives
- Apply the chain rule of probability to decompose the joint probability of a word sequence into a product of conditional probabilities
- Derive the maximum likelihood estimate for language model parameters and explain its connection to cross-entropy minimization
- Compute entropy, cross-entropy, and KL divergence for discrete distributions and interpret their meaning in the context of language modeling
- Calculate perplexity from cross-entropy and explain why lower perplexity indicates a better language model
Section Outline
2.1 Probability and Conditional Probability (~5 pages)
Reviews the probability foundations needed for language modeling: sample spaces over vocabularies, joint probability of word sequences, marginal and conditional distributions, and the chain rule decomposition.
- 2.1.1 Probability over vocabularies
- 2.1.2 Joint and conditional distributions
- 2.1.3 The chain rule for language
- 2.1.4 Independence assumptions and the Markov property
2.2 Maximum Likelihood Estimation (~5 pages)
Derives the MLE for a simple language model (counting frequencies), shows the equivalence between maximizing log-likelihood and minimizing cross-entropy loss, and discusses the bias-variance tradeoff.
- 2.2.1 The likelihood function for language models
- 2.2.2 Log-likelihood and its properties
- 2.2.3 MLE as counting (n-gram preview)
- 2.2.4 Overfitting and the need for smoothing
2.3 Information Theory Essentials (~7 pages)
Shannon entropy as the theoretical lower bound on compression, cross-entropy as the loss function, KL divergence as the gap between model and true distribution, and perplexity as the standard evaluation metric.
- 2.3.1 Entropy of a language
- 2.3.2 Cross-entropy between distributions
- 2.3.3 KL divergence and its asymmetry
- 2.3.4 Perplexity: the language modeler's metric
- 2.3.5 Bits-per-character and bits-per-token
2.4 Evaluation Metrics (~4 pages)
How language models are evaluated in practice: intrinsic metrics (perplexity on held-out data), limitations of perplexity, and preview of downstream evaluation.
- 2.4.1 Held-out perplexity
- 2.4.2 Interpreting perplexity values
- 2.4.3 When perplexity fails: downstream evaluation
2.5 Optimization Basics (~4 pages)
A concise review of gradient-based optimization: stochastic gradient descent, mini-batches, momentum, the Adam optimizer, and learning rate schedules.
- 2.5.1 Gradient descent and SGD
- 2.5.2 Adam and adaptive methods
- 2.5.3 Learning rate schedules
Key Equations
Key Figures
Exercises
10 exercises (4 theory, 6 programming)
Cross-References
This chapter builds on:
- Ch 1: Introduction -- the prediction paradigm and notation conventions
This chapter is needed for:
- Ch 3: Classical Language Models -- uses chain rule, MLE, perplexity, cross-entropy directly for n-gram models
- Ch 4: Word Representations -- uses optimization basics (SGD, Adam) for training word embeddings
Key Papers
- Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379--423. View in bibliography →
- Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), 50--64. View in bibliography →
- Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Proceedings of ICLR. View in bibliography →
- Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79--86. View in bibliography →
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. View in bibliography →