Ch 3: Classical Language Models | Predicting the Next Words

Prerequisites

Ch 2: Mathematical Foundations → Ch 3: Classical Language Models

Chapter Summary

Chapter 3 is where the prediction paradigm first becomes a working system. The reader has the mathematical toolkit from Chapter 2 (chain rule, MLE, perplexity); now we apply it to build the simplest concrete language model: the n-gram. The chapter establishes a clear performance baseline -- perplexity of unigram through 5-gram models, with and without smoothing -- that every neural approach in subsequent chapters must beat. Equally important, the chapter carefully diagnoses why count-based models fail: the curse of dimensionality, the inability to generalize across semantically similar words, and the fixed context window that ignores long-range dependencies.

Why this chapter matters: The three limitations of n-grams directly motivate Chapter 4 (dense word representations for generalization), Chapter 5 (RNNs for unbounded context), and ultimately the Transformer (Chapter 8) as the comprehensive solution. Smoothing -- the art of assigning probability to unseen events -- remains conceptually relevant even in the neural era through dropout, label smoothing, and softmax temperature.

Learning Objectives

Build an n-gram language model from a text corpus by computing conditional probabilities from count statistics
Apply smoothing techniques (add-k, Kneser-Ney, interpolation) to handle unseen n-grams and explain the tradeoffs between them
Evaluate a language model using held-out perplexity and interpret the results in terms of model quality
Articulate the fundamental limitations of count-based models -- sparsity, fixed context, and inability to generalize -- that motivate neural approaches

Section Outline

3.1 N-gram Language Models (~6 pages)

Applies the chain rule with the Markov assumption. Shows how to estimate probabilities by counting. Walks through a complete bigram model on a small corpus step by step.

3.1.1 The Markov assumption
3.1.2 Unigram, bigram, and trigram models
3.1.3 Counting and probability estimation
3.1.4 A worked example: building a bigram model

3.2 Smoothing Techniques (~6 pages)

The zero-probability problem and its solutions: add-k (Laplace) smoothing, Good-Turing estimation, Kneser-Ney smoothing with continuation counts, and interpolation/backoff strategies.

3.2.1 The zero-frequency problem
3.2.2 Additive (Laplace) smoothing
3.2.3 Good-Turing discounting
3.2.4 Kneser-Ney smoothing
3.2.5 Interpolation and backoff

3.3 Language Model Evaluation (~4 pages)

Applies the perplexity metric from Ch 2 to evaluate n-gram models. Shows how perplexity decreases as n increases but saturates due to data sparsity.

3.3.1 Computing perplexity in practice
3.3.2 The effect of n on perplexity
3.3.3 Domain effects and out-of-vocabulary words

3.4 Limitations of Count-Based Models (~4 pages)

Why n-gram models cannot scale: exponential growth of possible n-grams, no notion of word similarity, and fixed context windows. Directly motivates Chapters 4-5.

3.4.1 The curse of dimensionality in language
3.4.2 No generalization across similar words
3.4.3 Fixed context windows
3.4.4 Looking ahead: neural language models

Key Equations

$$P(w_t \mid w_{t-n+1}, \ldots, w_{t-1}) = \frac{C(w_{t-n+1}, \ldots, w_t)}{C(w_{t-n+1}, \ldots, w_{t-1})}$$ (3.1)

N-gram probability -- Section 3.1

$$P_{\text{add-k}}(w_t \mid w_{t-1}) = \frac{C(w_{t-1}, w_t) + k}{C(w_{t-1}) + k|V|}$$ (3.2)

Add-k smoothing -- Section 3.2

$$P_{\text{KN}}(w_t \mid w_{t-1}) = \frac{\max(C(w_{t-1}, w_t) - d, 0)}{C(w_{t-1})} + \lambda(w_{t-1}) P_{\text{continuation}}(w_t)$$ (3.3a)

Kneser-Ney smoothing -- Section 3.2

$$P_{\text{continuation}}(w_t) = \frac{|\{w' : C(w', w_t) > 0\}|}{|\{(w', w'') : C(w', w'') > 0\}|}$$ (3.3b)

Continuation probability -- Section 3.2

$$P_{\text{interp}}(w_t \mid w_{t-2}, w_{t-1}) = \lambda_3 P_3(w_t \mid w_{t-2}, w_{t-1}) + \lambda_2 P_2(w_t \mid w_{t-1}) + \lambda_1 P_1(w_t) \quad \text{with } \sum \lambda_i = 1$$ (3.4)

Interpolation -- Section 3.2

$$\mathrm{PP}(W) = \left(\prod_{t=1}^{N} \frac{1}{P(w_t \mid w_{t-n+1}, \ldots, w_{t-1})}\right)^{1/N}$$ (3.5)

Perplexity for n-gram models -- Section 3.3

Key Figures

TikZ

Figure 3.1: N-gram Counting Example

A small corpus (3-4 sentences) with a bigram count table and the resulting probability matrix.

Matplotlib

Figure 3.2: Smoothing Comparison Plot

Probability distribution over vocabulary for a given context under (a) MLE, (b) add-1, (c) Kneser-Ney. Shows how smoothing redistributes mass to unseen events.

Matplotlib

Figure 3.3: Perplexity vs n

Curve showing perplexity on held-out data as n increases from 1 to 5, showing diminishing returns and eventual degradation due to sparsity.

Matplotlib

Figure 3.4: Sparsity Illustration

Heatmap of a bigram count matrix for a moderate vocabulary (~50 words) showing that most cells are zero. Visually demonstrates the curse of dimensionality.

TikZ

Figure 3.5: LM Generation Example

Side-by-side text generated from unigram, bigram, trigram, and 5-gram models, showing quality increase with context length.

Exercises

8 exercises (3 theory, 5 programming)

Cross-References

This chapter builds on:

Ch 1: Introduction -- the prediction paradigm framework
Ch 2: Mathematical Foundations -- chain rule decomposition (Sec 2.1), MLE counting formula (Sec 2.2), perplexity (Sec 2.3)

This chapter is needed for:

Ch 5: Sequence Models -- neural language models are motivated by n-gram limitations (Sec 3.4) and compared against n-gram baselines (Sec 3.3)

Key Papers

Chen, S. F. & Goodman, J. (1999). An Empirical Study of Smoothing Techniques for Language Modeling. Computer Speech & Language, 13(4), 359--394. View in bibliography →
Kneser, R. & Ney, H. (1995). Improved Backing-Off for M-Gram Language Modeling. Proceedings of ICASSP, Vol. 1, 181--184. View in bibliography →
Jelinek, F. & Mercer, R. L. (1980). Interpolated Estimation of Markov Source Parameters from Sparse Data. Proceedings of the Workshop on Pattern Recognition in Practice, 381--397. View in bibliography →
Good, I. J. (1953). The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 237--264. View in bibliography →
Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137--1155. View in bibliography →
Jelinek, F. (1976). Continuous Speech Recognition by Statistical Methods. Proceedings of the IEEE, 64(4), 532--556. View in bibliography →