Classical Language Models
Prerequisites
Chapter Summary
Chapter 3 is where the prediction paradigm first becomes a working system. The reader has the mathematical toolkit from Chapter 2 (chain rule, MLE, perplexity); now we apply it to build the simplest concrete language model: the n-gram. The chapter establishes a clear performance baseline -- perplexity of unigram through 5-gram models, with and without smoothing -- that every neural approach in subsequent chapters must beat. Equally important, the chapter carefully diagnoses why count-based models fail: the curse of dimensionality, the inability to generalize across semantically similar words, and the fixed context window that ignores long-range dependencies.
Learning Objectives
- Build an n-gram language model from a text corpus by computing conditional probabilities from count statistics
- Apply smoothing techniques (add-k, Kneser-Ney, interpolation) to handle unseen n-grams and explain the tradeoffs between them
- Evaluate a language model using held-out perplexity and interpret the results in terms of model quality
- Articulate the fundamental limitations of count-based models -- sparsity, fixed context, and inability to generalize -- that motivate neural approaches
Section Outline
3.1 N-gram Language Models (~6 pages)
Applies the chain rule with the Markov assumption. Shows how to estimate probabilities by counting. Walks through a complete bigram model on a small corpus step by step.
- 3.1.1 The Markov assumption
- 3.1.2 Unigram, bigram, and trigram models
- 3.1.3 Counting and probability estimation
- 3.1.4 A worked example: building a bigram model
3.2 Smoothing Techniques (~6 pages)
The zero-probability problem and its solutions: add-k (Laplace) smoothing, Good-Turing estimation, Kneser-Ney smoothing with continuation counts, and interpolation/backoff strategies.
- 3.2.1 The zero-frequency problem
- 3.2.2 Additive (Laplace) smoothing
- 3.2.3 Good-Turing discounting
- 3.2.4 Kneser-Ney smoothing
- 3.2.5 Interpolation and backoff
3.3 Language Model Evaluation (~4 pages)
Applies the perplexity metric from Ch 2 to evaluate n-gram models. Shows how perplexity decreases as n increases but saturates due to data sparsity.
- 3.3.1 Computing perplexity in practice
- 3.3.2 The effect of n on perplexity
- 3.3.3 Domain effects and out-of-vocabulary words
3.4 Limitations of Count-Based Models (~4 pages)
Why n-gram models cannot scale: exponential growth of possible n-grams, no notion of word similarity, and fixed context windows. Directly motivates Chapters 4-5.
- 3.4.1 The curse of dimensionality in language
- 3.4.2 No generalization across similar words
- 3.4.3 Fixed context windows
- 3.4.4 Looking ahead: neural language models
Key Equations
Key Figures
Exercises
8 exercises (3 theory, 5 programming)
Cross-References
This chapter builds on:
- Ch 1: Introduction -- the prediction paradigm framework
- Ch 2: Mathematical Foundations -- chain rule decomposition (Sec 2.1), MLE counting formula (Sec 2.2), perplexity (Sec 2.3)
This chapter is needed for:
- Ch 5: Sequence Models -- neural language models are motivated by n-gram limitations (Sec 3.4) and compared against n-gram baselines (Sec 3.3)
Key Papers
- Chen, S. F. & Goodman, J. (1999). An Empirical Study of Smoothing Techniques for Language Modeling. Computer Speech & Language, 13(4), 359--394. View in bibliography →
- Kneser, R. & Ney, H. (1995). Improved Backing-Off for M-Gram Language Modeling. Proceedings of ICASSP, Vol. 1, 181--184. View in bibliography →
- Jelinek, F. & Mercer, R. L. (1980). Interpolated Estimation of Markov Source Parameters from Sparse Data. Proceedings of the Workshop on Pattern Recognition in Practice, 381--397. View in bibliography →
- Good, I. J. (1953). The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 237--264. View in bibliography →
- Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137--1155. View in bibliography →
- Jelinek, F. (1976). Continuous Speech Recognition by Statistical Methods. Proceedings of the IEEE, 64(4), 532--556. View in bibliography →