Introduction: What Does It Mean to Predict the Next Word?
Prerequisites
Chapter Summary
Chapter 1 establishes the book's central thesis: that next-word prediction is the single unifying principle connecting seventy-five years of language technology, from Shannon's information-theoretic experiments in 1948 through statistical n-gram models, neural language models, and the modern large language model era. The chapter frames every subsequent chapter as a variation on the same fundamental question -- given what we have seen so far, what comes next? By grounding the entire book in P(wnext | context), the chapter gives readers a conceptual scaffold that transforms a sprawling field into a coherent narrative. It also provides the historical arc, the four-part book roadmap, and the mathematical and programming prerequisites necessary for what follows.
Learning Objectives
- Explain why next-word prediction is the unifying principle that connects classical NLP, neural language models, and modern large language models
- Trace the historical development of language modeling from Shannon's information-theoretic experiments through statistical models to transformer-based LLMs
- Describe the four-part structure of the book and map each part to a phase in the evolution of language modeling
- Identify the mathematical, programming, and machine learning prerequisites needed for the remaining chapters
Section Outline
1.1 The Prediction Paradigm (~5 pages)
How "predicting the next word" unifies NLP from Shannon to GPT. Introduces P(wnext | context) as the core idea underlying text generation, machine translation, speech recognition, and modern chatbots.
- 1.1.1 What is a language model?
- 1.1.2 Prediction as the common thread
- 1.1.3 From probabilities to applications
1.2 A Brief History of Language Modeling (~6 pages)
From Markov chains and Shannon's noisy channel to n-gram models, neural language models, and the transformer revolution. Key milestones from Shannon (1948) through frontier LLMs (2023-2026).
- 1.2.1 Shannon and information theory
- 1.2.2 Statistical language models (n-grams)
- 1.2.3 The neural turn (embeddings, RNNs)
- 1.2.4 Attention and the Transformer
- 1.2.5 The large language model era
1.3 How This Book Is Organized (~5 pages)
The four-part structure: Foundations, Neural Language Models, The Transformer Revolution, and Frontiers. Explains the conceptual dependency ordering and suggests reading paths for different audiences.
- 1.3.1 Part I: Foundations
- 1.3.2 Part II: Neural Language Models
- 1.3.3 Part III: The Transformer Revolution
- 1.3.4 Part IV: Frontiers
- 1.3.5 Suggested reading paths
1.4 Prerequisites and Notation (~4 pages)
Mathematical prerequisites (linear algebra, probability, calculus), programming prerequisites (Python, PyTorch), and the notation conventions used throughout the book.
- 1.4.1 Mathematical prerequisites
- 1.4.2 Programming prerequisites
- 1.4.3 Notation conventions
- 1.4.4 The companion repository
Key Equations
This is a conceptual chapter with no formal equations. The only mathematical expression introduced is the informal notation $P(w_{\text{next}} \mid w_1, w_2, \ldots, w_{t-1})$ to foreshadow the language modeling objective, but no formal derivation occurs here.
Key Figures
Exercises
8 exercises (3 theory, 5 programming)
Cross-References
This chapter builds on:
- No prior chapters (this is the first chapter)
This chapter is needed for:
- Ch 2: Mathematical Foundations -- provides the prediction paradigm and notation conventions
- All subsequent chapters assume the reader understands the "prediction as unifying theme" framing, the historical progression, and the book's four-part structure
Key Papers
- Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379--423. View in bibliography →
- Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), 50--64. View in bibliography →
- Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137--1155. View in bibliography →
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS, 30, 5998--6008. View in bibliography →
- Brown, T. B. et al. (2020). Language Models Are Few-Shot Learners. NeurIPS, 33, 1877--1901. View in bibliography →
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. View in bibliography →