Ch 1: Introduction | Predicting the Next Words

Prerequisites

Ch 1: Introduction (No prerequisites -- this is the first chapter)

Chapter Summary

Chapter 1 establishes the book's central thesis: that next-word prediction is the single unifying principle connecting seventy-five years of language technology, from Shannon's information-theoretic experiments in 1948 through statistical n-gram models, neural language models, and the modern large language model era. The chapter frames every subsequent chapter as a variation on the same fundamental question -- given what we have seen so far, what comes next? By grounding the entire book in P(w_next | context), the chapter gives readers a conceptual scaffold that transforms a sprawling field into a coherent narrative. It also provides the historical arc, the four-part book roadmap, and the mathematical and programming prerequisites necessary for what follows.

Why this chapter matters: This chapter establishes that predicting the next word is not merely one task among many -- it is THE fundamental operation that connects Shannon's information theory to modern chatbots. Every chapter that follows is motivated as a better way to solve this single prediction problem.

Learning Objectives

Explain why next-word prediction is the unifying principle that connects classical NLP, neural language models, and modern large language models
Trace the historical development of language modeling from Shannon's information-theoretic experiments through statistical models to transformer-based LLMs
Describe the four-part structure of the book and map each part to a phase in the evolution of language modeling
Identify the mathematical, programming, and machine learning prerequisites needed for the remaining chapters

Section Outline

1.1 The Prediction Paradigm (~5 pages)

How "predicting the next word" unifies NLP from Shannon to GPT. Introduces P(w_next | context) as the core idea underlying text generation, machine translation, speech recognition, and modern chatbots.

1.1.1 What is a language model?
1.1.2 Prediction as the common thread
1.1.3 From probabilities to applications

1.2 A Brief History of Language Modeling (~6 pages)

From Markov chains and Shannon's noisy channel to n-gram models, neural language models, and the transformer revolution. Key milestones from Shannon (1948) through frontier LLMs (2023-2026).

1.2.1 Shannon and information theory
1.2.2 Statistical language models (n-grams)
1.2.3 The neural turn (embeddings, RNNs)
1.2.4 Attention and the Transformer
1.2.5 The large language model era

1.3 How This Book Is Organized (~5 pages)

The four-part structure: Foundations, Neural Language Models, The Transformer Revolution, and Frontiers. Explains the conceptual dependency ordering and suggests reading paths for different audiences.

1.3.1 Part I: Foundations
1.3.2 Part II: Neural Language Models
1.3.3 Part III: The Transformer Revolution
1.3.4 Part IV: Frontiers
1.3.5 Suggested reading paths

1.4 Prerequisites and Notation (~4 pages)

Mathematical prerequisites (linear algebra, probability, calculus), programming prerequisites (Python, PyTorch), and the notation conventions used throughout the book.

1.4.1 Mathematical prerequisites
1.4.2 Programming prerequisites
1.4.3 Notation conventions
1.4.4 The companion repository

Key Equations

This is a conceptual chapter with no formal equations. The only mathematical expression introduced is the informal notation $P(w_{\text{next}} \mid w_1, w_2, \ldots, w_{t-1})$ to foreshadow the language modeling objective, but no formal derivation occurs here.

Key Figures

TikZ

Figure 1.1: Book Roadmap Diagram

A visual map showing the 4 parts (15 chapters) with arrows indicating the conceptual flow from foundations through neural models to transformers to frontiers.

TikZ

Figure 1.2: Prediction Paradigm Illustration

A schematic showing the same core task (predict next word) being solved by different model families: n-gram, RNN, Transformer. The input context and prediction target are identical; only the model changes.

TikZ / Matplotlib

Figure 1.3: History Timeline

A horizontal timeline from 1948 (Shannon) to 2026 (frontier LLMs) with key milestones annotated: Shannon, n-grams, Bengio neural LM, Word2Vec, attention, Transformer, GPT/BERT, ChatGPT.

Exercises

8 exercises (3 theory, 5 programming)

Cross-References

This chapter builds on:

No prior chapters (this is the first chapter)

This chapter is needed for:

Ch 2: Mathematical Foundations -- provides the prediction paradigm and notation conventions
All subsequent chapters assume the reader understands the "prediction as unifying theme" framing, the historical progression, and the book's four-part structure

Key Papers

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379--423. View in bibliography →
Shannon, C. E. (1951). Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1), 50--64. View in bibliography →
Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137--1155. View in bibliography →
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS, 30, 5998--6008. View in bibliography →
Brown, T. B. et al. (2020). Language Models Are Few-Shot Learners. NeurIPS, 33, 1877--1901. View in bibliography →
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. View in bibliography →

Introduction: What Does It Mean to Predict the Next Word?