Part I ยท Chapter 1

Introduction: What Does It Mean to Predict the Next Word?

Part I: Foundations Moderate ~20 pages Phase 2

Prerequisites

Ch 1: Introduction (No prerequisites -- this is the first chapter)

Chapter Summary

Chapter 1 establishes the book's central thesis: that next-word prediction is the single unifying principle connecting seventy-five years of language technology, from Shannon's information-theoretic experiments in 1948 through statistical n-gram models, neural language models, and the modern large language model era. The chapter frames every subsequent chapter as a variation on the same fundamental question -- given what we have seen so far, what comes next? By grounding the entire book in P(wnext | context), the chapter gives readers a conceptual scaffold that transforms a sprawling field into a coherent narrative. It also provides the historical arc, the four-part book roadmap, and the mathematical and programming prerequisites necessary for what follows.

Why this chapter matters: This chapter establishes that predicting the next word is not merely one task among many -- it is THE fundamental operation that connects Shannon's information theory to modern chatbots. Every chapter that follows is motivated as a better way to solve this single prediction problem.

Learning Objectives

  1. Explain why next-word prediction is the unifying principle that connects classical NLP, neural language models, and modern large language models
  2. Trace the historical development of language modeling from Shannon's information-theoretic experiments through statistical models to transformer-based LLMs
  3. Describe the four-part structure of the book and map each part to a phase in the evolution of language modeling
  4. Identify the mathematical, programming, and machine learning prerequisites needed for the remaining chapters

Section Outline

1.1 The Prediction Paradigm (~5 pages)

How "predicting the next word" unifies NLP from Shannon to GPT. Introduces P(wnext | context) as the core idea underlying text generation, machine translation, speech recognition, and modern chatbots.

  • 1.1.1 What is a language model?
  • 1.1.2 Prediction as the common thread
  • 1.1.3 From probabilities to applications
1.2 A Brief History of Language Modeling (~6 pages)

From Markov chains and Shannon's noisy channel to n-gram models, neural language models, and the transformer revolution. Key milestones from Shannon (1948) through frontier LLMs (2023-2026).

  • 1.2.1 Shannon and information theory
  • 1.2.2 Statistical language models (n-grams)
  • 1.2.3 The neural turn (embeddings, RNNs)
  • 1.2.4 Attention and the Transformer
  • 1.2.5 The large language model era
1.3 How This Book Is Organized (~5 pages)

The four-part structure: Foundations, Neural Language Models, The Transformer Revolution, and Frontiers. Explains the conceptual dependency ordering and suggests reading paths for different audiences.

  • 1.3.1 Part I: Foundations
  • 1.3.2 Part II: Neural Language Models
  • 1.3.3 Part III: The Transformer Revolution
  • 1.3.4 Part IV: Frontiers
  • 1.3.5 Suggested reading paths
1.4 Prerequisites and Notation (~4 pages)

Mathematical prerequisites (linear algebra, probability, calculus), programming prerequisites (Python, PyTorch), and the notation conventions used throughout the book.

  • 1.4.1 Mathematical prerequisites
  • 1.4.2 Programming prerequisites
  • 1.4.3 Notation conventions
  • 1.4.4 The companion repository

Key Equations

This is a conceptual chapter with no formal equations. The only mathematical expression introduced is the informal notation $P(w_{\text{next}} \mid w_1, w_2, \ldots, w_{t-1})$ to foreshadow the language modeling objective, but no formal derivation occurs here.

Key Figures

TikZ
Figure 1.1: Book Roadmap Diagram
A visual map showing the 4 parts (15 chapters) with arrows indicating the conceptual flow from foundations through neural models to transformers to frontiers.
TikZ
Figure 1.2: Prediction Paradigm Illustration
A schematic showing the same core task (predict next word) being solved by different model families: n-gram, RNN, Transformer. The input context and prediction target are identical; only the model changes.
TikZ / Matplotlib
Figure 1.3: History Timeline
A horizontal timeline from 1948 (Shannon) to 2026 (frontier LLMs) with key milestones annotated: Shannon, n-grams, Bengio neural LM, Word2Vec, attention, Transformer, GPT/BERT, ChatGPT.

Exercises

8 exercises (3 theory, 5 programming)

Cross-References

This chapter builds on:

  • No prior chapters (this is the first chapter)

This chapter is needed for:

  • Ch 2: Mathematical Foundations -- provides the prediction paradigm and notation conventions
  • All subsequent chapters assume the reader understands the "prediction as unifying theme" framing, the historical progression, and the book's four-part structure

Key Papers