Ch 5: Sequence Models | Predicting the Next Words

Prerequisites

Ch 3: Classical Language Models → Ch 4: Word Representations → Ch 5: Sequence Models

Chapter Summary

Chapter 5 introduces the first architecture that can condition on unbounded context when predicting the next word. Where n-gram models are limited to a fixed window of n-1 words (Chapter 3), recurrent neural networks maintain a hidden state h_t that is updated at every time step, theoretically enabling them to capture dependencies of arbitrary length. The chapter makes three advances in the prediction story: (1) move from count-based to neural language models, showing dramatic perplexity improvements; (2) introduce the vanishing gradient problem as the central technical challenge of sequence modeling; and (3) present LSTMs and GRUs as architectures that solve the vanishing gradient problem through gating mechanisms.

Why this chapter matters: The chapter's culminating section builds a complete neural language model (embedding layer from Chapter 4, LSTM, softmax) and compares its perplexity against the n-gram baselines from Chapter 3, demonstrating the superiority of neural approaches. The bottleneck of compressing an entire sequence into a single fixed-size hidden state h_T motivates the attention mechanism in Chapter 6.

Learning Objectives

Describe the forward pass of a vanilla RNN and explain how it processes sequences of variable length through shared parameters and recurrent hidden states
Diagnose the vanishing gradient problem mathematically by analyzing the gradient flow through time and explain why it prevents learning long-range dependencies
Trace the information flow through an LSTM cell, identifying how the forget, input, and output gates regulate the cell state to preserve gradients over long sequences
Train an RNN-based neural language model and compare its perplexity against n-gram baselines on the same corpus

Section Outline

5.1 Vanilla RNNs (~5 pages)

The recurrent neural network as a sequence model that maintains a hidden state summarizing all previous inputs. Forward equations, parameter sharing, and using RNNs for language modeling.

5.1.1 The recurrence relation
5.1.2 Parameter sharing across time steps
5.1.3 RNNs as language models
5.1.4 Unrolling the computation graph

5.2 The Vanishing Gradient Problem (~5 pages)

Why vanilla RNNs fail on long sequences. Derives the gradient via BPTT, shows the Jacobian product, spectral radius analysis, and gradient clipping as a partial fix.

5.2.1 Backpropagation through time (BPTT)
5.2.2 The Jacobian product and spectral radius
5.2.3 Vanishing gradients: a numerical demonstration
5.2.4 Gradient clipping for exploding gradients

5.3 Long Short-Term Memory (LSTM) (~6 pages)

The LSTM architecture solves the vanishing gradient problem by introducing a cell state c_t with additive updates. Four gate equations: forget, input, candidate cell state, and output gate.

5.3.1 The cell state: a conveyor belt for information
5.3.2 The forget gate
5.3.3 The input gate and candidate values
5.3.4 The output gate
5.3.5 Why LSTMs solve the vanishing gradient problem

5.4 Gated Recurrent Units (GRUs) (~4 pages)

The GRU as a simpler alternative: merges forget and input gates into a single update gate, eliminates the separate cell state. Empirical comparison with LSTMs.

5.4.1 Update and reset gates
5.4.2 The GRU forward pass
5.4.3 LSTM vs GRU: when to use which

5.5 Neural Language Models (~5 pages)

Embedding layer (Ch 4) feeds into an LSTM/GRU, which predicts the next token via softmax. Training with truncated BPTT. Perplexity comparison showing LSTM LMs significantly outperform n-gram baselines.

5.5.1 The neural language model architecture
5.5.2 Training with truncated BPTT
5.5.3 Perplexity comparison: n-grams vs neural LMs
5.5.4 Practical considerations (dropout, batching, GPU training)

Key Equations

$$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$ (5.1)

RNN forward pass -- Section 5.1

$$\mathbf{y}_t = \text{softmax}(\mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y)$$ (5.2)

RNN output -- Section 5.1

$$\frac{\partial L}{\partial \mathbf{h}_k} = \frac{\partial L}{\partial \mathbf{h}_T} \prod_{t=k+1}^{T} \text{diag}(1 - \mathbf{h}_t^2)\mathbf{W}_{hh}$$ (5.3)

BPTT gradient -- Section 5.2

$$\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$$ (5.4)

LSTM forget gate -- Section 5.3

$$\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$$ (5.5)

LSTM input gate -- Section 5.3

$$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)$$ (5.6)

LSTM candidate cell -- Section 5.3

$$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$ (5.7)

LSTM cell state update -- Section 5.3

$$\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o), \quad \mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$ (5.8)

LSTM output gate and hidden state -- Section 5.3

$$\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)$$ (5.9)

GRU update gate -- Section 5.4

$$\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)$$ (5.10)

GRU reset gate -- Section 5.4

$$\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tanh(\mathbf{W}_h[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_h)$$ (5.11)

GRU hidden state update -- Section 5.4

Key Figures

TikZ

Figure 5.1: RNN Unrolled Diagram

The standard diagram showing an RNN cell replicated across time steps with shared weights, inputs x_t at the bottom, hidden states h_t flowing horizontally, and outputs y_t at the top.

Matplotlib

Figure 5.2: Vanishing Gradient Visualization

A plot showing the magnitude of gradients as a function of the time gap (T - t), demonstrating exponential decay in vanilla RNNs.

TikZ

Figure 5.3: LSTM Cell Diagram

Detailed diagram of a single LSTM cell showing the cell state c_t as a horizontal line (the "highway"), the three gates (forget, input, output) as sigmoid nodes, and the tanh nodes for the candidate cell state and output.

TikZ

Figure 5.4: GRU Cell Diagram

Analogous to the LSTM diagram but with only two gates (update, reset) and no separate cell state.

Matplotlib

Figure 5.5: Training Curves

Loss curves over training epochs for vanilla RNN vs LSTM vs GRU on the same language modeling task, showing that LSTM/GRU converge faster and to lower loss.

Matplotlib

Figure 5.6: Perplexity Comparison

Bar chart comparing perplexity of n-gram (unigram, bigram, trigram), vanilla RNN, LSTM, and GRU language models on the same held-out set.

Exercises

10 exercises (4 theory, 6 programming)

Cross-References

This chapter builds on:

Ch 1: Introduction -- the prediction paradigm
Ch 3: Classical Language Models -- limitations of n-gram models (sparsity, no generalization) that motivate neural LMs
Ch 4: Word Representations -- word embeddings as the input representation for RNNs

This chapter is needed for:

Ch 6: Attention Mechanisms -- the bottleneck of compressing an entire sequence into a single fixed-size hidden state motivates the invention of attention

Key Papers

Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735--1780. View in bibliography →
Cho, K. et al. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of EMNLP, 1724--1734. View in bibliography →
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179--211. View in bibliography →
Mikolov, T. et al. (2010). Recurrent Neural Network Based Language Model. Proceedings of Interspeech, 1045--1048. View in bibliography →
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. Proceedings of ICML, 1310--1318. View in bibliography →
Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent Neural Network Regularization. arXiv:1409.2329. View in bibliography →
Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137--1155. View in bibliography →

Sequence Models: RNNs, LSTMs, and GRUs

Prerequisites

Chapter Summary

Learning Objectives

Section Outline

Key Equations

Key Figures

Exercises

Cross-References

This chapter builds on:

This chapter is needed for:

Key Papers