Part II ยท Chapter 5

Sequence Models: RNNs, LSTMs, and GRUs

Part II: Neural Language Models Moderate ~25 pages Phase 2

Prerequisites

Chapter Summary

Chapter 5 introduces the first architecture that can condition on unbounded context when predicting the next word. Where n-gram models are limited to a fixed window of n-1 words (Chapter 3), recurrent neural networks maintain a hidden state ht that is updated at every time step, theoretically enabling them to capture dependencies of arbitrary length. The chapter makes three advances in the prediction story: (1) move from count-based to neural language models, showing dramatic perplexity improvements; (2) introduce the vanishing gradient problem as the central technical challenge of sequence modeling; and (3) present LSTMs and GRUs as architectures that solve the vanishing gradient problem through gating mechanisms.

Why this chapter matters: The chapter's culminating section builds a complete neural language model (embedding layer from Chapter 4, LSTM, softmax) and compares its perplexity against the n-gram baselines from Chapter 3, demonstrating the superiority of neural approaches. The bottleneck of compressing an entire sequence into a single fixed-size hidden state hT motivates the attention mechanism in Chapter 6.

Learning Objectives

  1. Describe the forward pass of a vanilla RNN and explain how it processes sequences of variable length through shared parameters and recurrent hidden states
  2. Diagnose the vanishing gradient problem mathematically by analyzing the gradient flow through time and explain why it prevents learning long-range dependencies
  3. Trace the information flow through an LSTM cell, identifying how the forget, input, and output gates regulate the cell state to preserve gradients over long sequences
  4. Train an RNN-based neural language model and compare its perplexity against n-gram baselines on the same corpus

Section Outline

5.1 Vanilla RNNs (~5 pages)

The recurrent neural network as a sequence model that maintains a hidden state summarizing all previous inputs. Forward equations, parameter sharing, and using RNNs for language modeling.

  • 5.1.1 The recurrence relation
  • 5.1.2 Parameter sharing across time steps
  • 5.1.3 RNNs as language models
  • 5.1.4 Unrolling the computation graph
5.2 The Vanishing Gradient Problem (~5 pages)

Why vanilla RNNs fail on long sequences. Derives the gradient via BPTT, shows the Jacobian product, spectral radius analysis, and gradient clipping as a partial fix.

  • 5.2.1 Backpropagation through time (BPTT)
  • 5.2.2 The Jacobian product and spectral radius
  • 5.2.3 Vanishing gradients: a numerical demonstration
  • 5.2.4 Gradient clipping for exploding gradients
5.3 Long Short-Term Memory (LSTM) (~6 pages)

The LSTM architecture solves the vanishing gradient problem by introducing a cell state ct with additive updates. Four gate equations: forget, input, candidate cell state, and output gate.

  • 5.3.1 The cell state: a conveyor belt for information
  • 5.3.2 The forget gate
  • 5.3.3 The input gate and candidate values
  • 5.3.4 The output gate
  • 5.3.5 Why LSTMs solve the vanishing gradient problem
5.4 Gated Recurrent Units (GRUs) (~4 pages)

The GRU as a simpler alternative: merges forget and input gates into a single update gate, eliminates the separate cell state. Empirical comparison with LSTMs.

  • 5.4.1 Update and reset gates
  • 5.4.2 The GRU forward pass
  • 5.4.3 LSTM vs GRU: when to use which
5.5 Neural Language Models (~5 pages)

Embedding layer (Ch 4) feeds into an LSTM/GRU, which predicts the next token via softmax. Training with truncated BPTT. Perplexity comparison showing LSTM LMs significantly outperform n-gram baselines.

  • 5.5.1 The neural language model architecture
  • 5.5.2 Training with truncated BPTT
  • 5.5.3 Perplexity comparison: n-grams vs neural LMs
  • 5.5.4 Practical considerations (dropout, batching, GPU training)

Key Equations

$$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$ (5.1)
RNN forward pass -- Section 5.1
$$\mathbf{y}_t = \text{softmax}(\mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y)$$ (5.2)
RNN output -- Section 5.1
$$\frac{\partial L}{\partial \mathbf{h}_k} = \frac{\partial L}{\partial \mathbf{h}_T} \prod_{t=k+1}^{T} \text{diag}(1 - \mathbf{h}_t^2)\mathbf{W}_{hh}$$ (5.3)
BPTT gradient -- Section 5.2
$$\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$$ (5.4)
LSTM forget gate -- Section 5.3
$$\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$$ (5.5)
LSTM input gate -- Section 5.3
$$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c)$$ (5.6)
LSTM candidate cell -- Section 5.3
$$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$ (5.7)
LSTM cell state update -- Section 5.3
$$\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o), \quad \mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$ (5.8)
LSTM output gate and hidden state -- Section 5.3
$$\mathbf{z}_t = \sigma(\mathbf{W}_z[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)$$ (5.9)
GRU update gate -- Section 5.4
$$\mathbf{r}_t = \sigma(\mathbf{W}_r[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)$$ (5.10)
GRU reset gate -- Section 5.4
$$\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tanh(\mathbf{W}_h[\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_h)$$ (5.11)
GRU hidden state update -- Section 5.4

Key Figures

TikZ
Figure 5.1: RNN Unrolled Diagram
The standard diagram showing an RNN cell replicated across time steps with shared weights, inputs x_t at the bottom, hidden states h_t flowing horizontally, and outputs y_t at the top.
Matplotlib
Figure 5.2: Vanishing Gradient Visualization
A plot showing the magnitude of gradients as a function of the time gap (T - t), demonstrating exponential decay in vanilla RNNs.
TikZ
Figure 5.3: LSTM Cell Diagram
Detailed diagram of a single LSTM cell showing the cell state c_t as a horizontal line (the "highway"), the three gates (forget, input, output) as sigmoid nodes, and the tanh nodes for the candidate cell state and output.
TikZ
Figure 5.4: GRU Cell Diagram
Analogous to the LSTM diagram but with only two gates (update, reset) and no separate cell state.
Matplotlib
Figure 5.5: Training Curves
Loss curves over training epochs for vanilla RNN vs LSTM vs GRU on the same language modeling task, showing that LSTM/GRU converge faster and to lower loss.
Matplotlib
Figure 5.6: Perplexity Comparison
Bar chart comparing perplexity of n-gram (unigram, bigram, trigram), vanilla RNN, LSTM, and GRU language models on the same held-out set.

Exercises

10 exercises (4 theory, 6 programming)

Cross-References

This chapter builds on:

This chapter is needed for:

  • Ch 6: Attention Mechanisms -- the bottleneck of compressing an entire sequence into a single fixed-size hidden state motivates the invention of attention

Key Papers

  • Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735--1780. View in bibliography →
  • Cho, K. et al. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of EMNLP, 1724--1734. View in bibliography →
  • Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179--211. View in bibliography →
  • Mikolov, T. et al. (2010). Recurrent Neural Network Based Language Model. Proceedings of Interspeech, 1045--1048. View in bibliography →
  • Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. Proceedings of ICML, 1310--1318. View in bibliography →
  • Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent Neural Network Regularization. arXiv:1409.2329. View in bibliography →
  • Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137--1155. View in bibliography →