Sequence Models: RNNs, LSTMs, and GRUs
Prerequisites
Chapter Summary
Chapter 5 introduces the first architecture that can condition on unbounded context when predicting the next word. Where n-gram models are limited to a fixed window of n-1 words (Chapter 3), recurrent neural networks maintain a hidden state ht that is updated at every time step, theoretically enabling them to capture dependencies of arbitrary length. The chapter makes three advances in the prediction story: (1) move from count-based to neural language models, showing dramatic perplexity improvements; (2) introduce the vanishing gradient problem as the central technical challenge of sequence modeling; and (3) present LSTMs and GRUs as architectures that solve the vanishing gradient problem through gating mechanisms.
Learning Objectives
- Describe the forward pass of a vanilla RNN and explain how it processes sequences of variable length through shared parameters and recurrent hidden states
- Diagnose the vanishing gradient problem mathematically by analyzing the gradient flow through time and explain why it prevents learning long-range dependencies
- Trace the information flow through an LSTM cell, identifying how the forget, input, and output gates regulate the cell state to preserve gradients over long sequences
- Train an RNN-based neural language model and compare its perplexity against n-gram baselines on the same corpus
Section Outline
5.1 Vanilla RNNs (~5 pages)
The recurrent neural network as a sequence model that maintains a hidden state summarizing all previous inputs. Forward equations, parameter sharing, and using RNNs for language modeling.
- 5.1.1 The recurrence relation
- 5.1.2 Parameter sharing across time steps
- 5.1.3 RNNs as language models
- 5.1.4 Unrolling the computation graph
5.2 The Vanishing Gradient Problem (~5 pages)
Why vanilla RNNs fail on long sequences. Derives the gradient via BPTT, shows the Jacobian product, spectral radius analysis, and gradient clipping as a partial fix.
- 5.2.1 Backpropagation through time (BPTT)
- 5.2.2 The Jacobian product and spectral radius
- 5.2.3 Vanishing gradients: a numerical demonstration
- 5.2.4 Gradient clipping for exploding gradients
5.3 Long Short-Term Memory (LSTM) (~6 pages)
The LSTM architecture solves the vanishing gradient problem by introducing a cell state ct with additive updates. Four gate equations: forget, input, candidate cell state, and output gate.
- 5.3.1 The cell state: a conveyor belt for information
- 5.3.2 The forget gate
- 5.3.3 The input gate and candidate values
- 5.3.4 The output gate
- 5.3.5 Why LSTMs solve the vanishing gradient problem
5.4 Gated Recurrent Units (GRUs) (~4 pages)
The GRU as a simpler alternative: merges forget and input gates into a single update gate, eliminates the separate cell state. Empirical comparison with LSTMs.
- 5.4.1 Update and reset gates
- 5.4.2 The GRU forward pass
- 5.4.3 LSTM vs GRU: when to use which
5.5 Neural Language Models (~5 pages)
Embedding layer (Ch 4) feeds into an LSTM/GRU, which predicts the next token via softmax. Training with truncated BPTT. Perplexity comparison showing LSTM LMs significantly outperform n-gram baselines.
- 5.5.1 The neural language model architecture
- 5.5.2 Training with truncated BPTT
- 5.5.3 Perplexity comparison: n-grams vs neural LMs
- 5.5.4 Practical considerations (dropout, batching, GPU training)
Key Equations
Key Figures
Exercises
10 exercises (4 theory, 6 programming)
Cross-References
This chapter builds on:
- Ch 1: Introduction -- the prediction paradigm
- Ch 3: Classical Language Models -- limitations of n-gram models (sparsity, no generalization) that motivate neural LMs
- Ch 4: Word Representations -- word embeddings as the input representation for RNNs
This chapter is needed for:
- Ch 6: Attention Mechanisms -- the bottleneck of compressing an entire sequence into a single fixed-size hidden state motivates the invention of attention
Key Papers
- Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735--1780. View in bibliography →
- Cho, K. et al. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of EMNLP, 1724--1734. View in bibliography →
- Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179--211. View in bibliography →
- Mikolov, T. et al. (2010). Recurrent Neural Network Based Language Model. Proceedings of Interspeech, 1045--1048. View in bibliography →
- Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. Proceedings of ICML, 1310--1318. View in bibliography →
- Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent Neural Network Regularization. arXiv:1409.2329. View in bibliography →
- Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137--1155. View in bibliography →