Ch 6: The Attention Revolution | Predicting the Next Words

Prerequisites

Chapter 5: Sequence Models — RNNs, LSTMs, and GRUs — RNNs and LSTMs as the encoder/decoder that attention augments; the fixed-size hidden state $\mathbf{h}_T$ that creates the bottleneck

Summary

Chapter 6 introduces the attention mechanism — the single most consequential idea on the path from recurrent neural networks to transformers. The chapter opens by diagnosing the information bottleneck problem: when an encoder RNN compresses an arbitrarily long input sequence into a single fixed-size hidden state $\mathbf{h}_T$, information is inevitably lost, and this loss worsens with sequence length. Bahdanau attention solves the bottleneck by allowing the decoder to dynamically attend to all encoder hidden states at every generation step, computing a weighted combination (the context vector) that focuses on the most relevant source positions. The chapter then presents Luong's simpler multiplicative variants, makes the conceptual leap from cross-attention (decoder attending to encoder) to self-attention (a sequence attending to itself), and culminates in the general query-key-value formulation that unifies all attention variants. This chapter is the critical bridge between Part II (Neural Language Models, RNN-based) and Part III (The Transformer Revolution), establishing the mechanism that Chapter 8 will scale into the full Transformer architecture.

Learning Objectives

Explain why compressing a variable-length input sequence into a single fixed-size vector creates an information bottleneck that degrades performance on long sequences
Derive the Bahdanau (additive) attention mechanism, computing alignment scores, attention weights via softmax, and the context vector as a weighted sum of encoder hidden states
Contrast Bahdanau (additive) and Luong (multiplicative) attention, identifying when each variant is preferred and their computational tradeoffs
Describe self-attention as a generalization where a sequence attends to itself, and explain why this formulation is the foundation of the Transformer architecture

Section Outline

6.1 Motivation: The Bottleneck Problem (~4pp)

Why fixed-size vectors bottleneck sequence models. Reviews the encoder-decoder setup from Ch 5: an encoder RNN compresses the entire input into a single final hidden state $\mathbf{h}_T$, which the decoder must use to generate the full output. Shows empirically that performance degrades as input length increases.

6.1.1 The fixed-size encoding problem
6.1.2 Performance degradation on long sequences
6.1.3 The idea: let the decoder look back at the input

6.2 Bahdanau (Additive) Attention (~6pp)

Score function, alignment, context vector. The breakthrough paper (Bahdanau, Cho, and Bengio, 2015). The decoder at each step computes an alignment score between its state and each encoder hidden state, normalizes via softmax, and computes a context vector as a weighted sum.

6.2.1 Alignment scores
6.2.2 Softmax normalization and attention weights
6.2.3 The context vector
6.2.4 Incorporating context into the decoder
6.2.5 A complete worked example

6.3 Luong (Multiplicative) Attention (~4pp)

Dot-product and general variants. Luong et al. (2015) propose simpler alternatives to Bahdanau's additive score: dot-product, general (bilinear), and concat variants. Discusses global vs. local attention and computational advantages.

6.3.1 Dot-product attention
6.3.2 General (bilinear) attention
6.3.3 Global vs. local attention
6.3.4 Computational comparison with Bahdanau

6.4 Self-Attention (~5pp)

Attending within a single sequence. The key conceptual leap: instead of the decoder attending to encoder states, a sequence can attend to itself. Each position computes attention weights over all positions, allowing every word to directly interact with every other word.

6.4.1 From cross-attention to self-attention
6.4.2 The query-key-value intuition
6.4.3 Why self-attention changes everything
6.4.4 Parallelization advantages

6.5 Attention as a General Mechanism (~6pp)

Unifying view, attention patterns, visualization. Presents attention as a general differentiable memory access mechanism: given a query, compute similarity to stored keys, and retrieve a weighted combination of values.

6.5.1 The general attention formulation
6.5.2 Attention as soft dictionary lookup
6.5.3 Visualizing attention weights
6.5.4 What attention patterns reveal
6.5.5 From attention to Transformers (preview)

Key Equations

Eq 6.1 — Bahdanau alignment score (Section 6.2)

$$e_{ij} = \mathbf{v}^\top \tanh(\mathbf{W}_s \mathbf{s}_{i-1} + \mathbf{W}_h \mathbf{h}_j)$$

Eq 6.2 — Attention weights via softmax (Section 6.2)

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}$$

Eq 6.3 — Context vector (Section 6.2)

$$\mathbf{c}_i = \sum_{j=1}^{T_x} \alpha_{ij} \mathbf{h}_j$$

Eq 6.4 — Luong dot-product score (Section 6.3)

$$\text{score}(\mathbf{s}_t, \mathbf{h}_j) = \mathbf{s}_t^\top \mathbf{h}_j$$

Eq 6.5 — Luong general score (Section 6.3)

$$\text{score}(\mathbf{s}_t, \mathbf{h}_j) = \mathbf{s}_t^\top \mathbf{W}_a \mathbf{h}_j$$

Eq 6.6 — General attention formulation (Section 6.5)

$$\text{Attention}(\mathbf{q}, \mathbf{K}, \mathbf{V}) = \text{softmax}(\text{score}(\mathbf{q}, \mathbf{K})) \mathbf{V}$$

Key Figures

Bottleneck problem diagram — Encoder RNN compressing a long sequence into a single vector $\mathbf{h}_T$, with a "squeeze" visual metaphor. Contrasts with the attention solution where the decoder can access all hidden states. (Architecture diagram, TikZ)
Bahdanau attention diagram — Detailed architecture showing encoder hidden states $\mathbf{h}_1, \ldots, \mathbf{h}_T$, decoder state $\mathbf{s}_{i-1}$, alignment scores, softmax producing $\alpha_{ij}$, and weighted sum producing $\mathbf{c}_i$. (Architecture diagram, TikZ)
Attention weight heatmap — French-to-English translation with attention weights visualized as a heatmap (source words on one axis, target words on the other). Shows approximate diagonal pattern with deviations at word reordering. (Heatmap, Matplotlib)
Luong attention variants — Side-by-side comparison of the three Luong score functions (dot, general, concat) showing their computational graphs. (Architecture diagram, TikZ)
Self-attention illustration — A single sentence where each word attends to all other words, with arrows showing attention from "it" to its antecedent. (Annotated sentence diagram, TikZ)
Attention patterns visualization — 2x2 grid showing different learned attention patterns: diagonal (local), block (phrasal), long-range (coreference), uniform (global context). (Heatmap grid, Matplotlib)

Exercises

Theory (4 exercises)

[Basic] Show that when attention weights are uniform ($\alpha_{ij} = 1/T_x$ for all $j$), the context vector $\mathbf{c}_i$ is simply the arithmetic mean of the encoder hidden states. Explain when this uniform distribution would occur and why it is suboptimal.
[Intermediate] Prove that Luong's dot-product attention $\text{score}(\mathbf{s}_t, \mathbf{h}_j) = \mathbf{s}_t^\top \mathbf{h}_j$ is a special case of the general bilinear attention $\text{score}(\mathbf{s}_t, \mathbf{h}_j) = \mathbf{s}_t^\top \mathbf{W}_a \mathbf{h}_j$ by specifying $\mathbf{W}_a$. What constraint on $\mathbf{s}_t$ and $\mathbf{h}_j$ is required?
[Intermediate] Analyze the computational complexity of computing attention for a single decoder step as a function of $T_x$ and $d$. Compare additive (Bahdanau) and multiplicative (Luong dot-product) attention, stating time and space complexity.
[Advanced] In self-attention over a sequence of length $T$ with $d_{\text{model}} = 512$, the attention score matrix has $T^2$ entries. Show that for $T = 1000$, this matrix requires 4 MB of memory (assuming float32). Discuss why this quadratic scaling motivates efficient attention variants.

Programming (6 exercises)

[Basic] Implement Bahdanau attention from scratch in PyTorch. Given 6 encoder hidden states of dimension 16 and a decoder state of dimension 16, compute alignment scores, attention weights, and context vector.
[Intermediate] Implement Luong dot-product attention. Compare computation time of Bahdanau vs. Luong for source sequence lengths $T_x = 10, 50, 100, 500, 1000$. Plot computation time vs. sequence length.
[Intermediate] Visualize attention weights for a pre-trained HuggingFace translation model (e.g., Helsinki-NLP/opus-mt-en-fr). Translate 3 English sentences, extract attention, and plot heatmaps.
[Intermediate] Implement self-attention (without scaling or learned projections) on 8 word embeddings. Visualize the 8x8 attention matrix. Compare cosine similarity matrices before and after self-attention.
[Advanced] Build a minimal encoder-decoder translation model with and without Bahdanau attention. Train both on a toy parallel corpus. Compare BLEU scores and attention heatmaps.
[Advanced] Experiment with attention temperature: modify softmax to use $\text{softmax}(e / \tau)$ with $\tau \in \{0.1, 0.5, 1.0, 2.0, 5.0\}$. Visualize how attention distributions change and plot entropy vs. $\tau$.

Cross-References

This chapter references:

Chapter 1 (Section 1.1) — The prediction paradigm
Chapter 5 (Sections 5.1, 5.3) — RNNs and LSTMs as the encoder/decoder that attention augments; the fixed-size hidden state $\mathbf{h}_T$ that creates the bottleneck

This chapter is referenced by:

Chapter 7 (Section 7.1) — The encoder-decoder architecture with attention forms the basis of sequence-to-sequence models
Chapter 8 (Sections 8.1, 8.2) — The Transformer replaces recurrence entirely with self-attention; scaled dot-product attention descends directly from Luong dot-product attention

Key Papers

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. Proceedings of ICLR.
Luong, T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. Proceedings of EMNLP, 1412–1421.
Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. Proceedings of SSST-8, 103–111.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. Advances in NeurIPS, 5998–6008.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. Advances in NeurIPS, 3104–3112.
Xu, K., Ba, J., Kiros, R., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of ICML, 2048–2057.
Cho, K., van Merrienboer, B., Gulcehre, C., et al. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of EMNLP, 1724–1734.
Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT's Attention. Proceedings of ACL Workshop on BlackboxNLP, 276–286.