Part II · Chapter 7

Sequence-to-Sequence and Decoding

Part II: Neural Language Models Moderate ~20pp Phase 2

Why this chapter matters for prediction: A language model assigns probabilities to words, but the decoding strategy determines what text is actually produced. This chapter addresses "the other half of prediction" — how to turn probability distributions into sequences. Whether a model produces boring repetitive text or diverse creative output depends not on the model itself, but on how we decode from it.

Prerequisites

Chapter 6: The Attention Revolution — Bahdanau and Luong attention mechanisms used in the decoder; the context vector as a weighted sum of encoder states

Summary

Chapter 7 assembles the attention mechanism from Chapter 6 into a complete encoder-decoder system for conditional text generation and then addresses the two critical questions that any generation system must answer: how to train (teacher forcing and its pitfalls) and how to decode (greedy, beam search, sampling). The chapter treats decoding as "the other half of prediction" — the language model assigns probabilities, but the decoding strategy determines what text is actually produced. Machine translation serves as the running case study, connecting the historical development of attention and seq2seq with empirical evaluation via BLEU. The chapter is designated MODERATE depth, serving as the practical engineering bridge between the attention mechanism (Ch 6) and the Transformer architecture (Ch 8), where the encoder-decoder framework will be fully generalized.

Learning Objectives

Describe the encoder-decoder architecture for conditional text generation and explain how the encoder produces representations that the decoder conditions on via attention
Explain the training-inference discrepancy caused by teacher forcing, identify exposure bias as its consequence, and describe scheduled sampling as a mitigation strategy
Implement and compare decoding strategies — greedy search, beam search, top-$k$ sampling, and nucleus (top-$p$) sampling — and explain the quality-diversity tradeoff each embodies
Compute BLEU score for a generated translation and critically evaluate the strengths and weaknesses of automated evaluation metrics for text generation

Section Outline

7.1 Encoder-Decoder Architecture (~5pp)

Framework for conditional generation. Formalizes the encoder-decoder pattern: an encoder reads an input sequence and produces a set of representations, and a decoder generates an output sequence one token at a time, conditioned on the encoder representations via attention.

7.1.1 The encoder: from input to representations
7.1.2 The decoder: autoregressive generation
7.1.3 Conditioning via attention
7.1.4 The general seq2seq framework

7.2 Teacher Forcing and Exposure Bias (~4pp)

Training vs. inference discrepancy. During training, the decoder receives ground-truth previous tokens (teacher forcing). During inference, it receives its own predictions. This mismatch causes exposure bias.

7.2.1 Teacher forcing: fast training, hidden cost
7.2.2 Exposure bias: the train-test gap
7.2.3 Scheduled sampling
7.2.4 Other mitigation strategies

7.3 Decoding Strategies (~5pp)

Greedy, beam search, top-$k$, top-$p$ (nucleus) sampling. How to turn a probability distribution over the vocabulary into actual text, and the quality-diversity tradeoff inherent in each strategy.

7.3.1 Greedy decoding
7.3.2 Beam search
7.3.3 Top-$k$ sampling
7.3.4 Nucleus (top-$p$) sampling
7.3.5 Temperature and the quality-diversity tradeoff

7.4 Evaluation of Generated Text (~3pp)

BLEU, ROUGE, METEOR, human evaluation. How to measure the quality of generated text, the limitations of all automated metrics, and the gold standard of human evaluation.

7.4.1 BLEU score
7.4.2 ROUGE and METEOR
7.4.3 Limitations of automated metrics
7.4.4 Human evaluation and LLM-as-judge

7.5 Machine Translation as a Case Study (~3pp)

The task that drove seq2seq development. Machine translation as the historical proving ground: from phrase-based SMT to neural MT, with a complete example and attention visualization.

7.5.1 From phrase-based to neural MT
7.5.2 A complete translation example
7.5.3 Attention alignment in translation
7.5.4 The transition to Transformers

Key Equations

Eq 7.1 — Beam search score (Section 7.3)

$$\text{score}(\mathbf{y} \mid \mathbf{x}) = \sum_{t=1}^{S} \log P(y_t \mid y_{

Eq 7.2 — Length-normalized beam score (Section 7.3)

$$\text{score}_{\text{norm}}(\mathbf{y} \mid \mathbf{x}) = \frac{1}{S^\alpha} \sum_{t=1}^{S} \log P(y_t \mid y_{

Eq 7.3 — BLEU precision (Section 7.4)

$$p_n = \frac{\sum_{\mathbf{y} \in \hat{Y}} \sum_{\text{n-gram} \in \mathbf{y}} \min(C_{\text{pred}}(\text{n-gram}), C_{\text{ref}}(\text{n-gram}))}{\sum_{\mathbf{y} \in \hat{Y}} \sum_{\text{n-gram} \in \mathbf{y}} C_{\text{pred}}(\text{n-gram})}$$

Eq 7.4 — BLEU score (Section 7.4)

$$\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) \quad \text{where } \text{BP} = \min\left(1, \exp\left(1 - \frac{r}{c}\right)\right)$$

Eq 7.5 — Top-$p$ (nucleus) sampling (Section 7.3)

$$V_p = \arg\min_{V' \subseteq V} \left\{ |V'| : \sum_{w \in V'} P(w \mid \text{context}) \geq p \right\}$$

Eq 7.6 — Temperature scaling (Section 7.3)

$$P_\tau(w \mid \text{context}) = \frac{\exp(z_w / \tau)}{\sum_{w'} \exp(z_{w'} / \tau)}$$

Key Figures

Encoder-decoder architecture — Full architecture diagram showing the encoder (RNN/LSTM), attention connections, and the decoder (autoregressive generation with attention context). (Architecture diagram, TikZ)
Teacher forcing diagram — Side-by-side: training mode (ground-truth tokens) vs. inference mode (model's own predictions). Highlights the discrepancy. (Dual architecture diagram, TikZ)
Beam search tree — Tree diagram with $B=3$ over 3 time steps, showing how candidates are expanded and pruned with log-probabilities annotated. (Tree diagram, TikZ)
Decoding strategies comparison — Example outputs from the same model using greedy, beam ($B=5$), top-$k$ ($k=50$), and nucleus ($p=0.9$) decoding. (Text comparison, TikZ)
MT alignment example — French-to-English translation showing source, target, attention heatmap, and how attention captures word reordering. (Heatmap, Matplotlib)

Exercises

Theory (3 exercises)

[Basic] Prove that greedy decoding is a special case of beam search with beam width $B=1$. Formally define both algorithms and show their equivalence.
[Intermediate] Show that as temperature $\tau \to 0$, sampling from $P_\tau(w) = \text{softmax}(z / \tau)$ converges to greedy decoding (argmax). What happens as $\tau \to \infty$? Derive both limiting cases.
[Intermediate] Analyze the time complexity of beam search as a function of beam width $B$, vocabulary size $|V|$, and target sequence length $S$. Compare with greedy decoding.

Programming (5 exercises)

[Basic] Implement greedy decoding and beam search ($B=5$) for a pre-trained HuggingFace translation model. Translate 5 English sentences into German with both methods and compare BLEU scores.
[Intermediate] Implement top-$k$ sampling and nucleus sampling from scratch. Generate 10 continuations of "The meaning of life is" using greedy, beam ($B=5$), top-$k$ ($k=50$), and nucleus ($p=0.9$). Rate each for quality and diversity.
[Intermediate] Compute BLEU scores for 20 translation pairs. Plot BLEU vs. sentence length. Identify cases where BLEU disagrees with your own judgment and compute correlation with human ratings.
[Advanced] Build a minimal seq2seq model with attention for a toy task: reversing digit sequences. Train with teacher forcing. Visualize attention weights and verify the model learns to attend in reverse order.
[Advanced] Compare teacher forcing with scheduled sampling on the sequence reversal task. Compare training loss curves, test accuracy on lengths 10 and 20, and robustness to injected errors.

Cross-References

This chapter references:

Chapter 1 (Section 1.1) — The prediction paradigm applied to conditional generation
Chapter 6 (Sections 6.2–6.3) — Bahdanau and Luong attention mechanisms used in the decoder

This chapter is referenced by:

Chapter 8 (Section 8.1) — The encoder-decoder framework is generalized by the Transformer, which replaces the RNN components with self-attention

Key Papers

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. Advances in NeurIPS, 3104–3112.
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. Proceedings of ICLR.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of ACL, 311–318.
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. Advances in NeurIPS, 1171–1179.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. Proceedings of ICLR.
Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, 74–81.
Banerjee, S. & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of ACL Workshop, 65–72.
Wu, Y., Schuster, M., Chen, Z., et al. (2016). Google's Neural Machine Translation System. arXiv:1609.08144.
Callison-Burch, C., Osborne, M., & Koehn, P. (2006). Re-evaluating the Role of BLEU in Machine Translation Research. Proceedings of EACL, 249–256.
Fan, A., Lewis, M., & Dauphin, Y. (2018). Hierarchical Neural Story Generation. Proceedings of ACL, 889–898.

← Previous

Ch 6: The Attention Revolution

Ch 8: The Transformer Architecture