Part III · Chapter 8

The Transformer Architecture

Part III: The Transformer Revolution Deep ~30pp Phase 1

Why this chapter matters for prediction: This is the book's centerpiece. The Transformer is the architecture behind every modern language model — GPT-4, Claude, Gemini, LLaMA. It replaced 25 years of recurrence-based sequence modeling with a single insight: self-attention can relate any two positions directly, making recurrence unnecessary. Everything before this chapter (n-grams, embeddings, RNNs, attention, seq2seq) builds toward the Transformer. Everything after (pre-training, scaling, alignment, prompting) builds upon it. This is where "predicting the next words" converges into its modern form.

Prerequisites

Chapter 6: The Attention Revolution — Bahdanau/Luong attention, self-attention introduction, and the query-key-value abstraction
Chapter 7: Seq-to-Seq and Decoding — The encoder-decoder framework that the Transformer generalizes

Summary

Chapter 8 is the centerpiece of the entire book. It presents the Transformer architecture — the model that replaced recurrence with self-attention and enabled the modern era of large language models. The chapter derives scaled dot-product attention from first principles (including the critical $1/\sqrt{d_k}$ scaling factor), introduces multi-head attention as a mechanism for capturing diverse relationship types in parallel, addresses the permutation-equivariance problem through four positional encoding strategies (sinusoidal, learned, RoPE, ALiBi), assembles the complete Transformer block (self-attention, residual connections, layer normalization, feed-forward network), and maps the three architectural variants (encoder-only, decoder-only, encoder-decoder) to their respective model families (BERT, GPT, T5). Everything before this chapter builds toward the Transformer. Everything after builds upon it. This is where the "Predicting the Next Words" narrative converges into its modern form.

Learning Objectives

Derive scaled dot-product attention from first principles, explaining why the $1/\sqrt{d_k}$ scaling factor is necessary to prevent softmax saturation for large key dimensions
Explain multi-head attention: how splitting queries, keys, and values into $h$ heads enables the model to attend to information from different representation subspaces simultaneously
Compare positional encoding strategies — sinusoidal, learned, rotary (RoPE), and ALiBi — and articulate the tradeoffs between absolute, relative, and extrapolatable position representations
Trace the forward pass through a complete Transformer block (self-attention, add-and-norm, feed-forward, add-and-norm) and explain the roles of residual connections, layer normalization, and the feed-forward network

Section Outline

8.1 From Recurrence to Attention (~4pp)

Why remove recurrence entirely. RNNs process tokens sequentially ($O(T)$ serial steps), while self-attention relates all pairs simultaneously ($O(1)$ serial depth). The key insight of Vaswani et al. (2017): if self-attention can relate any two positions directly, why keep the RNN at all?

8.1.1 The parallelization problem with RNNs
8.1.2 Self-attention as a replacement for recurrence
8.1.3 The Transformer hypothesis: attention is all you need

8.2 Scaled Dot-Product Attention (~6pp)

Query-key-value framework, scaling factor. Formalizes attention in matrix form: $\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})\mathbf{V}$. Derives why the scaling factor is needed: without it, dot products grow proportionally to $d_k$, pushing softmax into saturation.

8.2.1 Queries, keys, and values as linear projections
8.2.2 The dot-product attention formula
8.2.3 Why scale by $\sqrt{d_k}$?
8.2.4 Masking for autoregressive models
8.2.5 Numerical walkthrough

8.3 Multi-Head Attention (~5pp)

Parallel attention heads, concatenation, projection. Project Q, K, V into $h$ subspaces, perform attention in parallel, concatenate, and project back. Each head can learn to attend to different types of relationships.

8.3.1 The motivation for multiple heads
8.3.2 Projection into subspaces
8.3.3 Parallel attention and concatenation
8.3.4 What different heads learn
8.3.5 Parameter count analysis

8.4 Position Encodings (~5pp)

Self-attention is permutation-equivariant — it has no notion of word order. Four solutions: sinusoidal (Vaswani et al., 2017), learned absolute (BERT, GPT), rotary (RoPE, Su et al., 2021), and ALiBi (Press et al., 2022).

8.4.1 Why position matters
8.4.2 Sinusoidal position encodings
8.4.3 Learned position embeddings
8.4.4 Rotary Position Embeddings (RoPE)
8.4.5 ALiBi and relative position biases
8.4.6 Comparison and extrapolation

8.5 The Full Transformer Block (~6pp)

Multi-head self-attention, residual connections, layer normalization, and the position-wise feed-forward network. Assembles the complete block and discusses Pre-LN vs. Post-LN configurations.

8.5.1 The attention sublayer
8.5.2 Residual connections and their role
8.5.3 Layer normalization (Pre-LN vs. Post-LN)
8.5.4 The feed-forward sublayer
8.5.5 Putting it all together: the Transformer block
8.5.6 Stacking blocks: depth and capacity

8.6 Encoder, Decoder, and Encoder-Decoder Variants (~4pp)

Three Transformer variants: encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5). The causal mask, cross-attention, and choosing an architecture.

8.6.1 Encoder-only Transformers
8.6.2 Decoder-only Transformers
8.6.3 Encoder-decoder Transformers
8.6.4 The causal mask
8.6.5 Choosing an architecture (preview of Ch 9)

Key Equations

Eq 8.1 — Scaled dot-product attention (Section 8.2)

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$$

Eq 8.2 — Multi-head attention (Section 8.3)

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O$$

where $\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$

Eq 8.3 — Sinusoidal positional encoding (Section 8.4)

$$\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad \text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

Eq 8.4 — Rotary Position Embedding (RoPE) (Section 8.4)

$$f_q(\mathbf{x}_m, m) = \mathbf{R}_\Theta^m \mathbf{W}_q \mathbf{x}_m$$

where $\mathbf{R}_\Theta^m$ is a rotation matrix encoding position $m$

Eq 8.5 — Feed-forward sublayer (Section 8.5)

$$\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2$$

Eq 8.6 — Residual connection + layer norm (Section 8.5)

Pre-LN: $\quad \mathbf{x} + \text{Sublayer}(\text{LayerNorm}(\mathbf{x}))$

Post-LN: $\quad \text{LayerNorm}(\mathbf{x} + \text{Sublayer}(\mathbf{x}))$

Eq 8.7 — Layer normalization (Section 8.5)

$$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma + \epsilon} + \beta \quad \text{where } \mu = \frac{1}{d}\sum_i x_i, \quad \sigma^2 = \frac{1}{d}\sum_i (x_i - \mu)^2$$

Key Figures

Full Transformer architecture — The complete Vaswani et al. (2017) architecture showing both encoder and decoder stacks, with self-attention, cross-attention, feed-forward layers, residual connections, and normalization. (Architecture diagram, TikZ)
Scaled dot-product attention — Matrix computation diagram: Q, K, V matrices, the $\mathbf{Q}\mathbf{K}^\top$ matmul, scaling by $1/\sqrt{d_k}$, softmax, and multiplication by V. (Computation flow, TikZ)
Multi-head attention — Input projected into $h$ parallel heads, each computing independent attention, concatenation, and final projection. (Architecture diagram, TikZ)
Sinusoidal positional encoding — Heatmap visualization showing PE values for positions 0–127 and dimensions 0–63, revealing sinusoidal wave patterns at different frequencies. (Heatmap, Matplotlib)
RoPE visualization — Query and key vectors rotated in 2D subspaces, with rotation angle proportional to position. (Geometric diagram, TikZ/Matplotlib)
Encoder block — Single encoder block: input → multi-head self-attention → add & norm → FFN → add & norm → output. (Block diagram, TikZ)
Decoder block with masking — Single decoder block: masked self-attention → add & norm → cross-attention → add & norm → FFN → add & norm. Shows the causal mask as a triangular matrix. (Block diagram, TikZ)
Residual connections — Gradient flow through a stack of layers with residual connections vs. a plain network. Shows short-circuit paths for gradients. (Flow diagram, TikZ)

Exercises

Theory (4 exercises)

[Basic] Prove that without the $1/\sqrt{d_k}$ scaling factor, the variance of the dot product $\mathbf{q}^\top \mathbf{k}$ grows linearly with $d_k$. Assume $q_i$ and $k_i$ are independent with mean 0 and variance 1.
[Intermediate] Compute the total parameter count of a Transformer with $d_{\text{model}} = 512$, $h = 8$, $d_{\text{ff}} = 2048$, $N = 6$ layers, vocabulary $|V| = 32000$. Break down by component. Assume weight tying.
[Intermediate] Show that the sinusoidal positional encoding allows representing $\text{PE}(pos + k)$ as a linear transformation of $\text{PE}(pos)$ for any fixed offset $k$. Derive the 2x2 rotation matrix.
[Advanced] Analyze the computational complexity of self-attention: show time is $O(T^2 \cdot d)$ and space is $O(T^2 + T \cdot d)$. For $T = 4096$, $d = 4096$, compute FLOPs and memory in GB (float16).

Programming (6 exercises)

[Basic] Implement scaled dot-product attention from scratch in PyTorch. Test on input of shape (batch=2, T=8, $d_{\text{model}}$=16). Verify each row of attention weights sums to 1.
[Intermediate] Implement multi-head attention with $h=4$ heads. Apply to a sentence encoded as pre-trained word embeddings. Visualize attention patterns for each head and identify local vs. long-range patterns.
[Intermediate] Compute sinusoidal positional encodings for 256 positions and $d_{\text{model}}=128$. Visualize: (1) PE matrix as heatmap, (2) cosine similarity between PE vectors, (3) similarity vs. distance $|i-j|$.
[Intermediate] Build a complete Transformer block (Pre-LN). Stack 4 blocks. Verify output shape matches input shape and all parameter gradients are non-zero after backprop.
[Advanced] Train a small Transformer language model (4 layers, $d_{\text{model}}=128$, $h=4$) on WikiText-2. Report perplexity. Generate sample text. Compare with an LSTM baseline.
[Advanced] Implement RoPE from scratch for a 2-head attention module. Verify the relative position property: attention scores between $(i, j)$ and $(i+5, j+5)$ should be equal.

Cross-References

This chapter references:

Chapter 1 (Section 1.1) — The prediction paradigm
Chapter 6 (Sections 6.2–6.5) — Bahdanau and Luong attention, self-attention introduction, the query-key-value abstraction
Chapter 7 (Section 7.1) — The encoder-decoder framework that the Transformer generalizes

This chapter is referenced by:

Chapter 9 (Sections 9.1–9.5) — Pre-training paradigms build directly on the Transformer: BERT uses the encoder, GPT uses the decoder, T5 uses the full encoder-decoder
Chapter 10 (Sections 10.1–10.4) — Tokenization is a critical pre-processing step for Transformer-based models

Key Papers

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in NeurIPS, 5998–6008.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
Press, O., Smith, N. A., & Lewis, M. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. Proceedings of ICLR.
Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT's Attention. Proceedings of ACL Workshop on BlackboxNLP, 276–286.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of CVPR, 770–778.

← Previous

Ch 7: Seq-to-Seq and Decoding

Ch 9: Pre-training Paradigms