Part III ยท Chapter 10

Tokenization and Data at Scale

Part III: The Transformer Revolution Moderate ~20pp Phase 2
Why this chapter matters for prediction: Tokenization defines what the model predicts — whether it predicts words, subwords, or bytes. This choice cascades through vocabulary size, sequence length, computational cost, and multilingual fairness. Data quality determines what patterns are available to learn. These often-overlooked infrastructure decisions shape every downstream capability of a language model.

Prerequisites

  • Chapter 8: The Transformer Architecture — Tokenization feeds directly into the Transformer's embedding layer; sequence length after tokenization determines the computational cost of self-attention (quadratic in sequence length)

Summary

Chapter 10 addresses the often-overlooked infrastructure that determines what a language model can learn: how text is split into tokens and how training data is curated at scale. Tokenization defines the unit of prediction — whether the model predicts words, subwords, or bytes — and this choice cascades through vocabulary size, sequence length, computational cost, and multilingual fairness. The chapter presents Byte-Pair Encoding (BPE) as the dominant algorithm, introduces SentencePiece and Unigram as alternatives, analyzes the fertility problem that penalizes non-English languages, and covers the data curation pipeline (web crawling, deduplication, quality filtering, data mixing) that produces corpora like C4, The Pile, and RefinedWeb. This is a unique differentiator for the book: no competing textbook dedicates a full chapter to linking tokenization choices and data quality to language model performance.

Learning Objectives

  1. Implement the Byte-Pair Encoding (BPE) algorithm from scratch, trace its merge operations on a sample corpus, and explain why subword tokenization solves the open-vocabulary problem that word-level models cannot handle
  2. Compare BPE, WordPiece, Unigram, and SentencePiece tokenization methods in terms of their algorithmic approach, vocabulary construction, and impact on downstream model performance
  3. Analyze how tokenization choices affect model behavior across languages, domains, and efficiency metrics, including fertility (tokens per word), compression ratio, and coverage of rare words
  4. Describe the key stages of a modern data curation pipeline — web crawling, deduplication, quality filtering, data mixing — and explain how each stage affects pre-training quality

Section Outline

10.1 From Words to Subwords (~3pp)

Why word-level tokenization fails at scale: the open-vocabulary problem, the long tail of rare words, morphological productivity, and OOV tokens. Character-level as the other extreme. Subword tokenization as the principled middle ground.

  • 10.1.1 The open-vocabulary problem
  • 10.1.2 Word-level vs. character-level trade-offs
  • 10.1.3 The subword hypothesis

10.2 Byte-Pair Encoding (BPE) (~5pp)

The BPE algorithm (Sennrich et al., 2016) in detail: character-level initialization, iterative merging of the most frequent pair, vocabulary size as a hyperparameter. GPT-2's byte-level BPE variant. Practical considerations.

  • 10.2.1 The BPE algorithm step by step
  • 10.2.2 Byte-level BPE (GPT-2 variant)
  • 10.2.3 Implementation and practical considerations

10.3 SentencePiece and Unigram (~4pp)

SentencePiece: language-independent tokenization operating on raw text. The Unigram language model approach: start with a large vocabulary, iteratively prune tokens that contribute least to corpus likelihood. Subword regularization.

  • 10.3.1 SentencePiece: language-independent preprocessing
  • 10.3.2 The Unigram language model tokenizer
  • 10.3.3 Subword regularization and its benefits

10.4 The Impact of Tokenization (~4pp)

Vocabulary size vs. sequence length trade-off, fertility across languages (English-centric tokenizers penalize non-Latin scripts), tokenization of code and math, and emerging tokenizer-free approaches.

  • 10.4.1 Vocabulary size vs. sequence length
  • 10.4.2 Multilingual tokenization and the fertility problem
  • 10.4.3 Toward tokenizer-free models

10.5 Data Curation at Scale (~4pp)

Common Crawl, The Pile, RedPajama, RefinedWeb. Key pipeline stages: web scraping, deduplication, quality filtering, content filtering, and data mixing. How data quality and diversity affect pre-training outcomes.

  • 10.5.1 Data sources: web crawls, books, code
  • 10.5.2 Deduplication and quality filtering
  • 10.5.3 Data mixing and domain balancing

Key Equations

Eq 10.1 — BPE merge operation (Section 10.2)

$$\text{pair}^* = \arg\max_{(a,b) \in \text{pairs}(V)} \text{count}(a, b)$$

At each step, merge the most frequent adjacent pair $(a, b)$ into a new symbol $ab$ and add $ab$ to vocabulary $V$.

Eq 10.2 — Unigram language model probability (Section 10.3)

$$P(\mathbf{x}) = \prod_{i=1}^{|\mathbf{x}|} p(x_i)$$

The optimal segmentation maximizes:

$$\mathbf{x}^* = \arg\max_{\mathbf{x} \in S(w)} \sum_{i=1}^{|\mathbf{x}|} \log p(x_i)$$

where $S(w)$ is the set of all valid segmentations of word $w$.

Eq 10.3 — Fertility (Section 10.4)

$$\text{fertility}(w) = \frac{\text{number of subword tokens for } w}{\text{number of whitespace words in } w}$$

A tokenizer-agnostic metric measuring encoding efficiency. Lower fertility = more efficient.

Key Figures

  1. BPE step-by-step — Worked example showing 5–6 merge iterations on a small corpus, with frequency tables updated at each step and vocabulary growth. (Stepped diagram, TikZ)
  2. Tokenization comparison — The same English sentence tokenized by word-level, BPE, Unigram, and character-level methods, side by side. (Annotated text diagram, TikZ)
  3. Vocabulary size vs. performance — Downstream task performance vs. vocabulary size for a fixed model, demonstrating the sweet spot and diminishing returns. (Line plot, Matplotlib)
  4. Data pipeline diagram — Full data curation pipeline: Web Crawl → URL Filtering → Language ID → Deduplication → Quality Filtering → Content Filtering → Data Mixing → Training Corpus. (Flowchart, TikZ)
  5. Data quality filtering — Illustrated examples of filtered vs. retained documents, with quality scores and filtering criteria. (Annotated examples, TikZ)

Exercises

Theory (3 exercises)

  1. [Basic] Given the corpus {"low": 5, "lower": 2, "new": 6, "newer": 3, "wider": 2}, compute the first 5 BPE merges by hand. Show the frequency table and vocabulary after each merge.
  2. [Intermediate] Compare the compression ratio of BPE vs. Unigram on a small corpus. Which produces shorter token sequences for the same vocabulary size? Explain why.
  3. [Intermediate] Analyze the trade-off between vocabulary size and sequence length. For a Transformer with quadratic attention cost, derive how doubling vocabulary size affects total training FLOPs (considering both shorter sequences and larger embedding matrices).

Programming (4 exercises)

  1. [Basic] Implement BPE from scratch in Python (~30 lines). Train on a small corpus and show the merge table. Tokenize new text using the learned merges.
  2. [Intermediate] Train a SentencePiece model on a text corpus. Compare segmentations from BPE and Unigram mode for 10 English words and 10 German words. Discuss differences.
  3. [Intermediate] Measure fertility across 5 languages using GPT-2's tokenizer. For equivalent sentences in English, German, Chinese (pinyin), Korean (romanized), and Arabic (romanized), compute and compare tokens per word.
  4. [Advanced] Build a simple quality-filtering pipeline for web text: implement perplexity-based filtering using a small reference LM, deduplication using hash-based exact matching, and minimum-length filtering. Measure how much data is removed at each stage.

Cross-References

This chapter references:

  • Chapter 1 (Section 1.1) — The prediction paradigm: tokenization determines the unit of prediction
  • Chapter 8 (Sections 8.1, 8.4) — Transformer architecture and positional encodings; sequence length after tokenization determines attention cost

This chapter is referenced by:

  • Chapter 14 (Section 14.4) — Long-context models must cope with tokenization-induced sequence lengths; efficient tokenization directly impacts feasibility of extended context windows

Key Papers

  • Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL, 1715–1725.
  • Kudo, T. & Richardson, J. (2018). SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. Proceedings of EMNLP (System Demonstrations), 66–71.
  • Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of ACL, 66–75.
  • Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67.
  • Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2022). Deduplicating Training Data Makes Language Models Better. Proceedings of ACL, 8424–8445.