Tokenization and Data at Scale
Prerequisites
- Chapter 8: The Transformer Architecture — Tokenization feeds directly into the Transformer's embedding layer; sequence length after tokenization determines the computational cost of self-attention (quadratic in sequence length)
Summary
Chapter 10 addresses the often-overlooked infrastructure that determines what a language model can learn: how text is split into tokens and how training data is curated at scale. Tokenization defines the unit of prediction — whether the model predicts words, subwords, or bytes — and this choice cascades through vocabulary size, sequence length, computational cost, and multilingual fairness. The chapter presents Byte-Pair Encoding (BPE) as the dominant algorithm, introduces SentencePiece and Unigram as alternatives, analyzes the fertility problem that penalizes non-English languages, and covers the data curation pipeline (web crawling, deduplication, quality filtering, data mixing) that produces corpora like C4, The Pile, and RefinedWeb. This is a unique differentiator for the book: no competing textbook dedicates a full chapter to linking tokenization choices and data quality to language model performance.
Learning Objectives
- Implement the Byte-Pair Encoding (BPE) algorithm from scratch, trace its merge operations on a sample corpus, and explain why subword tokenization solves the open-vocabulary problem that word-level models cannot handle
- Compare BPE, WordPiece, Unigram, and SentencePiece tokenization methods in terms of their algorithmic approach, vocabulary construction, and impact on downstream model performance
- Analyze how tokenization choices affect model behavior across languages, domains, and efficiency metrics, including fertility (tokens per word), compression ratio, and coverage of rare words
- Describe the key stages of a modern data curation pipeline — web crawling, deduplication, quality filtering, data mixing — and explain how each stage affects pre-training quality
Section Outline
10.1 From Words to Subwords (~3pp)
Why word-level tokenization fails at scale: the open-vocabulary problem, the long tail of rare words, morphological productivity, and OOV tokens. Character-level as the other extreme. Subword tokenization as the principled middle ground.
- 10.1.1 The open-vocabulary problem
- 10.1.2 Word-level vs. character-level trade-offs
- 10.1.3 The subword hypothesis
10.2 Byte-Pair Encoding (BPE) (~5pp)
The BPE algorithm (Sennrich et al., 2016) in detail: character-level initialization, iterative merging of the most frequent pair, vocabulary size as a hyperparameter. GPT-2's byte-level BPE variant. Practical considerations.
- 10.2.1 The BPE algorithm step by step
- 10.2.2 Byte-level BPE (GPT-2 variant)
- 10.2.3 Implementation and practical considerations
10.3 SentencePiece and Unigram (~4pp)
SentencePiece: language-independent tokenization operating on raw text. The Unigram language model approach: start with a large vocabulary, iteratively prune tokens that contribute least to corpus likelihood. Subword regularization.
- 10.3.1 SentencePiece: language-independent preprocessing
- 10.3.2 The Unigram language model tokenizer
- 10.3.3 Subword regularization and its benefits
10.4 The Impact of Tokenization (~4pp)
Vocabulary size vs. sequence length trade-off, fertility across languages (English-centric tokenizers penalize non-Latin scripts), tokenization of code and math, and emerging tokenizer-free approaches.
- 10.4.1 Vocabulary size vs. sequence length
- 10.4.2 Multilingual tokenization and the fertility problem
- 10.4.3 Toward tokenizer-free models
10.5 Data Curation at Scale (~4pp)
Common Crawl, The Pile, RedPajama, RefinedWeb. Key pipeline stages: web scraping, deduplication, quality filtering, content filtering, and data mixing. How data quality and diversity affect pre-training outcomes.
- 10.5.1 Data sources: web crawls, books, code
- 10.5.2 Deduplication and quality filtering
- 10.5.3 Data mixing and domain balancing
Key Equations
Eq 10.1 — BPE merge operation (Section 10.2)
$$\text{pair}^* = \arg\max_{(a,b) \in \text{pairs}(V)} \text{count}(a, b)$$
At each step, merge the most frequent adjacent pair $(a, b)$ into a new symbol $ab$ and add $ab$ to vocabulary $V$.
Eq 10.2 — Unigram language model probability (Section 10.3)
$$P(\mathbf{x}) = \prod_{i=1}^{|\mathbf{x}|} p(x_i)$$
The optimal segmentation maximizes:
$$\mathbf{x}^* = \arg\max_{\mathbf{x} \in S(w)} \sum_{i=1}^{|\mathbf{x}|} \log p(x_i)$$
where $S(w)$ is the set of all valid segmentations of word $w$.
Eq 10.3 — Fertility (Section 10.4)
$$\text{fertility}(w) = \frac{\text{number of subword tokens for } w}{\text{number of whitespace words in } w}$$
A tokenizer-agnostic metric measuring encoding efficiency. Lower fertility = more efficient.
Key Figures
- BPE step-by-step — Worked example showing 5–6 merge iterations on a small corpus, with frequency tables updated at each step and vocabulary growth. (Stepped diagram, TikZ)
- Tokenization comparison — The same English sentence tokenized by word-level, BPE, Unigram, and character-level methods, side by side. (Annotated text diagram, TikZ)
- Vocabulary size vs. performance — Downstream task performance vs. vocabulary size for a fixed model, demonstrating the sweet spot and diminishing returns. (Line plot, Matplotlib)
- Data pipeline diagram — Full data curation pipeline: Web Crawl → URL Filtering → Language ID → Deduplication → Quality Filtering → Content Filtering → Data Mixing → Training Corpus. (Flowchart, TikZ)
- Data quality filtering — Illustrated examples of filtered vs. retained documents, with quality scores and filtering criteria. (Annotated examples, TikZ)
Exercises
Theory (3 exercises)
- [Basic] Given the corpus {"low": 5, "lower": 2, "new": 6, "newer": 3, "wider": 2}, compute the first 5 BPE merges by hand. Show the frequency table and vocabulary after each merge.
- [Intermediate] Compare the compression ratio of BPE vs. Unigram on a small corpus. Which produces shorter token sequences for the same vocabulary size? Explain why.
- [Intermediate] Analyze the trade-off between vocabulary size and sequence length. For a Transformer with quadratic attention cost, derive how doubling vocabulary size affects total training FLOPs (considering both shorter sequences and larger embedding matrices).
Programming (4 exercises)
- [Basic] Implement BPE from scratch in Python (~30 lines). Train on a small corpus and show the merge table. Tokenize new text using the learned merges.
- [Intermediate] Train a SentencePiece model on a text corpus. Compare segmentations from BPE and Unigram mode for 10 English words and 10 German words. Discuss differences.
- [Intermediate] Measure fertility across 5 languages using GPT-2's tokenizer. For equivalent sentences in English, German, Chinese (pinyin), Korean (romanized), and Arabic (romanized), compute and compare tokens per word.
- [Advanced] Build a simple quality-filtering pipeline for web text: implement perplexity-based filtering using a small reference LM, deduplication using hash-based exact matching, and minimum-length filtering. Measure how much data is removed at each stage.
Cross-References
This chapter references:
- Chapter 1 (Section 1.1) — The prediction paradigm: tokenization determines the unit of prediction
- Chapter 8 (Sections 8.1, 8.4) — Transformer architecture and positional encodings; sequence length after tokenization determines attention cost
This chapter is referenced by:
- Chapter 14 (Section 14.4) — Long-context models must cope with tokenization-induced sequence lengths; efficient tokenization directly impacts feasibility of extended context windows
Key Papers
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of ACL, 1715–1725.
- Kudo, T. & Richardson, J. (2018). SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. Proceedings of EMNLP (System Demonstrations), 66–71.
- Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of ACL, 66–75.
- Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67.
- Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2022). Deduplicating Training Data Makes Language Models Better. Proceedings of ACL, 8424–8445.