Tokenization
BPE, WordPiece & SentencePiece
35 SLIDES Part 3: Advanced Topics
The Vocabulary Explosion Problem: 100K word vocabulary = 30M embedding parameters. English has 170K words, all languages combined have millions. We need a better approach.
Prerequisites
- Basic understanding of text representation
- Week 2: Word embeddings and vocabulary concepts
- Familiarity with frequency-based methods
Overview
Break text into tokens. Subword algorithms that power modern language models.
Learning Objectives
- Explain the vocabulary explosion problem (100K words = 30M parameters)
- Compare character-level, word-level, and subword tokenization
- Implement Byte-Pair Encoding (BPE) algorithm from scratch
- Understand WordPiece and Unigram tokenization methods
- Analyze how tokenization affects model performance and efficiency
Key Topics
BPE algorithm
WordPiece
SentencePiece
Vocabulary optimization
Key Concepts
Vocabulary explosionWhy word-level tokenization doesn't scale
SubwordsThe Goldilocks zone between characters and words
BPE (Byte-Pair Encoding)Bottom-up merging of frequent pairs
WordPieceLikelihood-based subword segmentation (used by BERT)
UnigramProbabilistic tokenization with subword probabilities
OOV handlingHow subwords solve out-of-vocabulary problems
Key Visualizations
Bpe Progression Visual
Tokenization Comparison Visual
Rare Word Handling Visual
Vocab Size Oov Visual