Tokenization

BPE, WordPiece & SentencePiece

35 SLIDES Part 3: Advanced Topics

The Vocabulary Explosion Problem: 100K word vocabulary = 30M embedding parameters. English has 170K words, all languages combined have millions. We need a better approach.

Prerequisites

Basic understanding of text representation
Week 2: Word embeddings and vocabulary concepts
Familiarity with frequency-based methods

Overview

Break text into tokens. Subword algorithms that power modern language models.

Learning Objectives

Explain the vocabulary explosion problem (100K words = 30M parameters)
Compare character-level, word-level, and subword tokenization
Implement Byte-Pair Encoding (BPE) algorithm from scratch
Understand WordPiece and Unigram tokenization methods
Analyze how tokenization affects model performance and efficiency

Key Topics

BPE algorithm

WordPiece

SentencePiece

Vocabulary optimization

Key Concepts

Vocabulary explosionWhy word-level tokenization doesn't scale

SubwordsThe Goldilocks zone between characters and words

BPE (Byte-Pair Encoding)Bottom-up merging of frequent pairs

WordPieceLikelihood-based subword segmentation (used by BERT)

UnigramProbabilistic tokenization with subword probabilities

OOV handlingHow subwords solve out-of-vocabulary problems

Key Visualizations

Bpe Progression Visual

Tokenization Comparison Visual

Rare Word Handling Visual

Vocab Size Oov Visual

Resources

View Slides (PDF) [source] Open in Colab Download Notebook Chart Gallery

Previous Scaling Laws Next Decoding