Tokenization

BPE, WordPiece & SentencePiece

35 SLIDES Part 3: Advanced Topics

?
The Vocabulary Explosion Problem: 100K word vocabulary = 30M embedding parameters. English has 170K words, all languages combined have millions. We need a better approach.

Prerequisites

  • Basic understanding of text representation
  • Week 2: Word embeddings and vocabulary concepts
  • Familiarity with frequency-based methods

Overview

Break text into tokens. Subword algorithms that power modern language models.

Learning Objectives

  • Explain the vocabulary explosion problem (100K words = 30M parameters)
  • Compare character-level, word-level, and subword tokenization
  • Implement Byte-Pair Encoding (BPE) algorithm from scratch
  • Understand WordPiece and Unigram tokenization methods
  • Analyze how tokenization affects model performance and efficiency

Key Topics

BPE algorithm
WordPiece
SentencePiece
Vocabulary optimization

Key Concepts

Vocabulary explosionWhy word-level tokenization doesn't scale
SubwordsThe Goldilocks zone between characters and words
BPE (Byte-Pair Encoding)Bottom-up merging of frequent pairs
WordPieceLikelihood-based subword segmentation (used by BERT)
UnigramProbabilistic tokenization with subword probabilities
OOV handlingHow subwords solve out-of-vocabulary problems

Key Visualizations

Bpe Progression Visual Bpe Progression Visual
Tokenization Comparison Visual Tokenization Comparison Visual
Rare Word Handling Visual Rare Word Handling Visual
Vocab Size Oov Visual Vocab Size Oov Visual

Resources