Ch 4: Word Representations | Predicting the Next Words

Prerequisites

Ch 2: Mathematical Foundations → Ch 4: Word Representations

Chapter Summary

Chapter 4 is the bridge between classical count-based NLP and neural language modeling. It addresses the most damaging limitation identified in Chapter 3: n-gram models treat every word as an atomic, unrelated symbol, so observing "the dog ran" teaches the model nothing about "the puppy ran." The solution is to represent words as dense vectors in a continuous space where semantically similar words occupy nearby points. The chapter traces this idea from its theoretical roots (Firth's distributional hypothesis), through classical dimensionality reduction (SVD on co-occurrence matrices), to the neural embedding revolution (Word2Vec, GloVe, FastText). The connecting thread to the prediction paradigm is that Word2Vec's Skip-gram objective IS a prediction task -- predicting context words from a center word -- and the learned embeddings are a byproduct of optimizing this prediction.

Why this chapter matters: By chapter's end, the reader has the input representation that all neural language models (Chapters 5-8) require: a learned mapping from discrete tokens to continuous vectors that encodes semantic similarity. The prediction paradigm is preserved -- Word2Vec learns by predicting context words.

Learning Objectives

Contrast sparse (one-hot, TF-IDF) and dense (embedding) representations of words and explain why dense representations enable generalization across semantically similar words
Derive the Word2Vec Skip-gram objective with negative sampling and explain how it learns to predict context words from a target word
Compare the training objectives of Word2Vec (local context windows), GloVe (global co-occurrence statistics), and FastText (subword information) and identify scenarios where each excels
Evaluate word embeddings using intrinsic methods (analogy tasks, similarity benchmarks) and discuss their limitations

Section Outline

4.1 Sparse Representations (~3 pages)

One-hot encoding and TF-IDF. The fundamental limitation: orthogonality implies no generalization.

4.1.1 One-hot encoding
4.1.2 TF-IDF weighting
4.1.3 The orthogonality problem

4.2 Distributional Semantics (~3 pages)

"You shall know a word by the company it keeps" (Firth, 1957). Building word-context co-occurrence matrices and dimensionality reduction via truncated SVD.

4.2.1 The distributional hypothesis
4.2.2 Word-context co-occurrence matrices
4.2.3 Dimensionality reduction with SVD

4.3 Word2Vec (~6 pages)

The neural embedding revolution: CBOW and Skip-gram architectures, the softmax bottleneck, negative sampling, and training procedures.

4.3.1 Continuous Bag of Words (CBOW)
4.3.2 Skip-gram
4.3.3 The softmax bottleneck
4.3.4 Negative sampling
4.3.5 Training and hyperparameters

4.4 GloVe and FastText (~4 pages)

GloVe combines global statistics with local context window efficiency. FastText extends Word2Vec to subword units, handling morphology and OOV words.

4.4.1 GloVe: global vectors for word representation
4.4.2 FastText: enriching embeddings with subword information
4.4.3 Comparing Word2Vec, GloVe, and FastText

4.5 Evaluating Embeddings (~4 pages)

Intrinsic evaluation (analogy tasks, similarity benchmarks), extrinsic evaluation, and limitations of static embeddings (polysemy, bias).

4.5.1 Word analogy tasks
4.5.2 Word similarity benchmarks
4.5.3 Intrinsic vs extrinsic evaluation
4.5.4 Limitations and the road to contextual embeddings

Key Equations

$$J_{\text{CBOW}} = -\frac{1}{T}\sum_{t=1}^{T} \log P(w_t \mid w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c})$$ (4.1)

Word2Vec CBOW objective -- Section 4.3

$$J_{\text{SG}} = -\frac{1}{T}\sum_{t=1}^{T} \sum_{\substack{-c \leq j \leq c \\ j \neq 0}} \log P(w_{t+j} \mid w_t)$$ (4.2)

Word2Vec Skip-gram objective -- Section 4.3

$$\log \sigma(\mathbf{v}_{w_O}^\top \mathbf{v}_{w_I}) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n(w)}[\log \sigma(-\mathbf{v}_{w_k}^\top \mathbf{v}_{w_I})]$$ (4.3)

Negative sampling loss -- Section 4.3

$$J_{\text{GloVe}} = \sum_{i,j=1}^{|V|} f(X_{ij})\left(\mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2$$ (4.4)

GloVe objective -- Section 4.4

$$P(w_O \mid w_I) = \frac{\exp(\mathbf{v}_{w_O}^\top \mathbf{v}_{w_I})}{\sum_{w=1}^{|V|} \exp(\mathbf{v}_w^\top \mathbf{v}_{w_I})}$$ (4.5)

Softmax over vocabulary -- Section 4.3

Key Figures

TikZ

Figure 4.1: One-hot vs Dense Vectors

Side-by-side visualization: a sparse one-hot vector vs a dense embedding vector. Shows that "dog" and "puppy" are orthogonal in one-hot space but nearby in embedding space.

TikZ

Figure 4.2: Word2Vec Architecture

Diagram showing both CBOW (context words as input, center word as target) and Skip-gram (center word as input, context words as targets) architectures with the embedding layer highlighted.

Matplotlib

Figure 4.3: Embedding Space Visualization

2D PCA/t-SNE projection of trained embeddings showing semantic clusters (animals, colors, countries) and the linear structure (king-man+woman=queen).

Matplotlib

Figure 4.4: Word Analogy Diagram

Vector arithmetic visualization: the parallelogram formed by king, queen, man, woman in 2D projected space.

TikZ

Figure 4.5: GloVe Training

Diagram showing the co-occurrence matrix X_ij, the weighting function f(X_ij), and how GloVe combines global statistics with local embedding optimization.

TikZ

Figure 4.6: FastText Subword Decomposition

Example showing how "unhappiness" is decomposed into character n-grams and how the word embedding is the sum of subword embeddings.

Exercises

10 exercises (4 theory, 6 programming)

Cross-References

This chapter builds on:

Ch 1: Introduction -- the prediction paradigm; word representations serve the prediction task
Ch 2: Mathematical Foundations -- MLE and log-likelihood (Sec 2.2), SGD and Adam (Sec 2.5)

This chapter is needed for:

Ch 5: Sequence Models -- RNNs use word embeddings as input representations; the embedding layer is the first component of a neural language model

Key Papers

Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of ICLR Workshop. View in bibliography →
Mikolov, T. et al. (2013). Distributed Representations of Words and Phrases and Their Compositionality. NeurIPS, 26, 3111--3119. View in bibliography →
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of EMNLP, 1532--1543. View in bibliography →
Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135--146. View in bibliography →
Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS, 27, 2177--2185. View in bibliography →
Firth, J. R. (1957). A Synopsis of Linguistic Theory 1930--1955. Studies in Linguistic Analysis, 1--32. View in bibliography →