Word Representations
Prerequisites
Chapter Summary
Chapter 4 is the bridge between classical count-based NLP and neural language modeling. It addresses the most damaging limitation identified in Chapter 3: n-gram models treat every word as an atomic, unrelated symbol, so observing "the dog ran" teaches the model nothing about "the puppy ran." The solution is to represent words as dense vectors in a continuous space where semantically similar words occupy nearby points. The chapter traces this idea from its theoretical roots (Firth's distributional hypothesis), through classical dimensionality reduction (SVD on co-occurrence matrices), to the neural embedding revolution (Word2Vec, GloVe, FastText). The connecting thread to the prediction paradigm is that Word2Vec's Skip-gram objective IS a prediction task -- predicting context words from a center word -- and the learned embeddings are a byproduct of optimizing this prediction.
Learning Objectives
- Contrast sparse (one-hot, TF-IDF) and dense (embedding) representations of words and explain why dense representations enable generalization across semantically similar words
- Derive the Word2Vec Skip-gram objective with negative sampling and explain how it learns to predict context words from a target word
- Compare the training objectives of Word2Vec (local context windows), GloVe (global co-occurrence statistics), and FastText (subword information) and identify scenarios where each excels
- Evaluate word embeddings using intrinsic methods (analogy tasks, similarity benchmarks) and discuss their limitations
Section Outline
4.1 Sparse Representations (~3 pages)
One-hot encoding and TF-IDF. The fundamental limitation: orthogonality implies no generalization.
- 4.1.1 One-hot encoding
- 4.1.2 TF-IDF weighting
- 4.1.3 The orthogonality problem
4.2 Distributional Semantics (~3 pages)
"You shall know a word by the company it keeps" (Firth, 1957). Building word-context co-occurrence matrices and dimensionality reduction via truncated SVD.
- 4.2.1 The distributional hypothesis
- 4.2.2 Word-context co-occurrence matrices
- 4.2.3 Dimensionality reduction with SVD
4.3 Word2Vec (~6 pages)
The neural embedding revolution: CBOW and Skip-gram architectures, the softmax bottleneck, negative sampling, and training procedures.
- 4.3.1 Continuous Bag of Words (CBOW)
- 4.3.2 Skip-gram
- 4.3.3 The softmax bottleneck
- 4.3.4 Negative sampling
- 4.3.5 Training and hyperparameters
4.4 GloVe and FastText (~4 pages)
GloVe combines global statistics with local context window efficiency. FastText extends Word2Vec to subword units, handling morphology and OOV words.
- 4.4.1 GloVe: global vectors for word representation
- 4.4.2 FastText: enriching embeddings with subword information
- 4.4.3 Comparing Word2Vec, GloVe, and FastText
4.5 Evaluating Embeddings (~4 pages)
Intrinsic evaluation (analogy tasks, similarity benchmarks), extrinsic evaluation, and limitations of static embeddings (polysemy, bias).
- 4.5.1 Word analogy tasks
- 4.5.2 Word similarity benchmarks
- 4.5.3 Intrinsic vs extrinsic evaluation
- 4.5.4 Limitations and the road to contextual embeddings
Key Equations
Key Figures
Exercises
10 exercises (4 theory, 6 programming)
Cross-References
This chapter builds on:
- Ch 1: Introduction -- the prediction paradigm; word representations serve the prediction task
- Ch 2: Mathematical Foundations -- MLE and log-likelihood (Sec 2.2), SGD and Adam (Sec 2.5)
This chapter is needed for:
- Ch 5: Sequence Models -- RNNs use word embeddings as input representations; the embedding layer is the first component of a neural language model
Key Papers
- Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of ICLR Workshop. View in bibliography →
- Mikolov, T. et al. (2013). Distributed Representations of Words and Phrases and Their Compositionality. NeurIPS, 26, 3111--3119. View in bibliography →
- Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of EMNLP, 1532--1543. View in bibliography →
- Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135--146. View in bibliography →
- Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS, 27, 2177--2185. View in bibliography →
- Firth, J. R. (1957). A Synopsis of Linguistic Theory 1930--1955. Studies in Linguistic Analysis, 1--32. View in bibliography →