Pre-trained Models
BERT and Beyond
52 SLIDES Part 2: Core Architectures
The $1 Million Problem: Training BERT from scratch costs $1M+, but fine-tuning costs only $50-500. This changed everything about how we build NLP systems.
Prerequisites
- Week 5: Transformer architecture (attention, encoder-decoder)
- Understanding of neural network training and transfer learning concepts
- Familiarity with word embeddings (Week 2)
Overview
Pre-training paradigms that changed NLP. BERT, GPT, and transfer learning at scale.
Learning Objectives
- Explain the paradigm shift from task-specific to pre-trained models
- Compare BERT (bidirectional) vs GPT (autoregressive) architectures
- Understand the economics: $1M pre-training vs $50-500 fine-tuning
- Apply masked language modeling (MLM) and next sentence prediction (NSP)
- Describe how fine-tuning adapts pre-trained models to downstream tasks
Key Topics
BERT architecture
Masked LM
Next sentence prediction
Fine-tuning
Key Concepts
Pre-training paradigmTrain once on massive data, fine-tune for any task
BERTBidirectional Encoder Representations from Transformers
GPTGenerative Pre-trained Transformer (autoregressive)
Masked Language Modeling (MLM)Predict masked tokens using context
Transfer learningKnowledge from pre-training transfers to new tasks
Fine-tuningAdapt pre-trained weights with task-specific data
Key Visualizations
Bert Architecture
Bert Finetuning Process
Bert Vs Gpt Architecture
Bert Results Glue