Pre-trained Models

BERT and Beyond

52 SLIDES Part 2: Core Architectures

?
The $1 Million Problem: Training BERT from scratch costs $1M+, but fine-tuning costs only $50-500. This changed everything about how we build NLP systems.

Prerequisites

  • Week 5: Transformer architecture (attention, encoder-decoder)
  • Understanding of neural network training and transfer learning concepts
  • Familiarity with word embeddings (Week 2)

Overview

Pre-training paradigms that changed NLP. BERT, GPT, and transfer learning at scale.

Learning Objectives

  • Explain the paradigm shift from task-specific to pre-trained models
  • Compare BERT (bidirectional) vs GPT (autoregressive) architectures
  • Understand the economics: $1M pre-training vs $50-500 fine-tuning
  • Apply masked language modeling (MLM) and next sentence prediction (NSP)
  • Describe how fine-tuning adapts pre-trained models to downstream tasks

Key Topics

BERT architecture
Masked LM
Next sentence prediction
Fine-tuning

Key Concepts

Pre-training paradigmTrain once on massive data, fine-tune for any task
BERTBidirectional Encoder Representations from Transformers
GPTGenerative Pre-trained Transformer (autoregressive)
Masked Language Modeling (MLM)Predict masked tokens using context
Transfer learningKnowledge from pre-training transfers to new tasks
Fine-tuningAdapt pre-trained weights with task-specific data

Key Visualizations

Bert Architecture Bert Architecture
Bert Finetuning Process Bert Finetuning Process
Bert Vs Gpt Architecture Bert Vs Gpt Architecture
Bert Results Glue Bert Results Glue

Resources