Pre-trained Models

BERT and Beyond

52 SLIDES Part 2: Core Architectures

The $1 Million Problem: Training BERT from scratch costs $1M+, but fine-tuning costs only $50-500. This changed everything about how we build NLP systems.

Prerequisites

Week 5: Transformer architecture (attention, encoder-decoder)
Understanding of neural network training and transfer learning concepts
Familiarity with word embeddings (Week 2)

Overview

Pre-training paradigms that changed NLP. BERT, GPT, and transfer learning at scale.

Learning Objectives

Explain the paradigm shift from task-specific to pre-trained models
Compare BERT (bidirectional) vs GPT (autoregressive) architectures
Understand the economics: $1M pre-training vs $50-500 fine-tuning
Apply masked language modeling (MLM) and next sentence prediction (NSP)
Describe how fine-tuning adapts pre-trained models to downstream tasks

Key Topics

BERT architecture

Masked LM

Next sentence prediction

Fine-tuning

Key Concepts

Pre-training paradigmTrain once on massive data, fine-tune for any task

BERTBidirectional Encoder Representations from Transformers

GPTGenerative Pre-trained Transformer (autoregressive)

Masked Language Modeling (MLM)Predict masked tokens using context

Transfer learningKnowledge from pre-training transfers to new tasks

Fine-tuningAdapt pre-trained weights with task-specific data

Key Visualizations

Bert Architecture

Bert Finetuning Process

Bert Vs Gpt Architecture

Bert Results Glue

Resources

View Slides (PDF) [source] Open in Colab Download Notebook Chart Gallery

Previous Transformers Next Scaling Laws