Transformer Architecture

Attention Is All You Need

45 SLIDES Part 2: Core Architectures

?
From Sequential to Parallel: RNNs process words one-by-one like reading a book. Transformers process all words simultaneously like seeing a photograph. This parallelization made GPT and BERT possible - training that took months now takes days.

Prerequisites

  • Week 4: Attention mechanism fundamentals
  • Matrix operations (multiplication, softmax)
  • Understanding of parallel vs sequential processing

Overview

The architecture that revolutionized NLP. Self-attention, multi-head attention, and positional encoding.

Learning Objectives

  • Explain why transformers replaced RNNs (parallelization advantage)
  • Calculate self-attention scores using Query, Key, Value
  • Understand multi-head attention and its benefits
  • Describe positional encoding and why it's necessary
  • Draw the complete transformer encoder-decoder architecture

Key Topics

Self-attention
Multi-head attention
Positional encoding
Feed-forward layers

Key Concepts

Self-attentionEach position attends to all positions in sequence
Query, Key, Value (QKV)The three projections for attention
Scaled dot-product attentionQK^T / sqrt(d_k) then softmax
Multi-head attentionMultiple attention patterns in parallel
Positional encodingInject position information (no recurrence)
Feed-forward networkPosition-wise fully connected layers

Key Visualizations

Sr 16 Transformer Architecture Annotated Sr 16 Transformer Architecture Annotated
3D Multihead Attention 3D Multihead Attention
Attention Heatmap Attention Heatmap
Positional Encoding 3D Positional Encoding 3D

Resources

Moodle Resources (HS25)

Lecture Slides