Part IV ยท Chapter 14

Retrieval, Agents, and Multimodal Models

Part IV: Frontiers Moderate ~20pp Phase 3
Why this chapter matters. The prediction paradigm reaches its most general form here: the model still predicts the next token, but now conditioned on retrieved documents it has never seen, tool results from external systems, and visual features from images. Every frontier extension -- RAG, agents, multimodal, efficient inference -- preserves the core prediction mechanism while dramatically expanding what the model can accomplish.

Prerequisites

Ch 13: In-Context Learning Ch 14: Retrieval, Agents, Multimodal

Summary

Chapter 14 extends the language model beyond single-turn text-in/text-out prediction into three frontier directions: retrieval-augmented generation (RAG), which grounds the model in external documents to reduce hallucination and enable knowledge-intensive tasks; agentic systems (ReAct), which chain multiple reasoning and tool-use steps into autonomous multi-step problem solving; and multimodal models (CLIP, LLaVA), which extend the prediction paradigm to images alongside text. The chapter also covers practical deployment: long-context extensions (RoPE scaling, sparse attention), parameter-efficient fine-tuning (LoRA), quantization (INT8/INT4), and speculative decoding. The unifying theme is that every extension preserves the core next-token prediction mechanism while enriching the context or reducing the computational cost.

Learning Objectives

  1. Design and implement a Retrieval-Augmented Generation (RAG) pipeline -- including document chunking, embedding-based retrieval, and grounded generation -- and evaluate it against a closed-book baseline on a knowledge-intensive QA task.
  2. Explain the ReAct agent framework (Reason + Act), implement a multi-step agent loop that interleaves reasoning traces with tool calls, and identify the challenges of planning, error recovery, and termination.
  3. Describe how vision-language models (CLIP, LLaVA, GPT-4V) bridge visual and textual modalities through contrastive pre-training and visual instruction tuning, and explain their connection to the text-prediction paradigm.
  4. Apply parameter-efficient fine-tuning (LoRA) and inference-time optimization (quantization, KV-cache, speculative decoding) to deploy large language models under real-world resource constraints.

Section Outline

14.1 Retrieval-Augmented Generation (RAG) (~5pp)

The RAG architecture (Lewis et al., 2020): augmenting generation with retrieved documents to reduce hallucination and ground responses in external knowledge. The full pipeline: document indexing, dense retrieval, and conditioned generation. Evaluation: faithfulness, relevance, correctness.

  • 14.1.1 Why Retrieval? Grounding and Hallucination Reduction
  • 14.1.2 The RAG Pipeline: Index, Retrieve, Generate
  • 14.1.3 Evaluation and Advanced Retrieval Strategies

14.2 Agents and Planning (~4pp)

The ReAct framework (Yao et al., 2023): interleaving Thought, Action, and Observation. Planning, task decomposition, memory (short-term and long-term), error recovery, and safety in autonomous systems.

  • 14.2.1 The ReAct Framework: Thought-Action-Observation
  • 14.2.2 Planning and Task Decomposition
  • 14.2.3 Memory, Error Recovery, and Safety

14.3 Vision-Language Models (~4pp)

CLIP (Radford et al., 2021): contrastive pre-training of image and text encoders. LLaVA (Liu et al., 2023): visual instruction tuning. GPT-4V and Gemini as frontier multimodal models.

  • 14.3.1 CLIP: Contrastive Image-Text Pre-training
  • 14.3.2 Visual Instruction Tuning (LLaVA)
  • 14.3.3 Frontier Multimodal Models

14.4 Long-Context Models (~3pp)

Extending context windows: RoPE scaling, ALiBi, sub-quadratic attention alternatives, and the "lost in the middle" phenomenon.

  • 14.4.1 Position Encoding Extrapolation
  • 14.4.2 Sub-Quadratic Attention Alternatives
  • 14.4.3 Practical Applications of Long Context

14.5 Efficient Inference (~4pp)

Quantization (GPTQ, AWQ, GGUF), LoRA and parameter-efficient fine-tuning, KV-cache optimization, speculative decoding, and distillation.

  • 14.5.1 Quantization: INT8, INT4, and Beyond
  • 14.5.2 LoRA and Parameter-Efficient Fine-Tuning
  • 14.5.3 KV-Cache, Speculative Decoding, and Distillation

Key Equations

(14.1)
$$P(y \mid x) = \sum_{d \in \mathcal{D}_k} P(y \mid x, d; \theta) \, P(d \mid x; \phi)$$
RAG Generation Probability -- where $\mathcal{D}_k$ is the set of top-$k$ retrieved documents, $P(d|x;\phi)$ is the retrieval score, and $P(y|x,d;\theta)$ is the generator conditioned on both the query and the retrieved document.
(14.2)
$$\text{sim}(x, d) = \frac{\mathbf{e}_q(x)^\top \mathbf{e}_d(d)}{\|\mathbf{e}_q(x)\| \, \|\mathbf{e}_d(d)\|}$$
Dense Retrieval Score -- where $\mathbf{e}_q$ and $\mathbf{e}_d$ are learned query and document encoders (bi-encoder architecture).
(14.3)
$$\mathbf{W}' = \mathbf{W}_0 + \mathbf{B}\mathbf{A}, \quad \mathbf{B} \in \mathbb{R}^{d \times r}, \; \mathbf{A} \in \mathbb{R}^{r \times k}$$
LoRA Decomposition -- where $\mathbf{W}_0$ is the frozen pre-trained weight matrix, $r \ll \min(d, k)$ is the LoRA rank, and only $\mathbf{B}$ and $\mathbf{A}$ are trainable. Trainable parameters: $r(d + k)$ instead of $dk$.
(14.4)
$$\mathcal{L}_{\text{CLIP}} = -\frac{1}{N} \sum_{i=1}^{N} \left[\log \frac{\exp(\text{sim}(\mathbf{z}_i^I, \mathbf{z}_i^T) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(\mathbf{z}_i^I, \mathbf{z}_j^T) / \tau)}\right]$$
CLIP Contrastive Loss -- where $\mathbf{z}^I$ and $\mathbf{z}^T$ are image and text embeddings, $\tau$ is a learned temperature, and $N$ is the batch size.

Key Figures

Figure 14.1 · Architecture Flowchart · TikZ
RAG Architecture Diagram
End-to-end pipeline: Query → Retriever (embedding + ANN search over document index) → Top-k Documents → Generator (LLM conditioned on query + documents) → Answer.
Figure 14.2 · Loop/Cycle Diagram · TikZ
ReAct Loop
The Thought-Action-Observation cycle: the LLM generates a reasoning trace, decides on a tool call, the environment returns an observation, and the loop continues until a final answer.
Figure 14.3 · Architecture Diagram · TikZ
CLIP Architecture
Dual-encoder architecture: image encoder (ViT) and text encoder (Transformer) producing embeddings in a shared space, trained with contrastive loss. Shows the cosine similarity matrix across a batch.
Figure 14.4 · Architecture Diagram · TikZ
Vision-Language Pipeline
LLaVA-style architecture: image → visual encoder → projection layer → LLM input embeddings (interleaved with text tokens) → autoregressive generation.
Figure 14.5 · Matrix Diagram · TikZ
LoRA Diagram
Frozen weight matrix $\mathbf{W}_0$ plus low-rank update $\mathbf{BA}$, showing the dimensional reduction from $d \times k$ to $r(d+k)$ trainable parameters.
Figure 14.6 · Grouped Bar Chart · Matplotlib
Quantization Comparison
Model size (GB), inference speed (tokens/sec), and quality (perplexity) compared across FP16, INT8, and INT4 quantization levels.

Exercises

Theory

  1. LoRA Parameter Savings (Basic). Derive the LoRA parameter savings for a Transformer with $L=32$ layers, each having 4 weight matrices of dimension $d=4096$, with rank $r=16$. What fraction of total parameters are trainable?
  2. RAG Precision Trade-off (Intermediate). If a retriever returns $k$ documents with precision $p$, how does the expected number of relevant documents scale? What happens to generation quality when most retrieved documents are irrelevant?
  3. Speculative Decoding Correctness (Intermediate). Explain why speculative decoding preserves the output distribution of the target model exactly. Given an acceptance rate $\alpha$, estimate the expected speedup.

Programming

  1. RAG Pipeline (Basic). Build a RAG pipeline using sentence-transformers for retrieval. Index 100 Wikipedia paragraphs, retrieve top-3 for 20 questions, and compare with a closed-book baseline.
  2. ReAct Agent (Intermediate). Implement a ReAct agent with calculator and search tools. Test on 30 multi-step questions. Report steps per question, tool-use frequency, and accuracy.
  3. LoRA Fine-Tuning (Intermediate). Apply LoRA fine-tuning ($r=16$) to a 7B model on 1000 instruction examples. Compare trainable parameters, GPU memory, and task performance with full fine-tuning.
  4. Quantization Benchmark (Intermediate). Quantize a model to INT4 using GPTQ. Compare FP16, INT8, INT4 on perplexity, inference speed, and model size.
  5. Vision-Language Pipeline (Advanced). Use CLIP to encode images, project embeddings into a small LLM's input space, and generate captions. Evaluate on 50 COCO validation images.

Cross-References

This chapter references:

  • Ch 1 (Section 1.1): The prediction paradigm. Chapter 14 extends this to multimodal prediction, retrieval-grounded prediction, and agent-mediated prediction.
  • Ch 10 (Sections 10.4--10.5, soft): Tokenization and data scale. Efficient tokenization impacts context length and retrieval granularity.
  • Ch 13 (Sections 13.2--13.4, soft): Prompting and tool use. Chapter 14 builds on these with retrieval-augmented prompts (RAG) and multi-step tool-using agents (ReAct).

This chapter is referenced by:

  • No later chapters directly depend on this chapter.

Key Papers

  • Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in NeurIPS. [Section 14.1]
  • Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. Proceedings of ICLR. [Section 14.2]
  • Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. Proceedings of ICML. [Section 14.3.1]
  • Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. Advances in NeurIPS. [Section 14.3.2]
  • Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of ICLR. [Section 14.5.2]
  • Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of EMNLP. [Section 14.1.2]
  • Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of ICML. [Section 14.5.3]
  • Frantar, E., et al. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. Proceedings of ICLR. [Section 14.5.1]
  • Liu, N. F., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL. [Section 14.4.3]