Retrieval, Agents, and Multimodal Models
Prerequisites
Summary
Chapter 14 extends the language model beyond single-turn text-in/text-out prediction into three frontier directions: retrieval-augmented generation (RAG), which grounds the model in external documents to reduce hallucination and enable knowledge-intensive tasks; agentic systems (ReAct), which chain multiple reasoning and tool-use steps into autonomous multi-step problem solving; and multimodal models (CLIP, LLaVA), which extend the prediction paradigm to images alongside text. The chapter also covers practical deployment: long-context extensions (RoPE scaling, sparse attention), parameter-efficient fine-tuning (LoRA), quantization (INT8/INT4), and speculative decoding. The unifying theme is that every extension preserves the core next-token prediction mechanism while enriching the context or reducing the computational cost.
Learning Objectives
- Design and implement a Retrieval-Augmented Generation (RAG) pipeline -- including document chunking, embedding-based retrieval, and grounded generation -- and evaluate it against a closed-book baseline on a knowledge-intensive QA task.
- Explain the ReAct agent framework (Reason + Act), implement a multi-step agent loop that interleaves reasoning traces with tool calls, and identify the challenges of planning, error recovery, and termination.
- Describe how vision-language models (CLIP, LLaVA, GPT-4V) bridge visual and textual modalities through contrastive pre-training and visual instruction tuning, and explain their connection to the text-prediction paradigm.
- Apply parameter-efficient fine-tuning (LoRA) and inference-time optimization (quantization, KV-cache, speculative decoding) to deploy large language models under real-world resource constraints.
Section Outline
14.1 Retrieval-Augmented Generation (RAG) (~5pp)
The RAG architecture (Lewis et al., 2020): augmenting generation with retrieved documents to reduce hallucination and ground responses in external knowledge. The full pipeline: document indexing, dense retrieval, and conditioned generation. Evaluation: faithfulness, relevance, correctness.
- 14.1.1 Why Retrieval? Grounding and Hallucination Reduction
- 14.1.2 The RAG Pipeline: Index, Retrieve, Generate
- 14.1.3 Evaluation and Advanced Retrieval Strategies
14.2 Agents and Planning (~4pp)
The ReAct framework (Yao et al., 2023): interleaving Thought, Action, and Observation. Planning, task decomposition, memory (short-term and long-term), error recovery, and safety in autonomous systems.
- 14.2.1 The ReAct Framework: Thought-Action-Observation
- 14.2.2 Planning and Task Decomposition
- 14.2.3 Memory, Error Recovery, and Safety
14.3 Vision-Language Models (~4pp)
CLIP (Radford et al., 2021): contrastive pre-training of image and text encoders. LLaVA (Liu et al., 2023): visual instruction tuning. GPT-4V and Gemini as frontier multimodal models.
- 14.3.1 CLIP: Contrastive Image-Text Pre-training
- 14.3.2 Visual Instruction Tuning (LLaVA)
- 14.3.3 Frontier Multimodal Models
14.4 Long-Context Models (~3pp)
Extending context windows: RoPE scaling, ALiBi, sub-quadratic attention alternatives, and the "lost in the middle" phenomenon.
- 14.4.1 Position Encoding Extrapolation
- 14.4.2 Sub-Quadratic Attention Alternatives
- 14.4.3 Practical Applications of Long Context
14.5 Efficient Inference (~4pp)
Quantization (GPTQ, AWQ, GGUF), LoRA and parameter-efficient fine-tuning, KV-cache optimization, speculative decoding, and distillation.
- 14.5.1 Quantization: INT8, INT4, and Beyond
- 14.5.2 LoRA and Parameter-Efficient Fine-Tuning
- 14.5.3 KV-Cache, Speculative Decoding, and Distillation
Key Equations
Key Figures
Exercises
Theory
- LoRA Parameter Savings (Basic). Derive the LoRA parameter savings for a Transformer with $L=32$ layers, each having 4 weight matrices of dimension $d=4096$, with rank $r=16$. What fraction of total parameters are trainable?
- RAG Precision Trade-off (Intermediate). If a retriever returns $k$ documents with precision $p$, how does the expected number of relevant documents scale? What happens to generation quality when most retrieved documents are irrelevant?
- Speculative Decoding Correctness (Intermediate). Explain why speculative decoding preserves the output distribution of the target model exactly. Given an acceptance rate $\alpha$, estimate the expected speedup.
Programming
- RAG Pipeline (Basic). Build a RAG pipeline using sentence-transformers for retrieval. Index 100 Wikipedia paragraphs, retrieve top-3 for 20 questions, and compare with a closed-book baseline.
- ReAct Agent (Intermediate). Implement a ReAct agent with calculator and search tools. Test on 30 multi-step questions. Report steps per question, tool-use frequency, and accuracy.
- LoRA Fine-Tuning (Intermediate). Apply LoRA fine-tuning ($r=16$) to a 7B model on 1000 instruction examples. Compare trainable parameters, GPU memory, and task performance with full fine-tuning.
- Quantization Benchmark (Intermediate). Quantize a model to INT4 using GPTQ. Compare FP16, INT8, INT4 on perplexity, inference speed, and model size.
- Vision-Language Pipeline (Advanced). Use CLIP to encode images, project embeddings into a small LLM's input space, and generate captions. Evaluate on 50 COCO validation images.
Cross-References
This chapter references:
- Ch 1 (Section 1.1): The prediction paradigm. Chapter 14 extends this to multimodal prediction, retrieval-grounded prediction, and agent-mediated prediction.
- Ch 10 (Sections 10.4--10.5, soft): Tokenization and data scale. Efficient tokenization impacts context length and retrieval granularity.
- Ch 13 (Sections 13.2--13.4, soft): Prompting and tool use. Chapter 14 builds on these with retrieval-augmented prompts (RAG) and multi-step tool-using agents (ReAct).
This chapter is referenced by:
- No later chapters directly depend on this chapter.
Key Papers
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in NeurIPS. [Section 14.1]
- Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. Proceedings of ICLR. [Section 14.2]
- Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. Proceedings of ICML. [Section 14.3.1]
- Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. Advances in NeurIPS. [Section 14.3.2]
- Hu, E. J., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of ICLR. [Section 14.5.2]
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of EMNLP. [Section 14.1.2]
- Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of ICML. [Section 14.5.3]
- Frantar, E., et al. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. Proceedings of ICLR. [Section 14.5.1]
- Liu, N. F., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL. [Section 14.4.3]