In-Context Learning, Prompting, and Reasoning
Prerequisites
Summary
Chapter 13 explores one of the most surprising discoveries in modern NLP: large language models can "learn" new tasks at inference time from a handful of demonstrations in the prompt, without updating a single parameter. This phenomenon -- in-context learning (ICL) -- was first demonstrated at scale by Brown et al. (2020) with GPT-3 and challenges the conventional understanding of learning as weight updates. The chapter covers three layers: the phenomenon itself (ICL and its theoretical explanations), the engineering practice (prompt engineering and system prompts), and the reasoning frontier (chain-of-thought prompting, self-consistency, and tree-of-thought). It then extends to tool use and function calling as a bridge to agentic systems (Ch 14), and closes with an honest assessment of prompting's limits and the decision framework for choosing between prompting, fine-tuning, and RAG.
Learning Objectives
- Define in-context learning (ICL) and explain how a pre-trained autoregressive model can perform new tasks at inference time without any gradient updates, using only demonstrations provided in the prompt.
- Apply systematic prompt engineering principles -- including role specification, few-shot exemplar selection, output formatting, and system prompts -- to measurably improve model performance on classification, extraction, and generation tasks.
- Implement chain-of-thought (CoT) prompting and its variants (zero-shot CoT, self-consistency, tree-of-thought), and explain why eliciting intermediate reasoning steps improves performance on multi-step tasks.
- Describe how tool use and function calling extend language model capabilities beyond text generation, and identify the practical boundaries where prompting fails and fine-tuning becomes necessary.
Section Outline
13.1 In-Context Learning (~5pp)
The surprising discovery that large pre-trained LLMs can learn tasks from a few demonstrations in the prompt (Brown et al., 2020). Zero-shot, one-shot, and few-shot learning. Theoretical perspectives: implicit Bayesian inference, implicit gradient descent, or pattern matching? Sensitivity to prompt formatting and exemplar ordering.
- 13.1.1 Zero-Shot, One-Shot, Few-Shot
- 13.1.2 How Does ICL Work? Theoretical Perspectives
- 13.1.3 Sensitivity to Prompt Format and Exemplar Choice
13.2 Prompt Engineering (~6pp)
Practical principles for designing effective prompts. The anatomy of a prompt: system instruction, task description, exemplars, input, output format specification. Few-shot exemplar selection, system prompts, and automated prompt optimization.
- 13.2.1 The Anatomy of an Effective Prompt
- 13.2.2 Few-Shot Exemplar Selection and Ordering
- 13.2.3 System Prompts and Output Formatting
- 13.2.4 Automated Prompt Optimization
13.3 Chain-of-Thought Reasoning (~6pp)
Prompting the model to "think step by step" dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks (Wei et al., 2022). Zero-shot CoT, self-consistency via majority voting, tree-of-thought, and the faithfulness debate.
- 13.3.1 Chain-of-Thought Prompting
- 13.3.2 Zero-Shot CoT and Self-Consistency
- 13.3.3 Tree-of-Thought and Structured Reasoning
- 13.3.4 Faithfulness and Limitations of Stated Reasoning
13.4 Tool Use and Function Calling (~4pp)
Extending LLMs beyond text-in/text-out by connecting them to external tools: calculators, web search, code interpreters, databases, and APIs. The function-calling interface and Toolformer. Bridge from prompting to agentic systems (Ch 14).
- 13.4.1 Why LLMs Need External Tools
- 13.4.2 Function Calling: Architecture and Interface
- 13.4.3 Toolformer and Self-Taught Tool Use
13.5 The Limits of Prompting (~4pp)
When ICL and prompting are insufficient: domain-specific knowledge, consistent formatting at scale, latency constraints, and the prompt vs. fine-tune vs. RAG decision framework.
- 13.5.1 Where Prompting Fails
- 13.5.2 The Prompt vs. Fine-Tune vs. RAG Decision
- 13.5.3 The Cost-Quality-Latency Triangle
Key Equations
Key Figures
Exercises
Theory
- ICL vs. Fine-Tuning (Basic). Explain why in-context learning does NOT update the model's parameters. Distinguish functional learning (behavior changes) from parametric learning (weight changes).
- Exemplar Ordering (Intermediate). Analyze why exemplar ordering affects ICL performance. Design an ordering strategy for a 5-class classification task with 3 exemplars per class, accounting for recency bias.
- Self-Consistency Conditions (Intermediate). Explain under what conditions self-consistency improves over single-sample CoT. What is the minimum accuracy of individual chains needed for majority voting to help?
- Information Capacity (Basic). Compare the information-theoretic capacity of zero-shot vs. five-shot prompts. If each exemplar adds ~100 tokens, how does this compare to the model's billions of parameters?
Programming
- Few-Shot Classifier (Basic). Build a few-shot sentiment classifier using only prompting with 0, 1, 3, and 5 exemplars. Plot accuracy vs. number of exemplars on 100 test examples.
- CoT with Self-Consistency (Intermediate). Implement chain-of-thought prompting with self-consistency ($K=10$) on a math reasoning dataset. Plot accuracy as a function of $K$.
- Function-Calling Loop (Intermediate). Implement a function-calling loop with a calculator tool for 50 arithmetic word problems. Compare accuracy with direct prompting.
- Prompt vs. Fine-Tune (Intermediate). Compare prompt-only vs. fine-tuned performance on a classification task. Report accuracy, consistency, and inference time.
- ICL as Gradient Descent (Advanced). Critically evaluate the "ICL as implicit gradient descent" hypothesis (Akyurek et al., 2023). What evidence supports this? What are the limitations?
Cross-References
This chapter references:
- Ch 1 (Section 1.1): The prediction paradigm. ICL reveals that prediction, when scaled sufficiently, can perform arbitrary tasks without any architectural change -- the task specification is encoded in the context.
- Ch 9 (Sections 9.3, 9.5): GPT and autoregressive language modeling. The CLM objective from Chapter 9 is the mechanism through which ICL operates: the model conditions on the prompt to predict the output tokens.
This chapter is referenced by:
- Ch 14 (Sections 14.1, 14.2): RAG and agents extend the prompting capabilities introduced here. RAG augments prompts with retrieved documents; agents chain multiple tool-augmented prompting steps.
Key Papers
- Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models Are Few-Shot Learners. Advances in NeurIPS. [Section 13.1]
- Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in NeurIPS. [Sections 13.3.1--13.3.2]
- Wang, X., Wei, J., Schuurmans, D., et al. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. Proceedings of ICLR. [Section 13.3.2]
- Akyurek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. (2023). What Learning Algorithm Is In-Context Learning? Investigations with Linear Models. Proceedings of ICLR. [Section 13.1.2]
- Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. Advances in NeurIPS. [Section 13.3.2]
- Yao, S., et al. (2024). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in NeurIPS. [Section 13.3.3]
- Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in NeurIPS. [Section 13.4.3]
- Khattab, O., et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714. [Section 13.2.4]
- Turpin, M., et al. (2023). Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in NeurIPS. [Section 13.3.4]