Part IV · Chapter 13

In-Context Learning, Prompting, and Reasoning

Part IV: Frontiers Moderate ~25pp Phase 3

Why this chapter matters. In-context learning reveals the deepest implication of the prediction paradigm: a model that predicts the next token well enough can perform arbitrary tasks without changing a single parameter. The task specification moves from the weights to the prompt -- and the model still just predicts what comes next. This transforms how we use language models, making prompt engineering a core NLP skill.

Prerequisites

Ch 9: Pre-training Paradigms → Ch 13: In-Context Learning

Summary

Chapter 13 explores one of the most surprising discoveries in modern NLP: large language models can "learn" new tasks at inference time from a handful of demonstrations in the prompt, without updating a single parameter. This phenomenon -- in-context learning (ICL) -- was first demonstrated at scale by Brown et al. (2020) with GPT-3 and challenges the conventional understanding of learning as weight updates. The chapter covers three layers: the phenomenon itself (ICL and its theoretical explanations), the engineering practice (prompt engineering and system prompts), and the reasoning frontier (chain-of-thought prompting, self-consistency, and tree-of-thought). It then extends to tool use and function calling as a bridge to agentic systems (Ch 14), and closes with an honest assessment of prompting's limits and the decision framework for choosing between prompting, fine-tuning, and RAG.

Learning Objectives

Define in-context learning (ICL) and explain how a pre-trained autoregressive model can perform new tasks at inference time without any gradient updates, using only demonstrations provided in the prompt.
Apply systematic prompt engineering principles -- including role specification, few-shot exemplar selection, output formatting, and system prompts -- to measurably improve model performance on classification, extraction, and generation tasks.
Implement chain-of-thought (CoT) prompting and its variants (zero-shot CoT, self-consistency, tree-of-thought), and explain why eliciting intermediate reasoning steps improves performance on multi-step tasks.
Describe how tool use and function calling extend language model capabilities beyond text generation, and identify the practical boundaries where prompting fails and fine-tuning becomes necessary.

Section Outline

13.1 In-Context Learning (~5pp)

The surprising discovery that large pre-trained LLMs can learn tasks from a few demonstrations in the prompt (Brown et al., 2020). Zero-shot, one-shot, and few-shot learning. Theoretical perspectives: implicit Bayesian inference, implicit gradient descent, or pattern matching? Sensitivity to prompt formatting and exemplar ordering.

13.1.1 Zero-Shot, One-Shot, Few-Shot
13.1.2 How Does ICL Work? Theoretical Perspectives
13.1.3 Sensitivity to Prompt Format and Exemplar Choice

13.2 Prompt Engineering (~6pp)

Practical principles for designing effective prompts. The anatomy of a prompt: system instruction, task description, exemplars, input, output format specification. Few-shot exemplar selection, system prompts, and automated prompt optimization.

13.2.1 The Anatomy of an Effective Prompt
13.2.2 Few-Shot Exemplar Selection and Ordering
13.2.3 System Prompts and Output Formatting
13.2.4 Automated Prompt Optimization

13.3 Chain-of-Thought Reasoning (~6pp)

Prompting the model to "think step by step" dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks (Wei et al., 2022). Zero-shot CoT, self-consistency via majority voting, tree-of-thought, and the faithfulness debate.

13.3.1 Chain-of-Thought Prompting
13.3.2 Zero-Shot CoT and Self-Consistency
13.3.3 Tree-of-Thought and Structured Reasoning
13.3.4 Faithfulness and Limitations of Stated Reasoning

13.4 Tool Use and Function Calling (~4pp)

Extending LLMs beyond text-in/text-out by connecting them to external tools: calculators, web search, code interpreters, databases, and APIs. The function-calling interface and Toolformer. Bridge from prompting to agentic systems (Ch 14).

13.4.1 Why LLMs Need External Tools
13.4.2 Function Calling: Architecture and Interface
13.4.3 Toolformer and Self-Taught Tool Use

13.5 The Limits of Prompting (~4pp)

When ICL and prompting are insufficient: domain-specific knowledge, consistent formatting at scale, latency constraints, and the prompt vs. fine-tune vs. RAG decision framework.

13.5.1 Where Prompting Fails
13.5.2 The Prompt vs. Fine-Tune vs. RAG Decision
13.5.3 The Cost-Quality-Latency Triangle

Key Equations

(13.1)

$$P(y_1, \ldots, y_T \mid x) = \prod_{t=1}^{T} P(y_t \mid y_{CLM Probability (reference from Ch 9) -- where $x$ is the prompt (including system instruction, exemplars, and input) and $y$ is the generated output. ICL exploits this autoregressive mechanism: the "learning" happens through the conditioning context $x$, not through parameter updates to $\theta$.

(13.2)

$$\hat{y} = \arg\max_{y} \sum_{i=1}^{K} \mathbb{1}[\text{answer}(r_i) = y]$$

Self-Consistency -- where $r_1, \ldots, r_K$ are $K$ independently sampled reasoning chains and $\text{answer}(r_i)$ extracts the final answer from chain $r_i$. The majority-voted answer is selected.

Key Figures

Figure 13.1 · Annotated Text Diagram · LaTeX listings

ICL Examples

Side-by-side comparison of zero-shot, one-shot, and few-shot prompts for a sentiment classification task, showing the full prompt text and model output.

Figure 13.2 · Annotated Diagram · TikZ

Prompt Anatomy

Complete prompt decomposed into its five components: system instruction, task description, few-shot exemplars, input text, and output format specification, with labels and arrows.

Figure 13.3 · Annotated Text Comparison · LaTeX listings

Chain-of-Thought Example

A multi-step reasoning problem showing the standard prompt (direct answer, often wrong) vs. the CoT prompt (step-by-step reasoning, correct answer).

Figure 13.4 · Sequence/Flow Diagram · TikZ

Tool Use Flow Diagram

User → LLM (generates function call) → Orchestrator (executes tool) → LLM (incorporates result) → User (final answer). The Thought-Action-Observation loop.

Figure 13.5 · Table · LaTeX tabular

ICL Failure Cases

Failure modes with examples: sensitivity to exemplar order, label bias, recency bias, format inconsistency, and domain knowledge gaps.

Exercises

Theory

ICL vs. Fine-Tuning (Basic). Explain why in-context learning does NOT update the model's parameters. Distinguish functional learning (behavior changes) from parametric learning (weight changes).
Exemplar Ordering (Intermediate). Analyze why exemplar ordering affects ICL performance. Design an ordering strategy for a 5-class classification task with 3 exemplars per class, accounting for recency bias.
Self-Consistency Conditions (Intermediate). Explain under what conditions self-consistency improves over single-sample CoT. What is the minimum accuracy of individual chains needed for majority voting to help?
Information Capacity (Basic). Compare the information-theoretic capacity of zero-shot vs. five-shot prompts. If each exemplar adds ~100 tokens, how does this compare to the model's billions of parameters?

Programming

Few-Shot Classifier (Basic). Build a few-shot sentiment classifier using only prompting with 0, 1, 3, and 5 exemplars. Plot accuracy vs. number of exemplars on 100 test examples.
CoT with Self-Consistency (Intermediate). Implement chain-of-thought prompting with self-consistency ($K=10$) on a math reasoning dataset. Plot accuracy as a function of $K$.
Function-Calling Loop (Intermediate). Implement a function-calling loop with a calculator tool for 50 arithmetic word problems. Compare accuracy with direct prompting.
Prompt vs. Fine-Tune (Intermediate). Compare prompt-only vs. fine-tuned performance on a classification task. Report accuracy, consistency, and inference time.
ICL as Gradient Descent (Advanced). Critically evaluate the "ICL as implicit gradient descent" hypothesis (Akyurek et al., 2023). What evidence supports this? What are the limitations?

Cross-References

This chapter references:

Ch 1 (Section 1.1): The prediction paradigm. ICL reveals that prediction, when scaled sufficiently, can perform arbitrary tasks without any architectural change -- the task specification is encoded in the context.
Ch 9 (Sections 9.3, 9.5): GPT and autoregressive language modeling. The CLM objective from Chapter 9 is the mechanism through which ICL operates: the model conditions on the prompt to predict the output tokens.

This chapter is referenced by:

Ch 14 (Sections 14.1, 14.2): RAG and agents extend the prompting capabilities introduced here. RAG augments prompts with retrieved documents; agents chain multiple tool-augmented prompting steps.

Key Papers

Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models Are Few-Shot Learners. Advances in NeurIPS. [Section 13.1]
Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in NeurIPS. [Sections 13.3.1--13.3.2]
Wang, X., Wei, J., Schuurmans, D., et al. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. Proceedings of ICLR. [Section 13.3.2]
Akyurek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. (2023). What Learning Algorithm Is In-Context Learning? Investigations with Linear Models. Proceedings of ICLR. [Section 13.1.2]
Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. Advances in NeurIPS. [Section 13.3.2]
Yao, S., et al. (2024). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. Advances in NeurIPS. [Section 13.3.3]
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in NeurIPS. [Section 13.4.3]
Khattab, O., et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714. [Section 13.2.4]
Turpin, M., et al. (2023). Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Advances in NeurIPS. [Section 13.3.4]

← Previous

Ch 12: Alignment (RLHF, DPO)

Ch 14: Retrieval, Agents, Multimodal