Week 10: Agent Evaluation
Benchmarks, metrics, and assessment for rigorous agent testing
Week 10 of 12
Learning Objectives
- Define AgentBench, SWE-bench, GAIA, and LLM-as-Judge
- Explain why agent evaluation differs from LLM evaluation
- Run agents against standard benchmarks and interpret results
- Compare agent performance across different dimensions
- Assess reliability and validity of different evaluation methods
- Design custom evaluation protocols for novel applications
Topics Covered
- Why agent evaluation differs from LLM evaluation
- Major benchmarks (AgentBench, SWE-bench, WebArena, GAIA)
- Evaluation dimensions (success, efficiency, safety, cost)
- LLM-as-Judge methodology and limitations
- Designing custom evaluation protocols
Resources
Jupyter Notebooks
Required Readings
| Paper | Authors | Year | Link |
|---|---|---|---|
| AgentBench: Evaluating LLMs as Agents | Liu et al. | 2023 | arXiv |
| WebArena: A Realistic Web Environment | Zhou et al. | 2024 | arXiv |
| GAIA: A Benchmark for General AI Assistants | Mialon et al. | 2024 | arXiv |
Reading Guide: Agent Evaluation and Benchmarking
Analysis of AgentBench, WebArena, GAIA, and SWE-bench
Primary Paper
Secondary Papers
Exercise: Agent Evaluation
Design and implement agent evaluation frameworks
Learning Objectives
- Create: Design evaluation frameworks
- Apply: Implement automated assessment
- Analyze: Analyze agent performance
Tasks
| Task | Points | Description |
|---|---|---|
| Benchmark Design | 35 | Design custom evaluation tasks |
| LLM-as-Judge | 35 | Implement automated evaluation |
| Analysis Report | 30 | Analyze results and limitations |
Key Concepts
Agent Evaluation Challenges:
- Trajectory dependence: Many valid paths to same goal
- Partial credit: How to score incomplete solutions?
- Environment variance: Results depend on environment state
- Cost: Each evaluation run costs time and API calls
Major Benchmarks:
- AgentBench: 8 environments (OS, database, web, games)
- SWE-bench: Real GitHub issues from Python repositories
- WebArena: Realistic web environments (shopping, forums)
- GAIA: General AI Assistant multi-modal tasks
LLM-as-Judge: Using an LLM to automatically evaluate agent outputs. Flexible but has biases (position, self-preference, verbosity).
Exercise
Design an evaluation framework for a specific agent type:
- Define clear success criteria (binary or graded scoring)
- Create a representative task suite covering difficulty range
- Implement LLM-as-Judge evaluation with structured rubrics
- Establish baselines (random, human, prior models)
- Compare multiple agent architectures with statistical significance
Discussion Questions
- How do you prevent benchmark overfitting when developing agents?
- When is LLM-as-Judge reliable? What are its failure modes?
- What makes a good human baseline for agent evaluation?
- How should we balance success rate vs efficiency vs cost?
- Are current benchmarks representative of real-world agent use cases?
Additional Resources
Discussion & Questions
Join the Conversation
Have questions about this week's material? Want to discuss concepts with fellow students?