← Week 9 Week 11 →

Week 10: Agent Evaluation

Benchmarks, metrics, and assessment for rigorous agent testing

Week 10 of 12

Learning Objectives

Define AgentBench, SWE-bench, GAIA, and LLM-as-Judge
Explain why agent evaluation differs from LLM evaluation
Run agents against standard benchmarks and interpret results
Compare agent performance across different dimensions
Assess reliability and validity of different evaluation methods
Design custom evaluation protocols for novel applications

Topics Covered

Why agent evaluation differs from LLM evaluation
Major benchmarks (AgentBench, SWE-bench, WebArena, GAIA)
Evaluation dimensions (success, efficiency, safety, cost)
LLM-as-Judge methodology and limitations
Designing custom evaluation protocols

Resources

Download Slides (PDF)

Jupyter Notebooks

Benchmarking Suite

Required Readings

Paper	Authors	Year	Link
AgentBench: Evaluating LLMs as Agents	Liu et al.	2023	arXiv
WebArena: A Realistic Web Environment	Zhou et al.	2024	arXiv
GAIA: A Benchmark for General AI Assistants	Mialon et al.	2024	arXiv

Reading Guide: Agent Evaluation and Benchmarking

3-4 hours AgentBench SWE-bench Evaluation metrics

Analysis of AgentBench, WebArena, GAIA, and SWE-bench

Primary Paper

AgentBench: Evaluating LLMs as Agents
Liu, X., Yu, H., Zhang, H., et al. (2023)
arXiv arXiv

Secondary Papers

WebArena: A Realistic Web Environment for Building Autonomous Agents - Zhou, S., Xu, F. F., et al. (2024) arXiv
GAIA: A Benchmark for General AI Assistants - Mialon, G., et al. (2024) arXiv
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - Jimenez, C. E., et al. (2024) arXiv

View Reading Guide Download (MD)

Exercise: Agent Evaluation

100 Points 5-7 hours Advanced

Design and implement agent evaluation frameworks

Learning Objectives

Create: Design evaluation frameworks
Apply: Implement automated assessment
Analyze: Analyze agent performance

Tasks

Task	Points	Description
Benchmark Design	35	Design custom evaluation tasks
LLM-as-Judge	35	Implement automated evaluation
Analysis Report	30	Analyze results and limitations

View Full Exercise Download (MD)

Key Concepts

Agent Evaluation Challenges:

Trajectory dependence: Many valid paths to same goal
Partial credit: How to score incomplete solutions?
Environment variance: Results depend on environment state
Cost: Each evaluation run costs time and API calls

Major Benchmarks:

AgentBench: 8 environments (OS, database, web, games)
SWE-bench: Real GitHub issues from Python repositories
WebArena: Realistic web environments (shopping, forums)
GAIA: General AI Assistant multi-modal tasks

LLM-as-Judge: Using an LLM to automatically evaluate agent outputs. Flexible but has biases (position, self-preference, verbosity).

Exercise

Design an evaluation framework for a specific agent type:

Define clear success criteria (binary or graded scoring)
Create a representative task suite covering difficulty range
Implement LLM-as-Judge evaluation with structured rubrics
Establish baselines (random, human, prior models)
Compare multiple agent architectures with statistical significance

Discussion Questions

How do you prevent benchmark overfitting when developing agents?
When is LLM-as-Judge reliable? What are its failure modes?
What makes a good human baseline for agent evaluation?
How should we balance success rate vs efficiency vs cost?
Are current benchmarks representative of real-world agent use cases?

Additional Resources

Discussion & Questions

Join the Conversation

Have questions about this week's material? Want to discuss concepts with fellow students?

Week 10 Discussion Ask a Question Report an Issue

← Previous Week Course Home Next Week →