Week 10: Agent Evaluation

Benchmarks, metrics, and assessment for rigorous agent testing

Week 10 of 12

Learning Objectives

  • Define AgentBench, SWE-bench, GAIA, and LLM-as-Judge
  • Explain why agent evaluation differs from LLM evaluation
  • Run agents against standard benchmarks and interpret results
  • Compare agent performance across different dimensions
  • Assess reliability and validity of different evaluation methods
  • Design custom evaluation protocols for novel applications

Topics Covered

  • Why agent evaluation differs from LLM evaluation
  • Major benchmarks (AgentBench, SWE-bench, WebArena, GAIA)
  • Evaluation dimensions (success, efficiency, safety, cost)
  • LLM-as-Judge methodology and limitations
  • Designing custom evaluation protocols

Resources

Jupyter Notebooks

Open Benchmarking Suite in Colab Benchmarking Suite

Required Readings

PaperAuthorsYearLink
AgentBench: Evaluating LLMs as Agents Liu et al. 2023 arXiv
WebArena: A Realistic Web Environment Zhou et al. 2024 arXiv
GAIA: A Benchmark for General AI Assistants Mialon et al. 2024 arXiv

Reading Guide: Agent Evaluation and Benchmarking

3-4 hours AgentBench SWE-bench Evaluation metrics

Analysis of AgentBench, WebArena, GAIA, and SWE-bench

Primary Paper

AgentBench: Evaluating LLMs as Agents
Liu, X., Yu, H., Zhang, H., et al. (2023)
arXiv arXiv

Secondary Papers

  • WebArena: A Realistic Web Environment for Building Autonomous Agents - Zhou, S., Xu, F. F., et al. (2024) arXiv
  • GAIA: A Benchmark for General AI Assistants - Mialon, G., et al. (2024) arXiv
  • SWE-bench: Can Language Models Resolve Real-World GitHub Issues? - Jimenez, C. E., et al. (2024) arXiv

Exercise: Agent Evaluation

100 Points 5-7 hours Advanced

Design and implement agent evaluation frameworks

Learning Objectives

  • Create: Design evaluation frameworks
  • Apply: Implement automated assessment
  • Analyze: Analyze agent performance

Tasks

TaskPointsDescription
Benchmark Design 35 Design custom evaluation tasks
LLM-as-Judge 35 Implement automated evaluation
Analysis Report 30 Analyze results and limitations

Key Concepts

Agent Evaluation Challenges:

  • Trajectory dependence: Many valid paths to same goal
  • Partial credit: How to score incomplete solutions?
  • Environment variance: Results depend on environment state
  • Cost: Each evaluation run costs time and API calls

Major Benchmarks:

  • AgentBench: 8 environments (OS, database, web, games)
  • SWE-bench: Real GitHub issues from Python repositories
  • WebArena: Realistic web environments (shopping, forums)
  • GAIA: General AI Assistant multi-modal tasks

LLM-as-Judge: Using an LLM to automatically evaluate agent outputs. Flexible but has biases (position, self-preference, verbosity).

Exercise

Design an evaluation framework for a specific agent type:

  1. Define clear success criteria (binary or graded scoring)
  2. Create a representative task suite covering difficulty range
  3. Implement LLM-as-Judge evaluation with structured rubrics
  4. Establish baselines (random, human, prior models)
  5. Compare multiple agent architectures with statistical significance

Discussion Questions

  1. How do you prevent benchmark overfitting when developing agents?
  2. When is LLM-as-Judge reliable? What are its failure modes?
  3. What makes a good human baseline for agent evaluation?
  4. How should we balance success rate vs efficiency vs cost?
  5. Are current benchmarks representative of real-world agent use cases?

Additional Resources

Discussion & Questions

Join the Conversation

Have questions about this week's material? Want to discuss concepts with fellow students?



Back to top