AI Agent Evaluation: How to Build Targeted Evals for Superior Performance

In the rapidly evolving landscape of AI, ensuring the reliability and accuracy of AI agents is paramount. Thoughtful evaluation strategies are key to shaping agent behavior and achieving desired outcomes. This article explores a focused approach to building effective evaluations (evals) for deep agents, emphasizing quality over quantity.

The Power of Targeted Evals

Every evaluation acts as a guiding force, influencing the behavior of your AI system. For instance, if an evaluation designed to test efficient file reading fails, adjustments to the system prompt or tool descriptions can be made to improve performance. Over time, these targeted evaluations exert a cumulative effect on the overall system.

However, blindly adding numerous tests can create a false sense of progress. It's more effective to concentrate on targeted evaluations that directly reflect the behaviors you want to see in a production environment. Remember, more evals don't necessarily equate to better agents. Prioritize quality and relevance.

Curating Data for Effective Evals

The process of curating data for evaluations involves several key methods:

Leveraging Internal Feedback: Utilize insights gained from internal use of your agents.
Adapting External Benchmarks: Select and modify evaluations from established benchmarks to suit your specific agent.
Crafting Custom Evals: Develop unique unit tests tailored to behaviors that are crucial for your agent's performance.

Dogfooding your agents and carefully analyzing traces are invaluable for identifying areas where evaluations are needed. By tracing every interaction, mistakes become opportunities to create new evaluations and refine agent definitions. For example, teams using open-source coding agents often encounter diverse codebases, which can lead to errors. Tracing these interactions allows for the creation of evals that prevent recurrence. Consider using code assistant tools to help streamline the process.

Grouping Evals by Function

Organizing evaluations by what they test, rather than their origin, provides a clearer understanding of agent performance. This approach allows for a more nuanced view, avoiding reliance on a single overall score. For example, tasks from different sources can be grouped based on whether they measure retrieval or tool use. This categorization provides a more insightful perspective on agent capabilities.

Defining Meaningful Metrics

When selecting a model for your agent, correctness should be the primary consideration. If a model cannot reliably complete the required tasks, other factors become irrelevant. Run multiple models on your evaluations and refine the harness to address any issues that arise. Once several models meet the correctness threshold, focus on efficiency.

Efficiency encompasses factors such as the number of turns taken, unnecessary tool calls, and overall task completion speed. These differences can significantly impact latency, cost, and user experience in a production environment. Consider using data analytics tools to measure and compare these metrics.

Key Metrics for Evaluation

Solve Rate: Measures how quickly an agent solves a task, normalized by the expected number of steps.

These metrics provide a straightforward way to compare models:

Assess correctness: Identify models that are sufficiently accurate for your specific tasks.
Compare efficiency: Among the accurate models, determine which offers the best balance of correctness, latency, and cost.

Ideal Trajectories: A Reference Point

To gain deeper insights into model performance, establish an ideal trajectory – a sequence of steps that achieves a correct outcome with minimal unnecessary actions. For simple tasks, the optimal path is often evident. For more complex tasks, approximate a trajectory using the best-performing model and refine it as models and harnesses improve. This approach helps refine your understanding of ideal agent behavior.

For example, consider the request: "What is the current time and weather where I live?" An ideal trajectory might involve resolving the user's location, fetching the time and weather, and providing the final answer without unnecessary intermediate steps. This could involve 4 steps, 4 tool calls, and ~8 seconds. A less efficient trajectory might involve 6 steps, 5 tool calls, and ~14 seconds. Both are correct, but the second run increases latency and cost.

Running Evals Effectively

Utilize tools like pytest with GitHub Actions to run evaluations in a clean, reproducible environment. Each evaluation should create an agent instance, provide a task, and compute correctness and efficiency metrics. To save costs and focus on specific experiments, run subsets of evaluations using tags. For example, if you're building an agent that heavily relies on local file processing, concentrate on evaluations tagged with "file_operations" and "tool_use". Explore the developer tools available to help you build your AI agent.

Conclusion

By focusing on targeted evaluations, defining meaningful metrics, and establishing ideal trajectories, you can significantly improve the performance and reliability of your AI agents. Remember, quality over quantity is key to building effective evaluations that drive desired behaviors and optimize your AI systems.

The Power of Targeted Evals