All posts
Production Published 8 min

Eval datasets: stop testing your agents on the happy path

If your eval set is the demos you showed the client, you are testing the wrong thing. How we build evals from production failures and the minimum viable suite to ship.

Jigar JoshiJigar JoshiAgentic AI Architect and Consultant
In this post (4 sections)

Most agent eval sets I audit are the demo scenarios that worked when the system was first built. They prove the agent can handle what it was designed for. They do not prove it can handle what users actually send, which is the only thing that matters in production. A green eval run against the happy path is a comfortable lie.

Building evals from production failures

Every production failure is a free eval case. Capture the user input, the agent's incorrect output, and the correct output, which is the part you have to write. Add it to the eval set and run the suite on every deploy. The set grows, and growing is the point. Your evals should reflect what users actually do, not what you imagined they would do. This is the natural feed for the eval suite in the agent observability stack we ship: the traces show you the failures, and each failure becomes a permanent test.

The minimum viable eval suite

  • 20+ ambiguous requests, to see what the model does at the edge of categories.
  • 10+ requests that should be refused or escalated to a human rather than answered.
  • 10+ adversarial and prompt-injection attempts.
  • 20+ "correct on the first run last week" cases for regression detection.
  • 20+ drawn from real production failures, which are your specific weak spots.

Notice how much of this is the unhappy path. The ambiguous and refuse-or-escalate cases connect directly to the routing trade-offs in Haiku 4.5 made our router 5x cheaper, and the budget-exhaustion behaviour from why your agent keeps failing after 3 steps deserves its own eval cases too.

Why happy-path evals mislead
Eval sourceWhat it provesWhat it misses
Launch demosThe designed path worksEverything users actually send
Ambiguous casesBehaviour at category edgesNothing, this is the point
Refuse / escalateIt knows its limitsOverconfident wrong answers
AdversarialResistance to injectionSilent compliance with attacks
Production failuresReal weak spotsOnly what has happened so far

Score per category, not in aggregate

Track pass and fail per category, never as a single number. A 90% overall score that is 100% on easy and 60% on hard is worse than a flat 80%, because the aggregate hides exactly the failures that hurt. Regression by category is where you spot a specific change degrading a specific behaviour, for example a prompt edit that quietly broke refusals while improving the easy cases.

Common mistakes

  • Freezing the eval set at launch, so it never learns from what production teaches you.
  • Reporting one aggregate score that lets strong easy-case performance mask weak hard-case performance.
  • Skipping adversarial and refusal cases, which are precisely the ones with real-world cost when they fail.
  • Writing the "correct output" loosely, so the eval cannot actually tell pass from fail.

An eval set is a living asset, not a launch checkbox. The teams whose agents stay reliable are the ones whose eval suite grows every week from real failures. Standing up that loop is part of how I run consulting engagements and what we practise in training.

The weekly take

Agentic AI patterns, delivered Thursdays

What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.

Shipping an agentic AI project this quarter?
Book a 30-min consult
Frequently asked

Questions readers ask about this post

Share this post
LinkedIn Facebook