What is wrong with testing agents on the happy path?

Happy-path evals only prove the agent handles the scenarios it was designed for, which it almost always does. They say nothing about ambiguous, adversarial, or out-of-scope requests, which is where production agents actually fail. A green happy-path run is a false sense of safety.

How do I build evals from production failures?

For every failure, capture three things: the user input, the agent's incorrect output, and the correct output you write by hand. Add that case to the suite and run it on every deploy. Over time the set comes to represent what users really send.

What should a minimum viable eval suite contain?

A deliberate mix of unhappy paths: 20+ ambiguous requests, 10+ that should be refused or escalated, 10+ adversarial or prompt-injection attempts, 20+ known-good cases for regression detection, and 20+ drawn from real production failures.

Why score per category instead of one overall number?

Because an aggregate hides the failures that matter. A 90% score can be 100% on easy cases and 60% on hard ones, which is worse than a flat 80%. Per-category scoring shows you exactly which behaviour a change degraded.

Should the eval set ever stop growing?

No. A frozen eval set stops reflecting reality the moment users do something new. The suite should grow continuously from production failures; that growth is what keeps an agent reliable over time.

Stop Testing Your AI Agents on the Happy Path

In this post (4 sections)

In this post

Most agent eval sets I audit are the demo scenarios that worked when the system was first built. They prove the agent can handle what it was designed for. They do not prove it can handle what users actually send, which is the only thing that matters in production. A green eval run against the happy path is a comfortable lie.

Building evals from production failures

Every production failure is a free eval case. Capture the user input, the agent's incorrect output, and the correct output, which is the part you have to write. Add it to the eval set and run the suite on every deploy. The set grows, and growing is the point. Your evals should reflect what users actually do, not what you imagined they would do. This is the natural feed for the eval suite in the agent observability stack we ship: the traces show you the failures, and each failure becomes a permanent test.

The minimum viable eval suite

20+ ambiguous requests, to see what the model does at the edge of categories.
10+ requests that should be refused or escalated to a human rather than answered.
10+ adversarial and prompt-injection attempts.
20+ "correct on the first run last week" cases for regression detection.
20+ drawn from real production failures, which are your specific weak spots.

Notice how much of this is the unhappy path. The ambiguous and refuse-or-escalate cases connect directly to the routing trade-offs in Haiku 4.5 made our router 5x cheaper, and the budget-exhaustion behaviour from why your agent keeps failing after 3 steps deserves its own eval cases too.

Why happy-path evals mislead

Eval source	What it proves	What it misses
Launch demos	The designed path works	Everything users actually send
Ambiguous cases	Behaviour at category edges	Nothing, this is the point
Refuse / escalate	It knows its limits	Overconfident wrong answers
Adversarial	Resistance to injection	Silent compliance with attacks
Production failures	Real weak spots	Only what has happened so far

Score per category, not in aggregate

Track pass and fail per category, never as a single number. A 90% overall score that is 100% on easy and 60% on hard is worse than a flat 80%, because the aggregate hides exactly the failures that hurt. Regression by category is where you spot a specific change degrading a specific behaviour, for example a prompt edit that quietly broke refusals while improving the easy cases.

Common mistakes

Freezing the eval set at launch, so it never learns from what production teaches you.
Reporting one aggregate score that lets strong easy-case performance mask weak hard-case performance.
Skipping adversarial and refusal cases, which are precisely the ones with real-world cost when they fail.
Writing the "correct output" loosely, so the eval cannot actually tell pass from fail.

An eval set is a living asset, not a launch checkbox. The teams whose agents stay reliable are the ones whose eval suite grows every week from real failures. Standing up that loop is part of how I run consulting engagements and what we practise in training.

Eval datasets: stop testing your agents on the happy path

Building evals from production failures

The minimum viable eval suite

Score per category, not in aggregate

Common mistakes

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Eval datasets: stop testing your agents on the happy path

Building evals from production failures

The minimum viable eval suite

Score per category, not in aggregate

Common mistakes

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Claude Code Artifacts turn terminal output into live review pages: what Team and Enterprise buyers should pilot first

Agentjacking is real: poisoned Sentry errors can hijack Cursor, Claude Code, and Codex without touching your repo

The June 15 Claude billing change: Agent SDK credits, model retirement, and the checklist I run before anything breaks