All posts
Production Published 8 min

The agent observability stack we ship to every client

Traces, spans, evals, cost-per-completed-task, and the one dashboard panel that catches 80% of regressions. Vendor-agnostic; covers Langfuse, Honeycomb, and rolling your own.

Jigar JoshiJigar JoshiAgentic AI Architect and Consultant
In this post (4 sections)

Every client engagement we run ships an observability layer with the agent. Skipping this is the single most common reason I see production agents quietly degrade over time. Nobody can see them degrade, so a tool that started failing last Tuesday keeps failing until a customer complains. Observability is what turns that silent decay into a dashboard line you can act on.

The stack

  • Per-step traces with input, output, model, tool calls, and timing, so any single request can be replayed.
  • Cost attribution per request and per route, the foundation for the savings work in the cost posts.
  • A cache-hit rate panel, kept separate from usage metrics so a broken cache cannot hide behind healthy token counts.
  • An eval suite that runs on every deploy plus a nightly canary, so regressions surface before and after release.
  • A completion-rate gauge: of N user requests today, what fraction reached a terminal state?

The one panel that catches everything

Step-count distribution. A histogram of how many agent steps each request took. When this distribution shifts right or grows a long tail, something is wrong. Usually a tool starting to fail, a prompt change degrading first-pass accuracy, or a model update that needs re-tuning. I have caught more regressions on this single panel than every other piece of telemetry combined.

It works because step count is a leading indicator. Before accuracy visibly drops, before costs spike, the agent starts taking more steps to reach the same answer, retrying tools and re-planning. The long tail of that histogram is where the agents that fail after three steps live, and watching the distribution lets you find them as a population rather than one support ticket at a time.

Five signals and what each one catches
SignalCatchesWhy it earns its place
Per-step tracesWhere a single request went wrongReplay beats guessing
Cost per routeWhere the spend concentratesTargets optimisation work
Cache-hit panelSilently broken cachingUsage metrics hide it
Eval suite + canaryRegressions at and after deployPre and post release coverage
Step-count histogramMost early degradationLeading indicator of trouble

Measure cost per completed task, not per token

Cost attribution is only useful if you divide by outcomes. A route can look cheap per token while being expensive per completed task because it retries constantly. Pairing cost-per-route with the completion-rate gauge gives you cost per completed task, the metric I argue for in Haiku 4.5 made our router 5x cheaper and the cheapest LLM call is the one you do not make. The cache-hit panel ties straight back to prompt caching is not optional anymore.

Vendor-agnostic

Langfuse if you want a managed product. Honeycomb if you already have it for your services. OpenTelemetry directly if you want to roll your own and keep options open. The stack matters less than the discipline of looking at it. Pick the tool your team will actually open every morning.

Whatever you choose, ship it on day one rather than bolting it on after the first incident. Standing this up is part of every consulting engagement I run and a module in our training, because an agent you cannot observe is an agent you cannot keep healthy.

The weekly take

Agentic AI patterns, delivered Thursdays

What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.

Shipping an agentic AI project this quarter?
Book a 30-min consult
Frequently asked

Questions readers ask about this post

Share this post
LinkedIn Facebook