What is the minimum viable observability for an agent?

Per-step input/output traces, an eval suite that covers ambiguous inputs (not just happy paths), and cache-hit rate per route. Anything below this is folklore.

Why is cache-hit rate so important?

Prompt caching is where 40–50% of your token cost savings live. If you are not measuring hit rate per route, you cannot tell which routes are leaking spend.

How adversarial should eval datasets be?

Aggressive enough that your team finds it hard to write the next case. If the eval suite is "easy" to author, it is not exercising the edge of the model's capability — and edges are where production breaks.

Topic Pillar

AI Observability.You cannot fix what you cannot see.

Observability is the unsexy half of production AI that decides whether you keep your job. Per-step traces, score-bearing evals, cache-hit telemetry, tool-selection scores — these are the primitives that let you debug agents instead of guessing at them.

● 12 cluster pages· 7 posts· 5 updates

The four telemetry layers

Input/output traces per step (Langfuse-shaped). Eval scoring on representative inputs, not just happy paths. Cache-hit rate per route — the new throughput metric. Tool-selection score deltas from the model boundary. Without all four, multi-agent debugging is folklore.

Eval datasets that surface real bugs

A 100-case eval set that scores 95% means nothing if all 100 cases are the happy path. The cases that matter are the ambiguous ones, the conflicting ones, the malformed ones. Build the eval set adversarially — ask "what would a real user actually paste in?" — and most agents drop 15–25 points.

Per-agent cost attribution

Tools like Langfuse now expose per-agent cost attribution and step-level cache-hit telemetry. You will discover 30–50% of your agent traffic is repeat queries you should be caching. This is the report nobody wants to read but everyone needs to.

7 blog posts

Deep dives on AI Observability

Tool Design

Tool descriptions are prompts. Fix the registry, not the agent.

When an agent picks the wrong tool, the registry is broken — not the agent. Three rules I now apply before debugging anything in a multi-tool system: precise names, "when to use" triggers, and a curated load list. Anthropic's new tool-selection telemetry finally puts numbers on what changes accuracy.

May 13, 20265 min

Read the post Production

The cheapest LLM call is the one you do not make — GitHub's 19-62% token cut, decoded

GitHub published an instrumented analysis of their agentic CI workflows and reported 19-62% token-cost reductions. The savings are the headline. The technique — pre-agentic data fetching and tool-registry hygiene — is the story most teams will miss.

May 11, 20265 min

Read the post Architecture

Claude Opus 4.7's 1M context: when to RAG and when to just stuff it

A million tokens reliably is real now, but it does not retire RAG — it changes the calculus. Cost, latency, recency, and the prompt-cache angle nobody is talking about.

May 6, 20266 min

Read the post Production

Prompt caching is not optional anymore — measuring a 47% cost drop

A walkthrough from a client engagement: identifying stable prefixes, restructuring the system prompt for cacheability, and the telemetry that proved caching was actually working.

Apr 19, 20264 min

Read the post Production

The agent observability stack we ship to every client

Traces, spans, evals, cost-per-completed-task, and the one dashboard panel that catches 80% of regressions. Vendor-agnostic — covers Langfuse, Honeycomb, and rolling your own.

Mar 28, 20267 min

Read the post Production

Eval datasets: stop testing your agents on the happy path

If your eval set is the demos you showed the client, you are testing the wrong thing. How we build evals from production failures and the minimum viable suite to ship.

Jan 19, 20266 min

Read the post Architecture

RAG vs CAG: how to actually decide

A decision framework from real implementations. RAG retrieves. CAG stores in cache. Knowing which to use — and when to combine both — determines whether your agent finds the right answer at the right cost.

Sep 21, 20255 min

Read the post

5 ship-news updates

Latest in AI Observability

Claude

Anthropic ships tool-use telemetry — every selection is scored and logged at the model boundary

May 13, 2026 · via Anthropic

Claude

Claude Opus 4.7 ships with 1M-token context window in production

May 7, 2026 · via Anthropic

Tools

Cursor 1.0 stabilises background agents and ships a review-and-merge workflow

Apr 18, 2026 · via Cursor

OpenAI

OpenAI Agent Builder GA — pricing finally competitive for enterprise tool use

Apr 12, 2026 · via OpenAI

Tools

Langfuse adds per-agent cost attribution and step-level cache-hit telemetry

Apr 10, 2026 · via Langfuse

See the full AI Observability update feed

Enterprise solutions

How AI Observability ships in our engagements

The pages below are the buyer-focused, conversion-grade versions of this topic — deliverables, methodology, ROI, security considerations, and CTAs to scope a real engagement.

Solution