AI Observability.You cannot fix what you cannot see.
Observability is the unsexy half of production AI that decides whether you keep your job. Per-step traces, score-bearing evals, cache-hit telemetry, tool-selection scores — these are the primitives that let you debug agents instead of guessing at them.
The four telemetry layers
Input/output traces per step (Langfuse-shaped). Eval scoring on representative inputs, not just happy paths. Cache-hit rate per route — the new throughput metric. Tool-selection score deltas from the model boundary. Without all four, multi-agent debugging is folklore.
Eval datasets that surface real bugs
A 100-case eval set that scores 95% means nothing if all 100 cases are the happy path. The cases that matter are the ambiguous ones, the conflicting ones, the malformed ones. Build the eval set adversarially — ask "what would a real user actually paste in?" — and most agents drop 15–25 points.
Per-agent cost attribution
Tools like Langfuse now expose per-agent cost attribution and step-level cache-hit telemetry. You will discover 30–50% of your agent traffic is repeat queries you should be caching. This is the report nobody wants to read but everyone needs to.
Deep dives on AI Observability
Tool descriptions are prompts. Fix the registry, not the agent.
When an agent picks the wrong tool, the registry is broken — not the agent. Three rules I now apply before debugging anything in a multi-tool system: precise names, "when to use" triggers, and a curated load list. Anthropic's new tool-selection telemetry finally puts numbers on what changes accuracy.
The cheapest LLM call is the one you do not make — GitHub's 19-62% token cut, decoded
GitHub published an instrumented analysis of their agentic CI workflows and reported 19-62% token-cost reductions. The savings are the headline. The technique — pre-agentic data fetching and tool-registry hygiene — is the story most teams will miss.
Claude Opus 4.7's 1M context: when to RAG and when to just stuff it
A million tokens reliably is real now, but it does not retire RAG — it changes the calculus. Cost, latency, recency, and the prompt-cache angle nobody is talking about.
Prompt caching is not optional anymore — measuring a 47% cost drop
A walkthrough from a client engagement: identifying stable prefixes, restructuring the system prompt for cacheability, and the telemetry that proved caching was actually working.
The agent observability stack we ship to every client
Traces, spans, evals, cost-per-completed-task, and the one dashboard panel that catches 80% of regressions. Vendor-agnostic — covers Langfuse, Honeycomb, and rolling your own.
Eval datasets: stop testing your agents on the happy path
If your eval set is the demos you showed the client, you are testing the wrong thing. How we build evals from production failures and the minimum viable suite to ship.
RAG vs CAG: how to actually decide
A decision framework from real implementations. RAG retrieves. CAG stores in cache. Knowing which to use — and when to combine both — determines whether your agent finds the right answer at the right cost.
Latest in AI Observability
Claude Opus 4.7 ships with 1M-token context window in production
Cursor 1.0 stabilises background agents and ships a review-and-merge workflow
OpenAI Agent Builder GA — pricing finally competitive for enterprise tool use
Langfuse adds per-agent cost attribution and step-level cache-hit telemetry
How AI Observability ships in our engagements
The pages below are the buyer-focused, conversion-grade versions of this topic — deliverables, methodology, ROI, security considerations, and CTAs to scope a real engagement.
Agentic AI Consulting
Designed, built, and handed off — production agentic systems for enterprise teams.
Explore the Agentic AI Consulting solutionAI Guardrails
Multi-layer safety, policy, and audit controls for agents in regulated environments.
Explore the AI Guardrails solutionAI Systems Engineering Training
Eight-day corporate training programs that take dev teams from AI-assisted coding to production agentic systems.
Explore the AI Systems Engineering Training solutionEnterprise AI Architecture
Reference architectures for organisations standing up an AI platform — not one agent, but the foundation for many.
Explore the Enterprise AI Architecture solutionAI Observability
Tracing, eval, cache-hit telemetry, and cost attribution for production agents.
Explore the AI Observability solutionMulti-Agent Workflows
Supervisor + handoff orchestration for portfolios of agents that need to cooperate without arguing.
Explore the Multi-Agent Workflows solutionAI Observability — the questions teams actually ask
Train your team on AI Observability
Two tracks — one for developers who build agents, one for business teams who use them. Customised to your stack, hands-on from session 1.
See AI Observability training tracksShip your first AI Observability system
Architecture design, production implementation on Claude API and MCP, full observability, and a real handoff. Working agents, not slides.
Explore AI Observability consulting