What is the difference between AI engineering and prompt engineering?

Prompt engineering is one component. AI engineering is the whole system: tool registries, evals, observability, retries, cost discipline, deployment. Prompts are 10% of what makes an agent work in production.

When should I add evals?

Before the second feature. If you cannot score a regression, you will ship one. The cost of building evals retroactively after a production bug is higher than building them as you go.

How do I avoid prompt drift?

Externalise prompts as resources, version them, and run the eval suite on every prompt change. If you cannot diff a prompt change against an eval delta, you cannot reason about it.

Topic Pillar

AI Engineering.Turning AI prototypes into production systems.

AI engineering is the discipline that sits between "the model can do this" and "this runs in production for paying users." It is mostly the un-glamorous parts — tool registries, exit conditions, retries, eval suites, observability — that decide whether your agent works on day 30, not just day one.

● 25 cluster pages· 13 posts· 1 notes· 11 updates

The patterns I broke in 2025 and what replaced them

Supervisor patterns gave way to handoffs. Inline tool descriptions gave way to externalised, per-tool selection criteria. "Use this when…" beat "this function returns…" every time. JSON mode replaced bespoke parsers. RAG-by-default got replaced by a measure-then-decide framework. These are the moves that separate engineering from improv.

Tool design as a first-class concern

When an agent picks the wrong tool, the registry is broken — not the agent. Three rules: name tools precisely, describe when to use each one, load only what the task needs. Anthropic's tool-use telemetry finally puts numbers on what changes accuracy.

Cost discipline is design, not optimisation

The cheapest LLM call is the one you do not make. Pre-agentic data fetching, prompt caching, and registry pruning compound into 30–60% cost reductions on real workflows — before you touch a prompt. Cost is not a post-launch concern; it is a design constraint from day one.

13 blog posts

Deep dives on AI Engineering

Tool Design

Tool descriptions are prompts. Fix the registry, not the agent.

When an agent picks the wrong tool, the registry is broken — not the agent. Three rules I now apply before debugging anything in a multi-tool system: precise names, "when to use" triggers, and a curated load list. Anthropic's new tool-selection telemetry finally puts numbers on what changes accuracy.

May 13, 20265 min

Read the post Production

The cheapest LLM call is the one you do not make — GitHub's 19-62% token cut, decoded

GitHub published an instrumented analysis of their agentic CI workflows and reported 19-62% token-cost reductions. The savings are the headline. The technique — pre-agentic data fetching and tool-registry hygiene — is the story most teams will miss.

May 11, 20265 min

Read the post Architecture

Claude Opus 4.7's 1M context: when to RAG and when to just stuff it

A million tokens reliably is real now, but it does not retire RAG — it changes the calculus. Cost, latency, recency, and the prompt-cache angle nobody is talking about.

May 6, 20266 min

Read the post Multi-Agent

Why I am replacing supervisor patterns with handoffs

Supervisors looked clean on paper and shipped slow in production. Handoffs read messier in the code but recover better when an agent loses the plot. Two real systems and where supervisors still earn their keep.

Apr 26, 20266 min

Read the post Production

Prompt caching is not optional anymore — measuring a 47% cost drop

A walkthrough from a client engagement: identifying stable prefixes, restructuring the system prompt for cacheability, and the telemetry that proved caching was actually working.

Apr 19, 20264 min

Read the post Tool Design

Tool descriptions are prompts. Stop treating them like docstrings

A docstring tells a developer what a function does. A tool description tells a model when to call it. Different audience, different writing. Six concrete edits that lifted tool-call accuracy.

Apr 8, 20264 min

Read the post Production

The agent observability stack we ship to every client

Traces, spans, evals, cost-per-completed-task, and the one dashboard panel that catches 80% of regressions. Vendor-agnostic — covers Langfuse, Honeycomb, and rolling your own.

Mar 28, 20267 min

Read the post Architecture

Three patterns I broke in 2025 — and what I do instead now

Self-correction loops without budgets, single-agent solutions to multi-domain problems, and using JSON mode to force structure I should have built into the schema. An honest review.

Mar 14, 20268 min

Read the post Production

Eval datasets: stop testing your agents on the happy path

If your eval set is the demos you showed the client, you are testing the wrong thing. How we build evals from production failures and the minimum viable suite to ship.

Jan 19, 20266 min

Read the post Prompt Engineering

I was wrong about JSON mode. Here is what changed my mind

For two years I told teams to avoid forced JSON outputs and use structured tool calls. That was right then and partially wrong now — schema enforcement got better, latency penalties got smaller.

Dec 12, 20254 min

Read the post Architecture

Why your agent keeps failing after 3 steps

The exit condition problem nobody talks about. Most agents are built for the happy path — where every tool call succeeds and the task completes cleanly. Real production agents are different.

Nov 8, 20254 min

Read the post Tool Design

The one rule for designing agent tools that actually work

One tool, one purpose. Every tool that does two things will fail you on the third call. I have watched this pattern fail in every team I have trained — and the fix is the same refactor.

Oct 17, 20253 min

Read the post Architecture

RAG vs CAG: how to actually decide

A decision framework from real implementations. RAG retrieves. CAG stores in cache. Knowing which to use — and when to combine both — determines whether your agent finds the right answer at the right cost.

Sep 21, 20255 min

Read the post

1 carousel note

Visual breakdowns on AI Engineering

Tool Design

Your agent called the wrong tool.

Fix the description. Not the agent.

Read the note

11 ship-news updates

Latest in AI Engineering

Tools

Claude Code adds parallel sub-agent execution — multi-file refactors land in a single turn

May 13, 2026 · via Anthropic

MCP

MCP remote-server registry crosses 500 listed servers — a curated production-ready tier emerges

May 12, 2026 · via modelcontextprotocol.io

Architecture

GitHub cuts agentic CI workflow costs 19-62% by pruning tools and moving data-fetch outside the LLM loop

May 11, 2026 · via GitHub Engineering Blog

Claude

Claude Opus 4.7 ships with 1M-token context window in production

May 7, 2026 · via Anthropic

Tools

Claude Code adds project memory — persistent context that survives across CLI sessions

May 5, 2026 · via Anthropic

Architecture

Anthropic publishes "Effective Tool Design" — official guidance for production agents

Apr 28, 2026 · via Anthropic

Claude

Sonnet 4.6 update: cheaper tokens, sharper tool calls, fewer retry loops

Apr 24, 2026 · via Anthropic

Claude

Haiku 4.5 in production — small-model speed, surprising tool-use chops

Apr 22, 2026 · via Anthropic

Tools

Cursor 1.0 stabilises background agents and ships a review-and-merge workflow

Apr 18, 2026 · via Cursor

See the full AI Engineering update feed

Enterprise solutions

How AI Engineering ships in our engagements

The pages below are the buyer-focused, conversion-grade versions of this topic — deliverables, methodology, ROI, security considerations, and CTAs to scope a real engagement.

Solution