Prompt caching is not optional anymore. Measuring a 47% cost drop
A walkthrough from a client engagement: identifying stable prefixes, restructuring the system prompt for cacheability, and the telemetry that proved caching was actually working.
In this post (5 sections)
Numbers from a recent engagement: 47% reduction in monthly model spend, no change in output quality, two days of engineering work. Prompt caching is not optional once your traffic stabilises. It is the same theme as the cheapest LLM call is the one you do not make, applied to the calls you cannot avoid: make the tokens you resend cost a fraction of full price.
How caching actually saves money
A cache hit means the provider already has the prefix of your prompt in a fast store and bills it at a steep discount instead of full input price. The catch is that caching keys on an exact, contiguous prefix. The moment something dynamic appears in that prefix, everything after it is uncacheable. So the whole game is arranging your prompt so the stable part comes first and stays byte-for-byte identical across requests.
Step 1: find your stable prefixes
Audit your system prompts. Anything that does not change per request is a cache candidate: tool definitions, role and style instructions, retrieval indexes for closed corpora, few-shot examples. The user message and any per-request context go after the cacheable section, never threaded through it.
Step 2: restructure for cache hits
Move all stable content into one contiguous block at the start of the prompt and place the cache breakpoint after that block. The mistake I see most often is interleaving cacheable and non-cacheable content, for example dropping a timestamp or a per-user greeting into the middle of the system prompt, which forfeits the cache benefit even though "caching is enabled" in the dashboard.
Step 3: instrument it
You need cache-hit telemetry, not just usage telemetry. Most providers return cache-read counts in response metadata. Log them. If you cannot see your cache-hit rate per route, you cannot optimise it, and you cannot tell the difference between caching that works and caching that is silently defeated by an interleaved dynamic token. Aim for 70%+ on the routes where you have stable prefixes; this is one of the core panels in the agent observability stack we ship.
Common mistakes
- Enabling caching but interleaving a timestamp or per-user string into the stable block, which silently forfeits every hit after it.
- Measuring total token usage and assuming caching works, without ever logging cache-read counts.
- Caching a prefix that changes more often than you think, such as a tool list rebuilt in a non-deterministic order each request.
The boring engineering wins. Caching is not glamorous, but it is the single biggest ROI optimisation I do for clients in 2026, and it pairs naturally with the registry and pre-agentic work from the cost post above. If you want this done as a focused engagement, it is a two-day consulting project for most stabilised systems.
Agentic AI patterns, delivered Thursdays
What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.