All posts
Architecture Published 7 min

RAG vs CAG: how to actually decide

A decision framework from real implementations. RAG retrieves. CAG stores in cache. Knowing which to use, and when to combine both, determines whether your agent finds the right answer at the right cost.

Jigar JoshiJigar JoshiAgentic AI Architect and Consultant
In this post (4 sections)

RAG retrieves at request time. CAG, for cache or context augmented generation, stores frequently-needed content in the prompt cache so it is reused across calls. They solve overlapping problems with different costs, and the choice is not ideological, it is a function of how your corpus behaves. This is the framework version of the question I raised in Opus 4.7's 1M context: RAG or just stuff it.

When to RAG

  • The corpus is too large to fit in context, or it grows unboundedly.
  • Content changes frequently, such as product catalogues or ticket queues.
  • Per-user access controls apply, where different users must see different subsets.

When to CAG

  • Stable reference material everyone needs, such as style guides, schemas, or framework docs.
  • A high repeat-query rate against the same corpus, so the cached prefix is reused constantly.
  • Latency-sensitive paths where a retrieval round-trip costs more than reading the cached prefix.
RAG and CAG on the dimensions that decide it
DimensionRAG (retrieve)CAG (cache)
Corpus sizeLarge or unboundedFits in the prompt budget
VolatilityChanges oftenStable for hours or days
Access controlPer-user subsetsShared by everyone
LatencyAdds a retrieval round-tripNo round-trip after first call
Best repeat rateLow or one-offHigh

When to combine

Most production systems end up doing both. CAG the stable reference material, RAG the volatile or per-user content. The practical rule: cache anything that does not change in 24 hours, retrieve everything else. Then measure cache-hit rate, which tells you whether you got the split right; the instrumentation for that is in prompt caching is not optional anymore and the agent observability stack we ship.

Common mistakes

  • Caching per-user content into a shared prefix, which is both a cache-pollution problem and a data-leak risk.
  • Retrieving stable reference material on every call, paying a round-trip for content that never changes.
  • Choosing one approach for the whole system instead of splitting by how each slice of content behaves.
  • Picking a split and never measuring cache-hit rate to confirm it.

RAG and CAG are not rivals, they are two tools for two kinds of content. Get the split right and your agent finds the right answer at the right cost; get it wrong and you pay in latency, spend, or stale answers. Designing that split for a specific workload is a common consulting starting point.

The weekly take

Agentic AI patterns, delivered Thursdays

What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.

Shipping an agentic AI project this quarter?
Book a 30-min consult
Frequently asked

Questions readers ask about this post

Share this post
LinkedIn Facebook