Claude Opus 4.7's 1M context: when to RAG and when to just stuff it
A million tokens reliably is real now, but it does not retire RAG. It changes the calculus. Cost, latency, recency, and the prompt-cache angle nobody is talking about.
In this post (4 sections)
For two years the answer to "how do I give the model my whole codebase" was RAG. With Opus 4.7 reliably handling 1M tokens in production, that default is broken, but only for some shapes of problem. RAG is not retiring, it is moving down the decision tree. The cleaner framing of that tree is in RAG vs CAG: how to actually decide; this post is about what specifically changed when a million tokens became reliable.
When to just stuff the context
Single-shot tasks where the relevant material fits in the budget, where you control all of it, and where the cost of one big call is acceptable: paste it. The cognitive load on the developer drops, the implementation cost drops, and prompt caching makes repeat calls cheap.
- Whole-repo code review for repos under roughly 500K tokens.
- Long-document analysis where every part might matter and retrieval would risk dropping the relevant slice.
- Multi-day debugging sessions with stable codebases, where you cache the prefix once and reuse it.
When you still want RAG
Recency-sensitive content such as docs that change daily, genuinely large corpora in the millions of pages, and anywhere users have different access levels to subsets of content. RAG also wins when latency to first token matters: a 5K-token retrieved chunk arrives faster than 800K tokens of stuffed context, even with caching. Access control is the one people underrate; stuffing a shared context across users with different permissions is a data-leak waiting to happen.
The prompt-cache angle
This is the part most write-ups miss. With cache-aware design, stuffing can be cheaper than retrieving for repeated queries against the same corpus, because the large prefix is billed at the cached discount after the first call. Cache-hit rate is the new throughput metric. Instrument it before you decide, using the method in prompt caching is not optional anymore. The decision is not RAG versus long context in the abstract, it is RAG versus long-context-plus-caching, which is a different and often closer contest.
Long context handles what the model needs to read this session. It is not memory across sessions; that is a separate layer I cover in persistent memory for coding agents and the three paradigms of LLM memory. Conflating "big context" with "memory" is one of the more expensive mistakes I see.
The rule of thumb
If the corpus fits and is stable, stuff it and cache. If it is large or volatile, retrieve. If it is medium and stable, prototype both and measure. The answer is rarely obvious before measurement, which is exactly the kind of architecture call I help teams make in consulting.
Agentic AI patterns, delivered Thursdays
What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.