Does a 1M-token context window make RAG obsolete?

No. It changes the calculus for some workloads but not others. RAG moves down the decision tree rather than disappearing: it still wins for huge or fast-changing corpora, per-user access control, and low-latency paths. Long context wins when the material fits, you control it, and you can cache it.

When should I just stuff everything into the context?

When the relevant material fits in the budget, you control all of it, the cost of one large call is acceptable, and queries repeat enough that caching the prefix pays off. Whole-repo review, long-document analysis, and stable multi-day debugging sessions are the typical fits.

When is RAG still the right choice?

For content that changes daily, corpora in the millions of pages, situations where different users may see only their own subset of content, and latency-sensitive paths where a small retrieved chunk returns faster than a huge stuffed context.

How does prompt caching change the decision?

For repeated queries against a stable corpus, the large stuffed prefix is billed at the cached discount after the first call, so stuffing-plus-caching can beat retrieval on cost. That is why the real comparison is RAG versus long-context-with-caching, not RAG versus raw long context.

Is a long context the same as giving the model memory?

No. Long context is what the model reads within a session; it resets when the session ends. Persistent memory across sessions is a separate architectural layer. Treating a big context window as memory leads to expensive surprises when the window clears.

Opus 4.7 1M Context: RAG or Just Stuff It?

In this post (4 sections)

In this post

For two years the answer to "how do I give the model my whole codebase" was RAG. With Opus 4.7 reliably handling 1M tokens in production, that default is broken, but only for some shapes of problem. RAG is not retiring, it is moving down the decision tree. The cleaner framing of that tree is in RAG vs CAG: how to actually decide; this post is about what specifically changed when a million tokens became reliable.

When to just stuff the context

Single-shot tasks where the relevant material fits in the budget, where you control all of it, and where the cost of one big call is acceptable: paste it. The cognitive load on the developer drops, the implementation cost drops, and prompt caching makes repeat calls cheap.

Whole-repo code review for repos under roughly 500K tokens.
Long-document analysis where every part might matter and retrieval would risk dropping the relevant slice.
Multi-day debugging sessions with stable codebases, where you cache the prefix once and reuse it.

When you still want RAG

Recency-sensitive content such as docs that change daily, genuinely large corpora in the millions of pages, and anywhere users have different access levels to subsets of content. RAG also wins when latency to first token matters: a 5K-token retrieved chunk arrives faster than 800K tokens of stuffed context, even with caching. Access control is the one people underrate; stuffing a shared context across users with different permissions is a data-leak waiting to happen.

Stuff the context versus retrieve it

Factor	Favours stuffing	Favours RAG
Corpus size	Fits in the budget	Millions of pages
Recency	Stable for hours or days	Changes by the minute
Access control	Everyone sees the same content	Per-user subsets
Latency to first token	Acceptable	Must be low
Repeat queries	High, so caching pays off	Low or one-off

The prompt-cache angle

This is the part most write-ups miss. With cache-aware design, stuffing can be cheaper than retrieving for repeated queries against the same corpus, because the large prefix is billed at the cached discount after the first call. Cache-hit rate is the new throughput metric. Instrument it before you decide, using the method in prompt caching is not optional anymore. The decision is not RAG versus long context in the abstract, it is RAG versus long-context-plus-caching, which is a different and often closer contest.

Long context handles what the model needs to read this session. It is not memory across sessions; that is a separate layer I cover in persistent memory for coding agents and the three paradigms of LLM memory. Conflating "big context" with "memory" is one of the more expensive mistakes I see.

The rule of thumb

If the corpus fits and is stable, stuff it and cache. If it is large or volatile, retrieve. If it is medium and stable, prototype both and measure. The answer is rarely obvious before measurement, which is exactly the kind of architecture call I help teams make in consulting.

Claude Opus 4.7's 1M context: when to RAG and when to just stuff it

When to just stuff the context

When you still want RAG

The prompt-cache angle

The rule of thumb

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Claude Opus 4.7's 1M context: when to RAG and when to just stuff it

When to just stuff the context

When you still want RAG

The prompt-cache angle

The rule of thumb

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Cursor cloud subagents in 2026: /in-cloud, /babysit, and /automate without losing your local guardrails

Claude Fable 5 for agent builders: when the frontier model is worth the routing change

Agentic RAG vs vanilla RAG: why a Sufficient Context Agent beats retrieve-then-pray