What is the difference between RAG and agent memory?

RAG is explicit memory: a read-mostly external store that grounds answers in shared world knowledge. Agent memory is the per-session and per-user state the agent carries about this specific conversation: what the user said, what was decided, what to remember next time. Different layers, different storage shapes. Using only one is the most common gap I see in production reviews.

Where does the implicit / explicit / agentic split come from?

From BigAI-NLCO's 2026 memory survey (published in TMLR, with the companion repository Awesome-AI-Memory on GitHub). The authors borrow neuroscience anchors loosely: weights as neocortex, retrieval as hippocampus, agent state as prefrontal cortex. You do not need the analogy to use the split, but it is a fine mnemonic.

Do I really need a separate database for agent memory?

In almost every production case I have seen, yes. Your RAG corpus is shared and read-mostly. Agentic memory is per-user, write-heavy, and lifecycle-aware: consolidation, summarisation, expiry. Bolting user state into the same vector store that holds your product docs leaks data across users and tangles two lifecycles that should be separate.

When should I fine-tune facts in instead of using memory layers?

Rarely. Fine-tuning is the most expensive layer and the hardest to update when the fact changes next month. Exhaust the cheaper options first (knowledge editing, retrieval-augmented updates, agentic memory writes). Fine-tune when what you are changing is stylistic or structural, not when it is a list of facts.

Does long-context (1M tokens) replace any of this?

It expands the working scratchpad inside one call, which is genuinely useful for stuffing whole-document context, a trade-off I work through in [Opus 4.7's 1M context: RAG or just stuff it](/blog/claude-opus-4-7-1m-context-rag-or-stuff/). It does not replace agentic long-term memory. A 1M window still resets between calls. Per-user state across sessions still needs its own store, and the RAG-versus-cache split for the explicit layer is in [RAG vs CAG: how to actually decide](/blog/rag-vs-cag-decision-framework/).

The Vector Store Is Not Your Agent's Memory

In this post (6 sections)

In this post

Last month I reviewed an agent built on a 4 million chunk vector store. The team called the vector store "memory". They could not work out why the agent kept losing track of what the user had said three turns earlier.

The vector store was fine. The problem was that retrieval and memory are not the same thing, and the team had bolted on a retrieval pipeline while assuming it would also give them an agent that remembered conversations. It does not. There is now a survey from BigAI-NLCO that does a clean job of naming why, and the taxonomy in it is the most useful thing I have read on this in a while.

Three layers, not one

The survey (Awesome-AI-Memory on GitHub, paper in TMLR) splits memory into implicit, explicit, and agentic. They borrow the names loosely from neuroscience but you do not have to care about the analogy to use the split. The split itself is what matters: at a glance you can tell which layer a given paper is about, and which layer your own system has not built yet.

The three memory layers

Implicit memory: what the weights already know

This is everything the model learned in pretraining and fine-tuning. The research the survey points at is interesting (transformer feed-forward layers behave a lot like key-value stores, knowledge neurons can be localised, memorisation scales differently from reasoning) but the practical takeaway is short: you cannot change this layer at runtime. What it does decide is how much of the next two layers you actually need. A model that already knows your domain needs less retrieval. A model that does not, needs more.

Explicit memory: the stuff you bolt on outside

RAG. Vector stores. Knowledge graphs. Long-context stuffing. This is the layer almost everyone calls "memory" in production, which is where the trouble starts. It is read-mostly, queried per request, and shared across users. It is good at grounding answers in current facts. It is not an agent's memory of a conversation. It has no concept of who asked what, when, or what got decided. The build-time trade-offs inside this layer, retrieval versus caching, are the subject of RAG vs CAG: how to actually decide.

A rule that has saved me from a lot of design arguments: use this layer when the question is "what does the world know about X?". Do not use it when the question is "what did this specific user just tell me?". Those are different problems with different storage shapes.

Agentic memory: the layer most teams skip

This is the one that goes missing. The survey covers it in some depth: single-agent vs multi-agent, short-term vs long-term, the architecture of ingestion → storage → retrieval → invocation, and how you evaluate it. The simplest framing I have found that survives a code review:

A short-term scratchpad that lives inside the session. What the user said, which tools got called, what was decided. Most teams approximate this with raw prompt history, and a bigger window does not fix it (see Opus 4.7's 1M context). Fine for a prototype. In production you want it instrumented so you can see what fell out of the window and why.
A long-term store keyed per user or per workspace. What this user prefers, what they tried last week, what they explicitly asked you to remember. This is its own database. It is not your RAG corpus and it should not share infrastructure with it.
A consolidation step. Something that, every so often, promotes things from short-term to long-term and summarises old turns out of the live context. Without it, sessions grow unbounded and the agent ends up "forgetting" things it actually saw, just because they aged off the prefix.

I have seen teams build any one of these and call the problem solved. The leverage is in building all three. For a working implementation of the agentic layer made durable for a coding agent, see persistent memory for coding agents.

Why the taxonomy is worth holding in your head

Most of the "the agent feels dumb" reports I get from clients are layer-assignment problems in disguise. Agent does not know a product fact? Implicit plus explicit. Fine-tune or retrieve. Agent answers correctly but cannot remember what the user just said? Short-term agentic. Your context-window management is leaking. Agent treats every session as if it is meeting the user for the first time? Long-term agentic. There is no per-user store, and adding more RAG will not produce one no matter how much money you throw at it.

A related thing worth pulling out: the survey's coverage of memory editing and unlearning is the right thing to read before any team commits to fine-tuning facts into the weights. That is the most expensive memory layer you can pick and the hardest to roll back. The survey walks the gradient of cheaper alternatives (knowledge editing, retrieval-augmented updates, agentic memory writes) which is usually where the answer lives.

Where to start reading

If you have thirty minutes, three things, in this order. The repo README itself, for the taxonomy and the neuroscience anchors. The "Transformer Feed-Forward Layers Are Key-Value Memories" paper (2012.14913) which will recalibrate how you think about weights. And the agent-memory survey at 2404.13501, which reads as a practitioner companion to the BigAI synthesis and is the one I keep open on the side when I am whiteboarding the layer-three design.

The reason most production agents in 2026 feel shallow is not that the models are weak. It is that engineering teams imported the RAG pattern from 2023, called it memory, and stopped there. The BigAI survey gives the missing layer a name and a literature. Open your system architecture diagram and look for the box that is not there yet. That is usually the work.

Reference: GitHub. TMLR paper openreview.net — forum. arXiv arXiv paper.

Three paradigms of LLM memory: implicit, explicit, and agentic

Three layers, not one

Implicit memory: what the weights already know

Explicit memory: the stuff you bolt on outside

Agentic memory: the layer most teams skip

Why the taxonomy is worth holding in your head

Where to start reading

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Three paradigms of LLM memory: implicit, explicit, and agentic

Three layers, not one

Implicit memory: what the weights already know

Explicit memory: the stuff you bolt on outside

Agentic memory: the layer most teams skip

Why the taxonomy is worth holding in your head

Where to start reading

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

The maker is not the verifier: how I build self-improving agent loops without pretending models self-learn

How to actually use Fable 5: the four-layer architecture behind Mythos-tier results

Codex Record and Replay turns one demo into a Computer Use skill: how I inspect generated skills before trusting them unattended