Memory in AI Systems Issue 1/7

Memory Is Belief State, Not Storage

Most agents treat memory as a storage problem. It is a state management problem — and the distinction produces completely different architectures.

May 12, 2026 · 17 min read · Sentient Zero Labs

In this issue (10 sections)

Two years ago, a user told their AI assistant they were vegetarian. The assistant stored it. Every subsequent conversation included that fact in context. Restaurant recommendations, meal planning, recipe suggestions — all vegetarian.

Except the user isn’t vegetarian anymore. They changed about eight months ago. They never told the assistant. Why would they? People don’t announce dietary changes to software.

The assistant kept serving salads.

This isn’t a retrieval bug. The memory system was working exactly as designed: it stored a fact, and it surfaced that fact when relevant. The problem is the design itself. The system treated memory as storage. What it needed to treat memory as was belief state — a continuously maintained model of what is currently true about this user, with mechanisms to detect when stored facts become wrong.

Storage is solved. Belief state management is not.

That distinction is what this series is about.

The Reframe

Ask most engineers what “adding memory” to an agent means, and you get the same answer: vector store, embed the conversation, retrieve on query. Maybe a few tables for user preferences. Ship it.

This architecture works in demos. In production, it degrades.

Not because the technology is wrong — vector stores are fine for what they do. Because the problem model is wrong. The engineer is solving a storage problem when the actual problem is something harder:

Memory is the agent’s belief state about the user. It is the agent’s internal model of who this person is, what they want, what they have done, and — critically — what is currently true about them that was perhaps not true six months ago.

Managing that model accurately over time is not a storage problem. It is a state management problem. And state management has different requirements: not just write and read, but maintain — detect contradictions, decay stale facts, resolve conflicts when new information arrives.

The expert systems of the 1980s ran into exactly this problem. Those systems stored facts and rules in hand-coded knowledge bases. They were brittle not because the knowledge was wrong at the time of encoding — it was often carefully crafted by domain experts. They were brittle because the world changed and the knowledge base had no mechanism to detect it. The knowledge bases accumulated stale facts silently. MYCIN could tell you drug interactions with confidence. It could not tell you that the drug interaction table from 1977 was partially superseded by 1984.

The modern memory system fails the same way. The user is still getting salad recommendations. The knowledge base still says vegetarian. Nobody told the system, and the system has no way to ask.

The fix is not to ask the user more often. The fix is to design a memory system that knows it has a model of the user — and that models can be wrong.

The field is converging on this view. Mem0 reached 186 million API calls in Q3 2025 — a 5x increase in six months — which indicates that production teams are adopting dedicated memory infrastructure rather than bolting retrieval onto a vector store. Zep pivoted its entire architecture to a temporal knowledge graph (Graphiti) specifically because vector embeddings cannot express which version of a fact is currently valid. The convergence point: selective retrieval, temporal tracking, and explicit management phases. The vegetarian salad failure is not an edge case. It is the design flaw at the center of how most teams build memory today.

Memory vs. RAG: Two Different Problems

Before anything else, the distinction that will save you weeks of wrong architecture.

RAG — Retrieval-Augmented Generation — is how you give an agent access to a knowledge base it was not trained on. Product documentation, company policies, research papers, inventory data. The documents exist independently of any particular user. You embed them, you retrieve relevant chunks, you inject them into context. When the documents change, you re-index.

Memory is different in every dimension that matters for system design:

	Memory	RAG
About	User-specific belief state	Shared domain knowledge
Content	Preferences, history, personal facts, stated constraints	Documents, products, manuals, policies
Scope	Per-user — must be strictly isolated	Shared across all users
Updates	Continuously, per conversation	When documents change (periodic re-index)
Privacy	Must be isolated and erasable per user	Usually public or shared

Notice the scope row. RAG is shared. Memory is per-user. These require different storage architectures (multi-tenant isolation from day one), different retrieval semantics (scoped always by user_id, never a global search), and different privacy obligations (GDPR right to erasure applies to memory, not to a shared product catalog).

Notice the updates row. A product catalog changes when you update the catalog. User memory changes with every conversation. A RAG system that goes one week without re-indexing is probably fine. A memory system that goes one week without processing new information is losing signal continuously.

A production system needs both. RAG for domain knowledge. Memory for user state. The mistake is building one to do the other’s job — which produces systems that are simultaneously over-engineered (a vector store for 20 user preference facts) and under-built (no temporal tracking for a user preference that will change).

The Zep team put it directly: stop using RAG for agent memory. The root failure is that vector embeddings have no concept of temporal ordering or fact validity windows. When the user changes a preference, the old preference and the new preference both live in the embedding space. The system retrieves whichever is more semantically similar to the query. There is no concept of “which of these is currently true.”

Why 1M Context Windows Don’t Solve This

The counter-argument you will encounter: “Just use a 1M-token context window. Put everything in context. Done.”

This position misunderstands the problem on two levels: cost and quality.

Cost. The math is not subtle:

CONTEXT WINDOW COST COMPARISON (Q1 2026 pricing, approximate)
─────────────────────────────────────────────────────────────
Full 128k context on every query:
  Input cost:       $5 / million tokens (GPT-4o uncached)
  Per-query cost:   $0.64
  At 10k queries/day: $6,400/day

Smart memory retrieval (~1k tokens injected):
  Input cost:       $5 / million tokens
  Per-query cost:   $0.005
  At 10k queries/day: $50/day

Ratio: ~128x more expensive with full-context stuffing.

At any current frontier model pricing, full-context injection costs 100–130x more per query than selective memory injection. The ratio holds regardless of which model you use — it is a function of the token count difference, not the per-token price. At any meaningful scale, full-context stuffing is not a strategy. It is a billing problem.

Quality. Even if cost were not the constraint, dumping everything into context does not mean the model uses it well. The “lost in the middle” problem, documented by Liu et al. (2023), showed that language models have a recency bias and a primacy bias: information at the beginning and end of the context is recalled significantly better than information in the middle. Longer context does not improve retrieval of middle-position information — it buries it deeper.

A 1M-token context window filled with user history does not produce a model that knows the user well. It produces a model that remembers the first thing it ever learned about the user, and the last thing, and has a degraded signal on everything in between.

Smart memory retrieval — surface the 1k most relevant tokens for this specific query — outperforms stuffed context on both cost and quality. The LOCOMO benchmark puts a number on this: full-context achieves 72.9% accuracy at 9.87 seconds latency. Selective memory retrieval (Mem0’s approach) achieves 66.9% accuracy at 0.71 seconds latency. Six percentage points of accuracy in exchange for 91% lower latency and 90% fewer tokens. That is the right trade for production.

The winning architecture is not bigger context. It is smaller, better-chosen context.

Four Memory Types, Four Engineering Decisions

Cognitive science has a clean taxonomy for memory that maps directly to AI system design choices. This is not academic decoration — each type requires a completely different engineering approach.

Type	What It Is	Engineering Implementation
Working	Active thoughts right now — what the model can see this instant	Context window management. Controlled directly via prompt construction and context budget allocation.
Episodic	Past events and experiences — what happened, when, in what sequence	Conversation history storage and retrieval. The hard part: knowing what to retrieve from thousands of past turns. Needs vector search + recency scoring.
Semantic	General world knowledge — facts, concepts, how things work in general	Model weights (pretraining) + RAG layer (domain knowledge). This is the RAG problem, not the memory problem.
Procedural	How to do things — learned behaviors, skills, patterns	The hardest to engineer. Cannot be injected at runtime; requires fine-tuning or dense few-shot examples. Most teams substitute with explicit system-prompt instructions.

The practical sequencing for most production agents:

Working memory first — get your context window management right. Context ordering (primacy and recency matter), context budget allocation, compression strategy. Most agents underperform not because they lack long-term memory but because they mismanage short-term context. Fix this before adding anything else.
Episodic memory second — add cross-session storage when conversations need to build on each other. This is where the vegetarian salad bug lives. Extract meaningful facts from conversations, persist them, retrieve them on subsequent sessions.
Semantic memory via RAG — already solved by most teams. Keep it separate from personal memory. Different store, different retrieval logic, different privacy model.
Procedural memory last — only if you have feedback mechanisms to refine behavior and a clear idea of what behaviors need to be learned. Do not build this upfront.

Notice the sequencing. Most teams skip working memory management, bolt on a vector store for episodic memory without any extraction or curation logic, call it “memory,” and wonder why quality degrades over time.

The taxonomy tells you what to store. The loop tells you how to maintain it.

The WRITE → MANAGE → READ Loop

Every memory system is fundamentally three phases. Most implementations get two of them.

┌─────────────────────────────────────────────────────────────────────────┐
│                    THE MEMORY LOOP                                      │
│                                                                         │
│                                                                         │
│   New conversation  ──►  WRITE                                          │
│                          │                                              │
│                          │  Extract what is worth keeping.             │
│                          │  Assign importance. Record source.          │
│                          │  Store to persistent layer.                 │
│                          │                                              │
│                          ▼                                              │
│                       MANAGE  ◄── (runs asynchronously, continuously)  │
│                          │                                              │
│                          │  Detect contradictions.                     │
│                          │  Decay stale facts.                         │
│                          │  Consolidate duplicates.                    │
│                          │  Prune low-value memories.                  │
│                          │                                              │
│                          ▼                                              │
│   Incoming query   ──►  READ                                            │
│                          │                                              │
│                          │  Retrieve relevant memories.                │
│                          │  Score by relevance + recency +             │
│                          │  importance. Enforce token budget.          │
│                          │  Inject into context.                       │
│                          │                                              │
│                          ▼                                              │
│                    Agent response (memory-grounded)                     │
└─────────────────────────────────────────────────────────────────────────┘

WRITE and READ are visible. Engineers implement them because the system obviously does not work without them.

MANAGE is invisible — until the system has been running for three months and the memory store has become a noise pile. Contradictory facts sitting unresolved. Old preferences surfacing on every query because they have high similarity scores but low current relevance. Duplicate memories filling the context window. The agent response quality degrades in a way that looks like a model problem. The investigation goes in the wrong direction.

The anti-pattern: accumulate everything during WRITE, dump it all during READ, never maintain anything during MANAGE. This produces the junk drawer problem — technically a memory system, practically a noise generator.

Here is the timeline of degradation in a system with no MANAGE phase:

Week 1: Memory works well. Small store, high signal.
Month 1: Duplicates accumulate. Some contradictions. Quality drifts slightly.
Month 3: Old facts crowd out recent ones. Contradiction frequency increases. Response quality noticeably worse than a system with no memory at all.
Month 6: The memory store has become a liability. Engineers consider turning it off.

This is not hypothetical. It is the modal outcome for “we added memory” without the MANAGE phase.

The remaining issues in this series build each phase in order:

Issue 02 — WRITE: what to extract, how to score importance, what deserves to be stored
Issue 03 — MANAGE: contradiction detection, decay functions, consolidation, pruning
Issue 04 — READ: retrieval as a hyperparameter, threshold tuning, hybrid search, context budget

Failure Mode: The Stale Belief

Every issue in this series names one failure mode. Issue 01’s failure mode is the one the opening story illustrates.

Name: Stale Belief

What happens: The agent holds a fact that was true when stored but is no longer true. It acts on this fact with confidence because the memory system has no mechanism to distinguish “was true once” from “is true now.” The user experiences this as the agent being wrong about them in a way that feels worse than the agent knowing nothing — it signals that the agent remembered, but failed to update.

Why it is worse than no memory: A stateless agent makes a mistake once. A memory-equipped agent can turn a stale belief into a recurring pattern by retrieving it as evidence on every subsequent interaction. The vegetarian example is low-stakes. A stale belief about a user’s budget, risk tolerance, medical status, or business context is not.

How to detect it: Look for queries where retrieved memories are older than 60 days and have not been accessed recently. Look for retrieval patterns where the same old memory surfaces repeatedly without any confirming signal in recent conversations. Look for user corrections (“actually, I don’t do that anymore”) — these are explicit signals that a stale belief has surfaced.

How to prevent it: Three mechanisms, applied together:

Decay scoring — old memories accumulate a time penalty. A preference stated once, two years ago, scores lower than a preference stated last month. The vegetarian memory does not get deleted; it gets deprioritized until it stops surfacing.
Periodic revalidation — for high-importance, high-stakes facts (dietary restrictions, medical conditions, business roles), surface them occasionally for explicit confirmation. “I recall you mentioned you’re vegetarian — is that still accurate?” Annoying if overused; valuable when calibrated to facts that actually change.
Explicit update mechanisms — give users a path to correct the record. “I’m not vegetarian anymore” should trigger a contradiction resolution flow that supersedes the old fact, not adds a new one alongside it.

Issue 03 (the MANAGE phase) builds these mechanisms in full. For now, the diagnostic question: do you have any of these three in your current system? If the answer is no, you have a Stale Belief problem — you just have not hit a high-stakes case yet.

When Memory Matters — and When It Doesn’t

Not every agent needs this level of memory architecture.

Build it when:

The agent serves the same user across multiple sessions
User preferences, history, or stated facts should change the agent’s behavior
The system is meant to personalize over time (the value proposition is “the more you use it, the better it gets”)
Incorrect recall is worse than no recall — high-stakes domains (health, finance, professional context)

Skip it (for now) when:

Single-turn or single-session interactions only
All users should get the same responses regardless of history
The agent is a reasoning engine over provided documents, not a personal assistant
You have fewer than a few hundred active users — the signal is too sparse to justify the architecture

The practical rule: if your agent would benefit from knowing the user said something last month, you need memory. If your agent would be confused by knowing what the user said last month, you do not.

One more useful question before building: what should this agent be allowed to forget? The product decision about what is ephemeral versus what is worth persisting should precede every technical decision. Teams that skip this question end up storing everything — and the junk drawer problem is guaranteed.

Anchor Project: Recall

Starting from Issue 02, this series ships code from Recall — an open-source persistent memory layer built in public alongside these issues.

The vegetarian salad failure is exactly the failure Recall was built to make detectable. Recall is a persistent memory layer that runs as an MCP server — it gives AI agents durable, structured memory across sessions, backed by SQLite, with no external services required. The architecture is deliberately inspectable: one database file, one process, and a schema you can query directly to see exactly what the agent currently believes about a user.

Issue 02 covers the extraction pipeline — how facts are pulled from conversation turns, scored for importance, and written to the memory store. That is where the code ships. The technical details (hybrid search, 4-component scoring) belong there, not here.

Everything is at: github.com/Sentient-Zero-Labs/szl-recall

Production Checklist

Before claiming your agent has memory, check these six properties. Binary yes or no.

	Item	Score
	Memory is stored per-user, isolated — a query for User A cannot return memories from User B under any circumstances.
	The system distinguishes memory from RAG — user-specific belief state lives in a different store, with different retrieval logic, from shared domain knowledge.
	The MANAGE phase exists — there is some mechanism (even a minimal one) for handling stale facts, contradictions, or outdated memories.
	Memory has an expiry or decay mechanism — old memories lose weight over time; there is no permanent fact that can never be overridden.
	There is a deletion path — a user can request that their memories be erased, and the system can fulfill that request completely.
	The retrieval result is token-budgeted — injected memories cannot grow unboundedly; there is a hard cap on how much memory enters the context on any single query.

0 of 6

If any of these is no, that is the next thing to fix — not the seventh item.

Resources

Lost in the Middle: How Language Models Use Long Contexts ↗

Liu et al. 2023 — arXiv:2307.03172

The paper that named the primacy/recency bias in long-context retrieval. Essential reading if you are making decisions about context window size. The key finding: models perform significantly worse on information in the middle of long contexts vs. the beginning or end. Position matters — not just length.

Building Production-Ready AI Agents with Scalable Long-Term Memory ↗

Mem0 Research Team — arXiv:2504.19413 (ECAI 2025)

The production paper behind the LOCOMO benchmark results cited in this issue. Validates the selective pipeline over full-context injection with latency and accuracy numbers. Directly relevant if you are choosing between architectures.

Zep: A Temporal Knowledge Graph Architecture for Agent Memory ↗

Zep — arXiv:2501.13956 (2025)

The paper that makes the strongest case for why vector embeddings are insufficient for evolving memory. Introduces the bi-temporal tracking model (valid_at, invalid_at). Read this before designing a memory schema for facts that can change.

Memory in the Age of AI Agents ↗

arXiv:2512.13564 (2025)

The most comprehensive survey of the field. Proposes a three-dimensional taxonomy (forms, functions, dynamics) that deliberately moves beyond the 'long-term vs. short-term' framing. Useful as a map of the research landscape if you want to go deeper than this series.

Recall — open-source persistent memory layer ↗

github.com/Sentient-Zero-Labs/szl-recall

The implementation anchor for this series. SQLite-based, local-first, MCP-compatible. Start here if you want to run code rather than read about it. The extraction pipeline and scoring architecture ship with Issue 02.

Memory in AI Systems is a seven-issue series from Sentient Zero Labs. Issue 02 covers the WRITE phase — what to extract from conversations, how to score importance, and what the memory unit schema looks like. The Recall extraction pipeline ships with Issue 02.

Until next issue,

Sentient Zero Labs