Memory in AI Systems Issue 5/7

Four Design Patterns in Order

Most teams jump to Pattern 3 (vector store) or Pattern 4 (episodic log). Most should start with Pattern 2. A decision matrix for which memory architecture your use case actually requires.

May 12, 2026 · 15 min read · Sentient Zero Labs

In this issue (10 sections)

A team at a mid-size SaaS company spent three weeks building a vector memory system for their AI assistant. They set up an embedding pipeline, provisioned a vector database, implemented HNSW indexing, built a semantic search layer, and added a reranking step. They wrote tests. They tuned similarity thresholds. Three weeks of careful engineering work.

The agent needed to remember five facts about each user: name, timezone, language preference, notification settings, and plan tier.

A Postgres table with five columns and one row per user would have solved the problem. Two hours of work. Zero ongoing maintenance. The vector infrastructure they built solved a categorically different problem — one they did not have.

This is not a memory problem. It is a pattern selection problem. The team had cargo-culted a popular architecture without asking the prior question: can you enumerate every fact this agent needs to remember? Yes, they could — five of them, clearly defined. If you can enumerate the facts, you do not need semantic search over a vector index. You need a lookup table.

The question is not how to implement memory. It is which implementation your use case actually requires.

Issues 2, 3, and 4 covered the WRITE, MANAGE, and READ phases of the memory loop — how to extract facts, resolve contradictions, and retrieve the right memories at query time. This issue steps back to ask which of the four patterns those phases apply to, and when.

The Four Patterns

Four patterns cover the full range of memory architectures used in production agents. They are ordered by complexity, not preference. The right pattern depends on three questions: How many facts does your agent need to remember? Do they persist across sessions? Does the agent need to answer open-ended questions about its own history?

Pattern 1 — In-Session Compression: No persistent storage. The context window is the only memory. Rolling summaries handle context limits. The agent forgets everything when the session ends.

Pattern 2 — KV Fact Store: Structured storage of typed facts. Entity/attribute/value triples in a relational database. Direct lookup by key. Cross-session persistence. No embeddings required.

Pattern 3 — Vector Semantic Recall: Embeddings stored alongside structured facts. Semantic search over memory when the query does not map to a known key. Layered on top of Pattern 2, not a replacement for it.

Pattern 4 — Episodic Structured Log: Full conversation history stored as timestamped entries with semantic retrieval. The agent can reason about its own past. The canonical example is Generative Agents (Park et al. 2023). Required only for long-horizon agents with continuity requirements over months of interaction.

Each pattern costs more to build and maintain than the one before it. Each upgrade is only justified when the previous pattern’s limits are actually hit.

MEMORY ARCHITECTURE COMPLEXITY AXIS
═══════════════════════════════════════════════════════════════════════

Pattern 1          Pattern 2          Pattern 3          Pattern 4
In-Session         KV Fact Store      Vector Semantic    Episodic Log
Compression                           Recall

──────────────────────────────────────────────────────────────────────►
                                                 Complexity / Cost

Storage:     None         Relational DB    Relational DB    Relational DB
                        (5–7 cols)       + Embeddings     + Full history

Cross-session: No           Yes              Yes              Yes

Semantic
search:        No           No               Yes              Yes

Facts/user:   N/A          1–100           100–10,000       Unbounded

Contradiction
detection:    N/A         Deterministic    Deterministic    Semantic (slow)
                        (entity+attr)    (entity+attr)

Relative
cost:          $            $$               $$$              $$$$

Pattern 1: In-Session Compression

Pattern 1 is what every agent does by default, whether it chooses to or not. The context window is the memory system. Conversation history accumulates until it approaches the limit, at which point rolling summarization compresses the oldest turns — a cheap LLM call that replaces N turns with a 3-5 sentence summary.

This is the right starting point for a reason: it works, requires no infrastructure, and is sufficient for a large class of use cases. Customer service agents resolving a single ticket. One-shot coding assistants. Prototypes where cross-session continuity is not a requirement. Any agent where the conversation has a defined start and end, and the user does not expect the agent to remember who they are next time.

Two risks are worth naming explicitly.

Lost-in-the-middle (Liu et al. 2023, arXiv:2307.03172): Language models do not retrieve information uniformly across a long context. Performance is highest for content at the beginning and end of the context window, and degrades significantly for content in the middle — even for explicitly long-context models. A preference stated in turn 3 of a 200-turn conversation is in a risky position.

Summarization drift: Every summarization pass is a lossy compression. A preference that was concrete in turn 3 (“I always use Python for backend work, Go for CLIs”) becomes “User prefers Python” after one compression, then “User has language preferences” after two. If your agent’s responses start feeling generic despite a long shared history, drift is a likely cause. (Forward reference to Issue 6: Failure Modes.)

The upgrade trigger from Pattern 1 to Pattern 2 is simple: the user starts a new session and the agent does not know who they are.

Pattern 2: KV Fact Store

The principle behind Pattern 2 dates to the mid-1950s: if you know what you are looking for, compute where it should be rather than searching for it. The credit is contested — variously attributed to Hans Peter Luhn’s 1953 IBM memo or to H. A. M. Dumey’s 1956 paper in Computers and Automation — but the principle is not. Structure beats brute search for known-key lookups.

Seventy-three years later, the correct approach for storing a user’s timezone preference is still a lookup table. SELECT value FROM memories WHERE entity='user' AND attribute='timezone'. Microseconds. Via a B-tree index. Returns the right answer, not a semantically similar one.

Pattern 2 is a KV fact store. Each memory is a typed triple: who (entity), what property (attribute), and what value (value). The type is one of four: preference, fact, decision, or procedure. The taxonomy matters because different types have different contradiction behavior — a decision can be superseded by a later decision without touching a preference; two procedures can coexist even if they conflict in detail.

This is exactly how Recall implements the core storage layer. The memories table has entity, attribute, and value columns. When a new memory arrives for the same (user_id, entity, attribute), the old memory’s valid_until is set to now and excluded from all future queries. Contradiction detection happens deterministically, without embedding lookup.

Recall’s MemoryType enum defines the four types:

class MemoryType(str, Enum):
    PREFERENCE = "preference"    # stated likes/dislikes, settings
    FACT = "fact"                # verifiable information about the user
    DECISION = "decision"        # choices the user has made
    PROCEDURE = "procedure"      # steps or workflows the user follows

The case for Pattern 2 over Pattern 3 for structured facts: if you can name the attributes in advance, direct lookup is faster, more deterministic, and gives you contradiction detection as a free side effect. Semantic search over an embedding of “what is the user’s timezone preference” is slower, probabilistic, and will occasionally return the second-best match instead of the definitive one.

When to use Pattern 2: fewer than 100 facts per user, mostly typed preferences and decisions, cross-session personalization without open-ended history queries. This covers the vast majority of production personalization agents.

Pattern 3: Vector Semantic Recall

Pattern 3 is the layer you add when Pattern 2’s structured lookup is not sufficient — not the layer you start with.

The upgrade trigger is a specific query type: “tell me about my past projects,” “what do you know about my work on the auth service,” “remind me what we decided about the API design.” These queries do not map to a known entity/attribute pair. The user is asking the agent to search over memory by semantic relevance, not to retrieve a specific fact by key.

Recall adds Pattern 3 via a single install variant:

pip install "szl-recall[embeddings]"

This loads BAAI/bge-small-en-v1.5 (~500MB on first run) and enables dense vector retrieval. Embeddings are stored in the embedding column of the memories table — a BLOB of L2-normalized float32 values. BM25Plus keyword ranking is fused with dense cosine similarity via Reciprocal Rank Fusion (RRF, k=60). Without the embeddings package installed, Recall falls back to BM25-only — Pattern 2 continues to work fully.

This graceful degradation is the architecture point: Pattern 3 is a layer on top of Pattern 2, not a replacement. Structured facts still use direct lookup. Semantic search applies to the broader memory set when a query cannot be resolved by key.

What you take on when you add Pattern 3:

Embedding latency: 5-20ms per query on CPU to encode the query vector
Storage overhead: ~1.5KB per memory entry for a 384-dimension dense vector
Re-embedding on update: when a memory’s text is revised, its embedding must be regenerated
Model pinning: changing embedding models requires regenerating all stored embeddings

	Pattern 2 only	Pattern 2 + Pattern 3
Install	pip install szl-recall	pip install szl-recall[embeddings]
Schema	entity + attribute + value + valid_until	Same + embedding BLOB column populated
Search path	BM25Plus keyword ranking only	BM25Plus + dense vectors, fused via RRF (k=60)
Contradiction detection	Deterministic: same entity+attribute → supersede	Same (unaffected by embeddings)
Query latency	<5ms (B-tree index)	5–20ms (embedding encode + ANN search)
When to use	≤100 facts/user, structured preferences	Large fact sets, open-ended history queries

Pattern 4: Episodic Structured Log

Pattern 4 is the full episodic memory architecture — full conversation history, stored as timestamped diary entries, retrievable by semantic similarity. The canonical example is Generative Agents (Park et al. 2023, arXiv:2304.03442).

In that architecture, each agent maintains a memory stream: a chronological log of every observation in natural language with timestamps. Retrieval scores each memory by three equal-weight components — recency (exponential decay over hours), importance (a 1-10 poignancy score assigned at write time by an LLM), and relevance (cosine similarity to the query). The agent periodically runs a reflection pass: when cumulative importance of recent observations crosses a threshold, it generates a higher-order insight and stores it as a new memory entry.

This architecture is necessary for a specific class of agent: one that needs to reason about its own history. “Last time we talked about this, you were leaning toward FastAPI — have you made a decision?” That question requires that the prior session’s content be retrievable as a specific moment. A rolling summary would not preserve the detail. A KV fact store would not have captured it as a structured fact.

The cost is real: storage grows with every session. For a daily-use agent over 6 months, that is potentially thousands of entries — each needing an embedding. A production deployment at scale requires persistent vector storage, ANN index rebuild cycles, a reflection scheduler, and separate storage for raw observations versus synthesized reflections — none of which is present in the original research prototype.

Most agents do not need Pattern 4. The question is whether your agent’s users have continuity requirements over months of interaction and need the agent to reason about history as a narrative. Research assistants, long-running project agents: yes. Customer service bots, code assistants, scheduling agents: no.

The Decision Matrix

Pattern	Facts / User	Cross-session	Open-ended History	Relative Cost	Upgrade Trigger
1 — In-session compression	None (stateless)	No	No	$	User returns next session and agent does not know them
2 — KV fact store	1–100 (enumerable)	Yes	No	$$	User asks "tell me about my projects" — lookup cannot answer
3 — Vector semantic recall	100–10,000	Yes	Yes	$$$	Agent must reason over its own session history as narrative
4 — Episodic structured log	Unbounded	Yes	Yes (full)	$$$$	You have hit Pattern 4 — consider cost controls instead

The upgrade path is linear: 1 → 2 → 3 → 4. Each step adds complexity only when the prior step’s limits are actually reached. Skipping steps is not an efficiency gain — it is premature optimization.

Building Pattern 2: A Complete Implementation

The minimal viable KV memory store is small enough to show completely.

CREATE TABLE memories (
    id          TEXT PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id     TEXT NOT NULL,
    entity      TEXT NOT NULL,
    attribute   TEXT NOT NULL,
    value       TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    valid_until TIMESTAMPTZ
);

CREATE INDEX idx_memories_lookup
    ON memories(user_id, entity, attribute)
    WHERE valid_until IS NULL;

import sqlite3
from datetime import datetime

def upsert_fact(conn, user_id: str, entity: str, attribute: str, value: str):
    """Store a fact. Supersedes any existing active fact for this entity+attribute."""
    now = datetime.utcnow().isoformat()
    conn.execute(
        "UPDATE memories SET valid_until = ? "
        "WHERE user_id = ? AND entity = ? AND attribute = ? AND valid_until IS NULL",
        (now, user_id, entity, attribute),
    )
    conn.execute(
        "INSERT INTO memories (user_id, entity, attribute, value, created_at) "
        "VALUES (?, ?, ?, ?, ?)",
        (user_id, entity, attribute, value, now),
    )
    conn.commit()

def get_facts(conn, user_id: str, entity: str) -> dict[str, str]:
    """Return all active facts for an entity as {attribute: value}."""
    rows = conn.execute(
        "SELECT attribute, value FROM memories "
        "WHERE user_id = ? AND entity = ? AND valid_until IS NULL",
        (user_id, entity),
    ).fetchall()
    return {row[0]: row[1] for row in rows}

Seven columns. One index. Fifteen lines of Python. Works with SQLite for local agents, Postgres for production. Recall is the production version of this — with typed extraction, contradiction detection, scoring, decay, and hybrid search — but the foundation is what you see above.

If you are building a new agent and you need cross-session memory, start here. Add Pattern 3 when a user asks an open-ended history question that this cannot answer.

Failure Mode: Premature Optimization

The 3-week vector store story from the opening is not unusual. It is the default path for teams that discover “agents need memory” and reach directly for the tooling they have seen in demos and benchmarks.

The diagnostic question: Can you enumerate all the things your agent needs to remember? Write them down. If the list has fewer than 100 items and they all have clear attribute names — name, timezone, language preference, plan tier, recent project name — Pattern 2 is correct.

Two detection signals that you have over-built:

The embedding model is your largest dependency but you never run queries that could not be answered by a key lookup.
Contradiction detection is complex or absent because your architecture has no structured fields to compare — just semantic similarity between free-text memory entries.

The fix is not to rebuild from scratch. Add entity, attribute, and value columns to your existing memory table, populate them for structured facts, and route key lookups through direct SQL. Semantic search can coexist — keep it for the queries it actually helps.

Earning Pattern 3 or 4 means hitting Pattern 2’s limits first. The limit is when a user asks an open-ended question about their history that direct lookup cannot answer. Not before.

Production Checklist

	Item	Score
	You can enumerate the facts your agent needs to remember — or have consciously accepted that you cannot and chosen Pattern 3 accordingly.
	You have assigned a pattern (1, 2, 3, or 4) based on the decision matrix before writing implementation code.
	If Pattern 2: schema has at minimum (user_id, entity, attribute, value, valid_until). Contradiction detection is a single UPDATE + INSERT, not a semantic similarity check.
	If Pattern 3: you have a plan for embedding maintenance (re-embedding on text changes, index rebuild schedule) and a fallback to direct lookup for structured facts.
	If Pattern 4: you have justified full episodic logging over Pattern 3 by identifying a specific query type that requires reasoning over session history.
	You have a written upgrade trigger: the exact condition under which you will move from your current pattern to the next one.

0 of 6

Resources

Generative Agents: Interactive Simulacra of Human Behavior ↗

Park et al. 2023 — arXiv:2304.03442

The canonical Pattern 4 architecture. Introduces the memory stream, reflection passes, and three-component retrieval scoring (recency + importance + relevance). The intellectual lineage for importance-scored retrieval in both Mem0 and Recall.

Lost in the Middle: How Language Models Use Long Contexts ↗

Liu et al. 2023 — arXiv:2307.03172

Documents the U-shaped retrieval performance curve in long-context LLMs — performance degrades significantly when relevant content is in the middle of the context window. The empirical basis for Pattern 1's summarization drift risk.

Building Production-Ready AI Agents with Scalable Long-Term Memory ↗

Mem0 Research Team — arXiv:2504.19413

The LOCOMO benchmark paper. Documents Mem0's hybrid KV+vector+graph architecture and the case for separate storage layers for structured versus semantic memory. 26% improvement over OpenAI baseline in concrete numbers.

Indexing for Rapid Random-Access Memory Systems ↗

H. A. M. Dumey — Computers and Automation, 1956

One of the two contested primary sources for hash tables — the other being Luhn's 1953 IBM memo. The attribution is contested; the principle is not: structure beats brute search for known-key lookups. The history anchor for Pattern 2.

Recall — Persistent Memory Layer for AI Agents ↗

szl-recall on PyPI — github.com/Sentient-Zero-Labs/szl-recall

The Pattern 2 + optional Pattern 3 implementation used throughout this series. pip install szl-recall for structured KV; pip install szl-recall with the embeddings extra for hybrid BM25+dense search. The production version of the minimal implementation shown above.

Memory in AI Systems is a seven-issue series from Sentient Zero Labs. Issue 06 covers failure modes — what goes wrong when you get the pattern right but the implementation wrong. Summarization drift, embedding staleness, and contradiction storms.

Until next issue,

Sentient Zero Labs