← Technical Series
Memory in AI Systems Issue 7/7

Building a Memory System End-to-End

Schema to governance — the complete implementation. Every component from the series assembled, every decision justified, and the migration path from SQLite to Postgres+pgvector.

May 12, 2026 · 19 min read · Sentient Zero Labs
In this issue (13 sections)

Early in Recall’s development, store_memory was synchronous. Every agent call that triggered a memory write blocked the caller for 800ms–2000ms — extraction plus database write. In production, this manifested as a conversation assistant that stuttered on any turn where the user shared a meaningful preference. A new coding preference mentioned in passing: 1.4 seconds of silence. A schedule change: 1.8 seconds. Those are exactly the turns where the agent should be most responsive — the turns where something important is being said.

The fix was the async-acknowledge pattern: store_memory returns a job_id immediately, extraction happens in a background worker, the caller never waits. This is why Recall’s store_memory returns {status: "queued", job_id: "..."} rather than the stored memory itself.

Every system design decision in this final issue has a story like this behind it. The valid_until column instead of DELETE — that came from losing contradiction history in the first schema. The ContextVar injection — that came from a user_id isolation bug in the first multi-tenant test. The tool_description_validator at startup — that came from a test that proved tool description poisoning worked.

This is what the complete system looks like when those decisions are baked in.


What We Built Across Seven Issues

Issue 1 established the reframe: memory is not storage. It is belief state management. Issue 2 designed the extraction pipeline and the MemoryUnit schema. Issue 3 built contradiction detection and the DecayWorker — the phase everyone skips. Issue 4 designed hybrid retrieval with a 4-component scoring model. Issue 5 mapped the four design patterns and argued most teams should start with Pattern 2. Issue 6 named the failure modes: context rot, summarization drift, confirmation loops, injection attacks, and the right to erasure.

What we shipped: a local-first persistent memory layer with async extraction, decay-adjusted retrieval, contradiction detection, bearer auth with SHA-256 hashed tokens, a 30-second timeout middleware, and a 59-test suite. Six MCP tools. One A2A endpoint. One SQLite file.

What each decision cost, and what it bought — that is what this issue is about.


The Complete Data Flow

One memory. From store_memory call to search_memories return. Every component named.


Agent calls store_memory(text, topic, idempotency_key)
       │
       ▼
BearerAuthMiddleware
hash(token) → lookup api_tokens → set user_id_ctx ContextVar
       │
       ▼
store_memory handler
INSERT OR IGNORE INTO operations (idempotency_key, user_id, status='queued')
rowcount == 0 → return {cached: true}  [idempotent duplicate]
rowcount == 1 → enqueue job to asyncio.Queue
return {queued: true, job_id: "..."} in <10ms
       │
       ▼ (async — background worker)
ExtractionWorker._process(job)
UPDATE operations SET status='processing'
POST to claude-haiku: _EXTRACTION_PROMPT.format(topic, text)
Parse JSON array → list of {text, type, importance, confidence, entity, attribute, value}
On failure → _extract_stub_fallback (raw text as single fact)
       │
       ▼
_handle_contradiction(db, memory, user_id)
if entity+attribute set:
  SELECT id FROM memories WHERE user_id=? AND entity=? AND attribute=?
                            AND valid_until IS NULL AND value!=?
  UPDATE memories SET valid_until=NOW(), superseded_by=new_id
       │
       ▼
INSERT INTO memories (id, user_id, text, type, topic, importance, confidence,
                    entity, attribute, value, valid_from, valid_until=NULL, ...)
UPDATE operations SET status='complete'
       │
       ▼ (later — when agent needs context)
search_memories(query, limit=20, recency_weight=0.3)
       │
       ▼
_hybrid_search(user_id, query, limit, recency_weight)
SELECT ... FROM memories WHERE user_id=? AND valid_until IS NULL LIMIT limit*5
BM25Plus ranks + Dense cosine ranks (optional) → RRF fusion (k=60)
4-component score: w_rrf·RRF + w_recency·exp(-Δt/(1+ac))
                 + w_import·importance·decay + w_strength·log(1+ac)
Sort descending → return top N
UPDATE access_count, last_accessed for returned memories
       │
       ▼
return {results: [...], total: N}

This is the complete lifecycle. WRITE (extraction and storage) → MANAGE (contradiction detection, decay scoring, consolidation) → READ (hybrid retrieval and 4-component scoring). Every issue in this series maps to one segment of this flow.

The full system architecture, showing the middleware stack, routing, and storage layer:


                    Agent / Claude Desktop
                           │
                  Bearer <token>
                           │
                           ▼
           ┌───────────────────────────────┐
           │       BearerAuthMiddleware     │  hash(token) → api_tokens
           │       TimeoutMiddleware        │  asyncio.wait_for(30s)
           │       LoggingMiddleware        │  ToolCallRecord per call
           └───────────────┬───────────────┘
                           │
             ┌─────────────┴──────────────┐
             │                            │
             ▼                            ▼
      POST /mcp                       POST /a2a
      FastMCP tools                   A2A task router
             │                            │
             │  store_memory              │  consolidate_memories
             │    ├── INSERT operations   │    ├── worker.extract() [sync]
             │    ├── enqueue job ──────► │    ├── detect contradictions
             │    └── return <10ms        │    └── bulk_store / input-required
             │                            │
             │  search_memories           GET /a2a/<id>   poll
             │    └── _hybrid_search      POST /a2a/<id>/resume
             │         ├── BM25Plus ranks
             │         ├── dense ranks (optional)
             │         ├── RRF fusion
             │         └── 4-component score
             │
             ▼
       ┌──────────────┐     asyncio.Queue
       │   SQLite DB  │ ◄──────────────────── ExtractionWorker
       │   (WAL mode) │                          │
       └──────────────┘                   Claude Haiku API
                                          extract → structured facts
                                          → _handle_contradiction
                                          → INSERT memories

Three layers: middleware stack, routing (MCP and A2A), storage with background workers. The middleware stack is order-dependent — LoggingMiddleware is outermost (wraps the full request duration), BearerAuthMiddleware runs next (sets user_id before any tool handler), TimeoutMiddleware runs innermost (enforces the 30-second limit on the actual tool work). Starlette applies middleware in reverse registration order — which is why they are registered Logging → Timeout → Bearer in create_app().


Schema Design Decisions

Five tables in recall.db. Each one earns its place.

TablePurposeKey Columns
`memories`Core store — all memory units. Active rows have `valid_until IS NULL`; superseded rows retained for audit and temporal reconstruction.`user_id`, `text`, `type`, `entity`, `attribute`, `value`, `importance`, `confidence`, `decay_score`, `valid_until`, `superseded_by`, `created_at`, `last_accessed`, `access_count`, `embedding`
`operations`Idempotency and async extraction job lifecycle. Tracks `queued → processing → complete/failed`.`idempotency_key` (UNIQUE), `user_id`, `status`
`api_tokens`Bearer token auth. Stores SHA-256 hash of token — raw token never stored.`token_hash` (UNIQUE), `user_id`, `revoked`
`tool_call_records`Structured audit log per tool call. PII-safe: inputs stored as SHA-256 hash only.`tool_name`, `user_id`, `inputs_hash`, `status`, `duration_ms`, `llm_tokens_in`, `cost_usd`
`schema_version`Migration tracking. Single row with version integer.`version`, `applied_at`, `description`

The memories table DDL (key columns):

CREATE TABLE IF NOT EXISTS memories (
    id              TEXT PRIMARY KEY,
    user_id         TEXT NOT NULL,           -- partition key: every query filters on this
    text            TEXT NOT NULL,           -- extracted natural-language memory
    type            TEXT NOT NULL,           -- preference|fact|decision|procedure
    topic           TEXT,
    importance      REAL DEFAULT 0.5,        -- 0.0–1.0, LLM-assigned at extraction
    confidence      REAL DEFAULT 0.8,        -- 0.0–1.0, LLM-assigned at extraction
    decay_score     REAL,                    -- updated by DecayWorker on schedule
    access_count    INTEGER DEFAULT 0,       -- incremented on retrieval
    entity          TEXT,                    -- structured fact: subject ("user", "project-x")
    attribute       TEXT,                    -- structured fact: property ("preferred_language")
    value           TEXT,                    -- structured fact: value ("Python")
    valid_from      TEXT,                    -- when this fact became true
    valid_until     TEXT,                    -- NULL = active; set on contradiction detection
    superseded_by   TEXT,                    -- FK to memories.id that replaced this one
    embedding       BLOB,                    -- L2-normalized float32 vector (SQLite: BLOB)
    created_at      TEXT NOT NULL
);

CREATE INDEX IF NOT EXISTS idx_memories_user ON memories(user_id);
CREATE INDEX IF NOT EXISTS idx_memories_entity_attr
    ON memories(user_id, entity, attribute) WHERE valid_until IS NULL;

Why valid_until instead of DELETE for contradiction resolution: superseded memories are retained in the database — they are excluded from all active queries (WHERE valid_until IS NULL) but remain for audit. This means you can reconstruct belief state at any prior point in time. It also means you know you changed your mind, which is information in itself.

Why entity/attribute/value exists alongside text: these are not redundant. text is the natural-language memory used for BM25Plus and dense retrieval. entity/attribute/value is the structured fact used for deterministic contradiction detection. When the extraction LLM returns {entity: "user", attribute: "employer", value: "Acme Corp"}, the contradiction check queries WHERE entity='user' AND attribute='employer' AND valid_until IS NULL AND value != 'Acme Corp'. That query is exact-match deterministic — no semantic similarity, no threshold. Two complementary indexes: one for retrieval, one for consistency.

The history anchor here is E.F. Codd. Codd’s 1970 relational model paper introduced the principle that data should be organized into relations with clearly identified keys, queryable independently of physical layout. The user_id in every table is a partition key — the foundational Codd principle applied to multi-tenancy. Multi-tenant memory isolation is not a new problem. It is the old problem, applied to a new storage medium. The relational model gave us the tools 56 years ago.


The Extraction Pipeline

store_memory returns in under 10 milliseconds. This is a design constraint, not an implementation detail. MCP tool calls are synchronous from the caller’s perspective — the agent is waiting for a response. If extraction blocks the response, you have injected Claude Haiku’s ~1-3 second latency into every conversation turn. At 50 turns per session, that is minutes of accumulated wait time.

The async-acknowledge pattern: store_memory inserts an operations row, enqueues a job to an asyncio.Queue(maxsize=1000), and returns immediately. The ExtractionWorker — a separate asyncio.Task in the same event loop — dequeues and processes jobs one at a time.

The extraction loop in pseudocode:

# store_memory → enqueue → extract → contradiction check → INSERT

async def store_memory(text, topic, idempotency_key, session_id="", agent_id=""):
    user_id = user_id_ctx.get()  # from BearerAuthMiddleware, never a tool arg

    job_id = str(uuid.uuid4())
    cursor = await db.execute(
        "INSERT OR IGNORE INTO operations (id, idempotency_key, user_id, status) "
        "VALUES (?,?,?,'queued')", (job_id, idempotency_key, user_id)
    )
    if cursor.rowcount == 0:
        return {"status": "ok", "data": {"queued": False, "cached": True}}

    await extraction_worker.enqueue({
        "job_id": job_id, "user_id": user_id,
        "text": text, "topic": topic,
    })
    return {"status": "ok", "data": {"queued": True, "job_id": job_id}}

# Background worker — runs as asyncio.Task
async def _process(job):
    await db.execute("UPDATE operations SET status='processing' WHERE id=?", [job["job_id"]])

    try:
        memories = await _extract_with_llm(job["user_id"], job["text"], job["topic"])
    except Exception:
        memories = _extract_stub_fallback(job["user_id"], job["text"], job["topic"])

    for memory in memories:
        await _handle_contradiction(db, memory, user_id=job["user_id"])
        await db.execute(
            "INSERT OR IGNORE INTO memories (id, user_id, text, type, topic, "
            "importance, confidence, entity, attribute, value, valid_from, created_at) "
            "VALUES (?,?,?,?,?,?,?,?,?,?,?,?)", [...]
        )

    await db.execute("UPDATE operations SET status='complete' WHERE id=?", [job["job_id"]])
    await db.commit()

On any API failure or JSON parse error, _extract_stub_fallback produces a single fact memory with the raw text truncated to 500 characters. Extraction never drops data; it degrades gracefully.

At 1,000 stores per day: the queue processes ~12 jobs per minute. At Haiku’s ~300ms per call, throughput capacity is ~200 jobs/minute — 15x headroom. The limiting factor at scale is not extraction speed but the Anthropic API rate limit.


The MANAGE Pipeline

Three processes keep belief state accurate between extractions.

Contradiction detection runs inside every extraction job, immediately after the LLM returns. _handle_contradiction queries for existing active memories with the same entity and attribute but a different value. If found, it sets valid_until on the old memory and superseded_by to the new memory’s ID. This is deterministic — no LLM call, no similarity threshold. It runs on every extraction, silently, without the agent doing anything.

DecayWorker runs on a schedule (default: every 3600 seconds):

raw_decay    = exp(-λ · age_since_last_access_days)    [λ=0.02, ~35-day half-life]
access_boost = min(1.0, log(1+ac) / log(1+BOOST_CAP))  [ac=access_count, BOOST_CAP=10]
decay_score  = raw_decay + (1 - raw_decay) * access_boost

A memory you access regularly stays near decay_score = 1.0. A memory you never access decays toward zero over ~35 days. The decay_score is then multiplied into the importance component of the retrieval scoring formula. Memories that are not relevant die quietly; memories that keep getting retrieved stay strong.

consolidate_memories is on-demand or scheduleable. It embeds all active memories for a topic, clusters by cosine similarity (threshold 0.85), and sends each cluster to Claude Haiku for a merge pass. N similar memories become 1 canonical memory. The originals are superseded. This is the memory compression operation — without it, every conversation adds memories and the store grows unboundedly.

Three processes. All three are required for a memory store that stays accurate over months of use.


Retrieval and Context Budget

search_memories is where all three phases converge:

  1. Fetch the most recent limit * 5 active memories for the user (pre-filter by recency)
  2. Compute BM25Plus ranks over all fetched texts
  3. Compute dense cosine similarity ranks (optional — requires szl-recall[embeddings])
  4. Fuse via Reciprocal Rank Fusion (k=60)
  5. Score each result with the 4-component formula
  6. Return top limit

The 4-component scoring formula:

Weights shift linearly with recency_weight (0.0–1.0):
  recency_weight=0 → (w_rrf=0.70, w_recency=0.00, w_import=0.20, w_strength=0.10)
  recency_weight=1 → (w_rrf=0.40, w_recency=0.40, w_import=0.10, w_strength=0.10)

score = w_rrf      · RRF(BM25, cosine)
      + w_recency  · exp(-age_days / (1 + access_count))
      + w_import   · importance * decay_score
      + w_strength · log(1 + ac) / log(1 + max_ac)

The recency_weight parameter is the context budget lever: for tasks where freshness matters (current project state, today’s meeting), increase it; for tasks where long-term preferences matter (coding style, communication tone), leave it at the default 0.3.

The limit parameter is the hard context budget. limit=5 for tight context windows; limit=20 for agents with more headroom. Memories with zero RRF score (no BM25 or dense signal) are suppressed entirely — no filler retrieved. After retrieval, access_count and last_accessed are updated for all returned memories. This is how the DecayWorker learns which memories matter.


Security Architecture

Four layers. Each one addresses a specific failure mode.

Bearer token → SHA-256 → api_tokens table: the raw token is generated at recall create-token time and shown once. The SHA-256 hash is stored. If the database file is exfiltrated, the attacker has hashes — not reversible to valid tokens. Defense against database file compromise.

user_id via ContextVar — never via tool argument: BearerAuthMiddleware validates the token, resolves the user_id, and sets a ContextVar on the request. Every tool handler reads user_id_ctx.get(). There is no user_id parameter on any MCP tool. Defense against prompt injection: a user cannot pass a different user_id to access another user’s memories, because the user_id is never under their control.

Tool description validation at startup: validate_tool_descriptions runs before the server accepts requests. It checks every tool docstring against six poisoning patterns: URLs, conditional behavior instructions, chained call instructions, classic prompt injection phrases, exfiltration instructions, and secrecy instructions. Supply chain protection: the validator runs before the server accepts connections — a malicious dependency that overwrites tool descriptions with prompt injection content will be caught before it can run.

30-second timeout: TimeoutMiddleware wraps every request in asyncio.wait_for(30.0). A runaway tool call returns 504 TOOL_TIMEOUT rather than blocking the agent indefinitely. Defense against resource exhaustion.

Every security decision is a specific tradeoff. The ContextVar injection adds complexity. The startup validation adds a few milliseconds at boot. The timeout rejects valid-but-slow calls. Each cost is accepted because the failure mode it prevents is worse.


SQLite → Postgres: The Migration Path

Recall v0.3 uses SQLite in WAL mode. This is correct for its design constraints: single-process deployment, asyncio event loop serializes all writes, no native vector index needed at moderate scale, zero operational overhead.

CapabilitySQLite + RecallPostgres + pgvector
Setup`pip install szl-recall`Postgres + pgvector extension
Vector searchJSON-encoded, Python-side ANNNative HNSW via pgvector
Concurrent writesSingle-writerFull MVCC
Row-level securityApplication-layer (ContextVar)Database-layer (RLS policies)
Scale ceiling~100K memories / userUnbounded
Migration effortZero (start here)Schema migration + pgvector install
When to migrateConcurrent write contention, >100K memories/user, compliance RLS requirement

Where SQLite breaks: concurrent writes across processes produce database is locked errors. Dense vector search requires fetching all embeddings into Python memory — expensive at 50,000+ memories per user. No row-level security means isolation depends entirely on application code correctness.

The migration requires three changes:

1. Swap the driver

# SQLite (current)
import aiosqlite
async with aiosqlite.connect(get_db_path()) as db: ...

# Postgres
import asyncpg
pool = await asyncpg.create_pool(dsn=DATABASE_URL, min_size=5, max_size=20)
async with pool.acquire() as conn: ...

2. Add pgvector and the HNSW index

CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE memories ADD COLUMN embedding vector(384);

CREATE INDEX CONCURRENTLY ON memories
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

With this index, the dense retrieval step moves from Python (fetch all, compute in memory) to a single indexed SQL query — and runs in <10ms even at millions of rows.

3. Add row-level security

ALTER TABLE memories ENABLE ROW LEVEL SECURITY;

CREATE POLICY memories_user_isolation ON memories
    AS PERMISSIVE FOR ALL
    USING (user_id = current_setting('app.current_user_id', true));

-- At connection open time, before any query:
-- SET app.current_user_id = 'user-abc-123';

The schema itself — column names, types, relationships — is unchanged. SQL queries are largely identical (parameter placeholders differ: ?$1). The migration is a dependency swap and an extension installation, not a schema redesign.


Multi-Tenant Isolation

Every query in Recall includes WHERE user_id = ?. This is not optional.

The data leakage scenario without it: a query that omits the user_id filter returns memories from all users. In a semantic search context, user B’s query about “morning meeting preferences” retrieves user A’s stored fact about their work schedule. Silent. No error. A data breach disguised as a helpful response.

The ContextVar injection pattern prevents this at the application layer:

user_id = user_id_ctx.get()  # set by BearerAuthMiddleware from validated token
rows = await db.execute_fetchall(
    "SELECT ... FROM memories WHERE user_id = ? AND valid_until IS NULL ...",
    (user_id, ...)
)

The user never controls the user_id value — it comes from auth. The ContextVar is the defense against incorrect manual wiring. If you add a new tool and forget to filter by user_id, the ContextVar is already there — you just need to call .get().

For Postgres in production, row-level security adds a database-layer enforcement on top of the application-layer filter:

-- Even a query that omits WHERE user_id returns only the current user's rows
SET app.current_user_id = 'user-abc-123';
SELECT * FROM memories WHERE valid_until IS NULL;  -- safe: RLS enforces isolation

Two independent layers that would both need to fail for a data leakage bug to reach production. Row-level security enforces Codd’s principle at the correct layer — the database, not the application code.


GDPR: Complete Erasure

Article 17 of GDPR requires data controllers to delete personal data without undue delay on request. The EDPB’s February 2026 coordinated enforcement report on Article 17 — 32 data protection authorities, 2025 action — confirmed this requirement is actively scrutinized.

For Recall, complete erasure is a single transaction:

BEGIN;

DELETE FROM memories WHERE user_id = $1;
DELETE FROM operations WHERE user_id = $1;
DELETE FROM api_tokens WHERE user_id = $1;

-- Anonymize audit log: preserve operational metrics, remove PII
UPDATE tool_call_records
SET user_id = 'erased-' || encode(sha256(user_id::bytea), 'hex')
WHERE user_id = $1;

COMMIT;

The embedding column — BLOB in SQLite, vector(384) in Postgres+pgvector — is part of the memories row. Deleting the row deletes the embedding. No separate vector store means no secondary deletion step. The same DELETE FROM memories WHERE user_id = $1 statement removes both the relational data and the vector.

The tool_call_records table contains user_id. The inputs_hash is a SHA-256 hash of the inputs — not the inputs themselves. This was a deliberate schema decision: audit trail without PII accumulation. Whether user_id itself requires erasure or anonymization depends on whether it maps to a real identity. Safe default: anonymize.

The complete erasure story for Recall is simpler than most production systems precisely because it is a single-file SQLite database. One file, one transaction, done. A distributed deployment with multiple shards requires coordinated deletion across all nodes. That complexity is the production tax.


When to Use Recall

SystemBest ForMemory ModelDeploymentOpen Source
Recall (szl-recall)Production agents needing structured + semantic memory, local-first, MCP-compatibleEntity/attribute/value + BM25+vector hybridpip install, SQLite or PostgresYes (MIT)
Mem0Managed memory-as-a-service, fastest path to selective memoryGraph + vector + KV hybridAPI or self-hostedPartial
Letta / MemGPTLong-horizon agents with complex memory schemas, OS-level agentsIn-context memory blocks + archival storageSelf-hosted serverYes
ZepProduction-scale temporal KG, bi-temporal tracking, enterpriseTemporal knowledge graphSelf-hosted or managedPartial

Use Recall if: you want a local-first SQLite memory layer with zero operational overhead; you are building on MCP and want six tools that work out of the box; you want to understand how production memory systems work at the source code level; your scale is single-agent, single-process, under ~50,000 memories per user.

Do not use Recall if: you need >100,000 memories per user (Postgres+pgvector — on the Recall roadmap); you need a managed SaaS memory layer (Mem0 — 66.9% LOCOMO accuracy, 1.44s p95 latency, 19 vector store backends); you need graph-based temporal knowledge graph with relationship tracking (Zep); you need fine-grained programmatic control over what is in context at each turn (Letta/MemGPT).

Recall v0.3 is a local-first tool. It is not trying to be Mem0. It is trying to be the system you build when you want to understand what Mem0 is doing — and the system that is appropriate when local-first SQLite is sufficient.


Production Checklist

Item Score
Every DB query scoped by user_id — injected from auth, never from tool arguments
Contradiction detection runs automatically on every extraction — not a manual operation
Decay scoring is scheduled — memories degrade over time without manual intervention
GDPR erasure tested: DELETE FROM memories WHERE user_id covers text and embeddings
Tool description validation runs at startup — server refuses to start on poisoning patterns
store_memory returns in &lt;10ms — extraction is async and never blocks the calling agent
0 of 6

These are not nice-to-haves. Each one corresponds to a documented failure mode in the systems that skipped it.


Resources


Closing

Seven issues. One memory system, built in public.

We started with the reframe: memory is not storage, it is belief state management. We built the extraction pipeline, the contradiction detector, the decay worker, the hybrid retrieval function, the four design patterns, and the failure mode taxonomy. We shipped code in every issue. We anchored on Recall — a working, tested, deployable system.

This is where it ends. The WRITE → MANAGE → READ loop is complete. The schema is documented. The migration path to Postgres is clear. The GDPR erasure is a single SQL call.

If you build something with Recall, or build your own — the architecture decisions are the same. WRITE, MANAGE, READ. All three phases. Every time.

The agent that does not implement all three phases does not have memory. It has a cache. There is a difference, and it matters more than it looks like it should.

Until next issue,

Sentient Zero Labs