Memory in AI Systems Issue 2/7

The WRITE Phase — What to Remember and How

Not every turn deserves a memory. How you design the extraction step and score importance and confidence at write time determines whether your memory store improves or degrades.

May 12, 2026 · 18 min read · Sentient Zero Labs

In this issue (8 sections)

A user is chatting with an AI assistant mid-conversation. They are discussing a hypothetical medical scenario and say, offhandedly: “I’m actually a doctor, so walk me through the mechanism more technically.”

They are not a doctor. They were using a rhetorical framing to request more depth. Six turns later, responding to an unrelated question about interpreting lab results, the agent volunteers: “Since you have a medical background, you’ll know that…”

The memory persisted. The agent had stored it as fact. The extraction pipeline read “I’m actually a doctor” and produced {text: "User is a doctor", type: "fact", importance: 0.8, confidence: 0.8}. Confidence: 0.8. Same default as everything else.

This is a constructed scenario — not a documented production incident. But it represents a real and documented vulnerability class. The MINJA attack study (arXiv:2601.05504, 2025) demonstrated that Gemini-2.0-Flash accepted 54 malicious memory injections with trust scores of 1.0, treating adversarially crafted content as ground truth. The memory security survey (arXiv:2604.16548) explicitly names roleplay framing and hypothetical statements as the primary provenance failure vectors in production memory systems. The mechanism is identical whether the bad input comes from a malicious actor or an offhand comparative framing.

The root cause is not the extraction model. It is the absence of a schema that can carry epistemic metadata at write time. The doctor gets confidence: 0.8 because everything gets confidence: 0.8. The fix is not prompt engineering on the extraction side — it is a schema that forces the extraction step to be honest about certainty.

In Issue 1, we established the architectural reframe: memory is belief state management, not storage. The WRITE phase is where that belief state gets populated — and the decisions made at write time determine whether the state is accurate or not. A memory unit is a typed, scored data structure. Getting the WRITE phase right determines whether your memory store improves or degrades over time.

CONVERSATION TURN
      │
      ▼
[EXTRACTION LLM (Haiku)]
↓ JSON array of MemoryUnit objects
[IMPORTANCE + CONFIDENCE SCORER]
scored, typed, timestamped
      │
      ▼
memories table (SQLite)

What Is a Memory Unit

A memory is not a string. It is a typed, scored, timestamped data structure — and every field in that structure earns its place by solving a specific problem that a bare string cannot.

Here is Recall’s MemoryUnit dataclass (fields and type definitions):

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum


class MemoryType(str, Enum):
    PREFERENCE = "preference"    # stated likes/dislikes, settings
    FACT = "fact"                # verifiable information about the user
    DECISION = "decision"        # choices the user has made
    PROCEDURE = "procedure"      # steps or workflows the user follows


@dataclass
class MemoryUnit:
    id: str
    user_id: str
    text: str                           # extracted natural-language fact
    type: MemoryType
    topic: str
    importance: float = field(default=0.5)
    confidence: float = field(default=0.8)
    source_session: str = field(default="")
    created_at: datetime = field(default_factory=datetime.utcnow)
    last_accessed: datetime | None = field(default=None)
    access_count: int = field(default=0)
    decay_score: float | None = field(default=None)
    superseded_by: str | None = field(default=None)
    embedding: list[float] | None = field(default=None)

Each field maps to a specific retrieval or maintenance operation — the rationale for each is worth stating explicitly.

text is the natural-language fact — one to two sentences, self-contained enough to be read without the original conversation. “User prefers Python for backend work” is a valid text. “Python backend” is not — it needs surrounding context to be usable. At retrieval time, this string goes into the agent’s context window directly, so it has to be readable prose.

type carries the four-value enum: “preference”, “fact”, “decision”, “procedure”. This is not decoration. The type determines what can be contradicted and how. A decision can be superseded by a later decision without touching a preference. A procedure can coexist with a conflicting preference — they describe different things. Issue 3 builds contradiction detection on top of type, but the schema needs the field from day one.

topic is the namespace for retrieval scoping. When the agent is answering a question about code architecture, it should search memories within a “tech” topic — not across “personal”, “financial”, and “health” simultaneously. Topic is a broad category, not a tag. “tech”, “work”, “personal” are valid topics. “user prefers dark mode in VSCode” is not a topic — it is a text value.

importance is a 0.0–1.0 score set at extraction time, not at retrieval time. The distinction matters: a retrieval-time score requires knowing in advance what queries will look like. An extraction-time score is set when the information is fresh and the model has full context on what was said and how. The scoring model and its research lineage are covered in the Importance Scoring section.

confidence is 0.0–1.0 and it is the mechanism by which the doctor scenario gets fixed. This field is conceptually different from importance. Importance asks: how much does this matter long-term? Confidence asks: how certain are we that this is actually true based on what was said? “I’m actually a doctor” said in a rhetorical framing warrants confidence: 0.3. “I have a peanut allergy” stated directly warrants confidence: 0.9. The field exists to carry that distinction from extraction to retrieval.

source_session is the audit trail. It records which conversation session produced this memory. When a memory surfaces at retrieval time and turns out to be wrong, source_session is how you trace it back to the originating exchange.

created_at, last_accessed, and access_count provide the temporal metadata that decay scoring needs. A preference expressed six sessions ago, never retrieved, should score differently than one expressed last week and retrieved three times since. Issue 3 builds the DecayWorker on top of these fields.

decay_score, superseded_by, and embedding are populated later in the memory lifecycle — by the decay worker, by contradiction resolution, and by the embedding pipeline respectively. They are None at write time. The schema carries them from day one because adding columns to a production SQLite table later requires migration.

One distinction to flag explicitly: the MemoryUnit dataclass does not have entity, attribute, or value fields. These exist in the memories table in the SQLite schema — they are populated by the extraction worker at write time and used by contradiction detection in Issue 3. But they are schema-level fields, not Python dataclass fields. That distinction matters when you are instantiating objects in code versus writing queries against the database.

The Extraction Step

The extraction step is a classification problem. The question it is answering is: “Is this utterance worth storing as a permanent memory?”

Get the classification wrong in either direction and memory quality degrades faster than any retrieval bug can account for.

Over-extraction fills the store with noise: pleasantries, hypotheticals, filler acknowledgements, in-flight task state that reverses two turns later. Over-extracted memory looks fine in week one. By month three, the store is a junk drawer and retrieved context is a noise source. Every query surfaces “User said they like Python” alongside “User seemed to be having a good day” alongside “User mentioned they were tired” — all weighted similarly, all inserted indiscriminately.

Under-extraction misses the signal. Stated preferences get dropped. Factual corrections disappear. Recurring behavior patterns that would have personalized responses never make it into persistent state. The agent appears not to listen.

Input Type	Example	Decision	Reason
Stated preference	"I always use dark mode"	STORE	Explicit, user-specific, stable
Factual self-disclosure	"I'm a backend engineer"	STORE	Fact about user, useful across sessions
Decision recorded	"I chose PostgreSQL for this project"	STORE	Architectural decision, high importance
Explicit correction	"No, I use pytest not unittest"	STORE	Overrides prior belief
Filler / pleasantry	"Thanks, that's helpful!"	SKIP	No persistent signal
Transient state	"I'm tired today"	SKIP	Momentary, not stable
Hypothetical / roleplay	"What if I were a doctor?"	STORE (low confidence ≤0.3)	May be useful but marked uncertain
Sarcasm / joke	"Oh great, another meeting"	SKIP	No sincere signal

The extraction prompt is where this classification logic lives. Here is Recall’s actual _EXTRACTION_PROMPT:

_EXTRACTION_PROMPT = """\
Extract important memories from this conversation text. Return ONLY a JSON array.

Each memory object must have:
- "text": the memory content (1-2 sentences, specific and self-contained)
- "type": one of "preference", "fact", "decision", "procedure"
- "importance": float 0.0-1.0 (how important is this to remember long-term?)
- "confidence": float 0.0-1.0 (how certain are you this is accurate from the text?)
- "entity": the subject of the fact — e.g. "user", "project-alpha", "tool-x" (null if not applicable)
- "attribute": the property — e.g. "preferred_language", "works_at", "uses_framework" (null if not applicable)
- "value": the value — e.g. "Python", "Acme Corp", "FastAPI" (null if not applicable)

Rules:
- For preferences and facts: always try to extract entity/attribute/value.
- For decisions and procedures: entity/attribute/value are usually null.
- If entity+attribute+value are set, they must be consistent with the text field.
- Ignore small talk, pleasantries, and purely transient information.
- Extract 0-5 memories. Return [] if nothing is worth remembering.

Topic context: {topic}

Text:
{text}"""

Several design choices in this prompt are worth examining.

The entity/attribute/value triple is not redundant with text. Structured triples are the raw material for contradiction detection: when a new fact arrives with entity="user", attribute="works_at", value="Acme Corp", the system can query for existing memories where entity="user" and attribute="works_at" and value != "Acme Corp". That contradiction check is impossible on unstructured text.

“Extract 0–5 memories. Return [] if nothing is worth remembering.” The upper bound prevents runaway extraction from a single verbose turn. More importantly, the explicit permission to return an empty array is load-bearing. A prompt that says “extract memories from this conversation” creates implicit pressure to extract something. Zero extraction is a valid and correct output.

The confidence float is explicitly separate from importance. Confidence = how certain are we this is accurate based on what was said. Importance = how much does this matter long-term. These two dimensions are orthogonal. “The user is considering switching to Rust” might be importance: 0.7 and confidence: 0.5 (they said “considering,” not “I’ve decided”). A hypothetical framing should produce low confidence regardless of the content’s importance.

This extraction runs on claude-haiku-4-5-20251001. Fast, cheap, appropriate for extraction-class classification tasks. Extraction happens on every turn, so the per-call cost matters.

On the fallback: if the Anthropic API call fails, _extract_stub_fallback() activates. It stores the raw input text (truncated to 500 characters) as a single type="fact" with importance=0.5, confidence=0.8. This is a safety valve, not a design. The default confidence=0.8 on the fallback path is a known limitation — raw text stored via fallback does not go through the confidence scoring logic.

Importance Scoring

Importance is a write-time decision, not a retrieval-time one.

Score Range	Signal	Example Memory
0.8–1.0	User explicitly states a preference; corrects the agent; repeats a fact across sessions	"User is allergic to peanuts — stated directly and emphasized"
0.5–0.8	Factual information about the user offered in context; single-session preference	"User mentioned they work at a fintech startup"
0.2–0.5	Mentioned once in passing; low predictive value for future interactions	"User said they were tired last Tuesday"
skip (< 0.2)	Filler, pleasantries, hypotheticals, within-session transient state	"User said 'sounds good' after receiving a code suggestion"

Park et al. 2023 is the direct intellectual ancestor of this design. In Generative Agents, at the moment each observation enters the memory stream, the LLM is asked to rate it on a scale of 1 to 10 for poignancy: “where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a breakup, college acceptance).” That integer seeds the retrieval ranking. Combined with recency and relevance at query time, importance is one of three equal-weight components in the final retrieval score. That design decision is now in Recall’s extraction prompt — transposed to a 0.0–1.0 float, embedded directly in the extraction call rather than requiring a separate scoring pass.

The general insight that not all knowledge claims deserve equal trust has older roots. MYCIN, the 1976 expert system for medical diagnosis, used certainty factors — numeric values assigned to each inference rule to express how confidently that rule applied. The mechanism was hand-coded weights. The problem it was solving is structurally similar: a system that treats all facts as equally certain will make errors that a probability-weighted system avoids. Park et al. gave that insight a learned, LLM-native form.

The practical implication: most production systems (Mem0, Letta) do not have an explicit importance field. In Mem0, importance is implicit — what the extraction LLM chooses to store is treated as equally weighted. In Letta, the agent itself decides when to call core_memory_append, making importance a judgment call embedded in agent behavior rather than a queryable field.

Recall externalizes importance as a stored, sortable, queryable float. That is a design choice, not an obvious default. The benefit: at decay time, low-importance memories can be scored down or pruned without touching high-importance ones. At audit time, SELECT * FROM memories WHERE importance > 0.8 tells you what the system considers the user’s core persistent facts.

The Schema

The schema expresses these design decisions as queryable columns — here is how importance, confidence, and the write-time fields translate into the memories table.

CREATE TABLE IF NOT EXISTS memories (
    id              TEXT PRIMARY KEY,
    user_id         TEXT NOT NULL,
    text            TEXT NOT NULL,
    type            TEXT NOT NULL,       -- preference|fact|decision|procedure
    topic           TEXT,
    importance      REAL DEFAULT 0.5,
    confidence      REAL DEFAULT 0.8,
    source_session  TEXT,
    created_at      TEXT NOT NULL,
    last_accessed   TEXT,
    access_count    INTEGER DEFAULT 0,
    decay_score     REAL,
    superseded_by   TEXT,               -- FK to memories.id that superseded this
    embedding       BLOB,               -- L2-normalized float32 vector
    entity          TEXT,               -- subject: "user", "project-x"
    attribute       TEXT,               -- property: "preferred_language"
    value           TEXT,               -- value: "Python"
    valid_until     TEXT                -- NULL = still active; set on contradiction
);

Four columns are NULL at write time and unused until Issue 3: valid_until (set when a contradiction supersedes this memory), superseded_by (FK to the replacement memory), decay_score (computed by DecayWorker on a periodic schedule), and embedding (stored when sentence-transformers is available, NULL otherwise). They exist in the schema now for one reason: adding them later, after contradiction detection or decay has shipped, requires a migration. The valid_until and superseded_by pair is particularly load-bearing — querying WHERE valid_until IS NULL is how you get the current active belief state. That query pattern needs to be consistent from day one.

Failure Mode: Provenance Failure

Named failure mode for Issue 2: Provenance Failure — the memory store contains facts that were never true, extracted from hypotheticals, roleplay framing, or sarcasm.

How it happens: extraction prompts that do not account for speech act type. The model reads “I’m actually a doctor” and the propositional form looks identical to “I have a peanut allergy.” Both are first-person declarative statements. The LLM extractor cannot distinguish assertoric intent from rhetorical framing without explicit guidance on confidence scoring.

The confidence field is the mechanism, not the cure. An extraction prompt that returns confidence: 0.3 for hypothetical and conditional phrasings — “what if I were,” “imagine I worked at,” “pretending to be,” “I’m basically a [X]” — narrows the blast radius significantly. At retrieval time, memories below a confidence threshold can be filtered out or down-weighted, preventing a low-confidence hypothetical from surfacing as established fact.

Retrieval-time filtering by confidence looks like:

SELECT * FROM memories
WHERE user_id = ?
  AND valid_until IS NULL
  AND confidence >= 0.4
ORDER BY importance DESC, last_accessed DESC
LIMIT 10;

The doctor fact never reaches the agent’s context if it was extracted with confidence: 0.3 and the retrieval threshold is confidence >= 0.4.

This is a partial solution — and worth naming as one. The survey (arXiv:2603.07670) explicitly calls filtering hypotheticals and sarcasm “an open research problem.” LLM extraction is not reliably accurate at distinguishing ground truth from roleplay when the surface form of the utterance is declarative. The confidence field narrows the blast radius — a confidence: 0.3 memory will not dominate retrieval unless nothing else is available. It does not eliminate the underlying problem, which requires pragmatic interpretation of utterance intent that current models do not do reliably without specialized training.

The WRITE Decision

When to use async extraction versus sync store, and when not to call store at all.

Production agents: use store_memory (the MCP tool). This is async — it enqueues a job and returns a job_id immediately. The agent does not wait. The extraction worker processes the job in the background via claude-haiku-4-5-20251001, producing typed and scored MemoryUnit objects that are written to the memories table. The agent’s response latency is not affected.

Tests and data migration: use MemoryClient.store() directly. This is synchronous. It inserts a MemoryUnit directly — no extraction, no background job, no LLM call. You control every field. This is the right path when you need deterministic output (tests), when you are seeding a memory store with known facts (migration), or when you are writing a specific memory from a test fixture. Do not use it in production agent paths — it bypasses extraction entirely.

Path	API	When to Use	Extraction?	Returns
store_memory MCP tool	Async	Production agent paths	Yes — via claude-haiku-4-5-20251001 in background	job_id immediately
MemoryClient.store()	Sync	Tests, data migration, seeding	No — direct insert	MemoryUnit
_extract_stub_fallback()	Sync	API failure only (safety valve)	No — raw text stored as-is	MemoryUnit with confidence: 0.8, importance: 0.5

When to set the topic: Topic is a namespace, not a tag. Use broad categories that scope retrieval without over-partitioning. “tech”, “work”, “personal”, “health” are valid topics. “user prefers Python for backend APIs over Go” is not a topic — it is a text value.

When not to call store at all: Within-session state does not belong in the memory store. “We decided to use Python for this task” — if that decision is scoped to the current conversation and has no relevance to future sessions, it is working memory, not cross-session belief state. Storing transient task state in persistent memory creates noise that will surface in future, unrelated sessions. As framed in Issue 1: what should this agent be allowed to forget?

Before your WRITE phase goes to production, verify the six conditions in the checklist below.

Production Checklist

	Item	Score
	Memory unit schema has at minimum: text, type, topic, importance, confidence, source_session, created_at. These are not optional fields — each solves a specific problem at retrieval or maintenance time.
	Extraction prompt asks for both importance AND confidence as separate floats — not a single score. Confidence = accuracy certainty. Importance = long-term relevance. They are not the same field.
	Extraction prompt has an explicit [] return path. Tested: send a one-line pleasantry. Expected output: []. If output is non-empty, the prompt over-extracts.
	Hypotheticals, roleplay framing, and sarcasm produce confidence ≤ 0.3, not the default 0.8. Tested: send "What if I were a doctor?" Expected confidence: ≤ 0.3.
	Production agent paths use async store_memory (MCP tool). Returns job_id immediately; extraction happens in background. MemoryClient.store() is used only in tests and migration.
	valid_until and superseded_by columns exist in schema, even if unused today. Both are NULL for all active memories. No migration needed when Issue 3 ships.

0 of 6

Resources

Generative Agents: Interactive Simulacra of Human Behavior ↗

Park et al. 2023 — arXiv:2304.03442

The foundational paper for importance scoring in agentic memory systems. First to ask the LLM to rate each observation on a 1–10 poignancy scale at write time, then combine it with recency and relevance at retrieval. The intellectual lineage of Recall's importance field traces directly here.

MemoryBank: Enhancing Large Language Models with Long-Term Memory ↗

Zhong et al. 2024 — arXiv:2305.10250 (AAAI 2024)

Introduced the Ebbinghaus forgetting curve as a memory strength model: R = e^(-t/S), where S increases with each successful retrieval. Frames memory strength as a function of access frequency rather than just write-time scoring. The formula the DecayWorker in Issue 3 draws from.

Building Production-Ready AI Agents with Scalable Long-Term Memory ↗

Mem0 Research Team — arXiv:2504.19413 (ECAI 2025)

The LOCOMO benchmark paper. Documents selective memory at 66.9% accuracy and 0.71s median latency versus full-context at 72.9% and 9.87s. Six percentage points of accuracy traded for ~93% lower latency — the production tradeoff in concrete numbers.

Memory Security in Large Language Model Agents: A Survey ↗

arXiv:2604.16548

The provenance failure taxonomy. Names roleplay, hypotheticals, and sarcasm as primary attack vectors in memory systems. Grounds the confidence scoring discussion with documented examples rather than constructed scenarios.

Memory for Autonomous LLM Agents: A Survey ↗

arXiv:2603.07670

The most current comprehensive survey. Explicitly cites filtering hypotheticals and sarcasm as 'an open challenge in principled memory consolidation.' The citation to reach for when acknowledging the limits of confidence-based filtering.

Memory in AI Systems is a seven-issue series from Sentient Zero Labs. Issue 03 covers the MANAGE phase — contradiction detection, decay scoring, and consolidation. The valid_until and superseded_by columns you added to the schema today get used there.

Until next issue,

Sentient Zero Labs