In this issue (8 sections)
A user is chatting with an AI assistant mid-conversation. They are discussing a hypothetical medical scenario and say, offhandedly: “I’m actually a doctor, so walk me through the mechanism more technically.”
They are not a doctor. They were using a rhetorical framing to request more depth. Six turns later, responding to an unrelated question about interpreting lab results, the agent volunteers: “Since you have a medical background, you’ll know that…”
The memory persisted. The agent had stored it as fact. The extraction pipeline read “I’m actually a doctor” and produced {text: "User is a doctor", type: "fact", importance: 0.8, confidence: 0.8}. Confidence: 0.8. Same default as everything else.
This is a constructed scenario — not a documented production incident. But it represents a real and documented vulnerability class. The MINJA attack study (arXiv:2601.05504, 2025) demonstrated that Gemini-2.0-Flash accepted 54 malicious memory injections with trust scores of 1.0, treating adversarially crafted content as ground truth. The memory security survey (arXiv:2604.16548) explicitly names roleplay framing and hypothetical statements as the primary provenance failure vectors in production memory systems. The mechanism is identical whether the bad input comes from a malicious actor or an offhand comparative framing.
The root cause is not the extraction model. It is the absence of a schema that can carry epistemic metadata at write time. The doctor gets confidence: 0.8 because everything gets confidence: 0.8. The fix is not prompt engineering on the extraction side — it is a schema that forces the extraction step to be honest about certainty.
In Issue 1, we established the architectural reframe: memory is belief state management, not storage. The WRITE phase is where that belief state gets populated — and the decisions made at write time determine whether the state is accurate or not. A memory unit is a typed, scored data structure. Getting the WRITE phase right determines whether your memory store improves or degrades over time.
CONVERSATION TURN
│
▼
[EXTRACTION LLM (Haiku)]
↓ JSON array of MemoryUnit objects
[IMPORTANCE + CONFIDENCE SCORER]
scored, typed, timestamped
│
▼
memories table (SQLite) What Is a Memory Unit
A memory is not a string. It is a typed, scored, timestamped data structure — and every field in that structure earns its place by solving a specific problem that a bare string cannot.
Here is Recall’s MemoryUnit dataclass (fields and type definitions):
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
class MemoryType(str, Enum):
PREFERENCE = "preference" # stated likes/dislikes, settings
FACT = "fact" # verifiable information about the user
DECISION = "decision" # choices the user has made
PROCEDURE = "procedure" # steps or workflows the user follows
@dataclass
class MemoryUnit:
id: str
user_id: str
text: str # extracted natural-language fact
type: MemoryType
topic: str
importance: float = field(default=0.5)
confidence: float = field(default=0.8)
source_session: str = field(default="")
created_at: datetime = field(default_factory=datetime.utcnow)
last_accessed: datetime | None = field(default=None)
access_count: int = field(default=0)
decay_score: float | None = field(default=None)
superseded_by: str | None = field(default=None)
embedding: list[float] | None = field(default=None)
Each field maps to a specific retrieval or maintenance operation — the rationale for each is worth stating explicitly.
text is the natural-language fact — one to two sentences, self-contained enough to be read without the original conversation. “User prefers Python for backend work” is a valid text. “Python backend” is not — it needs surrounding context to be usable. At retrieval time, this string goes into the agent’s context window directly, so it has to be readable prose.
type carries the four-value enum: “preference”, “fact”, “decision”, “procedure”. This is not decoration. The type determines what can be contradicted and how. A decision can be superseded by a later decision without touching a preference. A procedure can coexist with a conflicting preference — they describe different things. Issue 3 builds contradiction detection on top of type, but the schema needs the field from day one.
topic is the namespace for retrieval scoping. When the agent is answering a question about code architecture, it should search memories within a “tech” topic — not across “personal”, “financial”, and “health” simultaneously. Topic is a broad category, not a tag. “tech”, “work”, “personal” are valid topics. “user prefers dark mode in VSCode” is not a topic — it is a text value.
importance is a 0.0–1.0 score set at extraction time, not at retrieval time. The distinction matters: a retrieval-time score requires knowing in advance what queries will look like. An extraction-time score is set when the information is fresh and the model has full context on what was said and how. The scoring model and its research lineage are covered in the Importance Scoring section.
confidence is 0.0–1.0 and it is the mechanism by which the doctor scenario gets fixed. This field is conceptually different from importance. Importance asks: how much does this matter long-term? Confidence asks: how certain are we that this is actually true based on what was said? “I’m actually a doctor” said in a rhetorical framing warrants confidence: 0.3. “I have a peanut allergy” stated directly warrants confidence: 0.9. The field exists to carry that distinction from extraction to retrieval.
source_session is the audit trail. It records which conversation session produced this memory. When a memory surfaces at retrieval time and turns out to be wrong, source_session is how you trace it back to the originating exchange.
created_at, last_accessed, and access_count provide the temporal metadata that decay scoring needs. A preference expressed six sessions ago, never retrieved, should score differently than one expressed last week and retrieved three times since. Issue 3 builds the DecayWorker on top of these fields.
decay_score, superseded_by, and embedding are populated later in the memory lifecycle — by the decay worker, by contradiction resolution, and by the embedding pipeline respectively. They are None at write time. The schema carries them from day one because adding columns to a production SQLite table later requires migration.
One distinction to flag explicitly: the MemoryUnit dataclass does not have entity, attribute, or value fields. These exist in the memories table in the SQLite schema — they are populated by the extraction worker at write time and used by contradiction detection in Issue 3. But they are schema-level fields, not Python dataclass fields. That distinction matters when you are instantiating objects in code versus writing queries against the database.
The Extraction Step
The extraction step is a classification problem. The question it is answering is: “Is this utterance worth storing as a permanent memory?”
Get the classification wrong in either direction and memory quality degrades faster than any retrieval bug can account for.
Over-extraction fills the store with noise: pleasantries, hypotheticals, filler acknowledgements, in-flight task state that reverses two turns later. Over-extracted memory looks fine in week one. By month three, the store is a junk drawer and retrieved context is a noise source. Every query surfaces “User said they like Python” alongside “User seemed to be having a good day” alongside “User mentioned they were tired” — all weighted similarly, all inserted indiscriminately.
Under-extraction misses the signal. Stated preferences get dropped. Factual corrections disappear. Recurring behavior patterns that would have personalized responses never make it into persistent state. The agent appears not to listen.
| Input Type | Example | Decision | Reason |
|---|---|---|---|
| Stated preference | "I always use dark mode" | STORE | Explicit, user-specific, stable |
| Factual self-disclosure | "I'm a backend engineer" | STORE | Fact about user, useful across sessions |
| Decision recorded | "I chose PostgreSQL for this project" | STORE | Architectural decision, high importance |
| Explicit correction | "No, I use pytest not unittest" | STORE | Overrides prior belief |
| Filler / pleasantry | "Thanks, that's helpful!" | SKIP | No persistent signal |
| Transient state | "I'm tired today" | SKIP | Momentary, not stable |
| Hypothetical / roleplay | "What if I were a doctor?" | STORE (low confidence ≤0.3) | May be useful but marked uncertain |
| Sarcasm / joke | "Oh great, another meeting" | SKIP | No sincere signal |
The extraction prompt is where this classification logic lives. Here is Recall’s actual _EXTRACTION_PROMPT:
_EXTRACTION_PROMPT = """\
Extract important memories from this conversation text. Return ONLY a JSON array.
Each memory object must have:
- "text": the memory content (1-2 sentences, specific and self-contained)
- "type": one of "preference", "fact", "decision", "procedure"
- "importance": float 0.0-1.0 (how important is this to remember long-term?)
- "confidence": float 0.0-1.0 (how certain are you this is accurate from the text?)
- "entity": the subject of the fact — e.g. "user", "project-alpha", "tool-x" (null if not applicable)
- "attribute": the property — e.g. "preferred_language", "works_at", "uses_framework" (null if not applicable)
- "value": the value — e.g. "Python", "Acme Corp", "FastAPI" (null if not applicable)
Rules:
- For preferences and facts: always try to extract entity/attribute/value.
- For decisions and procedures: entity/attribute/value are usually null.
- If entity+attribute+value are set, they must be consistent with the text field.
- Ignore small talk, pleasantries, and purely transient information.
- Extract 0-5 memories. Return [] if nothing is worth remembering.
Topic context: {topic}
Text:
{text}"""
Several design choices in this prompt are worth examining.
The entity/attribute/value triple is not redundant with text. Structured triples are the raw material for contradiction detection: when a new fact arrives with entity="user", attribute="works_at", value="Acme Corp", the system can query for existing memories where entity="user" and attribute="works_at" and value != "Acme Corp". That contradiction check is impossible on unstructured text.
“Extract 0–5 memories. Return [] if nothing is worth remembering.” The upper bound prevents runaway extraction from a single verbose turn. More importantly, the explicit permission to return an empty array is load-bearing. A prompt that says “extract memories from this conversation” creates implicit pressure to extract something. Zero extraction is a valid and correct output.
The confidence float is explicitly separate from importance. Confidence = how certain are we this is accurate based on what was said. Importance = how much does this matter long-term. These two dimensions are orthogonal. “The user is considering switching to Rust” might be importance: 0.7 and confidence: 0.5 (they said “considering,” not “I’ve decided”). A hypothetical framing should produce low confidence regardless of the content’s importance.
This extraction runs on claude-haiku-4-5-20251001. Fast, cheap, appropriate for extraction-class classification tasks. Extraction happens on every turn, so the per-call cost matters.
On the fallback: if the Anthropic API call fails, _extract_stub_fallback() activates. It stores the raw input text (truncated to 500 characters) as a single type="fact" with importance=0.5, confidence=0.8. This is a safety valve, not a design. The default confidence=0.8 on the fallback path is a known limitation — raw text stored via fallback does not go through the confidence scoring logic.
Importance Scoring
Importance is a write-time decision, not a retrieval-time one.
| Score Range | Signal | Example Memory |
|---|---|---|
| 0.8–1.0 | User explicitly states a preference; corrects the agent; repeats a fact across sessions | "User is allergic to peanuts — stated directly and emphasized" |
| 0.5–0.8 | Factual information about the user offered in context; single-session preference | "User mentioned they work at a fintech startup" |
| 0.2–0.5 | Mentioned once in passing; low predictive value for future interactions | "User said they were tired last Tuesday" |
| skip (< 0.2) | Filler, pleasantries, hypotheticals, within-session transient state | "User said 'sounds good' after receiving a code suggestion" |
Park et al. 2023 is the direct intellectual ancestor of this design. In Generative Agents, at the moment each observation enters the memory stream, the LLM is asked to rate it on a scale of 1 to 10 for poignancy: “where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a breakup, college acceptance).” That integer seeds the retrieval ranking. Combined with recency and relevance at query time, importance is one of three equal-weight components in the final retrieval score. That design decision is now in Recall’s extraction prompt — transposed to a 0.0–1.0 float, embedded directly in the extraction call rather than requiring a separate scoring pass.
The general insight that not all knowledge claims deserve equal trust has older roots. MYCIN, the 1976 expert system for medical diagnosis, used certainty factors — numeric values assigned to each inference rule to express how confidently that rule applied. The mechanism was hand-coded weights. The problem it was solving is structurally similar: a system that treats all facts as equally certain will make errors that a probability-weighted system avoids. Park et al. gave that insight a learned, LLM-native form.
The practical implication: most production systems (Mem0, Letta) do not have an explicit importance field. In Mem0, importance is implicit — what the extraction LLM chooses to store is treated as equally weighted. In Letta, the agent itself decides when to call core_memory_append, making importance a judgment call embedded in agent behavior rather than a queryable field.
Recall externalizes importance as a stored, sortable, queryable float. That is a design choice, not an obvious default. The benefit: at decay time, low-importance memories can be scored down or pruned without touching high-importance ones. At audit time, SELECT * FROM memories WHERE importance > 0.8 tells you what the system considers the user’s core persistent facts.
The Schema
The schema expresses these design decisions as queryable columns — here is how importance, confidence, and the write-time fields translate into the memories table.
CREATE TABLE IF NOT EXISTS memories (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL,
text TEXT NOT NULL,
type TEXT NOT NULL, -- preference|fact|decision|procedure
topic TEXT,
importance REAL DEFAULT 0.5,
confidence REAL DEFAULT 0.8,
source_session TEXT,
created_at TEXT NOT NULL,
last_accessed TEXT,
access_count INTEGER DEFAULT 0,
decay_score REAL,
superseded_by TEXT, -- FK to memories.id that superseded this
embedding BLOB, -- L2-normalized float32 vector
entity TEXT, -- subject: "user", "project-x"
attribute TEXT, -- property: "preferred_language"
value TEXT, -- value: "Python"
valid_until TEXT -- NULL = still active; set on contradiction
);
Four columns are NULL at write time and unused until Issue 3: valid_until (set when a contradiction supersedes this memory), superseded_by (FK to the replacement memory), decay_score (computed by DecayWorker on a periodic schedule), and embedding (stored when sentence-transformers is available, NULL otherwise). They exist in the schema now for one reason: adding them later, after contradiction detection or decay has shipped, requires a migration. The valid_until and superseded_by pair is particularly load-bearing — querying WHERE valid_until IS NULL is how you get the current active belief state. That query pattern needs to be consistent from day one.
Failure Mode: Provenance Failure
Named failure mode for Issue 2: Provenance Failure — the memory store contains facts that were never true, extracted from hypotheticals, roleplay framing, or sarcasm.
How it happens: extraction prompts that do not account for speech act type. The model reads “I’m actually a doctor” and the propositional form looks identical to “I have a peanut allergy.” Both are first-person declarative statements. The LLM extractor cannot distinguish assertoric intent from rhetorical framing without explicit guidance on confidence scoring.
The confidence field is the mechanism, not the cure. An extraction prompt that returns confidence: 0.3 for hypothetical and conditional phrasings — “what if I were,” “imagine I worked at,” “pretending to be,” “I’m basically a [X]” — narrows the blast radius significantly. At retrieval time, memories below a confidence threshold can be filtered out or down-weighted, preventing a low-confidence hypothetical from surfacing as established fact.
Retrieval-time filtering by confidence looks like:
SELECT * FROM memories
WHERE user_id = ?
AND valid_until IS NULL
AND confidence >= 0.4
ORDER BY importance DESC, last_accessed DESC
LIMIT 10;
The doctor fact never reaches the agent’s context if it was extracted with confidence: 0.3 and the retrieval threshold is confidence >= 0.4.
This is a partial solution — and worth naming as one. The survey (arXiv:2603.07670) explicitly calls filtering hypotheticals and sarcasm “an open research problem.” LLM extraction is not reliably accurate at distinguishing ground truth from roleplay when the surface form of the utterance is declarative. The confidence field narrows the blast radius — a confidence: 0.3 memory will not dominate retrieval unless nothing else is available. It does not eliminate the underlying problem, which requires pragmatic interpretation of utterance intent that current models do not do reliably without specialized training.
The WRITE Decision
When to use async extraction versus sync store, and when not to call store at all.
Production agents: use store_memory (the MCP tool). This is async — it enqueues a job and returns a job_id immediately. The agent does not wait. The extraction worker processes the job in the background via claude-haiku-4-5-20251001, producing typed and scored MemoryUnit objects that are written to the memories table. The agent’s response latency is not affected.
Tests and data migration: use MemoryClient.store() directly. This is synchronous. It inserts a MemoryUnit directly — no extraction, no background job, no LLM call. You control every field. This is the right path when you need deterministic output (tests), when you are seeding a memory store with known facts (migration), or when you are writing a specific memory from a test fixture. Do not use it in production agent paths — it bypasses extraction entirely.
| Path | API | When to Use | Extraction? | Returns |
|---|---|---|---|---|
| store_memory MCP tool | Async | Production agent paths | Yes — via claude-haiku-4-5-20251001 in background | job_id immediately |
| MemoryClient.store() | Sync | Tests, data migration, seeding | No — direct insert | MemoryUnit |
| _extract_stub_fallback() | Sync | API failure only (safety valve) | No — raw text stored as-is | MemoryUnit with confidence: 0.8, importance: 0.5 |
When to set the topic: Topic is a namespace, not a tag. Use broad categories that scope retrieval without over-partitioning. “tech”, “work”, “personal”, “health” are valid topics. “user prefers Python for backend APIs over Go” is not a topic — it is a text value.
When not to call store at all: Within-session state does not belong in the memory store. “We decided to use Python for this task” — if that decision is scoped to the current conversation and has no relevance to future sessions, it is working memory, not cross-session belief state. Storing transient task state in persistent memory creates noise that will surface in future, unrelated sessions. As framed in Issue 1: what should this agent be allowed to forget?
Before your WRITE phase goes to production, verify the six conditions in the checklist below.
Production Checklist
| Item | Score | |
|---|---|---|
| Memory unit schema has at minimum: text, type, topic, importance, confidence, source_session, created_at. These are not optional fields — each solves a specific problem at retrieval or maintenance time. | ||
| Extraction prompt asks for both importance AND confidence as separate floats — not a single score. Confidence = accuracy certainty. Importance = long-term relevance. They are not the same field. | ||
| Extraction prompt has an explicit [] return path. Tested: send a one-line pleasantry. Expected output: []. If output is non-empty, the prompt over-extracts. | ||
| Hypotheticals, roleplay framing, and sarcasm produce confidence ≤ 0.3, not the default 0.8. Tested: send "What if I were a doctor?" Expected confidence: ≤ 0.3. | ||
| Production agent paths use async store_memory (MCP tool). Returns job_id immediately; extraction happens in background. MemoryClient.store() is used only in tests and migration. | ||
| valid_until and superseded_by columns exist in schema, even if unused today. Both are NULL for all active memories. No migration needed when Issue 3 ships. |
Resources
Memory in AI Systems is a seven-issue series from Sentient Zero Labs. Issue 03 covers the MANAGE phase — contradiction detection, decay scoring, and consolidation. The valid_until and superseded_by columns you added to the schema today get used there.
Until next issue,
Sentient Zero Labs