In this issue (9 sections)
A production chatbot had a memory layer that stored preferences across sessions. One early tester flagged a quirk: the assistant kept volunteering “you mentioned you like Italian food” — even when the user asked about train schedules, meeting agendas, or anything else.
The similarity threshold was set to 0.5. At 0.5, nearly every query matched some memory in the store. The agent didn’t become more personalized. It became a broken record.
The team’s first instinct was to improve the extraction pipeline — write better prompts, store fewer memories, add more filters at write time. None of that helped. The problem wasn’t the memories. The problem was retrieval. The threshold was too low. Every query matched. The memory layer had become noise.
The fix took under an hour: raise the threshold to 0.72 and add monitoring for injection rate — the fraction of queries that actually surface memories above threshold. It dropped from ~90% to ~45%. Response quality improved immediately.
Two testers stopped using the assistant after the second session. The team assumed the use case wasn’t compelling enough. It was the memory layer.
The memory layer, once invisible because it was always on, started providing signal.
This failure mode has a name: retrieval over-injection. The cause is simple: treating the similarity threshold as a fixed constant instead of a hyperparameter.
Retrieval has a threshold. That threshold is a hyperparameter. Most teams never tune it.
Why Retrieval Is Not Solved by Storage
A common mistake in memory system design: treating the WRITE phase as the hard problem and the READ phase as plumbing. Store the memories correctly, and retrieval will follow.
It doesn’t work that way. Having memories in the database doesn’t mean they’ll reach the model in the right form, at the right time, with the right framing. The READ phase is its own engineering problem — and it has three distinct variables to tune.
Threshold: At what similarity score does a memory get included? Too low, and you inject noise. Too high, and the agent behaves as if it has no memory at all.
Ranking function: How do you order candidates before applying the limit? A keyword query for “Python version” should surface memories about Python, not memories that are merely recent or high-importance. A semantic query for “my coding preferences” should surface memories using different vocabulary. No single ranking function handles both.
Injection rate: What fraction of queries should retrieve memories at all? Not every query warrants memory retrieval. A query asking to summarize a pasted document doesn’t need the agent’s stored knowledge about the user’s preferred response format — it just needs to execute the task. Over-injecting dilutes the context with low-relevance content.
These three variables interact. Lowering the threshold raises the injection rate. Changing the ranking function changes which memories compete for the top slots. The context budget constrains how many results can be injected regardless of scoring.
Getting the READ phase right means treating these as variables to be measured and adjusted, not defaults to be accepted.
READ PHASE PIPELINE
User Query
│
▼
┌─────────────────────────────────┐
│ Candidate Fetch │
│ (SQL: top limit*5 memories │
│ by created_at DESC) │
└────────────┬────────────────────┘
│
┌───────┴───────┐
│ │
▼ ▼
┌─────────┐ ┌───────────┐
│ BM25+ │ │ Dense │
│ Rank │ │ Cosine │
│ (1..N) │ │ Rank │
└────┬────┘ └─────┬─────┘
│ │
└──────┬────────┘
▼
┌───────────────────────┐
│ RRF Fusion │
│ score = Σ 1/(k+r) │
│ k = 60 │
└──────────┬────────────┘
│
▼
┌──────────────────────────────────────────┐
│ 4-Component Scoring │
│ │
│ w_rrf · RRF(BM25, cosine) │
│ w_recency· exp(-age/(1+access_count)) │
│ w_import · importance * decay_score │
│ w_strength·log(1+ac)/log(1+max_ac) │
│ │
│ weights interpolate with recency_weight│
└──────────────┬───────────────────────────┘
│
▼
┌──────────────────────────┐
│ Sort by composite score │
│ Take top limit │
└──────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Threshold Gate │
│ below threshold → drop │
└──────────────┬────────────┘
│
▼
┌──────────────────────────┐
│ Context Injection │
│ (memory text → prompt) │
└──────────────────────────┘ Threshold Tuning
The similarity threshold is a gate. Before any memory reaches the context window, it has to clear a minimum relevance score. The threshold determines how strict that gate is.
The precision-recall tradeoff plays out clearly at the extremes:
| Threshold | Typical Injection Rate | Precision/Recall Tradeoff | Recommended Use |
|---|---|---|---|
| 0.5 | 80–95% | Low precision, high recall. Irrelevant memories flood context. Italian food appears on every query. | Avoid as default. Only viable for deliberate broad recall with very small memory stores. |
| 0.7 | 40–55% | Balanced. Semantically related memories pass; loosely associated ones don't. | Default starting point. Adjust by ±0.05 based on observed injection rate. |
| 0.9 | 5–15% | High precision, low recall. Near-identical matches only. Agents appear amnesiac. | Use only when false positives are costly (medical, legal context injection). |
At threshold = 0.7, you’re operating in the productive zone. The injection rate for a well-maintained memory store typically falls in the 40–50% range — roughly half of all queries surface at least one relevant memory. This is the target.
The 40–50% injection rate is an empirical target, not a guarantee. It depends on memory store density, user query patterns, and embedding model quality. An injection rate above 80% suggests the threshold is too low. A rate below 20% suggests the threshold is too high, the store is sparse, or the ranking function is failing.
Starting point: Begin at 0.7. Observe injection rate over 200+ queries. Adjust in increments of 0.05. Recheck.
# Threshold gate: only inject memories with composite score above minimum
INJECTION_THRESHOLD = 0.72 # tune based on observed injection rate
def filter_for_injection(scored_memories: list[dict], threshold: float) -> list[dict]:
"""Only pass memories whose composite score clears the threshold."""
return [m for m in scored_memories if m.get("_score", 0.0) >= threshold]
# Monitor injection rate:
def compute_injection_rate(search_calls: list[dict]) -> float:
"""Fraction of search calls that returned at least one memory above threshold."""
calls_with_results = sum(1 for c in search_calls if c["total"] > 0)
return calls_with_results / len(search_calls) if search_calls else 0.0
Hybrid BM25 + Dense Retrieval
In 1972, Karen Spärck Jones published a 10-page paper in the Journal of Documentation arguing that the relevance of a term in a document should be weighted by its specificity across the collection — and that specificity is an inverse function of how many documents contain the term. Rare terms carry more discriminating power than common ones. That weighting principle became Inverse Document Frequency (IDF).
Robertson and colleagues formalized this into the BM25 ranking function in 1994, using it in the Okapi information retrieval system at City University London. BM25 became the default relevance algorithm in Elasticsearch and Lucene in 2016. It still runs most of the world’s production search.
The reason BM25 survives in an era of billion-parameter embedding models is simple: dense retrieval fails silently on exact identifiers.
When a user queries ERR_SSL_VERSION_OR_CIPHER_MISMATCH, a dense embedding model averages that token sequence with surrounding context into a vector representing “document about SSL errors.” The exact error code identity is lost. BM25 doesn’t do averaging. It scores directly on lexical overlap. The exact string is found.
This matters for memory systems. When a user says “what did we decide about the Python version”, dense retrieval will surface memories about Python preferences. BM25 will surface memories that contain the exact word “Python”. Both are useful. Neither alone is robust.
| Query Type | BM25 Result | Dense Result | Winner |
|---|---|---|---|
| `ModuleNotFoundError: No module named 'recall'` | High — exact token match on recall, ModuleNotFoundError | Moderate — 'document about import errors' | BM25 |
| "what do I think about coding style" | Low — no exact match for 'coding style' | High — surfaces 'prefers clean code', 'dislikes verbose syntax' | Dense |
| "GPT-4o" (preferred model) | High — exact token match | Moderate — may conflate with other model memories | BM25 |
| "how I prefer to work" | Low — generic phrase | High — async style, communication preferences, tool choices | Dense |
The solution is fusion. Compute both rankings, then combine them with Reciprocal Rank Fusion (RRF), introduced by Cormack, Clarke, and Buettcher at SIGIR 2009.
The RRF formula: for each ranker r (BM25 and dense), a document gets a score of 1 / (k + its_rank). The scores are summed. Higher is better. The constant k=60 is the standard — it smooths the contribution of top-ranked vs. lower-ranked documents, preventing a single rank-1 result from dominating when the two lists disagree.
def _rrf_fuse(
bm25_ranks: list[int] | None,
dense_ranks: list[int] | None,
k: int = 60,
) -> list[float]:
"""Reciprocal Rank Fusion (Cormack et al., SIGIR 2009).
Returns a score per document; higher = better match.
k=60 is the standard constant — smooths rank contribution.
"""
n = len(bm25_ranks or dense_ranks or [])
if n == 0:
return []
scores = [0.0] * n
for ranks in (bm25_ranks, dense_ranks):
if ranks is None:
continue # graceful degradation: skip missing ranker
for i, r in enumerate(ranks):
if r <= n: # r == n+1 means zero BM25 signal — skip
scores[i] += 1.0 / (k + r)
return scores
# Example scores at k=60:
# rank 1: 1/(60+1) = 0.01639
# rank 10: 1/(60+10) = 0.01429 (~13% gap — consensus matters more than top rank)
# rank 50: 1/(60+50) = 0.00909
Why rank, not score? BM25 scores are unbounded positive numbers. Cosine similarity scores are in [-1, 1]. Normalizing these into a common scale requires assumptions about score distributions. RRF sidesteps this entirely — it only uses rank order, which is distribution-agnostic.
Recall uses BM25Plus (not BM25Okapi) specifically because BM25Okapi’s IDF collapses to zero for small corpora. When N=2 and a term appears in exactly one document, log(1.5/1.5)=0. Every memory gets a score of zero. BM25Plus adds a lower bound delta that keeps scores positive and meaningful — critical for memory systems where most users have fewer than 100 memories.
Graceful degradation: If rank_bm25 is not installed, BM25 ranks are skipped. If embeddings are disabled, dense ranks are skipped. The system degrades without breaking.
MMR for Diversity
Consider a user who has been using an AI assistant for six months. They’ve mentioned Python dozens of times in different contexts. Recall has stored 15–20 memories, most encoding the same underlying fact: the user prefers Python.
When the user asks “help me think through technology choices for a new project”, search_memories returns the 20 most relevant memories. Twelve of them are variants of “user prefers Python.” The context window fills with redundant signal. The remaining 8 memories — about team size, infrastructure constraints, past project decisions — get crowded out.
This is the diversity problem. High memory density on a single topic creates redundancy that displaces coverage.
Maximal Marginal Relevance (MMR), introduced by Carbonell and Goldstein at SIGIR 1998, addresses this directly.
The MMR formula:
MMR = ArgMax_{d in R\S} [ λ · Sim1(d, q) − (1−λ) · max_{di in S} Sim2(d, di) ]
At each step, MMR selects the candidate that maximizes: (relevance to the query) minus (similarity to what’s already been selected). Lambda controls the tradeoff — λ=1 is pure relevance ranking, λ=0 is maximal diversity, λ=0.5 gives equal weight to both.
def mmr_rerank(
candidates: list[dict],
query_vec: list[float],
lambda_: float = 0.5,
k: int = 10,
) -> list[dict]:
"""MMR reranking (Carbonell & Goldstein, SIGIR 1998).
Selects memories that are both relevant AND non-redundant.
lambda_=0.5 recommended for memory injection.
"""
from recall.embeddings import cosine_scores, embed
texts = [c["text"] for c in candidates]
doc_vecs = embed(texts)
query_similarities = cosine_scores(query_vec, doc_vecs)
selected = []
remaining = list(range(len(candidates)))
for _ in range(min(k, len(candidates))):
if not remaining:
break
best_score = -float("inf")
best_idx = remaining[0]
for i in remaining:
relevance = query_similarities[i]
if not selected:
redundancy = 0.0
else:
sims = cosine_scores(doc_vecs[i], [doc_vecs[j] for j in selected])
redundancy = max(sims)
mmr_score = lambda_ * relevance - (1 - lambda_) * redundancy
if mmr_score > best_score:
best_score = mmr_score
best_idx = i
selected.append(best_idx)
remaining.remove(best_idx)
return [candidates[i] for i in selected]
When to use MMR: Users with dense memory stores (20+ memories per topic); tasks requiring broad context (strategy, planning); when retrieved memories visibly cluster around one topic for broad queries.
When to skip MMR: Sparse memory stores where diversity isn’t a problem; highly focused queries where you want the top N most relevant memories specifically.
Context Budget Management
The context window is not unlimited, and injection is not free. Every memory injected displaces something else — conversation history, system prompt content, or generation headroom. The budget should be explicit. Treat the 25% ceiling on injected memories as hard: when the candidate set would exceed it, reduce limit first, then raise the threshold.
| Component | Recommended Token Share | Notes |
|---|---|---|
| System prompt | 10–15% | Agent persona, instructions, tool schemas. Keep stable — don't let memory injection crowd this. |
| Retrieved memories | 15–25% | Inject above conversation history. Position at start = better attention (Liu et al. 2023). |
| Conversation history | 40–50% | Recent turns, sliding window. Largest legitimate consumer. |
| Current query + headroom | 15–25% | Query elaboration, generation buffer. Memory injection should never eat into this. |
Liu et al. (2023, “Lost in the Middle”, arXiv:2307.03172) demonstrated that LLMs attend less to information placed in the middle of long contexts, with measurable performance degradation. Memories placed at the beginning of the context receive stronger attention than memories buried mid-history. Injection position matters — not just injection volume.
As covered in Issue 3, the MANAGE phase applies decay scoring. The READ phase uses decay_score as one of four components in the ranking formula. The importance field written in Issue 2’s WRITE phase feeds directly into this formula as well.
The 4-component scoring formula (from _hybrid_search in Recall’s server.py):
# Weight interpolation based on recency_weight parameter (0.0 to 1.0)
w_rrf = 0.70 - 0.30 * recency_weight # RRF score weight
w_recency = 0.40 * recency_weight # recency decay weight
w_import = 0.20 - 0.10 * recency_weight # decay-adjusted importance weight
w_strength = 0.10 # access frequency (constant)
# Weights always sum to 1.0:
# recency_weight=0 → (0.70, 0.00, 0.20, 0.10) — relevance-dominant
# recency_weight=1 → (0.40, 0.40, 0.10, 0.10) — recency-dominant
score = (
w_rrf * rrf_score
+ w_recency * math.exp(-age_days / (1 + access_count))
+ w_import * (importance * decay_score) # WRITE + MANAGE feeds here
+ w_strength * math.log(1 + ac) / math.log(1 + max_ac)
)
Recall’s retrieval controls map directly to budget management:
limit(default 20, max 50): caps the number of memories returned — the primary budget leverrecency_weight(0.0–1.0): shifts which memories fill the budget, not how many
Failure Mode: Retrieval Blindness
The Italian food story is an over-injection failure. There’s an opposite failure mode, less visible but equally damaging: retrieval blindness.
Threshold too high. The agent retrieves nothing, or retrieves so rarely that the memory layer provides no value. The agent acts as if it has no memory, even when memories exist and are relevant.
Retrieval blindness is harder to catch than over-injection. Over-injection shows up as annoying, repetitive responses that users notice. Retrieval blindness shows up as an agent that simply doesn’t feel personalized — easy to attribute to other causes.
Detection signal: Monitor the total field in search_memories responses. If total: 0 is common across many queries for a user who has stored memories, retrieval blindness is likely.
def retrieval_blindness_rate(search_logs: list[dict]) -> float:
"""Fraction of search calls that returned zero memories."""
empty = sum(1 for log in search_logs if log["data"]["total"] == 0)
return empty / len(search_logs) if search_logs else 0.0
# If blindness_rate > 0.5, diagnose:
# - Threshold is too high (lower it)
# - rank_bm25 not installed (install it)
# - Embedding model not loaded (memories get no dense ranks)
# - Memory store is sparse for this user (check get_memory_stats)
A second approach: call inspect_memories when search_memories returns empty. If memories exist but search returns nothing, the ranking function is suppressing them. Adjust threshold or ranker configuration.
Decision Guide
Use recency_weight = 0.0 when:
├── Task involves facts that don't change (user preferences, stated constraints)
├── Relevance matters more than freshness
└── Memory store has old but accurate information
Use recency_weight = 0.7–1.0 when:
├── Task involves current state (active projects, recent decisions)
├── User's situation changes frequently
└── Old memories may be stale
Use limit = 5 when:
├── Context window is tight
├── Query is focused (one specific topic)
└── Memory store is dense (high redundancy risk)
Use limit = 15–20 when:
├── Query is broad (strategy, planning, preferences overview)
├── Memory store is sparse
└── MMR is enabled (handles deduplication)
Enable embeddings when:
├── Users use varied vocabulary (paraphrase matching needed)
├── Semantic intent matters more than exact terms
└── Memory store has > 50 memories per user
Skip embeddings when:
├── Deployment is resource-constrained
├── Queries are primarily keyword-based (technical terms, IDs)
└── BM25 alone achieves acceptable injection rate
The tuning workflow: deploy with threshold=0.7, limit=20, recency_weight=0.3. After 200+ queries, compute injection rate. Above 70%: raise threshold. Below 20%: lower threshold or check ranker installation. Many redundant memories on one topic: enable MMR. Document final parameters and why each was chosen.
Production Checklist
| Item | Score | |
|---|---|---|
| Similarity threshold set (starting at 0.7) and injection rate monitored over 200+ real queries. Target: 40–55%. Above 70% = threshold too low. Below 20% = threshold too high or ranker missing. | ||
| rank-bm25 package installed and BM25Plus active. Verify _bm25_ranks returns non-None. Without it, retrieval falls back to dense-only — exact identifier queries will fail silently. | ||
| recency_weight chosen deliberately. Use 0.0–0.3 for fact-heavy stores where relevance dominates. Use 0.7–1.0 for time-sensitive stores where freshness matters. | ||
| MMR evaluated for users with 20+ memories per topic. Start with λ=0.5. Skip for sparse stores — diversity is not a problem when there are few candidates. | ||
| Retrieval blindness baseline established. If total: 0 rate exceeds 20% for users with stored memories, diagnose before deploying. Do not ship a memory layer that silently retrieves nothing. | ||
| Context budget allocation documented and enforced. Memory injection share is bounded (15–25% of context), not open-ended. Injection position is explicit (before history, not buried mid-context). |
Resources
This is Issue 4 of Memory in AI Systems. Issue 3 covered the MANAGE phase — decay scoring and the consolidation worker. Issue 5 covers Four Design Patterns — how retrieval configuration choices compose into complete memory architectures.
Recall (szl-recall on PyPI) is the open-source reference implementation for this series. The search_memories MCP tool and _hybrid_search function discussed throughout are in recall/src/recall/server.py.
Until next issue,
Sentient Zero Labs