Building Effective Tools for AI Issue 4/7

Tool Design in the Real World

How Recall's 8-tool design collapsed to 5 tools — and why designing for the LLM's decision surface, not your backend's capability, is the key to lower planning error rates.

May 12, 2026 · 21 min read · Sentient Zero Labs

In this issue (6 sections)

Recall v0.1 shipped on a Tuesday afternoon with eight tools, a working FastMCP server, and a demo that looked good.

The tools were designed carefully. store_memory persisted memories for background extraction. search_memories retrieved results by relevance. search_memories_recent handled temporal queries — a separate tool because “what happened last week” seemed different enough from “find everything about X” to warrant its own interface. inspect_memories listed stored memories with pagination. get_memory_by_id fetched a specific memory when the agent already knew the ID. update_memory modified an existing memory’s content. delete_memory handled GDPR erasure. get_memory_stats returned health metrics.

Eight tools. Every one defensible on its own. None of them obviously wrong.

The problem appeared in the logs within a day.

Three failure patterns emerged. First: on temporal queries like “What did we talk about last Tuesday regarding the deployment issue?”, the agent had both search_memories and search_memories_recent available and couldn’t reliably distinguish which to use — so it called both, got overlapping results, and spent two round trips where one sufficed. Second: when the agent wanted to review a specific result, it would call inspect_memories to browse, then get_memory_by_id to fetch the same record in full — a two-tool sequence that should have been one call. Third, and most consequential: update_memory caused persistent confusion. The agent would reason “I need to fix this memory” and call update_memory, but on a subsequent turn it would also call delete_memory followed by store_memory for what appeared to be the same intent — unsure whether to update in-place or replace. Three tools covering essentially the same write-then-read workflow, with no clear decision rule.

On cleaner queries the failure rate was lower, but on the ambiguous queries that dominate real usage — partially specified, temporally framed, revisiting something from a prior session — the agent would hedge, calling more tools than necessary or calling them in the wrong order.

When we traced the failures, the pattern was consistent: tool selection errors — the agent choosing the wrong tool on the first attempt, or calling multiple tools where one should have sufficed — accounted for roughly 40% of all tool call failures. Not network errors. Not execution bugs. Planning errors. The LLM was being asked to make a decision we hadn’t designed carefully enough.

The redesign took a day. The 5-tool version shipped the following week: store_memory, search_memories, inspect_memories, delete_memory, get_memory_stats. In internal testing, planning errors dropped to around 8%. The agent called the right tool on the first attempt 94% of the time.

The agent wasn’t choosing wrong. We’d given it choices that were too similar.

Mental Model: Design for the LLM’s Decision Surface

The wrong question when designing a tool layer is: “What can my backend do?”

The right question is: “What does the agent need to accomplish?”

Those questions seem similar. They produce completely different tool designs.

A backend that can do relevance search and recency-weighted search has two retrieval operations. The agent has one goal: find the right memory. The distinction between those retrieval strategies is an implementation detail — it belongs inside the tool as a tunable parameter, not between tools. When it lives between tools, the LLM has to make a choice it isn’t equipped to make. It doesn’t know whether the user’s query is more temporal or more semantic. It doesn’t know that recency_weight=0.8 handles “what happened last week” better than a dedicated date-range tool. It knows the user’s query and the tool descriptions you wrote, and it’s going to make its best guess — which will be wrong roughly as often as the query is ambiguous.

Database operations and LLM operations are not the same thing. Every tool you add is a choice the LLM has to make. Design for the LLM’s decision-making burden, not your backend’s capability.

Here’s what that shift looks like in practice, using Recall’s 8-to-5 reduction — which tools were removed, and why:

  FINE-GRAINED                                            COARSE-GRAINED
  (one tool per DB operation)                             (one tool per user intent)
  ────────────────────────────────────────────────────────────────────────────────

  search_memories()                    ←───┐  recency_weight param replaces
  search_memories_recent()             ←───┘  separate tool  ───►  search_memories()
                                               (recency_weight: float = 0.3)

  inspect_memories()                   ← stays — list intent
  get_memory_by_id()                   ← REMOVED — agent was calling inspect_memories
                                         then get_memory_by_id for the same record;
                                         use inspect_memories directly

  update_memory()                      ← REMOVED — LLM confused update vs delete+store;
                                         explicit delete_memory() + store_memory()
                                         is clearer and safer

  delete_memory()                      ← stays separate — consequential, hard-to-reverse

  store_memory()                       ← stays — write intent
  get_memory_stats()                   ← stays — health check


  Recall v0.1:  8 tools   →   Recall v0.2:  5 tools
  Planning error rate:   ~40%  →   ~8%  (internal testing)

  ──────────────────────────────────────────────────────────────────────────────
  The production failure signal: the agent calls 2+ tools per query
  because the tools describe similar things. That is a design problem,
  not a model problem.
  ──────────────────────────────────────────────────────────────────────────────

Fine-grained tools expose implementation. Coarse-grained tools expose intent.

Fine-grained tools work in demos because demos are curated. The queries are clean. You know which tool to call because you wrote the demo. In production, users write the queries, and production queries are ambiguous, underspecified, and partially correct. A system designed for demo queries will fail on production queries — not because the tools are broken, but because the LLM is making the implementation choice the developer should have made.

The insight that makes the reframe concrete: the implementation details belong inside the tool, not between tools. Here’s where that line falls:

  What the agent knows:              What the tool handles internally:
  ─────────────────────              ────────────────────────────────

  "Store this memory"           ───► idempotency check
                                     batch or single?
                                     extraction queued or inline?
                                     which table to write?

  "Find memories about X"       ───► BM25 keyword search
                                     vector similarity (v0.2)
                                     RRF merge (k=60)
                                     recency multiplier
                                     result ranking

  "Show me my memories"         ───► pagination query
                                     total count
                                     has_more calculation
                                     column selection

  "Delete this memory"          ───► ownership check (user_id match)
                                     hard delete vs. soft delete
                                     cascade cleanup

  ─────────────────────────────────────────────────────────────────────────
  The LLM sees: what to do.    The tool handles: how to do it.
  Mixing these two is where 8 tools became necessary.
  ─────────────────────────────────────────────────────────────────────────

  One parameter that IS exposed (and why):
  recency_weight: float = 0.3   ← meaningful choice for the caller
                                   (relevance vs. recency tradeoff)
                                   The search algorithm is NOT exposed.
                                   Which DB index to use is NOT exposed.

The LLM maps to user intent. Everything below the line is the tool’s problem.

Notice recency_weight. It’s exposed as a parameter on search_memories even though it’s an implementation detail in a narrow sense. It’s exposed because it represents a genuine caller choice: whether recent memories should rank above older but more semantically relevant ones. That’s a tradeoff the caller can reason about — “the user just asked about something from last week, so recency matters.” Which search algorithm to run is not a tradeoff the caller can reason about. It’s hidden.

When fine-grained is actually right. The argument here isn’t “always consolidate everything.” Fine-grained tools are correct in three specific cases:

The agent has genuinely different goals for each variant — not just different implementations of the same goal. delete_memory and search_memories stayed separate because deletion is a consequential, hard-to-reverse action. When the agent is choosing between “find a memory” and “delete a memory,” those are different intents with asymmetric consequences. The cost of choosing wrong is not symmetric.
The latency profiles are meaningfully different and the caller needs to choose based on time budget — for instance, a fast approximate search vs. an expensive precise one where the caller has a strict SLA to meet.
Different permission levels are required — admin tools and user-facing tools must be explicitly separated because they carry different authorization requirements.

The consolidation heuristic is simple: if two tools serve the same user goal, consolidation is probably right. If they serve different goals that happen to touch the same database table, separation is right. The question isn’t “are these similar?” — it’s “would the agent ever have a reason to choose one and not the other?”

The 8-to-5 reduction wasn’t about simplicity for its own sake. It was about reducing the surface area of decisions the LLM has to make correctly on every query. The model that now calls search_memories and receives hybrid results doesn’t need to know about BM25. It needs to know whether it found the right memory.

Implementation: Before/After Tool Interface

The before/after comparison shows where the implementation detail boundary lives — what the LLM sees versus what the tool handles internally.

The ReAct architecture (Yao et al., 2022) that makes modern tool-using agents work depends on the model reasoning correctly about which tool to call and when. What that paper’s formalization revealed — and what production deployments have confirmed — is that tool count is a direct tax on agent planning quality. Every additional tool the agent must reason about is an opportunity for the selection step to fail. Intent-aligned design is the engineering response to that constraint: fewer tools, more internal complexity, better selection accuracy.

Here’s the full before/after:

# ══════════════════════════════════════════════════════════════════════════════
# BEFORE: Recall v0.1 — 8 tools
# ══════════════════════════════════════════════════════════════════════════════

@mcp.tool()
async def store_memory(user_id: str, text: str, topic: str, idempotency_key: str) -> dict:
    """Store a memory for background extraction."""
    ...

@mcp.tool()
async def search_memories(user_id: str, query: str) -> dict:
    """Search memories by relevance."""
    ...

@mcp.tool()
async def search_memories_recent(user_id: str, query: str, days: int = 7) -> dict:
    """Search memories from the last N days."""
    # LLM had to decide: is this a recency query or a relevance query?
    # Wrong call ~30% of the time on temporal queries.
    ...

@mcp.tool()
async def inspect_memories(user_id: str, limit: int = 20, offset: int = 0) -> dict:
    """List stored memories with pagination."""
    ...

@mcp.tool()
async def get_memory_by_id(user_id: str, memory_id: str) -> dict:
    """Fetch a single memory by its ID."""
    # Agent pattern: call inspect_memories to browse, then get_memory_by_id
    # to fetch the same record. Two calls where one sufficed.
    ...

@mcp.tool()
async def update_memory(user_id: str, memory_id: str, text: str, topic: str) -> dict:
    """Update an existing memory's content."""
    # Agent confusion: update_memory or delete_memory + store_memory?
    # Both patterns appeared in logs for semantically identical intent.
    ...

@mcp.tool()
async def delete_memory(user_id: str, memory_id: str) -> dict:
    """Permanently delete a memory by ID."""
    ...

@mcp.tool()
async def get_memory_stats(user_id: str) -> dict:
    """Return memory counts."""
    ...


# ══════════════════════════════════════════════════════════════════════════════
# AFTER: Recall v0.2 — 5 tools (1 search, 1 store, 1 inspect, 1 delete, 1 stats)
# ══════════════════════════════════════════════════════════════════════════════

@mcp.tool()
async def search_memories(
    query: str,
    limit: int = 20,
    recency_weight: float = 0.3,  # 0=pure relevance, 1=pure recency
) -> dict:
    """Search memories using hybrid retrieval (~200-500ms). recency_weight 0-1."""
    ...

@mcp.tool()
async def store_memory(
    text: str,
    topic: str,
    idempotency_key: str,
) -> dict:
    """Store conversation messages for background memory extraction.
    Returns immediately — extraction runs async. Safe to retry with the same key."""
    ...

@mcp.tool()
async def inspect_memories(limit: int = 20, offset: int = 0) -> dict:
    """List stored memories with pagination. Default 20 per page, max 50."""
    ...

@mcp.tool()
async def delete_memory(memory_id: str) -> dict:
    """Permanently delete a memory by ID. This action cannot be undone."""
    ...

@mcp.tool()
async def get_memory_stats() -> dict:
    """Return memory counts and storage stats for the current user. Fast health check."""
    ...

# Note: user_id removed from all signatures — injected by auth middleware via ContextVar.
# Note: delete_memory stays separate — not consolidated. Different intent, different consequence.

There are four non-obvious decisions worth walking through.

The search consolidation. This is the most consequential change. search_memories and search_memories_recent became one tool — but search_memories_recent wasn’t merged by absorbing its logic as a hidden branch. It was replaced by a parameter: recency_weight: float = 0.3. That’s a deliberate design decision. Recency versus relevance is a meaningful tradeoff the caller can reason about — “the user just asked about something from last week, so set recency_weight=0.8.” Which retrieval strategy to use internally is not a tradeoff the caller should reason about. The alternative — keeping search_memories_recent as a separate tool because temporal queries are common — fails for a specific reason: “What happened last week about the deployment?” is not a pure recency query. It’s a hybrid query where time is one signal among many. A dedicated recency tool trains the agent to treat all temporal queries as pure filters. recency_weight=0.8 handles this correctly: it upweights recent results while still ranking by relevance.

The get_memory_by_id removal. The logs showed a consistent two-call sequence: inspect_memories to browse the list, then get_memory_by_id to fetch the same record the agent had just seen. inspect_memories already returns full memory content. get_memory_by_id was adding a second round trip for data already present in the first response. Removing it forced the agent to work with what inspect_memories returned — which was sufficient. This is the simplest case of redundancy: a tool that duplicates data already returned by another tool in normal usage.

The update_memory removal. This one required the most care. update_memory was a genuinely useful operation — modifying a memory’s text without deleting and recreating it. The problem was behavioral: in the logs, the agent would sometimes call update_memory, and other times call delete_memory followed by store_memory, for what appeared to be the same user intent. The model had no reliable rule for when to update in place versus delete-and-replace. Removing update_memory collapsed the choice: the only path for modifying a memory is now delete_memory + store_memory. Two explicit steps, clear intent at each step, no ambiguity about which pattern to use.

The store tool. store_memory existed in v0.1 and remains unchanged in v0.2 — same signature, same idempotency pattern from Issue 01. What changed is that the adjacent tools (update_memory, get_memory_by_id) that were creating decision overhead around it were removed. store_memory is now the unambiguous write path.

What wasn’t merged. delete_memory stayed separate. Not for technical reasons — deletion could be a parameter of a general manage_memory tool. It stayed separate because erasure is a consequential, hard-to-reverse action. When the LLM is choosing between “find a memory” and “delete a memory,” those are genuinely different intents with asymmetric consequences. If the model confuses search_memories and inspect_memories, the cost is cosmetic — a list view instead of a search result. If it confuses search_memories and delete_memory, the cost is data loss. The asymmetry justifies the separation.

Now here’s the internal implementation of search_memories — the code the LLM never sees:

import time
from rank_bm25 import BM25Okapi

@mcp.tool()
async def search_memories(
    query: str,
    limit: int = 20,
    recency_weight: float = 0.3,
) -> dict:
    """Search memories using hybrid retrieval (~200-500ms). recency_weight 0-1."""
    if not 0.0 <= recency_weight <= 1.0:
        return {
            "status": "error",
            "error": "recency_weight must be between 0.0 and 1.0.",
            "code": "INVALID_PARAM",
        }

    user_id = user_id_ctx.get()
    # Simplified: production path uses pgvector ANN search + pre-filter, not a full-table scan
    memories = await _fetch_all_memories(user_id)

    if not memories:
        return {"status": "ok", "data": {"results": [], "total": 0}, "error": None}

    # BM25 keyword search — handles exact terms, names, dates in text
    tokenized = [m["text"].lower().split() for m in memories]
    bm25 = BM25Okapi(tokenized)
    bm25_scores = bm25.get_scores(query.lower().split())

    # RRF merge (k=60, Cormack et al. 2009) — score-agnostic, no normalization needed
    # v0.1: merges BM25 with itself (placeholder for v0.2 vector scores)
    # v0.2: replace bm25_scores copy with cosine similarity from pgvector
    k = 60
    bm25_ranked = sorted(range(len(memories)), key=lambda i: bm25_scores[i], reverse=True)
    vector_ranked = bm25_ranked  # v0.1 stub — same ranking until vectors ship

    rrf_scores = {}
    for rank, idx in enumerate(bm25_ranked):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank + 1)
    for rank, idx in enumerate(vector_ranked):
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank + 1)

    # Apply recency multiplier post-merge
    now = time.time()
    for idx, mem in enumerate(memories):
        age_days = (now - mem["created_ts"]) / 86400
        recency_score = 1.0 / (1.0 + age_days * 0.1)  # decay over ~10 days
        # Note: memories with zero BM25+vector signal stay at 0 regardless of recency_weight.
        # recency_weight here amplifies existing signal, not pure time-sort.
        rrf_scores[idx] = rrf_scores.get(idx, 0) * (
            (1 - recency_weight) + recency_weight * recency_score
        )

    ranked = sorted(rrf_scores.keys(), key=lambda i: rrf_scores[i], reverse=True)
    results = [memories[i] for i in ranked[:limit]]

    return {"status": "ok", "data": {"results": results, "total": len(results)}, "error": None}

# What the LLM never had to know:
# • BM25 vs. vector vs. hybrid — implementation choice, not user intent
# • RRF merge algorithm — internal fusion strategy
# • recency decay formula — tuning detail
# What the LLM CAN control:
# • recency_weight — meaningful tradeoff the caller can reason about

A few things worth noting in the implementation. The RRF algorithm (Reciprocal Rank Fusion, Cormack et al. 2009) is score-agnostic — it operates on rank positions rather than raw scores, which means you don’t need to normalize BM25 scores against cosine similarities. That makes the v0.1-to-v0.2 upgrade straightforward: replace the vector_ranked = bm25_ranked stub with actual pgvector cosine similarity rankings, and the RRF merge handles the combination without any score-scale conversion. The recency multiplier is applied post-merge as a scalar weight on the RRF score — simple, tunable, and the LLM’s recency_weight parameter maps directly to the recency_weight variable in the formula.

What the LLM sees: one tool, one tunable parameter, a docstring that mentions expected latency. What it doesn’t see: the BM25 tokenization, the RRF implementation, the recency decay formula. That separation — what’s exposed vs. what’s hidden — is the design decision that made the ~40% planning error rate fall to ~8% in testing.

Failure Modes

The consolidation decision has two ways to fail: too far in one direction, not far enough in the other. There’s a third failure that emerges after consolidation works: hiding complexity so well that the LLM treats an expensive operation as a cheap one.

TOOL PROLIFERATION

What happens:  Multiple tools with overlapping purposes: same user intent,
               different implementation. LLM calls 2+ tools per query,
               unsure which to use. "What did we talk about last Tuesday
               about deployment" → agent calls search_memories and
               search_memories_recent before giving up. Latency doubles.
               Correct results available from the first call.
               Or: agent calls inspect_memories, then get_memory_by_id
               for the same record — two round trips, one needed.

Root cause:    Each logical variation in behavior gets its own tool. The
               engineer designed for the operation, not for what the agent
               needs to accomplish.

How to detect: Log tool selection events. Count queries where 2+ tools
               with similar names (search_*, get_*) are called in the
               same agent turn. Above 10% of turns: consolidation needed.
               Recall's signal: 40% of all failures traced to selection
               errors, not execution errors (internal testing).

Fix:           Identify the user intent behind each cluster. Merge the
               cluster into one tool. Move implementation differences to
               internal routing. Expose only tunable parameters that
               represent genuine caller choices (recency_weight, limit).

OVER-CONSOLIDATION

What happens:  One memory_manager tool with 12 parameters covering all
               operations: store, search, inspect, delete, stats. LLM
               confuses parameters, fills optional fields with defaults
               that conflict, or triggers a delete when it meant to search.
               Parameter validation errors become the dominant failure class.

Root cause:    Consolidation past intent boundaries. Single tool for
               genuinely different goals with genuinely different
               consequences.

How to detect: Parameter count per tool. If a tool has >8 parameters,
               it is likely covering two distinct intents. Count parameter
               validation errors in logs — distinct from execution errors.
               High validation error rate means the LLM is struggling with
               the interface.

Fix:           Each tool should have 2-5 parameters. At 8+, ask: is this
               one intent or two? If two: split. The rule is one intent
               per tool, not one tool per system.
               delete_memory is one intent with unique consequence.
               It stays separate even when other tools consolidate.

HIDDEN COMPLEXITY

What happens:  search_memories sounds like a cache lookup — instant,
               cheap, safe to call repeatedly. It is actually running
               BM25 tokenization, RRF merge, and recency scoring on
               the full memory corpus. Agent calls it every turn to
               "refresh context." Under 100 concurrent users: latency
               spikes, CPU pegged, timeouts start appearing.

Root cause:    Tool docstring communicates purpose but not performance
               expectations. The LLM optimizes for goal completion, not
               for resource cost — unless cost is communicated.

How to detect: P95 latency per tool in structured logs. Tools with high
               variance (50ms sometimes, 800ms other times) are doing
               non-obvious work. search_memories P95 at scale: ~500ms.

Fix:           Add expected latency to the docstring first line.
               "Search memories using hybrid retrieval (~200-500ms)."
               Not to scare the agent — to let it make informed decisions
               about call frequency. It will call less aggressively.

The third failure — hidden complexity — is the one that appears after the first two are fixed. Once consolidation works and the agent is selecting correctly, the next thing to monitor is call frequency. A tool that runs fast in testing may run at 500ms under production load. The LLM won’t reduce call frequency unless the docstring tells it there’s a cost to calling often. Notice that the search_memories docstring already includes ~200-500ms — that line is doing active work, not just documentation.

Decision Guide: When to Split, When to Merge

The decision isn’t binary — it’s a question about the shape of the user’s intent and the asymmetry of the consequences.

Consolidate tools when:                Split tools when:
────────────────────────────────────   ─────────────────────────────────────
Same user intent, different impl       Genuinely different goals
Implementation details only            Different consequence profiles
Planning errors > 10%                  Different permission levels required
Tool names overlap in purpose          Different latency profiles by design
One intent, multiple variants          Agent needs fine-grained control

The rule with the most practical leverage: start coarse. The 5-tool Recall interface isn’t the ideal final state — it’s the correct starting state. As Recall grows, search_memories might need a variant for structured data versus unstructured text. store_memory might need a fast path for high-frequency logging sessions where inline extraction is too slow. Those splits will be driven by production data: specific, documented edge cases where the consolidated tool can’t serve two distinct needs well. Not by engineering instinct, and not by what the database is capable of.

The consolidation discipline matters more than the consolidation itself. Teams that design tool layers by mapping database operations to tool names will end up at 8, 12, or 20 tools and wonder why the agent seems confused. Teams that start with the question “what does the agent need to accomplish?” will design 4-6 tools that cover the same operations more reliably.

For most teams: start with the coarsest design that covers your use cases. Measure planning error rate — wrong tool chosen on first attempt, tracked per query turn. Split only when a specific, documented production case demands it. The burden of proof is on splitting, not on consolidation.

Resources

ReAct: Synergizing Reasoning and Acting in Language Models ↗

Yao et al., 2022 — arXiv

The paper whose tool-calling formalization reveals that tool count is a direct tax on agent planning quality; the 40% → 8% error reduction in this issue is the engineering answer to that constraint.

Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods ↗

Cormack et al. — SIGIR 2009

Source of the RRF k=60 algorithm inside search_memories; score-agnostic rank fusion is why the BM25→vector upgrade requires no score normalization.

rank-bm25 — BM25 algorithms for Python ↗

dorianbrown — PyPI

The BM25Okapi class used in search_memories; tokenized keyword retrieval without a search server dependency.

FastMCP — The fast, Pythonic way to build MCP servers ↗

jlowin — GitHub

The framework whose auto-schema generation makes the before/after tool interface comparison literal: 8 functions vs. 5 functions, identical decorator.

Recall Reference Implementation ↗

Sentient Zero Labs — GitHub

Both the 8-tool v0.1 and 5-tool v0.2 are in the commit history; the planning error data is in the release notes.

Production Checklist: Is Your Tool Granularity Right?

Seven binary checks. If any answer is no, that’s the next thing to address before shipping.

	Item	Score
	Each tool name maps to one user/agent goal (not one DB operation)
	No two tools have overlapping purposes or near-identical descriptions
	Tools have 2-5 parameters — if more, evaluate whether one goal became two
	Implementation details (query type, batch vs. single) are inside tools
	You've measured planning error rate (wrong tool chosen on first attempt)
	Planning errors are < 10% of tool calls
	You started coarse and split only for documented, specific edge cases

0 of 7

The full Recall implementation — both the v0.1 8-tool version and the v0.2 5-tool version — is at github.com/Sentient-Zero-Labs/szl-recall. The before/after is in the commit history, with the planning error data from internal testing in the release notes.

Issue 05 covers A2A — agent-to-agent coordination, and the task lifecycle pattern that makes it work. The tool granularity principle from this issue carries forward directly: just as a tool should map to one user intent, an agent should map to one capability. The same consolidation discipline — hide implementation details, expose only meaningful choices, design for the caller’s decision burden — applies to the agent interface layer too. The design vocabulary transfers even when the protocol changes.

Building Effective Tools for AI is a seven-issue series from Sentient Zero Labs. Each issue ships with working code from the Recall memory server — a production MCP tool built in public alongside the series.

Until next issue,

Sentient Zero Labs