Building Effective Tools for AI Issue 1/7

What Makes a Good Tool

Five properties every production MCP tool must have — and why most demo tools satisfy only one of them.

May 12, 2026 · 17 min read · Sentient Zero Labs

In this issue (6 sections)

Three days after the session ended, the problem surfaced.

A user had opened Recall, worked through a long planning conversation, and closed the tab. Standard end-of-session behavior — the client called store_memory to persist the exchange, the tool returned success, and everything looked fine. Except the network had hiccupped during that close-of-session call. The client retried. The tool ran again. Two copies of the same memory block landed in the database — near-identical content, timestamps 400ms apart, slightly different importance scores because LLM extraction is non-deterministic.

The tool returned success both times. No error was raised. Nothing looked wrong.

Three days later, the agent started surfacing duplicate memories on every retrieval. Context windows filled faster than expected. Response quality dropped in a way that initially looked like a model problem — responses felt slightly off, slightly less coherent — not like a data problem. The investigation went in the wrong direction for half a day. The actual fix was two lines of SQL: a SELECT before any write, checking whether a record with that idempotency key already existed.

Two lines. The recognition — that this had to be designed in from the start, not patched in later — took considerably longer.

Idempotency isn’t a feature you add. It’s a property you design in — and most tools aren’t designed with it at all.

Five Properties. Most Tools Miss Three.

Tool design is software design — and most tools fail on idempotency or error communication before they fail on logic.

The ReAct paper (Yao et al., 2022) demonstrated that LLMs could use tools reliably by interleaving reasoning and action steps. What the paper took as given — that tools behave predictably, return structured output, and don’t execute twice on retry — turned out to be an open engineering problem. The five properties here are the answer to that assumption.

A tool isn’t a function. It’s a contract between a model and a system — and like any contract, it has obligations in both directions. The model agrees to call the tool with valid inputs. The tool agrees to do exactly one thing, behave predictably on retry, tell the model what went wrong in terms the model can act on, return a shape the model can parse without guessing, and leave a trace the developer can follow. Most demo tools satisfy one of these: they do something. They fail the other four.

The framework has five properties. These aren’t aspirational — they’re the minimum for a tool you can trust in production. A tool missing any one of them will fail in a specific, predictable way. The Recall store_memory incident was property 2. Most production outages trace to property 3 or 5. Property 4 is the one teams discover late, when they try to parse a freeform string in an agent loop and the whole thing breaks.

┌─────────────────────────────────────────────┐
│  A PRODUCTION TOOL HAS:                     │
│                                             │
│  1. SINGLE RESPONSIBILITY                   │
│     One action. One contract.               │
│     One failure mode to reason about.       │
│                                             │
│  2. IDEMPOTENCY                             │
│     Safe to call N times with same inputs.  │
│     State changes once. Result is the same. │
│                                             │
│  3. LLM-READABLE ERRORS                     │
│     "Invalid format. Expected UUID.         │
│      Got: 'alice'. Call create_user() first."│
│     Not: "bad input"                        │
│                                             │
│  4. STRUCTURED RETURN                       │
│     {status, data, error} — same shape      │
│     always. Never: "Done!" or a freeform    │
│     string.                                 │
│                                             │
│  5. CALL-LEVEL OBSERVABILITY                │
│     Every call logged: inputs, result,      │
│     duration. You will debug this at 2 AM.  │
└─────────────────────────────────────────────┘

Every production tool needs all five. Most demo tools have one.

The tutorial pattern — one @mcp.tool() decorator, a docstring, return a string — works in demos because demos don’t retry, don’t have multiple concurrent callers, don’t need to debug failures at 2 AM, and don’t have an LLM deciding what to do with the error. Walk through any tutorial tool and score it against the five properties: it probably does one thing (property 1 — fine). It has no idempotency key check (property 2 — broken by design). It raises ValueError("bad input") (property 3 — broken). It returns a plain string or an untyped dict (property 4 — fragile). It has no logging (property 5 — invisible). Four of five. That’s the average tutorial tool in production.

These aren’t polish. They’re the difference between a tool that works in a demo and a tool that works when your agent is retrying because the network dropped. The rest of this issue is one property in depth: idempotency. Issues 2 through 7 build the rest of the stack — protocol, auth, observability, multi-agent patterns, security. But idempotency is the one that bites first, hurts most, and is the easiest to design in from day one if you know to do it.

Let’s build the pattern correctly — starting with why the caller, not the tool, should generate the key.

Building Idempotency Into `store_memory`

Every Recall tool that writes state passes through the same pattern. Here’s store_memory — the first Recall tool built, the one that failed, and the one that now demonstrates all five properties.

import asyncio
import hashlib
import logging
import time
import uuid
from dataclasses import dataclass, field
from typing import Any

import aiosqlite
from fastmcp import FastMCP

logger = logging.getLogger(__name__)
mcp = FastMCP("recall")


@dataclass
class ToolCallRecord:
    """Mirrors LangSmith run fields — swap in LangSmith later without changes."""
    tool_name: str
    inputs: dict[str, Any]
    status: str          # "ok" | "error" | "duplicate"
    duration_ms: float
    error: str | None = None
    cached: bool = False
    extra: dict[str, Any] = field(default_factory=dict)


async def log_tool_call(record: ToolCallRecord) -> None:
    """Structured log — one line, all fields. Grep-friendly in production."""
    logger.info(
        "tool_call",
        extra={
            "tool": record.tool_name,
            "status": record.status,
            "duration_ms": round(record.duration_ms, 2),
            "cached": record.cached,
            "error": record.error,
            **record.extra,
        },
    )


@mcp.tool()
async def store_memory(
    user_id: str,
    messages: list[dict],
    idempotency_key: str,
) -> dict:
    """
    Store conversation messages for background memory extraction.
    Returns immediately — extraction runs in background.
    Idempotent: calling twice with the same idempotency_key is safe.
    Generate the key as: sha256(user_id + session_id + 'store_memory').
    """
    start = time.monotonic()

    async with aiosqlite.connect("recall.db") as db:
        # Check-before-write: the idempotency guarantee
        existing = await db.execute_fetchall(
            "SELECT id FROM operations WHERE idempotency_key = ?",
            (idempotency_key,),
        )
        if existing:
            record = ToolCallRecord(
                tool_name="store_memory",
                inputs={"user_id": user_id, "message_count": len(messages)},
                status="duplicate",
                duration_ms=(time.monotonic() - start) * 1000,
                cached=True,
            )
            await log_tool_call(record)
            return {"status": "ok", "data": None, "error": None, "cached": True}

        # First call: insert the operation record, then queue extraction
        await db.execute(
            "INSERT INTO operations (id, idempotency_key, user_id, status) VALUES (?, ?, ?, 'queued')",
            (str(uuid.uuid4()), idempotency_key, user_id),
        )
        await db.commit()

    # Queue extraction — runs in background, does not block the tool return
    await extraction_queue.put({"user_id": user_id, "messages": messages})

    duration_ms = (time.monotonic() - start) * 1000
    record = ToolCallRecord(
        tool_name="store_memory",
        inputs={"user_id": user_id, "message_count": len(messages)},
        status="ok",
        duration_ms=duration_ms,
        extra={"queued": True},
    )
    await log_tool_call(record)
    return {"status": "ok", "data": {"queued": True}, "error": None, "cached": False}


# In-process queue — sufficient for v0.1 (single process, modest load).
# Replace with Redis/SQS when you need durability across restarts.
extraction_queue: asyncio.Queue = asyncio.Queue(maxsize=1000)

Why the caller generates the key. The idempotency key is a required parameter, not generated inside the function. This is the most important design decision in the implementation. If the tool generated its own key — say, a uuid4() call — every call would be unique by definition. Idempotency becomes impossible regardless of the caller’s intent. The key must be deterministic from the caller’s perspective: a hash of user ID + session ID + a stable identifier for the operation. In practice, that’s one line: hashlib.sha256(f"{user_id}{session_id}store_memory".encode()).hexdigest(). The caller knows what it’s trying to do — which session, which operation, which user. The tool doesn’t. Responsibility follows knowledge. If you’re designing a tool that generates its own idempotency key, you’re designing a tool that cannot be safely retried.

The check-before-insert pattern. The DB check is a SELECT before any write. If a record with this idempotency key already exists, the tool returns the cached result immediately — same shape, same fields, "cached": True. The tool appears to have succeeded again. The caller can’t tell the difference. This is the contract: calling twice with the same key is indistinguishable from calling once. There’s an important edge case worth naming: if the first call fails mid-write — say, the process crashes after the INSERT but before the extraction is queued — the idempotency key is in the DB but the operation is incomplete. A second call with the same key will return the cached success result without completing the work. That gap is acceptable for Recall v0.1; the extraction queue is best-effort. If you need stronger guarantees (every call must either fully complete or fully fail), you need two-phase commit or a transactional outbox — that’s outside the scope of this issue but worth knowing the failure exists.

The return shape. Every code path in store_memory returns the same structure: {"status": "ok"|"error", "data": ..., "error": None|"...", "cached": bool}. Always. The shape doesn’t change based on outcome, which is what makes it parseable. This matters because the LLM is parsing this return value — if success returns {"result": "queued"} and failure returns {"error": "bad input"}, the model has to handle two schemas. When the tool has been called a hundred times across a long agent session, those two schemas become a source of silent bugs. One schema, all paths. The status field is the sentinel: check it first, then read data or error depending on the result.

Now here’s the second place tools break: the error messages they produce.

# ── BAD: written for human logs ──────────────────────────────────────────

async def store_memory_bad(user_id: str, messages: list[dict]) -> dict:
    user = await db.get_user(user_id)
    if not user:
        raise ValueError("invalid user")  # LLM gets: "invalid user"
                                          # LLM decision: retry with same input
                                          # Result: infinite loop

    if not messages:
        raise ValueError("no messages")  # LLM gets: "no messages"
                                         # LLM has no idea what "messages" expects


# ── GOOD: written for the LLM caller ─────────────────────────────────────

async def store_memory_good(user_id: str, messages: list[dict]) -> dict:
    user = await db.get_user(user_id)
    if not user:
        raise ValueError(
            f"User '{user_id}' not found. "
            f"Expected a registered user ID (UUID format). "
            f"Got: '{user_id}'. "
            f"To register this user, call create_user() first."
        )
        # LLM gets the full sentence above.
        # LLM decision: call create_user(user_id), then retry.
        # Result: self-correcting.

    if not messages:
        raise ValueError(
            "The 'messages' list is empty. "
            "Expected at least one message dict with keys: "
            "'role' ('user' or 'assistant') and 'content' (string). "
            "Example: [{'role': 'user', 'content': 'Hello'}]"
        )


# ── RULE ──────────────────────────────────────────────────────────────────
# Every error must answer three questions for the model:
#   1. What was wrong?         ("User 'alice' not found")
#   2. What was expected?      ("Expected a UUID format user ID")
#   3. What should it do next? ("Call create_user() first")

Why not catch-all exceptions. The most common mistake after implementing idempotency is adding except Exception: pass or except Exception: return {"status": "ok"} to “handle errors gracefully.” This silently swallows real failures. The tool reports success. The caller proceeds. The operation never completed. This is harder to debug than a visible crash — because there is no crash. The logs are clean. The state is quietly wrong. The rule is simple: never catch an exception unless you are going to either handle it specifically or return {"status": "error", "error": str(e)} explicitly, with a log call before the return. A visible crash is better than silent wrong state. The crash shows you where the problem is. Silent wrong state shows you nothing.

  CALL 1                        CALL 2 (retry)
  ──────                        ──────────────
  store_memory(                 store_memory(
    idempotency_key="abc123"      idempotency_key="abc123"
  )                             )
       │                               │
       ▼                               ▼
  ┌─────────────────┐           ┌─────────────────┐
  │ SELECT id FROM  │           │ SELECT id FROM  │
  │ operations WHERE│           │ operations WHERE│
  │ key = 'abc123'  │           │ key = 'abc123'  │
  └────────┬────────┘           └────────┬────────┘
           │                             │
     NOT FOUND                      FOUND ──────────────┐
           │                                            │
           ▼                                            ▼
  ┌─────────────────┐                    ┌──────────────────────┐
  │  INSERT record  │                    │  Return cached result│
  │  Run extraction │                    │  {status: "ok",      │
  │  Return result  │                    │   cached: true}      │
  └─────────────────┘                    └──────────────────────┘
           │                                            │
           ▼                                            ▼
   {status: "ok",                           {status: "ok",
    data: {...},                             data: {...},
    cached: false}                           cached: true}

First call writes. Retry returns the cached result. The caller sees success either way.

Idempotency handles the double-execution failure. Three other failures are still waiting.

Three Failures. One Is In Your Codebase Right Now.

DOUBLE-WRITE

What happens:  Tool executes on first call. Network retry triggers a
               second execution before the client receives the response.
               State written twice. Tool returns success both times.
               In Recall: two copies of the same memory block, one with
               slightly different importance scores (LLM extraction is
               non-deterministic). Retrieval returns duplicates. Context
               windows fill faster. Response quality drops. No error raised.

Root cause:    No idempotency check. Tool assumed it would be called once.

How to detect: Duplicate records in DB with near-identical content and
               timestamps < 1s apart. Or: retrieval returning the same
               memory twice in the same result set.

Fix:           SELECT 1 WHERE idempotency_key = $1 before any write.
               Return cached result immediately if found. Two lines.

  BAD ERROR                     GOOD ERROR
  ─────────────────────         ──────────────────────────────────
  raise ValueError(             raise ValueError(
    "bad input"                   f"User '{user_id}' not found. "
  )                               f"Expected a registered user "
                                  f"ID (UUID format). "
                                  f"Got: '{user_id}'. "
                                  f"Call create_user() first."
                                )
       │                                │
       ▼                                ▼
  LLM receives:                 LLM receives:
  "bad input"                   "User 'alice' not found.
                                 Expected a registered user ID
  LLM decision:                  (UUID format). Got: 'alice'.
  Retry with same               Call create_user() first."
  inputs.
                                LLM decision:
  Result: loop.                 Call create_user("alice"),
                                then retry store_memory.

                                Result: self-correcting.

The LLM reads your error and decides what to do next. Bad errors cause retry loops.

SILENT TOOL FAILURE

What happens:  Tool returns {"status": "ok"} but the operation never
               completed. An exception was raised internally, caught,
               and swallowed. The caller proceeds as if the operation
               succeeded. Downstream state is quietly wrong.

Root cause:    except Exception: pass — or worse, except Exception:
               return {"status": "ok"}. Added to "handle errors
               gracefully." Achieves the opposite.

How to detect: Tool call logs show consistent success. Downstream state
               doesn't match what the tool should have written. Requires
               end-to-end tests to catch — it won't show up in unit tests
               of the tool in isolation.

Fix:           Never swallow exceptions. Either:
               (a) catch the specific exception you expect and handle it
               (b) return {"status": "error", "error": str(e)} and log
               (c) let it propagate — a visible crash is better than
                   silent wrong state.

THE OPAQUE ERROR

What happens:  Tool raises ValueError("bad input"). The LLM receives
               "bad input." It has no information about what was wrong,
               what format was expected, or what to try next. It retries
               with identical inputs. The error repeats. The agent enters
               a loop. No progress is made.

Root cause:    Error messages written for human log readers, not for the
               model caller. The developer knew what "bad input" meant.
               The LLM does not.

How to detect: Agent retry loop with identical inputs in the tool call
               history. Tool call count > 2 for the same operation
               without input variation.

Fix:           Error messages must answer three questions for the model:
               (1) What was wrong?
               (2) What was expected?
               (3) What should the model try instead?
               See Code Block 2 for the exact pattern.

Now that you know what breaks, here’s the decision boundary: when should you build a new tool at all?

When to Build a New Tool (And When Not To)

The five properties aren’t just a design checklist — they’re a filter. If you can’t satisfy all five for a proposed tool, that’s a signal about the tool’s design, not a reason to skip the properties. Idempotency is hard to design into a tool that has two distinct side effects. Structured return is hard to maintain when the tool does different things depending on inputs. If you’re struggling to satisfy a property, the tool is probably doing too much.

Build a new tool when:          Don't build a new tool when:
────────────────────────        ─────────────────────────────────
It does exactly one thing       It requires two separate actions
It's idempotent by design       State changes can't be deduplicated
You own the data contract       The return format varies by case
You can log every call          Side effects are non-deterministic
The LLM needs to invoke it      The app can call it directly instead

For most teams, start with the narrowest possible scope: one verb, one noun, one side effect. If you can’t write the idempotency key in one line — hash(user_id + session_id + "store") — the tool scope is probably too large. Split it. The alternative isn’t always a new tool: sometimes the right move is adding a parameter to an existing tool, or composing existing tools at the agent level rather than inside a single tool boundary.

In Issue 4, we’ll show how Recall’s initial 8-tool design collapsed to 5 tools. Two of the tools removed weren’t wrong — they were too small, and the LLM was confused about which one to call when. Tool count has a cost too. Fewer, better-designed tools outperform many narrow ones in practice.

Before you ship: eight questions your tool should answer yes to.

Resources

ReAct: Synergizing Reasoning and Acting in Language Models ↗

Yao et al., 2022 — arXiv

The foundational paper demonstrating tool-using LLMs via interleaved reasoning and action; its assumption that tools behave predictably is exactly the gap this issue addresses.

FastMCP — The fast, Pythonic way to build MCP servers ↗

jlowin — GitHub

The Python framework used for all Recall tool implementations; the @mcp.tool() decorator and auto-schema generation are central to every code example in this issue.

Model Context Protocol Specification ↗

modelcontextprotocol.io

The current MCP spec; the structured return and error contract requirements flow directly from how MCP serializes tool output for model context.

aiosqlite — async SQLite for Python ↗

omnilib — GitHub

The async SQLite library used in store_memory; the context-manager pattern explains the check-before-insert idempotency implementation.

Recall Reference Implementation ↗

Sentient Zero Labs — GitHub

The open-source memory server built throughout this series; the store_memory code in this issue is live production code, not a tutorial stub.

Is Your Tool Production-Ready?

Eight binary questions. If any answer is no, that’s the next thing to fix — not the ninth question.

	Item	Score
	Each tool does exactly one thing — one verb, one noun, one side effect
	All write operations accept an idempotency_key parameter
	All write operations check for existing idempotency_key before writing
	Error messages state what was wrong, what was expected, and what to try next
	Return type is a typed dict with the same shape on all code paths
	Every call logs: tool_name, user_id, inputs, result status, duration_ms
	Tool is safe to call twice with the same inputs — verified, not assumed
	At least one test calls the tool twice with the same idempotency_key and asserts the result is identical and state was written only once

0 of 8

The full implementation is in github.com/Sentient-Zero-Labs/szl-recall — clone it, run the tests, and break it deliberately. Every failure mode described here surfaces exactly as written. (The shipped store_memory signature uses text: str, topic: str rather than messages: list[dict] — the idempotency pattern is identical.)

These five properties are the application-layer contract. Issue 2 goes one level down: the MCP protocol that transports your tool calls, and why some of these constraints exist because of how the protocol works — not just because they’re good practice. Structured returns, in particular, are a direct consequence of how MCP serializes tool output for the model context. Understanding that gets you closer to understanding why the model behaves the way it does when a tool fails.

Building Effective Tools for AI is a seven-issue series from Sentient Zero Labs. Each issue ships with working code from the Recall memory server — a production MCP tool built in public alongside the series.

Until next issue,

Sentient Zero Labs