Building Effective Tools for AI Issue 3/7

Building Your First Production MCP Server

How to wire auth, timeout, and logging middleware before your first tool — and the async-acknowledge pattern that prevents hanging tool calls.

May 12, 2026 · 22 min read · Sentient Zero Labs

In this issue (6 sections)

The extraction worker had been running fine in development for two weeks. store_memory accepted a conversation transcript, ran it through a local embedding model to extract facts, stored the result in SQLite. Average latency: 200ms. Reliable. Nothing to flag.

The first production call landed on a cold server. The embedding model wasn’t loaded — first call initializes it. The initialization took 45 seconds. The tool call returned nothing. No error. No partial response. No indication of progress. Just silence.

From the agent’s perspective, the tool was called and never answered. It waited. Then the client-side timeout fired. The agent retried. By the time the retry arrived, the embedding model was warm — the second call completed in 180ms. Success returned. The agent continued.

But the original call was still running in the background. It completed 3 seconds later. Two extraction jobs. Two sets of memories for the same conversation. The idempotency key from Issue 01 caught the duplicate write — that part worked. But the 45-second hang had already become a production incident. Users reported the agent “freezing.” The dashboard showed a tool call that never returned.

The investigation was fast once pointed at the right layer. The extraction ran synchronously inside the tool call. The embedding model had no timeout. There was no middleware to cap the call. When the model was cold, the tool held the connection for the full initialization window and the entire caller chain waited with it.

Two root causes, both structural: no timeout configured at any level, and a slow operation running inside a synchronous tool call instead of behind a queue.

A tool that doesn’t respond isn’t a slow tool. It’s a broken tool.

The fix required two things: a timeout, and a different mental model for what a tool call is supposed to do.

Schema From Code, Middleware Before Tools

FastMCP’s Schema Generation

The common approach to building MCP tools — writing JSON schema by hand, or maintaining a YAML spec file alongside your code — breaks the moment the function signature changes. FastMCP eliminates this entirely. The function is the schema.

The mechanism: @mcp.tool() introspects the function at startup using Python’s type inspection. Parameter names become JSON schema property names. Type annotations map directly to JSON schema types: str → "string", int → "integer", list[dict] → "array" of "object". Parameters with default values are marked optional; parameters without defaults are required. The docstring’s first line becomes the tool’s description — the only instruction the model receives when deciding whether to call this tool.

This connection matters historically. The ReAct paper (Yao et al., 2022) showed that LLMs could use tools reliably by interleaving reasoning steps with action steps. What that research took as given — that tool interfaces are stable, machine-readable, and don’t require the model to guess at a schema — was an open engineering problem when you went to implement it. FastMCP’s auto-schema generation is the practical answer: the schema is always in sync with the code because the schema is derived from the code. You can’t have schema drift when there’s no separate schema to drift.

  Python function (what you write)         JSON schema (what the LLM sees)
  ─────────────────────────────────        ──────────────────────────────────

  @mcp.tool()                              {
  async def store_memory(                    "name": "store_memory",
      text: str,               ──────────►  "description": "Store conversation
      topic: str,                             messages for background memory
      idempotency_key: str,                   extraction. Returns immediately —
  ) -> dict:                                  extraction runs async. Safe to retry.",
      """Store conversation                 "inputSchema": {
      messages for background                 "type": "object",
      memory extraction.                      "properties": {
      Returns immediately —                     "text": {"type": "string"},
      extraction runs async.                    "topic": {"type": "string"},
      Safe to retry."""                         "idempotency_key": {"type": "string"}
                                             },
                                             "required": ["text", "topic",
                                                          "idempotency_key"]
                                           }
                                         }

  FastMCP inspects the signature at startup and generates this automatically.
  You never write JSON schema by hand. If the signature changes, the schema
  updates automatically on next startup.

  ──────────────────────────────────────────────────────────────────────────
  The docstring's FIRST LINE is what the LLM receives as the tool description.
  Everything after the first blank line is ignored by schema generation.
  Write it as a runtime instruction, not documentation.
  ──────────────────────────────────────────────────────────────────────────

FastMCP reads your function signature once — at startup. The schema the LLM sees is derived, not written.

The docstring rule matters more than it looks. The first line of the docstring is a runtime instruction to the model, not a comment for a human reader. “Store conversation messages for background memory extraction. Returns immediately — extraction runs async. Safe to retry.” — every clause is load-bearing. “Returns immediately” tells the agent not to wait for a downstream result. “Safe to retry” tells the agent that calling it twice with the same key won’t corrupt state. Compare this to “Stores memories.” The model has no context for when to call it, how to call it safely, or whether a retry is appropriate. The docstring is a production artifact. Treat it that way.

The Middleware Stack

Here is the insight the hook was pointing at: the hang wasn’t caused by a missing feature in the tool code. It was caused by a missing layer around the tool. Auth, timeout, and logging are not things you add later when you “need them.” They ship before the first tool is written.

Recall’s middleware stack has three layers. The ordering is not arbitrary.

  Incoming request
        │
        ▼
  ┌─────────────────────────────────────────────────────────┐
  │  LAYER 1: BearerAuthMiddleware                          │
  │                                                         │
  │  • Extract token from Authorization header              │
  │  • Validate against DB                                  │
  │  • Inject user_id via ContextVar                        │
  │  • Return 401 if invalid — tool code never runs         │
  └─────────────────────────┬───────────────────────────────┘
                            │  (authorized requests only)
                            ▼
  ┌─────────────────────────────────────────────────────────┐
  │  LAYER 2: TimeoutMiddleware                             │
  │                                                         │
  │  • asyncio.wait_for(call_next(request), timeout=30.0)   │
  │  • Hard 30s cap — applies to ALL tools automatically    │
  │  • Raises TimeoutError → returns 504 with structured    │
  │    error: {"status": "error", "code": "TOOL_TIMEOUT"}   │
  └─────────────────────────┬───────────────────────────────┘
                            │  (fast-path requests only)
                            ▼
  ┌─────────────────────────────────────────────────────────┐
  │  LAYER 3: LoggingMiddleware (wraps everything)          │
  │                                                         │
  │  • Pre-call: tool_name, user_id, inputs_hash, start_ts  │
  │  • Post-call: status, duration_ms, error_code           │
  │  • Runs in finally block — logs success AND failure      │
  └─────────────────────────┬───────────────────────────────┘
                            │
                            ▼
                    Tool function runs

  Ordering matters:
  Auth first  → unauthorized calls never consume timeout budget
  Timeout second → all calls capped regardless of what the tool does
  Logging wraps → duration includes auth + timeout overhead accurately

  Code note: Starlette's add_middleware runs in REVERSE of call order.
  create_app() adds: LoggingMiddleware, TimeoutMiddleware, BearerAuthMiddleware
  Runtime order:     BearerAuth (innermost), Timeout, Logging (outermost)
  Add new layers AFTER BearerAuthMiddleware to preserve this invariant.

Auth rejects bad requests. Timeout caps slow ones. Logging sees everything.

Auth: Bearer token validated before any tool code runs. The user_id is injected into a ContextVar — tools read it with user_id_ctx.get() without it being passed as a parameter. This is the pattern from Issue 02: auth must stay out of tool signatures. If user_id were a tool parameter, the model controls it. The model could pass anything. Keeping auth in middleware means the identity on every tool call is verified by infrastructure, not supplied by the caller.

Timeout: 30-second hard limit on all tool calls via asyncio.wait_for(call_next(request), timeout=30.0). Not per-tool — middleware applies to every tool. One configuration change affects the entire server. The 45-second hang would have become a clean 504 response at the 30-second mark.

Logging: Pre-call log captures tool name, user, inputs hash, and start timestamp. Post-call log captures status and duration. Runs in a finally block — it records both successful and failed calls. This format is structured from day one because Issue 06 adds dashboards, and those dashboards require consistent field names. Retrofitting logging after the fact is the kind of work that gets skipped when there’s a production fire. Build it once, correctly, before any tool.

The mental model is straightforward. Now look at what it produces: a complete server, all five tools, ready to fork.

The Complete Recall Server

This is recall/server.py as it exists after Issue 03. All five tools wired. Auth middleware injecting user_id via ContextVar. Timeout middleware at 30 seconds. Structured errors on every failure path. Pagination on inspect_memories. The async-acknowledge pattern on store_memory. Run it with fastmcp dev server.py.

import asyncio
import hashlib
import uuid
from contextvars import ContextVar
from typing import Any

import aiosqlite
from fastmcp import FastMCP
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import JSONResponse

mcp = FastMCP("recall")

user_id_ctx: ContextVar[str] = ContextVar("user_id", default="")
extraction_queue: asyncio.Queue = asyncio.Queue(maxsize=1000)


# ── Middleware ────────────────────────────────────────────────────────────────

class BearerAuthMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next: Any) -> Any:
        auth = request.headers.get("Authorization", "")
        token = auth.removeprefix("Bearer ").strip()
        user_id = await _validate_token(token)
        if not user_id:
            return JSONResponse({"error": "Unauthorized"}, status_code=401)
        user_id_ctx.set(user_id)
        return await call_next(request)


class TimeoutMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next: Any) -> Any:
        try:
            return await asyncio.wait_for(call_next(request), timeout=30.0)
        except asyncio.TimeoutError:
            return JSONResponse(
                {"status": "error", "error": "Tool call exceeded 30s limit.",
                 "code": "TOOL_TIMEOUT"},
                status_code=504,
            )


class LoggingMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next: Any) -> Any:
        import time
        start = time.monotonic()
        try:
            response = await call_next(request)
            return response
        finally:
            duration_ms = round((time.monotonic() - start) * 1000, 2)
            user_id = user_id_ctx.get()
            # structured log: tool_name, user_id, duration_ms, status
            _ = duration_ms, user_id  # plug into your logger here


# ── Lifecycle ─────────────────────────────────────────────────────────────────

@mcp.on_startup
async def startup() -> None:
    await _init_db()
    asyncio.create_task(_extraction_worker())


@mcp.on_shutdown
async def shutdown() -> None:
    pass  # flush queues, close connections


# ── App factory ───────────────────────────────────────────────────────────────

def create_app():
    """Return the Starlette ASGI app with all middleware wired. Used by uvicorn."""
    app = mcp.streamable_http_app()
    # Starlette applies add_middleware in REVERSE order.
    # Last added = outermost (runs first on the way in).
    # add order below: Logging → Timeout → BearerAuth
    # runtime order:   BearerAuth first, Timeout second, Logging outermost
    app.add_middleware(LoggingMiddleware)
    app.add_middleware(TimeoutMiddleware)
    app.add_middleware(BearerAuthMiddleware)
    return app


# ── Tools ─────────────────────────────────────────────────────────────────────

@mcp.tool()
async def store_memory(text: str, topic: str, idempotency_key: str) -> dict:
    """Store conversation messages for background memory extraction.
    Returns immediately — extraction runs async. Safe to retry with the same key."""
    user_id = user_id_ctx.get()
    async with aiosqlite.connect("recall.db") as db:
        existing = await db.execute_fetchall(
            "SELECT 1 FROM operations WHERE idempotency_key = ?", (idempotency_key,)
        )
        if existing:
            return {"status": "ok", "data": {"queued": False, "cached": True}, "error": None}
        job_id = str(uuid.uuid4())
        await db.execute(
            "INSERT INTO operations (id, idempotency_key, user_id, status) VALUES (?,?,?,'queued')",
            (job_id, idempotency_key, user_id),
        )
        await db.commit()
    await extraction_queue.put({"job_id": job_id, "user_id": user_id, "text": text, "topic": topic})
    return {"status": "ok", "data": {"queued": True, "job_id": job_id}, "error": None}


@mcp.tool()
async def search_memories(query: str, limit: int = 20, recency_weight: float = 0.3) -> dict:
    """Search memories using hybrid retrieval (~200-500ms). recency_weight 0-1 upweights recent results."""
    if not 0.0 <= recency_weight <= 1.0:
        return {"status": "error", "data": None,
                "error": "recency_weight must be 0.0–1.0.", "code": "INVALID_PARAM"}
    user_id = user_id_ctx.get()
    results = await _hybrid_search(user_id, query, limit, recency_weight)
    return {"status": "ok", "data": {"results": results, "total": len(results)}, "error": None}


@mcp.tool()
async def inspect_memories(limit: int = 20, offset: int = 0) -> dict:
    """List stored memories with pagination. Default 20 per page, max 50."""
    limit = min(limit, 50)
    user_id = user_id_ctx.get()
    async with aiosqlite.connect("recall.db") as db:
        rows = await db.execute_fetchall(
            "SELECT id, text, topic, importance, type, created_at "
            "FROM memories WHERE user_id = ? ORDER BY created_at DESC LIMIT ? OFFSET ?",
            (user_id, limit, offset),
        )
        count = await db.execute_fetchall(
            "SELECT COUNT(*) FROM memories WHERE user_id = ?", (user_id,)
        )
        total = count[0][0]
    memories = [
        {"id": r[0], "text": r[1], "topic": r[2], "importance": r[3],
         "type": r[4], "created_at": r[5]}
        for r in rows
    ]
    return {"status": "ok", "data": {"memories": memories, "total": total,
            "has_more": offset + limit < total,
            "next_offset": offset + limit if offset + limit < total else None},
            "error": None}


@mcp.tool()
async def delete_memory(memory_id: str) -> dict:
    """Permanently delete a memory by ID. This action cannot be undone."""
    user_id = user_id_ctx.get()
    async with aiosqlite.connect("recall.db") as db:
        result = await db.execute(
            "DELETE FROM memories WHERE id = ? AND user_id = ?", (memory_id, user_id)
        )
        await db.commit()
        if result.rowcount == 0:
            return {"status": "error",
                    "error": f"Memory '{memory_id}' not found. Use inspect_memories to list valid IDs.",
                    "code": "MEMORY_NOT_FOUND"}
    return {"status": "ok", "data": {"deleted": memory_id}, "error": None}


@mcp.tool()
async def get_memory_stats() -> dict:
    """Return memory counts and storage stats for the current user. Fast health check."""
    user_id = user_id_ctx.get()
    async with aiosqlite.connect("recall.db") as db:
        rows = await db.execute_fetchall(
            "SELECT type, COUNT(*) FROM memories WHERE user_id = ? GROUP BY type", (user_id,)
        )
        pending = await db.execute_fetchall(
            "SELECT COUNT(*) FROM operations WHERE user_id = ? AND status = 'queued'", (user_id,)
        )
    return {"status": "ok", "data": {"by_type": dict(rows), "total": sum(r[1] for r in rows),
            "pending_extractions": pending[0][0]}, "error": None}


# ── Internal helpers (not tools) ─────────────────────────────────────────────

async def _init_db() -> None:
    pass  # create tables if not exist — see db/schema.sql


async def _validate_token(token: str) -> str | None:
    if not token:
        return None
    async with aiosqlite.connect("recall.db") as db:
        rows = await db.execute_fetchall(
            "SELECT user_id FROM api_tokens WHERE token_hash = ? AND revoked = 0",
            (hashlib.sha256(token.encode()).hexdigest(),),
        )
    return rows[0][0] if rows else None


async def _hybrid_search(user_id: str, query: str, limit: int, recency_weight: float) -> list:
    # v0.1: BM25 only. RRF merge architecture is in place for v0.2 vector upgrade.
    return []


async def _extraction_worker() -> None:
    while True:
        job = await extraction_queue.get()
        try:
            await _run_extraction(job)
        except Exception:
            pass  # log and continue — worker must not die on per-job failures
        finally:
            extraction_queue.task_done()


async def _run_extraction(job: dict) -> None:
    pass  # LLM extraction pipeline — implemented in worker.py

Server name and startup. FastMCP("recall") — not FastMCP("Recall Memory Service v1.0"). The server name is the identifier the host uses in routing and logs. A slug is stable, scriptable, and collision-free. A version string embedded in the identifier makes every log line harder to parse and breaks when the version increments. Name it like a hostname, not a heading.

@mcp.on_startup calls startup(), which runs three things in order: _init_db() to ensure tables exist, then starts the extraction worker as a background asyncio task. The matching @mcp.on_shutdown hook gives you a clean place to flush queues and close connections — both lifecycle hooks are registered before the app is served. The worker pulls from extraction_queue — a standard asyncio.Queue — and runs extraction independently of any tool call. This is the infrastructure that makes store_memory’s async-acknowledge pattern work: the queue was there before the tool was written, not added after.

The app factory and middleware ordering. create_app() calls mcp.streamable_http_app() — which returns a standard Starlette ASGI app — and then wires the three middleware layers. There is a non-obvious detail here: Starlette’s app.add_middleware() applies middleware in reverse order. The last middleware added runs outermost (first to receive the request). So while the code reads LoggingMiddleware → TimeoutMiddleware → BearerAuthMiddleware, at runtime BearerAuth runs first, Timeout runs second, and Logging wraps everything. This is the correct ordering for the reasons shown in the diagram above: unauthorized calls never consume timeout budget, and logging captures the true end-to-end duration including auth overhead. When you add a fourth middleware layer, add it after BearerAuthMiddleware to keep auth innermost.

The docstring contract. Look at two examples from the server. store_memory: “Store conversation messages for background memory extraction. Returns immediately — extraction runs async. Safe to retry with the same key.” Every clause does work. “Returns immediately” tells the agent not to block waiting for a result. “Safe to retry with the same key” tells the agent that calling it twice with an identical idempotency_key is explicitly safe. Compare to “Stores memories.” The model has no signal for when to call it, how the call behaves, or whether retry is appropriate.

FastMCP uses only the first line for schema generation. Everything after the first blank line is ignored by the introspection. Keep supplementary notes in the extended docstring if you need them for human readers. Write the first line as an instruction to the model.

Structured errors as LLM instructions. Every error return in the server follows the same shape: {"status": "error", "error": "...", "code": "SNAKE_CASE_CODE"}. The error field is written for the model to understand. The code field is written for code to switch on. Look at delete_memory: "Memory '{memory_id}' not found. Use inspect_memories to list valid IDs." — the model can relay that to the user and knows exactly what to do next. "MEMORY_NOT_FOUND" — a programmatic handler can match it without string parsing. Both fields are present because there are two consumers. Never return just a string. Never raise an unhandled exception from a tool — FastMCP will serialize it, but the model receives a generic error object with no actionable content.

Why not put timeout inside each tool? Because you’ll miss one. Middleware applies universally — every tool added in the future gets it automatically. A per-tool timeout requires every developer who adds a tool to remember to add it. The middleware pattern makes correctness the default. Missing it becomes impossible, not just unlikely.

Now the async-acknowledge pattern in focused form — the idempotency check, the enqueue, and the immediate return in isolation:

@mcp.tool()
async def store_memory(text: str, topic: str, idempotency_key: str) -> dict:
    """Store conversation messages for background memory extraction.
    Returns immediately — extraction runs async. Safe to retry with the same key."""
    user_id = user_id_ctx.get()

    async with aiosqlite.connect("recall.db") as db:
        # Idempotency check BEFORE enqueue — fast SELECT, no allocation
        existing = await db.execute_fetchall(
            "SELECT 1 FROM operations WHERE idempotency_key = ?",
            (idempotency_key,),
        )
        if existing:
            # Duplicate call: return same shape as success, caller can't tell the difference
            return {"status": "ok", "data": {"queued": False, "cached": True}, "error": None}

        # First call: create the operation record, then enqueue
        job_id = str(uuid.uuid4())
        await db.execute(
            "INSERT INTO operations (id, idempotency_key, user_id, status) VALUES (?,?,?,'queued')",
            (job_id, idempotency_key, user_id),
        )
        await db.commit()

    # Enqueue AFTER the DB record is committed — no orphaned jobs
    await extraction_queue.put({
        "job_id": job_id,
        "user_id": user_id,
        "text": text,
        "topic": topic,
    })

    # Return in <10ms. Agent continues. Worker runs independently.
    return {"status": "ok", "data": {"queued": True, "job_id": job_id}, "error": None}

The ordering here is deliberate. Idempotency check before enqueue: the check is a fast SELECT EXISTS read; enqueuing involves allocating a job record. Take the fast path first. DB commit before queue put: if the process crashes between commit and enqueue, the record exists as queued and can be recovered. An orphaned queue entry with no DB record is harder to detect and reconcile.

The job_id UUID is the extension point. A future get_job_status tool (not yet shipped) would accept this ID and return the extraction state. For now, the agent treats {"status": "ok", "data": {"queued": true}} as a confirmation: the work will happen. If the agent needs the extracted memories, it calls search_memories after a delay. This is not polling — it’s a design choice: write operations are fire-and-confirm, read operations are always synchronous.

  SYNC TOOL (blocks agent)               ASYNC-ACKNOWLEDGE (returns fast)
  ─────────────────────────              ────────────────────────────────

  Agent                                  Agent
    │                                      │
    │  call store_memory(...)              │  call store_memory(...)
    │                                      │
    │  ← waiting ─────────────────────┐   │  ◄── {"status":"ok",           ◄──┐
    │  ← waiting                      │   │       "data":{"queued":true},      │
    │  ← waiting                      │   │       "job_id":"abc123"}  (10ms)   │
    │                                 │   │                                    │
    │  [EMBEDDING MODEL LOADS]        │   │  Agent continues next step         │
    │  [EXTRACTION RUNS: 45s]         │   │                                    │
    │  [WRITE TO DB]                  │   │       ┌─── background worker ───┐  │
    │                                 │   │       │  dequeues job           │  │
    │  ← response (45s later)  ───────┘   │       │  loads embedding model  │  │
    │                                      │       │  runs extraction        │  │
  Client timeout fires.                   │       │  writes to DB           │  │
  Retry triggers.                         │       │  marks job complete     │  │
  Original call still running.            │       └─────────────────────────┘  │
  Duplicate execution.                    │                                     │
                                          └─────────────────────────────────────┘

Sync tools block. Async-acknowledge returns in <10ms and lets the worker finish in its own time.

Three failures broke this server before that design settled. Here they are, named.

Three Failures, Named

THE HANGING TOOL

What happens:  Tool makes a slow external call with no timeout. Embedding
               model cold-starts at 45 seconds. Agent waits. Client-side
               timeout fires. Retry triggers. Original call completes in
               background. Duplicate execution. Idempotency key catches
               the duplicate write — but the 45-second hang is already
               a production incident. Users report the agent "freezing."

Root cause:    No timeout at middleware or tool level. Cold-start conditions
               not tested — development used a warm model that responded
               in 200ms.

How to detect: P95 latency on tool calls. Anything above 5 seconds on a
               read operation or 10 seconds on a write operation warrants
               investigation. Set these as alerts from day one.

Fix:           30-second hard timeout in TimeoutMiddleware. asyncio.wait_for
               wraps call_next. Never increase the timeout to accommodate
               a slow operation — redesign the operation to use
               async-acknowledge instead.

SCHEMA DRIFT

What happens:  A new required parameter is added to a tool signature.
               FastMCP automatically updates the schema on next startup.
               Existing agents — configured before the update — send
               requests missing the new field. FastMCP raises a validation
               error. The agent cannot call the tool until it is
               reconfigured with the new schema.

Root cause:    Tool interface treated as internal to the server. Changed
               without a compatibility path for existing callers.

How to detect: Validation error spike in logs immediately after deployment.
               FastMCP validation errors include the missing field name —
               easy to identify in structured logs.

Fix:           New parameters must have defaults (optional). Never add
               a required field to an existing tool in place. Breaking
               changes get a new tool name (store_memory_v2) and a
               deprecation period for the old one.

CONTEXT WINDOW OVERFLOW

What happens:  inspect_memories returns all stored memories for a user —
               2,000+ entries after a year of use. Agent injects the full
               result into context. Token limit exceeded. LLM either
               truncates the response, errors, or silently ignores the
               overflow. In all cases, the agent behaves incorrectly
               without a clear error signal.

Root cause:    No result size limit. SELECT * FROM memories WHERE
               user_id = $1 with no LIMIT. The tool was tested with
               5-10 memories. Production users have thousands.

How to detect: Token usage spike on any tool that returns a list. Trace
               the tool call immediately preceding the spike.

Fix:           Default limit of 20. Hard cap of 50 enforced in code
               (min(limit, 50) — not just documented). Pagination with
               offset and has_more. Return total so the agent knows
               how many exist without fetching them all.

When do you use async-acknowledge, and when do you just return the result? One table.

Return Direct or Acknowledge Fast?

The decision is not about preference. It’s about what the agent needs and what the operation can promise.

Use async-acknowledge when:              Return direct when:
──────────────────────────────────       ────────────────────────────────────
Operation takes >500ms                   Operation takes <200ms
Has external dependency (LLM, API)       Agent needs result to continue
Agent can continue without result        Real-time response is the contract
Operation is idempotent                  Fast DB read or pure computation
(store_memory, batch extract)            (search_memories, get_memory_stats)

search_memories must be synchronous — the agent is waiting for retrieved context before generating a response. The delay matters to the user. get_memory_stats is always direct: it’s a health check, expected to be instant. store_memory is always async-acknowledge: it involves embedding model inference and DB writes, the agent doesn’t need the extraction result to continue the conversation, and the operation is idempotent with an explicit key.

The practical default: if a tool does any I/O you don’t control — external API, model inference, third-party service — default to async-acknowledge. You can always move it to direct-return later if it proves reliably fast in production. Going the other direction requires a new tool name or a breaking change, because agents that depended on the synchronous contract can’t silently absorb an asynchronous one.

There is a middle case worth naming: operations that take 200–500ms and where the agent can’t continue without the result. search_memories with a slow BM25 implementation falls here. The answer is not async-acknowledge — it’s to make the operation faster, or to accept that the agent will pause for that duration. Don’t push an operation to async because it’s slow and you don’t want to optimize it. Push it to async because the operation’s timing is genuinely uncontrollable.

Eight questions. If any are “no,” your server is not ready.

Resources

FastMCP — The fast, Pythonic way to build MCP servers ↗

jlowin — GitHub

The framework whose @mcp.tool() decorator, on_startup/on_shutdown hooks, and streamable_http_app() factory are demonstrated in the complete server listing.

Starlette Middleware documentation ↗

Starlette

The foundation for the three-layer middleware stack; the add_middleware reverse-ordering behavior explained in this issue is a Starlette-specific detail documented here.

asyncio — Python standard library (asyncio.wait_for) ↗

Python Docs

The primitive behind TimeoutMiddleware; understanding its cancellation semantics explains why the 30-second cap produces a clean 504 rather than a hung connection.

aiosqlite — async SQLite for Python ↗

omnilib — GitHub

Used in every tool's DB access pattern; async context manager ensures the idempotency check runs without blocking the event loop.

Recall Reference Implementation ↗

Sentient Zero Labs — GitHub

The server.py listed in full is the actual repository file; fork it to get the complete middleware stack and all five tools without writing any boilerplate.

Is Your MCP Server Production-Ready?

Recall passes all eight. Check yours against this before exposing your server to an agent in production.

	Item	Score
	All tools use Python type hints — no JSON schema written by hand
	Tool docstrings are one line, written as a runtime instruction for the LLM
	Middleware stack includes auth, timeout, and logging — in that order
	Timeout is at the middleware level — not inside individual tools
	All list-returning tools have a default limit and hard max (never unbounded)
	store_memory (and any write >500ms) uses the async-acknowledge pattern
	All error returns are typed dicts: {"status": "error", "error": "...", "code": "..."}
	Server name is a stable slug, not a version string or display name

0 of 8

v0.1 of Recall passes all eight. The code from this issue is the baseline — fork it from github.com/Sentient-Zero-Labs/szl-recall, run it with fastmcp dev server.py, and add your tools to the structure that’s already there. The middleware is wired. The patterns are in place. The only remaining question is whether your tools are designed correctly — and that’s Issue 04.

Recall v0.1 started with eight tools. Three were removed before this server shipped. The decisions behind that reduction — why fewer, better-scoped tools outperformed the broader original set, and what “intent-aligned” means in practice — are the subject of Issue 04.

Building Effective Tools for AI is a seven-issue series from Sentient Zero Labs. Each issue ships with working code from the Recall memory server — a production MCP tool built in public alongside the series.

Until next issue,

Sentient Zero Labs