Building Effective Tools for AI Issue 2/7

MCP Architecture In Depth

Three transports, three primitives, and the trust boundary that determines where auth and secrets belong in an MCP server.

May 12, 2026 · 18 min read · Sentient Zero Labs

In this issue (6 sections)

The parameter was called user_token: str, and it looked completely reasonable.

Recall’s first MCP server was being built on a Tuesday. The design question was mundane: how does the server know which user is making the request? The obvious answer — add user_token as a parameter to the store_memory tool — took about forty seconds to implement. The LLM receives the tool schema, sees the user_token field, passes it through on every call, and the server validates it. Clean. Obvious. Shipped.

The problems didn’t surface immediately. They surfaced in stages.

First: a trace log from Claude Desktop showed the full tool schema — user_id, user_token, text, topic — embedded in the conversation context. The token wasn’t encrypted in logs. It was there, verbatim, in plain text, in the agent’s prompt history. Every session where Recall ran, the token was being written into the conversation record.

Second: under a test that replayed a long session, the model forgot to pass user_token three times. Not occasionally — three times in a single session. The model judged that the parameter was “probably not needed for this particular call.” Silent auth failure. The tool ran without a valid user context and returned a 200 because the fallback defaulted to an empty string rather than raising.

Third: a junior engineer added a new tool to the server, looked at the existing schema, and copied the pattern. Now two tools had user_token as a parameter. The surface area had doubled.

The root cause of all three problems was the same: auth had been placed where the LLM could see it, manage it, and forget it. Auth was handed to the wrong controller.

Auth in a tool parameter means the LLM is your auth manager. It isn’t.

The fix took two hours. Understanding why it was wrong took longer — because the fix only makes sense once you understand the distinction the MCP spec makes that most tutorials gloss over. Before the code, we need the mental model — because the same mistake shows up in three different shapes once you know what to look for.

Mental Model: Three Transports, Three Primitives, One Distinction

MCP is three transports, three primitives, and one distinction. Most engineers learn the transports and the primitives. The distinction is the one that changes what you build.

The Three Transports

MCP has shipped three spec versions, and each one introduced a new transport:

MCP v1 (2024-11-05): stdio only
MCP v2 (2025-03-26): SSE added
MCP v3 (2025-11-25): Streamable HTTP added, SSE deprecated

┌──────────────────┬──────────────────────┬───────────────────────┬────────────────────────────┐
│                  │  stdio               │  SSE (deprecated)     │  Streamable HTTP           │
├──────────────────┼──────────────────────┼───────────────────────┼────────────────────────────┤
│  MCP version     │  v1 (2024-11-05)     │  v2 (2025-03-26)      │  v3 (2025-11-25)           │
├──────────────────┼──────────────────────┼───────────────────────┼────────────────────────────┤
│  Architecture    │  stdin/stdout pipes  │  GET /sse +           │  Single POST endpoint      │
│                  │                      │  POST /messages        │  Optional inline SSE       │
├──────────────────┼──────────────────────┼───────────────────────┼────────────────────────────┤
│  Concurrency     │  One process pair    │  Multiple clients,    │  Multiple clients,         │
│                  │  No multiplexing     │  stateful connection  │  stateless, load-balanceable│
├──────────────────┼──────────────────────┼───────────────────────┼────────────────────────────┤
│  Auth            │  None (local only)   │  Possible, complex    │  Bearer token in header    │
├──────────────────┼──────────────────────┼───────────────────────┼────────────────────────────┤
│  Horizontal      │  No                  │  No (connection maps  │  Yes                       │
│  scaling         │                      │  require sticky       │                            │
│                  │                      │  sessions)            │                            │
├──────────────────┼──────────────────────┼───────────────────────┼────────────────────────────┤
│  Use when        │  Claude Desktop,     │  Legacy clients only  │  All production servers.   │
│                  │  local dev,          │  (backward compat)    │  Any multi-user scenario.  │
│                  │  single process pair │                       │                            │
└──────────────────┴──────────────────────┴───────────────────────┴────────────────────────────┘

Three transports for three contexts. SSE is backward-compatible but deprecated.

stdio is structurally incapable of handling multiple concurrent clients. It’s a single process pair — one stdin, one stdout, one conversation at a time. This isn’t a configuration limit; it’s architectural. Running stdio in a server that handles concurrent users isn’t a bad practice — it doesn’t work at all.

SSE was deprecated for a specific reason worth understanding if you’re already using it. The SSE architecture required two endpoints: a GET /sse for the downstream event stream and a POST /messages for upstream requests. Maintaining the mapping between those two connections required stateful connection tracking, which made horizontal scaling nearly impossible without sticky sessions. If your load balancer sent the GET and the POST to different server instances — the common default — the server had no way to correlate them. Streamable HTTP solved this by collapsing to a single POST endpoint with optional inline SSE for streaming responses. Stateless, load-balanceable, and simpler.

If you’re running SSE in production today, it continues to work while clients maintain backward compatibility. But the scaling ceiling is real, and new servers should default to Streamable HTTP.

The Three Primitives

┌──────────────┬───────────────────────┬─────────────────────────────┬───────────────────────────┐
│  Primitive   │  Who controls it      │  Who decides when            │  Where auth/secrets go    │
├──────────────┼───────────────────────┼─────────────────────────────┼───────────────────────────┤
│  TOOL        │  Model-controlled     │  LLM reads schema, decides  │  Never in schema params.  │
│              │                       │  when to invoke + what       │  Auth belongs in          │
│              │                       │  parameters to pass          │  transport/middleware.    │
├──────────────┼───────────────────────┼─────────────────────────────┼───────────────────────────┤
│  RESOURCE    │  App-controlled       │  Application injects on      │  App owns injection.      │
│              │                       │  each request — LLM sees     │  Secrets injected by app, │
│              │                       │  content, cannot invoke      │  never passed via LLM.    │
├──────────────┼───────────────────────┼─────────────────────────────┼───────────────────────────┤
│  PROMPT      │  User-controlled      │  Human triggers via slash    │  User identity via        │
│              │                       │  command or UI menu          │  session/client context.  │
└──────────────┴───────────────────────┴─────────────────────────────┴───────────────────────────┘

Trust boundary rule:
  Transport layer   →  secrets (tokens, keys, credentials)
  App layer         →  session context, user profile, reference data (Resources)
  Model layer       →  what the LLM needs to ACT on (Tool parameters)

Crossing this boundary in the wrong direction is how secrets leak.

Who controls each primitive determines where your auth and secrets belong.

Each primitive has a distinct controller. Tools are model-controlled: the LLM reads the schema, decides when to invoke the tool, and provides the parameters. The LLM is the decision-maker. Resources are app-controlled: the application decides what context to inject, and the LLM sees the content when it’s included in the request context — but cannot independently invoke a Resource the way it invokes a Tool. Prompts are user-controlled: slash commands and menu items in client UIs, triggered by the human directly.

The Distinction That Changes Everything

The ReAct paper (Yao et al., 2022) demonstrated that LLMs could use tools reliably by interleaving reasoning and action steps. What the paper assumed — that tool calls are initiated by the model when it decides to act — is exactly the model-controlled primitive in MCP’s terminology. But that same paper took something else as given: that context is available to the model without the model having to fetch it. The model-controlled/app-controlled axis formalizes this split. Modern agent architectures depend on it. The LLM acts through Tools; the application provides context through Resources. Mixing these up means either the LLM is doing work it shouldn’t (fetching its own context) or carrying secrets it shouldn’t.

The Recall token mistake was a category error: putting a secret — something that belongs at the transport layer — into a Tool parameter, which is part of the model layer. It crossed the trust boundary in the wrong direction.

The correct location for every piece of context follows directly from who should control it:

Anything the LLM should act on (data it needs to call a tool correctly) → Tool parameter. Fine.
Anything the LLM should have access to but not manage (user profile, session state, auth identity) → Resource, injected by the app.
Anything that is a secret (tokens, keys, credentials) → middleware at the transport layer. It never touches the LLM layer at all.

The spec’s model-controlled/app-controlled axis isn’t a labeling choice. It’s a trust boundary. Crossing it in the wrong direction is how secrets leak.

Here’s what the correct architecture looks like in code — and what changed in Recall’s server once we moved auth out of the tool.

Implementation: The Correct Auth Pattern

The pattern below is Recall’s FastMCP server scaffold: Streamable HTTP transport, bearer token middleware, and context variable injection. It’s the starting point for any MCP server that needs auth. The middleware is a Starlette BaseHTTPMiddleware subclass — FastMCP’s streamable_http_app() returns a standard Starlette ASGI app.

The middleware intercepts every request before any tool runs. The token never appears in any tool’s parameter schema.

import hashlib
from contextvars import ContextVar
from typing import Any

from fastmcp import FastMCP
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
from starlette.responses import JSONResponse

mcp = FastMCP("recall")

# Request-scoped user identity — async-safe, never shared between requests
user_id_ctx: ContextVar[str] = ContextVar("user_id", default="")


class BearerAuthMiddleware(BaseHTTPMiddleware):
    """Validates Bearer token before any tool code runs. Token never touches a tool schema."""

    async def dispatch(self, request: Request, call_next: Any) -> Any:
        auth = request.headers.get("Authorization", "")
        if not auth:
            return JSONResponse({"error": "Unauthorized"}, status_code=401)

        # Handles both "Bearer <token>" and bare "<token>"
        token = auth.removeprefix("Bearer ").strip()
        user_id = await validate_token(token)  # your DB / cache lookup

        if not user_id:
            return JSONResponse({"error": "Unauthorized"}, status_code=401)

        # Inject identity for tools to read — without a parameter in their schema
        user_id_ctx.set(user_id)
        return await call_next(request)


async def validate_token(token: str) -> str | None:
    """Return user_id if token is valid, None otherwise. Swap in your auth logic."""
    if not token:
        return None
    import aiosqlite
    async with aiosqlite.connect("recall.db") as db:
        row = await db.execute_fetchall(
            "SELECT user_id FROM api_tokens WHERE token_hash = ? AND revoked = 0",
            (hashlib.sha256(token.encode()).hexdigest(),),
        )
        return row[0][0] if row else None


@mcp.tool()
async def store_memory(text: str, topic: str) -> dict:
    """Store a memory for the current user. Returns immediately — extraction runs async."""
    user_id = user_id_ctx.get()  # injected by middleware, not a parameter
    # ... store logic
    return {"status": "ok", "data": {"queued": True}, "error": None}


@mcp.tool()
async def search_memories(query: str, limit: int = 20) -> dict:
    """Search the user's memories using hybrid retrieval (~200-500ms)."""
    user_id = user_id_ctx.get()
    # ... search logic
    return {"status": "ok", "data": {"results": [], "total": 0}, "error": None}


if __name__ == "__main__":
    import uvicorn

    app = mcp.streamable_http_app()
    app.add_middleware(BearerAuthMiddleware)  # auth wraps everything
    uvicorn.run(app, host="0.0.0.0", port=8000)

Three non-obvious decisions in this scaffold are worth walking through.

Why ContextVar and not a module-level global. A module-level user_id = "" is shared state across all concurrent requests in the same process. Under asyncio, two requests running concurrently can interleave: request A sets user_id = "user-123", request B sets user_id = "user-456", and when request A’s tool handler reads user_id it gets "user-456". This isn’t a theoretical race — it’s the default behavior of a global in an async application with any concurrency at all. ContextVar creates a separate binding per asyncio task. Each request gets its own slot. user_id_ctx.get() inside a tool always returns the value set by the middleware for that specific request, regardless of what’s happening in parallel tasks.

The removeprefix pattern. auth.removeprefix("Bearer ").strip() handles both Authorization: Bearer sk-abc123 and bare Authorization: sk-abc123 without branching or regex. It’s a Python 3.9+ string method that removes the prefix if present and returns the string unchanged if the prefix is absent. The .strip() handles any whitespace. The result is a clean token string either way. Two concurrent client implementations that format the header differently both work.

What happens on auth failure. The middleware returns JSONResponse({"error": "Unauthorized"}, status_code=401) before calling call_next(request). The tool function never runs. The LLM receives a well-formed JSON error — not an exception traceback, not an empty response — that it can surface to the user or handle in its own error-recovery logic. The 401 also prevents the request from reaching any part of the tool layer, which means there’s no partial execution to reason about.

Here is the structural difference between the old design and the new one:

  BEFORE (broken)                       AFTER (correct)
  ───────────────                       ───────────────

  Agent                                 Agent
    │                                     │
    │  call store_memory(                 │  call store_memory(
    │    user_id="u1",                    │    text="...",
    │    user_token="sk-abc123",  ←────── │    topic="food"
    │    text="..."                       │  )
    │  )                                  │     +  Authorization: Bearer sk-abc123
    │                                     │        (in HTTP header, not schema)
    ▼                                     ▼
  ┌──────────────────────────┐          ┌──────────────────────────┐
  │  FastMCP tool schema:    │          │  BearerAuthMiddleware:   │
  │                          │          │    extracts token         │
  │  user_token: string  ←── │ visible  │    validates with DB      │
  │  user_id: string         │ to LLM   │    injects user_id via   │
  │  text: string            │          │    ContextVar            │
  │  topic: string           │          └──────────────────────────┘
  └──────────────────────────┘                     │
                                                    ▼
  Problems with BEFORE:                   ┌──────────────────────────┐
  • Token in LLM conversation logs        │  FastMCP tool schema:    │
  • LLM sometimes drops it silently       │                          │
  • Token embedded in every agent prompt  │  text: string            │
    that uses Recall                      │  topic: string           │
  • LLM is your auth manager. It isn't.  │                          │
                                          │  (no token in schema)    │
                                          └──────────────────────────┘

Before: token visible in LLM context. After: token validated before the LLM layer.

The before state isn’t just a security risk — it’s a reliability risk. The LLM occasionally drops parameters it judges as “probably not needed.” A token passed as a parameter can disappear silently. Middleware auth cannot be dropped. It runs on every request, unconditionally, before the tool handler is ever reached.

Resource vs. Tool: The Same Data, Two Controllers

The same data can be a Tool or a Resource. Which one you choose determines who controls it — and that has direct consequences for latency, token cost, and reliability.

# ── AS A TOOL (model-controlled) ─────────────────────────────────────────────
# LLM decides when to call it. LLM must remember to call it. Burns tokens each time.

@mcp.tool()
async def get_user_profile(user_id: str) -> dict:
    """Get the profile and preferences for a user."""
    profile = await db.get_profile(user_id)
    return {"status": "ok", "data": profile.to_dict(), "error": None}

# Signal that this is a Resource in disguise:
# • Called on >80% of conversation turns
# • Result is static within a session
# • LLM often calls it first before doing anything else
# • Every call is a round-trip that burns tokens and latency


# ── AS A RESOURCE (app-controlled) ───────────────────────────────────────────
# App injects it before the LLM sees anything. Always available. Zero tool calls.

@mcp.resource("user://profile")
async def user_profile_resource() -> str:
    """User profile and preferences — injected by app, not called by LLM."""
    user_id = user_id_ctx.get()  # available from auth middleware
    profile = await db.get_profile(user_id)
    return profile.to_json()

# The app includes this resource in every request context:
# resource_result = await mcp.read_resource("user://profile")
# # LLM receives the profile automatically — no tool call needed


# ── RULE ─────────────────────────────────────────────────────────────────────
# Tool:     LLM drives the action. Data changes per call. Side effects.
# Resource: App controls injection. Read-only. Static within session.
#
# If a tool is called >80% of turns and always returns the same value
# for that session — it is a Resource in disguise. Move it.

A get_user_profile tool implemented as model-controlled will be called on nearly every turn. The LLM “needs the data” before it can respond helpfully, so it calls the tool first. Each call is a round-trip to the server, time spent by the model processing the tool call schema and the response, and latency added before any user-facing work begins. Under scale, that cost grows proportionally with session count. Any data that doesn’t change within a session and doesn’t have side effects belongs as a Resource. The observable signal is tool call frequency: any tool called more than 80% of turns is a Resource in disguise. Move it.

These are the architectural decisions that prevent failures. Here are the failures you’ll hit anyway.

Failure Modes

Three failure modes — named specifically enough that you’ll recognize them when they appear.

AUTH LEAKAGE

What happens:  Token passed as tool parameter appears verbatim in LLM
               conversation context. It logs to agent traces, embeds in
               every prompt that includes the tool schema, and may surface
               in debugging output. In Recall: 'user_token' field visible
               in Claude Desktop's tool schema view for every session.

Root cause:    Auth designed for the tool layer instead of the transport
               layer. The engineer put auth where the schema lives, not
               where credentials belong.

How to detect: Audit all tool parameter names. If any contain 'token',
               'key', 'secret', 'auth', or 'credential' — this failure
               mode is present. The audit takes 2 minutes.

Fix:           Remove auth parameters from all tool schemas. Add a
               BaseHTTPMiddleware subclass. Validate in dispatch(),
               inject identity via ContextVar. The tool sees user_id,
               not the credential that proved it.

WRONG TRANSPORT IN PRODUCTION

What happens:  stdio transport deployed for a server that needs multiple
               concurrent clients. Second connection blocks waiting for
               the process pair to free. Under load, requests queue.
               At 10 concurrent users: complete service failure.

Root cause:    stdio is a single-process-pair transport by design.
               It has no multiplexing path. This is structural,
               not configurable.

How to detect: Run two simultaneous client connections to the server.
               One will block immediately. This is a one-minute test
               that most teams skip until production.

Fix:           Streamable HTTP for any server with more than one client.
               stdio is correct for exactly two cases: Claude Desktop
               plugins and local development where the process pair
               constraint is acceptable.

TOOL-RESOURCE CONFUSION

What happens:  Static reference data (user profile, app config, session
               preferences) implemented as a Tool. LLM calls it on nearly
               every turn — it "needs the data" before it can help. Token
               consumption rises. Latency spikes per turn. Under scale,
               cost grows proportionally.

Root cause:    Resource primitives not used for app-controlled, read-only
               context. The data is the same for every call in a session
               but was built as a Tool because that's the familiar pattern.

How to detect: Tool call frequency metrics. Any tool called >80% of
               conversation turns is a strong candidate. Observable as
               latency spikes at turn start — before any user-facing
               work begins.

Fix:           Implement as an MCP Resource. App injects it per-request
               using @mcp.resource() and includes it in context. LLM
               receives the data without invoking a tool. Zero round-trips.
               Zero token burn for schema parsing.

Knowing the failure modes is half the answer. The other half is knowing which approach to use when — before you build.

Decision Guide

Two decisions every MCP server forces you to make upfront: which transport, and which primitive for which data.

Use stdio when:                     Use Streamable HTTP when:
──────────────────────────          ──────────────────────────────────
Claude Desktop plugin               Any production server
Local dev, single process pair      Multi-client / multi-user
No auth required                    Auth + scaling required
Testing in isolation                Horizontal scaling needed
One consumer, one server            Any real deployment

Default to Streamable HTTP. Use stdio only when you’re building a local tool and the process pair constraint is acceptable.

Use Tool when:                      Use Resource when:
──────────────────────────          ──────────────────────────────────
LLM drives the action               App controls injection
Side effects (writes)               Read-only / reference data
User-initiated operations           Per-request context
Data that changes per call          Static within session
Different result per invocation     Called >80% of turns with same result
LLM needs to decide when to call    Data should always be available

If the data is the same for every call in a session and has no side effects — it’s a Resource. Everything else is probably a Tool.

One note on Prompts: they’re for pre-built user-facing workflows — slash commands in Claude Desktop, menu items in client UIs. They’re relatively rare in production server code. Most of what you build will be Tools and Resources. Prompts become relevant when you’re building a polished user experience on top of the protocol, not when you’re building the server internals.

One last pass before shipping.

Resources

Model Context Protocol Specification (2025-11-25) ↗

modelcontextprotocol.io

The authoritative spec defining the three transports and three primitives covered in this issue; the tool/resource/prompt distinction maps to the spec's core sections.

ReAct: Synergizing Reasoning and Acting in Language Models ↗

Yao et al., 2022 — arXiv

Establishes the model-controlled tool invocation pattern that makes the tool-vs-resource trust boundary consequential.

FastMCP — The fast, Pythonic way to build MCP servers ↗

jlowin — GitHub

The framework used to build Recall's server; BaseHTTPMiddleware, ContextVar injection, and streamable_http_app() are demonstrated throughout this issue.

Starlette Middleware documentation ↗

Starlette

The middleware layer where auth lives; understanding Starlette's dispatch chain explains the call_next(request) injection point and reverse-ordering behavior.

Agentic AI Foundation (AAIF) — Linux Foundation ↗

AAIF / Linux Foundation

The governance body that moved MCP under neutral open governance; relevant context for understanding why the spec versions are stable standards.

Is Your MCP Architecture Correct?

	Item	Score
	No tool parameter names contain: token, key, secret, auth, or credential
	Auth validation lives in a BaseHTTPMiddleware subclass — not in tools
	User identity injected via ContextVar, not passed as a tool parameter
	Transport is Streamable HTTP for any server with more than one client
	stdio is only used for Claude Desktop or local single-process dev
	Static per-session data (user profile, config) is a Resource, not a Tool
	Any tool called >80% of turns has been evaluated for conversion to Resource

0 of 7

Recall’s server passes all seven. If yours doesn’t pass any item, that item is worth fixing before the first real user connects.

Issue 3 builds the full five-tool Recall server on top of this scaffold — auth wired in, timeouts on the extraction worker, structured error handling on every tool, and a pagination pattern for search_memories that doesn’t blow up the model’s context window on large result sets. The scaffold above is the foundation. Issue 3 is everything you add to it before production.

Building Effective Tools for AI is a seven-issue series from Sentient Zero Labs. Each issue ships with working code from the Recall memory server — a production MCP tool built in public alongside the series.

Until next issue,

Sentient Zero Labs