← Technical Series
Building Effective Tools for AI Issue 6/7

Tool Observability

Every tool call produces one record that answers three questions — did it succeed, how long did it take, and how much did it cost — and those three questions, asked consistently, are the foundation of everything useful you'll ever know about your tool layer in production.

May 12, 2026 · 22 min read · Sentient Zero Labs
In this issue (6 sections)

The monthly invoice arrived on a Tuesday.

$2,100. In a month where Recall’s user base was still small enough to know most of the active sessions by name.

The number looked wrong immediately. Normal extraction costs for that month should have been somewhere around $40 — give or take a few dollars for indexing overhead. $2,100 was a different order of magnitude. Someone had done something very expensive.

The session log took two minutes to tell the story. One user. One session: sess_abc123. Six hours. 847 calls to the store_memory + extraction pair. The first call had hit a malformed JSON response from the extraction LLM — this happens roughly 2% of the time, a known quirk, a known retry candidate. What wasn’t known was that the retry used the same response from the LLM’s cache. Same input. Same malformed output. Same error. Same retry. The loop had run for six hours before the session expired.

No alert had fired. No log metric had tracked the session’s cost accumulating. No dashboard had shown 847 calls from a single session in a six-hour window. The tool had kept running. The LLM had kept billing.

The $2,100 wasn’t the worst part. The worst part was the arithmetic: a cost-per-session alert set at $10 would have fired after approximately four calls — three minutes into the loop. Total cost if that alert had existed: $0.40. The gap between $0.40 and $2,100 was not a code failure. The extraction worker’s retry behavior, the LLM caching behavior, the malformed JSON — all of that was recoverable. The gap was the absence of a number being watched.

There is a version of this story where the number existed and the alert fired and someone looked at a cost chart three minutes into the loop and stopped it. That version costs $0.40.

This is the version where no one had built the cost chart yet.

💡 The Core Insight
Observability isn’t about catching failures. It’s about catching them at $10, not $2,100.

Mental Model: One Record Per Call

Every tool call produces exactly one record. That record answers three questions: did it succeed, how long did it take, and how much did it cost. These three questions, asked consistently across every call, are the foundation of everything useful you’ll ever know about your tool layer in production.

The ToolCallRecord

The minimum viable observability unit for a production tool layer is a dataclass with twelve fields. Every field earns its place.

┌─────────────────────────────────────────────────────────────────────────┐
│  ToolCallRecord — one row written per tool call                         │
├─────────────────────┬───────────────────────────────────────────────────┤
│  FIELD              │  PURPOSE                                          │
├─────────────────────┼───────────────────────────────────────────────────┤
│  id                 │  UUID — join key across systems and logs          │
│  tool_name          │  "store_memory" — dashboards group by this        │
│  user_id            │  who triggered the call (from auth middleware)    │
│  session_id         │  which conversation session — per-session cost    │
├─────────────────────┼───────────────────────────────────────────────────┤
│  inputs_hash        │  sha256[:16] of inputs — detects duplicate calls  │
│                     │  without storing sensitive content                │
├─────────────────────┼───────────────────────────────────────────────────┤
│  status             │  "success" | "error" | "timeout"                  │
│  error_code         │  "EXTRACTION_FAILED" | null — structured for code │
│  duration_ms        │  wall-clock time — P50/P95/P99 come from this     │
├─────────────────────┼───────────────────────────────────────────────────┤
│  llm_tokens_in      │  prompt tokens consumed (0 for non-LLM tools)    │
│  llm_tokens_out     │  completion tokens generated                      │
│  cost_usd           │  estimated cost — the metric that catches loops   │
│  timestamp          │  UTC — partition key for time-series queries      │
└─────────────────────┴───────────────────────────────────────────────────┘

  Fields for operational questions:  tool_name, status, error_code, duration_ms
  Fields for cost questions:         llm_tokens_in, llm_tokens_out, cost_usd, session_id
  Fields for attribution:            user_id, session_id, timestamp
  Field for duplicate detection:     inputs_hash + session_id + timestamp window

One record per tool call. Seven fields answer operational questions. Four fields answer cost questions.

A few fields deserve particular attention. inputs_hash is a 16-character prefix of a sha256 hash of the canonicalized inputs — specifically, sha256(json.dumps(inputs, sort_keys=True))[:16]. The sort_keys=True ensures key ordering doesn’t create spurious hash differences between two calls with identical inputs. Logging full inputs at 10,000 calls per day is 80GB per day of log data; the hash detects duplicate calls (same hash, same session, close timestamps) without storing the content. For the runaway loop failure, this is Alert Query 4: if inputs_hash appears more than ten times in sixty minutes from the same session, something is stuck.

cost_usd is estimated, not exact. Model providers bill in tiers with rounding and caching discounts — your logged cost will be within 5% of actual charges. That accuracy is more than sufficient for the purpose: not accounting, but anomaly detection. When a session’s cost crosses $10, you need an alert, not a certified invoice.

llm_tokens_in and llm_tokens_out are 0 for non-LLM tools like search_memories, which hits a vector DB without calling any model. This is correct. Non-LLM tools have no LLM cost to track. The zeros are data: when these fields are non-zero, the call went through a model.

The Three Dashboards

The minimum viable monitoring setup for a production tool layer is three dashboards. These three catch 90% of production failures.

  ┌──────────────────────────────────────────────────────────────────────┐
  │  DASHBOARD 1: Volume + Success Rate                                  │
  │                                                                      │
  │  Calls per tool per hour (bar chart)                                 │
  │  ┌─────────────────────────────────────────────────────────────┐    │
  │  │ store_memory    ████████████████████  847  (6h window)       │    │
  │  │ search_memories █████                  92                    │    │
  │  │ inspect_memories ██                    18                    │    │
  │  └─────────────────────────────────────────────────────────────┘    │
  │                                                                      │
  │  Error rate per tool (last 1h)                                       │
  │  store_memory: 98.7% error rate  ◄── THIS IS YOUR ALERT             │
  │  search_memories: 0.4%                                               │
  │                                                                      │
  │  ALERT: error_rate > 5% for any tool                                 │
  └──────────────────────────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────────────┐
  │  DASHBOARD 2: Latency                                                │
  │                                                                      │
  │  P50 / P95 / P99 per tool (last 24h)                                 │
  │                                 P50      P95      P99                │
  │  store_memory                    12ms     38ms    210ms              │
  │  search_memories                182ms    487ms    820ms              │
  │  inspect_memories                 8ms     22ms     41ms              │
  │                                                                      │
  │  Slow calls list: calls > 5s in last 1h                              │
  │  Latency trend: 4-week P95 per tool                                  │
  │                                                                      │
  │  ALERT: P95 > 5s for any tool                                        │
  └──────────────────────────────────────────────────────────────────────┘

  ┌──────────────────────────────────────────────────────────────────────┐
  │  DASHBOARD 3: Cost                                                   │
  │                                                                      │
  │  Cost per tool per day                                               │
  │  store_memory (extraction):  $1.82 / day avg                         │
  │  search_memories:            $0.00 / day (no LLM calls)              │
  │                                                                      │
  │  Top 10 sessions by cost (current day)                               │
  │  sess_abc123:  $2,100.00  ◄── THIS IS YOUR ALERT (should have fired) │
  │                                                                      │
  │  ALERT: session_cost > $10 in rolling 1h window                      │
  │  ALERT: daily_cost > $100 (circuit breaker)                          │
  └──────────────────────────────────────────────────────────────────────┘

Dashboard 1 catches errors. Dashboard 2 catches slowness. Dashboard 3 catches cost loops. Three views, 90% of production failures.

Dashboard 1 answers the most fundamental question: is this tool working? The error rate view tells you whether your tool layer is degrading right now. Dashboard 2 answers: is it working at acceptable speed? Latency P95, not P50 — the median hides outliers; P95 is what your users at the tail are experiencing. Dashboard 3 answers: is it working without runaway cost? The top sessions view is the one that would have showed sess_abc123 and its $2,100 bill accumulating in real time.

LangSmith and OpenTelemetry Compatibility

The ToolCallRecord is structurally compatible with LangSmith’s run model — tool_name maps to run name, duration_ms to latency, status to run status. Teams already using LangSmith can log ToolCallRecord data directly into LangSmith runs and get these three dashboards without a custom Postgres setup. FastMCP v3.0+ also includes built-in OpenTelemetry export: set OTEL_EXPORTER_OTLP_ENDPOINT and tool_name becomes span.name, cost_usd becomes a custom span attribute, and everything routes to whatever OTel backend you already have — Jaeger, Honeycomb, or Grafana Tempo. The pattern doesn’t require a specific platform. The Postgres approach shown in this issue is the zero-dependency starting point for teams without existing observability infrastructure.

Understanding what to measure is the mental model. Here’s the code that produces these records on every call — automatically, without touching tool code.


Implementation

The pattern is a middleware that wraps every tool call. The tool doesn’t know it’s being logged. The middleware runs in a finally block — it logs success, timeout, and errors without the tool author having to remember to log anything. This is the most important architectural choice in the implementation: observability that depends on each tool developer remembering to add logging is observability that will have gaps exactly when you need it most.

Primary Code Block: ToolCallRecord + LoggingMiddleware

Four components ship together: the ToolCallRecord dataclass, the hash_inputs helper that canonicalizes inputs for deduplication, the LoggingMiddleware that creates a record on every call, and the estimate_cost function that translates token counts into dollars. LoggingMiddleware takes a user_id_ctx: ContextVar[str] in its constructor — this is the auth-layer context variable that carries the authenticated user ID into the logging path without the tool code having to pass it explicitly. The middleware only logs requests where the MCP JSON payload contains "method": "tools/call" — health checks and other protocol messages pass through unlogged.

import hashlib
import json
import time
import uuid
from contextvars import ContextVar
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Literal

import aiosqlite
from fastmcp import FastMCP
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request

mcp = FastMCP("recall")
session_id_ctx: ContextVar[str] = ContextVar("session_id", default="")
user_id_ctx: ContextVar[str] = ContextVar("user_id", default="")
token_in_ctx: ContextVar[int] = ContextVar("token_in", default=0)
token_out_ctx: ContextVar[int] = ContextVar("token_out", default=0)

_MODEL_RATES: dict[str, tuple[float, float]] = {
    "claude-sonnet": (3.00 / 1_000_000, 15.00 / 1_000_000),
    "claude-haiku": (0.25 / 1_000_000, 1.25 / 1_000_000),
    "claude-opus": (15.00 / 1_000_000, 75.00 / 1_000_000),
}


@dataclass
class ToolCallRecord:
    id: str
    tool_name: str
    user_id: str
    session_id: str
    inputs_hash: str
    status: Literal["success", "error", "timeout"]
    error_code: str | None
    duration_ms: int
    llm_tokens_in: int
    llm_tokens_out: int
    cost_usd: float
    timestamp: datetime


def hash_inputs(inputs: Any) -> str:
    """16-char prefix of sha256 — detects duplicates without storing content."""
    serialized = json.dumps(inputs, sort_keys=True, default=str)
    return hashlib.sha256(serialized.encode()).hexdigest()[:16]


class LoggingMiddleware(BaseHTTPMiddleware):
    """Creates one ToolCallRecord per MCP tool call. Runs in finally — logs success AND errors.

    Only logs requests where method == "tools/call". Health checks and
    other MCP protocol messages are passed through without logging.
    """

    def __init__(self, app: Any, user_id_ctx: ContextVar[str]) -> None:
        super().__init__(app)
        self._user_id_ctx = user_id_ctx

    async def dispatch(self, request: Request, call_next: Any) -> Any:
        body = b""
        tool_name = ""
        inputs_data: Any = {}

        try:
            body = await request.body()
            payload = json.loads(body) if body else {}
            # MCP Streamable HTTP: {"method": "tools/call", "params": {"name": "...", "arguments": {...}}}
            if payload.get("method") == "tools/call":
                tool_name = payload.get("params", {}).get("name", "unknown")
                inputs_data = payload.get("params", {}).get("arguments", {})
        except (json.JSONDecodeError, AttributeError):
            pass

        if not tool_name:
            return await call_next(request)

        start = time.monotonic()
        status: Literal["success", "error", "timeout"] = "success"
        error_code: str | None = None

        try:
            response = await call_next(request)
            return response
        except TimeoutError:
            status = "timeout"
            error_code = "TOOL_TIMEOUT"
            raise
        except Exception as exc:
            status = "error"
            error_code = type(exc).__name__
            raise
        finally:
            duration_ms = int((time.monotonic() - start) * 1000)
            tokens_in = token_in_ctx.get()
            tokens_out = token_out_ctx.get()

            record = ToolCallRecord(
                id=str(uuid.uuid4()),
                tool_name=tool_name,
                user_id=self._user_id_ctx.get(),
                session_id=session_id_ctx.get(),
                inputs_hash=hash_inputs(inputs_data),
                status=status,
                error_code=error_code,
                duration_ms=duration_ms,
                llm_tokens_in=tokens_in,
                llm_tokens_out=tokens_out,
                cost_usd=estimate_cost(tokens_in, tokens_out),
                timestamp=datetime.utcnow(),
            )
            await _insert_record(record)


def estimate_cost(tokens_in: int, tokens_out: int, model: str = "claude-sonnet") -> float:
    """Estimate cost from token counts. Within ~5% of actual billing."""
    rate_in, rate_out = _MODEL_RATES.get(model, _MODEL_RATES["claude-sonnet"])
    return round(tokens_in * rate_in + tokens_out * rate_out, 6)


async def _insert_record(record: ToolCallRecord) -> None:
    async with aiosqlite.connect("recall.db") as db:
        await db.execute(
            """INSERT INTO tool_call_records
               (id, tool_name, user_id, session_id, inputs_hash, status, error_code,
                duration_ms, llm_tokens_in, llm_tokens_out, cost_usd, timestamp)
               VALUES (?,?,?,?,?,?,?,?,?,?,?,?)""",
            (record.id, record.tool_name, record.user_id, record.session_id,
             record.inputs_hash, record.status, record.error_code, record.duration_ms,
             record.llm_tokens_in, record.llm_tokens_out, record.cost_usd,
             record.timestamp.isoformat()),
        )
        await db.commit()

Three non-obvious decisions in this implementation are worth walking through carefully.

Why finally, not except. Logging in a finally block means the record is created on success AND failure. An except block only creates records when something goes wrong — meaning you have no baseline data of successful calls to compare against. “Error rate” is mathematically meaningless without a denominator of successful calls. If your logging only fires on exceptions, you have a record of failures but no denominator. The middleware above will write a status = "success" record for every clean call, and status = "error" for every exception. Both counts are required to compute the number that triggers Dashboard 1’s alert.

Why inputs_hash not full inputs. The full conversation transcript passed to store_memory averages 8KB. At 10,000 calls per day, logging full inputs to Postgres means 80GB per day of log data. The inputs_hash field is a 16-character prefix of sha256 over json.dumps(inputs, sort_keys=True) — the sort_keys canonicalization ensures key ordering never creates false non-matches between identical calls. Computationally cheap, sufficient for duplicate detection (same hash, same session, close timestamps) without storing any of the content. If you need full inputs for a debugging session, those belong in an append-only blob store with a debug flag that gates writes per session.

The cost formula and what it doesn’t tell you. cost_usd = tokens_in * rate_in + tokens_out * rate_out. Current rates: claude-sonnet at $3.00/$15.00 per million tokens in/out; claude-haiku at $0.25/$1.25; claude-opus at $15.00/$75.00. The estimate is within 5% of actual billing — model providers round and apply caching discounts at invoice time, not per call. For the runaway loop use case, 5% accuracy is more than sufficient: what matters is that cost_usd per session crossed $10 and triggered an alert. An approximate metric that alerts is worth more than an exact metric that doesn’t exist yet.

Secondary Code Block: Postgres Schema and Alert Queries

The schema is designed for time-series queries: partitioned by day on timestamp, BRIN index on the partition column for cheap range scans, B-tree on session_id for per-session aggregation.

-- tool_call_records: one row per tool call
CREATE TABLE tool_call_records (
    id              UUID        PRIMARY KEY,
    tool_name       TEXT        NOT NULL,
    user_id         TEXT        NOT NULL,
    session_id      TEXT        NOT NULL,
    inputs_hash     TEXT        NOT NULL,
    status          TEXT        NOT NULL CHECK (status IN ('success', 'error', 'timeout')),
    error_code      TEXT,
    duration_ms     INTEGER     NOT NULL,
    llm_tokens_in   INTEGER     NOT NULL DEFAULT 0,
    llm_tokens_out  INTEGER     NOT NULL DEFAULT 0,
    cost_usd        NUMERIC(10, 6) NOT NULL DEFAULT 0,
    timestamp       TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (timestamp);

-- Daily partitions — older data can be dropped or archived without table locks
CREATE TABLE tool_call_records_2026_05
    PARTITION OF tool_call_records
    FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');

-- BRIN index on timestamp for partition pruning (cheap to maintain, fast for range scans)
CREATE INDEX idx_tcr_timestamp ON tool_call_records USING BRIN (timestamp);
-- B-tree for per-session aggregation
CREATE INDEX idx_tcr_session ON tool_call_records (session_id, timestamp DESC);
-- B-tree for per-tool monitoring
CREATE INDEX idx_tcr_tool ON tool_call_records (tool_name, timestamp DESC);


-- ── Alert Query 1: Error rate per tool in the last hour ───────────────────
-- ALERT when > 5% for any tool
SELECT
    tool_name,
    COUNT(*)                                            AS total_calls,
    ROUND(100.0 * SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) / COUNT(*), 1) AS error_rate_pct
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY tool_name
HAVING ROUND(100.0 * SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) / COUNT(*), 1) > 5;


-- ── Alert Query 2: P95 latency per tool in the last hour ─────────────────
-- ALERT when > 5000ms for any tool
SELECT
    tool_name,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY tool_name
HAVING PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) > 5000;


-- ── Alert Query 3: Session cost in the last hour ─────────────────────────
-- ALERT when any session exceeds $10 in the last hour
SELECT
    session_id,
    user_id,
    ROUND(SUM(cost_usd)::NUMERIC, 4) AS session_cost_usd,
    COUNT(*)                          AS call_count
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY session_id, user_id
HAVING SUM(cost_usd) > 10.00
ORDER BY session_cost_usd DESC;


-- ── Alert Query 4: Duplicate call detection (runaway loop signal) ─────────
-- ALERT when same inputs_hash appears > 10 times in 60 minutes from same session
SELECT
    session_id,
    tool_name,
    inputs_hash,
    COUNT(*) AS call_count,
    MIN(timestamp) AS first_call,
    MAX(timestamp) AS last_call
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY session_id, tool_name, inputs_hash
HAVING COUNT(*) > 10
ORDER BY call_count DESC;

The daily partition design has a practical consequence worth stating: a table running for a year at 10,000 calls per day has approximately 3.6 million rows — manageable with standard Postgres and the indexes above, no specialized time-series database required at this scale. When scale requires it, dropping a daily partition is a metadata operation with no table lock. The pg_partman extension can automate partition creation and expiry.

On partitioning vs. OpenTelemetry: for teams that already have OTel infrastructure, ToolCallRecord maps cleanly to OTel spans. tool_name becomes span.name, duration_ms becomes span.duration, cost_usd is a custom attribute. The dataclass approach shown here is the starting point for teams without an existing OTel collector. Both approaches produce the same four alert signals. FastMCP v3.0+ ships built-in OTel export — OTEL_EXPORTER_OTLP_ENDPOINT in your environment, and your spans flow to whatever backend you already have.

Before the Recall incident, Recall’s observability consisted of print statements. After, it consisted of these patterns. The gap in between is measured in dollars, not in difficulty. The following failures illustrate what these patterns prevent.


Failure Modes

Three failures — two you’ve probably seen, one you will see.

  Time →     0h          1h          2h          3h          4h          5h    6h
             │           │           │           │           │           │      │
  Calls  ___/▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
         1   142         142         142         142         142         139   847 total
             per hour    per hour    per hour    per hour    per hour

  Cost   $0  $0.35       $0.70       $1.05       $1.40       $1.75     $2,100

                                                              Invoice arrives 3 days later.
                                                              Pattern visible immediately.

  When a $10/session alert would have fired:

             ▼ (4 calls in, ~3 minutes)
  Caught at: $0.40

  ──────────────────────────────────────────────────────────────────────────────
  The failure wasn't that the loop ran. The failure was that there was
  no metric watching for it. The data was there. The alert was not.
  ──────────────────────────────────────────────────────────────────────────────

847 calls in 6 hours. The loop was visible in the cost chart immediately — but there was no cost chart.

THE RUNAWAY LOOP

What happens:  Extraction worker retries on LLM malformed-JSON error.
               LLM caches the response — same input, same malformed
               output, same retry. Loop runs for 6 hours. 847 calls.
               $2,100 billed. No alert fires. Discovery: the monthly
               invoice. The session log showed the pattern in 2 minutes
               once someone looked.

Root cause:    No cost-per-session alert. No duplicate-call detection.
               No retry max count. Three missing controls, all of which
               would have independently caught this.

How to detect: Alert Query 4 (duplicate inputs_hash) fires after 10
               calls — approximately 4 minutes in. Alert Query 3 (session
               cost) fires after ~$10 — approximately 3 minutes in.
               Either alert would have capped the incident at $0.40.

Fix:           Max retry count of 3 (not configurable — hardcoded
               in the worker). Session cost alert at $10.
               Duplicate-call alert on >10 same inputs_hash per hour.
THE SILENT DEGRADATION

What happens:  search_memories P95 latency drifts from 480ms to 8 seconds
               over 6 weeks. DB index fragmentation and connection pool
               exhaustion, both gradual. No one notices because the point-
               in-time monitor shows "current P95 = 8.1s, threshold = 5s"
               only after degradation is complete and users are complaining.

Root cause:    Latency monitored as point-in-time value, not as a trend.
               The signal was present in the data for weeks — it just
               wasn't being compared to a baseline.

How to detect: Latency trend alert: 20% increase in P95 week-over-week
               for any tool. This fires 4 weeks before point-in-time
               threshold breach. Run as a weekly Cron on the logs table.

Fix:           Add week-over-week trend query to monitoring. P95 at week
               N vs. P95 at week N-1, per tool. Alert at 20% increase.
               Weekly ops review includes latency trend charts, not just
               current values.
HIGH-CARDINALITY LOG EXPLOSION

What happens:  store_memory logs full conversation transcripts (avg 8KB)
               per call to the tool_call_records table. At 10,000 calls
               per day: 80GB/day. Postgres disk fills in 3 days.
               Emergency disk expansion. Production outage during resize.

Root cause:    Logging inputs without evaluating cardinality or storage
               cost. "Log everything" sounds safe until storage becomes
               the outage.

How to detect: pg_relation_size('tool_call_records') growth rate vs.
               call volume. If bytes-per-row is growing, input size
               is likely the cause. Check average row size with:
               SELECT pg_relation_size(...) / reltuples FROM pg_class
               WHERE relname = 'tool_call_records'.

Fix:           Log inputs_hash (16 chars) not full inputs.
               Log structured metadata: message_count, transcript_chars.
               Full inputs to append-only blob store (S3/GCS) only when
               a debug flag is set for a specific session.

The observability literature in distributed systems — the Dapper paper (Sigelman et al., 2010), Twitter’s Zipkin, the principles that became OpenTelemetry — established that tracing and metrics are prerequisites for operating systems at scale, not optional additions. What’s new for LLM-backed tools is the cost dimension: a trace in a traditional microservice has negligible runtime cost. A tool call that invokes a model has a real, variable, per-call cost. The cost_usd field in ToolCallRecord exists because of this difference. “You can’t improve what you can’t measure” has always been true. For LLM tools, “you can’t control what you can’t measure” is equally important — and the Recall incident is the proof. The data that would have stopped the loop at $0.40 was always being generated. It just wasn’t being captured.

Two decisions these failures force you to make — what to alert on immediately, and what to track as a trend.


Decision Guide

Two questions every observability setup forces: what to alert on immediately, and what to track as a trend.

Alert immediately on:                   Monitor as trend:
────────────────────────────────────    ──────────────────────────────────────
Error rate > 5% (any tool, 1h)          Latency increase > 20% week-over-week
P95 latency > 5s (any tool, 1h)         Daily cost increase > 30% week-over-week
Session cost > $10 (rolling 1h)         Success rate drift downward
Same inputs_hash > 10x/hour             Tool call volume anomaly
Log per tool call:                      Don't log:
────────────────────────────────────    ──────────────────────────────────────
tool_name, user_id, session_id          Full input content (cardinality)
status, error_code                      Full response content (cardinality)
duration_ms                             PII in inputs or outputs
inputs_hash (16-char prefix only)       Sub-10ms internal operations
estimated cost_usd                      Tool internals / debug traces

The first table is threshold vs. drift. Immediate alerts fire on point-in-time threshold violations — states that are wrong right now. Trend monitoring fires on drift — states that are getting worse over time but haven’t crossed a threshold yet. Silent degradation is a trend problem, not a threshold problem. Adding trend monitoring after your first silent degradation incident is the normal sequence; doing it before is better.

The second table is what to capture vs. what to avoid. The cardinality concern in the “don’t log” column isn’t theoretical — the high-cardinality log explosion failure mode in the previous section is the practical consequence of ignoring it. Full input content, full response content, and PII don’t belong in the structured log table. They belong in a blob store, accessed by session ID, only when a specific debugging session requires them.

Start with four alerts on day one. Add more only after a production incident surfaces a new signal. Don’t pre-build alerts for failure modes you haven’t seen — false positives burn alert fatigue faster than real failures. The rule of thumb: every alert in your system should have a response playbook. If you don’t know what to do when the alert fires, the alert isn’t ready to be an alert.


Resources

Sigelman et al. — Google Research, 2010
Google's foundational tracing paper; the principle that tracing and metrics are prerequisites for operating systems at scale is the intellectual basis for the ToolCallRecord pattern.
CNCF / opentelemetry.io
The industry standard for traces, metrics, and logs; FastMCP v3.0+ exports ToolCallRecord fields as OTel spans, making the Postgres schema an on-ramp to any OTel-compatible backend.
LangChain
The managed observability platform whose run model is structurally compatible with ToolCallRecord; teams already using LangSmith can skip the custom Postgres setup.
PostgreSQL Documentation
The foundational reference for the daily partition design in tool_call_records; explains BRIN index behavior and why partition drops are metadata-only operations.
Sentient Zero Labs — GitHub
The LoggingMiddleware, ToolCallRecord dataclass, Postgres schema, and all four alert queries are the actual production code running on the Recall server.

Is Your Tool Layer Observable?

Recall’s LoggingMiddleware is 40 lines of Python. It adds approximately 2ms to every tool call — the cost of a finally block and an async SQLite write. It would have caught the $2,100 incident after $0.40 and a three-minute window.

Item Score
Every tool call creates one ToolCallRecord
Records include: tool name, user, session, status, duration, estimated cost
Cost is estimated per tool call (token count × rate)
Error rate alert configured — threshold: error_rate > 5% in 1h window
P95 latency alert configured — threshold: P95 > 5s in 1h window
Session cost alert configured — threshold: cumulative cost > $10 in 1h window
Duplicate-call detection on same inputs_hash + session within 60 minutes
Dashboard shows all four views: volume, success rate, latency, cost
0 of 8

If any of these is unchecked, that’s the next thing to fix — not the ninth item, the unchecked one. Each item on this list corresponds to a specific failure mode. Missing item three is the runaway loop uncaught. Missing item seven is the silent degradation unnoticed. Missing item two is the entire cost axis dark.

The full ToolCallRecord dataclass, LoggingMiddleware, Postgres schema, and all four alert queries are in github.com/Sentient-Zero-Labs/szl-recall. The implementation is production code, not tutorial code — it’s the same middleware running on the actual Recall server.

Six issues have covered tool design from first principles to production observability: what makes a good tool, how MCP works, building your first production server, tool granularity decisions, A2A coordination patterns, and now the instrumentation layer that tells you what’s actually happening once everything ships.

Issue 7 closes the series with the ecosystem in 2026: where MCP adoption stands, the security threats that are now real attack vectors rather than theoretical risks, and the five open problems that remain genuinely unsolved — the questions the community is still working through. It’s the view from where we are and a clear-eyed look at what’s still hard.


Until next issue,

Sentient Zero Labs

Building Effective Tools for AI is a seven-issue series from Sentient Zero Labs. Each issue ships with working code from the Recall memory server — a production MCP tool built in public alongside the series.