Tool Observability
Every tool call produces one record that answers three questions — did it succeed, how long did it take, and how much did it cost — and those three questions, asked consistently, are the foundation of everything useful you'll ever know about your tool layer in production.
In this issue (6 sections)
The monthly invoice arrived on a Tuesday.
$2,100. In a month where Recall’s user base was still small enough to know most of the active sessions by name.
The number looked wrong immediately. Normal extraction costs for that month should have been somewhere around $40 — give or take a few dollars for indexing overhead. $2,100 was a different order of magnitude. Someone had done something very expensive.
The session log took two minutes to tell the story. One user. One session: sess_abc123. Six hours. 847 calls to the store_memory + extraction pair. The first call had hit a malformed JSON response from the extraction LLM — this happens roughly 2% of the time, a known quirk, a known retry candidate. What wasn’t known was that the retry used the same response from the LLM’s cache. Same input. Same malformed output. Same error. Same retry. The loop had run for six hours before the session expired.
No alert had fired. No log metric had tracked the session’s cost accumulating. No dashboard had shown 847 calls from a single session in a six-hour window. The tool had kept running. The LLM had kept billing.
The $2,100 wasn’t the worst part. The worst part was the arithmetic: a cost-per-session alert set at $10 would have fired after approximately four calls — three minutes into the loop. Total cost if that alert had existed: $0.40. The gap between $0.40 and $2,100 was not a code failure. The extraction worker’s retry behavior, the LLM caching behavior, the malformed JSON — all of that was recoverable. The gap was the absence of a number being watched.
There is a version of this story where the number existed and the alert fired and someone looked at a cost chart three minutes into the loop and stopped it. That version costs $0.40.
This is the version where no one had built the cost chart yet.
Mental Model: One Record Per Call
Every tool call produces exactly one record. That record answers three questions: did it succeed, how long did it take, and how much did it cost. These three questions, asked consistently across every call, are the foundation of everything useful you’ll ever know about your tool layer in production.
The ToolCallRecord
The minimum viable observability unit for a production tool layer is a dataclass with twelve fields. Every field earns its place.
┌─────────────────────────────────────────────────────────────────────────┐
│ ToolCallRecord — one row written per tool call │
├─────────────────────┬───────────────────────────────────────────────────┤
│ FIELD │ PURPOSE │
├─────────────────────┼───────────────────────────────────────────────────┤
│ id │ UUID — join key across systems and logs │
│ tool_name │ "store_memory" — dashboards group by this │
│ user_id │ who triggered the call (from auth middleware) │
│ session_id │ which conversation session — per-session cost │
├─────────────────────┼───────────────────────────────────────────────────┤
│ inputs_hash │ sha256[:16] of inputs — detects duplicate calls │
│ │ without storing sensitive content │
├─────────────────────┼───────────────────────────────────────────────────┤
│ status │ "success" | "error" | "timeout" │
│ error_code │ "EXTRACTION_FAILED" | null — structured for code │
│ duration_ms │ wall-clock time — P50/P95/P99 come from this │
├─────────────────────┼───────────────────────────────────────────────────┤
│ llm_tokens_in │ prompt tokens consumed (0 for non-LLM tools) │
│ llm_tokens_out │ completion tokens generated │
│ cost_usd │ estimated cost — the metric that catches loops │
│ timestamp │ UTC — partition key for time-series queries │
└─────────────────────┴───────────────────────────────────────────────────┘
Fields for operational questions: tool_name, status, error_code, duration_ms
Fields for cost questions: llm_tokens_in, llm_tokens_out, cost_usd, session_id
Fields for attribution: user_id, session_id, timestamp
Field for duplicate detection: inputs_hash + session_id + timestamp window
One record per tool call. Seven fields answer operational questions. Four fields answer cost questions.
A few fields deserve particular attention. inputs_hash is a 16-character prefix of a sha256 hash of the canonicalized inputs — specifically, sha256(json.dumps(inputs, sort_keys=True))[:16]. The sort_keys=True ensures key ordering doesn’t create spurious hash differences between two calls with identical inputs. Logging full inputs at 10,000 calls per day is 80GB per day of log data; the hash detects duplicate calls (same hash, same session, close timestamps) without storing the content. For the runaway loop failure, this is Alert Query 4: if inputs_hash appears more than ten times in sixty minutes from the same session, something is stuck.
cost_usd is estimated, not exact. Model providers bill in tiers with rounding and caching discounts — your logged cost will be within 5% of actual charges. That accuracy is more than sufficient for the purpose: not accounting, but anomaly detection. When a session’s cost crosses $10, you need an alert, not a certified invoice.
llm_tokens_in and llm_tokens_out are 0 for non-LLM tools like search_memories, which hits a vector DB without calling any model. This is correct. Non-LLM tools have no LLM cost to track. The zeros are data: when these fields are non-zero, the call went through a model.
The Three Dashboards
The minimum viable monitoring setup for a production tool layer is three dashboards. These three catch 90% of production failures.
┌──────────────────────────────────────────────────────────────────────┐
│ DASHBOARD 1: Volume + Success Rate │
│ │
│ Calls per tool per hour (bar chart) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ store_memory ████████████████████ 847 (6h window) │ │
│ │ search_memories █████ 92 │ │
│ │ inspect_memories ██ 18 │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Error rate per tool (last 1h) │
│ store_memory: 98.7% error rate ◄── THIS IS YOUR ALERT │
│ search_memories: 0.4% │
│ │
│ ALERT: error_rate > 5% for any tool │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ DASHBOARD 2: Latency │
│ │
│ P50 / P95 / P99 per tool (last 24h) │
│ P50 P95 P99 │
│ store_memory 12ms 38ms 210ms │
│ search_memories 182ms 487ms 820ms │
│ inspect_memories 8ms 22ms 41ms │
│ │
│ Slow calls list: calls > 5s in last 1h │
│ Latency trend: 4-week P95 per tool │
│ │
│ ALERT: P95 > 5s for any tool │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ DASHBOARD 3: Cost │
│ │
│ Cost per tool per day │
│ store_memory (extraction): $1.82 / day avg │
│ search_memories: $0.00 / day (no LLM calls) │
│ │
│ Top 10 sessions by cost (current day) │
│ sess_abc123: $2,100.00 ◄── THIS IS YOUR ALERT (should have fired) │
│ │
│ ALERT: session_cost > $10 in rolling 1h window │
│ ALERT: daily_cost > $100 (circuit breaker) │
└──────────────────────────────────────────────────────────────────────┘
Dashboard 1 catches errors. Dashboard 2 catches slowness. Dashboard 3 catches cost loops. Three views, 90% of production failures.
Dashboard 1 answers the most fundamental question: is this tool working? The error rate view tells you whether your tool layer is degrading right now. Dashboard 2 answers: is it working at acceptable speed? Latency P95, not P50 — the median hides outliers; P95 is what your users at the tail are experiencing. Dashboard 3 answers: is it working without runaway cost? The top sessions view is the one that would have showed sess_abc123 and its $2,100 bill accumulating in real time.
LangSmith and OpenTelemetry Compatibility
The ToolCallRecord is structurally compatible with LangSmith’s run model — tool_name maps to run name, duration_ms to latency, status to run status. Teams already using LangSmith can log ToolCallRecord data directly into LangSmith runs and get these three dashboards without a custom Postgres setup. FastMCP v3.0+ also includes built-in OpenTelemetry export: set OTEL_EXPORTER_OTLP_ENDPOINT and tool_name becomes span.name, cost_usd becomes a custom span attribute, and everything routes to whatever OTel backend you already have — Jaeger, Honeycomb, or Grafana Tempo. The pattern doesn’t require a specific platform. The Postgres approach shown in this issue is the zero-dependency starting point for teams without existing observability infrastructure.
Understanding what to measure is the mental model. Here’s the code that produces these records on every call — automatically, without touching tool code.
Implementation
The pattern is a middleware that wraps every tool call. The tool doesn’t know it’s being logged. The middleware runs in a finally block — it logs success, timeout, and errors without the tool author having to remember to log anything. This is the most important architectural choice in the implementation: observability that depends on each tool developer remembering to add logging is observability that will have gaps exactly when you need it most.
Primary Code Block: ToolCallRecord + LoggingMiddleware
Four components ship together: the ToolCallRecord dataclass, the hash_inputs helper that canonicalizes inputs for deduplication, the LoggingMiddleware that creates a record on every call, and the estimate_cost function that translates token counts into dollars. LoggingMiddleware takes a user_id_ctx: ContextVar[str] in its constructor — this is the auth-layer context variable that carries the authenticated user ID into the logging path without the tool code having to pass it explicitly. The middleware only logs requests where the MCP JSON payload contains "method": "tools/call" — health checks and other protocol messages pass through unlogged.
import hashlib
import json
import time
import uuid
from contextvars import ContextVar
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Literal
import aiosqlite
from fastmcp import FastMCP
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
mcp = FastMCP("recall")
session_id_ctx: ContextVar[str] = ContextVar("session_id", default="")
user_id_ctx: ContextVar[str] = ContextVar("user_id", default="")
token_in_ctx: ContextVar[int] = ContextVar("token_in", default=0)
token_out_ctx: ContextVar[int] = ContextVar("token_out", default=0)
_MODEL_RATES: dict[str, tuple[float, float]] = {
"claude-sonnet": (3.00 / 1_000_000, 15.00 / 1_000_000),
"claude-haiku": (0.25 / 1_000_000, 1.25 / 1_000_000),
"claude-opus": (15.00 / 1_000_000, 75.00 / 1_000_000),
}
@dataclass
class ToolCallRecord:
id: str
tool_name: str
user_id: str
session_id: str
inputs_hash: str
status: Literal["success", "error", "timeout"]
error_code: str | None
duration_ms: int
llm_tokens_in: int
llm_tokens_out: int
cost_usd: float
timestamp: datetime
def hash_inputs(inputs: Any) -> str:
"""16-char prefix of sha256 — detects duplicates without storing content."""
serialized = json.dumps(inputs, sort_keys=True, default=str)
return hashlib.sha256(serialized.encode()).hexdigest()[:16]
class LoggingMiddleware(BaseHTTPMiddleware):
"""Creates one ToolCallRecord per MCP tool call. Runs in finally — logs success AND errors.
Only logs requests where method == "tools/call". Health checks and
other MCP protocol messages are passed through without logging.
"""
def __init__(self, app: Any, user_id_ctx: ContextVar[str]) -> None:
super().__init__(app)
self._user_id_ctx = user_id_ctx
async def dispatch(self, request: Request, call_next: Any) -> Any:
body = b""
tool_name = ""
inputs_data: Any = {}
try:
body = await request.body()
payload = json.loads(body) if body else {}
# MCP Streamable HTTP: {"method": "tools/call", "params": {"name": "...", "arguments": {...}}}
if payload.get("method") == "tools/call":
tool_name = payload.get("params", {}).get("name", "unknown")
inputs_data = payload.get("params", {}).get("arguments", {})
except (json.JSONDecodeError, AttributeError):
pass
if not tool_name:
return await call_next(request)
start = time.monotonic()
status: Literal["success", "error", "timeout"] = "success"
error_code: str | None = None
try:
response = await call_next(request)
return response
except TimeoutError:
status = "timeout"
error_code = "TOOL_TIMEOUT"
raise
except Exception as exc:
status = "error"
error_code = type(exc).__name__
raise
finally:
duration_ms = int((time.monotonic() - start) * 1000)
tokens_in = token_in_ctx.get()
tokens_out = token_out_ctx.get()
record = ToolCallRecord(
id=str(uuid.uuid4()),
tool_name=tool_name,
user_id=self._user_id_ctx.get(),
session_id=session_id_ctx.get(),
inputs_hash=hash_inputs(inputs_data),
status=status,
error_code=error_code,
duration_ms=duration_ms,
llm_tokens_in=tokens_in,
llm_tokens_out=tokens_out,
cost_usd=estimate_cost(tokens_in, tokens_out),
timestamp=datetime.utcnow(),
)
await _insert_record(record)
def estimate_cost(tokens_in: int, tokens_out: int, model: str = "claude-sonnet") -> float:
"""Estimate cost from token counts. Within ~5% of actual billing."""
rate_in, rate_out = _MODEL_RATES.get(model, _MODEL_RATES["claude-sonnet"])
return round(tokens_in * rate_in + tokens_out * rate_out, 6)
async def _insert_record(record: ToolCallRecord) -> None:
async with aiosqlite.connect("recall.db") as db:
await db.execute(
"""INSERT INTO tool_call_records
(id, tool_name, user_id, session_id, inputs_hash, status, error_code,
duration_ms, llm_tokens_in, llm_tokens_out, cost_usd, timestamp)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?)""",
(record.id, record.tool_name, record.user_id, record.session_id,
record.inputs_hash, record.status, record.error_code, record.duration_ms,
record.llm_tokens_in, record.llm_tokens_out, record.cost_usd,
record.timestamp.isoformat()),
)
await db.commit()
Three non-obvious decisions in this implementation are worth walking through carefully.
Why finally, not except. Logging in a finally block means the record is created on success AND failure. An except block only creates records when something goes wrong — meaning you have no baseline data of successful calls to compare against. “Error rate” is mathematically meaningless without a denominator of successful calls. If your logging only fires on exceptions, you have a record of failures but no denominator. The middleware above will write a status = "success" record for every clean call, and status = "error" for every exception. Both counts are required to compute the number that triggers Dashboard 1’s alert.
Why inputs_hash not full inputs. The full conversation transcript passed to store_memory averages 8KB. At 10,000 calls per day, logging full inputs to Postgres means 80GB per day of log data. The inputs_hash field is a 16-character prefix of sha256 over json.dumps(inputs, sort_keys=True) — the sort_keys canonicalization ensures key ordering never creates false non-matches between identical calls. Computationally cheap, sufficient for duplicate detection (same hash, same session, close timestamps) without storing any of the content. If you need full inputs for a debugging session, those belong in an append-only blob store with a debug flag that gates writes per session.
The cost formula and what it doesn’t tell you. cost_usd = tokens_in * rate_in + tokens_out * rate_out. Current rates: claude-sonnet at $3.00/$15.00 per million tokens in/out; claude-haiku at $0.25/$1.25; claude-opus at $15.00/$75.00. The estimate is within 5% of actual billing — model providers round and apply caching discounts at invoice time, not per call. For the runaway loop use case, 5% accuracy is more than sufficient: what matters is that cost_usd per session crossed $10 and triggered an alert. An approximate metric that alerts is worth more than an exact metric that doesn’t exist yet.
Secondary Code Block: Postgres Schema and Alert Queries
The schema is designed for time-series queries: partitioned by day on timestamp, BRIN index on the partition column for cheap range scans, B-tree on session_id for per-session aggregation.
-- tool_call_records: one row per tool call
CREATE TABLE tool_call_records (
id UUID PRIMARY KEY,
tool_name TEXT NOT NULL,
user_id TEXT NOT NULL,
session_id TEXT NOT NULL,
inputs_hash TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('success', 'error', 'timeout')),
error_code TEXT,
duration_ms INTEGER NOT NULL,
llm_tokens_in INTEGER NOT NULL DEFAULT 0,
llm_tokens_out INTEGER NOT NULL DEFAULT 0,
cost_usd NUMERIC(10, 6) NOT NULL DEFAULT 0,
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (timestamp);
-- Daily partitions — older data can be dropped or archived without table locks
CREATE TABLE tool_call_records_2026_05
PARTITION OF tool_call_records
FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');
-- BRIN index on timestamp for partition pruning (cheap to maintain, fast for range scans)
CREATE INDEX idx_tcr_timestamp ON tool_call_records USING BRIN (timestamp);
-- B-tree for per-session aggregation
CREATE INDEX idx_tcr_session ON tool_call_records (session_id, timestamp DESC);
-- B-tree for per-tool monitoring
CREATE INDEX idx_tcr_tool ON tool_call_records (tool_name, timestamp DESC);
-- ── Alert Query 1: Error rate per tool in the last hour ───────────────────
-- ALERT when > 5% for any tool
SELECT
tool_name,
COUNT(*) AS total_calls,
ROUND(100.0 * SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) / COUNT(*), 1) AS error_rate_pct
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY tool_name
HAVING ROUND(100.0 * SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) / COUNT(*), 1) > 5;
-- ── Alert Query 2: P95 latency per tool in the last hour ─────────────────
-- ALERT when > 5000ms for any tool
SELECT
tool_name,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY tool_name
HAVING PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration_ms) > 5000;
-- ── Alert Query 3: Session cost in the last hour ─────────────────────────
-- ALERT when any session exceeds $10 in the last hour
SELECT
session_id,
user_id,
ROUND(SUM(cost_usd)::NUMERIC, 4) AS session_cost_usd,
COUNT(*) AS call_count
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY session_id, user_id
HAVING SUM(cost_usd) > 10.00
ORDER BY session_cost_usd DESC;
-- ── Alert Query 4: Duplicate call detection (runaway loop signal) ─────────
-- ALERT when same inputs_hash appears > 10 times in 60 minutes from same session
SELECT
session_id,
tool_name,
inputs_hash,
COUNT(*) AS call_count,
MIN(timestamp) AS first_call,
MAX(timestamp) AS last_call
FROM tool_call_records
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY session_id, tool_name, inputs_hash
HAVING COUNT(*) > 10
ORDER BY call_count DESC;
The daily partition design has a practical consequence worth stating: a table running for a year at 10,000 calls per day has approximately 3.6 million rows — manageable with standard Postgres and the indexes above, no specialized time-series database required at this scale. When scale requires it, dropping a daily partition is a metadata operation with no table lock. The pg_partman extension can automate partition creation and expiry.
On partitioning vs. OpenTelemetry: for teams that already have OTel infrastructure, ToolCallRecord maps cleanly to OTel spans. tool_name becomes span.name, duration_ms becomes span.duration, cost_usd is a custom attribute. The dataclass approach shown here is the starting point for teams without an existing OTel collector. Both approaches produce the same four alert signals. FastMCP v3.0+ ships built-in OTel export — OTEL_EXPORTER_OTLP_ENDPOINT in your environment, and your spans flow to whatever backend you already have.
Before the Recall incident, Recall’s observability consisted of print statements. After, it consisted of these patterns. The gap in between is measured in dollars, not in difficulty. The following failures illustrate what these patterns prevent.
Failure Modes
Three failures — two you’ve probably seen, one you will see.
Time → 0h 1h 2h 3h 4h 5h 6h
│ │ │ │ │ │ │
Calls ___/▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
1 142 142 142 142 142 139 847 total
per hour per hour per hour per hour per hour
Cost $0 $0.35 $0.70 $1.05 $1.40 $1.75 $2,100
↑
Invoice arrives 3 days later.
Pattern visible immediately.
When a $10/session alert would have fired:
│
▼ (4 calls in, ~3 minutes)
Caught at: $0.40
──────────────────────────────────────────────────────────────────────────────
The failure wasn't that the loop ran. The failure was that there was
no metric watching for it. The data was there. The alert was not.
──────────────────────────────────────────────────────────────────────────────
847 calls in 6 hours. The loop was visible in the cost chart immediately — but there was no cost chart.
THE RUNAWAY LOOP
What happens: Extraction worker retries on LLM malformed-JSON error.
LLM caches the response — same input, same malformed
output, same retry. Loop runs for 6 hours. 847 calls.
$2,100 billed. No alert fires. Discovery: the monthly
invoice. The session log showed the pattern in 2 minutes
once someone looked.
Root cause: No cost-per-session alert. No duplicate-call detection.
No retry max count. Three missing controls, all of which
would have independently caught this.
How to detect: Alert Query 4 (duplicate inputs_hash) fires after 10
calls — approximately 4 minutes in. Alert Query 3 (session
cost) fires after ~$10 — approximately 3 minutes in.
Either alert would have capped the incident at $0.40.
Fix: Max retry count of 3 (not configurable — hardcoded
in the worker). Session cost alert at $10.
Duplicate-call alert on >10 same inputs_hash per hour.
THE SILENT DEGRADATION
What happens: search_memories P95 latency drifts from 480ms to 8 seconds
over 6 weeks. DB index fragmentation and connection pool
exhaustion, both gradual. No one notices because the point-
in-time monitor shows "current P95 = 8.1s, threshold = 5s"
only after degradation is complete and users are complaining.
Root cause: Latency monitored as point-in-time value, not as a trend.
The signal was present in the data for weeks — it just
wasn't being compared to a baseline.
How to detect: Latency trend alert: 20% increase in P95 week-over-week
for any tool. This fires 4 weeks before point-in-time
threshold breach. Run as a weekly Cron on the logs table.
Fix: Add week-over-week trend query to monitoring. P95 at week
N vs. P95 at week N-1, per tool. Alert at 20% increase.
Weekly ops review includes latency trend charts, not just
current values.
HIGH-CARDINALITY LOG EXPLOSION
What happens: store_memory logs full conversation transcripts (avg 8KB)
per call to the tool_call_records table. At 10,000 calls
per day: 80GB/day. Postgres disk fills in 3 days.
Emergency disk expansion. Production outage during resize.
Root cause: Logging inputs without evaluating cardinality or storage
cost. "Log everything" sounds safe until storage becomes
the outage.
How to detect: pg_relation_size('tool_call_records') growth rate vs.
call volume. If bytes-per-row is growing, input size
is likely the cause. Check average row size with:
SELECT pg_relation_size(...) / reltuples FROM pg_class
WHERE relname = 'tool_call_records'.
Fix: Log inputs_hash (16 chars) not full inputs.
Log structured metadata: message_count, transcript_chars.
Full inputs to append-only blob store (S3/GCS) only when
a debug flag is set for a specific session.
The observability literature in distributed systems — the Dapper paper (Sigelman et al., 2010), Twitter’s Zipkin, the principles that became OpenTelemetry — established that tracing and metrics are prerequisites for operating systems at scale, not optional additions. What’s new for LLM-backed tools is the cost dimension: a trace in a traditional microservice has negligible runtime cost. A tool call that invokes a model has a real, variable, per-call cost. The cost_usd field in ToolCallRecord exists because of this difference. “You can’t improve what you can’t measure” has always been true. For LLM tools, “you can’t control what you can’t measure” is equally important — and the Recall incident is the proof. The data that would have stopped the loop at $0.40 was always being generated. It just wasn’t being captured.
Two decisions these failures force you to make — what to alert on immediately, and what to track as a trend.
Decision Guide
Two questions every observability setup forces: what to alert on immediately, and what to track as a trend.
Alert immediately on: Monitor as trend:
──────────────────────────────────── ──────────────────────────────────────
Error rate > 5% (any tool, 1h) Latency increase > 20% week-over-week
P95 latency > 5s (any tool, 1h) Daily cost increase > 30% week-over-week
Session cost > $10 (rolling 1h) Success rate drift downward
Same inputs_hash > 10x/hour Tool call volume anomaly
Log per tool call: Don't log:
──────────────────────────────────── ──────────────────────────────────────
tool_name, user_id, session_id Full input content (cardinality)
status, error_code Full response content (cardinality)
duration_ms PII in inputs or outputs
inputs_hash (16-char prefix only) Sub-10ms internal operations
estimated cost_usd Tool internals / debug traces
The first table is threshold vs. drift. Immediate alerts fire on point-in-time threshold violations — states that are wrong right now. Trend monitoring fires on drift — states that are getting worse over time but haven’t crossed a threshold yet. Silent degradation is a trend problem, not a threshold problem. Adding trend monitoring after your first silent degradation incident is the normal sequence; doing it before is better.
The second table is what to capture vs. what to avoid. The cardinality concern in the “don’t log” column isn’t theoretical — the high-cardinality log explosion failure mode in the previous section is the practical consequence of ignoring it. Full input content, full response content, and PII don’t belong in the structured log table. They belong in a blob store, accessed by session ID, only when a specific debugging session requires them.
Start with four alerts on day one. Add more only after a production incident surfaces a new signal. Don’t pre-build alerts for failure modes you haven’t seen — false positives burn alert fatigue faster than real failures. The rule of thumb: every alert in your system should have a response playbook. If you don’t know what to do when the alert fires, the alert isn’t ready to be an alert.
Resources
Is Your Tool Layer Observable?
Recall’s LoggingMiddleware is 40 lines of Python. It adds approximately 2ms to every tool call — the cost of a finally block and an async SQLite write. It would have caught the $2,100 incident after $0.40 and a three-minute window.
| Item | Score | |
|---|---|---|
| Every tool call creates one ToolCallRecord | ||
| Records include: tool name, user, session, status, duration, estimated cost | ||
| Cost is estimated per tool call (token count × rate) | ||
| Error rate alert configured — threshold: error_rate > 5% in 1h window | ||
| P95 latency alert configured — threshold: P95 > 5s in 1h window | ||
| Session cost alert configured — threshold: cumulative cost > $10 in 1h window | ||
| Duplicate-call detection on same inputs_hash + session within 60 minutes | ||
| Dashboard shows all four views: volume, success rate, latency, cost |
If any of these is unchecked, that’s the next thing to fix — not the ninth item, the unchecked one. Each item on this list corresponds to a specific failure mode. Missing item three is the runaway loop uncaught. Missing item seven is the silent degradation unnoticed. Missing item two is the entire cost axis dark.
The full ToolCallRecord dataclass, LoggingMiddleware, Postgres schema, and all four alert queries are in github.com/Sentient-Zero-Labs/szl-recall. The implementation is production code, not tutorial code — it’s the same middleware running on the actual Recall server.
Six issues have covered tool design from first principles to production observability: what makes a good tool, how MCP works, building your first production server, tool granularity decisions, A2A coordination patterns, and now the instrumentation layer that tells you what’s actually happening once everything ships.
Issue 7 closes the series with the ecosystem in 2026: where MCP adoption stands, the security threats that are now real attack vectors rather than theoretical risks, and the five open problems that remain genuinely unsolved — the questions the community is still working through. It’s the view from where we are and a clear-eyed look at what’s still hard.
Until next issue,
Sentient Zero Labs
Building Effective Tools for AI is a seven-issue series from Sentient Zero Labs. Each issue ships with working code from the Recall memory server — a production MCP tool built in public alongside the series.