← Technical Series
Building Effective Tools for AI Issue 5/7

A2A — When Agents Need to Talk to Each Other

A2A gives multi-agent systems a task lifecycle that makes every state in a sub-agent's execution visible, pausable, and recoverable — solving the coordination failures that async function calls cannot.

May 12, 2026 · 21 min read · Sentient Zero Labs
In this issue (6 sections)

The consolidation worker accepted the task at 14:32:07 and went silent.

The orchestrating agent had submitted a batch — 47 conversation transcripts from a single user’s session history. Recall’s consolidation worker was designed for exactly this: receive the batch, run LLM extraction on each transcript, deduplicate the extracted facts, detect contradictions, store the final memory set. For a batch this size, the worker normally finished in two to four minutes. The orchestrator had submitted the task, received a task ID, and started waiting.

14:33. Nothing.

14:34. Still nothing.

14:35. The orchestrator had been designed to retry on timeout. But what timeout? The task hadn’t timed out — the task was still running. Or it had crashed. Or it was waiting. There was no way to tell the difference. The orchestrator was holding a task ID and three minutes of silence.

The batch had hit a contradiction. Two memories the worker had extracted from different transcripts contradicted each other directly: one said the user was vegetarian, one said they had ordered steak three weeks ago. A human would pause and ask which was correct. The consolidation worker had no mechanism to pause and ask. It could only complete or fail. It had done neither. It had hung — processing loop blocked on a question it couldn’t surface, holding its allocated memory, doing nothing.

The problem wasn’t that the worker was slow. The problem was that slow, failed, and waiting-for-input were all indistinguishable from the outside. The orchestrator had submitted a task. The task had entered a state that existed nowhere in its design. There was no error to catch, no timeout to trigger, no signal to act on.

💡 The Core Problem
An agent that can’t signal its state is an agent you can’t coordinate with.

A2A exists to solve that. Not with a different queue primitive or a fancier job system — with a task lifecycle that makes every state in the agent’s execution visible, pausable, and recoverable from the outside. The consolidation worker rebuilt on A2A can say WORKING while it extracts, INPUT_REQUIRED when it finds a contradiction it can’t resolve, and COMPLETED or FAILED when it finishes. The orchestrator can poll, receive the question, resolve it, and watch the worker resume. Three minutes of silence becomes a conversation.


Mental Model

MCP and A2A Are Complementary, Not Competing

Modern agent architectures evolved from single-model systems calling tools to multi-agent systems where specialized agents delegate work to one another. The ReAct pattern (Yao et al., 2022) demonstrated that LLMs could use tools reliably — but tools were always functions: call them, get a result, continue. When the work to be delegated requires its own reasoning loop, persistent state across multiple steps, and the ability to pause and surface questions, a function call is the wrong primitive. A sub-agent is not a tool. It needs a different protocol.

MCP and A2A are both under AAIF (the Agentic AI Foundation, under Linux Foundation governance as of December 9, 2025). They are designed as complements. MCP connects an agent to tools — database calls, API calls, file operations, fast operations that return immediately or nearly so. A2A connects an agent to other agents — sub-agents with their own reasoning loops, their own state, and tasks that may run for minutes, surface blockers, and need to be driven to completion by the caller. The question is not which protocol. The question is which pattern fits the work.

┌─────────────────────┬────────────────────────────────┬──────────────────────────────────┐
│                     │  MCP (tool call)               │  A2A (agent task)                │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  Caller             │  LLM or application            │  Orchestrating agent             │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  Callee             │  Tool function                 │  Sub-agent with its own          │
│                     │                                │  reasoning loop + state          │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  Response model     │  Synchronous result or         │  Task ID, then polling or        │
│                     │  async-acknowledge             │  push notification               │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  State machine      │  None — call and return        │  SUBMITTED → WORKING →           │
│                     │                                │  COMPLETED (or branch states)    │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  Pause / resume     │  Not possible                  │  INPUT_REQUIRED → resolve →      │
│                     │                                │  resume WORKING                  │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  Discovery          │  Hardcoded server URL          │  Agent Card at                   │
│                     │                                │  /.well-known/agent-card.json    │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  Governance         │  AAIF (MCP spec)               │  AAIF (A2A v1.0 spec)           │
├─────────────────────┼────────────────────────────────┼──────────────────────────────────┤
│  Use when           │  DB calls, API calls, fast     │  Sub-agent needs own reasoning   │
│                     │  operations returning <30s     │  loop, long tasks, INPUT_REQUIRED│
└─────────────────────┴────────────────────────────────┴──────────────────────────────────┘

Relationship: MCP + A2A are both under AAIF (Linux Foundation, founded Dec 9 2025).
They are complementary — no unification planned. A tool is not a sub-agent.

MCP is for tools. A2A is for sub-agents. The difference is more than latency.

The A2A Task Lifecycle

A2A v1.0 (the current stable spec under AAIF governance) defines a task as the fundamental unit of work between agents. Every task progresses through a state machine. The orchestrator submits a task, receives a task ID, and polls (or receives push notifications) to track state transitions. The sub-agent drives the transitions — it moves itself from SUBMITTED to WORKING when it begins, to INPUT_REQUIRED when it needs the caller to resolve something, back to WORKING when the resolution is received, and to a terminal state when it finishes.

                ┌─────────────────────────────────────────┐
                │         Orchestrator submits task        │
                └──────────────────┬──────────────────────┘


                          ┌─────────────────┐
                          │   SUBMITTED     │ ← task accepted, not yet running
                          └────────┬────────┘


                          ┌─────────────────┐
                      ┌──►  WORKING         │ ← sub-agent is executing
                      │   └────────┬────────┘
                      │            │
                      │   ┌────────┴────────────────────┐
                      │   │                             │
                      │   ▼                             ▼
                      │  ┌──────────────────┐  ┌─────────────────┐
                      │  │  INPUT_REQUIRED  │  │  COMPLETED  ✓   │ ← terminal
                      │  └────────┬─────────┘  └─────────────────┘
                      │           │
                      │   orchestrator calls resolve()
                      │   with answer to the question
                      │           │
                      └───────────┘  (resumes WORKING)

  Terminal states (no resume from these):

  COMPLETED  ✓    — task finished successfully
  FAILED     ✗    — task started and could not finish (retry with new task)
  CANCELED   ✗    — canceled by caller or by sub-agent timeout
  REJECTED   ✗    — sub-agent declined before starting (capacity, unsupported type, auth)
  AUTH_REQUIRED ✗ — sub-agent requires authentication before it will accept the task

  Key distinction:
  REJECTED = declined before start (retry semantics: check capacity, resubmit)
  FAILED   = started and broke    (retry semantics: new task, investigate root cause)

INPUT_REQUIRED is not an error. It is a first-class mechanism to surface a blocker and resume.

The REJECTED vs. FAILED distinction is worth holding onto — it determines retry strategy. If REJECTED, the sub-agent never started; the task can be resubmitted once the constraint is resolved (capacity freed, task type supported, auth provided). If FAILED, the sub-agent started and broke during execution; resubmitting the same task may hit the same failure. Different root causes, different handling.

Agent Cards

A2A agents advertise their capabilities via an Agent Card — a JSON document served at /.well-known/agent-card.json. The orchestrator fetches the card before its first task submission and uses it to validate that the sub-agent supports the capabilities required for the work at hand.

{
  "name": "recall-consolidation",
  "description": "Batch memory consolidation worker — deduplicates and extracts memories from conversation transcripts.",
  "url": "https://agents.recall.internal/consolidation",
  "version": "1.2.0",

  "capabilities": {
    "streaming": false,
    "pushNotifications": true,
    "stateTransitionHistory": true,
    "inputRequired": true
  },

  "skills": [
    {
      "id": "consolidate_memories",
      "name": "Consolidate Memory Batch",
      "description": "Accepts a batch of conversation transcripts, extracts facts, deduplicates, and detects contradictions.",
      "inputModes": ["text"],
      "outputModes": ["text"],
      "examples": ["Consolidate 50 transcripts for user usr_123"]
    }
  ],

  "authentication": {
    "schemes": ["bearer"]
  }
}

────────────────────────────────────────────────────────────────────────────────
Key fields the orchestrator validates before submitting a task:

  capabilities.inputRequired  →  confirm sub-agent supports INPUT_REQUIRED
  capabilities.pushNotifications  →  prefer push over polling if true
  version  →  cache this, re-fetch on any submission failure (Agent Card drift)
  skills[*].id  →  use to target a specific capability in the task submission
────────────────────────────────────────────────────────────────────────────────

The Agent Card is how sub-agents advertise what they can do. The orchestrator validates it before submitting any task.

Agent Cards replace hardcoded endpoint assumptions with a discovery mechanism. A fleet of sub-agents can be discovered dynamically; the orchestrator validates capabilities before submitting, not after the first failure. The card version field is particularly important: when a sub-agent is redeployed with changed capabilities, the version should change. The orchestrator can detect drift.

Understanding the lifecycle is the mental model. The next question is what the code looks like when you’re the caller: you submit a task to a sub-agent, you don’t know how long it will take, and you need to handle every state — including the one where it stops and asks you something.


Implementation

Building on Issue 03: Issue 03 covered the FastMCP server pattern — the @mcp.tool() decorator, request context, error propagation. The ConsolidationClient below is the caller side of that architecture: the orchestrator that drives a sub-agent through its task lifecycle. The FastMCP server pattern handles what the sub-agent exposes; the A2A client pattern handles how the orchestrator calls it. Both sides are required for production multi-agent coordination.

The primary pattern for any A2A caller is a three-part loop: submit, poll, and branch. The branch on INPUT_REQUIRED is the part most implementations skip — until they need it.

The A2A Client: Submit, Poll, and Handle INPUT_REQUIRED

The ConsolidationClient below is the full A2A caller for Recall’s consolidation worker. It handles task submission, async poll loop with exponential backoff, all terminal state handling, and INPUT_REQUIRED resolution via a callback pattern. The resolution callback is the key architectural decision — the client stays generic, and the caller decides how to resolve any question the sub-agent surfaces.

import asyncio
import httpx
from typing import Awaitable, Callable

TaskResolutionCallback = Callable[[str, dict], Awaitable[str]]


class ConsolidationClient:
    """A2A client for the consolidation worker. Handles all task states."""

    def __init__(
        self,
        agent_url: str,
        bearer_token: str,
        resolution_callback: TaskResolutionCallback,
        poll_interval: float = 2.0,
    ) -> None:
        self.agent_url = agent_url
        self.headers = {"Authorization": f"Bearer {bearer_token}"}
        self.resolution_callback = resolution_callback
        self.poll_interval = max(2.0, poll_interval)  # 2s minimum — see notes

    async def consolidate(self, user_id: str, transcripts: list[str]) -> dict:
        """Submit a consolidation task and drive it to completion.
        Returns the final task result or raises on unrecoverable failure."""
        task_id = await self._submit(user_id, transcripts)
        return await self._poll(task_id)

    async def _submit(self, user_id: str, transcripts: list[str]) -> str:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{self.agent_url}/tasks",
                json={"skill": "consolidate_memories",
                      "input": {"user_id": user_id, "transcripts": transcripts}},
                headers=self.headers,
                timeout=30.0,
            )
            resp.raise_for_status()
            return resp.json()["task_id"]

    async def _poll(self, task_id: str) -> dict:
        consecutive_working = 0
        current_interval = self.poll_interval

        while True:
            async with httpx.AsyncClient() as client:
                resp = await client.get(
                    f"{self.agent_url}/tasks/{task_id}",
                    headers=self.headers,
                    timeout=10.0,
                )
                resp.raise_for_status()
                task = resp.json()

            state = task["state"]
            # A2A v1.0 wire format uses lowercase with hyphens:
            # "submitted", "working", "input-required", "completed",
            # "failed", "canceled", "rejected", "auth-required"

            if state == "completed":
                return task["result"]

            elif state == "failed":
                raise RuntimeError(f"Task {task_id} failed: {task.get('reason', 'unknown')}")

            elif state in ("canceled", "rejected"):
                raise RuntimeError(f"Task {task_id} ended with state {state}: {task.get('reason')}")

            elif state == "auth-required":
                # Sub-agent requires authentication before it will accept the task.
                # AUTH_REQUIRED is a terminal state added in A2A v1.0 — refresh
                # credentials and resubmit as a new task.
                raise RuntimeError(
                    f"Task {task_id} requires authentication: {task.get('reason')}. "
                    "Refresh bearer token and resubmit."
                )

            elif state == "input-required":
                # Sub-agent stopped and is asking a question — resolve and resume
                question = task["input_required"]
                answer = await self.resolution_callback(task_id, question)
                await self._resolve(task_id, answer)
                consecutive_working = 0
                current_interval = self.poll_interval  # reset backoff after INPUT_REQUIRED

            elif state in ("submitted", "working"):
                consecutive_working += 1
                # Exponential backoff after 10 consecutive WORKING responses
                if consecutive_working > 10:
                    current_interval = min(current_interval * 2, 30.0)

            await asyncio.sleep(current_interval)

    async def _resolve(self, task_id: str, answer: str) -> None:
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                f"{self.agent_url}/tasks/{task_id}/resolve",
                json={"answer": answer},
                headers=self.headers,
                timeout=10.0,
            )
            resp.raise_for_status()


# ── Usage ────────────────────────────────────────────────────────────────────
# The caller decides how to resolve INPUT_REQUIRED — present to user,
# use a lookup table, call another agent. The client just shuttles the Q+A.

async def resolve_contradiction(task_id: str, question: dict) -> str:
    """Example: resolve a contradiction by asking the user."""
    print(f"Contradiction detected: {question['description']}")
    # In production: surface to user via UI, or route to a resolution agent
    return "keep_most_recent"


async def main():
    client = ConsolidationClient(
        agent_url="https://agents.recall.internal/consolidation",
        bearer_token="...",
        resolution_callback=resolve_contradiction,
    )
    result = await client.consolidate(user_id="usr_123", transcripts=[...])
    print(result)

Why 2s minimum poll interval. At 100ms intervals, a client polling an LLM-backed sub-agent sends 600 status requests per minute. The sub-agent spends more CPU answering status checks than doing the actual work. 2s is a practical lower bound for any task that involves LLM calls — the worker completes one extraction pass in roughly 2-4 seconds, so polling faster than that adds no information and costs CPU on both sides. For production systems where the sub-agent declares capabilities.pushNotifications: true in its Agent Card, prefer push notification over polling entirely. Push eliminates polling for the happy path.

The resolution callback pattern. The resolution_callback is a Callable[[str, dict], Awaitable[str]] — the task ID and the question payload in, the answer out. The client doesn’t know or care what “resolve” means in your system. Maybe it presents a UI to the user. Maybe it routes to a different agent that specializes in contradiction resolution. Maybe it applies a deterministic rule (“always keep most recent”). The callback pattern keeps the client reusable across those three strategies without modification. The client’s job is to recognize INPUT_REQUIRED, call the callback, and relay the answer. Your system’s job is to decide what the answer is.

REJECTED vs. FAILED. These two terminal states look similar — the task didn’t complete — but they have different retry semantics. REJECTED means the sub-agent evaluated the task before starting and declined it: capacity limit, unsupported skill ID, authentication not satisfied. The task never ran. Retry after addressing the constraint: wait for capacity, check the skill ID in the Agent Card, refresh the auth token. FAILED means the sub-agent started the task and could not finish: extraction threw an exception, LLM API returned an error, a contradiction was found but the timeout on INPUT_REQUIRED expired. The task ran partway. Retry with a new task ID after investigating the root cause — resubmitting the same task ID will not restart it.

Task state persistence. The sub-agent must persist its task state to durable storage — not in memory only. If the sub-agent restarts while a task is WORKING, in-memory state is gone. The orchestrator polling for that task ID will receive a 404 or an unexpected state. The fix is straightforward: write state transitions to a database as they happen. The orchestrator sees a consistent state regardless of sub-agent restarts. This is not optional for any task that runs longer than the sub-agent’s uptime guarantee.

Agent Card Discovery and Validation

Before submitting any task, the orchestrator should fetch and validate the sub-agent’s Agent Card. This catches capability mismatches before they become task failures, and the version field in the response gives you a baseline for detecting drift after deployments.

import httpx


class AgentCardError(Exception):
    pass


async def fetch_and_validate_agent_card(
    agent_url: str,
    required_capabilities: set[str] | None = None,
) -> dict:
    """Fetch the Agent Card and validate it before submitting any task.

    Raises AgentCardError if the card is unreachable or missing required capabilities.
    The orchestrator should call this before the first task submission and re-fetch
    on any submission failure (handles Agent Card version drift).
    """
    if required_capabilities is None:
        required_capabilities = {"inputRequired"}

    async with httpx.AsyncClient() as client:
        try:
            resp = await client.get(
                f"{agent_url}/.well-known/agent-card.json",
                timeout=5.0,
            )
            resp.raise_for_status()
            card = resp.json()  # parse inside context — resp is only valid here
        except httpx.HTTPError as e:
            raise AgentCardError(f"Agent Card unreachable at {agent_url}: {e}") from e

    # Validate required capabilities are declared
    agent_caps = set(
        k for k, v in card.get("capabilities", {}).items() if v is True
    )
    missing = required_capabilities - agent_caps
    if missing:
        raise AgentCardError(
            f"Agent '{card.get('name')}' is missing required capabilities: {missing}. "
            f"Declared: {agent_caps}. Check agent deployment."
        )

    return card


# ── Usage in orchestrator ─────────────────────────────────────────────────────

async def submit_with_validation(user_id: str, transcripts: list[str]) -> dict:
    card = await fetch_and_validate_agent_card(
        agent_url="https://agents.recall.internal/consolidation",
        required_capabilities={"inputRequired", "pushNotifications"},
    )
    # Card version is available — log it for drift detection
    version = card.get("version", "unknown")

    client = ConsolidationClient(
        agent_url=card["url"],
        bearer_token="...",
        resolution_callback=resolve_contradiction,
    )
    return await client.consolidate(user_id=user_id, transcripts=transcripts)

Why validate before submission, not after failure. A missing capability discovered at task submission time returns an opaque error — the sub-agent receives a task it cannot process, fails it, and the orchestrator sees a FAILED task state without a clear root cause. Validating the Agent Card before submission means the error is specific (AgentCardError: missing capabilities: {'inputRequired'}) and caught before any task state is created. The re-fetch-on-failure strategy in submit_with_validation handles the deployment drift case: if a submission fails with a validation error, re-fetch the card before deciding whether to retry — the sub-agent may have been redeployed with a different schema.

With the implementation pattern in place, the question shifts from “what code do I write” to “what can go wrong in production.” Three failure modes surface reliably once A2A coordination scales past a single orchestrator.


Failure Modes

POLLING THUNDERSTORM

What happens:  Multiple orchestrators poll the consolidation worker at
               100ms intervals. Worker spends more CPU responding to
               status checks than running extractions. Throughput
               collapses. New tasks queue while the worker is busy
               answering "are you done yet?"

Root cause:    Poll interval too short. No backoff. No coordination
               between callers. Each orchestrator acts independently,
               each assuming its task is the only one.

How to detect: Sub-agent request volume vs. task throughput ratio.
               If HTTP GET /tasks/{id} requests >> completed tasks,
               polling is the bottleneck. Observable in access logs
               within minutes of a load increase.

Fix:           Minimum poll interval 2s. Exponential backoff after 10
               consecutive WORKING polls (2s → 4s → 8s → cap at 30s).
               Prefer push notification when the sub-agent declares
               capabilities.pushNotifications = true. Push eliminates
               polling entirely for the happy path.
HUNG INPUT_REQUIRED

What happens:  Sub-agent transitions to INPUT_REQUIRED ("input-required"
               on the wire). Orchestrator receives the question. Caller
               crashes before calling resolve(). Resolution callback throws.
               User dismisses the UI. Task holds allocated memory indefinitely
               on the sub-agent. Across 50 batches: sub-agent leaks state
               until it OOMs and restarts, losing all in-progress tasks.

Root cause:    No timeout on INPUT_REQUIRED state. No cleanup path for
               tasks where the caller disappears after receiving the question.

How to detect: Monitor tasks in INPUT_REQUIRED state for longer than a
               configurable threshold. Default: 10 minutes. Count of
               tasks stuck in INPUT_REQUIRED is the key metric.

Fix:           Sub-agent auto-cancels tasks that remain in INPUT_REQUIRED
               past the timeout. Returns CANCELED ("canceled" on the wire)
               with reason: "INPUT_REQUIRED not resolved within 10m timeout."
               Caller receives a clean terminal state, not a hung task.
AGENT CARD VERSION DRIFT

What happens:  Orchestrator caches the Agent Card at startup. Sub-agent
               is redeployed — new version removes a skill or changes
               input schema. Orchestrator continues submitting tasks
               targeting the old skill ID. Tasks fail immediately with
               schema validation errors. Error rate spikes. The root
               cause is invisible — the orchestrator sees failures, not
               the deployment that caused them.

Root cause:    Agent Card treated as static. Cached once, never re-fetched
               unless explicitly triggered.

How to detect: Validation error rate spike on task submission to a specific
               sub-agent immediately following a deployment. The correlation
               with deployment time is the signal.

Fix:           On any submission failure, re-fetch the Agent Card before
               retrying. Pin the card version in deployment manifests.
               Integration test: submit a sample task to every registered
               Agent Card after each deploy — before routing real traffic.

These three failures have a common shape: they’re invisible until they compound. The polling thunderstorm looks like a slow sub-agent. The hung INPUT_REQUIRED looks like a memory leak. The Agent Card drift looks like random task failures. All three are diagnosable from metrics — which is the subject of Issue 6. Before reaching for observability, though, the decision of whether to use A2A at all is the right first question.


Decision Guide

A2A adds protocol surface area. The poll loop, the Agent Card fetch, the INPUT_REQUIRED handler, the resolution callback — each is a piece of code that can fail, version-drift, or misbehave. Don’t pay for it until you need what it buys.

Use A2A when:                              Use async MCP tool when:
──────────────────────────────────────     ──────────────────────────────────────
Sub-agent needs its own reasoning loop     Task completes in <30 seconds
Multi-step execution with state            Fire-and-forget acceptable
Long-running tasks (minutes, not sec)      No contradiction or pause needed
INPUT_REQUIRED is a realistic path         Single call, single result
Human-in-the-loop possible                 Endpoint is known and static
Agent Card discovery needed                No Agent Card required
Sub-agent must persist state across        In-process queue sufficient
its own restarts

For most teams, start with async MCP tools. A job-ID pattern — submit an async MCP tool call, receive a job ID, poll a status endpoint — covers the majority of background work. It’s simpler, introduces less surface area, and is sufficient for any task that completes in under 30 seconds and never needs to pause and surface a question.

Add A2A when three conditions are met: the sub-agent genuinely needs INPUT_REQUIRED behavior (it will find questions it cannot answer itself), the task runs long enough that state persistence across the sub-agent’s own restarts matters, or you need Agent Card-based discovery across a deployment of multiple sub-agents. If only one of those conditions is true, evaluate whether a simpler pattern covers it first.

The A2A protocol adds surface area. Don’t pay for it until you need what it buys.


Resources

AAIF / a2a-protocol.org
The authoritative spec for the task lifecycle state machine, Agent Card format, and wire-format state names (input-required vs. INPUT_REQUIRED); the ConsolidationClient implementation follows this spec exactly.
a2aproject — GitHub
The open-source repository with spec, samples, and reference implementations; the consolidation worker architecture maps to the server-side samples here.
AAIF / Linux Foundation
The governance body placing both MCP and A2A under neutral open governance; establishes why both protocols are stable infrastructure rather than vendor-specific choices.
encode/httpx
The HTTP client used in ConsolidationClient for task submission, polling, and resolution; its async context manager avoids connection pooling boilerplate.
Sentient Zero Labs — GitHub
Full ConsolidationClient, Agent Card, and server-side task state persistence code; the INPUT_REQUIRED callback pattern is the implementation most tutorials skip.

Production Checklist

Item Score
Sub-agent implements all required states: SUBMITTED, WORKING, INPUT_REQUIRED, COMPLETED, FAILED, CANCELED, REJECTED, AUTH_REQUIRED
Poll interval is minimum 2s with exponential backoff after 10 consecutive WORKING responses (2s → 4s → 8s → cap at 30s)
INPUT_REQUIRED state has a configurable auto-cancel timeout (default: 10 min)
Agent Card is versioned and updated on every capability-changing deployment
Task state is persisted to durable storage — not held in memory only
FAILED and CANCELED states return structured reason field
Orchestrator poll loop handles all terminal states: COMPLETED, FAILED, CANCELED, REJECTED, AUTH_REQUIRED
Agent Card is validated against required capabilities before task submission
Orchestrator re-fetches Agent Card on any submission validation failure
0 of 9

The ConsolidationClient above satisfies the orchestrator-side items. The sub-agent implementation — particularly state persistence and the INPUT_REQUIRED timeout — is the side that most teams underinvest in. A sub-agent that handles every state correctly in development but loses in-memory task state on a pod restart is not production-ready. The checklist items for the sub-agent are harder to satisfy than the client-side items, and they are the ones that determine whether your A2A coordination holds up under load.

There is one dimension the checklist does not cover: when A2A coordination is working, how do you know? When it breaks, how fast do you find out? That’s the subject of Issue 6 — tool observability: what to log at every tool call, the three dashboards that catch 90% of production failures, and how to trace LLM API cost back to the originating session. The patterns apply to A2A tasks as directly as they apply to MCP tool calls. The ToolCallRecord pattern from Issue 1 extends naturally to task-level records — submit time, state transitions, resolution events, final result, and cost. Issue 6 builds that instrumentation in full.


Until next issue,

Sentient Zero Labs

Building Effective Tools for AI is a seven-issue series from Sentient Zero Labs. Each issue ships with working code from the Recall memory server — a production MCP tool and A2A coordination pattern built in public alongside the series. The consolidation worker, Agent Card, and client code are at github.com/Sentient-Zero-Labs/szl-recall.