Agent Design Fieldbook Issue 1/8

The Foundation: Why Most AI Agents Fail in Production

Most AI agents are wrappers that crash in production; production agents are systems with 8 deterministic layers where the LLM is constrained, not trusted.

Apr 13, 2026 · 12 min read · Sentient Zero Labs

In this issue (10 sections)

Most AI agents that work in demos crash in production. The gap isn’t the model — it’s the architecture.

We learned this the hard way. We built a “wrapper” — a thin UI around GPT-4 — and launched it. It worked perfectly for me and my co-founder. Then we let 100 real users in.

It was a bloodbath.

Hallucinated prices. Forgotten conversations. A $2,100 overnight API bill. All in the first month.

This series documents how we rebuilt our “demo toy” into a production system serving 50k+ queries. We made every mistake so you don’t have to.

In this first issue:

Why wrappers fail in production (and our $2,000 lesson)
The 8-layer architecture that actually survives real users
The “Agent Readiness” scorecard to assess your system

ℹ

This is not a prompting tutorial. This is systems engineering.

History Anchor: Symbolic AI -> Modern Agent Architectures

Early AI systems were rule-based expert systems — deterministic, brittle, but predictable. Modern agents use LLMs for reasoning but still need the same deterministic guardrails those early systems had. The 8-layer architecture is the bridge: probabilistic intelligence wrapped in deterministic engineering.

The Disaster (Week 1-3)

Week 1: The Price Disaster

┌──────────────────────────────────────────────────────────┐
│ USER SEARCH: "Find me cheap laptops under $500"          │
└──────────────────────────────────────────────────────────┘
                         ↓
┌──────────────────────────────────────────────────────────┐
│ BOT RECOMMENDS:                                          │
│ 1. MacBook Pro 16" - $5,299                              │
│ 2. Dell XPS 15 - $2,499                                  │
│ 3. HP Spectre - $1,899                                   │
└──────────────────────────────────────────────────────────┘

The user asked for laptops under $500. The bot recommended a $5,299 MacBook Pro.

⚠ Why it happened

The LLM interpreted “cheap” as “good value for money” and ignored the explicit $500 constraint. There was no validation layer between the model’s interpretation and the database query.

✓ The fix

A filter extraction layer that converts “cheap” to price < 800 and validates that $500 means price < 500, not “whatever the model thinks.”

Week 2: The Memory Disaster

user Find me laptops with good battery life

bot Here are 5 great options! [shows list]

user Tell me more about the first one

bot Which laptop are you referring to?

The bot forgot its own recommendations from 30 seconds ago.

⚠ Why it happened

Each message was processed in isolation. The system had no session context — no memory of what it had fetched, what order results appeared in, or what the user was referring to.

✓ The fix

A memory layer that tracks fetched products, current focus, and conversation state across turns.

Week 3: The Hallucination Disaster

user What's the price of the Dell XPS 15?

bot $1,299

Actual price in database: $1,899. The bot confidently stated a price that didn’t exist. Our legal team flagged us for false advertising.

⚠ Why it happened

The LLM wasn’t querying our database — it was generating a price that “sounded plausible” based on its training data. There was no grounding in real data.

✓ The fix

A search layer that queries the actual database and passes real data to the response generator, not invented facts.

Week 4: The Loop (The $2k Night)

We had a bug where the bot kept retrying a failed query. The error handler called the same function that failed, which failed again, which triggered the error handler again. Recursive error handling with no max_retries limit. Because we had no token tracking circuit (Layer 5), it ran all night.

Result: We woke up to a $2,100 OpenAI bill for a single user session.

⚠ Why it happened

No circuit breaker. No retry limit. No token budget. The loop had no termination condition — it would retry forever until the API rate-limited us (which took 8 hours).

✓ The fix

A state machine that tracks token usage, enforces max_retries, and has hard per-session token limits.

The Realization

We had a demo that worked in controlled tests. We didn’t have a system that could handle real users.

The LLM was excellent at understanding intent and generating natural responses. But we were trusting it with jobs it couldn’t do: remembering context, validating constraints, and grounding in real data.

💡 The Core Insight

LLMs are probabilistic pattern matchers. They predict the next likely token, not the “correct” answer. That means you cannot trust them — you must constrain them.

The reliability math: In traditional software, components are 99.9% reliable. In AI systems, the core component (the LLM) is maybe 80% reliable for any given task. You cannot build a 99% reliable system from 80% reliable parts — unless you add validation layers, fallbacks, and constraints. That’s what the 8-layer architecture provides.

The Diagnosis: Why Wrappers Fail

What We Built (The Wrapper)

┌─────────────────────────────────────────────────────────┐
│                    THE WRAPPER                          │
│                    (What We Built)                      │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   User Input  →  LLM API Call  →  Text Output           │
│                                                         │
│   "Find cheap laptops"                                  │
│         ↓                                               │
│   [Send to GPT-4]                                       │
│         ↓                                               │
│   "Here are some options..."                            │
│                                                         │
└─────────────────────────────────────────────────────────┘

This architecture has five fatal gaps:

Gap 1: No Memory

Each message is isolated. The bot can’t remember previous results. “Tell me about the first one” becomes “Which one?”

Gap 2: No Validation

The bot can’t verify if extracted data is real. “Cheap” means whatever the model guesses. The LLM invents field names that don’t exist in your schema.

Gap 3: No Grounding

The LLM doesn’t query your actual database. It makes up prices, specs, and availability based on training data. It sounds confident, but it’s wrong.

Gap 4: No Fallback

When the database is down, the bot hallucinates data. When zero results are found, it invents products. It fails silently, and users don’t know.

Gap 5: No Debugging

You can’t trace why the bot gave a wrong answer. The prompt is a black box. “It just doesn’t work” is the only diagnostic.

The pattern: Every gap is a missing layer.

What LLMs Are Good At vs. Not Good At

┌──────────────────────────────────────────────────────────┐
│  LLM STRENGTHS (Flexible, Probabilistic)                 │
├──────────────────────────────────────────────────────────┤
│  ✓ Understanding natural language intent                 │
│  ✓ Extracting structure from unstructured text           │
│  ✓ Classifying into categories                           │
│  ✓ Generating natural, conversational responses          │
│  ✓ Handling synonyms and vague terms                     │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│  LLM WEAKNESSES (Where Code Must Take Over)              │
├──────────────────────────────────────────────────────────┤
│  ✗ Remembering previous context                          │
│  ✗ Validating against your schema                        │
│  ✗ Enforcing constraints (price < 500 means price < 500) │
│  ✗ Querying databases with precision                     │
│  ✗ Handling edge cases deterministically                 │
│  ✗ Providing traceable, debuggable decisions             │
└──────────────────────────────────────────────────────────┘

💡 The Design Principle

Use LLMs where they’re strong (understanding, extraction, generation). Use code where it’s strong (validation, logic, persistence).

The Architecture: The 8-Layer System

We rebuilt the agent as a layered system. Each layer has one job, and the LLM is constrained — not trusted — at every step.

The Layer Roadmap

┌─────────────────────────────────────────────────────────┐
│  LAYER 0: FOUNDATION (This Issue)                       │
│  Goal: Architecture & Strategy                          │
│  Key Insight: Agents are systems, not prompts           │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 1: DATA (Issue 2)                                │
│  Goal: Schema & Field Registry                          │
│  Key Insight: LLMs need explicit field definitions      │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 2: INGESTION (Issue 3)                           │
│  Goal: Clean Data Pipeline                              │
│  Key Insight: Garbage in, hallucination out             │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 3: INTENT (Issue 4)                              │
│  Goal: Classification & Routing                         │
│  Key Insight: Classify first, execute second            │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 4: FILTERS (Issue 5)                             │
│  Goal: NL → Structured Query                            │
│  Key Insight: Extract → Validate → Clamp                │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 5: MEMORY (Issue 6)                              │
│  Goal: Context & State Persistence                      │
│  Key Insight: Memory must be structured, not text       │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 6: RANKING (Issue 7)                             │
│  Goal: Sorting & Scoring                                │
│  Key Insight: "Best" needs explicit definitions         │
└─────────────────────────────────────────────────────────┘
      ↓
┌─────────────────────────────────────────────────────────┐
│  LAYER 7: PRODUCT DEEP-DIVE (Issue 8)                   │
│  Goal: Hybrid RAG + Structured                          │
│  Key Insight: SQL for specs, RAG for reviews            │
└─────────────────────────────────────────────────────────┘

Each issue in this series will deep-dive one layer. But here’s how they connect in a single request.

How the Layers Connect

USER: "Find cheap laptops with good battery"
  ↓
┌─────────────────────────────────────────────────────────┐
│ INTENT CLASSIFICATION (Layer 3)                         │
│ Classify: NEW_SEARCH                                    │
│ Method: LLM (200 tokens, fast model)                    │
└─────────────────────────────────────────────────────────┘
  ↓
┌─────────────────────────────────────────────────────────┐
│ FILTER EXTRACTION (Layer 4)                             │
│ Extract: price < 800, battery_life > 8                  │
│ Method: LLM with field definitions injected             │
│ Validate: ✓ Both fields exist in registry               │
│ Clamp: ✓ Values within valid ranges                     │
└─────────────────────────────────────────────────────────┘
  ↓
┌─────────────────────────────────────────────────────────┐
│ SEARCH & RANKING (Layer 1 + 6)                          │
│ Query database with validated filters                   │
│ Sort by default for laptops (value rating)              │
│ Found: 5 products                                       │
└─────────────────────────────────────────────────────────┘
  ↓
┌─────────────────────────────────────────────────────────┐
│ MEMORY & CONTEXT (Layer 5)                              │
│ Store: 5 products in session context                    │
│ Update: Turn count = 1                                  │
│ Track: Token usage for cost control                     │
└─────────────────────────────────────────────────────────┘
  ↓
┌─────────────────────────────────────────────────────────┐
│ RESPONSE GENERATION                                     │
│ Format: Template with product list                      │
│ Method: LLM with real data from database                │
│ Return: "Found 5 laptops..."                            │
└─────────────────────────────────────────────────────────┘

Notice: The LLM is used three times — for classification, extraction, and response generation. But at each step, the output is validated, constrained, or templated. The LLM never has direct access to the database. It never decides what’s “true.”

The Implementation: The Agent Pipeline

Here’s the minimal pseudocode for the complete pipeline:

class ProductAgent:
    async def process(self, message: str, session_id: str):
        # 1. Load memory (deterministic)
        context = self.session_store.get(session_id)
        
        # 2. Classify intent (LLM, ~200 tokens)
        intent = await self.classify(message, context)
        
        # 3. Route to handler (deterministic)
        if intent == "NEW_SEARCH":
            # 3a. Extract filters (LLM, ~500 tokens)
            filters = await self.extract_filters(message)
            
            # 3b. Validate filters (deterministic)
            validated = self.validate(filters)
            
            # 3c. Search database (deterministic)
            results = await self.search(validated)
            
            # 3d. Update context (deterministic)
            context.add_results(results)
            
            # 3e. Generate response (LLM, ~1500 tokens)
            return await self.format_response(results)
        
        elif intent == "DEEP_DIVE":
            # Use context to resolve "the first one"
            product = context.resolve_reference(message)
            return await self.describe_product(product)
        
        # ... other handlers

Why this works:

Each step has one job
LLM only where needed (understanding, extraction, generation)
Validation catches errors before they reach users
Context enables multi-turn conversation
Every step is traceable and debuggable

Production Realities

Latency Budget (Target: 3000ms total)

Component	Target	Notes
Intent Classification	200ms	Small, fast model
Filter Extraction	500ms	Medium model, structured output
Database Search	200ms	PostgreSQL with indexes
Response Generation	1500ms	Streaming, first token < 500ms
Overhead	300ms	Network, parsing, validation
TOTAL	~2700ms	Leaves 300ms buffer

Error Handling Strategy

Error Type	Strategy
Ambiguous Intent	Ask clarifying question
Validation Fail	Correct user ("I can't filter by X")
Zero Results	Relax constraints, explain what changed
System Error	Graceful fallback message

💡 Key Insight

Error handling is not an afterthought. It’s designed into each layer. The user should always get a useful response, even when things go wrong.

The Proof: Before/After

Metrics That Changed

Before	After
40% of queries returned nothing useful	95% query success rate
22% of responses had hallucinated prices/specs	0% hallucinations (all data from validated DB)
0% user retention after 3 sessions	67% user retention after 3 sessions
0 successful purchases attributed to bot	23 purchases in first month
Legal complaints: 3/week	Legal complaints: 0

What changed: We stopped treating the LLM as magic and started treating it as one component in a deterministic system.

The Checklist: Agent Readiness Scorecard

Before you build an AI agent, assess your readiness:

	Item	Score
	Do you have structured data with a defined schema?	Data
	Are filterable fields explicitly defined (not guessed)?	Data
	Can you enumerate all user intents for your domain?	Intent
	Can you validate extracted filters against your schema?	Extraction
	Do you handle empty results gracefully (relaxation)?	Search
	Do you track session context (fetched products, focus)?	Memory
	Is conversation state persisted (survives refresh)?	Memory
	Are outputs templated or grounded (not free-form)?	Response

0 of 8

Score Interpretation:

0-3 checked: Don’t build yet. Too many gaps. Fix your data layer first.
4-6 checked: Prototype with human-in-the-loop. Monitor failures closely.
7-8 checked: Production-ready foundation. Start building.

What’s Next

Issue 2: The Data Layer

Now that you understand the architecture, we’ll dive deep into Layer 1: the data layer.

“Our LLM kept inventing field names that didn’t exist. Users would search for ‘laptops with good GPU’ but our database called it ‘graphics_card’. The bot had no way to know.”

What You’ll Learn:

How to design schemas that LLMs can understand
The field registry pattern (single source of truth)
Validation at every boundary
When to use typed columns vs JSONB

Key Takeaways

1 The Wrapper Trap: UI -> LLM -> Text = crashes in production
2 The System Approach: 8 layers, each with a job, LLM constrained at every step
3 The Core Insight: LLMs are probabilistic pattern matchers. They predict likely tokens, not correct answers. You cannot trust them -- you must constrain them with validation, grounding, and structure.
4 Key Takeaway: Agents are systems, not prompts. Build layers, not wrappers.

Glossary

Wrapper: UI -> LLM API -> Text (no structure, no moat, cloneable in a weekend)
System: Multiple layers, each with a job, with validation and persistence
Intent: What the user wants to do (search, compare, ask question)
Session Context: Memory of the conversation (fetched products, selections, focus)
Validation: Checking if extracted data is real and within valid ranges
Grounding: Ensuring LLM responses are based on real data, not hallucinations
Constraint Relaxation: Removing filters when zero results, in priority order

Resources

GPT-4 Technical Report (OpenAI, 2023) — Documents GPT-4’s known limitations around precise structured-data retrieval — the exact failure mode the 8-layer architecture is designed to prevent.
ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) — The foundational paper for agent reasoning loops; the 5 gaps framework maps to failure modes when ReAct’s assumptions aren’t met.
State of GPT (Andrej Karpathy, Microsoft Build 2023) — One-hour talk framing LLMs as a new computing stack. Best available technical-but-accessible overview of LLM capabilities and limits.
LangChain Documentation — The most widely used agent framework. Compare its layer abstractions against the 8-layer model in this issue.
LlamaIndex Documentation — RAG-first agent framework. Most relevant for implementations where data retrieval is the primary bottleneck.

Until next issue,

Sentient Zero Labs