In this issue (10 sections)
Most AI agents that work in demos crash in production. The gap isn’t the model — it’s the architecture.
We learned this the hard way. We built a “wrapper” — a thin UI around GPT-4 — and launched it. It worked perfectly for me and my co-founder. Then we let 100 real users in.
It was a bloodbath.
Hallucinated prices. Forgotten conversations. A $2,100 overnight API bill. All in the first month.
This series documents how we rebuilt our “demo toy” into a production system serving 50k+ queries. We made every mistake so you don’t have to.
In this first issue:
- Why wrappers fail in production (and our $2,000 lesson)
- The 8-layer architecture that actually survives real users
- The “Agent Readiness” scorecard to assess your system
History Anchor: Symbolic AI -> Modern Agent Architectures
Early AI systems were rule-based expert systems — deterministic, brittle, but predictable. Modern agents use LLMs for reasoning but still need the same deterministic guardrails those early systems had. The 8-layer architecture is the bridge: probabilistic intelligence wrapped in deterministic engineering.
The Disaster (Week 1-3)
Week 1: The Price Disaster
┌──────────────────────────────────────────────────────────┐
│ USER SEARCH: "Find me cheap laptops under $500" │
└──────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────┐
│ BOT RECOMMENDS: │
│ 1. MacBook Pro 16" - $5,299 │
│ 2. Dell XPS 15 - $2,499 │
│ 3. HP Spectre - $1,899 │
└──────────────────────────────────────────────────────────┘
The user asked for laptops under $500. The bot recommended a $5,299 MacBook Pro.
price < 800 and validates that $500 means price < 500, not “whatever the model thinks.” Week 2: The Memory Disaster
The bot forgot its own recommendations from 30 seconds ago.
Week 3: The Hallucination Disaster
Actual price in database: $1,899. The bot confidently stated a price that didn’t exist. Our legal team flagged us for false advertising.
Week 4: The Loop (The $2k Night)
We had a bug where the bot kept retrying a failed query. The error handler called the same function that failed, which failed again, which triggered the error handler again. Recursive error handling with no max_retries limit. Because we had no token tracking circuit (Layer 5), it ran all night.
Result: We woke up to a $2,100 OpenAI bill for a single user session.
max_retries, and has hard per-session token limits. The Realization
We had a demo that worked in controlled tests. We didn’t have a system that could handle real users.
The LLM was excellent at understanding intent and generating natural responses. But we were trusting it with jobs it couldn’t do: remembering context, validating constraints, and grounding in real data.
The reliability math: In traditional software, components are 99.9% reliable. In AI systems, the core component (the LLM) is maybe 80% reliable for any given task. You cannot build a 99% reliable system from 80% reliable parts — unless you add validation layers, fallbacks, and constraints. That’s what the 8-layer architecture provides.
The Diagnosis: Why Wrappers Fail
What We Built (The Wrapper)
┌─────────────────────────────────────────────────────────┐
│ THE WRAPPER │
│ (What We Built) │
├─────────────────────────────────────────────────────────┤
│ │
│ User Input → LLM API Call → Text Output │
│ │
│ "Find cheap laptops" │
│ ↓ │
│ [Send to GPT-4] │
│ ↓ │
│ "Here are some options..." │
│ │
└─────────────────────────────────────────────────────────┘
This architecture has five fatal gaps:
Gap 1: No Memory
Each message is isolated. The bot can’t remember previous results. “Tell me about the first one” becomes “Which one?”
Gap 2: No Validation
The bot can’t verify if extracted data is real. “Cheap” means whatever the model guesses. The LLM invents field names that don’t exist in your schema.
Gap 3: No Grounding
The LLM doesn’t query your actual database. It makes up prices, specs, and availability based on training data. It sounds confident, but it’s wrong.
Gap 4: No Fallback
When the database is down, the bot hallucinates data. When zero results are found, it invents products. It fails silently, and users don’t know.
Gap 5: No Debugging
You can’t trace why the bot gave a wrong answer. The prompt is a black box. “It just doesn’t work” is the only diagnostic.
The pattern: Every gap is a missing layer.
What LLMs Are Good At vs. Not Good At
┌──────────────────────────────────────────────────────────┐
│ LLM STRENGTHS (Flexible, Probabilistic) │
├──────────────────────────────────────────────────────────┤
│ ✓ Understanding natural language intent │
│ ✓ Extracting structure from unstructured text │
│ ✓ Classifying into categories │
│ ✓ Generating natural, conversational responses │
│ ✓ Handling synonyms and vague terms │
└──────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────┐
│ LLM WEAKNESSES (Where Code Must Take Over) │
├──────────────────────────────────────────────────────────┤
│ ✗ Remembering previous context │
│ ✗ Validating against your schema │
│ ✗ Enforcing constraints (price < 500 means price < 500) │
│ ✗ Querying databases with precision │
│ ✗ Handling edge cases deterministically │
│ ✗ Providing traceable, debuggable decisions │
└──────────────────────────────────────────────────────────┘
The Architecture: The 8-Layer System
We rebuilt the agent as a layered system. Each layer has one job, and the LLM is constrained — not trusted — at every step.
The Layer Roadmap
┌─────────────────────────────────────────────────────────┐
│ LAYER 0: FOUNDATION (This Issue) │
│ Goal: Architecture & Strategy │
│ Key Insight: Agents are systems, not prompts │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 1: DATA (Issue 2) │
│ Goal: Schema & Field Registry │
│ Key Insight: LLMs need explicit field definitions │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 2: INGESTION (Issue 3) │
│ Goal: Clean Data Pipeline │
│ Key Insight: Garbage in, hallucination out │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 3: INTENT (Issue 4) │
│ Goal: Classification & Routing │
│ Key Insight: Classify first, execute second │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 4: FILTERS (Issue 5) │
│ Goal: NL → Structured Query │
│ Key Insight: Extract → Validate → Clamp │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 5: MEMORY (Issue 6) │
│ Goal: Context & State Persistence │
│ Key Insight: Memory must be structured, not text │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 6: RANKING (Issue 7) │
│ Goal: Sorting & Scoring │
│ Key Insight: "Best" needs explicit definitions │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ LAYER 7: PRODUCT DEEP-DIVE (Issue 8) │
│ Goal: Hybrid RAG + Structured │
│ Key Insight: SQL for specs, RAG for reviews │
└─────────────────────────────────────────────────────────┘
Each issue in this series will deep-dive one layer. But here’s how they connect in a single request.
How the Layers Connect
USER: "Find cheap laptops with good battery"
↓
┌─────────────────────────────────────────────────────────┐
│ INTENT CLASSIFICATION (Layer 3) │
│ Classify: NEW_SEARCH │
│ Method: LLM (200 tokens, fast model) │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ FILTER EXTRACTION (Layer 4) │
│ Extract: price < 800, battery_life > 8 │
│ Method: LLM with field definitions injected │
│ Validate: ✓ Both fields exist in registry │
│ Clamp: ✓ Values within valid ranges │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ SEARCH & RANKING (Layer 1 + 6) │
│ Query database with validated filters │
│ Sort by default for laptops (value rating) │
│ Found: 5 products │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ MEMORY & CONTEXT (Layer 5) │
│ Store: 5 products in session context │
│ Update: Turn count = 1 │
│ Track: Token usage for cost control │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ RESPONSE GENERATION │
│ Format: Template with product list │
│ Method: LLM with real data from database │
│ Return: "Found 5 laptops..." │
└─────────────────────────────────────────────────────────┘
Notice: The LLM is used three times — for classification, extraction, and response generation. But at each step, the output is validated, constrained, or templated. The LLM never has direct access to the database. It never decides what’s “true.”
The Implementation: The Agent Pipeline
Here’s the minimal pseudocode for the complete pipeline:
class ProductAgent:
async def process(self, message: str, session_id: str):
# 1. Load memory (deterministic)
context = self.session_store.get(session_id)
# 2. Classify intent (LLM, ~200 tokens)
intent = await self.classify(message, context)
# 3. Route to handler (deterministic)
if intent == "NEW_SEARCH":
# 3a. Extract filters (LLM, ~500 tokens)
filters = await self.extract_filters(message)
# 3b. Validate filters (deterministic)
validated = self.validate(filters)
# 3c. Search database (deterministic)
results = await self.search(validated)
# 3d. Update context (deterministic)
context.add_results(results)
# 3e. Generate response (LLM, ~1500 tokens)
return await self.format_response(results)
elif intent == "DEEP_DIVE":
# Use context to resolve "the first one"
product = context.resolve_reference(message)
return await self.describe_product(product)
# ... other handlers
Why this works:
- Each step has one job
- LLM only where needed (understanding, extraction, generation)
- Validation catches errors before they reach users
- Context enables multi-turn conversation
- Every step is traceable and debuggable
Production Realities
Latency Budget (Target: 3000ms total)
| Component | Target | Notes |
|---|---|---|
| Intent Classification | 200ms | Small, fast model |
| Filter Extraction | 500ms | Medium model, structured output |
| Database Search | 200ms | PostgreSQL with indexes |
| Response Generation | 1500ms | Streaming, first token < 500ms |
| Overhead | 300ms | Network, parsing, validation |
| TOTAL | ~2700ms | Leaves 300ms buffer |
Error Handling Strategy
| Error Type | Strategy |
|---|---|
| Ambiguous Intent | Ask clarifying question |
| Validation Fail | Correct user ("I can't filter by X") |
| Zero Results | Relax constraints, explain what changed |
| System Error | Graceful fallback message |
The Proof: Before/After
Metrics That Changed
| Before | After |
|---|---|
| 40% of queries returned nothing useful | 95% query success rate |
| 22% of responses had hallucinated prices/specs | 0% hallucinations (all data from validated DB) |
| 0% user retention after 3 sessions | 67% user retention after 3 sessions |
| 0 successful purchases attributed to bot | 23 purchases in first month |
| Legal complaints: 3/week | Legal complaints: 0 |
What changed: We stopped treating the LLM as magic and started treating it as one component in a deterministic system.
The Checklist: Agent Readiness Scorecard
Before you build an AI agent, assess your readiness:
| Item | Score | |
|---|---|---|
| Do you have structured data with a defined schema? | Data | |
| Are filterable fields explicitly defined (not guessed)? | Data | |
| Can you enumerate all user intents for your domain? | Intent | |
| Can you validate extracted filters against your schema? | Extraction | |
| Do you handle empty results gracefully (relaxation)? | Search | |
| Do you track session context (fetched products, focus)? | Memory | |
| Is conversation state persisted (survives refresh)? | Memory | |
| Are outputs templated or grounded (not free-form)? | Response |
Score Interpretation:
- 0-3 checked: Don’t build yet. Too many gaps. Fix your data layer first.
- 4-6 checked: Prototype with human-in-the-loop. Monitor failures closely.
- 7-8 checked: Production-ready foundation. Start building.
What’s Next
Issue 2: The Data Layer
Now that you understand the architecture, we’ll dive deep into Layer 1: the data layer.
“Our LLM kept inventing field names that didn’t exist. Users would search for ‘laptops with good GPU’ but our database called it ‘graphics_card’. The bot had no way to know.”
What You’ll Learn:
- How to design schemas that LLMs can understand
- The field registry pattern (single source of truth)
- Validation at every boundary
- When to use typed columns vs JSONB
Key Takeaways
- 1 The Wrapper Trap: UI -> LLM -> Text = crashes in production
- 2 The System Approach: 8 layers, each with a job, LLM constrained at every step
- 3 The Core Insight: LLMs are probabilistic pattern matchers. They predict likely tokens, not correct answers. You cannot trust them -- you must constrain them with validation, grounding, and structure.
- 4 Key Takeaway: Agents are systems, not prompts. Build layers, not wrappers.
Glossary
- Wrapper: UI -> LLM API -> Text (no structure, no moat, cloneable in a weekend)
- System: Multiple layers, each with a job, with validation and persistence
- Intent: What the user wants to do (search, compare, ask question)
- Session Context: Memory of the conversation (fetched products, selections, focus)
- Validation: Checking if extracted data is real and within valid ranges
- Grounding: Ensuring LLM responses are based on real data, not hallucinations
- Constraint Relaxation: Removing filters when zero results, in priority order
Resources
Until next issue,
Sentient Zero Labs