← Technical Series
Agent Design Fieldbook Issue 2/8

The Data Layer: Your Agent Is Only As Good As Your Data

LLMs will hallucinate field names and values unless you explicitly define what's real through schemas, field registries, and validation boundaries.

Apr 13, 2026 · 11 min read · Sentient Zero Labs
In this issue (11 sections)

In Issue 1, we showed you the 8-layer architecture. Now we dive into Layer 1: the data layer — the foundation everything else builds on.

Here’s the uncomfortable truth: most agent failures aren’t LLM failures. They’re data failures. The model is doing exactly what you asked — extracting filters from natural language. But it has no idea what fields actually exist in your database.

In this issue:

  • Why LLMs hallucinate field names (and how to stop it)
  • The Field Registry pattern (single source of truth)
  • The 3 validation boundaries every agent needs
  • When to use typed columns vs JSONB
Get this layer wrong, and Issues 3-8 won’t save you.

History Anchor: Symbolic AI -> Transformers

Database schema design predates AI entirely — relational models, ORMs, NoSQL flexibility. But LLMs brought a new problem: the model doesn’t know your schema. The field registry pattern is essentially Symbolic AI thinking — explicit, structured knowledge definitions — applied to ground probabilistic models.


The Disasters

Disaster 1: The Invented Field

USER: "Find laptops with good GPU"
  |
  v
LLM EXTRACTION:
{
  "gpu_quality": {"operator": ">", "value": "good"}
}
  |
  v
DATABASE QUERY:
SELECT * FROM products WHERE gpu_quality > 'good'

ERROR: Column "gpu_quality" does not exist
Actual field name: "graphics_card"

The LLM invented a field name that sounded reasonable but didn’t exist. The query crashed. The user got nothing.

Why it happened: The LLM has no idea what your schema looks like. It generated a plausible field name based on the user’s language, not your database structure.

Disaster 2: The Unit Mismatch

USER: "Laptops under 2kg"
  |
  v
LLM EXTRACTION: weight < 2
  |
  v
DATABASE: Stores weight in pounds (lbs)
  |
  v
QUERY: WHERE weight < 2
  |
  v
RESULT: Shows laptops under 2 lbs (0.9kg)
        Misses laptops at 3 lbs (1.36kg) that user wanted

The user asked for laptops under 2kg. They got laptops under 2 pounds — less than half what they wanted.

Why it happened: The LLM correctly extracted “2” but had no way to know your database stores weight in pounds, not kilograms.

Disaster 3: The Range Hallucination

USER: "Cheap laptops"
  |
  v
LLM EXTRACTION: price < 10000000 (hallucinated huge number)
  |
  v
DATABASE: Has laptops from $300 to $5,000
  |
  v
RESULT: Returns ALL laptops (everything is < $10M)

The user asked for “cheap.” The LLM didn’t know what “cheap” means in your domain, so it guessed. Badly.

Why it happened: No range validation. The LLM had no idea that your prices max out at $5,000, so it picked a number that seemed “safe.”

The Pattern

💡 The Core Problem
LLMs are excellent at understanding what users mean. They’re terrible at knowing what your database looks like. You have to bridge that gap explicitly.

The Diagnosis: Why LLMs Can’t Know Your Schema

What LLMs Know vs. Don’t Know

WHAT LLM KNOWS (From Training):
  - "GPU" and "graphics card" mean the same thing
  - "cheap" implies low price
  - "2kg" is a weight measurement
  - Natural language patterns and synonyms

WHAT LLM DOESN'T KNOW (Your System):
  - Your database calls it "graphics_card" not "gpu"
  - Your weight is stored in pounds, not kilograms
  - Your price range is $0-$10,000, not unlimited
  - Which fields are filterable vs display-only
  - What "cheap" means in your specific domain

The Gap

USER INTENT
"Find laptops with good GPU under 2kg"
         |
         v
LLM UNDERSTANDING (Semantic)
  - GPU = graphics processing unit
  - good = high quality
  - 2kg = weight constraint
         |
         v
     THE GAP
         |
         v
YOUR DATABASE (Actual Schema)
  - Field: "graphics_card" (not "gpu")
  - Values: "NVIDIA RTX 4060", "AMD Radeon" (not "good")
  - Weight: stored in lbs (not kg)

The solution: Bridge the gap with a Field Registry — an explicit definition of what fields exist, what they’re called, and what values are valid.


The Field Registry Pattern

What Is a Field Registry?

A Field Registry is a single source of truth that tells the LLM:

  1. What fields exist in your database
  2. What they’re actually called (and their aliases)
  3. What type of data they hold
  4. What range of values is valid
  5. Whether they can be filtered/sorted
Enterprise engineers might call this a Semantic Layer or Data Catalog. We call it a Field Registry. The purpose is the same: grounding the LLM in your actual schema.

Why not just dump the schema into the prompt? Context window limits. A 500-table database schema won’t fit. Even if it did, the LLM gets confused with too many options. The Field Registry lets you inject only the relevant fields for each query — this is called schema scoping, and it’s a key scalability benefit.

FIELD REGISTRY: Single Source of Truth

For each field, define:

  1. IDENTITY
     - Name: "graphics_card"
     - DB Column: "graphics_card"
     - Aliases: ["gpu", "gpu_quality", "graphics"]

  2. TYPE & UNIT
     - Data Type: "text" / "numeric" / "enum"
     - Unit: "USD", "kg", "hours", etc.

  3. CAPABILITIES
     - Filterable: Yes/No
     - Sortable: Yes/No

  4. VALIDATION
     - Range: min/max values
     - Allowed values: enum list

Example: The Price Field

FIELD: "price"

  Identity:
  - Name: "price"
  - DB Column: "price"
  - Aliases: ["cost", "budget", "amount", "cheap"]

  Type & Unit:
  - Data Type: "numeric"
  - Unit: "USD"

  Capabilities:
  - Filterable: Yes
  - Sortable: Yes

  Validation:
  - Range: $0 - $10,000
  - Vague Mappings:
    - "cheap"      -> price < 800
    - "affordable" -> price < 1200
    - "budget"     -> price < 600

How It’s Used

USER: "Find cheap laptops"
  |
  v
LLM EXTRACTION:
"cheap" -> need to map to price field
  |
  v
FIELD REGISTRY LOOKUP:
  - Check aliases: "cheap" -> "price"
  - Map vague term: "cheap" -> price < 800
  - Validate: 800 is within range [0, 10000]
  |
  v
VALIDATED FILTER:
{"price": {"operator": "<", "value": 800}}

Now “cheap” has a precise meaning. The LLM doesn’t guess — it uses your definition.

💡 Business Logic Injection
These vague term mappings (“cheap” -> price < 800) live in your code or config, not in the prompt itself. This is business logic. When your pricing strategy changes (holiday sale?), you update the config once — every prompt automatically gets the new values.

The Implementation

Field Registry in Code

# The problem: LLM invents "gpu_quality", database has "graphics_card"
# The solution: Registry maps aliases to real fields

field_registry = {
    "graphics_card": {
        "db_column": "graphics_card",
        "aliases": ["gpu", "gpu_quality", "graphics", "video card"],
        "data_type": "text",
        "filterable": True,
        "sortable": False
    },
    "price": {
        "db_column": "price",
        "aliases": ["cost", "budget", "amount"],
        "data_type": "numeric",
        "unit": "USD",
        "range_min": 0,
        "range_max": 10000,
        "filterable": True,
        "sortable": True,
        "vague_mappings": {
            "cheap": {"operator": "<", "value": 800},
            "affordable": {"operator": "<", "value": 1200},
            "expensive": {"operator": ">", "value": 2000}
        }
    },
    "weight": {
        "db_column": "weight_kg",
        "aliases": ["weight", "mass", "lightweight", "heavy"],
        "data_type": "numeric",
        "unit": "kg",
        "range_min": 0.5,
        "range_max": 5,
        "filterable": True,
        "sortable": True
    }
}

Validation Function

def validate_filter(extracted_field, extracted_value, registry):
    # 1. Resolve alias to real field
    real_field = None
    for field_name, spec in registry.items():
        if extracted_field == field_name or extracted_field in spec.get("aliases", []):
            real_field = field_name
            break
    
    if not real_field:
        return False, f"Unknown field: {extracted_field}"
    
    spec = registry[real_field]
    
    # 2. Check if filterable
    if not spec.get("filterable"):
        return False, f"Field {real_field} is not filterable"
    
    # 3. Clamp to valid range (numeric fields only)
    if spec["data_type"] == "numeric":
        if spec.get("range_max") and extracted_value > spec["range_max"]:
            extracted_value = spec["range_max"]
        if spec.get("range_min") and extracted_value < spec["range_min"]:
            extracted_value = spec["range_min"]
    
    return True, {"field": spec["db_column"], "value": extracted_value}

Schema Design: Typed Columns vs JSONB

Not all data belongs in typed columns. Here’s how to decide:

When to Use Typed Columns

  • Field is common across all products
  • You need to filter/sort by this field
  • Type is consistent (all prices are numeric)
  • Examples: price, name, brand, created_at
  • Benefits: fast queries (indexed), type safety (database enforces), clear schema

When to Use JSONB

  • Fields vary by product category
  • Vendor-specific data you need to preserve
  • Display-only (not filtered or sorted)
  • Examples: specs, vendor_data, full_data
  • Benefits: flexibility (schema can evolve), preserve vendor data as-is, no migration needed for new fields

Example Schema

CREATE TABLE product (
    -- Typed columns (common, filterable)
    id BIGSERIAL PRIMARY KEY,
    sku TEXT NOT NULL UNIQUE,
    name TEXT NOT NULL,
    price NUMERIC(10,2),
    weight_kg NUMERIC(5,2),
    brand TEXT,
    category TEXT,
    graphics_card TEXT,    -- The field LLM kept inventing as "gpu_quality"
    created_at TIMESTAMP DEFAULT now(),
    
    -- JSONB (variable, vendor-specific)
    specs JSONB,           -- {"screen_size": 15.6, "ram_gb": 16}
    vendor_data JSONB,     -- Vendor-specific fields
    full_data JSONB        -- Original data for debugging
);

-- Index typed columns you'll filter by
CREATE INDEX idx_price ON product(price);
CREATE INDEX idx_brand ON product(brand);
CREATE INDEX idx_graphics ON product(graphics_card);

-- Index JSONB fields you'll query (if any)
CREATE INDEX idx_specs ON product USING GIN (specs);
The pattern: Typed columns for what you filter. JSONB for everything else.

The Three Validation Boundaries

Every data point in your system crosses three boundaries. Validate at each one.

Boundary 1: Input Validation (Vendor -> Database)

When data comes from vendors:

  • Required fields present?
  • Types correct? (price is number, not text)
  • Units normalized? (all weights in kg)
  • Ranges valid? (price > 0)
  • Reject: Bad data never enters database

Boundary 2: Extraction Validation (LLM -> Query)

When LLM extracts filters:

  • Field exists in registry?
  • Field is filterable?
  • Value in valid range?
  • Operator supported?
  • Fix: Clamp values, reject unknown fields

Boundary 3: Output Validation (Query -> User)

Before showing to user:

  • Required fields present?
  • Prices formatted correctly?
  • No null values in critical fields?
  • Fix: Filter out invalid results
💡 The Key Insight
If you validate at all three boundaries, bad data can’t reach users. Not from vendors. Not from the LLM. Not from the database.

The Proof: Before/After

Before After
35% of queries failed (unknown field names) 98% query success rate
18% returned wrong results (unit mismatches) 0% wrong results (units normalized at ingestion)
12% returned everything (no range validation) 0% unbounded queries (ranges validated at extraction)
0 vendor-specific fields filterable 23 vendor-specific fields now filterable (via aliases)

What changed: We stopped hoping the LLM would guess our schema and started telling it explicitly.


The Checklist: Data Layer Readiness

Item Score
Is your database schema documented? Schema
Are filterable fields explicitly listed? Fields
Do you map common aliases to real fields? Aliases
Are all units normalized to one standard? Units
Do you validate min/max values? Ranges
Do you enforce data types? Types
Is JSONB used only for variable data? JSONB
Do you validate at all 3 boundaries? Validation
0 of 8

Score Interpretation:

  • 0-3: Data layer not ready. Stop here.
  • 4-6: Prototype-ready. Proceed with caution.
  • 7-8: Production-ready. Move to Issue 3.

What’s Next

Issue 3: The Ingestion Layer

Now that you have a schema and field registry, how do you get messy vendor data into it?

“Vendor A sends prices in EUR. Vendor B sends them in USD. Vendor C sends them in cents. Vendor D sends them as text with dollar signs. How do we normalize this chaos?”

What You’ll Learn:

  • Script-based ingestion (why we chose it over Airflow)
  • Normalization strategies (units, nulls, duplicates)
  • Error handling patterns
  • Why full_data JSONB saves your life

Key Takeaways

  1. 1 The Problem: LLMs hallucinate field names, units, and ranges because they don't know your schema.
  2. 2 The Solution: Field Registry (single source of truth) + 3 Validation Boundaries (input, extraction, output) + Typed columns for filtering, JSONB for flexibility.
  3. 3 Key Takeaway: Your agent is only as good as your data. Define it explicitly.

Glossary

  • Schema: Structure of your data (what fields exist, what types)
  • Field Registry: Explicit list of filterable/sortable fields with metadata (also called Semantic Layer)
  • Semantic Layer: Enterprise term for Field Registry; a metadata layer that defines what data means
  • Schema Scoping: Injecting only relevant fields into prompts instead of the full schema
  • Typed Column: Database column with specific type (price NUMERIC)
  • JSONB: Flexible JSON storage for variable/vendor-specific data
  • Validation Boundary: Point where data crosses systems (vendor->DB, LLM->query, query->user)
  • Aliases: Different names for the same field (“gpu” = “graphics_card”)
  • Vague Mapping: Translation of imprecise terms (“cheap” -> price < 800)
  • Business Logic Injection: Storing domain rules (like “cheap” = $800) in code/config, not prompts

Until next issue,

Sentient Zero Labs