In this issue (11 sections)
In Issue 1, we showed you the 8-layer architecture. Now we dive into Layer 1: the data layer — the foundation everything else builds on.
Here’s the uncomfortable truth: most agent failures aren’t LLM failures. They’re data failures. The model is doing exactly what you asked — extracting filters from natural language. But it has no idea what fields actually exist in your database.
In this issue:
- Why LLMs hallucinate field names (and how to stop it)
- The Field Registry pattern (single source of truth)
- The 3 validation boundaries every agent needs
- When to use typed columns vs JSONB
History Anchor: Symbolic AI -> Transformers
Database schema design predates AI entirely — relational models, ORMs, NoSQL flexibility. But LLMs brought a new problem: the model doesn’t know your schema. The field registry pattern is essentially Symbolic AI thinking — explicit, structured knowledge definitions — applied to ground probabilistic models.
The Disasters
Disaster 1: The Invented Field
USER: "Find laptops with good GPU"
|
v
LLM EXTRACTION:
{
"gpu_quality": {"operator": ">", "value": "good"}
}
|
v
DATABASE QUERY:
SELECT * FROM products WHERE gpu_quality > 'good'
ERROR: Column "gpu_quality" does not exist
Actual field name: "graphics_card"
The LLM invented a field name that sounded reasonable but didn’t exist. The query crashed. The user got nothing.
Why it happened: The LLM has no idea what your schema looks like. It generated a plausible field name based on the user’s language, not your database structure.
Disaster 2: The Unit Mismatch
USER: "Laptops under 2kg"
|
v
LLM EXTRACTION: weight < 2
|
v
DATABASE: Stores weight in pounds (lbs)
|
v
QUERY: WHERE weight < 2
|
v
RESULT: Shows laptops under 2 lbs (0.9kg)
Misses laptops at 3 lbs (1.36kg) that user wanted
The user asked for laptops under 2kg. They got laptops under 2 pounds — less than half what they wanted.
Why it happened: The LLM correctly extracted “2” but had no way to know your database stores weight in pounds, not kilograms.
Disaster 3: The Range Hallucination
USER: "Cheap laptops"
|
v
LLM EXTRACTION: price < 10000000 (hallucinated huge number)
|
v
DATABASE: Has laptops from $300 to $5,000
|
v
RESULT: Returns ALL laptops (everything is < $10M)
The user asked for “cheap.” The LLM didn’t know what “cheap” means in your domain, so it guessed. Badly.
Why it happened: No range validation. The LLM had no idea that your prices max out at $5,000, so it picked a number that seemed “safe.”
The Pattern
The Diagnosis: Why LLMs Can’t Know Your Schema
What LLMs Know vs. Don’t Know
WHAT LLM KNOWS (From Training):
- "GPU" and "graphics card" mean the same thing
- "cheap" implies low price
- "2kg" is a weight measurement
- Natural language patterns and synonyms
WHAT LLM DOESN'T KNOW (Your System):
- Your database calls it "graphics_card" not "gpu"
- Your weight is stored in pounds, not kilograms
- Your price range is $0-$10,000, not unlimited
- Which fields are filterable vs display-only
- What "cheap" means in your specific domain
The Gap
USER INTENT
"Find laptops with good GPU under 2kg"
|
v
LLM UNDERSTANDING (Semantic)
- GPU = graphics processing unit
- good = high quality
- 2kg = weight constraint
|
v
THE GAP
|
v
YOUR DATABASE (Actual Schema)
- Field: "graphics_card" (not "gpu")
- Values: "NVIDIA RTX 4060", "AMD Radeon" (not "good")
- Weight: stored in lbs (not kg)
The solution: Bridge the gap with a Field Registry — an explicit definition of what fields exist, what they’re called, and what values are valid.
The Field Registry Pattern
What Is a Field Registry?
A Field Registry is a single source of truth that tells the LLM:
- What fields exist in your database
- What they’re actually called (and their aliases)
- What type of data they hold
- What range of values is valid
- Whether they can be filtered/sorted
Why not just dump the schema into the prompt? Context window limits. A 500-table database schema won’t fit. Even if it did, the LLM gets confused with too many options. The Field Registry lets you inject only the relevant fields for each query — this is called schema scoping, and it’s a key scalability benefit.
FIELD REGISTRY: Single Source of Truth
For each field, define:
1. IDENTITY
- Name: "graphics_card"
- DB Column: "graphics_card"
- Aliases: ["gpu", "gpu_quality", "graphics"]
2. TYPE & UNIT
- Data Type: "text" / "numeric" / "enum"
- Unit: "USD", "kg", "hours", etc.
3. CAPABILITIES
- Filterable: Yes/No
- Sortable: Yes/No
4. VALIDATION
- Range: min/max values
- Allowed values: enum list
Example: The Price Field
FIELD: "price"
Identity:
- Name: "price"
- DB Column: "price"
- Aliases: ["cost", "budget", "amount", "cheap"]
Type & Unit:
- Data Type: "numeric"
- Unit: "USD"
Capabilities:
- Filterable: Yes
- Sortable: Yes
Validation:
- Range: $0 - $10,000
- Vague Mappings:
- "cheap" -> price < 800
- "affordable" -> price < 1200
- "budget" -> price < 600
How It’s Used
USER: "Find cheap laptops"
|
v
LLM EXTRACTION:
"cheap" -> need to map to price field
|
v
FIELD REGISTRY LOOKUP:
- Check aliases: "cheap" -> "price"
- Map vague term: "cheap" -> price < 800
- Validate: 800 is within range [0, 10000]
|
v
VALIDATED FILTER:
{"price": {"operator": "<", "value": 800}}
Now “cheap” has a precise meaning. The LLM doesn’t guess — it uses your definition.
The Implementation
Field Registry in Code
# The problem: LLM invents "gpu_quality", database has "graphics_card"
# The solution: Registry maps aliases to real fields
field_registry = {
"graphics_card": {
"db_column": "graphics_card",
"aliases": ["gpu", "gpu_quality", "graphics", "video card"],
"data_type": "text",
"filterable": True,
"sortable": False
},
"price": {
"db_column": "price",
"aliases": ["cost", "budget", "amount"],
"data_type": "numeric",
"unit": "USD",
"range_min": 0,
"range_max": 10000,
"filterable": True,
"sortable": True,
"vague_mappings": {
"cheap": {"operator": "<", "value": 800},
"affordable": {"operator": "<", "value": 1200},
"expensive": {"operator": ">", "value": 2000}
}
},
"weight": {
"db_column": "weight_kg",
"aliases": ["weight", "mass", "lightweight", "heavy"],
"data_type": "numeric",
"unit": "kg",
"range_min": 0.5,
"range_max": 5,
"filterable": True,
"sortable": True
}
}
Validation Function
def validate_filter(extracted_field, extracted_value, registry):
# 1. Resolve alias to real field
real_field = None
for field_name, spec in registry.items():
if extracted_field == field_name or extracted_field in spec.get("aliases", []):
real_field = field_name
break
if not real_field:
return False, f"Unknown field: {extracted_field}"
spec = registry[real_field]
# 2. Check if filterable
if not spec.get("filterable"):
return False, f"Field {real_field} is not filterable"
# 3. Clamp to valid range (numeric fields only)
if spec["data_type"] == "numeric":
if spec.get("range_max") and extracted_value > spec["range_max"]:
extracted_value = spec["range_max"]
if spec.get("range_min") and extracted_value < spec["range_min"]:
extracted_value = spec["range_min"]
return True, {"field": spec["db_column"], "value": extracted_value}
Schema Design: Typed Columns vs JSONB
Not all data belongs in typed columns. Here’s how to decide:
When to Use Typed Columns
- Field is common across all products
- You need to filter/sort by this field
- Type is consistent (all prices are numeric)
- Examples: price, name, brand, created_at
- Benefits: fast queries (indexed), type safety (database enforces), clear schema
When to Use JSONB
- Fields vary by product category
- Vendor-specific data you need to preserve
- Display-only (not filtered or sorted)
- Examples: specs, vendor_data, full_data
- Benefits: flexibility (schema can evolve), preserve vendor data as-is, no migration needed for new fields
Example Schema
CREATE TABLE product (
-- Typed columns (common, filterable)
id BIGSERIAL PRIMARY KEY,
sku TEXT NOT NULL UNIQUE,
name TEXT NOT NULL,
price NUMERIC(10,2),
weight_kg NUMERIC(5,2),
brand TEXT,
category TEXT,
graphics_card TEXT, -- The field LLM kept inventing as "gpu_quality"
created_at TIMESTAMP DEFAULT now(),
-- JSONB (variable, vendor-specific)
specs JSONB, -- {"screen_size": 15.6, "ram_gb": 16}
vendor_data JSONB, -- Vendor-specific fields
full_data JSONB -- Original data for debugging
);
-- Index typed columns you'll filter by
CREATE INDEX idx_price ON product(price);
CREATE INDEX idx_brand ON product(brand);
CREATE INDEX idx_graphics ON product(graphics_card);
-- Index JSONB fields you'll query (if any)
CREATE INDEX idx_specs ON product USING GIN (specs);
The pattern: Typed columns for what you filter. JSONB for everything else.
The Three Validation Boundaries
Every data point in your system crosses three boundaries. Validate at each one.
Boundary 1: Input Validation (Vendor -> Database)
When data comes from vendors:
- Required fields present?
- Types correct? (price is number, not text)
- Units normalized? (all weights in kg)
- Ranges valid? (price > 0)
- Reject: Bad data never enters database
Boundary 2: Extraction Validation (LLM -> Query)
When LLM extracts filters:
- Field exists in registry?
- Field is filterable?
- Value in valid range?
- Operator supported?
- Fix: Clamp values, reject unknown fields
Boundary 3: Output Validation (Query -> User)
Before showing to user:
- Required fields present?
- Prices formatted correctly?
- No null values in critical fields?
- Fix: Filter out invalid results
The Proof: Before/After
| Before | After |
|---|---|
| 35% of queries failed (unknown field names) | 98% query success rate |
| 18% returned wrong results (unit mismatches) | 0% wrong results (units normalized at ingestion) |
| 12% returned everything (no range validation) | 0% unbounded queries (ranges validated at extraction) |
| 0 vendor-specific fields filterable | 23 vendor-specific fields now filterable (via aliases) |
What changed: We stopped hoping the LLM would guess our schema and started telling it explicitly.
The Checklist: Data Layer Readiness
| Item | Score | |
|---|---|---|
| Is your database schema documented? | Schema | |
| Are filterable fields explicitly listed? | Fields | |
| Do you map common aliases to real fields? | Aliases | |
| Are all units normalized to one standard? | Units | |
| Do you validate min/max values? | Ranges | |
| Do you enforce data types? | Types | |
| Is JSONB used only for variable data? | JSONB | |
| Do you validate at all 3 boundaries? | Validation |
Score Interpretation:
- 0-3: Data layer not ready. Stop here.
- 4-6: Prototype-ready. Proceed with caution.
- 7-8: Production-ready. Move to Issue 3.
What’s Next
Issue 3: The Ingestion Layer
Now that you have a schema and field registry, how do you get messy vendor data into it?
“Vendor A sends prices in EUR. Vendor B sends them in USD. Vendor C sends them in cents. Vendor D sends them as text with dollar signs. How do we normalize this chaos?”
What You’ll Learn:
- Script-based ingestion (why we chose it over Airflow)
- Normalization strategies (units, nulls, duplicates)
- Error handling patterns
- Why
full_dataJSONB saves your life
Key Takeaways
- 1 The Problem: LLMs hallucinate field names, units, and ranges because they don't know your schema.
- 2 The Solution: Field Registry (single source of truth) + 3 Validation Boundaries (input, extraction, output) + Typed columns for filtering, JSONB for flexibility.
- 3 Key Takeaway: Your agent is only as good as your data. Define it explicitly.
Glossary
- Schema: Structure of your data (what fields exist, what types)
- Field Registry: Explicit list of filterable/sortable fields with metadata (also called Semantic Layer)
- Semantic Layer: Enterprise term for Field Registry; a metadata layer that defines what data means
- Schema Scoping: Injecting only relevant fields into prompts instead of the full schema
- Typed Column: Database column with specific type (price NUMERIC)
- JSONB: Flexible JSON storage for variable/vendor-specific data
- Validation Boundary: Point where data crosses systems (vendor->DB, LLM->query, query->user)
- Aliases: Different names for the same field (“gpu” = “graphics_card”)
- Vague Mapping: Translation of imprecise terms (“cheap” -> price < 800)
- Business Logic Injection: Storing domain rules (like “cheap” = $800) in code/config, not prompts
Until next issue,
Sentient Zero Labs