← Technical Series
Agent Design Fieldbook Issue 5/8

The Filter Extraction Layer: From Natural Language to Query

LLMs are excellent at extracting filters from natural language, but terrible at enforcing boundaries -- inject your field registry, validate everything, and relax constraints when needed.

Apr 13, 2026 · 14 min read · Sentient Zero Labs
In this issue (12 sections)

In Issue 4, we built intent classification. Now we dive into what happens when the intent is “search”: converting natural language to precise database queries.

User says: “Find me a lightweight laptop with great battery, not too expensive.”

What does that mean? “Lightweight” — is that under 1kg? 2kg? 3kg? “Great battery” — 5 hours? 10 hours? “Not too expensive” — $500? $1,000? $2,000?

If you let the LLM guess, it will hallucinate numbers. We learned this the hard way.

In this issue:

  • Why LLMs can’t extract filters without field definitions
  • The field-aware extraction pattern (inject your registry)
  • Validation and clamping (reject unknown fields, clamp ranges)
  • Constraint relaxation (what to do when you get zero results)
This is where your field registry (Issue 2) becomes critical.

History Anchor: Symbolic AI -> Tool Use + ReAct

Extracting structured data from natural language is one of AI’s oldest problems. Frame-based dialog systems (1990s) solved it with hand-coded slot definitions — “departure_city,” “arrival_city,” “date.” Our field registry is the same idea, updated for LLMs: explicit definitions that constrain extraction instead of letting the model guess.


The Disasters

Disaster 1: The Vague Terms

USER: "Find me a lightweight laptop with great battery, not too expensive"

VAGUE TERMS:
• "lightweight" = ??? (< 1kg? < 2kg? < 3kg?)
• "great battery" = ??? (> 5h? > 8h? > 12h?)
• "not too expensive" = ??? (< $500? < $1000? < $2000?)

┌──────────────────────────────────────────────────────────┐
│ LLM EXTRACTION (Without Field Definitions):              │
├──────────────────────────────────────────────────────────┤
│ {                                                        │
│   "weight": {"<": 100},        ← Hallucinated nonsense   │
│   "battery": {">": 1000},      ← Impossible value        │
│   "price": {"<": 999999}       ← Unbounded garbage       │
│ }                                                        │
└──────────────────────────────────────────────────────────┘

DATABASE QUERY: WHERE weight < 100 AND battery > 1000 AND price < 999999

RESULT: 0 products (or everything) ❌

The LLM had no idea what reasonable values looked like. So it made them up.

Why it happened
No field definitions. The LLM knows “cheap” means low price, but it doesn’t know that in your domain, prices range from $300 to $5,000.

Disaster 2: The Invented Field

USER: "Find laptops with good GPU performance"

LLM EXTRACTION: {"gpu_performance": {">": "good"}}

DATABASE: ERROR - column "gpu_performance" does not exist

ACTUAL FIELD: "graphics_card" (text field with model names like "RTX 4060")

The LLM invented a field that sounded reasonable. But our database doesn’t have gpu_performance. It has graphics_card.

Why it happened
No field registry injection. The LLM guessed a plausible field name instead of using your actual schema.

Disaster 3: The Zero Results

USER: "Find Dell laptops under $500 with 32GB RAM"

FILTERS: brand = "Dell", price < 500, ram >= 32

DATABASE QUERY: WHERE brand = 'Dell' AND price < 500 AND ram >= 32

RESULT: 0 products ❌

REALITY: No Dell laptop under $500 has 32GB RAM

USER: "Your search sucks!" (leaves)

The filters were valid. The extraction was correct. But no products match those criteria.

Why it happened
No constraint relaxation strategy. When you get zero results, you need to intelligently relax constraints.

The Pattern

LLMs are excellent at understanding intent. They’re terrible at knowing:

  • What fields exist in your database
  • What ranges are valid
  • What vague terms mean in your domain
  • What to do when constraints are impossible
You must tell them explicitly.

The Diagnosis: Why Extraction Alone Fails

What LLMs Can vs. Can’t Do

┌──────────────────────────────────────────────────────────┐
│ WHAT LLM CAN DO (Extraction)                             │
├──────────────────────────────────────────────────────────┤
│ ✓ Understand "cheap" means low price                     │
│ ✓ Understand "lightweight" means low weight              │
│ ✓ Extract multiple constraints from one sentence         │
│ ✓ Handle synonyms ("budget" = "cheap" = "affordable")    │
│ ✓ Map natural language to filter structure               │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│ WHAT LLM CAN'T DO (Boundaries)                           │
├──────────────────────────────────────────────────────────┤
│ ❌ Know your field names ("price" vs "list_price")        │
│ ❌ Know your units (USD vs EUR, kg vs lbs)                │
│ ❌ Know your ranges (price: $0-$10,000, not unlimited)    │
│ ❌ Know what "cheap" means in your domain ($500? $1000?)  │
│ ❌ Enforce constraints (will hallucinate if unsure)       │
└──────────────────────────────────────────────────────────┘

The Gap

USER: "Find cheap laptops"

┌─────────────────────────────────────────────────────────┐
│ WITHOUT FIELD DEFINITIONS                               │
├─────────────────────────────────────────────────────────┤
│ LLM guesses:                                            │
│ {"price": {"<": 10000000}}  ← Unbounded hallucination   │
└─────────────────────────────────────────────────────────┘

RETURNS EVERYTHING (useless)

  vs

┌─────────────────────────────────────────────────────────┐
│ WITH FIELD DEFINITIONS                                  │
├─────────────────────────────────────────────────────────┤
│ Prompt includes:                                        │
│ "price (numeric) [unit: USD] [range: 0-10000]           │
│  [aliases: cost, budget, cheap]                         │
│  [mapping: cheap < 800, affordable < 1200]"             │
│                                                         │
│ LLM extracts:                                           │
│ {"price": {"<": 800}}  ← Grounded in field definition   │
└─────────────────────────────────────────────────────────┘

RETURNS RELEVANT RESULTS
💡 Key Insight
The gap is not intelligence — the LLM is smart enough. The gap is knowledge. It doesn’t know your schema. Inject it.

Field Registry Injection

Telling the LLM What Exists

Your field registry (from Issue 2) becomes critical here. You inject it directly into the extraction prompt.

Field Definition Format

┌─────────────────────────────────────────────────────────┐
│  FIELD DEFINITION (What Goes in Prompt)                 │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  For each filterable field, provide:                    │
│                                                         │
│  1. NAME: Canonical field name                          │
│  2. TYPE: numeric, text, enum                           │
│  3. UNIT: USD, kg, hours, etc.                          │
│  4. RANGE: min and max values                           │
│  5. ALIASES: Alternative names users might say          │
│  6. MAPPINGS: Vague term → specific value               │
│                                                         │
└─────────────────────────────────────────────────────────┘

Example: The Price Field

FIELD: price
├─ Type: numeric
├─ Unit: USD
├─ Range: 0 - 10,000
├─ Aliases: cost, budget, amount
└─ Mappings:
   • "cheap" → < 800
   • "affordable" → < 1200
   • "mid-range" → 1200 - 2500
   • "expensive" → > 2500
   • "premium" → > 4000

Generating Field Prompts from Registry

def get_field_prompt(field_registry: dict) -> str:
    """Generate field definitions for injection into LLM prompt."""
    lines = []
    for name, spec in field_registry.items():
        if not spec.get("filterable"):
            continue
        
        line = f"- {name} ({spec['data_type']})"
        
        if spec.get("unit"):
            line += f" [unit: {spec['unit']}]"
        
        if spec.get("range_min") is not None:
            line += f" [range: {spec['range_min']}-{spec['range_max']}]"
        
        if spec.get("aliases"):
            line += f" [aliases: {', '.join(spec['aliases'])}]"
        
        lines.append(line)
    
    return "\n".join(lines)

# Example output:
# - price (numeric) [unit: USD] [range: 0-10000] [aliases: cost, budget]
# - weight_kg (numeric) [unit: kg] [range: 0.5-5] [aliases: weight, lightweight]
# - battery_life (numeric) [unit: hours] [range: 1-24] [aliases: battery]
# - graphics_card (text) [aliases: gpu, graphics]

The Extraction Prompt

Designing a Field-Aware Prompt

FILTER_PROMPT = """
Extract search filters from the user's query.

AVAILABLE FIELDS (Only use these):
{field_definitions}

OPERATORS:
- = (equals)
- < (less than)
- <= (less than or equal)
- > (greater than)
- >= (greater than or equal)
- between (range)

VAGUE TERM MAPPINGS:
- "cheap" / "budget" → price < 800
- "affordable" → price < 1200
- "lightweight" → weight_kg < 1.5
- "ultralight" → weight_kg < 1.0
- "good battery" / "long battery" → battery_life > 8
- "all-day battery" → battery_life > 12

USER QUERY: "{query}"

Return JSON:
{{
  "filters": {{
    "field_name": {{"operator": "value"}},
    ...
  }},
  "reasoning": "Why these filters were extracted"
}}

RULES:
- ONLY use fields from the list above
- Map vague terms to specific values using the mappings
- If a term is unclear, OMIT it (don't guess)
- Use correct units from field definitions
- Values must be within the field's range
"""

Extraction Flow

USER: "Find cheap laptops with good battery, under 2kg"

┌─────────────────────────────────────────────────────────┐
│ PROMPT INJECTION                                        │
├─────────────────────────────────────────────────────────┤
│ Available fields:                                       │
│ - price (numeric) [unit: USD] [range: 0-10000]          │
│ - weight_kg (numeric) [unit: kg] [range: 0.5-5]         │
│ - battery_life (numeric) [unit: hours] [range: 1-24]    │
│ - brand (text)                                          │
│ - graphics_card (text)                                  │
│                                                         │
│ Vague term mappings:                                    │
│ - "cheap" → price < 800                                 │
│ - "good battery" → battery_life > 8                     │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ LLM EXTRACTION                                          │
├─────────────────────────────────────────────────────────┤
│ {                                                       │
│   "filters": {                                          │
│     "price": {"<": 800},                                │
│     "battery_life": {">": 8},                           │
│     "weight_kg": {"<": 2}                               │
│   },                                                    │
│   "reasoning": "cheap→price<800, good battery→>8h,      │
│                 explicit 'under 2kg'"                   │
│ }                                                       │
└─────────────────────────────────────────────────────────┘

The Extraction Function

async def extract_filters(query: str, field_registry: dict) -> dict:
    """Extract filters from natural language query."""
    # Generate field definitions
    field_defs = get_field_prompt(field_registry)
    
    # Build prompt
    prompt = FILTER_PROMPT.format(
        field_definitions=field_defs,
        query=query
    )
    
    # Call LLM
    response = await llm.generate(
        prompt=prompt,
        temperature=0.1,  # Low temperature for consistency
        max_tokens=300
    )
    
    # Parse JSON response
    result = json.loads(response)
    return result["filters"]
💡 Key Insight
By extracting JSON instead of asking the LLM to write raw SQL, you immunize your agent against SQL injection attacks. A prompt injection that tries to DROP TABLE will fail — the LLM can only output filter JSON, which your code validates and converts to parameterized queries. This is a critical security win.
For schemas with more than 50 fields, the field registry prompt can get large. Use context pruning — only inject fields relevant to the detected intent or component type. This keeps token usage low and improves extraction accuracy.

Validation and Clamping

The Validation Flow

Even with field definitions, LLMs can still return invalid values. Always validate.

LLM EXTRACTION: {"price": {"<": 50000}, "gpu_performance": {">": "good"}}

┌─────────────────────────────────────────────────────────┐
│ STEP 1: CHECK FIELD EXISTS                              │
├─────────────────────────────────────────────────────────┤
│ • "price" in registry? ✓                                │
│ • "gpu_performance" in registry? ❌                      │
│   → REJECT: Unknown field "gpu_performance"             │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ STEP 2: CHECK VALUE IN RANGE                            │
├─────────────────────────────────────────────────────────┤
│ • price: 50000 in range [0, 10000]? ❌                   │
│   → CLAMP: 50000 → 10000                                │
│   → WARN: "Clamped price to max $10,000"                │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ VALIDATED FILTERS                                       │
├─────────────────────────────────────────────────────────┤
│ {                                                       │
│   "price": {"<": 10000}                                 │
│ }                                                       │
│ Warnings: ["Clamped price to $10,000",                  │
│            "Ignored unknown field: gpu_performance"]    │
└─────────────────────────────────────────────────────────┘

The Validation Function

from dataclasses import dataclass
from typing import Dict, List, Tuple, Any

@dataclass
class ValidationResult:
    filters: Dict[str, Any]
    warnings: List[str]

def validate_filters(raw_filters: dict, field_registry: dict) -> ValidationResult:
    """Validate and clamp extracted filters."""
    validated = {}
    warnings = []
    
    for field, condition in raw_filters.items():
        # Step 1: Check field exists
        if field not in field_registry:
            warnings.append(f"Unknown field ignored: '{field}'")
            continue
        
        spec = field_registry[field]
        
        # Step 2: Check field is filterable
        if not spec.get("filterable", True):
            warnings.append(f"Field '{field}' is not filterable")
            continue
        
        # Extract operator and value
        operator = list(condition.keys())[0]
        value = condition[operator]
        
        # Step 3: Clamp to valid range
        if spec.get("range_max") is not None and value > spec["range_max"]:
            warnings.append(f"{field}: clamped {value} to max {spec['range_max']}")
            value = spec["range_max"]
        
        if spec.get("range_min") is not None and value < spec["range_min"]:
            warnings.append(f"{field}: clamped {value} to min {spec['range_min']}")
            value = spec["range_min"]
        
        validated[field] = {operator: value}
    
    return ValidationResult(filters=validated, warnings=warnings)

Constraint Relaxation

The Zero Results Problem

Sometimes valid filters return no results. Don’t just show “No results found” — help the user.

USER: "Find Dell laptops under $500 with 32GB RAM"

FILTERS: brand = "Dell", price < 500, ram >= 32

DATABASE QUERY: WHERE brand = 'Dell' AND price < 500 AND ram >= 32

RESULT: 0 products ❌

WHAT NOW? (Don't just fail!)

Relaxation Strategy

Define a priority order for constraints. Relax the least important first.

┌─────────────────────────────────────────────────────────┐
│ PRIORITY ORDER (Most to Least Important)               │
├─────────────────────────────────────────────────────────┤
│ 1. Core Specs (RAM, storage) - Most important          │
│ 2. Price - Somewhat flexible                           │
│ 3. Brand - Least important (user might accept others)  │
└─────────────────────────────────────────────────────────┘

Relaxation Flow

INITIAL FILTERS: brand = "Dell", price < 500, ram >= 32

┌─────────────────────────────────────────────────────────┐
│ TRY 1: All constraints                                  │
│ WHERE brand = 'Dell' AND price < 500 AND ram >= 32      │
│ → 0 results                                             │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ TRY 2: Relax brand (least important)                    │
│ WHERE price < 500 AND ram >= 32                         │
│ → 0 results                                             │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ TRY 3: Relax price (next least important)               │
│ WHERE ram >= 32                                         │
│ → 12 results ✓                                          │
└─────────────────────────────────────────────────────────┘

RESPONSE: "No Dell laptops under $500 with 32GB RAM found.
           I found 12 laptops with 32GB RAM (relaxed brand and price):"

Relaxation Implementation

async def search_with_relaxation(
    filters: dict,
    priority_order: list,
    min_results: int = 3
) -> Tuple[list, list]:
    """Search with automatic constraint relaxation."""
    
    # Try strict search first
    results = await search(filters)
    if len(results) >= min_results:
        return results, []  # No relaxation needed
    
    # Relax constraints in priority order (least important first)
    relaxed_fields = []
    current_filters = filters.copy()
    
    for field in reversed(priority_order):  # Least important first
        if field not in current_filters:
            continue
        
        # Remove this constraint
        del current_filters[field]
        relaxed_fields.append(field)
        
        # Try again
        results = await search(current_filters)
        if len(results) >= min_results:
            return results, relaxed_fields
    
    # Return whatever we have
    return results, relaxed_fields

# Usage:
results, relaxed = await search_with_relaxation(
    filters={"brand": "Dell", "price": {"<": 500}, "ram": {">=": 32}},
    priority_order=["ram", "price", "brand"]  # Most to least important
)

if relaxed:
    print(f"Note: Relaxed constraints: {', '.join(relaxed)}")

The Complete Flow

End-to-End Filter Extraction

USER: "Find cheap Dell laptops with good battery"

┌─────────────────────────────────────────────────────────┐
│ STEP 1: INJECT FIELD DEFINITIONS                        │
│ Generate prompt with field registry                     │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ STEP 2: LLM EXTRACTION                                  │
│ {                                                       │
│   "price": {"<": 800},                                  │
│   "brand": {"=": "Dell"},                               │
│   "battery_life": {">": 8}                              │
│ }                                                       │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ STEP 3: VALIDATION                                      │
│ All fields exist? ✓                                     │
│ All values in range? ✓                                  │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ STEP 4: DATABASE SEARCH                                 │
│ Found: 0 results                                        │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ STEP 5: CONSTRAINT RELAXATION                           │
│ Relax brand → 3 results ✓                               │
└─────────────────────────────────────────────────────────┘

RESPONSE: "No Dell laptops matched, but I found 3 cheap 
           laptops with good battery from other brands:"

The Proof: Before/After

Metrics That Changed

Before After
Unknown field errors: 35% Unknown field errors: 2% (rejected with warning)
Out-of-range hallucinations: 28% Out-of-range hallucinations: 0% (clamped)
'Cheap' mapped correctly: 0% (unbounded guesses) 'Cheap' mapped correctly: 94% (price < $800)
Zero-result dead ends: 42% Zero-result dead ends: 8% (relaxation helps)
User frustration: 'Search doesn't work' User satisfaction: 'Search is smart'

What changed: We stopped letting the LLM guess and started telling it exactly what exists.


The Checklist: Filter Extraction Readiness

Item Score
Do you inject field definitions into the extraction prompt? Field Defs
Do you map vague terms to specific values? Vague Terms
Do you reject unknown fields? Validation
Do you clamp out-of-range values? Clamping
Do you warn users about adjustments? Warnings
Do you have a constraint relaxation strategy? Relaxation
Is relaxation priority order defined? Priority
Do you tell users which constraints were relaxed? Feedback
0 of 8

Score Interpretation:

  • 0-3 checked: Extraction not ready. Start with field definitions.
  • 4-6 checked: Prototype-ready. Add validation and clamping.
  • 7-8 checked: Production-ready.

What’s Next

Issue 6: The Memory Layer

Now that you can extract filters and search, how do you remember what was found?

“User says ‘Tell me about the first one.’ You need to remember: (1) what products were fetched, (2) what order they were in, (3) what the user has selected.”

What You’ll Learn:

  • Session context design (what to remember)
  • Database persistence (chat history, selections)
  • Token tracking (cost control across sessions)

Key Takeaways

  1. 1 The Problem: LLMs extract filters but don't know your schema's boundaries
  2. 2 The Solution: Inject field definitions (name, type, unit, range, aliases), validate everything (reject unknown, clamp out-of-range), relax constraints when needed (least important first)
  3. 3 Security Win: JSON extraction instead of raw SQL immunizes against injection attacks
  4. 4 Key Takeaway: Field-aware extraction turns vague language into precise queries

Glossary

  • Filter Extraction: Converting natural language to structured query filters
  • Field Registry: List of filterable fields with metadata (units, ranges, aliases)
  • Validation: Checking if extracted filters match your schema
  • Clamping: Adjusting out-of-range values to valid boundaries
  • Constraint Relaxation: Removing filters when query returns 0 results
  • Vague Terms: Words like “cheap”, “good”, “lightweight” that need explicit mapping
  • Priority Order: Ranking constraints from most to least important for relaxation
  • Intermediate Representation: JSON filters instead of raw SQL — provides security against injection
  • Context Pruning: Only injecting relevant fields into prompts for large schemas (>50 fields)

Until next issue,

Sentient Zero Labs