Agent Design Fieldbook Issue 3/8

The Ingestion Layer: Getting Messy Data Into Your Clean Schema

Vendor data is chaos -- different units, formats, nulls, duplicates; ingestion is 80% of the work, and scripts beat pipelines for flexibility.

Apr 13, 2026 · 12 min read · Sentient Zero Labs

In this issue (12 sections)

In Issue 2, we built a clean schema with a field registry. Now comes the unglamorous truth: getting vendor data into that schema is 80% of the work.

We spent 2 weeks designing the perfect database. We spent 6 weeks fighting with vendor data. Every price in a different format. Every null represented differently. Every product with a slightly different SKU.

In this issue:

Why vendor data is chaos (and always will be)
The 4-step ingestion pipeline (fetch -> transform -> validate -> upsert)
Normalization strategies for units, nulls, and duplicates
Why we chose scripts over Airflow

ℹ

This is the least glamorous layer. It’s also the one that determines whether everything else works.

History Anchor: HMMs -> Modern Agent Architectures

Data ingestion has evolved from batch ETL jobs (Informatica, SSIS) through Hadoop MapReduce to modern ELT with dbt and Airflow. For AI agents, the ingestion layer has an additional constraint: the data must be normalized precisely enough for an LLM to reason about it. Messy data doesn’t just cause query errors — it causes hallucinations.

The Chaos

Nightmare 1: The Unit Chaos

SAME LAPTOP, 4 VENDORS, 4 DIFFERENT PRICE FORMATS:

Vendor A (Europe):
  "price": "EUR 1.299,00"

Vendor B (US):
  "price": "$1,499.00 USD"

Vendor C (API):
  "price_cents": 149900

Vendor D (CSV):
  "list_price": "1499"

OUR SCHEMA EXPECTS:
  price: NUMERIC (1499.00)
  unit: USD

PROBLEM: 4 different formats, 3 different currencies

The same laptop. Four vendors. Four completely different representations of the price. Our schema expects one consistent format.

Nightmare 2: The Null Chaos

MISSING WEIGHT DATA, 4 DIFFERENT REPRESENTATIONS:

Vendor A: "weight": null
Vendor B: "weight": "N/A"
Vendor C: "weight": 0
Vendor D: "weight": ""

QUESTION: What do we do?
  - Reject the product? (lose data)
  - Use default value? (inaccurate)
  - Flag as missing? (track it)

PROBLEM: No consistent null handling

“Missing” doesn’t mean the same thing to every vendor. Some send null. Some send "N/A". Some send 0 (is that missing or actual zero weight?). Some send empty strings.

Nightmare 3: The Duplicate Chaos

SAME LAPTOP, 3 VENDORS, 3 DIFFERENT IDENTIFIERS:

Vendor A:
  "sku": "XPS-15-2024"
  "name": "Dell XPS 15"
  "price": 1899

Vendor B:
  "sku": "XPS15"
  "name": "XPS 15 Laptop"
  "price": 1899.00

Vendor C:
  "sku": "DELL-XPS-15"
  "name": "Dell XPS 15 (2024)"
  "price": 1899

QUESTION: Are these the same product?
  - Different SKUs
  - Slightly different names
  - Same price (but different formats)

PROBLEM: Need deduplication strategy

Three vendors, three different SKUs, three slightly different names. Is “XPS-15-2024” the same as “XPS15”? Your database needs one canonical entry.

The Realization

We spent 2 weeks building the perfect schema. We spent 6 weeks getting vendor data to fit it. Ingestion is 80% of the work.

The Diagnosis: Why Vendor Data Is Chaos

What Vendors Send vs. What You Need

WHAT VENDORS SEND (The Reality):
  - Different field names ("price" vs "list_price")
  - Different units (EUR, USD, cents)
  - Different formats ("$1,499" vs 1499 vs "1.499,00")
  - Different null representations (null, "N/A", 0, "")
  - Different schemas (some have weight, some don't)
  - Typos and errors ("weightt": 2.5)

WHAT YOUR SCHEMA NEEDS (The Standard):
  - Consistent field names (price, weight_kg, brand)
  - One unit (USD, kg)
  - One format (NUMERIC: 1499.00)
  - One null strategy (null = missing, not 0 or "N/A")
  - Required fields always present
  - No typos, no errors

The pattern: Vendors don’t care about your schema. You have to transform their data into your format, every time, for every vendor.

The Gap (The Ingestion Layer Bridges It)

VENDOR DATA (Chaos)
  |
  v
INGESTION LAYER
  1. FETCH: Get data from vendor (API, CSV, scrape)
  2. TRANSFORM: Normalize to your schema
  3. VALIDATE: Check if it meets rules
  4. UPSERT: Update if exists, insert if new
  |
  v
YOUR DATABASE (Clean, Standardized)

The Ingestion Pipeline

The 4-Step Flow

Step 1: FETCH — Get raw data from vendor

Sources: API calls (JSON), CSV files (download), web scraping (HTML).

Output: Raw vendor data (as-is).

Key Principle: Don’t transform during fetch. Get the data exactly as the vendor sends it.

Step 2: TRANSFORM — Convert to your schema

Map fields: "list_price" -> "price", "product_name" -> "name", "weightt" -> "weight_kg" (fix typos)
Normalize units: "EUR 1,299" -> 1428.90 (USD), "5.5 lbs" -> 2.49 (kg)
Handle nulls: "N/A" -> null, "" -> null, 0 -> null (if 0 means missing)
Preserve original: Store raw data in full_data JSONB

Step 3: VALIDATE — Check if data meets schema rules

Required fields: sku present? name present? price present?
Types: price is numeric? weight is numeric?
Ranges: price > 0? weight > 0?
If validation fails: log error with details, skip product, continue with next

Step 4: UPSERT — Insert or update in database

Check: Does SKU already exist?
- Yes -> UPDATE (price, specs, updated_at)
- No -> INSERT (new product)
Conflict resolution: if price changed, update; if specs changed, update; if vendor data is older, skip

INSERT INTO product (...) VALUES (...)
ON CONFLICT (sku) DO UPDATE SET ...

Normalization Strategies

Strategy 1: Price Normalization

Input: "EUR 1.299,00"
  |
  v
1. Detect currency: EUR
2. Handle European format: 1.299,00 -> 1299.00
3. Remove symbols: 1299.00
4. Convert EUR -> USD: 1299 x 1.10 = 1428.90
5. Round to 2 decimals: 1428.90
  |
  v
Output: 1428.90 (NUMERIC, USD)

# The problem: "EUR 1.299,00" needs to become 1428.90 (USD)
# The solution: Normalize in steps

def normalize_price(raw, source_currency="USD"):
    # 1. Remove currency symbols
    cleaned = raw.replace("EUR", "").replace("$", "").strip()
    
    # 2. Handle European format (1.299,00 -> 1299.00)
    if "," in cleaned and "." in cleaned:
        if cleaned.rindex(",") > cleaned.rindex("."):
            # European: 1.299,00 -> remove dots, swap comma
            cleaned = cleaned.replace(".", "").replace(",", ".")
        else:
            # US: 1,499.00 -> just remove commas
            cleaned = cleaned.replace(",", "")
    elif "," in cleaned:
        cleaned = cleaned.replace(",", ".")
    
    # 3. Convert to float
    value = float(cleaned)
    
    # 4. Convert currency to USD
    rates = {"EUR": 1.10, "GBP": 1.27, "USD": 1.0}
    value = value * rates.get(source_currency, 1.0)
    
    # 5. Round to 2 decimals
    return round(value, 2)

Strategy 2: Weight Normalization

Input: "5.5 lbs"
  |
  v
1. Extract number: 5.5
2. Extract unit: "lbs"
3. Convert to kg: 5.5 x 0.453592 = 2.49
4. Round to 2 decimals: 2.49
  |
  v
Output: 2.49 (NUMERIC, kg)

# The problem: "5.5 lbs" needs to become 2.49 (kg)
# The solution: Extract unit, convert

import re

def normalize_weight(raw):
    if not raw:
        return None
    
    # Extract number and unit
    match = re.match(r"([\d.]+)\s*(kg|lbs|lb|g|oz)?", str(raw).lower())
    if not match:
        return None
    
    value = float(match.group(1))
    unit = match.group(2) or "kg"  # Default to kg if no unit
    
    # Convert to kg
    conversions = {
        "lbs": 0.453592,
        "lb": 0.453592,
        "g": 0.001,
        "oz": 0.0283495,
        "kg": 1.0
    }
    value = value * conversions.get(unit, 1.0)
    
    return round(value, 2)

Strategy 3: Null Handling

NULL HANDLING DECISION TREE:

Is field required (sku, name, price)?
  YES -> Reject product (log error)
  NO  -> Continue

Is value a null-equivalent ("N/A", "", 0)?
  YES -> Convert to NULL
  NO  -> Keep value

Should we flag missing data?
  YES -> Set flag: missing_weight = true
  NO  -> Just store null

# The problem: "N/A", "", 0, null all mean "missing"
# The solution: Normalize to Python None

NULL_EQUIVALENTS = {None, "", "N/A", "n/a", "NA", "null", "NULL", "-"}

def normalize_null(value, treat_zero_as_null=False):
    if value in NULL_EQUIVALENTS:
        return None
    if treat_zero_as_null and value == 0:
        return None
    return value

Strategy 4: Deduplication

DEDUPLICATION STRATEGY:

Step 1: Normalize SKU
  - Remove hyphens, spaces
  - Lowercase
  - "XPS-15-2024" -> "xps152024"
  - "XPS 15" -> "xps15"

Step 2: Check for match
  - Exact normalized SKU match? -> Update existing
  - Fuzzy name match (>90%)? -> Flag for review
  - No match? -> Insert new

Step 3: Conflict resolution
  - If prices differ: Use most recent vendor data
  - If specs differ: Merge (keep most complete)
  - Always track: which vendor, when updated

Scripts vs. Pipelines

When to Use Each

Use scripts when:

Schema is evolving (frequent changes)
Vendors are inconsistent (different formats)
Volume is manageable (< 10M products)
Need manual review/validation
Team is small (< 5 engineers)
Benefits: easy to debug, fast to iterate, flexible, low overhead

Use pipelines (Airflow/dbt) when:

Schema is stable (rarely changes)
Vendors are consistent (same format)
Volume is large (> 10M products)
Need scheduling/orchestration
Team is large (> 5 engineers)
Benefits: automated scheduling, dependency management (DAGs), monitoring/alerting built-in, scales to large volumes

Our choice: Scripts. Schema evolving, vendors inconsistent, volume under 1M. We can always migrate to Airflow later — but starting there would have slowed us down.

ℹ

This is the ELT (Extract-Load-Transform) pattern — load raw data first, then transform. It’s the 2024-2025 startup consensus: avoid “The Airflow Tax” until you need it. Scripts work great until you have ~10 engineers. Then invest in orchestration.

Error Handling

The Error Handling Pattern

# The problem: Some products fail validation
# The solution: Log errors, continue with valid products

def ingest_vendor(vendor_name, vendor_data):
    results = {"success": [], "errors": []}
    
    for raw_product in vendor_data:
        try:
            # 1. Transform
            product = transform(raw_product, vendor_name)
            
            # 2. Validate
            errors = validate(product)
            if errors:
                raise ValueError(f"Validation failed: {errors}")
            
            # 3. Upsert
            db.upsert(product)
            results["success"].append(product["sku"])
            
        except Exception as e:
            results["errors"].append({
                "sku": raw_product.get("sku", "UNKNOWN"),
                "vendor": vendor_name,
                "error": str(e),
                "raw_data": raw_product  # Preserve for debugging
            })
    
    # Report
    print(f"Ingested {len(results['success'])} products")
    print(f"Errors: {len(results['errors'])}")
    
    return results

The `full_data` Safety Net (Your Data Lake in Postgres)

Always preserve the original vendor data. Think of full_data as a mini Data Lake — raw, untransformed vendor data sitting right next to your normalized columns:

def transform(raw_product, vendor_name):
    return {
        "sku": normalize_sku(raw_product.get("sku")),
        "name": raw_product.get("name") or raw_product.get("product_name"),
        "price": normalize_price(raw_product.get("price")),
        "weight_kg": normalize_weight(raw_product.get("weight")),
        "brand": raw_product.get("brand"),
        
        # ALWAYS preserve original data
        "full_data": {
            "vendor": vendor_name,
            "fetched_at": datetime.now().isoformat(),
            "raw": raw_product  # Original, untouched
        }
    }

💡 Why This Matters

When normalization goes wrong (and it will), you can re-run with fixed logic against the preserved raw data. No need to re-fetch from vendors.

The Proof: Before/After

Before	After
30% of products rejected (validation failures)	95% of products ingested successfully
15% duplicate products (same laptop, different SKUs)	2% duplicates (normalized SKUs catch most)
40% missing weight data (nulls handled inconsistently)	15% missing weight (but flagged, not broken)
0% vendor data preserved (lost for debugging)	100% vendor data preserved (in full_data JSONB)
Manual fixes: 20 hours/week	Manual fixes: 2 hours/week

What changed: We stopped hoping vendor data was clean and started normalizing it systematically.

Resources

Apache Airflow Documentation — Official docs for DAG-based workflow orchestration. Read this when you outgrow scripts (~10 engineers or ~10M rows).
dbt Developer Hub — The ELT transform layer that pairs with Airflow. Relevant when your normalization logic needs version control, testing, and lineage documentation.
PostgreSQL UPSERT: INSERT … ON CONFLICT — Official syntax reference for the ON CONFLICT DO UPDATE pattern used throughout the ingestion pipeline.
Pydantic: Data Validation — Implement the validation layer as Pydantic models: field types, ranges, required fields, and custom validators map directly onto the validation patterns shown here.
Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) — Background on how LLMs learn to call external APIs; explains why clean, normalized data is a prerequisite for reliable tool-augmented agents.

The Checklist: Ingestion Layer Readiness

	Item	Score
	Are all units normalized to one standard?	Units
	Are all formats converted to schema types?	Formats
	Do you handle nulls consistently?	Nulls
	Do you deduplicate across vendors?	Duplicates
	Do you preserve original vendor data?	Raw Data
	Do you validate before inserting?	Validation
	Do you log failed products with details?	Errors
	Do you update existing products?	Upsert

0 of 8

Score Interpretation:

0-3: Ingestion not ready. Fix this before building agents.
4-6: Prototype-ready. Proceed with caution.
7-8: Production-ready. Move to Issue 4.

What’s Next

Issue 4: The Intent Layer

Now that you have clean data in your database, how does the agent understand what users want?

“User says ‘Find me something good.’ Is that a search? A question? A comparison? We had one giant prompt doing everything. It broke constantly.”

What You’ll Learn:

Intent classification design (search, refine, compare, ask)
Multi-intent handling (“Find laptops and compare the top two”)
Entity extraction (resolving “the first one” to an actual product)

Key Takeaways

1 The Problem: Vendor data is chaos (different units, formats, nulls, duplicates).
2 The Solution: 4-step pipeline (Fetch -> Transform -> Validate -> Upsert) + Normalize aggressively (units, formats, nulls) + Preserve raw data in full_data JSONB + Scripts over pipelines (for flexibility).
3 Key Takeaway: Ingestion is 80% of the work. Get it right, and everything else becomes easier.

Glossary

ELT: Extract-Load-Transform — load raw data first, transform in database (vs ETL which transforms before loading)
Normalization: Converting data to standard format (EUR -> USD, lbs -> kg)
Upsert: Update if exists, insert if new (INSERT … ON CONFLICT DO UPDATE)
Null Handling: Strategy for missing data (reject, default, or flag)
Deduplication: Removing duplicate entries across vendors
Raw Data: Original vendor data before transformation
Validation: Checking if transformed data meets schema rules
full_data: JSONB column storing original vendor data — your “Data Lake in Postgres”
The Airflow Tax: Overhead of running orchestration infrastructure before you need it

Until next issue,

Sentient Zero Labs

History Anchor: HMMs -> Modern Agent Architectures

The Chaos

Nightmare 1: The Unit Chaos

Nightmare 2: The Null Chaos

Nightmare 3: The Duplicate Chaos

The Realization

The Diagnosis: Why Vendor Data Is Chaos

What Vendors Send vs. What You Need

The Gap (The Ingestion Layer Bridges It)

The Ingestion Pipeline

The 4-Step Flow

Normalization Strategies

Strategy 1: Price Normalization

Strategy 2: Weight Normalization

Strategy 3: Null Handling

Strategy 4: Deduplication

Scripts vs. Pipelines

When to Use Each

Error Handling

The Error Handling Pattern

The full_data Safety Net (Your Data Lake in Postgres)

The Proof: Before/After

Resources

The Checklist: Ingestion Layer Readiness

What’s Next

Issue 4: The Intent Layer

Key Takeaways

Glossary

The `full_data` Safety Net (Your Data Lake in Postgres)