How we moved data quality checks from "after the INSERT" to the ingest boundary, and why Cloudflare's edge runtime turned out to be a surprisingly good fit for real-time schema drift detection.
Your pipeline ran successfully last night. Every DAG is green. Airflow says everything is fine.
But the dashboard is broken this morning. A field went null. A type changed from number to string. A column that existed yesterday is gone today. And nothing caught it, because the quality checks run after the data is already in the warehouse.
This is the default failure mode for most data teams. Great Expectations, dbt tests, Soda — they're all good tools, but they validate data that's already been written. By the time they flag an issue, bad rows have been in production for hours. If you're feeding ML models, those bad rows are already in the training set.
We wanted something different: a check that runs before storage, before transformation, before damage.
We evaluated a few approaches before landing on Workers. Lambda was the obvious first thought, but cold starts of 500ms+ felt wrong for something that's supposed to be a gate, not a bottleneck. We also considered adding validation logic to the application layer, but that means every pipeline needs its own implementation. We wanted a single API endpoint that any pipeline can call.
Workers turned out to be a good fit for a few reasons that aren't immediately obvious:
No filesystem access. This sounds like a limitation, but it's actually a privacy feature. Our screening engine processes data entirely in-memory. There's physically no disk to write to at the edge layer. Customer data goes in, a verdict comes out, and the raw payload is gone. We store schema fingerprints and aggregate statistics, never row-level data.
Durable Objects for consistent state. Schema baselines need to be consistent — if two requests arrive simultaneously, they both need to compare against the same baseline. DOs give us single-threaded consistency per tenant without running a database.
KV for fast reads. Schema fingerprints and baseline stats are cached in KV for sub-millisecond reads. The DO is the source of truth, but the hot path reads from KV.
D1 for the control plane. Job history, tenant metadata, billing state — anything that doesn't need to be in the critical path lives in D1 (Cloudflare's SQLite-based database).
Global edge execution. If your data source is in Frankfurt and your warehouse is in Virginia, the quality gate runs in Frankfurt. The screening happens close to the data, not close to the warehouse.
Every batch goes through a single-pass column analysis on a deterministically-sampled subset of rows. Here's what gets computed:
| Check | What it catches |
|---|---|
| Null rate | Fields with too many missing values |
| Type mismatch | Fields where values aren't a consistent type |
| Empty string rate | Fields full of "" instead of null |
| Duplicate rate | Cardinality collapse |
| Outliers (IQR) | Numeric values beyond 1.5x interquartile range |
| Distinct count (HLL) | Approximate unique values via HyperLogLog |
| Enum tracking | New values in low-cardinality string fields |
| Timestamp staleness | Most recent timestamp older than expected |
| Schema fingerprint | SHA-256 of sorted field names + inferred types |
After the first batch for a given source, every subsequent batch is compared against the stored baseline. This is where drift detection kicks in:
| Drift event | Severity |
|---|---|
| Field added / removed | WARN |
| Type changed (e.g. number → string) | BLOCK |
| Null rate spike (>20% from baseline) | WARN / BLOCK |
| New enum value appeared | WARN |
| Row count anomaly (>3x from average) | WARN / BLOCK |
The verdict logic is straightforward: any BLOCK-severity event or a health score below 0.5 returns
BLOCK. Health below 0.8 or any WARN event returns WARN.
Everything clean returns PASS.
Here's how a request flows through the system:
// 1. Request arrives at the Worker
POST /v1/screen
X-API-Key: dsiq_live_...
Content-Type: application/json
{
"source": "orders",
"rows": [
{"order_id": "ORD-001", "amount": 99.5},
{"order_id": "ORD-002", "amount": "broken"}
]
}
// 2. Auth check (KV lookup: API key → tenant)
// 3. Read schema baseline from KV (fast path)
// 4. Run single-pass column analysis (in-memory)
// 5. Compare against baseline → detect drift
// 6. Compute health score → determine verdict
// 7. Return response immediately
// 8. waitUntil() → async updates to DO + D1
The key insight was separating the read path from the write path. Steps 1-7 complete without touching the Durable Object. The DO only gets involved in step 8, after the response has already been sent back to the caller. The caller never waits on storage writes.
Lesson learned: We originally had the DO in the hot path for everything —
reads and writes. Under 100 concurrent requests, latency jumped to 2.5+ seconds because DOs
serialize requests to the same object. Moving reads to KV and writes to waitUntil()
dropped steady-state latency to ~100ms.
We ran a concurrency test using 10 parallel users across 5 rounds, hitting the live production endpoint. Round 1 includes a cold start (the DO hadn't been accessed recently). Rounds 2-5 show warm performance.
| Round | Server latency (p50) | Throughput | Notes |
|---|---|---|---|
| 1 | 2,714 ms | 3.2 req/s | Cold start — DO spinning up |
| 2 | 128 ms | 42.6 req/s | Warm |
| 3 | 138 ms | 43.9 req/s | Warm |
| 4 | 119 ms | 46.7 req/s | Warm |
| 5 | 118 ms | 49.0 req/s | Warm, settling in |
The cold start is the one gotcha. Cloudflare evicts idle DOs after roughly 30 seconds on the free plan. The first request after a quiet period pays a ~2.7 second penalty while the DO spins up and loads its SQLite state. With real traffic keeping it warm, you consistently see sub-150ms responses.
We also tested 100 and 200 concurrent requests from a single API key. All 200 succeeded with zero errors, though server-side latency climbed to ~3.5s due to DO serialization. In production, 100 different users means 100 different DOs running fully parallel.
{
"status": "BLOCK",
"health_score": 0.34,
"decision": {
"action": "BLOCK",
"reason": "Type mismatch in 'amount'; High null rate in 'email' (67%)"
},
"schema": {
"order_id": { "type": "string", "confidence": 1.0 },
"amount": { "type": "number", "confidence": 0.67 }
},
"issues": {
"type_mismatches": {
"amount": {
"expected": "number",
"found": ["string"],
"sample_value": "broken",
"severity": "critical"
}
}
},
"latency_ms": 7
}
Response headers also carry the verdict for lightweight integration — you can route on
X-DataScreenIQ-Status without parsing the body.
The Python SDK makes the integration a few lines:
import datascreeniq as dsiq
from datascreeniq.exceptions import DataQualityError
client = dsiq.Client() # reads DATASCREENIQ_API_KEY from env
try:
client.screen(rows, source="orders").raise_on_block()
load_to_warehouse(rows) # only runs on PASS or WARN
except DataQualityError as e:
send_to_dead_letter_queue(rows)
alert_team(f"Blocked: {e.report.summary()}")
It also works as a raw HTTP call — send JSON or CSV, get back a verdict. No SDK required if you don't want one.
curl -X POST https://api.datascreeniq.com/v1/screen \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: text/csv" \
-H "X-Source: orders" \
--data-binary @orders.csv
A few things we learned along the way:
Durable Object migrations on the free plan require new_sqlite_classes,
not new_classes. The docs don't make this obvious and the error message doesn't help.
Cost us a couple of hours.
Welcome emails need to be awaited. We initially used fire-and-forget for the
signup confirmation email. On Workers, the runtime gets killed after the response is sent
unless you use waitUntil(). The email never went out. Had to await it before
returning the signup response.
KV write limits on the free plan are 1,000/day. Every screen request updates schema cache and billing counters. At 500+ requests/day, you hit the ceiling. The $5/month Workers Paid plan removes this cap entirely.
Cloudflare's email obfuscation injects a script (email-decode.min.js)
that blocks all inline JavaScript on Pages, including unrelated canvas animations.
Took a while to figure out why our landing page animation was broken. Fix: use plain
mailto: links instead of obfuscated emails.
Pre-ingest screening isn't a replacement for warehouse-level testing. dbt tests and Great Expectations are still useful for catching transformation bugs, business rule violations, and cross-table consistency. Those checks need access to the full dataset.
What a quality gate at the ingest boundary gives you is speed. You know in milliseconds whether this batch is safe to load, before any storage writes happen. It's the difference between "the pipeline failed" and "the pipeline blocked bad data and kept running."
For teams running Airflow, Prefect, or any orchestrator, it slots in as a pre-load task. For teams hitting APIs directly, it's a single HTTP call. Either way, the pattern is the same: screen first, load second.
Free tier: 500K rows/month. No credit card. API key in 30 seconds.
Get a free API key →