APRIL 5, 2026 · 8 MIN READ

Building a Sub-100ms Data Quality Gate on Cloudflare Workers

How we moved data quality checks from "after the INSERT" to the ingest boundary, and why Cloudflare's edge runtime turned out to be a surprisingly good fit for real-time schema drift detection.

The failure mode nobody talks about

Your pipeline ran successfully last night. Every DAG is green. Airflow says everything is fine.

But the dashboard is broken this morning. A field went null. A type changed from number to string. A column that existed yesterday is gone today. And nothing caught it, because the quality checks run after the data is already in the warehouse.

This is the default failure mode for most data teams. Great Expectations, dbt tests, Soda — they're all good tools, but they validate data that's already been written. By the time they flag an issue, bad rows have been in production for hours. If you're feeding ML models, those bad rows are already in the training set.

We wanted something different: a check that runs before storage, before transformation, before damage.

Source → Quality Gate → PASS → Database
→ WARN → Quarantine
→ BLOCK → Dead-letter queue

Why Cloudflare Workers

We evaluated a few approaches before landing on Workers. Lambda was the obvious first thought, but cold starts of 500ms+ felt wrong for something that's supposed to be a gate, not a bottleneck. We also considered adding validation logic to the application layer, but that means every pipeline needs its own implementation. We wanted a single API endpoint that any pipeline can call.

Workers turned out to be a good fit for a few reasons that aren't immediately obvious:

No filesystem access. This sounds like a limitation, but it's actually a privacy feature. Our screening engine processes data entirely in-memory. There's physically no disk to write to at the edge layer. Customer data goes in, a verdict comes out, and the raw payload is gone. We store schema fingerprints and aggregate statistics, never row-level data.

Durable Objects for consistent state. Schema baselines need to be consistent — if two requests arrive simultaneously, they both need to compare against the same baseline. DOs give us single-threaded consistency per tenant without running a database.

KV for fast reads. Schema fingerprints and baseline stats are cached in KV for sub-millisecond reads. The DO is the source of truth, but the hot path reads from KV.

D1 for the control plane. Job history, tenant metadata, billing state — anything that doesn't need to be in the critical path lives in D1 (Cloudflare's SQLite-based database).

Global edge execution. If your data source is in Frankfurt and your warehouse is in Virginia, the quality gate runs in Frankfurt. The screening happens close to the data, not close to the warehouse.

What the engine actually does

Every batch goes through a single-pass column analysis on a deterministically-sampled subset of rows. Here's what gets computed:

Check	What it catches
Null rate	Fields with too many missing values
Type mismatch	Fields where values aren't a consistent type
Empty string rate	Fields full of `""` instead of null
Duplicate rate	Cardinality collapse
Outliers (IQR)	Numeric values beyond 1.5x interquartile range
Distinct count (HLL)	Approximate unique values via HyperLogLog
Enum tracking	New values in low-cardinality string fields
Timestamp staleness	Most recent timestamp older than expected
Schema fingerprint	SHA-256 of sorted field names + inferred types

After the first batch for a given source, every subsequent batch is compared against the stored baseline. This is where drift detection kicks in:

Drift event	Severity
Field added / removed	WARN
Type changed (e.g. number → string)	BLOCK
Null rate spike (>20% from baseline)	WARN / BLOCK
New enum value appeared	WARN
Row count anomaly (>3x from average)	WARN / BLOCK

The verdict logic is straightforward: any BLOCK-severity event or a health score below 0.5 returns BLOCK. Health below 0.8 or any WARN event returns WARN. Everything clean returns PASS.

The architecture in practice

Here's how a request flows through the system:

request-flow.txt
// 1. Request arrives at the Worker
POST /v1/screen
X-API-Key: dsiq_live_...
Content-Type: application/json

{
  "source": "orders",
  "rows": [
    {"order_id": "ORD-001", "amount": 99.5},
    {"order_id": "ORD-002", "amount": "broken"}
  ]
}

// 2. Auth check (KV lookup: API key → tenant)
// 3. Read schema baseline from KV (fast path)
// 4. Run single-pass column analysis (in-memory)
// 5. Compare against baseline → detect drift
// 6. Compute health score → determine verdict
// 7. Return response immediately
// 8. waitUntil() → async updates to DO + D1

The key insight was separating the read path from the write path. Steps 1-7 complete without touching the Durable Object. The DO only gets involved in step 8, after the response has already been sent back to the caller. The caller never waits on storage writes.

Lesson learned: We originally had the DO in the hot path for everything — reads and writes. Under 100 concurrent requests, latency jumped to 2.5+ seconds because DOs serialize requests to the same object. Moving reads to KV and writes to waitUntil() dropped steady-state latency to ~100ms.

Load test results

We ran a concurrency test using 10 parallel users across 5 rounds, hitting the live production endpoint. Round 1 includes a cold start (the DO hadn't been accessed recently). Rounds 2-5 show warm performance.

100% Success rate

111ms Warm API latency

~50 Requests / sec

17ms Stdev (p50–p99)

Round	Server latency (p50)	Throughput	Notes
1	2,714 ms	3.2 req/s	Cold start — DO spinning up
2	128 ms	42.6 req/s	Warm
3	138 ms	43.9 req/s	Warm
4	119 ms	46.7 req/s	Warm
5	118 ms	49.0 req/s	Warm, settling in

The cold start is the one gotcha. Cloudflare evicts idle DOs after roughly 30 seconds on the free plan. The first request after a quiet period pays a ~2.7 second penalty while the DO spins up and loads its SQLite state. With real traffic keeping it warm, you consistently see sub-150ms responses.

We also tested 100 and 200 concurrent requests from a single API key. All 200 succeeded with zero errors, though server-side latency climbed to ~3.5s due to DO serialization. In production, 100 different users means 100 different DOs running fully parallel.

What the response looks like

response.json
{
  "status": "BLOCK",
  "health_score": 0.34,
  "decision": {
    "action": "BLOCK",
    "reason": "Type mismatch in 'amount'; High null rate in 'email' (67%)"
  },
  "schema": {
    "order_id": { "type": "string", "confidence": 1.0 },
    "amount":   { "type": "number", "confidence": 0.67 }
  },
  "issues": {
    "type_mismatches": {
      "amount": {
        "expected": "number",
        "found": ["string"],
        "sample_value": "broken",
        "severity": "critical"
      }
    }
  },
  "latency_ms": 7
}

Response headers also carry the verdict for lightweight integration — you can route on X-DataScreenIQ-Status without parsing the body.

Dropping it into a pipeline

The Python SDK makes the integration a few lines:

quality_gate.py
import datascreeniq as dsiq
from datascreeniq.exceptions import DataQualityError

client = dsiq.Client()  # reads DATASCREENIQ_API_KEY from env

try:
    client.screen(rows, source="orders").raise_on_block()
    load_to_warehouse(rows)  # only runs on PASS or WARN

except DataQualityError as e:
    send_to_dead_letter_queue(rows)
    alert_team(f"Blocked: {e.report.summary()}")

It also works as a raw HTTP call — send JSON or CSV, get back a verdict. No SDK required if you don't want one.

curl
curl -X POST https://api.datascreeniq.com/v1/screen \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: text/csv" \
  -H "X-Source: orders" \
  --data-binary @orders.csv

What we'd do differently

A few things we learned along the way:

Durable Object migrations on the free plan require new_sqlite_classes, not new_classes. The docs don't make this obvious and the error message doesn't help. Cost us a couple of hours.

Welcome emails need to be awaited. We initially used fire-and-forget for the signup confirmation email. On Workers, the runtime gets killed after the response is sent unless you use waitUntil(). The email never went out. Had to await it before returning the signup response.

KV write limits on the free plan are 1,000/day. Every screen request updates schema cache and billing counters. At 500+ requests/day, you hit the ceiling. The $5/month Workers Paid plan removes this cap entirely.

Cloudflare's email obfuscation injects a script (email-decode.min.js) that blocks all inline JavaScript on Pages, including unrelated canvas animations. Took a while to figure out why our landing page animation was broken. Fix: use plain mailto: links instead of obfuscated emails.

Is this the right pattern?

Pre-ingest screening isn't a replacement for warehouse-level testing. dbt tests and Great Expectations are still useful for catching transformation bugs, business rule violations, and cross-table consistency. Those checks need access to the full dataset.

What a quality gate at the ingest boundary gives you is speed. You know in milliseconds whether this batch is safe to load, before any storage writes happen. It's the difference between "the pipeline failed" and "the pipeline blocked bad data and kept running."

For teams running Airflow, Prefect, or any orchestrator, it slots in as a pre-load task. For teams hitting APIs directly, it's a single HTTP call. Either way, the pattern is the same: screen first, load second.

Try it yourself

Free tier: 500K rows/month. No credit card. API key in 30 seconds.

Get a free API key →

PyPI · GitHub · API Reference