APRIL 7, 2026 · 7 MIN READ

Why Global Data Quality Thresholds Fail (And What to Do Instead)

A single null-rate threshold applied across every source in your pipeline sounds sensible. In practice it means your quality gate is simultaneously too strict for some sources and dangerously permissive for others. Here's the fix.

The problem with one number for everything

Most data quality tools let you set a threshold — say, WARN if null rate exceeds 30%, BLOCK if it exceeds 70%. That sounds reasonable until you look at what's actually flowing through a real pipeline.

Consider a typical setup with four sources:

Source	Critical field	Acceptable null rate	Why
payments	`amount`	0% — any null is a bug	Financial data. A null amount means money went somewhere unrecorded.
orders	`shipping_address`	≤ 5%	Digital orders don't always have a shipping address. Some nulls are expected.
events	`user_agent`	≤ 40%	Bot traffic, server-side events, and API calls rarely send a user agent. High nulls are normal.
user_profiles	`phone_number`	≤ 60%	Optional field. Most users don't provide it. Most nulls are legitimate.

Now apply a global threshold of WARN at 30%, BLOCK at 70%. What happens?

Your payments pipeline never alerts on a null amount — because 1 null in 1,000 rows is only 0.1%, well below your 30% WARN threshold. The bad row silently loads into your financial ledger.

Your events pipeline alerts constantly — because 40% null user_agent is completely normal, but your global threshold fires a WARN every single run. Your team learns to ignore it. And then they start ignoring the alerts that actually matter.

Alert fatigue is a data quality failure mode. When every pipeline run produces a WARN, engineers stop looking at WARNs. The threshold that was supposed to protect your pipeline becomes the reason nobody notices when something real breaks.

The right model: thresholds belong to sources, not to the tool

A payments pipeline and an events pipeline are fundamentally different things. They have different schemas, different upstream owners, different downstream consumers, and different tolerances for data imperfection. Applying the same quality rules to both is a category error.

What you actually want is a priority chain:

1. Inline threshold in API request → highest priority
2. Saved per-source override in dashboard
3. Global account defaults
4. System defaults → fallback

This means you can configure tight rules for payments, relaxed rules for events, and leave everything else at sensible defaults — without writing any custom validation logic.

What per-source overrides look like in practice

Here's how you'd configure the four sources from the table above, each with thresholds that actually reflect the reality of that data:

payments

null_rate_warn: 1%

null_rate_block: 2%

type_mismatch_warn: 0%

type_mismatch_block: 1%

orders

null_rate_warn: 5%

null_rate_block: 15%

type_mismatch_warn: 2%

type_mismatch_block: 10%

events

null_rate_warn: 50%

null_rate_block: 80%

type_mismatch_warn: 5%

type_mismatch_block: 20%

user_profiles

null_rate_warn: 65%

null_rate_block: 85%

type_mismatch_warn: 3%

type_mismatch_block: 10%

Now the payments pipeline fires a BLOCK the moment a single null slips through. The events pipeline runs clean at 40% null user_agent — no alert, no noise. And your team only gets paged when something is actually wrong.

Two ways to set per-source thresholds

Option 1 — Inline in the API request

Pass thresholds directly in the request body. These take highest priority and are useful for one-off overrides or pipeline-specific logic baked into your orchestration code.

payments_pipeline.py
import datascreeniq as dsiq

client = dsiq.Client()

report = client.screen(
    rows,
    source="payments",
    options={
        "thresholds": {
            "null_rate_warn":      0.01,  # WARN if > 1% nulls
            "null_rate_block":     0.02,  # BLOCK if > 2% nulls
            "type_mismatch_warn":  0.0,   # WARN on any type mismatch
            "type_mismatch_block": 0.01,  # BLOCK if > 1% mismatch
        }
    }
)

report.raise_on_block()
load_to_warehouse(rows)

Option 2 — Saved per-source overrides in the dashboard

Set thresholds once in the dashboard Thresholds tab, scoped to a specific source. These persist across all future requests for that source — no code change required. Every pipeline call to source="payments" automatically applies the saved payments thresholds, falling back to your global defaults for anything not explicitly overridden.

The priority chain matters here. If you set thresholds inline in the API request and have saved per-source thresholds for the same source, the inline values win. This lets pipeline code override saved settings for one-off runs — useful for backfills, migrations, or testing new thresholds before committing them.

Which thresholds can you configure?

All 9 check categories support per-source overrides:

Check	Threshold keys	Default
Null rate	`null_rate_warn` · `null_rate_block`	30% · 70%
Type mismatch	`type_mismatch_warn` · `type_mismatch_block`	5% · 20%
Empty string rate	`empty_string_warn` · `empty_string_block`	20% · 50%
Duplicate rate	`duplicate_warn` · `duplicate_block`	10% · 50%
Health score	`health_warn` · `health_block`	0.8 · 0.5
Row count anomaly	`row_count_min_warn` · `row_count_max_warn`	3× deviation
Timestamp staleness	`timestamp_stale_warn_hours` · `timestamp_stale_block_hours`	24h · 72h

Thresholds as code vs thresholds in a UI

There's a legitimate debate about whether quality thresholds belong in code (version controlled, reviewable, reproducible) or in a dashboard (fast to change, no deploy required). The answer depends on how quickly your data changes and who owns the thresholds.

If your data engineering team owns thresholds, code is probably right — you want PRs, history, and the ability to roll back. If your analytics team or a data owner needs to tune thresholds in response to upstream changes without going through a deploy cycle, a dashboard is the right tool.

DataScreenIQ supports both: inline thresholds in code via the API, and saved overrides in the dashboard. The priority chain means you can use either or both without conflicts. What you don't want is to be forced to choose one model and apply it to every source in your pipeline.

The practical starting point

If you're setting up thresholds for the first time, don't try to get everything right immediately. Start with global defaults and let your pipelines run for a week. Look at which sources produce the most WARN noise — those are your candidates for per-source overrides. Look at which sources handle financial or identity data — those are your candidates for tighter-than-default rules.

The goal isn't zero alerts. It's alerts that mean something.

Set up per-source thresholds

Free tier: 500K rows/month. Dashboard included. No credit card.

Get a free API key →

API Reference · PyPI · GitHub