A single null-rate threshold applied across every source in your pipeline sounds sensible. In practice it means your quality gate is simultaneously too strict for some sources and dangerously permissive for others. Here's the fix.
Most data quality tools let you set a threshold — say, WARN if null rate exceeds 30%, BLOCK if it exceeds 70%. That sounds reasonable until you look at what's actually flowing through a real pipeline.
Consider a typical setup with four sources:
| Source | Critical field | Acceptable null rate | Why |
|---|---|---|---|
| payments | amount |
0% — any null is a bug | Financial data. A null amount means money went somewhere unrecorded. |
| orders | shipping_address |
≤ 5% | Digital orders don't always have a shipping address. Some nulls are expected. |
| events | user_agent |
≤ 40% | Bot traffic, server-side events, and API calls rarely send a user agent. High nulls are normal. |
| user_profiles | phone_number |
≤ 60% | Optional field. Most users don't provide it. Most nulls are legitimate. |
Now apply a global threshold of WARN at 30%, BLOCK at 70%. What happens?
Your payments pipeline never alerts on a null amount — because 1 null in 1,000 rows is only 0.1%, well below your 30% WARN threshold. The bad row silently loads into your financial ledger.
Your events pipeline alerts constantly — because 40% null user_agent
is completely normal, but your global threshold fires a WARN every single run. Your team
learns to ignore it. And then they start ignoring the alerts that actually matter.
Alert fatigue is a data quality failure mode. When every pipeline run produces a WARN, engineers stop looking at WARNs. The threshold that was supposed to protect your pipeline becomes the reason nobody notices when something real breaks.
A payments pipeline and an events pipeline are fundamentally different things. They have different schemas, different upstream owners, different downstream consumers, and different tolerances for data imperfection. Applying the same quality rules to both is a category error.
What you actually want is a priority chain:
This means you can configure tight rules for payments, relaxed rules for
events, and leave everything else at sensible defaults — without writing
any custom validation logic.
Here's how you'd configure the four sources from the table above, each with thresholds that actually reflect the reality of that data:
Now the payments pipeline fires a BLOCK the moment a single null slips through.
The events pipeline runs clean at 40% null user_agent — no alert,
no noise. And your team only gets paged when something is actually wrong.
Pass thresholds directly in the request body. These take highest priority and are useful for one-off overrides or pipeline-specific logic baked into your orchestration code.
import datascreeniq as dsiq
client = dsiq.Client()
report = client.screen(
rows,
source="payments",
options={
"thresholds": {
"null_rate_warn": 0.01, # WARN if > 1% nulls
"null_rate_block": 0.02, # BLOCK if > 2% nulls
"type_mismatch_warn": 0.0, # WARN on any type mismatch
"type_mismatch_block": 0.01, # BLOCK if > 1% mismatch
}
}
)
report.raise_on_block()
load_to_warehouse(rows)
Set thresholds once in the dashboard Thresholds tab, scoped to a specific source.
These persist across all future requests for that source — no code change required.
Every pipeline call to source="payments" automatically applies the
saved payments thresholds, falling back to your global defaults for
anything not explicitly overridden.
The priority chain matters here. If you set thresholds inline in the API request and have saved per-source thresholds for the same source, the inline values win. This lets pipeline code override saved settings for one-off runs — useful for backfills, migrations, or testing new thresholds before committing them.
All 9 check categories support per-source overrides:
| Check | Threshold keys | Default |
|---|---|---|
| Null rate | null_rate_warn · null_rate_block | 30% · 70% |
| Type mismatch | type_mismatch_warn · type_mismatch_block | 5% · 20% |
| Empty string rate | empty_string_warn · empty_string_block | 20% · 50% |
| Duplicate rate | duplicate_warn · duplicate_block | 10% · 50% |
| Health score | health_warn · health_block | 0.8 · 0.5 |
| Row count anomaly | row_count_min_warn · row_count_max_warn | 3× deviation |
| Timestamp staleness | timestamp_stale_warn_hours · timestamp_stale_block_hours | 24h · 72h |
There's a legitimate debate about whether quality thresholds belong in code (version controlled, reviewable, reproducible) or in a dashboard (fast to change, no deploy required). The answer depends on how quickly your data changes and who owns the thresholds.
If your data engineering team owns thresholds, code is probably right — you want PRs, history, and the ability to roll back. If your analytics team or a data owner needs to tune thresholds in response to upstream changes without going through a deploy cycle, a dashboard is the right tool.
DataScreenIQ supports both: inline thresholds in code via the API, and saved overrides in the dashboard. The priority chain means you can use either or both without conflicts. What you don't want is to be forced to choose one model and apply it to every source in your pipeline.
If you're setting up thresholds for the first time, don't try to get everything right immediately. Start with global defaults and let your pipelines run for a week. Look at which sources produce the most WARN noise — those are your candidates for per-source overrides. Look at which sources handle financial or identity data — those are your candidates for tighter-than-default rules.
The goal isn't zero alerts. It's alerts that mean something.
Free tier: 500K rows/month. Dashboard included. No credit card.
Get a free API key →