Data EngineeringMonitoringObservabilityData Quality

Data Pipeline Monitoring: What to Track and Why

March 20, 20267 min readBy Hybridyn Engineering

The most dangerous data pipeline failure is the one nobody notices. A pipeline that crashes loudly gets fixed in minutes. A pipeline that silently produces wrong data can corrupt weeks of business decisions before anyone realizes.

Monitoring is not optional. It's the difference between "our data is reliable" and "we think our data is probably fine."

What to Monitor

1. Pipeline Execution Status

The basics: did the pipeline run? Did it succeed or fail? How long did it take?

Metrics to track:

Run status — success, failure, running, skipped
Duration — how long each run takes (and trend over time)
Schedule adherence — did it start on time?
Retry count — how many retries before success (or ultimate failure)

Why it matters: Duration creep is an early warning sign. If a pipeline that took 5 minutes now takes 25, something changed — data volume grew, a query became inefficient, or a source system is slower.

2. Data Volume

How much data did the pipeline process?

Metrics to track:

Row counts — input rows, output rows, filtered rows, error rows
Byte volume — data size processed
Row count ratios — output/input ratio (should be stable across runs)

Why it matters: A pipeline that usually processes 50,000 rows but suddenly processes 500 has a problem. Either the source is empty (extraction failure) or the filter logic changed. Both need investigation.

3. Data Freshness

How old is the data in your destination?

Metrics to track:

Last successful run — when did the pipeline last complete successfully?
Data latency — time between event occurrence and availability in the destination
SLA compliance — is data available within the promised window?

Why it matters: A dashboard showing "updated 3 hours ago" when the SLA is 15 minutes means downstream consumers are making decisions on stale data.

4. Data Quality

Is the data correct?

Metrics to track:

Null rates — percentage of nulls in critical columns
Uniqueness — are IDs actually unique?
Value distribution — has the distribution of values changed unexpectedly?
Schema compliance — does the output match the expected schema?
Referential integrity — do foreign keys resolve?

Why it matters: A pipeline can succeed (status: green) while producing garbage data. Quality checks catch issues that execution monitoring misses.

Alert Strategy

Not everything needs an alert. Too many alerts cause fatigue — the team ignores them all, including the critical ones.

Tier 1: Page Immediately

Pipeline failure after all retries exhausted
SLA breach on critical pipelines
Data volume drops to zero
Source system connection failure

Tier 2: Alert Within Business Hours

Duration exceeding 2x normal
Data volume deviation greater than 50%
Quality rule failures above threshold
Schedule delays over 30 minutes

Tier 3: Report Weekly

Gradual duration increases
Minor quality score changes
Retry frequency trends
Resource utilization patterns

Building a Monitoring Dashboard

A good pipeline monitoring dashboard answers three questions at a glance:

Is everything running? — Status overview of all pipelines
Is anything failing? — Failed pipelines with error details
Is the data fresh? — Freshness indicators for critical datasets

Essential Dashboard Panels

Pipeline Health Overview: A grid showing every pipeline with green/yellow/red status. Sort by severity so problems are always at the top.

Recent Failures: A table of failed runs with pipeline name, error message, failure time, and a link to logs. This is what the team looks at first thing every morning.

Duration Trends: A line chart showing pipeline duration over the past 30 days. Spikes and upward trends are immediately visible.

Data Freshness: A table showing each critical dataset, when it was last updated, and whether it meets its SLA.

Common Monitoring Mistakes

1. Only Monitoring Execution

"The pipeline ran successfully" is not the same as "the data is correct." A pipeline can extract zero rows, transform nothing, load an empty table, and report success. You need data quality checks, not just execution checks.

2. Alert Fatigue

If the team gets 50 alerts a day, they'll ignore all of them. Be ruthless about alert thresholds. A quality rule that fires on 0.1% null rate in a column that's naturally nullable isn't helpful.

3. No Historical Context

"The pipeline took 45 minutes" means nothing without context. Is that normal? Was it 5 minutes last week? Monitoring without baselines is guessing.

4. Missing Dependencies

Pipeline B depends on Pipeline A. Pipeline A fails. Pipeline B runs on stale data and "succeeds." Without dependency-aware monitoring, B's success hides A's failure.

Monitoring in Practice

F-Pulse includes built-in pipeline monitoring with run history, duration tracking, status alerts, and row count monitoring out of the box. F-Pulse+ extends this with long-running thresholds, schedule-miss detection, multi-channel notifications (email, Slack, Discord, webhook), and an installation-health score surfaced through the AI Copilot's get_installation_health tool.

The key insight is that monitoring should be built into the pipeline tool, not bolted on afterward. When monitoring is an afterthought, gaps are inevitable. When it's native, every pipeline gets baseline coverage automatically.

Beyond per-job monitoring: the workspace-reliability layer

Everything above is per-pipeline monitoring — is this run healthy, is this data fresh, did this DAG succeed. Useful, necessary, and the place most teams stop.

The class of issues per-pipeline monitoring is structurally blind to: architectural problems that emerge from the way pipelines are composed. Two engineers built the same flow without realising. The same Stripe stream is being pulled three times for three downstream uses. A duplicate was resolved six weeks ago and just got re-introduced by a teammate.

F-Pulse ships the Steward for exactly this layer — a read-only workspace observer that scans across your pipelines, detects duplicate sources and duplicate pipelines (Archeologist sub-agent), escalates findings you keep ignoring (persistent-occurrence counter, P3 → P2 → P1), and remembers your fixes via the Memory Layer. Detection is pure code — no LLM in the decision path. Ships in OSS. Full architecture: Beyond Logs — The Workspace-Reliability Layer.

If your workspace has grown past a handful of pipelines, this is the layer worth adding on top of per-job monitoring.

Summary

Monitor pipeline execution (did it run?), data volume (did it process the right amount?), data freshness (is it timely?), and data quality (is it correct?). Set alerts at three tiers to avoid fatigue. Build dashboards that answer the three critical questions. And choose tools where monitoring is built in, not bolted on.

Build data pipelines visually

F-Pulse is open source. Try it in under 3 minutes.

Get F-Pulse Join D-Pulse Early Access