Handling Missing Sensor Readings Without Triggering False Violations

In municipal water SCADA environments, telemetry interruptions are an operational certainty: communication dropouts, PLC polling failures, and scheduled calibration cycles all create gaps in a continuous data stream. The exact engineering task on this page is to take a parameter series that contains NaN holes and decide — deterministically, and with an audit trail — which gaps are harmless polling artifacts and which are genuine compliance failures, so that a missing reading never fabricates a Maximum Contaminant Level (MCL) exceedance or a monitoring-frequency violation. This is the reference implementation of safe absence-handling within the parent Monitoring Gap Detection Algorithms section, written for the Python automation builders who own the ingestion service and the environmental compliance teams who sign the resulting reports. Get it wrong in one direction and you self-report a violation the utility never committed; get it wrong in the other and a real missed sample slips through to a primacy-agency audit unreported.

The distinguishing rule is simple to state and unforgiving in practice: EPA compliance frameworks such as the Disinfectants and Disinfection Byproducts Rule (DBPR) and the Revised Total Coliform Rule (RTCR) do not permit arbitrary statistical imputation or forward-filling for compliance determination. A gap must be classified, not filled, before it reaches the Violation Detection & Rule Engine Logic evaluation stage. The classification thresholds themselves are documented in 40 CFR Part 141, which defines minimum monitoring frequencies and the reporting windows a missed reading is measured against.

Prerequisites & Environment Setup

The implementation targets Python 3.10+ and uses pandas for the vectorized gap arithmetic and numpy for the mutually exclusive state selection. Pin the versions explicitly, because the groupby(...).transform broadcasting behaviour used in the classifier is stable only on modern pandas. Schema enforcement with pydantic is recommended once routed records enter the compliance submission queue, and python-dateutil is pulled in transitively for timezone-aware boundary math.

This module runs strictly downstream of ingestion. It assumes its input has already been coerced to a monotonic, timezone-aware UTC axis by the upstream time-series alignment strategies module — the same normalization described in Aligning Irregular SCADA Timestamps to UTC. It does not open a live Modbus or OPC UA session itself, so no OT-network egress is required at this layer.

python3 -m venv .venv && source .venv/bin/activate
pip install "pandas==2.2.*" "numpy==1.26.*" "pydantic==2.7.*"

Step-by-Step Implementation

The pipeline separates data-quality validation from regulatory evaluation across three stages: normalize the incoming frame, classify each contiguous gap against duration thresholds, then route each non-valid record to an operational action. No stage ever writes a synthetic value into the series.

Step 1 — Normalize ingestion and detect out-of-sequence records

SCADA historians frequently exhibit clock skew across distributed RTUs and PLCs. Before any compliance logic executes, all timestamps must be coerced to UTC, deduplicated, and sorted chronologically. Out-of-sequence records are flagged for manual reconciliation rather than interpolated — and the flag is computed against the original arrival order, before the rows are sorted, so the evidence of disorder is preserved.

import pandas as pd


def normalize_scada_ingestion(raw_df: pd.DataFrame, time_col: str = "timestamp") -> pd.DataFrame:
    """Coerce timestamps to UTC, drop duplicate readings, and enforce monotonic ordering."""
    df = raw_df.copy()
    df[time_col] = pd.to_datetime(df[time_col], utc=True)

    # Detect out-of-sequence records in the original arrival order, before sorting.
    df["is_out_of_sequence"] = df[time_col].diff() < pd.Timedelta(0)

    # Drop exact-duplicate timestamps, then enforce monotonic ordering.
    df = df.drop_duplicates(subset=time_col).sort_values(time_col).reset_index(drop=True)
    return df

Step 2 — Classify each contiguous gap by duration

Once the data is normalized, the engine identifies contiguous null sequences, measures the duration of each gap, and assigns a compliance state. A gap within the telemetry tolerance is a routine polling artifact; a gap beyond the tolerance but below the monitoring-violation threshold is held for review; a gap beyond the threshold is a formal violation. The step deliberately avoids pandas.interpolate(), since statistical imputation would violate primacy-agency audit requirements and could mask a real exceedance.

import numpy as np
import pandas as pd
from enum import Enum


class ComplianceState(Enum):
    VALID = "VALID"
    TELEMETRY_GAP = "TELEMETRY_GAP"
    MONITORING_VIOLATION = "MONITORING_VIOLATION"
    PENDING_REVIEW = "PENDING_REVIEW"


def classify_gaps(
    df: pd.DataFrame,
    value_col: str,
    time_col: str = "timestamp",
    telemetry_tolerance_min: int = 15,
    monitoring_violation_threshold_min: int = 1440,
) -> pd.DataFrame:
    """Identify contiguous null sequences and classify each against the telemetry
    tolerance and the regulatory monitoring-violation threshold."""
    tolerance = pd.Timedelta(minutes=telemetry_tolerance_min)
    violation_threshold = pd.Timedelta(minutes=monitoring_violation_threshold_min)

    # Boolean mask of missing readings.
    is_missing = df[value_col].isna()

    # Assign a stable group id to every contiguous run of identical missingness.
    group_id = is_missing.ne(is_missing.shift()).cumsum()

    # Gap duration is the span of each contiguous null block (first to last
    # missing timestamp), broadcast back to every row in that block.
    df["gap_duration"] = pd.Timedelta(0)
    df.loc[is_missing, "gap_duration"] = (
        df[is_missing]
        .groupby(group_id[is_missing])[time_col]
        .transform(lambda block: block.max() - block.min())
    )

    # Mutually exclusive classification conditions, evaluated row by row.
    conditions = [
        ~is_missing,
        is_missing & (df["gap_duration"] <= tolerance),
        is_missing & (df["gap_duration"] > tolerance) & (df["gap_duration"] <= violation_threshold),
        is_missing & (df["gap_duration"] > violation_threshold),
    ]
    choices = [
        ComplianceState.VALID,
        ComplianceState.TELEMETRY_GAP,
        ComplianceState.PENDING_REVIEW,
        ComplianceState.MONITORING_VIOLATION,
    ]

    df["compliance_state"] = np.select(conditions, choices, default=ComplianceState.PENDING_REVIEW)
    return df

The classifier maps each reading into one of four mutually exclusive compliance states based on gap duration relative to the telemetry tolerance and the monitoring-violation threshold.

Gap classification by duration: a present reading is always VALID; once absent, the gap's length places it in one of three states, and only a gap past the 24-hour threshold becomes a formal violation.

Step 3 — Route each record and suppress false positives

When a gap exceeds the telemetry tolerance but falls below the formal violation threshold, the system suppresses automated violation generation and routes the record to a fallback queue. This prevents false submissions to the Safe Drinking Water Information System (SDWIS) while preserving an immutable audit trail. Valid rows are skipped, tolerated telemetry gaps are logged and closed, review cases open a CMMS ticket, and only confirmed violations advance toward primacy-agency notification.

from dataclasses import dataclass, field
from typing import Any, Dict, List

import pandas as pd


@dataclass
class ComplianceRoute:
    sensor_id: str
    state: ComplianceState
    gap_minutes: float
    action: str
    audit_payload: Dict[str, Any] = field(default_factory=dict)


def route_compliance_records(df: pd.DataFrame, sensor_id: str) -> List[ComplianceRoute]:
    """Suppress false positives and route each non-valid record to its workflow."""
    routes: List[ComplianceRoute] = []
    for _, row in df.iterrows():
        if row["compliance_state"] == ComplianceState.VALID:
            continue

        gap_min = row["gap_duration"].total_seconds() / 60

        if row["compliance_state"] == ComplianceState.TELEMETRY_GAP:
            # Auto-resolve: log as a known polling artifact and suppress the violation.
            routes.append(ComplianceRoute(
                sensor_id=sensor_id,
                state=row["compliance_state"],
                gap_minutes=gap_min,
                action="SUPPRESS_AND_LOG",
                audit_payload={"reason": "SCADA_POLLING_TIMEOUT", "resolved": True},
            ))
        elif row["compliance_state"] == ComplianceState.PENDING_REVIEW:
            # Fallback: route to the compliance dashboard and open a CMMS ticket.
            routes.append(ComplianceRoute(
                sensor_id=sensor_id,
                state=row["compliance_state"],
                gap_minutes=gap_min,
                action="ROUTE_TO_REVIEW",
                audit_payload={"requires_manual_substitution": True, "sdwis_blocked": True},
            ))
        else:
            # Formal violation: generate an SDWIS-ready violation record.
            routes.append(ComplianceRoute(
                sensor_id=sensor_id,
                state=row["compliance_state"],
                gap_minutes=gap_min,
                action="FLAG_VIOLATION",
                audit_payload={"requires_primary_agency_notification": True},
            ))

    return routes

Each compliance state maps to a distinct routing action, so false positives are suppressed while genuine violations still reach the primacy agency. Verified contiguous windows continue on to MCL Exceedance Logic Implementation, while confirmed monitoring violations carry a severity contribution into the Severity Scoring Models.

route_compliance_records maps each non-valid state to its operational action: benign gaps are suppressed, borderline gaps block SDWIS until reviewed, and only confirmed violations reach the primacy agency.

Configuration Reference

The two duration thresholds are the only tuning knobs that change behaviour, and neither should be a hard-coded literal in production — both are governed per parameter by the Threshold Tuning Frameworks, because a jittery raw-water sensor and a stable finished-water residual cannot be held to the same spacing.

Parameter	Default	Unit	Purpose
`telemetry_tolerance_min`	`15`	minutes	Upper bound of a routine polling gap; below this a null is a `TELEMETRY_GAP` and is auto-suppressed
`monitoring_violation_threshold_min`	`1440`	minutes	Duration (24 h) past which a gap becomes a formal `MONITORING_VIOLATION`
`value_col`	—	column	Name of the parameter column carrying `NaN` where telemetry was absent
`time_col`	`timestamp`	column	Timezone-aware UTC index column

The four compliance-flag codes that leave this module form the shared vocabulary downstream stages route on:

Compliance state	Condition	Routing action
`VALID`	Reading present	Forward to exceedance evaluation
`TELEMETRY_GAP`	`duration <= tolerance`	`SUPPRESS_AND_LOG` as a polling artifact
`PENDING_REVIEW`	`tolerance < duration <= threshold`	`ROUTE_TO_REVIEW`; SDWIS blocked
`MONITORING_VIOLATION`	`duration > threshold`	`FLAG_VIOLATION`; notify primacy agency

Verification & Testing

Confirm the classifier’s boundary behaviour before trusting it in a reporting path. The critical property is that a gap exactly at the tolerance stays suppressed while a gap one interval past it escalates to review — and that a real, extended outage always reaches MONITORING_VIOLATION regardless of how many rows it spans.

import numpy as np
import pandas as pd


def test_gap_classification_boundaries():
    # 15-minute cadence: one short gap (<= tolerance) and one long gap (> tolerance).
    idx = pd.date_range("2026-07-03T00:00Z", periods=8, freq="15min")
    values = [1.0, np.nan, 1.1, np.nan, np.nan, np.nan, np.nan, 1.2]
    df = pd.DataFrame({"timestamp": idx, "ntu": values})

    out = classify_gaps(df, value_col="ntu", telemetry_tolerance_min=15,
                        monitoring_violation_threshold_min=60)

    states = out["compliance_state"].tolist()
    assert states[0] == ComplianceState.VALID
    # Single isolated null spans 0 min -> within tolerance -> suppressed.
    assert states[1] == ComplianceState.TELEMETRY_GAP
    # Four contiguous nulls span 45 min -> review, not a violation, at this threshold.
    assert states[4] == ComplianceState.PENDING_REVIEW

    routes = route_compliance_records(out, sensor_id="TURB-01")
    assert all(r.action != "FLAG_VIOLATION" for r in routes)

Acceptance criteria before this module feeds a compliance report:

All timestamps are timezone-aware UTC on entry; naive timestamps are rejected, not coerced.
No interpolate(), ffill(), or bfill() runs anywhere in the classification or routing path.
A gap exactly at telemetry_tolerance_min classifies as TELEMETRY_GAP, and one interval beyond it as PENDING_REVIEW.
Contiguous null blocks share a single gap_duration, so a multi-row outage is not split into several short gaps.
out_of_sequence records are flagged from arrival order and routed to manual reconciliation, never silently sorted away.
Tolerance and threshold values are sourced from the threshold-tuning configuration, not literals in the call site.
Every PENDING_REVIEW record blocks SDWIS submission until a documented manual substitution is recorded.

Troubleshooting & Gotchas

Phantom multiple gaps from a single outage. If contiguous nulls are classified row-by-row without grouping, one 6-hour outage becomes dozens of tiny gaps that each fall under tolerance and are all suppressed — hiding a real violation. Confirm the group_id = is_missing.ne(is_missing.shift()).cumsum() grouping is intact and that gap_duration is broadcast across the whole block.
Naive timestamps reintroducing DST error. A field device emitting local wall-clock time can appear to travel backward across a fall-back transition, producing a negative diff() that is misread as out-of-sequence, or a duplicate slot that inflates a gap. Normalize to UTC upstream and reject naive input; use Python’s datetime module for timezone-aware boundary math.
Silent forward-fill upstream. A single ffill() in a preprocessing step erases the very nulls this module exists to find, so every gap classifies as VALID. Assert that the series reaching classify_gaps still contains explicit NaN where telemetry was absent.
Confusing a continuous-sensor gap with a missed required sample. A missed required compliance sample is a monitoring violation by definition, independent of duration, and must not be bucketed with routine sensor dropouts. Resolve parameter identity and required-sample status through the SDWA MCL Reference Mapping rather than inferring it from the data.
Unmapped violation codes reaching the report. A MONITORING_VIOLATION needs a standardized regulatory code before submission; route it through Violation Code Classification so the SDWIS record is well-formed.

Monitoring Gap Detection Algorithms — parent section and the full gap-detection pipeline
Violation Detection & Rule Engine Logic — the domain this evaluation runs inside
Python Logic for Detecting MCL Exceedances in Real Time — where verified contiguous windows are evaluated
Threshold Tuning Frameworks — per-parameter tolerance and threshold calibration
Aligning Irregular SCADA Timestamps to UTC — the upstream normalization this module depends on

Handling Missing Sensor Readings Without Triggering False Violations

Related pages