Violation Code Classification: Engineering Deterministic SDWA Compliance Pipelines

Violation code classification is the deterministic translation layer that turns continuous utility telemetry into enforceable regulatory outcomes. This topic sits within the Core Architecture & SDWA Compliance Taxonomy and is written for the water utility operators who own the historian, the environmental compliance teams who sign the reports, and the municipal Python developers who build the automation between them. The subsystem converts raw SCADA historian data, laboratory information management system (LIMS) exports, and field technician logs into standardized EPA violation identifiers. To stay regulatorily defensible across distributed treatment plants and distribution networks, the classification logic must interpret Safe Drinking Water Act (SDWA) mandates uniformly regardless of facility topology or data source — a value measured at a filtered-water turbidimeter and a value hand-keyed from a bench sheet must reach the same code through the same auditable path.

The four-phase pipeline: each phase emits the artifact the next depends on, with missing data diverted to a dead-letter queue and sealed verdicts routed to reporting.

Regulatory & Protocol Foundation

The subsystem exists to satisfy a finite body of federal regulation, and every code it emits must trace back to a specific rule. The controlling framework is the EPA Safe Drinking Water Act and its implementing regulations in 40 CFR Part 141, which define the contaminant limits, monitoring frequencies, and reporting obligations the pipeline enforces. Rather than embedding these rules as scattered constants, a durable classifier resolves them at runtime against the SDWA MCL Reference Mapping, so a regulatory change is a data change, not a code deployment.

EPA violation codes follow a structured numeric convention, and each family carries an implicit set of thresholds, averaging shapes, and primacy-agency notification triggers. The classifier’s job is to decide which family a given record belongs to before it decides on a specific code.

Code family	Meaning	Typical trigger	Governing rule set
`01`–`04`	Maximum Contaminant Level (MCL) violation	Single-sample or averaged exceedance	Nitrate, TCR, DBPR
`11`–`12`	Treatment technique (TT) violation	Filtration/disinfection performance breach	Surface Water Treatment Rules
`21`–`27`	Maximum Residual Disinfectant Level (MRDL)	Running-average disinfectant exceedance	Stage 1/2 DBPR
`03` / `51`	Monitoring & reporting (M/R) violation	Missed sample or late submission	All monitoring schedules
`1A`–`1B`	Lead & Copper action-level exceedance	90th-percentile tap result over action level	Lead and Copper Rule
`60`+	Public-notification failure	Missed Tier 1/2/3 notice deadline	Public Notification Rule

The regulations most relevant to real-time automation each contribute a distinct averaging shape that the classifier must implement precisely: the Total Coliform Rule and its Revised successor drive presence/absence microbial logic; the Surface Water Treatment Rules constrain the continuous turbidity monitored through Modbus TCP parsing workflows; the Stage 2 Disinfectants and Disinfection Byproducts Rule imposes a locational running annual average on total trihalomethanes (TTHM) and haloacetic acids (HAA5); and the Lead and Copper Rule replaces per-sample comparison with a 90th-percentile action level across tap sites. Because these constraints govern legal evidence, the classifier is a strictly passive consumer of the upstream SCADA data ingestion stream — it never writes back toward the control network.

Architecture & Design Decisions

The central design decision is to keep regulatory knowledge out of the execution code. The classifier is a declarative rule engine: contaminant thresholds, averaging periods, MRLs, and conditional modifiers are externalized into version-controlled manifests and resolved at runtime, so compliance staff can adjust a limit without a redeploy and every classification can name the exact rule version it applied.

A second decision is the shape of the record that enters and leaves the subsystem. An inbound record is expected to already carry the enriched compliance contract used across the taxonomy — a contaminant_id, an EPA method_code, a timezone-aware UTC sample_ts, a numeric value, and a quality_flag — having crossed the perimeter defined in Security Boundary Design. An outbound record adds the classification verdict: an EPA violation_code, the rule_version applied, the calculated_value (which for an averaged parameter differs from any single reading), and the input vector’s source_hash. Downstream, those verdicts feed the Violation Detection Rule Engine and are mapped onto operator-facing escalation by Translating EPA Violation Codes to Internal Alerts.

The third decision separates two independent violation vectors. An exceedance (an analytical result over a limit) and a monitoring lapse (a sample that was never collected, or collected late) are different codes with different deadlines, and conflating them is a common source of misreporting. The classifier therefore cross-references sampling timestamps against the mandated collection intervals produced by Monitoring Frequency Scheduling, and emits M/R codes on a path entirely separate from the threshold-evaluation path.

The data contract: an enriched inbound record enters the classifier, the rule manifest and MCL mapping feed it from above, and the sealed verdict leaves on two separate paths — exceedance versus monitoring/reporting.

Phase-by-Phase Implementation

For municipal developers, the classifier is implemented as a stateful, vectorized pipeline. Row-by-row iteration adds latency at scale and makes deterministic, reproducible results harder to guarantee, so production architectures use Polars, pandas, or Apache Spark with the four phases below. Each phase produces an artifact the next depends on: a normalized frame, a set of evaluated thresholds, a vectorized verdict, and a routed, sealed record.

Phase 1 — Deterministic data ingestion & preprocessing

Before any regulatory logic executes, incoming telemetry passes through a strict normalization pipeline. Raw measurements from disparate sources are aggregated into a unified, time-indexed schema. Values are never silently interpolated into compliance calculations; instead, gaps are marked MISSING_DATA and routed for explicit handling, because regulatory determinations rely on collected samples rather than estimated values. Any operational backfill used for trending must be kept clearly distinct from compliance data — a discipline kept auditable by following formal data-quality frameworks such as the EPA Guidance for Quality Assurance Project Plans.

Implementation steps:

Align every stream to a common compliance clock (UTC) using deterministic resampling rules, coordinated with the time-series alignment strategies module.
Normalize concentration units (mg/L, µg/L, NTU, pH) to EPA reporting standards using version-controlled, idempotent conversion factors.
Flag sensor drift, calibration gaps, and transmission errors rather than interpolating; route flagged gaps to explicit handling.

from __future__ import annotations

import math
from datetime import datetime, timezone
from typing import Any

# Conversion factors are version-controlled and applied idempotently.
UNIT_TO_MG_L: dict[str, float] = {"mg/l": 1.0, "ug/l": 0.001, "ntu": 1.0, "ph": 1.0}


def normalize_record(raw: dict[str, Any]) -> dict[str, Any]:
    """Return a unit-normalized, UTC-stamped record with an explicit quality flag."""
    value = raw.get("value")
    unit = str(raw.get("unit", "")).lower()

    if value is None or (isinstance(value, float) and (math.isnan(value) or math.isinf(value))):
        quality, normalized = "MISSING_DATA", None
    else:
        factor = UNIT_TO_MG_L.get(unit)
        if factor is None:
            raise ValueError(f"Unversioned unit encountered: {unit!r}")
        quality, normalized = raw.get("quality_flag", "GOOD"), round(value * factor, 6)

    ts = raw["sample_ts"]
    if ts.tzinfo is None:
        raise ValueError("sample_ts must be timezone-aware")

    return {
        "contaminant_id": raw["contaminant_id"],
        "method_code": raw["method_code"],
        "value": normalized,
        "sample_ts": ts.astimezone(timezone.utc),
        "quality_flag": quality,
    }

Phase 2 — Rule evaluation & threshold mapping

The classification engine applies a declarative rule matrix to the normalized frame. Each parameter is evaluated against statutory thresholds, with logic branching on contaminant type, exposure duration, and regulatory context. Nitrate has an acute MCL where one confirmed exceedance triggers immediate classification; lead is governed by a 90th-percentile action level across tap-sampling sites rather than a per-sample MCL; and TTHM/HAA5 are judged by the locational running annual average (LRAA) at each monitoring site. The LRAA for quarter $q$ at a site is the mean of the four most recent quarterly averages:

\text{LRAA}_{q} = \frac{1}{4}\sum_{i=q-3}^{q} \bar{C}_{i}

Threshold evaluation branches on contaminant type into three averaging shapes that converge on one EPA code, while the monitoring/reporting path runs independently on the sampling timestamp.

Implementation steps:

Resolve the threshold, averaging basis, and code family for each contaminant_id from the reference mapping, pinned to a rule_version.
Apply the averaging shape the rule demands — single sample, percentile, or running annual average — rather than a blanket per-row comparison.
Emit a monitoring/reporting code when the sampling timestamp falls outside its mandated interval, independent of the analytical result.

from statistics import quantiles


def classify_lraa(quarterly_means: list[float], mcl: float, code: str) -> str | None:
    """Return an EPA MCL code if the locational running annual average exceeds the MCL."""
    if len(quarterly_means) < 4:
        return None  # Insufficient window: defer, never assume compliance.
    lraa = sum(quarterly_means[-4:]) / 4
    return code if lraa > mcl else None


def classify_lead_action_level(tap_results: list[float], action_level: float, code: str) -> str | None:
    """Lead is judged by the 90th percentile across tap sites, not any single sample."""
    if len(tap_results) < 5:
        return None
    p90 = quantiles(sorted(tap_results), n=10, method="inclusive")[8]
    return code if p90 > action_level else None

Phase 3 — Vectorized pipeline execution

Threshold comparisons, rolling aggregations, and window functions run as native array operations, not Python loops, so the classifier sustains throughput across multi-year historical datasets. Rule matrices, EPA code mappings, and calculation windows are parsed from version-controlled manifests at initialization, enabling hot-reload of regulatory updates.

Implementation steps:

Load the rule manifest once at startup and validate incoming frames against an expected schema before evaluation begins.
Compute rolling averages with native window functions (pandas rolling() or Polars group_by_dynamic()) rather than explicit iteration; see the pandas Window Functions documentation.
Fail fast on type mismatches, null propagation, or malformed timestamps and route the offending rows to a dead-letter queue.

import pandas as pd


def vectorized_lraa(frame: pd.DataFrame, mcl: float, code: str) -> pd.DataFrame:
    """Assign an EPA code where the 4-quarter running average of a site exceeds the MCL.

    Expects columns: site_id, sample_ts (UTC, tz-aware), value.
    """
    frame = frame.sort_values("sample_ts").set_index("sample_ts")
    quarterly = (
        frame.groupby("site_id")["value"]
        .resample("QE")
        .mean()
        .reset_index()
    )
    quarterly["lraa"] = (
        quarterly.groupby("site_id")["value"]
        .rolling(window=4, min_periods=4)
        .mean()
        .reset_index(level=0, drop=True)
    )
    quarterly["violation_code"] = quarterly["lraa"].gt(mcl).map({True: code, False: None})
    return quarterly

Phase 4 — Validation, edge cases & routing

Once a verdict is computed it is validated, sealed, and routed. Schema enforcement with a declarative library rejects malformed or out-of-range verdicts before they reach a report; a SHA-256 digest of the input vector binds each classification to the exact data that produced it; and only then is the record forwarded to the alerting and reporting tiers.

Implementation steps:

Validate the outbound verdict’s field types, code membership, and UTC timestamp with Pydantic.
Seal the verdict with the input vector’s source_hash and append it to a write-once audit store.
Route the sealed verdict to the deterministic alerting path, never mutating an existing row.

import hashlib
import json
from datetime import datetime

from pydantic import BaseModel, Field, field_validator

VALID_CODES = {"01", "02", "03", "11", "21", "1A", "1B", "51", "60"}


class Verdict(BaseModel):
    contaminant_id: str
    violation_code: str
    rule_version: str
    calculated_value: float
    sample_ts: datetime
    source_hash: str = Field(min_length=64, max_length=64)

    @field_validator("violation_code")
    @classmethod
    def _known_code(cls, code: str) -> str:
        if code not in VALID_CODES:
            raise ValueError(f"Unknown EPA violation code: {code}")
        return code


def seal_and_route(input_vector: bytes, verdict: dict) -> Verdict:
    """Bind the verdict to its input and validate before it is appended to the ledger."""
    verdict = {**verdict, "source_hash": hashlib.sha256(input_vector).hexdigest()}
    record = Verdict(**verdict)
    with open("classification_ledger.jsonl", "a", encoding="utf-8") as ledger:
        ledger.write(record.model_dump_json() + "\n")
    return record

Validation, Quality Flags & Edge Cases

Every reading carries a quality flag that survives the entire pipeline, so a classification decision can always be traced to the trustworthiness of its inputs. A MISSING_DATA reading can still produce a valid monitoring/reporting code even though it can never produce an exceedance code.

Quality flag	Meaning	Eligible for exceedance code?
`GOOD`	In-range reading from a calibrated sensor	Yes
`SUSPECT`	Passed range check but failed a rate-of-change/deadband test	Yes, with review
`MISSING_DATA`	No sample collected or transmission gap	No — may trigger M/R code
`BAD`	`NaN`/`Inf` or out-of-physical-range reading	No — quarantined
`INTERPOLATED`	Operational backfill for trending only	No — excluded from compliance

The pipeline maintains a per-parameter classification state so a partial averaging window never silently reports “compliant.” A parameter is PENDING until its window is complete, becomes CLASSIFIED once a verdict is sealed, and enters DEFERRED when data is insufficient — the reporting tier is told explicitly that no determination could be made rather than being left to infer compliance.

Per-parameter states: PENDING while the window fills, CLASSIFIED once a verdict is sealed, and DEFERRED when data is insufficient — so the reporting tier is never left to infer compliance.

Several edge cases demand explicit handling:

Timezone and DST drift. Field devices often emit local wall-clock time. A record stamped in a zone that observes daylight saving can appear to travel backward across a fall-back transition and corrupt an averaging window, so every sample_ts is normalized to UTC on ingress.
Leap-year and calendar boundaries. Quarterly and annual windows that span a calendar boundary — especially a leap year — must use calendar-aware resampling, never fixed 90-day arithmetic, or a boundary sample lands in the wrong quarter.
Exact-threshold crossings. A reading exactly at the MCL must be classified with the same rounding convention the primacy agency uses; a mismatched rounding rule flips the verdict at the boundary.
Partial and duplicated frames. A replayed message must overwrite its own prior verdict on an idempotent, source_hash-keyed write instead of inflating a sample count.

Deployment & Integration Patterns

The classifier is best deployed as a small, single-purpose container with a read-only root filesystem, so a compromised process cannot persist a foothold or rewrite its own rule manifests. Mount the append-only ledger on a dedicated write volume and grant the container only the egress it needs to reach the reporting tier. Because the subsystem must never apply backpressure to the OT side — slowing ingestion is not an option when the control loop is upstream — buffer bursts through a message broker (a Kafka topic or an MQTT queue) and let classification workers drain it at their own pace.

Long-running reprocessing jobs — for example, re-running a full year of history after a rule_version bump — should be dispatched through an asynchronous worker tier rather than blocking the live path, using the async batch processing setup pattern. Deterministic idempotency makes this safe: re-running the pipeline over identical input windows yields identical codes, because state mutations are isolated to append-only audit tables and every write is keyed on the input vector’s source_hash.

Production Validation Checklist

Failure Modes & Gotchas

The single most consequential misconfiguration is an averaging-window boundary error. Because the classifier’s correctness is only verifiable through adversarial testing against known compliance scenarios, run the full pipeline against a curated set of historical analytical results with confirmed outcomes before deploying — including exact-MCL-threshold values, non-detect results at or above the MRL, and quarterly averaging windows that straddle a calendar-year boundary. Every classification the engine makes on those cases should match the primacy agency’s determination. Discrepancies almost always reveal an averaging-window boundary error, a rounding-convention mismatch, or a unit-of-measure inconsistency that would otherwise stay hidden until an enforcement action. Catch it by making the known-case regression suite a merge gate: a rule_version change that flips any known verdict must fail the build, not ship.

Core Architecture & SDWA Compliance Taxonomy — parent architecture and shared data contracts
Translating EPA Violation Codes to Internal Alerts — mapping sealed verdicts onto operator escalation
SDWA MCL Reference Mapping — the runtime threshold and averaging-basis source
Monitoring Frequency Scheduling — mandated intervals behind monitoring/reporting codes
Security Boundary Design — the perimeter records cross before classification
Violation Detection Rule Engine — downstream consumer of classified codes

Violation Code Classification: Engineering Deterministic SDWA Compliance Pipelines

Related pages