Python Logic for Detecting MCL Exceedances in Real Time

Detecting a Maximum Contaminant Level (MCL) breach the moment it happens — rather than during a monthly laboratory reconciliation — is the exact engineering task on this page: turning high-frequency SCADA telemetry into deterministic, audit-ready exceedance events against the EPA 40 CFR Part 141 framework. This is the reference implementation within the parent MCL Exceedance Logic Implementation section, written for the Python automation builders who own the streaming evaluation service and the environmental compliance teams who answer for the resulting reports. A real-time engine has to normalize raw sensor payloads, maintain stateful running-average and percentile windows without reloading history, and route every flagged event so that no violation is ever silently dropped before it reaches the broader Violation Detection Rule Engine.

The four steps below build that engine end to end: strict edge normalization, a stateful rolling-window evaluator, fault-tolerant routing, and an immutable audit trail.

Prerequisites & Environment Setup

The engine targets Python 3.10+ for the modern datetime timezone handling and structural typing used throughout. It relies only on the standard library for evaluation (asyncio, collections.deque, dataclasses, datetime), plus a broker client for ingestion and, optionally, pydantic for schema enforcement once records enter the compliance pipeline. Pin versions explicitly so streaming behavior stays reproducible across deployments.

Ingestion typically subscribes to OPC-UA gateways or MQTT brokers — the same field sources described in OPC UA Data Extraction — and buffers normalized payloads in Redis Streams or Apache Kafka. Confine all window state to a single event loop: because asyncio is single-threaded and cooperatively scheduled, keeping window state on the loop makes evaluation across workers free of data races.

python3 -m venv .venv && source .venv/bin/activate
pip install "paho-mqtt==2.1.*" "redis==5.0.*"
# Optional: schema enforcement once events feed the compliance pipeline
pip install "pydantic==2.7.*"

Step-by-Step Implementation

Step 1 — Normalize telemetry deterministically at the edge

Before any compliance evaluation runs, incoming telemetry must pass through deterministic normalization, or the running averages you compute will be wrong in ways that only surface during an audit. Three transforms are non-negotiable:

Timestamp coercion. All timestamps are anchored to UTC at the edge. Local UTC offsets and Daylight Saving Time transitions are resolved during ingestion so that hourly aggregations are neither duplicated nor skipped — the same discipline enforced by the time-series alignment strategies module upstream.
Quality-flag filtering. SCADA quality codes are evaluated against a strict allowlist. In OPC-UA, the two most significant bits of the status code carry the quality severity (GOOD, UNCERTAIN, or BAD); any reading that is not GOOD is routed to a quarantine queue and excluded from MCL calculations.
Unit standardization. Sensor outputs are converted to EPA reporting units (mg/L or µg/L) through a deterministic conversion table. Calibration events are tagged with a calibration_flag so that maintenance-induced spikes do not trigger compliance violations.

Normalized payloads are then pulled by Python consumers in 1-to-5-second micro-batches, keeping evaluation latency low while preserving ordering and at-least-once delivery for downstream audit logging. Where those micro-batches are heavy, offload them to the workers described in Async Batch Processing Setup.

Step 2 — Build the stateful rolling-window evaluator

The compliance engine evaluates contaminants using three calculation methods defined by federal regulations: instantaneous limits, running annual averages (RAA), and percentile-based thresholds such as the Lead and Copper Rule 90th percentile. The running annual average over a window of $N$ samples is simply the arithmetic mean,

$\text{RAA} = \frac{1}{N}\sum_{i=1}^{N} c_i,$

while the 90th-percentile rank index is $k = \lceil 0.9\,N \rceil$ over the sorted sample set. The evaluator maintains state across streaming windows without reloading the full dataset, using bounded deque structures and explicit type annotations to keep outputs deterministic and memory usage stable.

import logging
import math
from collections import deque
from datetime import datetime, timedelta, timezone
from typing import Optional
from dataclasses import dataclass

logger = logging.getLogger(__name__)

@dataclass
class TelemetryPoint:
    timestamp: datetime
    value: float
    quality_code: int
    unit: str
    calibration_flag: bool = False

@dataclass
class ComplianceEvent:
    contaminant: str
    mcl_type: str
    threshold: float
    observed_value: float
    timestamp: datetime
    severity: str  # 'warning', 'exceedance', 'critical'
    routing_status: str = 'pending'

class MCLEvaluator:
    """
    Production-grade evaluator for EPA 40 CFR Part 141 thresholds.
    Maintains stateful rolling windows without full dataset reloads.
    """
    def __init__(self, contaminant: str, mcl_type: str, threshold: float, window_days: int = 365):
        self.contaminant = contaminant
        self.mcl_type = mcl_type  # 'instantaneous', 'running_avg', 'percentile_90'
        self.threshold = threshold
        self.window_days = window_days
        self.history: deque = deque()
        self._last_eval_time = datetime.now(timezone.utc)

    # OPC-UA quality severity occupies the two most significant bits of the
    # status code: 0b00 = GOOD, 0b01 = UNCERTAIN, 0b10/0b11 = BAD. Mask 0xC0
    # isolates that field so any non-GOOD reading is rejected.
    QUALITY_SEVERITY_MASK = 0xC0

    def ingest(self, point: TelemetryPoint) -> Optional[ComplianceEvent]:
        if point.calibration_flag or (point.quality_code & self.QUALITY_SEVERITY_MASK):
            logger.debug("Skipping %s point: calibration or non-GOOD quality", self.contaminant)
            return None

        # UTC anchor enforcement
        if point.timestamp.tzinfo is None:
            point.timestamp = point.timestamp.replace(tzinfo=timezone.utc)

        self.history.append(point)
        self._prune_window()

        return self._evaluate()

    def _prune_window(self) -> None:
        # Roll the window back by window_days using timedelta so month and year
        # boundaries are handled correctly (a naive day subtraction overflows).
        cutoff = datetime.now(timezone.utc) - timedelta(days=self.window_days)
        while self.history and self.history[0].timestamp < cutoff:
            self.history.popleft()

    def _evaluate(self) -> Optional[ComplianceEvent]:
        if not self.history:
            return None

        latest = self.history[-1]
        result = None

        if self.mcl_type == 'instantaneous':
            if latest.value > self.threshold:
                result = self._build_event(latest, 'exceedance')
        elif self.mcl_type == 'running_avg':
            avg = sum(p.value for p in self.history) / len(self.history)
            if avg > self.threshold:
                result = self._build_event(latest, 'exceedance', metric_value=avg)
        elif self.mcl_type == 'percentile_90':
            if len(self.history) >= 5:  # Minimum sample size for statistical validity
                sorted_vals = sorted(p.value for p in self.history)
                idx = math.ceil(0.9 * len(sorted_vals)) - 1
                p90 = sorted_vals[idx]
                if p90 > self.threshold:
                    result = self._build_event(latest, 'exceedance', metric_value=p90)

        return result

    def _build_event(self, point: TelemetryPoint, severity: str, metric_value: Optional[float] = None) -> ComplianceEvent:
        return ComplianceEvent(
            contaminant=self.contaminant,
            mcl_type=self.mcl_type,
            threshold=self.threshold,
            observed_value=metric_value if metric_value is not None else point.value,
            timestamp=point.timestamp,
            severity=severity
        )

The _evaluate method branches on mcl_type, applying a different comparison for each regulatory calculation method. Match each contaminant’s mcl_type and threshold to the authoritative values catalogued in the SDWA MCL Reference Mapping, so the engine and the rule table never drift apart.

Step 3 — Route exceedances with a circuit breaker and dead-letter queue

Detection is only half of the compliance lifecycle. When an exceedance is flagged, the system must route the event deterministically, degrade gracefully on downstream failures, and give operators an immediate resolution path. The routing layer combines a circuit breaker, a dead-letter queue, and multi-channel dispatch — the same fault-tolerance posture used on the ingestion side when handling missing sensor readings without triggering false violations.

import asyncio
import logging
from collections import deque
from datetime import datetime, timezone
from enum import Enum
from typing import Awaitable, Callable, Dict

logger = logging.getLogger(__name__)

class RoutingChannel(Enum):
    SCADA_HMI = "scada_hmi"
    EMAIL_ALERT = "email"
    WEBHOOK_PAGER = "pagerduty_webhook"
    DEAD_LETTER = "dead_letter_queue"

class ExceedanceRouter:
    def __init__(self, primary_handlers: Dict[RoutingChannel, Callable[[ComplianceEvent], Awaitable[None]]]):
        self.handlers = primary_handlers
        self.circuit_open = False
        self.dlq: deque = deque(maxlen=10000)

    async def route_event(self, event: ComplianceEvent) -> None:
        if self.circuit_open:
            self._queue_to_dlq(event)
            return

        # Primary dispatch: SCADA HMI write + PagerDuty webhook. gather with
        # return_exceptions=True never raises, so inspect each result and treat
        # any exception as a failed primary channel.
        results = await asyncio.gather(
            self.handlers[RoutingChannel.SCADA_HMI](event),
            self.handlers[RoutingChannel.WEBHOOK_PAGER](event),
            return_exceptions=True
        )
        failures = [r for r in results if isinstance(r, Exception)]

        if not failures:
            event.routing_status = 'dispatched'
            logger.info("Routed %s exceedance to primary channels", event.contaminant)
            return

        logger.error("Primary routing failed: %s", failures)
        # Fallback: email alert as a degraded path.
        try:
            await self.handlers[RoutingChannel.EMAIL_ALERT](event)
            event.routing_status = 'fallback_dispatched'
        except Exception as fallback_err:
            logger.critical("All routing channels exhausted: %s", fallback_err)
            event.routing_status = 'dead_lettered'
            self._queue_to_dlq(event)
            self._trip_circuit()

    def _queue_to_dlq(self, event: ComplianceEvent) -> None:
        self.dlq.append({
            "event": event.__dict__,
            "queued_at": datetime.now(timezone.utc).isoformat(),
            "retry_count": 0
        })

    def _trip_circuit(self) -> None:
        self.circuit_open = True
        # Schedule an automatic reset 300s later on the running event loop.
        loop = asyncio.get_running_loop()
        loop.call_later(300, self._reset_circuit)

    def _reset_circuit(self) -> None:
        self.circuit_open = False
        logger.info("Routing circuit breaker reset. Processing DLQ backlog...")
        # Implement DLQ replay logic here for production deployments.

The router attempts primary channels first, degrades to email, and finally dead-letters the event while tripping the circuit breaker. Operators receive HMI tag updates and mobile alerts, while the dead-letter queue preserves audit continuity for later replay. Tie the routed contaminant and mcl_type to a stable internal alert taxonomy by way of translating EPA violation codes to internal alerts, so a paged responder sees a canonical code rather than a raw threshold comparison.

Step 4 — Serialize an immutable audit trail

Every evaluation and routing decision must be serialized to an immutable audit store. Production deployments typically write JSON or Parquet records to a time-series data lake, capturing the raw telemetry payload (pre-normalization), the quality-flag resolution path, the window-state snapshot at evaluation time, the routing-channel responses and latency metrics, and operator acknowledgment timestamps. Regulatory audits require traceable lineage from sensor to compliance decision, so chain a SHA-256 digest across each evaluation batch — linking each record’s hash into the next — which makes any post-hoc tampering detectable.

Configuration Reference

The tables below capture the regulatory thresholds, calculation methods, and quality codes the evaluator depends on. Treat the MCL values and mcl_type assignments as versioned configuration sourced from the rule table, never as inline constants.

Representative MCL reference values (40 CFR Part 141)

Contaminant	MCL	Reporting unit	`mcl_type`
Nitrate (as N)	10	mg/L	`instantaneous`
Arsenic	10	µg/L	`running_avg`
Total trihalomethanes (TTHM)	0.080	mg/L	`running_avg`
Lead (action level)	15	µg/L	`percentile_90`
Turbidity (filtered, combined)	0.3	NTU	`instantaneous`

Evaluator parameters

Parameter	Type	Default	Meaning
`contaminant`	str	—	Analyte name used in the emitted `ComplianceEvent`
`mcl_type`	str	—	`instantaneous`, `running_avg`, or `percentile_90`
`threshold`	float	—	Regulatory limit in the analyte’s reporting unit
`window_days`	int	`365`	Rolling-window span for average / percentile methods
`QUALITY_SEVERITY_MASK`	int	`0xC0`	Bitmask isolating OPC-UA quality severity bits

OPC-UA quality-severity codes

Severity bits	Code	Meaning	Engine action
`0b00`	GOOD	Reading trusted	Included in MCL calculations
`0b01`	UNCERTAIN	Questionable quality	Quarantined, excluded
`0b10` / `0b11`	BAD	Sensor/comms fault	Quarantined, excluded

Verification & Testing

Confirm the evaluator with deterministic unit tests that replay synthetic telemetry against known EPA thresholds before any production promotion. Because the running-average and percentile branches are the subtle ones, assert them explicitly rather than trusting a single instantaneous check.

from datetime import datetime, timezone


def _pt(value):
    return TelemetryPoint(
        timestamp=datetime.now(timezone.utc),
        value=value,
        quality_code=0x00,  # GOOD
        unit="mg/L",
    )


def test_instantaneous_exceedance_flags_nitrate():
    ev = MCLEvaluator("nitrate", "instantaneous", threshold=10.0)
    assert ev.ingest(_pt(9.9)) is None
    event = ev.ingest(_pt(10.5))
    assert event is not None and event.observed_value == 10.5


def test_non_good_quality_is_excluded():
    ev = MCLEvaluator("nitrate", "instantaneous", threshold=10.0)
    bad = _pt(50.0)
    bad.quality_code = 0x80  # BAD severity -> masked out
    assert ev.ingest(bad) is None


def test_percentile_90_requires_min_sample_size():
    ev = MCLEvaluator("lead", "percentile_90", threshold=15.0)
    for v in [20.0, 20.0, 20.0, 20.0]:  # only 4 samples
        assert ev.ingest(_pt(v)) is None  # below n=5 threshold
    assert ev.ingest(_pt(20.0)) is not None  # 5th sample triggers evaluation

Acceptance criteria before promoting the engine to production:

Every mcl_type (instantaneous, running_avg, percentile_90) has a passing exceedance and non-exceedance test.
Non-GOOD quality codes and calibration_flag points are excluded from all calculations.
Naive (tz-unaware) timestamps are coerced to UTC before entering the window.
window_days pruning is verified across a leap-year and DST boundary.
percentile_90 returns no event below the n >= 5 minimum sample size.
Router dead-letters and trips the circuit when all channels fail, and auto-resets after 300s.
Each emitted record carries contaminant, threshold, observed value, and a UTC timestamp for the audit store.

Troubleshooting & Gotchas

Running averages drift because of unfiltered quality codes. A single BAD or UNCERTAIN reading that slips past the mask skews the mean for the entire window. Confirm QUALITY_SEVERITY_MASK matches your gateway’s status encoding and quarantine — never silently zero — any non-GOOD point.
Phantom exceedances after a maintenance window. Calibration spikes decode as real readings unless tagged. Ensure the edge sets calibration_flag=True during service so ingest drops those points before they reach _evaluate.
window_days pruning overflows twice a year. A naive day-count subtraction breaks across DST and month boundaries. The _prune_window cutoff uses timedelta(days=...) against UTC-anchored timestamps for exactly this reason — normalize timestamps to UTC at ingestion before any window math.
percentile_90 returns nothing on a fresh stream. By design: the branch requires len(self.history) >= 5 for statistical validity. Backfill the window from the historian on startup, or accept that the first four samples cannot trigger a Lead and Copper Rule evaluation.
Silent violation loss during a broker outage. If you bypass ExceedanceRouter and dispatch directly, a downstream failure drops the event with no trace. Always route through the circuit breaker and dead-letter queue so a webhook or SMTP outage becomes a replayable DLQ entry rather than a regulatory reporting gap.

MCL Exceedance Logic Implementation — parent section and the exceedance-evaluation contract
Handling Missing Sensor Readings Without Triggering False Violations — the gap-detection counterpart to this detection logic
SDWA MCL Reference Mapping — authoritative thresholds and mcl_type assignments for the evaluator
Translating EPA Violation Codes to Internal Alerts — canonical alert taxonomy for routed events
Parsing Modbus Registers for Turbidity Sensors — upstream sensor decode that feeds this engine
Aligning Irregular SCADA Timestamps to UTC — the UTC normalization this engine depends on

Python Logic for Detecting MCL Exceedances in Real Time

Related pages