Severity Scoring Models for Water Utility Compliance Automation

Severity scoring models convert a confirmed regulatory event into a normalized, risk-weighted index that drives triage, resource allocation, and escalation. This topic sits within the Violation Detection & Rule Engine Logic domain and covers how to build a deterministic scoring stage that runs immediately after an exceedance is detected: the weighting schema that combines exceedance magnitude, temporal persistence, population exposure, and data confidence; the phase-by-phase engine that computes the index; and the deployment patterns that keep the calculation auditable and reproducible. It is written for the utility operators who prioritize field response, the environmental compliance teams who must defend every score in an audit, and the Python engineers who build the automation between them. A severity score informs how urgently a utility responds — it never changes whether a violation occurred, which remains a fixed determination under the governing rule. The workflow below traces one confirmed exceedance from event ingestion, through the weighted index, to a routed response tier.

End-to-end severity scoring: a confirmed exceedance is enriched, normalized across four dimensions, weighted into an index, and routed to a low, mid, or high response tier.

Regulatory / Protocol Foundation

Severity scoring has no independent statutory definition — it is an operational layer built on top of the Safe Drinking Water Act (SDWA) determinations produced upstream. Its legitimacy depends entirely on preserving those determinations intact. The scoring engine consumes a confirmed exceedance already evaluated by MCL Exceedance Logic Implementation against the correct Maximum Contaminant Level (MCL), Maximum Residual Disinfectant Level (MRDL), and averaging methodology; it must never re-derive, soften, or override that result. A score of 12/100 on a nitrate exceedance is still a reportable acute violation with a 24-hour public-notification obligation; the score only sequences the response, never excuses it.

Three regulatory constraints shape the model. First, the threshold values, reporting units, and significant-figure conventions used to compute exceedance magnitude are resolved against the SDWA MCL Reference Mapping rather than hard-coded, so a rule revision propagates without touching scoring code. Second, temporal persistence must be measured against the EPA monitoring frequency and compliance window that governs each contaminant class — a single-sample acute contaminant and a locational running annual average contaminant carry fundamentally different notions of “duration.” Third, because public notification tiers are themselves codified (Tier 1 acute, Tier 2, Tier 3 under 40 CFR Part 141 Subpart Q), the highest severity band must map to — never gate — the statutory notification deadline.

The contaminant classes that most commonly drive high scores align with the same methodology map the detection engine uses:

Contaminant class	Example parameters	Health basis	Notification urgency
Acute microbial / chemical	Nitrate, nitrite, E. coli	Immediate (hours)	Tier 1 — 24 hours
Disinfection byproducts	TTHM, HAA5	Chronic (years)	Tier 2 — 30 days
Inorganic & organic chemicals	Arsenic, atrazine	Chronic (years)	Tier 2 — 30 days
Secondary / monitoring	Sampling gap, late report	Procedural	Tier 3 — annual/CCR

Architecture & Design Decisions

The scoring engine runs as a pure, deterministic stage: it consumes an immutable confirmed-exceedance record and emits a new record carrying a numeric index and its full derivation, so any score can be reconstructed from its inputs during a primacy-agency review. Placing it downstream of detection — rather than folding scoring into the threshold check — is a deliberate separation-of-concerns decision. Detection answers a binary regulatory question; scoring answers a continuous operational one. Coupling them would make it impossible to re-prioritize response without appearing to re-litigate compliance.

Several interfaces cross this stage’s edges. On the way in, records arrive already validated: completeness is confirmed by Monitoring Gap Detection Algorithms so that missing telemetry, sensor drift, and irregular sampling never inflate a risk score, and events are delivered clean across the boundary established by Security Boundary Design. The warning-band calibration that distinguishes an early operational signal from a hard exceedance is owned by Threshold Tuning Frameworks, so the scoring engine consumes tuned bands rather than inventing its own. On the way out, a scored event is routed to compliance reporting and, for high-severity events, to public-notification workflows keyed by the standardized codes resolved through Violation Code Classification.

The central design decision is that the weighting schema is data, not code. Weight matrices and tier boundaries live in versioned configuration validated against a schema before pipeline execution, letting compliance officers adjust risk tolerances — seasonal, jurisdictional, or event-driven — through a reviewable change rather than a code deploy. The record entering the engine carries the same enriched compliance contract used across the domain: a contaminant_id, a location_id, a UTC event_ts, the measured value, the bound mcl, and a quality_flag. The record leaving it adds a severity_index, a tier, and a component_breakdown for the audit log.

The engine's data contract: an enriched record threads four pure stages, a versioned weight-matrix config feeds aggregation, and the routed result fans out to reporting, high-tier notification, and the append-only audit ledger.

Phase-by-Phase Implementation

The engine is built in four phases, each producing an artifact the next depends on: an enriched event, a set of normalized components, a weighted index, and a routed response tier.

Phase 1 — Event enrichment

Bind the confirmed exceedance to the contextual data the score depends on: historical baselines for the parameter and location, asset topology (which pressure zone, how many downstream connections), and service-area demographics including sensitive populations and critical facilities. Enrichment is additive and must fail loudly — a missing demographic join should surface as a flagged low-confidence score, never a silent default that under-ranks a genuine risk.

Implementation steps:

Resolve contaminant_id and location_id to their asset and service-area context.
Attach the parameter baseline (rolling median and variance) for magnitude normalization.
Attach population, sensitive-population, and critical-facility counts for the exposure dimension.

from dataclasses import dataclass


@dataclass(frozen=True)
class ExceedanceEvent:
    """A confirmed exceedance handed down from the detection engine."""
    contaminant_id: str
    location_id: str
    event_ts: str          # ISO-8601, timezone-aware UTC
    value: float
    mcl: float
    quality_flag: str


@dataclass(frozen=True)
class EnrichedEvent:
    """An exceedance enriched with the context the score depends on."""
    event: ExceedanceEvent
    persistence_hours: float
    population: int
    sensitive_population: int
    critical_facilities: int
    confidence: float       # 0.0-1.0 from calibration + telemetry health


def enrich(event: ExceedanceEvent, context: dict) -> EnrichedEvent:
    """Join a confirmed exceedance to its asset, demographic, and confidence context."""
    try:
        ctx = context[event.location_id]
    except KeyError as exc:
        raise LookupError(f"No enrichment context for location {event.location_id!r}") from exc
    return EnrichedEvent(
        event=event,
        persistence_hours=ctx["persistence_hours"],
        population=ctx["population"],
        sensitive_population=ctx["sensitive_population"],
        critical_facilities=ctx["critical_facilities"],
        confidence=ctx["confidence"],
    )

Phase 2 — Dimension normalization

Reduce each raw dimension to a dimensionless value on [0, 1] so that heterogeneous units — NTU, mg/L, hours, population counts — become comparable before weighting. Normalization is where domain judgment lives, and each dimension uses an explicit, documented transform rather than an implicit one.

Exceedance magnitude is normalized against the contaminant’s tolerance band, saturating at a configured multiple $k$ of the limit so a tenfold breach and a hundredfold breach both map near the ceiling instead of dominating the index:

m = \min\!\left(\frac{x - \text{MCL}}{k \cdot \text{MCL}},\ 1\right),\quad m \in [0, 1]

Temporal persistence is normalized against the compliance window $T_c$ for that contaminant class, so “duration” is always expressed relative to the regulatory clock rather than an absolute number of hours:

p = \min\!\left(\frac{t_{\text{exceed}}}{T_c},\ 1\right)

Population and vulnerability impact combines served population with a weighted uplift for sensitive populations and critical facilities, then normalizes against the largest service area in the system, $P_{\max}$ :

v = \min\!\left(\frac{P + \alpha S + \beta F}{P_{\max}},\ 1\right)

where $P$ is population served, $S$ sensitive population, $F$ critical-facility count, and $\alpha, \beta$ their configured uplift factors.

Implementation steps:

Normalize magnitude against the tolerance-band ceiling $k$ .
Normalize persistence against the contaminant’s compliance window.
Compute the exposure index and clip every component to [0, 1].

from dataclasses import dataclass


@dataclass(frozen=True)
class NormConfig:
    k: float                 # magnitude saturation multiple of the MCL
    window_hours: float      # T_c for this contaminant class
    pop_max: int             # largest service-area population in the system
    alpha: float             # sensitive-population uplift
    beta: float              # critical-facility uplift


def clip01(x: float) -> float:
    return max(0.0, min(1.0, x))


def normalize(ev: EnrichedEvent, cfg: NormConfig) -> dict[str, float]:
    """Map the four raw dimensions onto a dimensionless [0, 1] scale."""
    e = ev.event
    magnitude = clip01((e.value - e.mcl) / (cfg.k * e.mcl)) if e.mcl > 0 else 1.0
    persistence = clip01(ev.persistence_hours / cfg.window_hours)
    exposure = clip01(
        (ev.population + cfg.alpha * ev.sensitive_population
         + cfg.beta * ev.critical_facilities) / cfg.pop_max
    )
    return {
        "magnitude": magnitude,
        "persistence": persistence,
        "exposure": exposure,
        "confidence": clip01(ev.confidence),
    }

Phase 3 — Weighted aggregation

Combine the normalized components through a versioned weight matrix into a single index scaled to [0, 100]. Weights are constrained to sum to 1 so the index stays interpretable across configuration changes. Confidence enters as a multiplicative attenuator rather than an additive term: a high-magnitude event on a poorly calibrated sensor is deliberately damped and flagged for verification instead of triggering an expensive emergency response on suspect data.

S = 100 \cdot c \cdot \big(w_m m + w_p p + w_v v\big),\qquad w_m + w_p + w_v = 1

Implementation steps:

Validate that the response weights sum to 1 before any event is scored.
Compute the weighted sum of magnitude, persistence, and exposure.
Attenuate by the confidence factor and scale to [0, 100].

from dataclasses import dataclass


@dataclass(frozen=True)
class WeightMatrix:
    """Versioned scoring weights; w_magnitude + w_persistence + w_exposure must equal 1."""
    version: str
    w_magnitude: float
    w_persistence: float
    w_exposure: float

    def __post_init__(self) -> None:
        total = self.w_magnitude + self.w_persistence + self.w_exposure
        if abs(total - 1.0) > 1e-9:
            raise ValueError(f"Weights must sum to 1.0, got {total} (matrix {self.version})")


def severity_index(components: dict[str, float], weights: WeightMatrix) -> float:
    """Weighted, confidence-attenuated severity index on a 0-100 scale."""
    weighted = (
        weights.w_magnitude * components["magnitude"]
        + weights.w_persistence * components["persistence"]
        + weights.w_exposure * components["exposure"]
    )
    return round(100.0 * components["confidence"] * weighted, 1)

Phase 4 — Tier routing

Map the index to an operational response tier and route it. Tier boundaries are configuration, not literals, so a utility can widen or narrow bands without a code change. Every routed event — regardless of tier — is written to the audit trail with its component breakdown and the weight-matrix version, so a low score is as reproducible as a high one.

Implementation steps:

Classify the index into LOW, MID, or HIGH against configured boundaries.
Attach the routing action and required downstream workflows.
Emit the scored event with its full derivation for the audit log.

from enum import Enum


class Tier(str, Enum):
    LOW = "low"
    MID = "mid"
    HIGH = "high"


def route(index: float, mid_cutoff: float = 40.0, high_cutoff: float = 70.0) -> Tier:
    """Map a 0-100 severity index to a response tier using configurable cutoffs."""
    if index >= high_cutoff:
        return Tier.HIGH
    if index >= mid_cutoff:
        return Tier.MID
    return Tier.LOW

Validation, Quality Flags & Edge Cases

A severity score is only defensible if its inputs were fit for use and its derivation is reproducible. The engine tracks each event through a small state machine so that a low-confidence or incomplete input is never scored as if it were clean: an event is PENDING until enrichment completes; it becomes SCORED once all four dimensions normalize successfully; it diverts to WITHHELD when the input quality flag or confidence is too low to trust, routing to verification instead of response; and it resolves to ROUTED once a tier and audit record are written.

Per-event scoring state machine: an event moves from PENDING to SCORED to ROUTED on clean data, or diverts to WITHHELD — to be re-scored after verification or resolved as a telemetry artifact.

Several edge cases must be handled explicitly:

Suspect and bad quality flags. Only events carrying a GOOD quality flag are scored for response. A SUSPECT or BAD flag diverts the event to the WITHHELD branch for confirmation sampling — scoring a fouled-sensor reading as HIGH burns field resources and erodes trust in the system.
Missing enrichment context. A location with no demographic or asset join must not default silently to zero exposure, which would under-rank a real risk. Treat it as a low-confidence score and flag it for data remediation.
Zero or near-zero limits. Contaminants with an MCL at or near zero (some microbial and treatment-technique parameters) break the magnitude ratio; these route through a presence/absence rule that maps any detection to the magnitude ceiling rather than a division.
Persistence across DST and timezone drift. Duration is computed from timezone-aware UTC timestamps only; a fall-back transition in local wall-clock time can otherwise make an exceedance appear to shrink or grow and mis-normalize the persistence dimension.

The quality-flag vocabulary travels with each record from ingestion through scoring:

Quality flag	Meaning	Eligible for scoring
`GOOD`	Passed all range and calibration checks	Yes
`INTERPOLATED`	Gap-filled by an approved upstream method	Yes, with reduced confidence
`SUSPECT`	Out-of-range or drift-flagged; held for review	No — route to `WITHHELD`
`BAD`	Sensor fault, `NaN`/`Inf`, or failed QC	No — route to `WITHHELD`

Deployment & Integration Patterns

Deploy the scoring engine as a stateless microservice or an embedded stream processor sitting between detection and reporting. Because each score depends only on its enriched event and the active weight matrix, the service scales horizontally with no shared mutable state. Use a message broker — a Kafka topic or an MQTT queue — to decouple detection from scoring and to absorb backpressure when an incident produces a burst of correlated exceedances; long-running rescoring jobs (for example, after a weight-matrix revision is applied to historical events) belong on the async batch processing setup rather than the live path. For batch and vectorized scoring across high-frequency feeds, pandas or Polars expressions apply the normalization and weighting to whole frames at once; interval-based indexing keeps the historical-baseline joins memory-safe.

Every weight matrix and tier-boundary set must be schema-validated before it reaches the pipeline. Enforce this with a declarative model in Pydantic so an invalid configuration — weights that do not sum to 1, a mid-cutoff above the high-cutoff — fails at load time rather than silently mis-scoring live events. Emit structured, provenance-carrying logs for every scored event: the component breakdown, the weight-matrix version, the configuration checksum, and the execution timestamp, so any score is reconstructable during an audit. Containerize the service with a read-only root filesystem and egress controls to meet municipal cybersecurity baselines, and pin random seeds in tests for reproducible scoring fixtures.

The most consequential deployment decision is version pinning the weight matrix per scored event. A score computed under matrix v3 must remain reproducible even after v4 is promoted, so the engine records the matrix version alongside every result and can replay any historical event under the exact configuration that produced its original score.

Production Validation Checklist

Failure Modes & Gotchas

The single most consequential misconfiguration is letting a low severity score suppress a mandated action. A severity index sequences operational response; it has no authority over the statutory obligations attached to the underlying violation. If routing logic ever treats a LOW tier as a reason to skip public notification or omit an event from a Consumer Confidence Report or SDWIS submission, the utility converts a scoring convenience into a reporting violation. It is easy to introduce because the low tier legitimately does suppress a field work order — the fix is to keep notification and reporting triggers bound to the confirmed violation and its Violation Code Classification, never to the severity tier, and to regression-test that a LOW-scored acute nitrate exceedance still fires its Tier 1 notification.

A close second is a weight matrix that silently drifts out of normalization. If a configuration edit leaves the response weights summing to something other than 1 — or scoring code reads a stale matrix version — indices become incomparable across events and tier boundaries lose meaning, so genuinely high-risk events can slip into the mid band. Catch it by validating the sum-to-one invariant at load time (the WeightMatrix.__post_init__ guard above), pinning the matrix version onto every scored event, and asserting in a canary run that a fixed set of reference events reproduces their expected scores before a new matrix is promoted.

Violation Detection & Rule Engine Logic — parent domain and shared rule-engine pipeline
MCL Exceedance Logic Implementation — produces the confirmed exceedances this engine scores
Monitoring Gap Detection Algorithms — completeness gate that prevents false risk inflation
Threshold Tuning Frameworks — calibrates the warning bands this engine consumes
Violation Code Classification — standardized codes that route high-severity notifications
SDWA MCL Reference Mapping — source of limit values used to normalize exceedance magnitude

Severity Scoring Models for Water Utility Compliance Automation

Related pages