Automating Monthly vs. Quarterly SDWA Monitoring Schedules

Deciding whether a given sampling location must be sampled monthly or quarterly is one of the most error-prone tasks in a Safe Drinking Water Act (SDWA) compliance program, because the answer is conditional: it shifts with population served, contaminant family, historical results, and rule-specific escalation triggers. This page walks Python automation engineers and environmental compliance teams through building a daemon that resolves that frequency from live compliance state — instead of a hand-maintained spreadsheet — and emits audit-ready work orders that prove every sample landed inside its window. It sits under the Monitoring Frequency Scheduling module and consumes the limits published by the SDWA MCL Reference Mapping, so a frequency shift always follows compliance status rather than an arbitrary administrative cycle.

The escalation math is anchored to the Stage 2 Disinfectants and Disinfection Byproducts Rule (DBPR) locational running annual average (LRAA). A location can hold reduced (quarterly) monitoring only while its trailing four-quarter mean stays at or below half the Maximum Contaminant Level (MCL):

\text{LRAA}_{\ell,q} = \frac{1}{4}\sum_{i=q-3}^{q}\bar{C}_{\ell,i} \;\le\; 0.5\cdot\text{MCL}

When that inequality breaks — or when any single EPA violation code is raised — the location escalates to monthly sampling, and the scheduler must react on the next evaluation cycle rather than at the next manual review.

Prerequisites & Environment Setup

The scheduler targets Python 3.11+ (for the built-in zoneinfo timezone database and modern datetime behavior). It uses APScheduler for recurring evaluation, requests with urllib3 retry logic for the compliance-database call, and standard-library modules for date math and caching. Pin versions so daylight-saving and scheduler-misfire behavior stays reproducible across deployments.

python3 -m venv .venv
source .venv/bin/activate
pip install "apscheduler>=3.10,<4.0" "requests>=2.31" "urllib3>=2.0"
# zoneinfo, dataclasses, enum, json, logging are all standard library on 3.11+

The daemon needs outbound HTTPS to your internal compliance-database API, write access to a local cache directory (/var/cache/sdwa), and a writable log path. Run it under systemd or a container supervisor so a crash restarts cleanly and misfired jobs are surfaced rather than lost.

Step-by-Step Implementation

Frequency determination is a stateful evaluation. Each cycle the daemon queries a compliance ledger, evaluates the last twelve months of analytical results, reconciles the current population served, and applies rule-specific escalation before generating the next sampling window. The decision itself is a small branch resolved per asset:

Resolving required cadence: any single trigger escalates a location to monthly; only a location that clears all three checks stays quarterly.

The implementation below is organized as five stages:

Model the asset and resolve frequency — a SamplingAsset dataclass plus determine_frequency, which encodes the population threshold and the rolling twelve-month escalation window.
Compute the next window with calendar-correct date math — parse_utc, add_months, and calculate_next_due advance by whole calendar months so quarters never drift across months of differing length.
Fetch compliance state resiliently — fetch_compliance_state retries with exponential backoff and falls back to a local cache when the primary database is unreachable.
Emit an audit-ready work order — generate_work_order and evaluate_and_schedule build the CMMS payload and persist the computed schedule for fallback.
Wire the recurring job — an APScheduler cron job with a listener that alarms on missed or failed executions.

import json
import logging
import os
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from enum import Enum
from typing import Any, Dict, Optional

import requests
from apscheduler.events import EVENT_JOB_ERROR, EVENT_JOB_MISSED
from apscheduler.schedulers.blocking import BlockingScheduler
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from zoneinfo import ZoneInfo

# Production logging configuration
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler("sdwa_compliance_scheduler.log", encoding="utf-8")
    ]
)
logger = logging.getLogger("sdwa_scheduler")
LOCAL_TZ = ZoneInfo("America/New_York")

class Frequency(str, Enum):
    MONTHLY = "monthly"
    QUARTERLY = "quarterly"

def parse_utc(value: str) -> datetime:
    """Parses an ISO 8601 timestamp into an aware datetime in LOCAL_TZ.

    Values without an explicit offset are assumed to be UTC, then converted to
    the local timezone so that all downstream date math is offset-aware.
    """
    parsed = datetime.fromisoformat(value)
    if parsed.tzinfo is None:
        parsed = parsed.replace(tzinfo=timezone.utc)
    return parsed.astimezone(LOCAL_TZ)

def add_months(start: datetime, months: int) -> datetime:
    """Advances a datetime by whole calendar months, preserving alignment.

    Naive day arithmetic (e.g. 91 days for a quarter) drifts across months of
    differing length; advancing by calendar months keeps sampling windows
    anchored to the same point in each period. The day is clamped to the last
    valid day of the target month (e.g. Jan 31 + 1 month -> Feb 28/29).
    """
    month_index = start.month - 1 + months
    year = start.year + month_index // 12
    month = month_index % 12 + 1
    # Last day of the target month via the first day of the following month.
    if month == 12:
        next_month_first = start.replace(year=year + 1, month=1, day=1)
    else:
        next_month_first = start.replace(year=year, month=month + 1, day=1)
    last_day = (next_month_first - timedelta(days=1)).day
    return start.replace(year=year, month=month, day=min(start.day, last_day))


@dataclass
class SamplingAsset:
    asset_id: str
    contaminant: str
    system_population: int
    last_sample_date: Optional[datetime]
    last_exceedance_date: Optional[datetime] = None
    current_frequency: Frequency = Frequency.QUARTERLY

def determine_frequency(asset: SamplingAsset) -> Frequency:
    """Evaluates SDWA conditional triggers and returns the required frequency."""
    # Stage 2 DBPR population threshold.
    if asset.contaminant in ("TTHM", "HAA5") and asset.system_population > 10000:
        return Frequency.MONTHLY

    # Violation-driven escalation (EPA 12-month rolling window).
    if asset.last_exceedance_date:
        days_since = (datetime.now(LOCAL_TZ) - asset.last_exceedance_date).days
        if days_since < 365:
            return Frequency.MONTHLY

    return Frequency.QUARTERLY

def calculate_next_due(last_sample: Optional[datetime], frequency: Frequency) -> datetime:
    """Calculates the next sampling deadline with timezone awareness."""
    if last_sample is None:
        return datetime.now(LOCAL_TZ)

    months = 1 if frequency == Frequency.MONTHLY else 3
    return add_months(last_sample, months)


def resolve_fallback_schedule(asset_id: str) -> Optional[datetime]:
    """Immediate operational resolution: reads local cache if primary DB is unreachable."""
    cache_path = f"/var/cache/sdwa/{asset_id}_schedule.json"
    try:
        with open(cache_path, "r") as f:
            data = json.load(f)
            return parse_utc(data["next_due"])
    except (FileNotFoundError, json.JSONDecodeError, KeyError, ValueError) as e:
        logger.warning(f"Local cache fallback failed for {asset_id}: {e}")
        return None

def fetch_compliance_state(asset_id: str) -> SamplingAsset:
    """Retrieves asset state with retry logic and dead-letter routing."""
    session = requests.Session()
    retry_strategy = Retry(total=3, backoff_factor=1.5, status_forcelist=[500, 502, 503, 504])
    session.mount("https://", HTTPAdapter(max_retries=retry_strategy))

    try:
        resp = session.get(f"https://compliance-db.internal/api/v1/assets/{asset_id}", timeout=10)
        resp.raise_for_status()
        payload = resp.json()
        last_exceedance = payload.get("last_exceedance_utc")
        return SamplingAsset(
            asset_id=asset_id,
            contaminant=payload["contaminant"],
            system_population=payload["population"],
            last_sample_date=parse_utc(payload["last_sample_utc"]),
            last_exceedance_date=parse_utc(last_exceedance) if last_exceedance else None
        )
    except requests.exceptions.RequestException as e:
        logger.error(f"Primary compliance DB unreachable for {asset_id}. Triggering fallback routing.")
        fallback_due = resolve_fallback_schedule(asset_id)
        if fallback_due:
            logger.info(f"Resuming operations using cached schedule for {asset_id}")
            return SamplingAsset(
                asset_id=asset_id, contaminant="UNKNOWN", system_population=0,
                last_sample_date=add_months(fallback_due, -1), current_frequency=Frequency.MONTHLY
            )
        raise RuntimeError(f"Critical: No viable schedule source for {asset_id}") from e

def generate_work_order(asset: SamplingAsset, due_date: datetime) -> Dict[str, Any]:
    """Formats an audit-ready work order payload for CMMS ingestion."""
    return {
        "asset_id": asset.asset_id,
        "contaminant": asset.contaminant,
        "scheduled_frequency": asset.current_frequency.value,
        "due_date": due_date.isoformat(),
        "priority": "HIGH" if asset.current_frequency == Frequency.MONTHLY else "STANDARD",
        "compliance_rule": "SDWA_Stage2_DBPR" if asset.contaminant in ("TTHM", "HAA5") else "SDWA_GENERAL",
        "generated_at": datetime.now(LOCAL_TZ).isoformat()
    }

def evaluate_and_schedule(asset_id: str) -> None:
    """Core job execution with state evaluation and CMMS dispatch."""
    try:
        asset = fetch_compliance_state(asset_id)
        asset.current_frequency = determine_frequency(asset)
        next_due = calculate_next_due(asset.last_sample_date, asset.current_frequency)

        wo_payload = generate_work_order(asset, next_due)
        logger.info(f"Dispatching work order: {json.dumps(wo_payload, indent=2)}")

        # Persist to local cache for immediate fallback routing
        os.makedirs("/var/cache/sdwa", exist_ok=True)
        with open(f"/var/cache/sdwa/{asset_id}_schedule.json", "w") as f:
            json.dump({"next_due": next_due.isoformat(), "frequency": asset.current_frequency.value}, f)

    except Exception as e:
        logger.critical(f"Job execution failed for {asset_id}: {e}")
        # Route to dead-letter queue or SCADA alarm system here

def scheduler_event_listener(event):
    if event.exception:
        logger.error(f"Scheduler job failed: {event.job_id} | {event.exception}")
    elif event.code == EVENT_JOB_MISSED:
        logger.warning(f"Job missed execution window: {event.job_id}")

if __name__ == "__main__":
    scheduler = BlockingScheduler(timezone=LOCAL_TZ)
    scheduler.add_listener(scheduler_event_listener, EVENT_JOB_ERROR | EVENT_JOB_MISSED)

    # Daily evaluation at 02:00 local time
    scheduler.add_job(evaluate_and_schedule, "cron", hour=2, minute=0, args=["ASSET_001"], id="daily_compliance_eval")
    logger.info("SDWA Compliance Scheduler initialized. Press Ctrl+C to exit.")
    try:
        scheduler.start()
    except KeyboardInterrupt:
        logger.info("Scheduler shutdown requested.")
        scheduler.shutdown()

Production pipelines cannot tolerate silent failures. When the primary compliance database or CMMS API becomes unreachable, the scheduler pivots to deterministic fallback routing: it reads the cached deadline, logs a WARNING, and keeps generating work orders from the last known compliant state so a transient outage never opens a sampling gap that triggers an EPA Tier 1 notification.

The fetch path: a reachable database drives a freshly computed schedule; an outage falls back to the cached deadline so no sampling gap opens.

Configuration Reference

Every tunable that affects compliance timing is centralized below. Treat these as configuration, not code — a regulatory change or a network-policy change should be a value edit, not a redeploy.

Parameter	Value / Location	Purpose
`LOCAL_TZ`	`America/New_York`	Anchors all window math to the utility’s reporting timezone; must match the primacy agency’s local time.
Population threshold	`> 10000` in `determine_frequency`	Stage 2 DBPR monthly-monitoring cutoff for TTHM/HAA5 systems.
Escalation window	`365` days in `determine_frequency`	Rolling window during which a prior MCL exceedance forces monthly sampling.
Monthly interval	`months = 1` in `calculate_next_due`	Calendar months added for a monthly cadence.
Quarterly interval	`months = 3` in `calculate_next_due`	Calendar months added for a quarterly cadence.
`Retry(total=3, backoff_factor=1.5)`	`fetch_compliance_state`	Exponential backoff on 5xx responses before fallback.
Request timeout	`10` seconds	Prevents thread starvation in the scheduler pool.
Cache path	`/var/cache/sdwa/{asset_id}_schedule.json`	Last-known-good deadline for fallback routing.
Cron trigger	`hour=2, minute=0`	Daily off-peak re-evaluation of every asset.

The compliance_rule field written into each work order maps the contaminant to its governing rule set, which downstream reporting reads directly:

Contaminant	`compliance_rule`	Default cadence	Escalation trigger
TTHM	`SDWA_Stage2_DBPR`	Monthly if pop. > 10,000	LRAA > 0.5·MCL, or any exceedance in 365 days
HAA5	`SDWA_Stage2_DBPR`	Monthly if pop. > 10,000	LRAA > 0.5·MCL, or any exceedance in 365 days
Other regulated	`SDWA_GENERAL`	Quarterly	MCL exceedance, Tier 1 violation, or source-water change

Verification & Testing

Frequency resolution and calendar arithmetic are the two functions most likely to produce a silent, off-by-a-month compliance gap, so pin their behavior with unit tests before deploying. The following pytest snippet exercises the population threshold, the escalation window boundary, and the month-clamping edge case that naive day math gets wrong.

from datetime import datetime, timedelta
from zoneinfo import ZoneInfo

from scheduler import SamplingAsset, Frequency, determine_frequency, add_months

LOCAL_TZ = ZoneInfo("America/New_York")

def test_large_dbpr_system_is_monthly():
    asset = SamplingAsset("A1", "TTHM", 25000, last_sample_date=None)
    assert determine_frequency(asset) == Frequency.MONTHLY

def test_small_system_defaults_quarterly():
    asset = SamplingAsset("A2", "TTHM", 8000, last_sample_date=None)
    assert determine_frequency(asset) == Frequency.QUARTERLY

def test_recent_exceedance_forces_monthly():
    recent = datetime.now(LOCAL_TZ) - timedelta(days=100)
    asset = SamplingAsset("A3", "HAA5", 3000, last_sample_date=None,
                          last_exceedance_date=recent)
    assert determine_frequency(asset) == Frequency.MONTHLY

def test_month_end_clamps_to_february():
    jan31 = datetime(2025, 1, 31, tzinfo=LOCAL_TZ)
    assert add_months(jan31, 1).day == 28  # not a March 3 rollover

Confirm the deployment against these acceptance criteria before enabling the live cron trigger:

determine_frequency returns MONTHLY for TTHM/HAA5 systems with population strictly above 10,000.
A confirmed MCL exceedance inside the trailing 365 days escalates any asset to monthly.
add_months clamps to the last valid day of short months (Jan 31 → Feb 28/29) with no rollover.
Every timestamp is offset-aware; parse_utc promotes naive inputs to UTC before conversion.
A simulated primary-database outage resolves the schedule from /var/cache/sdwa and logs a WARNING.
EVENT_JOB_MISSED and EVENT_JOB_ERROR both reach the SCADA/ITSM alarm channel, not just the log file.

Troubleshooting & Gotchas

Naive timestamps silently drift across DST. If the compliance API returns a bare datetime with no offset and it is used directly, spring-forward and fall-back transitions shift the window by an hour and can push a sample across a period boundary. Always route inbound timestamps through parse_utc so every value is offset-aware before any arithmetic.
Day-count quarters land on the wrong month. Adding timedelta(days=91) for a quarter accumulates drift because months differ in length; a January anchor eventually walks off its intended day. add_months advances by whole calendar months and clamps the day, keeping windows anchored — verify it is used everywhere instead of raw timedelta.
Population exactly at the threshold. The Stage 2 DBPR cutoff uses strict greater-than (> 10000), so a system serving exactly 10,000 stays quarterly by default. Confirm your primacy agency’s interpretation and adjust the comparison rather than assuming; a wrong operator here misclassifies an entire system.
Fallback masks a stale schedule. The cache keeps operations running during an outage, but if the primary database stays down for days the daemon will happily reissue an outdated deadline. Emit an escalating alert when the fallback path fires more than once for the same asset, and treat repeated fallback as an incident, not a warning.
Missed jobs after daemon downtime. If the process is down at 02:00, APScheduler may skip the run entirely. Set a misfire_grace_time on the job and confirm the EVENT_JOB_MISSED listener alarms, so a restart re-evaluates assets instead of waiting a full day. Sampling-completeness signals also flow to the Violation Detection Rule Engine, which independently flags a window that closed without a sample — use it as a backstop, not the primary guard.

Monitoring Frequency Scheduling — parent module: resolving cadence, opening and closing windows, and reconciling sampling tasks.
Core Architecture & SDWA Compliance Taxonomy — the domain and data models this scheduler plugs into.
SDWA MCL Reference Mapping — the MCL limits and parameter codes the escalation logic reads.
Translating EPA Violation Codes to Internal Alerts — how a raised violation becomes the escalation trigger consumed here.
Handling Missing Sensor Readings Without Triggering False Violations — the downstream guard that distinguishes a missed sample from a data gap.