Configuring Celery for Async Water Quality Batches

Municipal Python developers who build compliance automation need a task queue that runs Safe Drinking Water Act (SDWA) validation over batched SCADA telemetry without ever stalling the live acquisition path. This page is the concrete broker, backend, and routing configuration that implements the execution tier described in Async Batch Processing Setup: how to wire Celery so that discrete, time-bounded measurement windows are validated against EPA thresholds, persisted to an audit-ready ledger, and quarantined for review when a payload is malformed or a worker dies mid-commit. Get the acknowledgment, prefetch, and dead-letter settings wrong and you silently drop readings that a Discharge Monitoring Report (DMR) later depends on; get them right and the pipeline stays fully attributable even during telemetry bursts. The task queue consumes the single canonical reading shape that Time-Series Alignment Strategies emits after Modbus TCP Parsing Workflows and OPC UA Data Extraction have normalized the raw signals.

The broker only acks after the worker commits to the ledger: a crash between delivery and commit requeues the batch instead of dropping readings a DMR later depends on.

Prerequisites & Environment Setup

This configuration targets Celery 5.4 on Python 3.11+, with RabbitMQ as the broker and PostgreSQL as the result backend. RabbitMQ is the operational standard here because quorum queues give you replicated, disk-persistent delivery that survives a node restart during a bursty polling cycle; PostgreSQL is the recommended backend because its ACID guarantees let a compliance auditor trace every task state against an EPA reporting period. Redis is viable for throughput-only workloads, but its result backend does not offer the durability municipal audit trails require.

Provision the runtime and a dedicated broker virtual host before writing any task code:

# Python dependencies (pin majors for reproducible compliance builds)
pip install "celery[librabbitmq]==5.4.*" "kombu==5.4.*" \
            "sqlalchemy==2.0.*" "psycopg2-binary==2.9.*"

# RabbitMQ: isolate the utility workload on its own vhost + user
rabbitmqctl add_vhost utility_vhost
rabbitmqctl add_user scada_broker "$BROKER_PASSWORD"
rabbitmqctl set_permissions -p utility_vhost scada_broker ".*" ".*" ".*"

# Declare the primary queue as a replicated quorum queue
rabbitmqctl set_policy -p utility_vhost quorum-compliance \
    "^compliance_" '{"queue-type":"quorum"}' --apply-to queues

Every batch enters already carrying contaminant_id, parameter, reading_value, sample_ts, quality_flag, and a 64-character source_hash, so the worker never has to reason about protocol quirks — it only validates and persists. The batch_id doubles as the idempotency key that makes reprocessing safe.

Step-by-Step Implementation

Step 1 — Initialize the Celery application

Create the app with strict JSON serialization (never pickle, which is a remote-code-execution vector on an OT-adjacent network), late acknowledgments, and a prefetch of one. Late acks plus task_reject_on_worker_lost guarantee the broker only removes a message after the validation logic commits, preserving chain-of-custody for SCADA Data Ingestion & Time-Series Sync workloads.

from celery import Celery
from kombu import Exchange, Queue

app = Celery(
    'water_quality_pipeline',
    broker='amqp://scada_broker:5672/utility_vhost',
    backend='db+postgresql://compliance_user:pwd@localhost:5432/celery_results',
)

app.conf.update(
    task_serializer='json',
    accept_content=['json'],
    result_serializer='json',
    timezone='UTC',
    enable_utc=True,
    task_track_started=True,
    task_acks_late=True,              # ack only after a successful commit
    worker_prefetch_multiplier=1,     # no worker hoards batches it cannot start
    task_reject_on_worker_lost=True,  # requeue if the worker dies mid-task
    task_default_delivery_mode='persistent',
    broker_transport_options={'confirm_publish': True},
    task_routes={
        'water_quality_pipeline.tasks.process_water_quality_batch':
            {'queue': 'compliance_validation'},
        'water_quality_pipeline.tasks.archive_raw_telemetry':
            {'queue': 'data_archival'},
    },
)

Step 2 — Define the validation task

Keep the pure compliance arithmetic separate from I/O so it can be unit-tested without a database. The evaluate_exceedances helper compares each reading against EPA-mandated limits; the Celery task fetches the window, calls the helper, and writes an immutable ledger row.

import logging
import random
from datetime import datetime, timezone
from sqlalchemy import create_engine, text

logger = logging.getLogger(__name__)

EPA_THRESHOLDS = {
    'turbidity_ntu': 1.0,           # max
    'residual_chlorine_mg_l': 0.2,  # min
    'ph_range': (6.5, 8.5),         # inclusive band
}


def evaluate_exceedances(rows, thresholds=EPA_THRESHOLDS):
    """Pure function: return the list of readings that breach EPA limits."""
    exceedances = []
    for row in rows:
        param, val = row.parameter, row.reading_value
        if param == 'turbidity_ntu' and val > thresholds['turbidity_ntu']:
            exceedances.append({'sensor_id': row.sensor_id, 'param': param,
                                'value': val, 'limit': thresholds['turbidity_ntu']})
        elif param == 'residual_chlorine_mg_l' and val < thresholds['residual_chlorine_mg_l']:
            exceedances.append({'sensor_id': row.sensor_id, 'param': param,
                                'value': val, 'limit': thresholds['residual_chlorine_mg_l']})
        elif param == 'ph':
            low, high = thresholds['ph_range']
            if val < low or val > high:
                exceedances.append({'sensor_id': row.sensor_id, 'param': param,
                                    'value': val, 'limit': thresholds['ph_range']})
    return exceedances


@app.task(bind=True, max_retries=3, default_retry_delay=60)
def process_water_quality_batch(self, batch_id: str, start_ts: str, end_ts: str) -> dict:
    """Validate one reporting window against EPA limits and commit the outcome."""
    engine = create_engine('postgresql://compliance_user:pwd@localhost:5432/staging')
    try:
        query = text("""
            SELECT sensor_id, parameter, reading_value, recorded_at
            FROM raw_telemetry
            WHERE recorded_at BETWEEN :start_ts AND :end_ts AND batch_id = :batch_id
            ORDER BY recorded_at ASC
        """)
        with engine.connect() as conn:
            rows = conn.execute(query, {'start_ts': start_ts, 'end_ts': end_ts,
                                        'batch_id': batch_id}).fetchall()

        if not rows:
            return {'status': 'empty', 'batch_id': batch_id,
                    'processed_at': datetime.now(timezone.utc).isoformat()}

        exceedances = evaluate_exceedances(rows)
        status = 'EXCEEDED' if exceedances else 'VALIDATED'

        # Idempotent write: batch_id is the primary key, so a retried batch
        # updates in place rather than emitting a second compliance record.
        with engine.begin() as conn:
            conn.execute(text("""
                INSERT INTO compliance_ledger (batch_id, exceedance_count, processed_at, status)
                VALUES (:batch_id, :count, :ts, :status)
                ON CONFLICT (batch_id) DO UPDATE
                SET exceedance_count = EXCLUDED.exceedance_count,
                    processed_at = EXCLUDED.processed_at, status = EXCLUDED.status
            """), {'batch_id': batch_id, 'count': len(exceedances),
                   'ts': datetime.now(timezone.utc).isoformat(), 'status': status})

        return {'status': status, 'batch_id': batch_id, 'exceedances': exceedances}

    except Exception as exc:
        logger.error("Batch %s failed: %s", batch_id, exc)
        # Exponential backoff with jitter for transient DB/network failures
        backoff = min(60 * (2 ** self.request.retries), 300)
        countdown = backoff + random.uniform(0, backoff * 0.1)
        raise self.retry(exc=exc, countdown=countdown)

Exceedance results feed directly into the Violation Detection Rule Engine; the real-time evaluation logic those flags trigger is covered in Python Logic for Detecting MCL Exceedances in Real Time.

Step 3 — Configure dead-letter quarantine

Non-recoverable failures — malformed payloads, corrupted calibration offsets, schema drift — must be quarantined for engineering review, never silently dropped. Bind the primary queue to a dead-letter exchange (DLX) so RabbitMQ routes rejected or expired messages to a compliance_dlq for manual, idempotent replay.

app.conf.task_default_queue = 'compliance_validation'
app.conf.task_default_exchange = 'water_quality'
app.conf.task_default_exchange_type = 'direct'
app.conf.task_default_routing_key = 'validate'

app.conf.task_queues = (
    Queue(
        'compliance_validation',
        Exchange('water_quality', type='direct'),
        routing_key='validate',
        queue_arguments={
            'x-dead-letter-exchange': 'water_quality_dlx',
            'x-dead-letter-routing-key': 'dlq',
        },
    ),
    Queue('compliance_dlq', Exchange('water_quality_dlx', type='direct'), routing_key='dlq'),
)

Bounded retries with backoff absorb transient failures; once exhausted, the payload is dead-lettered to compliance_dlq for operator replay rather than silently dropped.

Attach a task_failure signal handler as the operational alerting hook so on-call staff learn about a quarantined payload the moment retries are exhausted:

from celery.signals import task_failure

@task_failure.connect
def alert_on_task_failure(sender=None, task_id=None, exception=None, **kwargs):
    """Alert compliance operations once retries are exhausted and the payload
    has been dead-lettered to compliance_dlq for manual review."""
    retries = getattr(getattr(sender, 'request', None), 'retries', 0)
    max_retries = getattr(sender, 'max_retries', 0) or 0
    if retries >= max_retries:
        logger.critical(
            "Task %s exhausted retries; payload quarantined in compliance_dlq. Exception: %s",
            task_id, exception,
        )
        # Integration point: compliance_ops.notify_dlq(task_id, exception)

Configuration Reference

Every setting below is load-bearing for either data durability or audit integrity. Treat the defaults as the minimum production baseline for a compliance pipeline.

Setting	Value	Why it matters for compliance
`task_acks_late`	`True`	Message is acked only after the ledger commit, so a mid-task crash requeues the batch instead of losing readings.
`worker_prefetch_multiplier`	`1`	Prevents one worker from reserving batches it cannot yet start, so a crash re-releases at most one window.
`task_reject_on_worker_lost`	`True`	Requeues (not drops) a batch when the worker process is killed mid-execution.
`task_serializer` / `accept_content`	`json`	Blocks `pickle` deserialization on an OT-adjacent network and enforces a structured payload contract.
`task_default_delivery_mode`	`persistent` (`2`)	Persists messages to disk so an in-flight batch survives a broker restart.
`broker_transport_options.confirm_publish`	`True`	Producer blocks until RabbitMQ confirms the enqueue, closing the publish-loss gap.
`max_retries` / `default_retry_delay`	`3` / `60s`	Bounded retries with exponential backoff before a batch is dead-lettered.
`x-dead-letter-exchange`	`water_quality_dlx`	Routes rejected/expired batches to `compliance_dlq` for operator replay.
`timezone` / `enable_utc`	`UTC` / `True`	Keeps every timestamp on the monotonic UTC axis the reporting windows are cut on.

Verification & Testing

Because the compliance arithmetic lives in a pure function, the highest-value assertions need no broker or database. Test evaluate_exceedances directly against the EPA limits:

import pytest
from types import SimpleNamespace
from water_quality_pipeline.tasks import evaluate_exceedances

def _reading(param, value, sensor_id="S-1"):
    return SimpleNamespace(parameter=param, reading_value=value, sensor_id=sensor_id)

def test_turbidity_over_limit_flags_exceedance():
    result = evaluate_exceedances([_reading("turbidity_ntu", 1.4)])
    assert len(result) == 1 and result[0]["limit"] == 1.0

def test_low_chlorine_residual_flags_exceedance():
    result = evaluate_exceedances([_reading("residual_chlorine_mg_l", 0.1)])
    assert len(result) == 1

def test_ph_inside_band_is_compliant():
    assert evaluate_exceedances([_reading("ph", 7.2)]) == []

def test_ph_outside_band_flags_exceedance():
    assert len(evaluate_exceedances([_reading("ph", 9.0)])) == 1

Then confirm the live queue behaves, using Celery’s inspection API and RabbitMQ’s queue stats:

celery -A water_quality_pipeline inspect active
celery -A water_quality_pipeline inspect stats
rabbitmqctl list_queues -p utility_vhost name messages consumers

Acceptance criteria before this configuration reaches production:

task_acks_late, worker_prefetch_multiplier=1, and task_reject_on_worker_lost are all set
Broker and result backend use durable/quorum queues and a persistent delivery mode
compliance_validation declares x-dead-letter-exchange and a bound compliance_dlq
Killing a worker mid-batch requeues the window and leaves no partial ledger row
Re-running the same batch_id produces exactly one compliance record (idempotent)
evaluate_exceedances unit tests pass for turbidity, chlorine, and pH edge cases
A forced task failure fires the task_failure alert and lands the payload in compliance_dlq

Troubleshooting & Gotchas

Queue depth climbs while consumers sit idle. With worker_prefetch_multiplier unset, workers over-reserve and appear busy while batches wait. Confirm the depth with rabbitmqctl list_queues, then restart workers with an explicit celery -A water_quality_pipeline worker --concurrency=4 --prefetch-multiplier=1 --loglevel=info.
Duplicate compliance records after a retry. Late acknowledgment means a batch can legitimately run twice. Without the ON CONFLICT (batch_id) upsert you get two ledger rows and a double-counted DMR figure. Always key ledger writes on batch_id.
kombu.exceptions.ContentDisallowed at dispatch. A producer serialized with pickle while the worker only accepts json. Align task_serializer/accept_content on both ends; never re-enable pickle to paper over it.
Payloads vanish instead of quarantining. The DLX bindings are missing or misspelled. Verify with rabbitmqadmin get queue=compliance_dlq count=10 — an empty DLQ after a known-bad batch means the x-dead-letter-exchange argument never took; requeues have to be declared at queue creation, so delete and re-declare the queue.
Result backend bloats and audit queries slow down. Task states accumulate forever by default. Set a result_expires aligned to your statutory retention window rather than deleting rows ad hoc, so the celery_results table still cross-references cleanly against the compliance ledger before DMR submission.

Async Batch Processing Setup — parent topic and the execution-tier architecture this page implements
SCADA Data Ingestion & Time-Series Sync — the ingestion domain these batches belong to
Parsing Modbus Registers for Turbidity Sensors — how the readings a batch validates are decoded
Aligning Irregular SCADA Timestamps to UTC — how batches get a clean UTC time axis before enqueue
Python Logic for Detecting MCL Exceedances in Real Time — downstream evaluation of the exceedance flags this task emits

Configuring Celery for Async Water Quality Batches

Related pages