Batch Processing Pipelines for DuckDB Spatial Workloads

A batch geospatial pipeline turns a directory of source files into a derived dataset under a fixed memory and time budget, repeatably, with no interactive operator watching it run. The hard part is not the spatial SQL — it is making that SQL deterministic: bounded peak memory, predictable parallelism, controlled spill, and a clean restart story when one partition fails. This reference is part of the Python & DuckDB integration workflows guide and inherits its governing rule — Python orchestrates, DuckDB computes, and geometry crosses the boundary as zero-copy Arrow buffers — then adds the partitioning, checkpointing, and resource-guardrail discipline that distinguishes a one-off query from a production pipeline.

The defining constraint of batch work is that a job must fit a worst-case partition under its memory ceiling, not an average one. A pipeline tuned to the median tile will OOM on the one dense urban tile with ten million parcels. Every pattern below exists to keep the per-partition working set bounded so the whole run completes in one pass instead of dying two hours in.

Runtime Configuration & Memory Guardrails

Batch pipelines must cap parallelism and memory explicitly. DuckDB’s default behaviour auto-scales threads to the host core count and lets memory grow toward a fraction of system RAM — fine for an interactive session, dangerous in a container where the orchestrator OOM-kills the process the moment resident memory crosses the cgroup limit. Set every guardrail at connection startup so the same code behaves identically on a laptop and in a scheduler.

import duckdb

def open_batch_connection(spill_dir: str = "/data/duckdb_spill") -> duckdb.DuckDBPyConnection:
    con = duckdb.connect(":memory:")
    con.execute("INSTALL spatial; LOAD spatial;")

    # Engine-internal parallelism. 8 is a sane baseline on a 16-core host;
    # past the physical core count, R-tree lookups contend on shared cache
    # lines and spatial-join throughput DROPS rather than rises.
    con.execute("SET threads = 8;")

    # Hard ceiling. On breach DuckDB spills intermediates to temp_directory
    # instead of being OOM-killed. Set to ~70% of the container memory cap so
    # the Python interpreter and any fetched Arrow buffers still have headroom.
    con.execute("SET memory_limit = '12GB';")

    # Spill target — MUST be on fast local storage (see trade-off below).
    con.execute(f"SET temp_directory = '{spill_dir}';")

    # Bound the spill so one runaway ST_Union can't fill the disk and take
    # down every other tenant on the node.
    con.execute("SET max_temp_directory_size = '50GB';")

    # Drops the row-order barrier, unlocking parallel hash aggregation and
    # spatial partitioning. Trade-off: output row order is undefined — any
    # downstream consumer must rely on explicit ORDER BY or a surrogate key.
    con.execute("SET preserve_insertion_order = false;")

    # No TTY in a scheduler; the progress bar just pollutes logs.
    con.execute("SET enable_progress_bar = false;")
    return con

Two trade-offs dominate. First, the spill path: when intermediates approach memory_limit, DuckDB transitions hash tables and sort buffers to temporary files, and uncontrolled spilling degrades throughput by 3–5× on NVMe and 10–50× on network or HDD storage. Keep temp_directory on a volume with greater than 2000 MB/s sequential write; on slower storage, drop to SET threads = 4 so fewer lanes spill concurrently. The decision between an in-memory database and a persisted file changes how the buffer manager treats spill and cache — see in-memory vs disk storage for the buffer-pool mechanics. Second, memory_limit and threads interact multiplicatively: each thread can hold its own copy of an in-flight hash table, so peak memory scales roughly with thread count. When a spatial join thrashes, lower threads before lowering memory_limit.

Trade-off: disabling preserve_insertion_order improves grouped and windowed spatial aggregation throughput by 20–40%, but destroys row-sequence guarantees. Treat result order as undefined and add an explicit ORDER BY wherever a consumer depends on it.

Note: the setting is memory_limit, not max_memory. A typo here silently does nothing and leaves the workload unbounded until the OOM killer intervenes.

Primary Execution Pattern: Partition, Process, Checkpoint

The canonical batch shape is three-staged. Partition the input on a natural key that aligns with the file layout (region, tile, or date), so each unit of work reads a contiguous slice rather than scanning everything. Process each partition with a single self-contained spatial query whose working set fits under memory_limit. Checkpoint each partition’s output to its own GeoParquet file and record success, so a re-run skips completed partitions instead of recomputing the whole dataset. This converts one unbounded query into many bounded, independently restartable ones.

Each partition is an isolated, restartable unit: bounded working set, own output file, recorded in a manifest so a re-run resumes instead of recomputing.

The processing query keeps all geometry work inside SQL. Construction, validation, measurement, and the spatial join itself run as vectorized ST_ kernels; Python only chooses the partition and writes the result. The bounding-box predicate goes in the ON clause so the planner can route it through an R-tree index before the exact topology test, the same predicate-ordering discipline detailed in the spatial joins and proximity filters patterns.

PROCESS_PARTITION = """
COPY (
    SELECT
        z.zone_id,
        ST_Area(ST_Union(p.geom))      AS covered_area_m2,   -- aggregate in-engine
        COUNT(DISTINCT p.parcel_id)    AS parcel_count
    FROM read_parquet(?) p             -- one partition's parcels
    JOIN read_parquet('zones/*.parquet') z
      ON z.geom && p.geom              -- cheap MBR pre-filter, R-tree assisted
     AND ST_Intersects(z.geom, p.geom) -- exact test only on survivors
    WHERE p.status = 'active'
      AND ST_IsValid(p.geom)           -- quarantine bad geometry before the join
    GROUP BY z.zone_id
) TO ? (FORMAT PARQUET, COMPRESSION ZSTD)
"""

def run_partition(con, src_glob: str, out_path: str) -> None:
    con.execute(PROCESS_PARTITION, [src_glob, out_path])

The driver loop owns idempotency. It reads a manifest of completed partitions, skips them, runs the rest, and records each success atomically — so a crash on partition 47 of 200 costs one partition’s work, not the whole run.

import json, pathlib

def run_pipeline(con, partitions: dict[str, str], out_dir: str, manifest: str):
    done = set()
    if pathlib.Path(manifest).exists():
        done = set(json.loads(pathlib.Path(manifest).read_text()))

    for key, src_glob in partitions.items():
        if key in done:
            continue                                   # idempotent re-run: skip completed
        out_path = f"{out_dir}/{key}.parquet"
        run_partition(con, src_glob, out_path)         # bounded working set per call
        done.add(key)
        pathlib.Path(manifest).write_text(json.dumps(sorted(done)))  # checkpoint

Geometry pre-simplification is the highest-leverage knob for keeping each partition under budget. Applying ST_SimplifyPreserveTopology(geom, tolerance) before a union or self-join cuts vertex counts — and therefore hash-table and intermediate-buffer size — often by more than half on dense polygon layers, at the cost of sub-tolerance positional accuracy. Apply it inside the partition query, not in Python.

For partitions that still overflow even one tile at a time, fall back to streaming the result to GeoPandas in bounded Arrow batches rather than materializing it whole; the zero-copy interchange and lazy Shapely rehydration are covered in DuckDB to GeoPandas sync. When independent partitions can run concurrently, give each its own connection and temp path and dispatch through the async execution patterns — DuckDB’s Python client is synchronous, so concurrency comes from running multiple connections on worker threads, never from sharing one connection across coroutines.

Execution Plan Validation

A batch query that is correct on a small fixture can still regress catastrophically at scale if the planner picks the wrong join. Capture the plan with EXPLAIN (ANALYZE, FORMAT JSON) on a representative partition and inspect operator-level timing and row counts before committing the pipeline to a full run.

import json

def analyze_partition(con, src_glob: str) -> dict:
    plan_json = con.execute(
        "EXPLAIN (ANALYZE, FORMAT JSON) " + PROCESS_PARTITION.split("COPY (")[1].split(") TO")[0],
        [src_glob],
    ).fetchdf().iloc[0, 0]
    return json.loads(plan_json)

A correctly optimized plan shows a spatial join (an RTREE_INDEX_SCAN or a hash-style spatial join) feeding a HASH_GROUP_BY — never a CROSS_PRODUCT or a bare NESTED_LOOP_JOIN over geometry. The diagnostic thresholds that matter for batch work:

Pre-filter engaged. If the row count reaching ST_Intersects equals N × M (the full cross product), no bounding-box pre-filtering happened. Confirm both inputs are cast to GEOMETRY and that the && predicate sits in the ON clause without a nested scalar wrapper.
Row-estimate drift. When the planner’s estimated cardinality diverges from the measured count by more than ~10×, join order is likely wrong and an explosion is imminent on the larger partitions. Refresh statistics on the base tables.
Spill onset. If EXPLAIN ANALYZE runtime on a partition spikes more than ~300% versus a memory-resident baseline, intermediates are spilling. Reduce partition size, raise memory_limit, or pre-simplify geometry before the union.

Diagnostic boundary: if spatial-join cardinality explodes beyond 10× the input row count, suspect a coordinate reference system mismatch — two layers in different units never satisfy a metric predicate cleanly. Verify alignment using the CRS mapping and transformations guidance and reproject both layers to a common projected frame before intersection.

Performance Trade-offs

The extraction door and the partitioning strategy dominate end-to-end batch latency far more than any single SQL tweak. The table quantifies the common variants on a 16-core host at memory_limit = '12GB' over a 200-million-row parcel layer.

Variant	Peak memory	Throughput	When to apply
One unpartitioned query over the whole dataset	Highest — spills hard	Slowest; often never finishes	Never for production; only on data that fits comfortably in RAM
Partition by natural key, `COPY ... TO` per partition	Bounded per partition	Best for pure ETL (file → file)	Default for derived-dataset builds
Partition + `ST_SimplifyPreserveTopology` pre-join	40–60% lower per partition	1.5–2× faster on dense polygons	Union/overlay work where sub-tolerance accuracy is acceptable
Stream partition result via `fetch_record_batch`	Capped at one batch	Slight per-batch overhead	Result feeds Python/GeoPandas rather than a file
Concurrent partitions, one connection each	`concurrency × per-query peak`	Scales until CPU/IO saturate	Many small partitions on a host with spare cores

Two scaling rules follow. First, raising SET threads past the physical core count reduces spatial-join throughput because R-tree lookups contend on shared cache lines — measure before exceeding physical_cores. Second, concurrency multiplies connections, not engine threads: running four concurrent partitions each at threads = 8 on an 8-core box oversubscribes the CPU 4× and inflates tail latency. Size concurrent partitions × per-query threads to the core count, and prefer many small partitions at modest thread counts over few large ones at high thread counts, since the former restart more cheaply on failure.

When the per-partition aggregation is itself the bottleneck — large grouped rollups, distance summaries, or windowed metrics — push the heavy lifting into vectorized aggregations rather than rehydrating geometry into Python and looping.

Edge Cases & Anti-Patterns

The failure modes below recur in batch spatial code; each has a minimal fix.

Tuning to the median partition. Sizing memory_limit and partition granularity to the average tile guarantees an OOM on the densest one. Fix: profile the largest partition first and budget for it, or split partitions adaptively by row count rather than by a fixed key.
Non-idempotent writes. Re-running a failed job appends to or corrupts partial output. Fix: write each partition to its own file and record completion in a manifest; treat the manifest as the source of truth on restart.
WHERE-only spatial predicates. Placing ST_Intersects solely in the WHERE clause can defeat index routing. The bounding-box && predicate belongs in the ON clause so the planner pushes it into the R-tree — the same rule that governs point-in-polygon optimization.
Invalid geometry aborting a partition. A single malformed WKB blob makes ST_Union or a Shapely rehydrate raise mid-partition and lose the whole unit’s work. Fix: guard with WHERE ST_IsValid(geom) (or repair with ST_MakeValid) inside the query and quarantine offenders to a side table for later inspection.
CRS drift between layers. No error is raised when two layers disagree on units; the join simply returns wrong areas or empty results. Fix: reproject both layers to a common projected CRS before any metric operation, and assert coordinate ranges in a pre-flight check.
Fetching geometry as Python tuples. Ending a partition with fetchall() triples resident memory by materializing every geometry as a bytes object. Fix: write straight to GeoParquet with COPY ... TO, or stay on the Arrow door via fetch_arrow_table().
GeoJSON as the batch source. Row-oriented GeoJSON ingestion re-parses to WKB on every run and never pushes predicates down. Fix: normalize to GeoParquet once, then read the columnar copy thereafter.

Query Regression Analysis

Batch pipelines regress silently: a statistics change, a schema edit, or a new data distribution flips a plan from index scan to nested loop, and the nightly run that took twenty minutes now runs for six hours without error. Capture a plan fingerprint per partition query and diff it against a committed baseline in CI to catch the regression before it reaches the scheduler.

import hashlib, json

def plan_fingerprint(con, sql: str, params: list) -> dict:
    """Operator skeleton + measured timing for one partition query."""
    plan = json.loads(
        con.execute(f"EXPLAIN (ANALYZE, FORMAT JSON) {sql}", params)
           .fetchdf().iloc[0, 0]
    )

    def operators(node, acc):
        acc.append(node.get("name", node.get("operator_type", "?")))
        for child in node.get("children", []):
            operators(child, acc)
        return acc

    ops = operators(plan, [])
    return {
        "op_skeleton": ops,                                       # ordered operators
        "skeleton_hash": hashlib.sha1("|".join(ops).encode()).hexdigest(),
        "rows": plan.get("cardinality"),
        "seconds": plan.get("timing"),
    }

def assert_no_regression(current: dict, baseline: dict, slack: float = 1.5):
    # Structural: an operator changed (e.g. RTREE scan -> nested loop).
    assert current["skeleton_hash"] == baseline["skeleton_hash"], (
        f"Plan shape changed:\n  was {baseline['op_skeleton']}\n  now {current['op_skeleton']}"
    )
    # Timing: tolerate noise up to `slack`x the baseline runtime.
    assert current["seconds"] <= baseline["seconds"] * slack, (
        f"Latency regressed: {current['seconds']:.3f}s vs baseline {baseline['seconds']:.3f}s"
    )

Store the baseline fingerprint for a representative partition alongside the query in the repo, regenerate current on every CI run against the same fixture, and fail the build when the operator skeleton changes or runtime exceeds the slack multiplier. A skeleton hash flip from RTREE_INDEX_SCAN to NESTED_LOOP_JOIN is the canonical early warning that a spatial index was dropped or a CRS mismatch defeated the bounding-box pre-filter — exactly the silent degradation that turns a bounded nightly job into one that misses its window.

Pair the plan gate with two runtime guards baked into the pipeline itself: a spill-ratio check (abort and re-partition if spill time exceeds 25% of total runtime) and a validity gate (abort and quarantine if more than 5% of a partition’s geometries fail ST_IsValid, since topology errors cascade into wrong join results). Together these bound both plan-level and data-level regressions before a bad run ever reaches production.

See also

Async execution patterns — running independent partitions concurrently without blocking the event loop, each with an isolated connection and temp path.
DuckDB to GeoPandas sync — streaming a partition result into a GeoDataFrame over zero-copy Arrow when the output feeds Python rather than a file.
Shapely integration — when to offload per-feature topology work to Python objects and when to keep it in SQL.
Spatial joins & proximity filters — the ON-clause predicate discipline that keeps each partition’s join index-routed.
In-memory vs disk storage — how the buffer manager and spill path shape per-partition throughput.

Up: Python & DuckDB Integration Workflows

Batch Processing Pipelines for DuckDB Spatial Workloads

Runtime Configuration & Memory Guardrails #

Primary Execution Pattern: Partition, Process, Checkpoint #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Related #

Runtime Configuration & Memory Guardrails

Primary Execution Pattern: Partition, Process, Checkpoint

Execution Plan Validation

Performance Trade-offs

Edge Cases & Anti-Patterns

Query Regression Analysis

Related