Async Execution Patterns for DuckDB Spatial Workloads in Python

Production geospatial workloads rarely conform to synchronous, monolithic query patterns. When routing spatial joins, coordinate transformations, and large-scale geometry aggregations through DuckDB from Python, blocking execution stalls the event loop, inflates resident memory, and starves downstream consumers. This reference covers how to dispatch DuckDB Spatial queries without blocking the asyncio event loop, stream Arrow batches back under backpressure, and validate that the work actually parallelizes. It is part of the broader Python & DuckDB integration workflows guide, and it assumes the orchestrate-vs-compute split established there: Python schedules and routes, DuckDB computes, and geometry crosses the boundary as zero-copy Arrow buffers rather than per-row Python objects.

Async dispatch keeps the event loop responsive: the blocking DuckDB call runs on a worker thread and streams Arrow batches back to the consumer.

The single most important fact to internalize first: DuckDB’s Python client is synchronous and there is no native async query API. Every await-shaped pattern on this page is built by handing the blocking call to a worker thread via asyncio.to_thread() (or a ThreadPoolExecutor) so the event loop stays free to service other coroutines. The engine’s own parallelism comes from its internal thread pool, configured with SET threads; the asyncio layer exists only to keep Python responsive, not to make a single query faster.

Runtime Configuration & Memory Guardrails

DuckDB executes queries on the calling thread by default, so true non-blocking behavior requires dispatching each query onto an executor. For spatial workloads the engine’s thread count must be balanced against memory pressure and NVMe I/O bandwidth before any async wrapper is added. Over-provisioning threads on geometry-heavy scans triggers CPU cache thrashing, increases lock contention on R-tree lookups, and frequently precipitates out-of-memory (OOM) termination. The boundaries below are the minimal reproducible session for the patterns on this page.

import asyncio
import duckdb

def open_spatial_connection() -> duckdb.DuckDBPyConnection:
    conn = duckdb.connect(":memory:")
    conn.execute("INSTALL spatial; LOAD spatial;")

    # threads = engine-internal parallelism, NOT asyncio concurrency.
    # 8 is a sane baseline for a 16-core host on heavy geometry ops;
    # exceeding physical cores yields diminishing returns from cache contention.
    conn.execute("SET threads = 8;")

    # Hard ceiling. When breached, DuckDB spills intermediates to temp_directory
    # instead of being OOM-killed. Set to ~70% of the container's memory budget.
    conn.execute("SET memory_limit = '4GB';")

    # Spill target. Must live on fast storage (see trade-off below).
    conn.execute("SET temp_directory = '/tmp/duckdb_spatial_spill';")

    # Unlocks parallel hash aggregation by dropping the row-order barrier.
    # Trade-off: result row order is no longer stable — rely on ORDER BY.
    conn.execute("SET preserve_insertion_order = false;")
    return conn

Two configuration trade-offs dominate async spatial work. First, the spill path: when memory_limit is approached, DuckDB transitions intermediate results to temporary files, and uncontrolled spilling degrades throughput by 10–50× depending on storage IOPS. The safe boundary is a temp_directory on NVMe with greater than 2000 MB/s sequential write throughput; on HDD-backed hosts, drop to SET threads = 4 and chunk queries with windowed predicates to avoid spill-induced I/O starvation. Second, connection topology: a DuckDB connection is not safe to share across concurrent coroutines that each dispatch to different threads. Give each worker its own connection, or serialize access behind an asyncio.Lock. The decision between an in-memory database and a persisted file changes the spill and cache behavior materially — see in-memory vs disk storage for how the buffer manager treats each. A practical diagnostic: if SET threads = N leaves more than 15% CPU idle during spatial joins, reduce N toward ceil(physical_cores / 2) to mitigate cache thrashing.

Note: SET max_memory is not a valid DuckDB setting. Use SET memory_limit exclusively; a typo here silently does nothing and leaves the workload unbounded.

Primary Execution Pattern: Dispatch, Stream, Materialize

The canonical async spatial pattern is three-staged: dispatch the blocking query onto a worker thread, stream the result as Arrow RecordBatch chunks to bound peak memory, and materialize into GeoPandas only at the final consumer. Materializing a large spatial result in one synchronous call blocks the loop and risks heap fragmentation; streaming preserves async responsiveness while keeping geometry on the zero-copy Arrow path described in the parent guide.

async def run_spatial_async(
    conn: duckdb.DuckDBPyConnection,
) -> duckdb.DuckDBPyRelation:
    query = """
        SELECT
            a.id,
            ST_AsWKB(a.geom) AS geom_wkb,
            b.zone_name
        FROM spatial_points a
        JOIN spatial_zones b
          ON b.geom && a.geom            -- cheap MBR pre-filter (R-tree assisted)
         AND ST_Contains(b.geom, a.geom) -- exact topology test on survivors
        WHERE a.ts > '2024-01-01'
    """
    # The execute() call blocks; to_thread hands it to the default executor
    # so the event loop keeps servicing other coroutines while DuckDB runs.
    return await asyncio.to_thread(conn.execute, query)

The && bounding-box operator in the ON clause is what lets the planner route through an R-tree before exact evaluation; the same predicate-ordering discipline is detailed in the modern spatial SQL query patterns guide and is the difference between an index scan and a Cartesian blow-up. With the query dispatched, stream the result rather than calling fetchall():

import geopandas as gpd
import pandas as pd
from shapely import wkb

async def stream_to_geodataframe(
    relation: duckdb.DuckDBPyRelation,
    chunk_size: int = 50_000,
) -> gpd.GeoDataFrame:
    # fetch_record_batch keeps geometry as an Arrow binary column (zero copy
    # out of the engine) and bounds peak memory to one batch at a time.
    reader = await asyncio.to_thread(relation.fetch_record_batch, chunk_size)

    chunks: list[gpd.GeoDataFrame] = []
    while True:
        # Each .read_next_batch() is a blocking pull from the engine — keep it
        # off the event loop so other coroutines can run between batches.
        try:
            batch = await asyncio.to_thread(reader.read_next_batch)
        except StopIteration:
            break
        df = batch.to_pandas()
        df["geometry"] = df["geom_wkb"].map(wkb.loads)  # vectorized rehydrate
        chunks.append(
            gpd.GeoDataFrame(df.drop(columns="geom_wkb"),
                             geometry="geometry", crs="EPSG:4326")
        )
    return pd.concat(chunks, ignore_index=True)

WKB-to-Shapely rehydration is the only unavoidable Python-side cost; defer it to the last possible stage. The full set of zero-copy interchange options — and when to skip Shapely entirely — is documented in DuckDB to GeoPandas sync.

Backpressure & stream orchestration

When a downstream transform is slower than the engine produces batches, an unbounded buffer grows until the process is OOM-killed. Decouple producer and consumer with a bounded asyncio.Queue so the producer naturally blocks when the consumer falls behind. This is the backpressure edge shown in the sequence diagram above.

async def producer(relation, queue: asyncio.Queue, chunk_size: int = 50_000):
    reader = await asyncio.to_thread(relation.fetch_record_batch, chunk_size)
    while True:
        try:
            batch = await asyncio.to_thread(reader.read_next_batch)
        except StopIteration:
            break
        await queue.put(batch)   # awaits (applies backpressure) when queue is full
    await queue.put(None)        # sentinel: signals completion

async def consumer(queue: asyncio.Queue):
    while (batch := await queue.get()) is not None:
        await handle(batch)      # downstream transform / write
        queue.task_done()

async def pipeline(relation):
    # maxsize caps in-flight batches; tune to (memory_limit budget / batch size).
    queue: asyncio.Queue = asyncio.Queue(maxsize=4)
    await asyncio.gather(producer(relation, queue), consumer(queue))

For time-series telemetry, bound intermediate state further with sliding-window predicates (WHERE ts BETWEEN ? AND ?) so each dispatched query touches a fixed slice rather than an ever-growing history. When several independent jobs run together, this same pattern composes into the batch processing pipelines workflow, where each job owns an isolated connection and temp path.

Execution Plan Validation

Async dispatch does not create parallelism — it only prevents the event loop from blocking. Whether the spatial query itself parallelizes is a property of the plan, and the planner only parallelizes spatial operations when predicates are deterministic and an R-tree is available. Capture the plan with EXPLAIN (ANALYZE, FORMAT JSON) and inspect operator-level timing and thread utilization.

import json

def analyze_spatial_plan(conn: duckdb.DuckDBPyConnection, query: str) -> dict:
    plan_df = conn.execute(
        f"EXPLAIN (ANALYZE, FORMAT JSON) {query}"
    ).fetchdf()
    return json.loads(plan_df.iloc[0, 0])

A correctly optimized spatial join plan shows an RTREE_INDEX_SCAN or a spatial join operator feeding a HASH_GROUP_BY — never a CROSS_PRODUCT or a bare nested-loop join over geometry. The diagnostic thresholds that matter:

Thread under-utilization. If the plan reports fewer active threads than SET threads, a serial bottleneck exists — usually a non-deterministic spatial function (e.g. a random-seeded simplification) or a missing partition key that prevents the planner from splitting the scan.
Row-estimate drift. When the planner’s estimated cardinality diverges from the measured row count by more than ~10×, the join order is likely wrong and an explosion is imminent. Re-run ANALYZE on the base tables to refresh statistics.
Spill onset. If EXPLAIN ANALYZE execution time spikes by more than ~300% versus a memory-resident baseline, intermediates are spilling. Reduce chunk_size, raise memory_limit, or pre-cluster the input with ORDER BY ST_XMin(geom), ST_YMin(geom) so geometry density is even across partitions.

Performance Trade-offs

The extraction door you choose dominates end-to-end latency far more than the asyncio wrapper does. The table below quantifies the common variants for a spatial result on a 16-core host with memory_limit = '4GB'.

Variant	Peak memory	Throughput	When to apply
`fetchall()` then build GeoDataFrame	Highest (full result + tuples)	Baseline	Small results (< 100k rows) where simplicity wins
`fetch_record_batch(chunk)` streaming	~60% lower peak vs `fetchall()`	Slight per-batch overhead	Default for results that approach `memory_limit`
`fetch_arrow_table()` + `from_arrow`	Low, single conversion	Fastest for large sets	Results > 10M rows; bypasses row-by-row Shapely instantiation
Bounded `asyncio.Queue` pipeline	Capped at `maxsize × batch`	Matches slowest stage	Slow downstream consumer; live streaming

Two scaling rules follow from this. First, raising SET threads beyond the physical core count typically reduces spatial-join throughput because R-tree lookups contend on shared cache lines — measure before exceeding physical_cores. Second, asyncio concurrency multiplies connections, not engine threads: running four concurrent queries each with threads = 8 on an 8-core box oversubscribes the CPU 4× and increases tail latency. Size total concurrent queries × per-query threads to the core count. For workloads above 10M rows, prefer the Arrow-table door and rehydrate Shapely lazily only on the rows a consumer actually inspects.

Edge Cases & Anti-Patterns

The failure modes below recur in async spatial code and each has a minimal fix.

Sharing one connection across coroutines. A single DuckDBPyConnection driven by concurrent to_thread dispatches corrupts internal state and raises opaque errors. Fix: one connection per worker, or use conn.cursor() to get an isolated child connection per coroutine, or guard a shared connection with an asyncio.Lock.
Blocking the loop with a synchronous fetch. Wrapping execute() in to_thread but then calling relation.fetchall() directly on the event loop re-introduces the stall you just removed. Fix: dispatch the fetch through to_thread as well, or stream batches.
Ignoring cancellation mid-scan. When a coroutine is cancelled, the work already handed to a thread keeps running and holding memory because to_thread calls are not cancellable. Fix: gate long scans behind a LIMIT/window, or run them on a ThreadPoolExecutor you can drain on shutdown, and call conn.interrupt() to abort the in-flight query.
WHERE-only spatial predicates. Placing ST_Contains solely in the WHERE clause of a join can prevent index routing; the bounding-box predicate belongs in the ON clause so the planner can push it into the R-tree. This predicate-placement rule is the same one that governs point-in-polygon optimization.
Unbounded result buffering. await asyncio.gather(*[run(q) for q in many_queries]) materializes every result simultaneously. Fix: cap fan-out with an asyncio.Semaphore sized to your memory budget.
Invalid geometry crashing the rehydrate. A malformed WKB blob makes shapely.wkb.loads raise mid-stream and abort the whole pipeline. Fix: filter with WHERE ST_IsValid(geom) (or repair with ST_MakeValid) inside the SQL before extraction, quarantining the offenders.

Query Regression Analysis

Async spatial pipelines regress silently: a statistics change or a schema edit flips a plan from index scan to nested loop, and latency degrades without any error. Capture a plan fingerprint per query and diff it against a stored baseline in CI to catch the regression before production.

import hashlib
import json

def plan_fingerprint(conn: duckdb.DuckDBPyConnection, query: str) -> dict:
    """Capture the operator skeleton and measured timing for one query."""
    plan = json.loads(
        conn.execute(f"EXPLAIN (ANALYZE, FORMAT JSON) {query}").fetchdf().iloc[0, 0]
    )

    def operators(node, acc):
        acc.append(node.get("name", node.get("operator_type", "?")))
        for child in node.get("children", []):
            operators(child, acc)
        return acc

    ops = operators(plan, [])
    return {
        "op_skeleton": ops,                                   # ordered operator list
        "skeleton_hash": hashlib.sha1("|".join(ops).encode()).hexdigest(),
        "rows": plan.get("cardinality"),
        "seconds": plan.get("timing"),
    }

def assert_no_regression(current: dict, baseline: dict, slack: float = 1.5):
    # Structural regression: an operator changed (e.g. RTREE scan -> nested loop).
    assert current["skeleton_hash"] == baseline["skeleton_hash"], (
        f"Plan shape changed:\n  was {baseline['op_skeleton']}\n  now {current['op_skeleton']}"
    )
    # Timing regression: tolerate noise up to `slack`x the baseline runtime.
    assert current["seconds"] <= baseline["seconds"] * slack, (
        f"Latency regressed: {current['seconds']:.3f}s vs baseline {baseline['seconds']:.3f}s"
    )

Store the baseline fingerprint alongside the query (committed to the repo), regenerate current on every CI run against a representative fixture, and fail the build when the operator skeleton changes or runtime exceeds the slack multiplier. A skeleton hash flip from an RTREE_INDEX_SCAN to a NESTED_LOOP_JOIN is the canonical early warning that a spatial index was dropped or a CRS mismatch defeated the bounding-box pre-filter. Pair this with the per-batch streaming metrics above to bound both plan-level and extraction-level regressions in one gate.

See also

Running async spatial queries in Python — connection-lifecycle and cancellation deep dive for the patterns above.
DuckDB to GeoPandas sync — zero-copy Arrow interchange and lazy Shapely rehydration at the consumer.
Batch processing pipelines — composing isolated async jobs with bounded memory and spill control.
Shapely integration — when to offload topology work to Python objects and when not to.
In-memory vs disk storage — how the buffer manager and spill path shape async throughput.

Up: Python & DuckDB Integration Workflows

Async Execution Patterns for DuckDB Spatial Workloads in Python

Runtime Configuration & Memory Guardrails #

Primary Execution Pattern: Dispatch, Stream, Materialize #

Backpressure & stream orchestration #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Related #