Running Async Spatial Queries in Python

Concurrent DuckDB Spatial queries stall a Python service when a blocking execute() freezes the event loop and an unbounded fetchall() materializes a whole geometry result set into the heap at once. This walkthrough sits under Async Execution Patterns and isolates one failure shape — many spatial queries dispatched through asyncio against the thread-unsafe DuckDB client — then gives the connection-isolation rules, the Arrow streaming rewrite, plan validation, and fallback routing for invalid geometry and out-of-memory (OOM) conditions.

Root-Cause Analysis

DuckDB’s Python client has no native async API, so every naive integration trips one of four distinct failure modes. Each has a different cause and a different fix, so name the symptom before changing code.

Event-loop blocking. Calling conn.execute(query) directly inside a coroutine runs the entire spatial scan on the event-loop thread. Every other task — health checks, request handlers, batch yields — is frozen until the join returns. The fix is to dispatch the blocking call into a worker thread with asyncio.to_thread.
Shared-connection corruption. A duckdb.Connection is not thread-safe. Sharing one connection across concurrent tasks raises RuntimeError: DuckDB connection is not thread-safe or, worse, returns silently wrong results. Each task needs its own connection or a conn.cursor() clone scoped to that task.
Unbounded result materialization. fetchall() and fetchdf() pull the full result into CPython memory before the first row is usable. A wide geometry column over millions of rows spikes resident memory and triggers an OOM kill. Streaming fetch_record_batch() caps residency to one batch at a time.
GIL contention on deserialization. Converting WKB (well-known binary) blobs to Shapely objects row-by-row on the main thread holds the GIL and re-starves the loop you just unblocked. The representation trade-off behind that WKB payload is covered in the ST_Geometry vs WKB reference; the fix here is vectorized shapely.from_wkb dispatched off-thread.

Whether the working set fits in RAM at all is governed by the in-memory vs disk storage boundaries for your dataset — set those limits before tuning concurrency.

Deterministic Configuration

Pin memory and threads per connection so concurrent tasks cannot collectively oversubscribe the box. Confirm the prerequisites first:

duckdb and the spatial extension available in the runtime
shapely (2.x, vectorized API) and pyarrow installed for off-thread parsing
A per-connection memory_limit chosen so limit × max_concurrent_tasks stays below total RAM
Inputs share one CRS before any proximity predicate runs (see CRS mapping & transformations)

import duckdb

def open_spatial_connection() -> duckdb.DuckDBPyConnection:
    conn = duckdb.connect(":memory:")
    conn.execute("INSTALL spatial; LOAD spatial;")

    # Cap per-connection RAM: N concurrent tasks each hold their own limit,
    # so size this as total_ram / max_tasks to avoid collective OOM.
    conn.execute("SET memory_limit = '4GB';")

    # Bound per-connection threads: DuckDB parallelizes internally, so
    # oversubscribing across many async tasks thrashes the CPU scheduler.
    conn.execute("SET threads = 4;")

    # Quiet the progress bar so EXPLAIN ANALYZE timings are not skewed by render overhead.
    conn.execute("SET enable_progress_bar = false;")
    return conn

The single most important number here is memory_limit × concurrency. DuckDB enforces the limit per connection, not per process, so eight tasks at 4GB each can demand 32 GB the engine never coordinates. Treat concurrency as a global budget divided across tasks, not a free dial.

Optimized Execution Pattern

The naive coroutine blocks the loop and materializes everything:

# Before: execute() and fetchall() both run on the event-loop thread.
# The whole result set lands in the heap before the first row is usable.
async def fetch_spatial(query: str):
    conn = open_spatial_connection()
    rows = conn.execute(query).fetchall()  # blocks loop + unbounded heap
    conn.close()
    return rows

The optimized form does two things: it moves the blocking call off the loop with asyncio.to_thread, and it replaces the eager fetch with a streaming Arrow iterator so peak memory stays bounded to one batch.

import asyncio
import duckdb

async def stream_spatial_batches(query: str, batch_size: int = 2048):
    # Connection is created and closed inside the task scope so an
    # exception can never leak a file descriptor or a shared cursor.
    conn = open_spatial_connection()
    try:
        # Dispatch the blocking scan into a worker thread: the event loop
        # stays free to service other tasks while DuckDB computes.
        rel = await asyncio.to_thread(conn.execute, query)

        # Stream Arrow RecordBatches instead of fetchall(): peak resident
        # memory is bounded to batch_size rows, not the full result set.
        reader = await asyncio.to_thread(rel.fetch_record_batch, batch_size)

        for batch in reader:
            yield batch  # async generator applies natural backpressure
    finally:
        conn.close()  # deterministic cleanup even on cancellation

The behavioural change is entirely in where and how the result is pulled. asyncio.to_thread keeps the loop responsive; fetch_record_batch turns a single unbounded allocation into a bounded stream. Peak RSS now tracks batch_size × row_width rather than the full geometry column. For datasets beyond ~500K rows, never fall back to fetchall() — the iterator is the contract.

When the consumer is GeoPandas, keep WKB parsing off the loop too. Hand the raw blobs to vectorized shapely.from_wkb inside a thread, then assemble the frame through Arrow — the zero-copy hand-off detailed in converting DuckDB queries to a GeoDataFrame efficiently:

import geopandas as gpd
import shapely
import pyarrow as pa

async def build_geodataframe(query: str):
    batches = []
    async for batch in stream_spatial_batches(query):
        wkb = batch.column("geom").to_pylist()

        # Vectorized WKB -> geometry, dispatched off the event loop so the
        # GIL-heavy parse never re-blocks the work to_thread just freed.
        geoms = await asyncio.to_thread(shapely.from_wkb, wkb)

        idx = batch.schema.get_field_index("geom")
        batches.append(batch.set_column(
            idx, batch.schema.field("geom"),
            pa.array(geoms, type=pa.binary()),
        ))

    table = pa.Table.from_batches(batches)
    return gpd.GeoDataFrame.from_arrow(table, geometry="geom")

Diagnostic Queries & Plan Validation

Confirm the dispatch actually parallelizes rather than serializing behind one connection. Capture the plan from the same thread DuckDB runs on:

# EXPLAIN ANALYZE from inside the worker thread so timings reflect the
# real dispatch path, not a synchronous warm-up call.
async def explain_spatial(query: str):
    conn = open_spatial_connection()
    try:
        plan = await asyncio.to_thread(
            conn.execute, f"EXPLAIN ANALYZE {query}"
        )
        return plan.fetchall()
    finally:
        conn.close()

Read the result and check three things:

Wall-clock vs total operator time. If N concurrent queries take roughly N × the time of one, the tasks are serializing — usually a shared connection or a thread pool capped at one worker. Independent connections should overlap.
Memory limit honoured. A query that reports spilling at far below its configured memory_limit means another task on the same process is consuming the shared OS budget. Cross-check live allocation.
Batch cadence. Time between yielded batches should be steady. A long stall before the first batch means the engine buffered the full result — verify you called fetch_record_batch, not fetchdf.

One-liner monitors for a live session:

-- Confirm the active limit before trusting any concurrency math.
SELECT value FROM duckdb_settings() WHERE name = 'memory_limit';

-- Watch peak allocation and spill across the running tasks.
SELECT tag, memory_usage_bytes, temporary_storage_bytes
FROM duckdb_memory() ORDER BY memory_usage_bytes DESC LIMIT 10;

Geometry Validation & Fallback Routing

Invalid geometry crashes the deserialization stage after the query succeeds, so the failure surfaces in Python, not SQL. Gate the geometry inside the query before it ever reaches shapely.from_wkb:

-- Quarantine invalid geometry before it poisons the WKB parse.
SELECT id, geom
FROM features
WHERE NOT ST_IsValid(geom);

Repair in place, or repair inline so the stream only ever emits parseable blobs:

-- Emit repaired WKB; ST_MakeValid keeps edge-noise polygons from
-- raising shapely.errors.GEOSException downstream.
SELECT id, ST_AsWKB(ST_MakeValid(geom)) AS geom
FROM features;

When asyncio.to_thread raises MemoryError, duckdb.IOException, or pyarrow.lib.ArrowInvalid, route the failed task to a smaller-footprint synchronous path with tighter limits and explicit heap compaction between chunks rather than killing the whole service:

import gc
import duckdb

def fallback_sync_execution(query: str, chunk_size: int = 1024):
    conn = duckdb.connect(":memory:")
    try:
        conn.execute("INSTALL spatial; LOAD spatial;")

        # Halve the ceiling: the fallback exists because the async path
        # already breached memory, so trade throughput for survival.
        conn.execute("SET memory_limit = '2GB';")
        conn.execute("SET threads = 2;")

        rel = conn.execute(query)
        for batch in rel.fetch_record_batch(chunk_size):
            yield batch
            gc.collect()  # force heap compaction between chunks
    finally:
        conn.close()

Trigger the fallback when available system memory falls below a safe threshold, and shrink chunk_size on each retry. Pair it with the ST_IsValid guard above so a single malformed polygon downgrades one task instead of cascading into repeated OOM kills across the pool. For the row-bounded chunking strategy that underpins this fallback at larger scale, see the batch processing pipelines reference.

See also

Converting DuckDB Queries to a GeoDataFrame Efficiently — the Arrow zero-copy hand-off the async consumer relies on.
Batch Processing Pipelines — row-bounded chunking that backs the OOM fallback path.

Up: Async Execution Patterns · Python & DuckDB Integration Workflows

External Reference Standards

DuckDB Spatial extension overview — geometry type mappings and WKB compliance: https://duckdb.org/docs/stable/core_extensions/spatial/overview.html

Running Async Spatial Queries in Python

Root-Cause Analysis #

Deterministic Configuration #

Optimized Execution Pattern #

Diagnostic Queries & Plan Validation #

Geometry Validation & Fallback Routing #

Related #

External Reference Standards #