Shapely Integration with DuckDB for Production Geospatial Workflows

Shapely owns Python’s de-facto geometry manipulation API, but direct in-memory operations on large spatial datasets quickly exhaust RAM and stall on single-threaded execution. This page addresses one specific workflow: moving geometries across the Shapely ↔ DuckDB boundary efficiently, so the columnar, vectorized engine performs the heavy ST_ compute while Shapely stays reserved for the topology repairs it does best. It sits under the Python & DuckDB Integration Workflows reference, which establishes the broader principle these patterns inherit — Python orchestrates, DuckDB computes, and geometry crosses the membrane as serialized buffers rather than per-row Python objects. The reference below covers session guardrails, the canonical serialization round-trip, plan validation, quantified trade-offs, and a regression harness you can wire into CI.

Runtime Configuration & Memory Guardrails

The Shapely bridge is memory-sensitive on both sides: DuckDB materializes vectorized geometry batches while the CPython process holds the WKB byte buffers you hand it. Configure the engine before the first geometry touches it, otherwise an unbounded spatial join will compete with the interpreter for the same physical pages and trigger an OOM kill mid-pipeline.

import duckdb

# Each setting below is a guardrail, not a tuning knob — pick conservative values first.
con = duckdb.connect(config={
    "threads": 8,                 # parallelizes ST_ kernels; higher = more temp-file/lock contention
    "memory_limit": "12GB",       # cap BELOW physical RAM (≈70–80%) to leave OS page-cache headroom
    "preserve_insertion_order": "false",  # frees the engine to reorder batches and spill cleanly
    "temp_directory": "/mnt/fast-nvme/duckdb_temp",  # spill to NVMe, not the OS root volume
})
con.execute("INSTALL spatial; LOAD spatial;")  # GEOS-backed ST_ functions live in this extension

The same knobs are available as SQL SET statements when you connect to a persistent database rather than constructing the config inline:

SET threads = 8;                       -- match physical cores; oversubscription hurts spill paths
SET memory_limit = '12GB';             -- hard ceiling; sustained temp-file growth means it is too high
SET preserve_insertion_order = false;  -- required for clean out-of-core execution on large joins
SET temp_directory = '/mnt/fast-nvme/duckdb_temp';  -- isolate per worker to avoid I/O contention

The memory_limit ceiling interacts directly with how DuckDB decides between in-memory and spill execution; the trade-offs of that decision are covered in in-memory vs disk storage. The practical rule for this workflow: reserve roughly 20–30% of physical RAM for the OS and the Python heap that will hold your WKB buffers, because that allocation is invisible to DuckDB’s own accounting.

Primary Execution Pattern: WKB Round-Trip

The canonical pattern moves geometry as WKB bytes, never as serialized WKT strings interpolated into SQL. Shapely exposes raw WKB on every geometry, DuckDB ingests it with ST_GeomFromWKB, and ST_AsWKB returns it without ever constructing an intermediate Python object inside the query. The distinction between the engine’s internal geometry type and the WKB wire format is detailed in ST_Geometry vs WKB; for this bridge, treat WKB as the only thing that should cross.

import duckdb
import shapely
from shapely import wkb

con = duckdb.connect(config={"threads": 8, "memory_limit": "12GB"})
con.execute("INSTALL spatial; LOAD spatial;")

# Build a geometry in Python and serialize to WKB bytes (zero string parsing in SQL).
poly = shapely.box(0, 0, 10, 10)
wkb_bytes = poly.wkb  # -> bytes

# Bind WKB as a parameter ($1) — never f-string a geometry into the SQL text.
con.execute(
    "CREATE TABLE test_geom AS SELECT ST_GeomFromWKB($1::BLOB) AS geom",
    [wkb_bytes],
)

# Pull it back as WKB and reconstruct the Shapely object on the Python side.
raw = con.execute("SELECT ST_AsWKB(geom) FROM test_geom").fetchone()[0]
restored = shapely.from_wkb(raw)
assert poly.equals(restored)

Parameter binding ($1::BLOB) matters for more than injection safety: a string-interpolated WKT geometry forces DuckDB to re-parse text on every execution, whereas a bound BLOB is handed to GEOS as bytes. Validate topology immediately after deserialization with ST_IsValid — invalid geometries propagate silently through joins and surface later as non-deterministic GEOS exceptions far from their origin.

Two-stage batch ingestion

Spatial joins, buffers, and convex hulls over millions of polygons grow memory super-linearly. The durable pattern is two-stage: stage one lands geometry in DuckDB in bounded chunks; stage two runs the expensive ST_ work inside the engine where it can spill. Chunk inputs through partitioned Parquet or explicit LIMIT/OFFSET windows, and always apply a bounding-box pre-filter (the && operator) before any GEOS predicate so the candidate set shrinks before the costly routine runs. The chunking discipline itself generalizes to the batch processing pipelines patterns, and the GeoParquet ingestion path is documented under GeoParquet parsing.

# Stage 1: ingest WKB in bounded batches so the Python heap never holds the full set.
def ingest_batch(con, rows):
    """rows: list[tuple[int, bytes]] of (id, wkb_bytes)."""
    con.executemany(
        "INSERT INTO parcels (id, geom) VALUES (?, ST_GeomFromWKB(?::BLOB))",
        rows,
    )

# Stage 2: compute inside the engine; the && bbox pre-filter prunes before ST_Intersects.
con.execute("""
    CREATE OR REPLACE TABLE matched AS
    SELECT a.id, z.zone_name
    FROM parcels a
    JOIN zoning z
      ON a.geom && z.geom              -- cheap bounding-box overlap, runs first
     AND ST_Intersects(a.geom, z.geom) -- exact GEOS predicate on the survivors only
""")

When the result must return to a GeoDataFrame rather than stay in the engine, hand it off through the zero-copy path described in DuckDB to GeoPandas sync instead of reconstructing geometries row by row in Python. And because DuckDB’s connection object is not thread-safe, overlapping ingestion with compute belongs in the async execution patterns model — one connection per task, dispatched through asyncio.to_thread():

import asyncio
import duckdb

async def process_chunk(chunk_id: int, db_path: str) -> None:
    """Each task owns its connection — DuckDB connections are not thread-safe."""
    conn = duckdb.connect(db_path)
    conn.execute("INSTALL spatial; LOAD spatial;")
    try:
        await asyncio.to_thread(
            conn.execute,
            """
            CREATE OR REPLACE TABLE chunk_result AS
            SELECT id, ST_Buffer(geom, 0.001) AS buffered_geom
            FROM raw_geometries
            WHERE chunk_id = ?
            """,
            [chunk_id],
        )
    finally:
        conn.close()

async def run_pipeline(db_path: str = ":memory:") -> None:
    await asyncio.gather(*(process_chunk(i, db_path) for i in range(8)))

asyncio.run(run_pipeline())

Execution Plan Validation

A spatial query that looks correct can still degrade into a full cross product if the optimizer never finds a bounding-box predicate to push down. Confirm operator behavior with EXPLAIN ANALYZE before trusting throughput numbers:

EXPLAIN ANALYZE
SELECT a.id, b.zone_name
FROM parcels a
JOIN zoning b ON ST_Intersects(a.geom, b.geom)
WHERE ST_Contains(b.geom, ST_Point(-73.9857, 40.7484));

A healthy plan applies the Filter node before the join phase and reports a Spatial Join (or hash join over an && predicate) rather than a CROSS_PRODUCT:

Read the plan top-down for three signals:

CROSS_PRODUCT with a high Actual Rows count — the optimizer found no spatial predicate to drive the join. Force a hash join by materializing the smaller table, or add an explicit && bounding-box pre-filter to the ON clause.
Filter applied after the join — the selective ST_Contains predicate ran late, so the join processed rows it could have discarded. Move the predicate into a CTE or subquery that filters before the join.
Row-estimate drift — when the optimizer’s estimated rows diverge from actual by more than ~10×, its join order is built on bad statistics; persist a R-tree index (CREATE INDEX … USING RTREE (geom)) so the planner has real selectivity to work with. Without an index, DuckDB falls back to runtime bounding-box pruning and hash joins.

The diagnostic threshold worth alerting on: if total query time scales super-linearly with row count, the candidate set is not being pruned — re-check that the && predicate survived into the plan.

Performance Trade-offs

Every choice in this bridge trades CPU, RAM, precision, or developer effort. The quantified guidance below assumes the two-stage pattern above on a dense urban dataset.

Dimension	Trade-off	When to apply / threshold
WKB binding vs WKT strings	Bound `$1::BLOB` skips text re-parsing; full-table WKT serialization back to Python degrades throughput by >40%.	Always bind WKB. Reserve WKT for human-readable debugging dumps only.
`&&` bbox pre-filter vs raw predicate	Bounding-box pruning before `ST_Intersects`/`ST_Contains` removes 60–90% of candidate pairs in dense layers.	Apply on every join over more than a few thousand geometries.
Engine-side compute vs Shapely fallback	`ST_` kernels are vectorized; round-tripping a full table to Shapely for `make_valid`/`union` costs the >40% serialization penalty plus single-threaded execution.	Export only the affected subset via WKB; never the whole table.
Spill to NVMe vs in-memory	Out-of-core execution keeps the pipeline alive past `memory_limit` but adds I/O latency; partitioning by `ST_Envelope` cuts cross-partition join overhead ~40–60%.	Spill when the working set exceeds `memory_limit`; partition when joins span many envelopes.
Precision vs speed	Double-precision coordinates preserve topology; rounding saves CPU but risks invalid geometries and coordinate drift.	Keep double precision; if `ST_Intersects` fails on valid inputs, check for drift `>1e-6`.

Where geometry must be reduced to summary statistics rather than returned whole, chaining into vectorized aggregations keeps the entire reduction inside the engine and avoids the Shapely round-trip cost altogether.

Edge Cases & Anti-Patterns

CRS mismatch across the boundary. Shapely is CRS-agnostic — it operates on raw coordinates and silently ignores units. A buffer of 0.001 is ~111 m near the equator in EPSG:4326 degrees but 1 mm in a metric projection. DuckDB will not warn you. Establish the working CRS explicitly using the rules in CRS mapping and transformations before any distance or buffer operation, and keep both sides in the same system.

# ANTI-PATTERN: buffer value whose unit depends on an unstated CRS.
con.execute("SELECT ST_Buffer(geom, 0.001) FROM parcels")  # 0.001 of WHAT?

# FIX: transform to a metric CRS first, buffer in metres, document the unit.
con.execute("""
    SELECT ST_Buffer(ST_Transform(geom, 'EPSG:4326', 'EPSG:32618'), 50) AS buf_50m
    FROM parcels
""")

Predicate in WHERE instead of ON. Placing the bounding-box filter in a WHERE clause on a join can prevent the optimizer from using it to drive the join itself, leaving a cross product. Keep the && pruning predicate in the ON clause beside the exact predicate.

-- ANTI-PATTERN: bbox filter stranded in WHERE; join may materialize the full product first.
SELECT a.id, b.zone_name
FROM parcels a JOIN zoning b ON ST_Intersects(a.geom, b.geom)
WHERE a.geom && b.geom;

-- FIX: pruning predicate in ON, so the optimizer builds the join around it.
SELECT a.id, b.zone_name
FROM parcels a JOIN zoning b
  ON a.geom && b.geom AND ST_Intersects(a.geom, b.geom);

Invalid geometry surfacing downstream. Self-intersecting rings deserialized from WKB pass ST_GeomFromWKB but throw inside GEOS predicates later. Guard at the boundary and route repairs through Shapely or ST_MakeValid only for the failing subset:

-- Detect and repair in place, touching only invalid rows.
UPDATE parcels SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);

Per-row Python loops over geometry. Pulling a result set into Python and iterating Shapely objects to compute, say, areas reintroduces the single-threaded bottleneck this whole bridge exists to avoid. Push the computation into the engine and return only scalars.

Query Regression Analysis

Spatial plans regress quietly: a DuckDB upgrade, a dropped index, or a statistics shift can turn a hash join back into a cross product with no error. Capture the plan as a normalized baseline and diff it in CI so a regression fails the build rather than the on-call pager.

import duckdb
import json
import re

def capture_plan(con: duckdb.DuckDBPyConnection, sql: str) -> dict:
    """Capture a normalized EXPLAIN ANALYZE fingerprint for regression diffing."""
    rows = con.execute("EXPLAIN ANALYZE " + sql).fetchall()
    plan_text = "\n".join(r[-1] for r in rows)

    # Strip volatile fields (timings, thread ids) so only structure is compared.
    structural = re.sub(r"\d+(\.\d+)?\s*(ms|s|rows)", "<N>", plan_text)
    timing_ms = None
    m = re.search(r"Total Time:\s*([\d.]+)\s*s", plan_text)
    if m:
        timing_ms = float(m.group(1)) * 1000

    return {
        "has_cross_product": "CROSS_PRODUCT" in plan_text,
        "operators": sorted(set(re.findall(r"[A-Z_]{4,}", structural))),
        "timing_ms": timing_ms,
        "fingerprint": structural,
    }

def assert_no_regression(baseline: dict, current: dict, slack: float = 1.5) -> None:
    """Fail loudly if the join strategy changed or timing blew past the budget."""
    assert not current["has_cross_product"], "Regression: plan fell back to CROSS_PRODUCT"
    assert baseline["operators"] == current["operators"], (
        f"Regression: operator set changed\n"
        f"  was: {baseline['operators']}\n  now: {current['operators']}"
    )
    if baseline["timing_ms"] and current["timing_ms"]:
        budget = baseline["timing_ms"] * slack
        assert current["timing_ms"] <= budget, (
            f"Regression: {current['timing_ms']:.0f} ms exceeds budget {budget:.0f} ms"
        )

# Usage: store baseline once, compare on every CI run.
con = duckdb.connect(); con.execute("INSTALL spatial; LOAD spatial;")
SQL = "SELECT a.id FROM parcels a JOIN zoning b ON a.geom && b.geom AND ST_Intersects(a.geom, b.geom)"

# baseline = capture_plan(con, SQL); json.dump(baseline, open("plan_baseline.json", "w"))
baseline = json.load(open("plan_baseline.json"))
assert_no_regression(baseline, capture_plan(con, SQL))

Two thresholds make this actionable: the operator-set check catches structural regressions (a vanished Spatial Join, a new CROSS_PRODUCT) deterministically, while the timing budget with a 1.5× slack absorbs normal noise but trips on a real slowdown. Pair it with a one-liner that flags runaway spill during the run — SELECT * FROM duckdb_temporary_files(); showing sustained growth means a spatial filter went missing.

See also

DuckDB to GeoPandas sync — zero-copy handoff when results must land in a GeoDataFrame rather than Shapely objects.
Async execution patterns — connection-per-task isolation for overlapping ingestion with spatial compute.
Batch processing pipelines — chunking and out-of-core strategies the two-stage ingestion pattern builds on.
Understanding ST_Geometry vs WKB — why WKB is the right wire format for the boundary.

Up: Python & DuckDB Integration Workflows

External Reference Standards

DuckDB Spatial extension overview: https://duckdb.org/docs/stable/core_extensions/spatial/overview.html
GEOS (the geometry engine shared by DuckDB Spatial and Shapely): https://libgeos.org/

Shapely Integration with DuckDB for Production Geospatial Workflows

Runtime Configuration & Memory Guardrails #

Primary Execution Pattern: WKB Round-Trip #

Two-stage batch ingestion #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Related #

External Reference Standards #