DuckDB to GeoPandas Sync

Moving query results from DuckDB’s analytical engine into a GeoPandas GeoDataFrame is the boundary where most Python spatial pipelines lose their performance. This workflow within the Python & DuckDB Integration Workflows reference covers the deterministic transfer path: bounding DuckDB’s memory and threads, projecting geometry as Well-Known Binary, streaming Arrow record batches, and reconstructing Shapely geometries vectorized — so geometry never decays into per-row Python objects. The governing rule is that DuckDB computes, Python orchestrates, and geometry crosses the boundary as a contiguous binary buffer rather than a stringified column.

Runtime Configuration & Memory Guardrails

The sync runs in-process, so DuckDB’s working set, Python’s heap, and GeoPandas’ geometry construction all contend for the same address space. Before any spatial query executes, pin the connection to hard memory and thread ceilings. Without them, an unbounded spatial join can allocate far past available RAM and trigger an OOM kill that takes the Python process with it.

import duckdb

# Connection tuned for the extract-and-convert path, not ad-hoc analytics.
con = duckdb.connect(config={
    "threads": 4,                          # Match physical cores; over-subscription
                                           #   adds context-switch cost during WKB scans.
    "memory_limit": "8GB",                 # Hard ceiling — spatial joins spill past this
                                           #   instead of being OOM-killed.
    "temp_directory": "/tmp/duckdb_spill", # Spill target for hash tables / sort buffers.
    "preserve_insertion_order": False,     # Unlocks parallel scans; we re-order in SQL.
    "enable_object_cache": False,          # Deterministic scans — no stale plan/stat reuse.
})

con.execute("INSTALL spatial; LOAD spatial;")

Performance Trade-off: Capping threads below the logical core count reduces context-switching while DuckDB materializes geometry, but lengthens wall-clock time on non-spatial aggregations in the same query. Set it to the physical core count and measure, rather than defaulting to the logical count. enable_object_cache=False forces deterministic I/O so spatial join cardinality estimates do not drift between runs — important when you are diffing execution plans later in this page.

Whether DuckDB keeps intermediates in RAM or spills them depends on how you balance memory_limit against result size; the trade-offs are covered in in-memory vs disk storage. For the sync path specifically, the dangerous allocation is not the query itself but the copy into Python: a result that fits comfortably inside memory_limit can still double its footprint the moment it is materialized as a DataFrame plus a Shapely geometry array. Size the ceiling against that doubled figure.

Monitor spill behaviour before scaling batch sizes:

-- Non-zero rows mean DuckDB is spilling intermediates to temp_directory.
SELECT path, size FROM duckdb_temporary_files();

The Two-Stage Sync Pattern

The canonical sync is two stages with a single hand-off format. Stage one runs entirely inside DuckDB and emits geometry as Well-Known Binary rather than the verbose WKT text encoding. Stage two pulls the result as an Arrow table and parses the WKB column with a single vectorized Shapely call. The two stages never exchange Python objects — only an Arrow buffer.

The boundary between DuckDB and Python is crossed once, as a binary Arrow buffer; geometry is reconstructed in a single vectorized pass on the Python side.

Stage one — project geometry as WKB in SQL

-- Emit a binary BLOB column, NOT stringified WKT.
SELECT
    parcel_id,
    zone_code,
    ST_Area(geom)   AS area_sqm,
    ST_AsWKB(geom)  AS geom_wkb   -- Arrow receives this as a BLOB, zero string allocation.
FROM parcels
WHERE ST_IsValid(geom);

ST_AsWKB() is the load-bearing choice. The default text representation forces DuckDB to allocate a Python string per row and GeoPandas to call shapely.wkt.loads() per row — an object-dtype column that defeats every vectorized path. WKB arrives as a contiguous binary column that Shapely parses in one pass.

Stage two — Arrow to GeoDataFrame, vectorized

import geopandas as gpd
import pandas as pd
import shapely
import pyarrow as pa

def arrow_to_geodataframe(
    table: pa.Table,
    wkb_col: str = "geom_wkb",
    crs: str = "EPSG:4326",
) -> gpd.GeoDataFrame:
    """Build a GeoDataFrame from an Arrow table with a WKB BLOB column."""
    # Single vectorized parse — no per-row Python iteration.
    geometries = shapely.from_wkb(table.column(wkb_col).to_numpy(zero_copy_only=False))

    # Non-geometry columns stay in Arrow until the final to_pandas() call.
    attrs = table.drop_columns([wkb_col]).to_pandas(types_mapper=pd.ArrowDtype)
    return gpd.GeoDataFrame(attrs, geometry=geometries, crs=crs)

arrow_table = con.execute(SQL_FROM_STAGE_ONE).fetch_arrow_table()
gdf = arrow_to_geodataframe(arrow_table)

The CRS is set once, explicitly, on construction — DuckDB Spatial does not propagate an SRS tag through ST_AsWKB(), so the geometry arrives coordinate-system-agnostic. Tag it with the CRS the query actually produced; if the query reprojected, use the target CRS, following the rules in CRS mapping and transformations. For the deeper conversion mechanics — Arrow dtype mapping, large-buffer tuning, and async variants — see converting DuckDB queries to GeoDataFrame efficiently.

Streaming large result sets

A single fetch_arrow_table() materializes the whole result in one contiguous block, which is fastest but bounded by RAM. When the result exceeds what you can safely double in memory, stream record batches instead and convert one chunk at a time — the same chunked discipline used across batch processing pipelines and async execution patterns.

result = con.execute("SELECT parcel_id, ST_AsWKB(geom) AS geom_wkb FROM parcels")

frames = []
for batch in result.fetch_record_batch(500_000):   # 500k-row chunks
    if batch.num_rows == 0:
        continue
    frames.append(arrow_to_geodataframe(pa.Table.from_batches([batch])))

gdf = gpd.GeoDataFrame(pd.concat(frames, ignore_index=True), crs="EPSG:4326")

Performance Trade-off: fetch_record_batch() holds memory flat regardless of result size but adds Python-level iteration cost and a final concat. fetch_arrow_table() is roughly 30–40% faster end-to-end on results that fit, at the risk of an OOM if the materialized DataFrame plus geometry array breaches memory_limit. Choose batched streaming above ~5M rows or for any polygon-heavy result.

Execution Plan Validation

A slow sync is usually a slow query, not slow conversion. Before optimizing the Python side, confirm the SQL stage uses spatial index routing rather than a nested-loop scan. Inspect the plan with EXPLAIN ANALYZE.

EXPLAIN ANALYZE
SELECT
    a.parcel_id,
    b.zone_code,
    ST_AsWKB(a.geom) AS geom_wkb
FROM parcels a
JOIN zoning_districts b
  ON ST_Intersects(a.geom, b.geom)
WHERE ST_DWithin(a.geom, ST_Point(12.45, 41.89), 5000)
  AND b.zone_code IN ('R1', 'C2');

A healthy plan resolves the join through a hash join fed by bounding-box pre-filtering, not a cross product:

A healthy plan: an early attribute filter, a hash join routed through bounding-box pre-filtering, and WKB serialized last — not a nested loop over a cross product.

Read the plan with these priorities:

HASH_JOIN vs NESTED_LOOP_JOIN. ST_Intersects should ride on bounding-box (&&) pre-filtering so the engine prunes candidate pairs before exact predicate evaluation. The mechanics of that pruning live in the R-tree spatial index internals, and the join-side tuning in spatial joins and proximity filters. A nested-loop node on a multi-million-row input is the signal that index routing failed.
FILTER placement. Attribute filters (zone_code IN (…)) should execute before geometry evaluation so WKB is never deserialized for rows that are discarded anyway.
SCAN cost. A high ratio of rows_scanned to rows_output points to a missing index or an unpartitioned table. For inputs above 10M rows, ordering by a space-filling (Hilbert or XY) key localizes scans.

Diagnostic Boundary: Treat a join that emits more rows than the larger input as a cardinality explosion. The usual cause is a CRS mismatch — degree-unit and metre-unit geometries compared as if equal — or a ST_DWithin radius expressed in the wrong unit. Confirm both inputs share one coordinate system before blaming the planner.

The cost model for an unindexed intersection is $O(N \times M)$ over the two input cardinalities; bounding-box pre-filtering is what collapses it toward near-linear behaviour by discarding non-overlapping pairs first.

Performance Trade-offs

The transfer path has three decision points, each with a measurable cost. The numbers below are directional for parcel-scale polygon data on a 4-core, 8 GB session and should be re-measured against your own geometry complexity.

Decision	Faster option	Cost of the alternative
Geometry encoding	`ST_AsWKB()` → `shapely.from_wkb`	WKT + `wkt.loads()` per row: 5–10× slower, `object` dtype
Result transfer	`fetch_arrow_table()` (fits in RAM)	`fetch_record_batch()` streaming: ~30–40% slower, memory-flat
Validation timing	defer `make_valid()` for read-only loads	early validation: needed for write/ETL, adds a full pass
CRS handling	tag once on construction	reproject in pandas post-load: 15–30% overhead, GIL-bound

Two rules follow from the table. For read-heavy extraction, defer geometry validation until something downstream actually needs valid topology — an early ST_MakeValid() on every row is wasted work if you are only reading. For write-heavy ETL, validate early so invalid geometry never reaches a destination that will reject the whole batch. And always set the CRS in SQL or at construction time; reprojecting a materialized GeoDataFrame is the slowest place to do it.

Edge Cases & Anti-Patterns

.fetchdf() then GeoDataFrame(...). The most common anti-pattern. fetchdf() materializes geometry as WKT strings into an object-dtype column, then GeoPandas re-parses each string. It duplicates the buffer twice (WKB → string → GEOS object) and serializes under the GIL, erasing DuckDB’s multi-threaded execution. Fix: project ST_AsWKB() and parse with shapely.from_wkb as shown above.

Geometry predicate in WHERE instead of ON. Writing the spatial relation as a post-join filter rather than the join condition can suppress index routing and force a near-cross-product before the filter runs:

-- Anti-pattern: cross product, then filter.
SELECT a.parcel_id, ST_AsWKB(a.geom)
FROM parcels a, zoning_districts b
WHERE ST_Intersects(a.geom, b.geom);

-- Fix: spatial relation as the join predicate so the planner can route it.
SELECT a.parcel_id, ST_AsWKB(a.geom)
FROM parcels a
JOIN zoning_districts b ON ST_Intersects(a.geom, b.geom);

Wrong CRS unit in ST_DWithin. A radius of 5000 means 5000 degrees on EPSG:4326 geometry — effectively the whole globe — but 5000 metres on a projected CRS. Either reproject to a metric system first or use the geography-aware distance path. The unit semantics are detailed in how DuckDB Spatial handles coordinate systems.

Missing CRS on the GeoDataFrame. ST_AsWKB() carries no SRS tag, so a GeoDataFrame built without an explicit crs= argument is silently coordinate-system-less. Every later reprojection or distance call then either errors or produces wrong answers. Always pass crs= at construction.

Invalid geometry forcing a slow fallback. When more than a few percent of rows fail ST_IsValid, Shapely’s vectorized parse degrades and downstream operations fall back to per-geometry Python loops, cutting throughput 3–5×. Filter or repair at the SQL stage with ST_MakeValid(geom) before export rather than after.

Query Regression Analysis

The sync is only deterministic if the plan stays stable as data grows and statistics drift. Capture the plan as a fingerprint, store a baseline, and fail loudly when join strategy or row estimates change. This catches the silent regression where a HASH_JOIN degrades to a nested loop after a table doubles in size.

import json
import re

def plan_fingerprint(con, sql: str) -> dict:
    """Capture join strategy and row-estimate signal from an EXPLAIN plan."""
    plan = con.execute("EXPLAIN " + sql).fetchall()[0][1]
    return {
        "joins":  sorted(re.findall(r"(HASH_JOIN|NESTED_LOOP_JOIN|CROSS_PRODUCT)", plan)),
        "scans":  len(re.findall(r"SEQ_SCAN|TABLE_SCAN", plan)),
        "uses_index": "RTREE" in plan or "&&" in plan,
    }

def assert_no_regression(con, sql: str, baseline_path: str) -> None:
    current = plan_fingerprint(con, sql)
    try:
        with open(baseline_path) as f:
            baseline = json.load(f)
    except FileNotFoundError:
        with open(baseline_path, "w") as f:
            json.dump(current, f, indent=2)   # First run: record the baseline.
        return

    # Hard gate: any join strategy change or lost index usage is a regression.
    assert current["joins"] == baseline["joins"], (
        f"Join strategy drift: {baseline['joins']} -> {current['joins']}")
    assert current["uses_index"] == baseline["uses_index"], (
        "Spatial index routing changed — re-check statistics and CRS alignment.")

Run this in CI against a representative fixture. Pair it with a runtime check on the live connection so a degraded plan surfaces before it floods memory:

def guard_runtime_state(con) -> None:
    spill = con.execute("SELECT count(*) FROM duckdb_temporary_files()").fetchone()[0]
    if spill > 50:
        raise RuntimeError(
            f"{spill} spill files — likely cardinality misestimate; "
            f"re-run ANALYZE on base tables or raise memory_limit.")

Diagnostic Boundaries:

Spill count. More than 50 entries in duckdb_temporary_files() indicates severe cardinality misestimation. Re-run ANALYZE on the base tables or raise memory_limit before retrying.
Geometry validity. Reject any batch where ST_IsValid fails on more than 5% of rows; that ratio reliably predicts the Shapely slow-path fallback described above.
Row-estimate drift. When the planner’s estimated output diverges from the actual by more than an order of magnitude, stale statistics are usually the cause — refresh them rather than rewriting the query.

See also

Converting DuckDB queries to GeoDataFrame efficiently — Arrow dtype mapping and async conversion at scale
Batch processing pipelines — chunked, bounded-memory orchestration around this sync
Async execution patterns — running the extract stage off the event loop
Shapely integration — geometry construction and validity on the Python side
CRS mapping and transformations — getting coordinate systems right before export

Up: Python & DuckDB Integration Workflows

External Reference Standards

DuckDB Spatial extension: https://duckdb.org/docs/stable/core_extensions/spatial/overview.html
GeoPandas GeoDataFrame construction and geometry handling: https://geopandas.org/en/stable/docs/reference/geodataframe.html

DuckDB to GeoPandas Sync

Runtime Configuration & Memory Guardrails #

The Two-Stage Sync Pattern #

Stage one — project geometry as WKB in SQL #

Stage two — Arrow to GeoDataFrame, vectorized #

Streaming large result sets #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Related #

External Reference Standards #

Runtime Configuration & Memory Guardrails

The Two-Stage Sync Pattern

Stage one — project geometry as WKB in SQL

Stage two — Arrow to GeoDataFrame, vectorized

Streaming large result sets

Execution Plan Validation

Performance Trade-offs

Edge Cases & Anti-Patterns

Query Regression Analysis

Related

External Reference Standards