Converting DuckDB Queries to GeoDataFrame Efficiently

Turning a DuckDB Spatial result into a GeoPandas GeoDataFrame collapses to seconds-not-minutes only when geometry crosses the boundary as a contiguous binary buffer instead of decaying into per-row Python objects — this page, part of the DuckDB to GeoPandas Sync workflow, isolates that exact conversion bottleneck and the fixes for it.

The naive idiom — .fetchdf() followed by geopandas.GeoDataFrame(df, geometry='geom') — forces row-by-row WKT parsing, bypasses vectorized memory layouts, and fragments the Python heap. For result sets beyond a few million rows or with complex polygons, that pattern routinely ends in an out-of-memory (OOM) crash from duplicated intermediate buffers. The deterministic path routes everything through Arrow and reconstructs geometry with one vectorized call.

Root-Cause Analysis

The conversion fails along four distinct axes, and each has a different remedy. Diagnosing which one you are hitting is the prerequisite to fixing it — the symptoms (slow, then OOM) look identical from the outside.

Stringified geometry round-trip. .fetchdf() materializes the whole result into Pandas and converts each GEOMETRY value to a WKT string for legacy compatibility. GeoPandas then calls shapely.wkt.loads() once per row. The same coordinate set is encoded as bytes, re-encoded as text, then parsed back into GEOS objects — three full copies. The internal representation behind this — and why WKB is the cheaper interchange format — is covered in the ST_Geometry vs WKB reference.
object-dtype geometry column. Pandas infers a generic object dtype for the geometry column, so it never gets a contiguous backing buffer. Every downstream operation chases pointers across a fragmented heap.
GIL-bound row iteration. Per-row WKT parsing runs single-threaded under the GIL, throwing away the multi-threaded execution DuckDB already paid for. The query finishes fast; the conversion is the wall.
Unbounded materialization. .fetchdf() and .fetchall() build the entire frame before you can touch it. Whether the working set even fits is governed by the in-memory vs disk storage boundaries for your machine; past that line the process is killed mid-fetch.

A correctly built converter is bounded by the size of the result itself: $O(N)$ in row count with a single vectorized parse pass, versus the naive $O(N)$ row-by-row parse that carries a large per-row Python constant and several redundant buffer copies.

Deterministic Configuration

Configure the session for vectorized geometry export before any query runs. Confirm the prerequisites first:

spatial extension installed and loaded in the session
memory_limit set below total RAM, with headroom for the Arrow result buffer
A fast local temp_directory for graceful spill
shapely >= 2.0 available (vectorized from_wkb lives there)

import duckdb
import pandas as pd
import shapely
import geopandas as gpd

con = duckdb.connect(config={
    # Match physical cores: Arrow export parallelizes, so undersubscribing
    # leaves throughput on the table; oversubscribing only raises peak memory.
    "threads": 8,
    # Cap resident memory below total RAM; the result buffer and the spill
    # region must both fit, so leave headroom rather than claiming all of it.
    "memory_limit": "16GB",
    # Route spill to fast local NVMe — DuckDB writes here when memory_limit
    # is breached, and a slow disk turns a spill into a stall.
    "temp_directory": "/tmp/duckdb_temp",
})
con.execute("INSTALL spatial; LOAD spatial;")
# Allow large contiguous Arrow buffers so wide geometry payloads export as a
# single chunk instead of being split into many small (re-allocated) batches.
con.execute("SET arrow_large_buffer_size = true;")

The single most important rule: never let geometry leave DuckDB as WKT. Project it as raw Well-Known Binary with ST_AsWKB(geom) so the Arrow column arrives as a binary BLOB, ready for one vectorized shapely.from_wkb() call. This is the same binary-boundary discipline used across the Shapely integration patterns.

Optimized Execution Pattern

The behavioral change is small in code and enormous in cost. Before — implicit Pandas materialization with stringified geometry:

# ANTI-PATTERN: full Pandas frame, WKT strings, per-row parse under the GIL.
df = con.execute("SELECT id, name, geom FROM regions").fetchdf()
gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.GeoSeries.from_wkt(df["geom"]),  # one shapely call per row
    crs="EPSG:4326",
)

After — Arrow result with WKB geometry and a single vectorized reconstruction:

def fetch_geodataframe(query: str, geometry_col: str = "geom_wkb") -> gpd.GeoDataFrame:
    """Convert a DuckDB query to a GeoDataFrame via zero-copy Arrow + vectorized WKB.

    The query MUST project geometry as ST_AsWKB(geom) AS geom_wkb so the Arrow
    column is a binary BLOB rather than a WKT string.
    """
    arrow_table = con.execute(query).fetch_arrow_table()

    # Pull the geometry column out as raw bytes and parse it in one vectorized
    # call — this replaces N per-row shapely.wkt.loads() invocations.
    wkb_bytes = arrow_table.column(geometry_col).to_pylist()
    geometries = shapely.from_wkb(wkb_bytes)

    # Build the attribute frame from the remaining Arrow columns (no geometry),
    # so Pandas never sees the BLOB column and never infers an object dtype for it.
    attr_cols = [c for c in arrow_table.column_names if c != geometry_col]
    df = arrow_table.select(attr_cols).to_pandas()

    return gpd.GeoDataFrame(df, geometry=geometries, crs="EPSG:4326")


gdf = fetch_geodataframe(
    "SELECT id, name, ST_AsWKB(geom) AS geom_wkb FROM regions"
)

The annotated diff: geometry is projected to WKB in SQL (ST_AsWKB), the result is pulled as an Arrow table rather than a Pandas frame (fetch_arrow_table vs fetchdf), and reconstruction is one shapely.from_wkb(list_of_bytes) instead of a per-row from_wkt. Attribute columns stay on the fast Arrow→Pandas path; only the geometry detours through Shapely.

When the result exceeds available RAM, stream it as Arrow record batches so peak memory is bounded by the chunk, not the result. This is the same chunking discipline detailed in the batch processing pipelines workflow:

def fetch_geodataframe_batched(
    query: str, geometry_col: str = "geom_wkb", chunk_size: int = 100_000
) -> gpd.GeoDataFrame:
    """Stream Arrow record batches so peak memory tracks chunk_size, not row count."""
    reader = con.execute(query).fetch_record_batch(chunk_size)
    frames = []
    for batch in reader:
        if batch.num_rows == 0:
            continue
        geometries = shapely.from_wkb(batch.column(geometry_col).to_pylist())
        attr_cols = [c for c in batch.column_names if c != geometry_col]
        df = batch.select(attr_cols).to_pandas()
        frames.append(gpd.GeoDataFrame(df, geometry=geometries, crs="EPSG:4326"))
        del batch, geometries, df  # release each chunk before the next arrives
    return gpd.GeoDataFrame(pd.concat(frames, ignore_index=True), crs="EPSG:4326")

For pipelines that must not block an event loop while a large geometry scan runs, dispatch the blocking execute() on a thread and await it — the connection-isolation rules for that live in the async execution patterns reference.

Diagnostic Queries & Plan Validation

Measure the result before materializing it, then confirm the export path is actually vectorized. First, size the payload so you can route to the right converter:

-- One-liner sizing probe: geometry bytes + row count before any fetch.
SELECT
    sum(octet_length(ST_AsWKB(geom)))           AS geo_bytes,
    count(*)                                     AS row_count,
    sum(octet_length(ST_AsWKB(geom))) / count(*) AS avg_geom_bytes
FROM your_table;

Interpretation thresholds: an avg_geom_bytes in the tens of bytes is point/line data that converts trivially; tens of kilobytes signals dense multipolygons where ST_Simplify pre-filtering pays off. If geo_bytes approaches your memory_limit, route to the batched converter rather than the single-shot one.

Validate that the query itself is not the bottleneck with EXPLAIN ANALYZE. The conversion is only worth optimizing once the scan is clean:

plan = con.execute(
    "EXPLAIN ANALYZE SELECT id, ST_AsWKB(geom) AS geom_wkb FROM regions"
).fetchall()
print(plan[0][1])

Read the plan for two things: a TABLE_SCAN (or PARQUET_SCAN) whose actual row count matches the estimate — large estimate drift means the optimizer is mis-sizing buffers — and the total operator time. If query time is small but your wall-clock is large, the cost is in the Python conversion, confirming the WKB-vs-WKT diagnosis above. Loading source data as GeoParquet keeps that scan column-pruned and predicate-pushed so the geometry buffer is the only large object in flight.

Geometry Validation & Fallback Routing

shapely.from_wkb() will happily reconstruct topologically invalid geometry — self-intersections and unclosed rings survive the byte round-trip and only explode later inside a GeoPandas overlay. Guard the boundary in SQL with ST_IsValid, repairing with ST_MakeValid, and route on estimated size so a large result never reaches the single-shot converter.

def safe_fetch(query: str, memory_threshold_gb: float = 12.0) -> gpd.GeoDataFrame:
    """Repair invalid geometry in SQL, size the payload, route to the right converter."""
    # Project valid WKB once; ST_MakeValid fixes self-intersections/unclosed rings
    # at the source so Shapely never reconstructs a broken geometry.
    projected = query.replace(
        "SELECT ", "SELECT ST_AsWKB(ST_MakeValid(geom)) AS geom_wkb, ", 1
    )
    est_gb = con.execute(
        f"SELECT sum(octet_length(geom_wkb))::DOUBLE / 1073741824.0 FROM ({projected}) t"
    ).fetchone()[0]

    if est_gb and est_gb > memory_threshold_gb:
        return fetch_geodataframe_batched(projected, chunk_size=50_000)  # bounded path
    return fetch_geodataframe(projected)                                  # fast path

Three fallbacks keep the conversion inside its memory envelope when the result is genuinely large:

Tolerance-based simplification. ST_Simplify(geom, tolerance) applied in SQL before export cuts WKB byte count by 40–70% for coarse-scale analysis without changing topological validity — apply it only when the analytical question tolerates generalized boundaries.
Spill, don’t crash. Keep temp_directory on fast NVMe so a momentary breach of memory_limit spills to disk instead of killing the process mid-fetch.
Post-conversion assertions. After conversion, confirm the CRS survived and geometry is sound: check gdf.crs is set and shapely.is_valid(gdf.geometry).all() is true. A silent CRS drop here propagates into every downstream join; the alignment rules are in the CRS mapping & transformations reference.

See also

Shapely Integration — the binary-boundary discipline for moving geometry between GEOS and DuckDB.
Running Async Spatial Queries in Python — keep a large geometry fetch off the event loop without blocking.
ST_Geometry vs WKB — why WKB is the cheaper interchange representation.

Up: DuckDB to GeoPandas Sync · Python & DuckDB Integration Workflows

Converting DuckDB Queries to GeoDataFrame Efficiently

Root-Cause Analysis #

Deterministic Configuration #

Optimized Execution Pattern #

Diagnostic Queries & Plan Validation #

Geometry Validation & Fallback Routing #

Related #

Root-Cause Analysis

Deterministic Configuration

Optimized Execution Pattern

Diagnostic Queries & Plan Validation

Geometry Validation & Fallback Routing

Related