Python & DuckDB Integration Workflows

This guide covers how to drive DuckDB Spatial from Python so that geometry stays in vectorized columnar buffers and never decays into per-row Python objects. It is written for data engineers, GIS analysts, and platform teams who orchestrate spatial pipelines in Python but want DuckDB’s embedded OLAP engine to do the heavy spatial compute. It is one of the three core areas on this Analytical SQL for GIS reference, sitting alongside the DuckDB Spatial architecture fundamentals and the modern spatial SQL query patterns that the SQL examples here build on. The governing principle throughout is a strict boundary: Python orchestrates, DuckDB computes, and geometry crosses between them as zero-copy Arrow buffers rather than serialized objects.

The IO membrane: Python orchestrates and DuckDB computes, exchanging geometry as zero-copy Arrow buffers rather than serialized objects.

Execution Model & Core Concepts

DuckDB runs in-process inside the Python interpreter as a columnar, vectorized query engine. There is no client/server socket, no wire protocol, and no network round-trip — import duckdb loads a shared library and queries execute on the same threads the engine spawns under your process. For spatial work this matters because geometry is large and irregular: the cost of moving it dominates everything else. Keeping the engine embedded means a 2-million-row geometry column can be filtered, joined, and aggregated without ever being copied across a process boundary.

The engine processes data in fixed-size column chunks (a “vector”, default 2048 values). Spatial predicates such as ST_Intersects, ST_DWithin, and ST_Contains are compiled into kernels that sweep a whole vector at a time, applying a cheap bounding-box test before the exact topology test. This is the same vectorized pipeline described in the DuckDB Spatial architecture fundamentals, and it is why pushing predicates into SQL beats looping in Python: a Python for loop evaluates one geometry per interpreter step, while a DuckDB kernel evaluates 2048 with no interpreter overhead.

The second core concept is the type membrane. Inside DuckDB a geometry column is the GEOMETRY extension type, stored as variable-length WKB bytes in a contiguous buffer with an offset vector. When results leave the engine, they cross into Python’s world through one of three doors, in increasing order of cost:

fetch_arrow_table() / fetch_record_batch() — geometry arrives as an Arrow binary/extension column with no copy. This is the fast path and the one to design around.
fetchnumpy() / df() — columns are materialized into NumPy or pandas. Numeric columns are cheap; geometry degrades to opaque WKB blobs in an object column.
fetchall() / fetchone() — rows become Python tuples. Every geometry becomes a bytes object on the heap. Avoid this for geometry columns entirely.

Understanding which door you use is the difference between a pipeline that holds a flat memory profile and one that triples its resident set on every fetch. The DuckDB to GeoPandas sync patterns exist precisely to keep you on the Arrow door for as long as possible and only rehydrate Shapely objects at the very end.

The orchestrate-vs-compute split

A useful mental model: treat every Python statement that touches a geometry value element-by-element as a bug until proven otherwise. Construction (ST_MakeEnvelope, ST_Point), validation (ST_IsValid, ST_MakeValid), measurement (ST_Area, ST_Length), and relation tests all belong in SQL. Python’s job is to assemble the SQL, manage connections and concurrency, route data between stages, and hand the final result to a consumer such as GeoPandas, Polars, or a Parquet writer.

Configuration Reference

A DuckDB connection in Python is configured either through the config dict passed to duckdb.connect() or via SET statements after the connection opens. For spatial workloads the spatial extension must be installed and loaded once per connection. Every setting below carries a real trade-off — none of them are free.

import duckdb

con = duckdb.connect(database=":memory:", config={
    "memory_limit": "8GB",        # hard ceiling; the engine spills past it instead of OOM-killing
    "threads": 4,                 # parallel pipeline lanes; more threads = more peak RAM per query
    "temp_directory": "/var/lib/duckdb/spill",  # where over-limit intermediates land
    "max_temp_directory_size": "20GB",          # cap the spill so a runaway join cannot fill the disk
    "preserve_insertion_order": False,          # lets aggregations reorder freely — faster, but output order is undefined
    "enable_object_cache": True,  # reuse parsed Parquet/metadata across queries on the same files
})
con.execute("INSTALL spatial; LOAD spatial;")

Trade-off: memory_limit and threads interact multiplicatively. Each thread can hold its own copy of an in-flight hash table or sort buffer, so peak memory roughly scales with threads. On an 8 GB box, four threads at a 6 GB limit is safer than eight threads at the same limit. Lower the thread count before lowering the memory limit when a spatial join thrashes.

Trade-off: preserve_insertion_order = False measurably speeds up grouped and windowed spatial aggregations because the engine no longer has to keep a stable row order, but downstream code must not assume the result order. If you need deterministic output, add an explicit ORDER BY instead of relying on insertion order.

For deeper tuning of the spill path and when the engine chooses disk over RAM, see in-memory vs disk storage. For workloads that fan out across many connections, the async execution patterns section explains why each worker should own its own cursor rather than sharing one configured connection.

Connection lifecycle in Python

A single Connection object is not free-threaded. The supported concurrency model is one connection with multiple cursors (con.cursor()), or one connection per worker. Spawning a fresh cursor() per task is cheap and gives each task an isolated transaction context while sharing the underlying database and buffer pool:

# Shared engine, isolated per-task cursors — the canonical concurrency unit
def run_stage(sql: str):
    cur = con.cursor()            # lightweight; shares buffers, isolates transaction state
    try:
        return cur.execute(sql).fetch_arrow_table()
    finally:
        cur.close()               # release the cursor promptly to free its result buffers

Ingestion & Format Support

DuckDB reads most geospatial formats directly into Arrow buffers without a staging copy, which is what makes Python ingestion fast. The reader functions are SQL functions, so you invoke them from con.execute() and never marshal file contents through Python.

GeoParquet — the zero-copy default

GeoParquet is the preferred interchange format because it is already columnar and its geometry column is WKB. DuckDB projects only the columns you select and pushes row-group filters down to the scan, so a narrow spatial query never reads the whole file. The mechanics of metadata and CRS extraction are detailed under GeoParquet parsing.

# Projection + predicate pushdown happen at the Parquet scan, before geometry decode
tbl = con.execute("""
    SELECT parcel_id, land_use, geom
    FROM read_parquet('s3://bucket/parcels/*.parquet', hive_partitioning = true)
    WHERE land_use = 'residential'
""").fetch_arrow_table()   # geometry stays WKB in an Arrow column — no Python copy

GeoJSON and other vector formats

GeoJSON, Shapefile, FlatGeobuf, and GDAL-backed sources are read through ST_Read. GeoJSON is row-oriented and must be parsed into WKB on ingest, so it is the slowest common path; convert it to GeoParquet once and read that thereafter. The batch conversion strategy lives in GeoJSON ingestion.

# One-time normalization: read GeoJSON, persist as GeoParquet for all future runs
con.execute("""
    COPY (SELECT * FROM ST_Read('boundaries.geojson'))
    TO 'boundaries.parquet' (FORMAT PARQUET)
""")

Pushing Python data into DuckDB

The membrane runs both ways. A pandas, Polars, or Arrow object in Python can be queried directly by name — DuckDB registers it as a virtual table with no copy when the source is Arrow-backed:

import pyarrow as pa

points = pa.table({"id": [1, 2], "wkt": ["POINT(0 0)", "POINT(1 1)"]})
con.execute("""
    SELECT id, ST_GeomFromText(wkt) AS geom
    FROM points                       -- the Arrow table is visible to SQL by variable name
""")

Trade-off: registering an Arrow table is zero-copy and ideal; registering a GeoPandas GeoDataFrame forces its geometry through WKB serialization. When the source is already a GeoDataFrame, convert its geometry to WKB explicitly so you control where the cost lands rather than letting an implicit conversion surprise you.

Query Planning & Optimization

Because the engine is embedded, EXPLAIN and EXPLAIN ANALYZE are run from Python exactly like any other query, and reading their output is the single most useful debugging skill for spatial pipelines. EXPLAIN shows the chosen plan; EXPLAIN ANALYZE runs the query and annotates each operator with real timing and row counts.

plan = con.execute("""
    EXPLAIN ANALYZE
    SELECT z.zone_id, COUNT(*) AS hits
    FROM zones z
    JOIN sensor_points p ON ST_Intersects(z.geom, p.geom)
    GROUP BY z.zone_id
""").fetchall()
print(plan[0][1])   # the rendered plan tree with per-operator timing

The things to look for in a spatial plan:

A bounding-box pre-filter ahead of the exact predicate. When an R-tree or implicit bbox filter is engaged you will see far fewer rows reaching the ST_Intersects evaluation than the cross product would produce. If the row count into the predicate equals N × M, no pre-filtering happened — revisit your join condition and indexing. The index mechanics are covered in spatial indexing internals.
Predicate placement. A spatial test in the ON or WHERE clause can be pushed into the join; the same test applied after the join (e.g. wrapped in a subquery or a HAVING) cannot. Keep spatial predicates in the join condition.
Plan shape drift. A spatial join that silently switches from a hash-style join to a nested-loop evaluation is the classic cause of a query that was fast last week and times out today. Capture the plan as a baseline so you can diff it.

The query patterns that benefit most from plan inspection — proximity joins, distance matrices, and grouped rollups — are documented under spatial joins and proximity filters and vectorized aggregations, and the planning advice there carries straight into Python because the SQL is identical.

Capturing a plan baseline from Python

Treat the execution plan as a testable artifact. Snapshot it on a representative dataset and fail your CI when the operator shape changes:

import json

def plan_signature(con, sql: str) -> list[str]:
    raw = con.execute(f"EXPLAIN (FORMAT JSON) {sql}").fetchone()[0]
    tree = json.loads(raw)
    names = []
    def walk(node):
        names.append(node.get("name", "?"))
        for child in node.get("children", []):
            walk(child)
    walk(tree[0] if isinstance(tree, list) else tree)
    return names   # e.g. ['HASH_GROUP_BY', 'HASH_JOIN', 'SEQ_SCAN', 'SEQ_SCAN']

baseline = plan_signature(con, "SELECT z.zone_id FROM zones z JOIN sensor_points p ON ST_Intersects(z.geom, p.geom)")
# Assert against `baseline` in CI; a NESTED_LOOP_JOIN appearing where a HASH_JOIN was is a regression.

Production Deployment Boundaries

DuckDB’s embedded model removes a whole class of operational concerns (no server to patch, no connection pool to size) and replaces them with process-level ones. The boundaries you must engineer around are memory, file access, and concurrency within the host process.

Resource isolation

Because the engine shares the host process’s address space, its memory_limit is the only thing standing between a heavy spatial join and the OS OOM killer. In a serverless function or container, set memory_limit below the container’s hard memory cap with headroom for the Python interpreter itself and for the Arrow buffers you fetch out — a query that respects an 8 GB limit can still OOM the container if you then fetch a 3 GB Arrow table into a 8 GB container.

Multi-tenant and concurrent access

DuckDB has no GRANT/role system. Isolation between tenants is enforced at the file boundary: attach databases read-only and rely on OS file permissions.

con.execute("ATTACH 'tenant_a.duckdb' AS a (READ_ONLY)")  # tenant cannot mutate the shared store

For write workloads, a single DuckDB file allows one read/write process at a time. Concurrent analytical readers are fine; concurrent writers are not. When a pipeline needs parallelism, fan out reads across cursors but funnel writes through a single owner — the model the async execution patterns section formalizes, including running async spatial queries in Python without blocking the event loop.

Batching for throughput

Large ETL jobs should be partitioned so each unit of work fits comfortably under memory_limit rather than scanning an entire dataset in one query. Partitioning on a natural key (region, tile, date) aligns the work with the file layout and keeps spill off the disk. The partitioning and checkpointing strategy is the subject of the batch processing pipelines section.

# Partition the workload so each query's working set stays in RAM
for region in regions:
    con.execute("""
        COPY (
            SELECT region_id, ST_Centroid(geom) AS centroid, ST_Area(geom) AS area_m2
            FROM read_parquet(?)
            WHERE ST_IsValid(geom)
        ) TO ? (FORMAT PARQUET)
    """, [f"s3://bucket/parcels/region={region}/*.parquet", f"out/{region}.parquet"])

Failure Modes & Diagnostics

Spatial pipelines rarely crash loudly. They degrade — silently spilling to disk, silently falling back to a slower join, or silently producing wrong answers because two layers disagreed on their coordinate reference system. The table below maps the common silent failures to the signal that exposes them and the fix.

Symptom	Likely cause	Diagnostic	Remediation
RSS climbs then job stalls	Spill thrash past `memory_limit`	`SELECT * FROM duckdb_temporary_files();` shows large/growing files	Lower `threads`, raise `memory_limit`, or partition the input
Query 10× slower than last run	Plan regression to nested-loop join	`EXPLAIN` shows `NESTED_LOOP_JOIN` where a hash join was	Restore the bbox predicate to the `ON` clause; check stats
Distances or areas off by orders of magnitude	CRS mismatch (degrees vs metres)	Coordinate ranges fall in ±180/±90 but areas are tiny	`ST_Transform` both layers to a common projected CRS first
`fetchall()` triples memory	Geometry materialized as Python `bytes`	Memory spikes only at fetch time	Switch to `fetch_arrow_table()`
Empty or partial join result	Invalid geometry skipped by predicate	`SELECT COUNT(*) FROM t WHERE NOT ST_IsValid(geom)` > 0	Guard with `ST_MakeValid(geom)` before the join

CRS drift is the most insidious of these because no error is ever raised — the numbers are simply wrong. DuckDB geometries carry no inline SRID, so the burden is on you to track each layer’s CRS and reproject before any join or measurement. The full treatment is in CRS mapping and transformations; the short version is to reproject to a common projected frame before computing anything metric.

# Pre-flight checks to run before a production spatial join
checks = con.execute("""
    SELECT
        SUM(CASE WHEN NOT ST_IsValid(geom) THEN 1 ELSE 0 END) AS invalid_geoms,
        MIN(ST_XMin(geom)) AS min_x, MAX(ST_XMax(geom)) AS max_x  -- |x| > 180 ⇒ already projected
    FROM layer_a
""").fetchone()
assert checks[0] == 0, f"{checks[0]} invalid geometries — run ST_MakeValid first"

Memory overflow and graceful spilling

When intermediates exceed memory_limit the engine spills to temp_directory rather than failing — but only if you gave it a writable directory and enough headroom. Cap the spill so a runaway query degrades to slow instead of filling the disk, and watch the spill files to know when a query has left the in-memory fast path:

con.execute("SET memory_limit = '6GB'")
con.execute("SET max_temp_directory_size = '20GB'")  # bound the blast radius of a runaway join
con.execute("SET enable_progress_bar = true")        # surface long-running spatial ops in logs
# During execution, poll for spill activity:
spilling = con.execute("SELECT count(*), sum(size) FROM duckdb_temporary_files()").fetchone()

When even spilling is not enough — for example a self-join on a very large, complex polygon layer — fall back to chunked execution: partition the driving side, run the join per partition, and union the results. This trades a single large query for many bounded ones and keeps each working set under the limit, the same fallback pattern the batch processing pipelines section applies at job scale.

See also

DuckDB to GeoPandas sync — preserve CRS metadata across the zero-copy handoff, including converting DuckDB queries to a GeoDataFrame efficiently.
Shapely integration — mapping Arrow geometry columns to Python objects only when SQL cannot express the operation.
Async execution patterns — connection-per-task isolation and overlapping IO with spatial compute.
Batch processing pipelines — partitioning and checkpointing for large-scale spatial ETL.
Modern spatial SQL query patterns — the SQL these workflows orchestrate, from vectorized aggregations to window functions for geospatial.

Up: Analytical SQL for GIS — home · sibling area: DuckDB Spatial architecture & fundamentals

External Reference Standards: the zero-copy interchange described here is the Apache Arrow C Data Interface; the ST_ function semantics follow the OGC Simple Features specification; version-specific configuration flags and client semantics are in the DuckDB Python API documentation.

Python & DuckDB Integration Workflows

Execution Model & Core Concepts #

The orchestrate-vs-compute split #

Configuration Reference #

Connection lifecycle in Python #

Ingestion & Format Support #

GeoParquet — the zero-copy default #

GeoJSON and other vector formats #

Pushing Python data into DuckDB #

Query Planning & Optimization #

Capturing a plan baseline from Python #

Production Deployment Boundaries #

Resource isolation #

Multi-tenant and concurrent access #

Batching for throughput #

Failure Modes & Diagnostics #

Memory overflow and graceful spilling #

Related #

Execution Model & Core Concepts

The orchestrate-vs-compute split

Configuration Reference

Connection lifecycle in Python

Ingestion & Format Support

GeoParquet — the zero-copy default

GeoJSON and other vector formats

Pushing Python data into DuckDB

Query Planning & Optimization

Capturing a plan baseline from Python

Production Deployment Boundaries

Resource isolation

Multi-tenant and concurrent access

Batching for throughput

Failure Modes & Diagnostics

Memory overflow and graceful spilling

Related