DuckDB Spatial Architecture & Fundamentals

DuckDB Spatial operates as an embedded, vectorized OLAP extension rather than a traditional GIS server. This reference — one of the three core topic areas on the Analytical SQL for GIS home — explains how the engine actually executes geospatial work: its columnar data layout, memory and I/O boundaries, ingestion pipelines, coordinate-system handling, indexing mechanics, and the failure modes that silently erode performance in production. It is written for data engineers, GIS analysts, and Python developers who need to reason about why a spatial query behaves the way it does, not just which function to call. Every section pairs the engine internals with runnable SQL or Python and the diagnostic queries you use to confirm the engine is doing what you expect.

The vectorized pipeline: columnar ingestion feeds SIMD-accelerated spatial kernels, spilling to disk only when the working set exceeds memory_limit.

Execution Model & Core Concepts

DuckDB processes spatial data through a strictly columnar, vectorized execution pipeline. Unlike row-oriented engines that materialize one geometry object per record, DuckDB Spatial maintains variable-length geometry columns as contiguous byte arrays paired with offset vectors. This layout minimizes pointer chasing, enables SIMD-accelerated bounding box evaluation, and aligns with modern CPU cache hierarchies. The engine moves data through the plan in fixed-size vectors (2,048 values per chunk by default), so a spatial predicate is never evaluated one row at a time — it is applied across a whole vector of geometries before the next operator sees the batch.

The in-memory geometry representation is well-formed binary (WKB) wrapped in DuckDB’s native GEOMETRY type. The distinction matters for performance: the GEOMETRY type carries a cached bounding box and validity state that the planner can read without decoding vertices, whereas a raw BLOB of WKB forces a full parse on every touch. The trade-offs between storing columns as GEOMETRY versus carrying portable WKB are covered in depth in ST_Geometry vs WKB storage; as a rule, materialize geometries as GEOMETRY at ingestion and reserve WKB for interchange across process boundaries.

Spatial predicates execute as a two-phase evaluation. First, a cheap minimum-bounding-rectangle (MBR) overlap test runs across the vector using the cached bounding boxes. Only geometries that survive the MBR filter proceed to the expensive exact topology computation (ST_Intersects, ST_Contains, ST_Within). Without this short-circuit, a spatial join between two tables of cardinality $N$ and $M$ degrades to $O(N \times M)$ vertex comparisons. The MBR pre-filter is what turns that into a tractable workload, and the structures that accelerate it are explained in the spatial indexing internals reference.

Two architectural facts shape every decision downstream. First, DuckDB is single-process and embedded — it runs inside your Python interpreter, your data pipeline, or your serverless function; there is no separate daemon, connection pool, or shared buffer cache to tune. Second, the engine is deterministic about memory only if you make it so. Geometry expansion from operations like ST_Buffer, ST_Union, or ST_Intersection can inflate a working set far beyond the size of the source columns, and whether that inflation stays in RAM or spills to disk is governed entirely by your configuration. The behavioral split between resident and disk-backed execution is the subject of the in-memory vs disk storage guide.

Why columnar wins for spatial analytics

Row-oriented spatial databases optimize for transactional access: fetch one feature, update its geometry, commit. Analytical GIS work has the opposite shape — scan millions of geometries, filter by a bounding region, aggregate by zone. Columnar storage lets DuckDB read only the geometry column (and any projected attributes) while skipping unrelated columns entirely, and the contiguous layout means the MBR pre-filter streams through L2/L3 cache instead of dereferencing scattered heap objects. The result is that a well-formed scan-and-aggregate over a GeoParquet dataset is bandwidth-bound rather than latency-bound, which is exactly the regime where vectorization pays off.

Configuration Reference

Memory allocation is deterministic but requires explicit configuration in production. Spatial operations frequently trigger temporary materialization, and without explicit limits an unbounded geometry expansion will exhaust process memory before the OS scheduler can intervene. Configure hard boundaries at session initialization, and treat every knob as a stated trade-off rather than a default to accept silently:

-- Enforce a memory ceiling; spatial expansion (ST_Buffer/ST_Union) can
-- multiply the working set, so size this BELOW total RAM to leave headroom.
SET memory_limit = '8GB';

-- More threads speed vectorized kernels until memory bandwidth saturates;
-- past the saturation point extra threads only increase peak memory.
SET threads = 4;

-- Where intermediate results spill once memory_limit is hit. Put this on
-- fast local NVMe — a slow spill directory turns OOM-avoidance into I/O thrash.
SET temp_directory = '/var/lib/duckdb/spill';

-- Cap total spill so a runaway overlay cannot fill the disk and crash the host.
SET max_temp_directory_size = '50GB';

-- Reuse decoded Parquet/Arrow buffers across queries; trades a little RAM for
-- lower re-scan latency on repeated GeoParquet reads.
SET enable_object_cache = true;

The single most impactful setting for analytical throughput is preserve_insertion_order. DuckDB preserves row order by default so result sets are reproducible, but order preservation forces additional buffering that penalizes large vectorized aggregations. When the output order is irrelevant — which is the common case for spatial summaries grouped by zone — disabling it removes a real bottleneck:

import duckdb

con = duckdb.connect(config={
    "threads": 8,
    "memory_limit": "16GB",
    "temp_directory": "/mnt/fast-ssd/duckdb_spill",
    "enable_object_cache": True,
    # Order preservation buffers whole pipelines; turn it off when the final
    # ORDER BY (or no ordering at all) makes intermediate order meaningless.
    "preserve_insertion_order": False,
})
con.execute("INSTALL spatial; LOAD spatial;")

The INSTALL spatial; LOAD spatial; pair is mandatory — the spatial functions, the GEOMETRY type, and the R-tree index access method all live in the extension, not the core engine. In locked-down or air-gapped deployments, pre-stage the extension binary and set SET allow_unsigned_extensions = false so the process refuses to load anything that is not signed. Sizing guidance for raster-heavy and large-geometry workloads, where the memory ceiling interacts with tile decoding, is detailed in memory limits for large raster data, and a from-scratch local setup walkthrough lives in setting up the DuckDB Spatial CLI.

Ingestion & Format Support

DuckDB Spatial bypasses row-by-row serialization by reading geospatial formats directly into Arrow memory buffers. The extension supports columnar and semi-structured spatial payloads without an intermediate conversion step, which is what makes its ingestion path “zero-copy” in the meaningful sense: bytes move from the file or object store into Arrow buffers that the vectorized kernels read in place.

GeoParquet and Parquet

GeoParquet layers spatial metadata on top of the standard Parquet columnar format. DuckDB reads geometry columns as binary WKB and applies vectorized decoding during the scan, while honoring Parquet’s row-group statistics for predicate pushdown. The full metadata-extraction and OGC-compliance path — including how CRS information is recovered from the file header — is documented in the GeoParquet parsing reference. Project only the columns you need so the scanner skips unrelated column chunks entirely:

-- Direct GeoParquet ingestion with schema projection and partition pruning.
-- Listing only three columns lets the scanner skip every other column chunk.
CREATE OR REPLACE TABLE parcels AS
SELECT parcel_id, land_use_code, geometry
FROM read_parquet('s3://bucket/parcels/*.parquet', hive_partitioning = true);

When you are deciding whether to migrate an existing archive to GeoParquet at all, the head-to-head numbers in GeoParquet vs Shapefile performance quantify the scan-time and storage difference; the short version is that the columnar format’s statistics and projection make it the default choice for analytical access.

GeoJSON and semi-structured payloads

GeoJSON ingestion requires parsing nested JSON into WKB. The bulk path through st_read() is dramatically more efficient than manual JSON extraction for well-formed files, because it streams features through the GDAL-backed reader instead of building intermediate JSON objects per row. The batch conversion strategies and memory-efficient patterns for large feature collections are covered in the GeoJSON ingestion workflow:

-- Preferred: stream GeoJSON into a spatial table via st_read.
CREATE OR REPLACE TABLE boundaries AS
SELECT geom, name
FROM st_read('s3://bucket/boundaries.geojson');

-- Fallback for non-standard JSON layouts: extract and convert explicitly.
-- maximum_object_size guards against a single oversized feature OOM-ing the parse.
CREATE OR REPLACE TABLE boundaries_manual AS
SELECT
    json_extract(data, '$.properties.name')::VARCHAR AS boundary_name,
    st_geomfromgeojson(json_extract(data, '$.geometry')::VARCHAR) AS geometry
FROM read_json_auto('s3://bucket/boundaries/*.json', maximum_object_size = 10485760);

Beyond these two, st_read() also fronts shapefiles, GeoPackage, FlatGeobuf, and the rest of the GDAL vector drivers, so legacy formats land in the same GEOMETRY columns without a separate conversion tool. The practical rule across all formats is identical: decode once at ingestion, store as GEOMETRY, and never re-parse text in a hot path.

Coordinate Reference Systems & Geodetic Precision

DuckDB geometries are stored without inline CRS metadata. A GEOMETRY value is just coordinates and topology; it carries no SRID tag the way a PostGIS geometry does. This is a deliberate simplification that keeps the columnar layout compact, but it means correctness is your responsibility: every spatial operation assumes its inputs share a reference frame, and the engine will happily compute a meaningless intersection between a layer in degrees and a layer in metres. How EPSG codes are resolved and applied during transformation is detailed in the CRS mapping and transformations reference, with the resolution and caching mechanics broken out further in how DuckDB Spatial handles coordinate systems.

Track each layer’s CRS explicitly in your own catalog and reproject to a common frame with ST_Transform before any join, overlay, or distance computation:

-- Reproject layer_a into layer_b's frame inside the join so both operands
-- share EPSG:3857 before the predicate runs. Mixing frames silently returns
-- wrong (often empty) results rather than an error.
SELECT a.id, b.zone
FROM layer_a a
JOIN layer_b b
  ON ST_Intersects(
       ST_Transform(a.geometry, 'EPSG:4326', 'EPSG:3857'),
       b.geometry  -- already EPSG:3857
     );

Precision drift is the second CRS hazard. Planar predicates evaluated on geographic (degree) coordinates produce distances and areas in degrees, not metres — a ST_DWithin(..., 100) against EPSG:4326 data means “100 degrees”, which is most of the planet. Validate coordinate ranges as a cheap sanity check (geographic data must fall within ±180 longitude and ±90 latitude) and confirm every layer is on the same metric CRS before computing areas or buffers. Note also that ST_Transform is not free: reprojecting inside a join evaluates the transform for every candidate pair, so where possible reproject once into a materialized column rather than per-row inside the predicate.

Query Planning & Optimization

Spatial predicates and functions compile into vectorized kernels, and the planner aggressively pushes filters down to the scan phase. Two tools tell you whether that happened: EXPLAIN shows the physical plan and EXPLAIN ANALYZE adds per-operator timing and actual-vs-estimated row counts. Read the plan before you tune anything — most “slow spatial query” reports are a missing pushdown, not a slow function.

-- Inspect the plan: look for the bbox filter landing inside the scan,
-- not a TABLE_SCAN feeding a separate downstream FILTER.
EXPLAIN
SELECT zone_id,
       count(*)            AS parcel_count,
       sum(ST_Area(geometry)) AS total_area_m2
FROM parcels
WHERE ST_Intersects(
        geometry,
        ST_GeomFromText('POLYGON((0 0, 100 0, 100 100, 0 100, 0 0))'));

-- Measure it: compare actual vs estimated rows per operator and watch
-- for spill ("Temporary files") appearing under load.
EXPLAIN ANALYZE
SELECT p.zone_id, count(*) AS parcel_count
FROM parcels p
JOIN flood_zones f ON ST_Intersects(p.geometry, f.geometry)
GROUP BY p.zone_id;

The canonical optimization for spatial joins is the two-stage predicate: put the cheap MBR overlap operator && in the ON clause so the planner can route it through an R-tree index, and keep the exact topology check as a WHERE filter. This pruning routinely eliminates 85–99% of candidate pairs before any vertex math runs:

-- Stage 1 (&&) prunes pairs via the R-tree; Stage 2 (ST_Contains) runs exact
-- topology only on survivors. Folding both into one ST_Contains forfeits the index.
SELECT p.id, b.name
FROM points p
JOIN boundaries b
  ON b.geom && ST_Point(p.lon, p.lat)
WHERE ST_Contains(b.geom, ST_Point(p.lon, p.lat));

When the plan shows SpatialFilter or SpatialJoin nodes, the planner has successfully isolated the MBR stage. When it shows a plain nested-loop join over TABLE_SCAN nodes with no index access, you are paying the full $O(N \times M)$ cost — create a persistent R-tree first, as the spatial indexing internals reference walks through. The broader catalogue of predicate-ordering and proximity patterns lives in the Modern Spatial SQL Query Patterns reference, with proximity-specific tuning in spatial joins and proximity filters and grouped-summary throughput in vectorized aggregations.

Production Deployment Boundaries

DuckDB Spatial is designed for embedded deployment inside analytical applications, data pipelines, and serverless functions. It exposes no network listener and manages no concurrent client sessions — each process instance owns an isolated memory space and must be given explicit resource limits to avoid contention with whatever else shares the host. This is a feature for analytics (no daemon to operate, no connection pool to size) and a constraint for multi-tenancy (no built-in session isolation).

Because there is no SQL GRANT/role system, enforce access at the boundary. Attach production databases read-only and rely on OS-level file permissions to separate tenants; expose curated views rather than raw tables:

-- Attach read-only so downstream consumers physically cannot mutate the source.
ATTACH 'production.duckdb' AS prod (READ_ONLY);

-- Publish a validated view instead of the raw table.
CREATE OR REPLACE VIEW analytics.spatial_summary AS
SELECT id, geom FROM prod.parcels WHERE ST_IsValid(geom);

The orchestration layer that drives these embedded instances is the subject of the Python & DuckDB integration workflows reference. Three patterns recur in production. For concurrency, async execution patterns cover running spatial scans off the event loop without blocking, since a single connection serializes its own queries. For throughput, batch processing pipelines cover partitioned, chunked ingestion that keeps each instance under its memory ceiling. For handoff, DuckDB-to-GeoPandas sync and Shapely integration cover moving geometry into the Python ecosystem over Arrow without serializing through WKT. In every case the deployment unit is one process with one explicit memory budget — scale by running more bounded instances, not by oversubscribing one.

Failure Modes & Diagnostics

Most spatial performance problems in DuckDB are silent: the query still returns rows, just slowly or with subtly wrong results. The four degradation patterns below account for the overwhelming majority of production incidents, each with a detection query and a remediation.

Diagnostic — uncontrolled spill (OOM avoidance gone wrong): when a working set exceeds memory_limit, the engine spills intermediates to temp_directory. That keeps the query alive but converts a memory-bound job into an I/O-bound one, often a 10–100x slowdown if the spill disk is slow. Watch for it directly:

-- Non-empty during a query = the engine is spilling. If this is steady-state,
-- raise memory_limit, lower threads (fewer concurrent buffers), or chunk the job.
SELECT * FROM duckdb_temporary_files();

Diagnostic — plan regression (lost pushdown): an index drop, a statistics staleness, or an innocuous query rewrite can move the bbox filter out of the scan and collapse the join back to nested-loop. Detect it by diffing EXPLAIN against a known-good baseline; a SpatialJoin node turning into a generic join over TABLE_SCAN is the signature. Re-create the R-tree and re-check that && sits in the ON clause. Capturing and diffing plans on a schedule is the regression-analysis discipline detailed in spatial joins and proximity filters.

Diagnostic — CRS mismatch (wrong, not slow): mixed reference frames produce empty or nonsensical overlays without raising an error. The cheapest guard is a coordinate-range assertion before the join:

-- Any row here means the layer is NOT in a geographic CRS as assumed —
-- stop and reproject before joining, or every result is suspect.
SELECT count(*) AS out_of_range
FROM layer_a
WHERE ST_XMax(geometry) > 180 OR ST_XMin(geometry) < -180
   OR ST_YMax(geometry) >  90 OR ST_YMin(geometry) <  -90;

Diagnostic — invalid geometry (silent topology failures): self-intersections and unclosed rings make exact predicates return wrong answers or skip rows. Gate on validity at ingestion and repair before indexing, because an R-tree built over invalid geometries propagates the error into every query that uses it:

-- Find offenders.
SELECT id, geom FROM boundaries WHERE NOT ST_IsValid(geom);

-- Repair in place, then rebuild the index so the R-tree reflects the fixes.
UPDATE boundaries SET geom = ST_MakeValid(geom) WHERE NOT ST_IsValid(geom);
DROP INDEX IF EXISTS idx_boundaries_geom;
CREATE INDEX idx_boundaries_geom ON boundaries USING RTREE (geom);

When memory limits are breached despite pre-filtering, fall back to chunked execution — partition with row_number() or pre-aggregate points to a coarse grid before the exact join — rather than raising the ceiling indefinitely. The detailed point-in-polygon recovery path, including grid pre-aggregation, is worked through in optimizing point-in-polygon queries. Taken together, these guards make DuckDB Spatial’s throughput predictable: vectorized execution, explicit resource boundaries, validated geometry, and a verified plan are the four conditions under which the engine delivers deterministic, enterprise-scale spatial analytics.

See also

Spatial indexing internals — R-tree construction, MBR pushdown, and persistent index maintenance.
GeoParquet parsing and GeoJSON ingestion — zero-copy format readers and schema projection.
CRS mapping and transformations — EPSG resolution and ST_Transform semantics.
In-memory vs disk storage — spill behavior and memory-ceiling tuning.
Modern Spatial SQL Query Patterns — predicate ordering, joins, and aggregation throughput.
Python & DuckDB integration workflows — async, batch, and GeoPandas/Shapely handoff.

Up: Analytical SQL for GIS — home

External Reference Standards

Apache Arrow columnar format and C Data Interface — the zero-copy memory model underlying DuckDB’s ingestion and Python handoff: https://arrow.apache.org/

DuckDB Spatial Architecture & Fundamentals

Execution Model & Core Concepts #

Why columnar wins for spatial analytics #

Configuration Reference #

Ingestion & Format Support #

GeoParquet and Parquet #

GeoJSON and semi-structured payloads #

Coordinate Reference Systems & Geodetic Precision #

Query Planning & Optimization #

Production Deployment Boundaries #

Failure Modes & Diagnostics #

Related #

External Reference Standards #

Execution Model & Core Concepts

Why columnar wins for spatial analytics

Configuration Reference

Ingestion & Format Support

GeoParquet and Parquet

GeoJSON and semi-structured payloads

Coordinate Reference Systems & Geodetic Precision

Query Planning & Optimization

Production Deployment Boundaries

Failure Modes & Diagnostics

Related

External Reference Standards