Spatial Indexing Internals

DuckDB’s spatial indexing model diverges from traditional RDBMS overlay indexes. Instead of maintaining persistent B-tree or R-tree files on disk, the spatial extension computes bounding-box hierarchies and Hilbert-curve orderings during query execution. This vectorized, columnar approach eliminates background index maintenance overhead but shifts the computational burden to the scan phase. For a comprehensive overview of the execution pipeline, consult the DuckDB Spatial Architecture & Fundamentals. Production workloads require explicit tuning of memory allocation, thread parallelism, and geometry serialization to prevent degenerate full-table scans.

Index Architecture & Predicate Pushdown

The spatial extension constructs an in-memory R-tree over minimum bounding rectangles (MBRs) extracted from ST_Geometry columns. During query planning, spatial predicates (ST_Intersects, ST_Contains, ST_DWithin) trigger a two-phase evaluation: a fast MBR overlap filter followed by exact topology validation. The optimizer pushes the MBR filter into the TABLE_SCAN operator, reducing I/O before expensive exact checks execute.

EXPLAIN SELECT * FROM parcels WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))', 4326));

Typical EXPLAIN Output:

graph TD
  S["TABLE_SCAN<br/>parcels"] --> F["FILTER · MBR<br/>ST_Intersects(geom, …)"]
  F --> P["PROJECTION<br/>*"]

Performance Trade-off: The MBR filter is highly selective for well-distributed geometries but degrades to a full scan when bounding boxes exhibit high overlap or when the dataset lacks spatial locality. DuckDB chooses the join algorithm automatically: a persistent RTREE index (built with CREATE INDEX ... USING RTREE) lets the planner perform an index scan, while in its absence it falls back to a hash or nested-loop join based on cardinality estimates. Materializing inputs in Hilbert-sorted order keeps MBR selectivity high.

Storage Tiers & Materialization Behavior

Index structures in DuckDB are ephemeral by default. They are reconstructed per query unless you explicitly materialize sorted or partitioned tables. For high-throughput pipelines, persisting geometry in GeoParquet format preserves spatial locality and reduces parsing latency. The storage engine dynamically switches between vectorized buffers and memory-mapped files based on dataset size and memory_limit configuration. Review In-Memory vs Disk Storage for tiering thresholds and spill-to-disk behavior.

-- Configure runtime memory and thread allocation for spatial workloads
SET memory_limit = '8GB';
SET threads TO 12;

-- Materialize with explicit spatial clustering to optimize scan locality
COPY (
    SELECT * FROM parcels
    ORDER BY ST_XMin(geom), ST_YMin(geom)
) TO 'parcels_sorted.parquet' (FORMAT PARQUET);

-- Subsequent reads leverage preserved MBR locality without recomputing sort order
SELECT COUNT(*) FROM read_parquet('parcels_sorted.parquet')
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((...))', 4326));

Diagnostic Boundary: Monitor execution metrics via EXPLAIN ANALYZE. If TABLE_SCAN consumes >80% of execution time without a FILTER (MBR) node, the optimizer bypassed spatial indexing due to insufficient memory for the R-tree build or missing sort order.

Geometry Serialization & Index Key Generation

Spatial indexes operate on serialized binary representations. DuckDB normalizes ST_Geometry objects to WKB during index traversal. The Hilbert curve mapping relies on coordinate extraction from WKB to generate spatial keys. Understanding the overhead of geometry conversion is critical when tuning join performance. Refer to Understanding ST_Geometry vs WKB for binary layout specifics. Coordinate reference system alignment directly impacts key generation accuracy; mismatched projections force implicit transformations that invalidate precomputed Hilbert keys. See CRS Mapping & Transformations for projection normalization strategies.

import duckdb
import geopandas as gpd

# Python-side validation of WKB serialization overhead
con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("CREATE TABLE test AS SELECT * FROM read_parquet('parcels_sorted.parquet');")

# Benchmark exact vs MBR filter latency
res = con.execute("""
    SELECT
        AVG(CASE WHEN ST_Intersects(geom, ST_GeomFromText('POLYGON((...))', 4326)) THEN 1 ELSE 0 END) as exact_match_rate,
        AVG(CASE WHEN ST_Overlaps(geom, ST_GeomFromText('POLYGON((...))', 4326)) THEN 1 ELSE 0 END) as mbr_filter_rate
    FROM test
""").fetchdf()

Performance Trade-off: WKB parsing adds ~15–30% CPU overhead compared to raw numeric column scans. Pre-parsing geometries into separate x_min, y_min, x_max, y_max columns eliminates runtime WKB extraction at the cost of storage duplication. For GeoJSON ingestion, the extension automatically converts to WKB, but large nested JSON structures trigger JSON-to-WKB serialization bottlenecks. Consult the GeoParquet specification for columnar encoding best practices and the OGC Simple Features standard for binary compliance requirements.

Diagnostic Boundaries & Production Tuning

Symptom Root Cause Resolution
rows_scannedrows_returned in EXPLAIN ANALYZE R-tree bypassed due to memory pressure or unsorted input Increase memory_limit, materialize sorted GeoParquet, or build a USING RTREE index
ST_Intersects latency spikes on large datasets WKB deserialization contention across threads Pre-extract MBR columns, reduce threads to physical core count, or batch queries
Incorrect spatial join results CRS mismatch between source tables Normalize projections using ST_Transform prior to materialization; verify SRID consistency
Out of Memory during spatial join Ephemeral R-tree build exceeds contiguous allocation Set temp_directory so the engine can spill, or partition input data

Execution Checklist:

  1. Verify EXPLAIN output contains FILTER (MBR) before TABLE_SCAN.
  2. Confirm memory_limit ≥ 2× dataset size in bytes for in-memory R-tree construction.
  3. Align thread count with physical cores (SET threads TO <n>) to avoid context-switching penalties.
  4. Enforce explicit ST_Transform in ETL pipelines to prevent CRS drift during key generation.

DuckDB’s spatial indexing prioritizes execution-time vectorization over persistent index structures. By aligning storage locality, memory allocation, and serialization formats, engineers can achieve sub-second spatial query performance at scale. Continuous monitoring of EXPLAIN ANALYZE metrics and strict adherence to WKB normalization boundaries prevent degenerate query plans in production.