Spatial Indexing Internals

Spatial indexing is the single mechanism that separates a sub-second bounding-box lookup from a degenerate full-table scan that re-evaluates exact topology on every row. This page dissects how DuckDB’s spatial extension builds, stores, and traverses its persistent R-tree, how the optimizer pushes a minimum-bounding-rectangle (MBR) filter ahead of expensive geometry math, and how to confirm — from the query plan, not from intuition — that the index is actually being used. It sits inside the DuckDB Spatial Architecture & Fundamentals reference, and the patterns here underpin every higher-level operator, from spatial joins and proximity filters to vectorized spatial aggregations. If you only read one section, read Execution Plan Validation: an index that exists but is never traversed is the most common — and most invisible — spatial performance bug in production.

Runtime Configuration & Memory Guardrails

The R-tree in DuckDB Spatial is built and traversed in memory, and its build phase is the most memory-sensitive step in the entire indexing lifecycle. Before creating an index on a large GEOMETRY column, set explicit guardrails so the engine spills predictably instead of aborting with an out-of-memory error mid-build.

INSTALL spatial; LOAD spatial;

-- Cap the working set. The R-tree node buffers plus the source column
-- must fit; if they don't, the build spills to temp_directory instead of OOM-ing.
SET memory_limit = '8GB';

-- Match physical cores, not logical threads. R-tree node splits are
-- contention-heavy; hyperthreaded oversubscription degrades build throughput.
SET threads = 8;

-- Give the engine a place to spill index pages and sort runs.
-- Without this, a build larger than memory_limit fails hard.
SET temp_directory = '/var/tmp/duckdb_spill';

-- Bound spill so a runaway build cannot fill the disk.
SET max_temp_directory_size = '50GB';

Trade-off Analysis: Raising memory_limit keeps the entire R-tree resident and avoids spill latency, but it competes directly with the columnar buffers the scan needs — over-allocating the index starves the scan and the net query gets slower. The threads setting cuts both ways: more threads accelerate the bulk sort that precedes a bottom-up R-tree build, but past the physical core count the node-split locks serialize and throughput plateaus. Treat these as a single budget: index build memory plus scan buffers must stay under the process ceiling described in in-memory vs disk storage tiering.

A reproducible session for the patterns below assumes the geometry column is already materialized as the native GEOMETRY type rather than a raw BLOB of WKB — the difference in cached-bounding-box behavior is covered in ST_Geometry vs WKB storage, and it directly determines whether the index can read an MBR without decoding vertices.

R-Tree Construction & the Two-Stage Filter

DuckDB exposes a persistent R-tree through CREATE INDEX ... USING RTREE. The structure indexes the MBR of every geometry — a four-float (xmin, ymin, xmax, ymax) envelope — and arranges those envelopes into a balanced tree of internal nodes whose own bounding boxes enclose their children. A spatial predicate then descends only the subtrees whose node envelope overlaps the query window, pruning the rest without ever touching the underlying vertices.

-- Build a persistent R-tree over the geometry column.
CREATE INDEX idx_parcels_geom ON parcels USING RTREE (geom);

-- Optional: tune the node fanout. Larger max_node_capacity means shallower
-- trees (fewer hops) but coarser pruning per node; the default suits most
-- mixed-extent datasets.
CREATE INDEX idx_parcels_geom2 ON parcels USING RTREE (geom)
  WITH (max_node_capacity = 64);

Every index-eligible predicate (ST_Intersects, ST_Contains, ST_Within, ST_DWithin, and the && bounding-box operator) is evaluated as a two-stage filter. Stage one is the cheap MBR overlap test served by the R-tree; stage two is the exact topology check on the small candidate set that survives. The whole point of the index is to make stage two run over a tiny fraction of the table.

The R-tree answers the cheap envelope question first; only survivors pay for exact geometry math.

-- The query the planner can route through the R-tree above.
SELECT p.parcel_id, p.geom
FROM parcels p
WHERE ST_Intersects(
        p.geom,
        ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))')
      );

Index quality is not just a function of the tree — it is a function of insertion order. An R-tree built over rows with no spatial locality produces internal nodes whose envelopes overlap heavily, which forces the descent to follow many subtrees and erodes pruning. Materializing the table in a space-filling-curve order before building the index keeps sibling nodes spatially compact:

-- Cluster rows along a Hilbert curve so the R-tree's internal node
-- envelopes stay tight and non-overlapping. Trade-off: the COPY/ORDER BY
-- is a one-time sort cost paid to make every later range query selective.
COPY (
  SELECT *
  FROM parcels
  ORDER BY ST_Hilbert(geom, ST_Extent_Agg(geom) OVER ())
) TO 'parcels_sorted.parquet' (FORMAT PARQUET);

Persisting in this order also benefits ingestion-time pruning: a GeoParquet file written with spatially sorted row groups carries per-row-group bounding boxes that DuckDB uses to skip whole groups before the R-tree is even consulted. The same locality argument applies to GeoJSON ingestion, where ST_Read streams features in file order — re-sorting after load is what makes the subsequent index worth building.

Execution Plan Validation

An index that exists is not an index that runs. The optimizer chooses an R-tree scan only when the predicate is index-eligible, the cardinality estimate favors it, and the column statistics are current. Always confirm the choice with EXPLAIN, and confirm the payoff with EXPLAIN ANALYZE.

EXPLAIN
SELECT * FROM parcels
WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'));

A healthy plan contains an RTREE_INDEX_SCAN (or a FILTER carrying the MBR predicate pushed into the scan) ahead of any exact-topology operator:

The anti-pattern is a bare SEQ_SCAN feeding a FILTER that holds the full ST_Intersects call: every row’s geometry is decoded and tested, and the index — if one exists — was bypassed. Switch to EXPLAIN ANALYZE to read the runtime counters that expose this:

EXPLAIN ANALYZE
SELECT count(*) FROM parcels
WHERE ST_DWithin(geom, ST_Point(-73.98, 40.75), 500);

Read three signals from the output:

Rows scanned vs rows returned. When RTREE_INDEX_SCAN emits a candidate count far below the table size, the index pruned. When the scan emits ≈ the full row count, it didn’t.
Operator wall time. If SEQ_SCAN or the exact-ST_ FILTER dominates total time (say, > 80%) with no MBR stage above it, the engine fell back to brute force.
Estimated vs actual cardinality drift. A wildly low estimate next to a huge actual is what makes the optimizer skip the index; it is the root cause behind most plan regressions and the metric the regression harness below tracks.

Diagnostic Boundary: If EXPLAIN ANALYZE shows no MBR/RTREE_INDEX_SCAN node despite a built index, the usual causes are (a) the predicate isn’t index-eligible after rewriting (e.g. a function wraps geom), (b) the estimator believes the predicate is non-selective, or © the build never finished because it lacked memory. Confirm spill pressure with SELECT * FROM duckdb_temporary_files(); and confirm the index exists with SELECT * FROM duckdb_indexes() WHERE table_name = 'parcels';.

Performance Trade-offs

The index is not free, and the right choice depends on the query mix. The table below quantifies the levers that matter most for spatial workloads.

Decision	Cost	Payoff	When to apply
Persistent R-tree vs ephemeral filter	One-time build (sort + tree, often minutes on 10⁸ rows) plus storage and write-amplification on updates	Range queries drop from $O(N)$ exact tests toward $O(\log N)$ descent; 60–95% candidate pruning on well-distributed data	Read-heavy tables queried by many spatial windows
Hilbert-sorted load vs natural order	A full sort before `COPY`/`CREATE INDEX`	Tighter node envelopes, higher MBR selectivity, fewer subtree descents	Any table that will carry an R-tree
`GEOMETRY` column vs raw WKB `BLOB`	Slightly larger on-disk footprint (cached bbox + validity flag)	Eliminates the 15–30% WKB re-parse overhead the MBR read would otherwise pay per row	Always, except pure interchange tables
Pre-extracted `xmin/ymin/xmax/ymax` columns	Storage duplication, ETL maintenance	Removes runtime envelope extraction entirely for pure bbox filters	Hot paths dominated by `&&` overlap, not exact topology
More `threads` on build	Lock contention past physical cores	Faster bulk sort and bottom-up assembly	Build time matters and cores are physical

The dominant rule: an R-tree pays for itself when the same table is hit by many selective spatial windows, and it loses when the predicate is non-selective (the window covers most of the data) or the table is rebuilt constantly. For one-shot scans over a freshly loaded file, the ephemeral row-group bbox pruning from a sorted GeoParquet layout often beats the cost of building a throwaway index.

Edge Cases & Anti-Patterns

Predicate wrapped so it can’t push down. Applying a function to the indexed column defeats the R-tree because the planner can no longer reason about the stored MBR.

-- ANTI-PATTERN: ST_Transform wraps the indexed column; the R-tree on geom
-- can no longer serve the predicate, so this falls back to SEQ_SCAN.
SELECT * FROM parcels
WHERE ST_Intersects(ST_Transform(geom, 'EPSG:4326', 'EPSG:3857'), :window);

-- FIX: transform once at materialization, index the projected column, and
-- pass the query window already in the indexed CRS.
ALTER TABLE parcels ADD COLUMN geom_3857 GEOMETRY;
UPDATE parcels SET geom_3857 = ST_Transform(geom, 'EPSG:4326', 'EPSG:3857');
CREATE INDEX idx_parcels_3857 ON parcels USING RTREE (geom_3857);

CRS mismatch producing silently wrong MBRs. The R-tree compares raw coordinates; it has no concept of projection. If the indexed geometries and the query window are in different coordinate systems, the envelopes don’t align and the index returns a wrong — but confidently fast — candidate set. Normalize projections before indexing, as detailed in CRS mapping and transformations. A ST_DWithin radius in degrees against a metric CRS is the same class of bug: the distance literal must match the index’s coordinate units.

Index not maintained under heavy updates. Each insert and delete mutates the tree and can fragment node envelopes over time. For append-mostly tables this is negligible; for tables under constant churn, periodically rebuild after a Hilbert re-sort rather than letting envelope overlap creep upward.

High intrinsic MBR overlap. Long, diagonal, or sprawling geometries (rivers, road networks, multipolygon administrative boundaries) have envelopes that overlap regardless of insertion order, so stage one prunes little and stage two carries the cost. Here, pre-extracting tighter sub-envelopes or decomposing geometries beats tuning the tree.

Symptom	Root cause	Resolution
`rows_scanned ≈ rows_returned` in `EXPLAIN ANALYZE`	Index bypassed: missing index, wrapped predicate, or unsorted input	Build `USING RTREE`; remove the function around `geom`; materialize Hilbert-sorted
`ST_Intersects` latency spikes on large tables	WKB re-parse contention across threads	Store as `GEOMETRY`; pre-extract MBR columns; cap `threads` at physical cores
Spatially wrong results, fast plan	CRS mismatch between column and query window	Normalize with `ST_Transform` before indexing; match radius units
`Out of Memory` during `CREATE INDEX`	R-tree build exceeds `memory_limit`	Set `temp_directory` to spill, or partition the input and index per partition

Query Regression Analysis

Plan regressions are silent: the query still returns correct rows, just an order of magnitude slower because the optimizer quietly switched from RTREE_INDEX_SCAN to SEQ_SCAN after a statistics shift or a schema change. Capture the plan as a fingerprint and diff it against a stored baseline in CI.

import duckdb, json, hashlib, pathlib

con = duckdb.connect("parcels.duckdb")
con.execute("INSTALL spatial; LOAD spatial;")

QUERIES = {
    "intersects_window": """
        SELECT count(*) FROM parcels
        WHERE ST_Intersects(geom,
              ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'))
    """,
    "dwithin_point": """
        SELECT count(*) FROM parcels
        WHERE ST_DWithin(geom, ST_Point(-73.98, 40.75), 500)
    """,
}

def plan_fingerprint(sql: str) -> dict:
    # EXPLAIN (FORMAT JSON) gives a stable, machine-comparable tree.
    plan = con.execute("EXPLAIN (FORMAT JSON) " + sql).fetchone()[1]
    tree = json.loads(plan)
    ops = sorted(_collect_operators(tree))
    return {
        "operators": ops,
        "uses_index": any("RTREE_INDEX_SCAN" in o for o in ops),
        "hash": hashlib.sha1(json.dumps(ops).encode()).hexdigest()[:12],
    }

def _collect_operators(node, acc=None):
    acc = acc if acc is not None else []
    name = node.get("name") or node.get("operator_type")
    if name:
        acc.append(name)
    for child in node.get("children", []):
        _collect_operators(child, acc)
    return acc

baseline_path = pathlib.Path("plan_baseline.json")
baseline = json.loads(baseline_path.read_text()) if baseline_path.exists() else {}

regressions = []
for label, sql in QUERIES.items():
    current = plan_fingerprint(sql)
    prior = baseline.get(label)
    # Hard fail if a query that used the index stops using it.
    if prior and prior["uses_index"] and not current["uses_index"]:
        regressions.append(f"{label}: lost R-tree scan ({prior['hash']} -> {current['hash']})")
    baseline[label] = current

baseline_path.write_text(json.dumps(baseline, indent=2))
if regressions:
    raise SystemExit("PLAN REGRESSION:\n" + "\n".join(regressions))
print("plans stable; index scans preserved")

Pair the fingerprint check with a timing budget so you catch the cases where the operator set is unchanged but actual cardinality drifted. The same harness slots cleanly into a batch processing pipeline as a pre-deploy gate, and the queried results can be handed straight to a GeoDataFrame for inspection when a regression needs a visual diff.

See also

ST_Geometry vs WKB storage — why the column type decides whether the MBR is read without decoding vertices.
Spatial joins & proximity filters — the two-stage filter applied across two tables.
In-memory vs disk storage — the memory budget the index build competes for.
CRS mapping & transformations — keeping index envelopes in a single coordinate system.

Up: DuckDB Spatial Architecture & Fundamentals

External Reference Standards: geometry encoding follows the OGC Simple Features standard for WKB binary compliance, and columnar spatial persistence follows the GeoParquet specification.

Spatial Indexing Internals

Runtime Configuration & Memory Guardrails #

R-Tree Construction & the Two-Stage Filter #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Related #

Runtime Configuration & Memory Guardrails

R-Tree Construction & the Two-Stage Filter

Execution Plan Validation

Performance Trade-offs

Edge Cases & Anti-Patterns

Query Regression Analysis

Related