In-Memory vs Disk Storage: Tactical Patterns for DuckDB Spatial Workflows

DuckDB’s vectorized engine defaults to aggressive in-memory processing, but geospatial workloads routinely violate that assumption: complex polygon intersections, unbounded point clouds, and rasterized pixel tables exhaust RAM long before the on-disk footprint looks alarming. This guide, part of the DuckDB Spatial architecture and fundamentals reference, addresses one decision in depth — when a spatial query should stay resident in memory and when it should spill to a temporary directory — and treats it as a configuration and query-optimization problem rather than architectural dogma. The sections below give a reproducible session setup with memory guardrails, the canonical two-stage filter that keeps spills cheap, an EXPLAIN ANALYZE walkthrough for spotting external operators, quantified trade-offs between formats and routing strategies, the anti-patterns that silently turn a join into a Cartesian product, and a Python regression harness you can wire into CI.

Runtime Configuration & Memory Guardrails

DuckDB Spatial inherits the core engine’s memory model: it allocates contiguous blocks for columnar vectors, parallelizes spatial operators across cores, and spills intermediate results to a temporary directory once the working set breaches the configured ceiling. The defaults infer a memory limit from total RAM and a thread count from logical cores, which makes latency oscillate with whatever machine runs the query. Production pipelines pin every knob explicitly so behaviour is deterministic.

-- Minimal reproducible session for memory-bounded spatial execution.
SET memory_limit = '8GB';         -- Hard ceiling: when joins/aggregations approach it,
                                  -- DuckDB switches to external merge-sort and hash-join
                                  -- spilling instead of OOM-killing the process.
SET threads = 8;                  -- More threads saturate vectorized spatial operators
                                  -- but each in-flight pipeline holds its own buffers,
                                  -- so high thread counts RAISE peak RAM and spill risk.
SET temp_directory = '/mnt/nvme/duckdb-temp/spatial'; -- Put spills on NVMe: once a query
                                  -- spills, scratch I/O becomes the latency floor, so a
                                  -- slow volume erases the columnar advantage entirely.
SET preserve_insertion_order = false; -- Frees the scan to reorder row groups, cutting
                                      -- buffering on large multi-file spatial reads.
SET enable_progress_bar = false;  -- Removes UI/stderr overhead in headless ETL.

The memory_limit is a genuine hard ceiling, not a hint. Spatial operators such as ST_Intersects and ST_DWithin materialize bounding boxes and geometry arrays before topology evaluation, so their transient footprint runs well above the size of the geometry column on disk. As a rule of thumb, if the working set exceeds roughly 60% of memory_limit, expect I/O-bound execution and budget for spill.

Diagnostic boundary: watch live spill with SELECT * FROM duckdb_memory(); and SELECT * FROM duckdb_temporary_files();. If duckdb_temporary_files() grows during a plain SELECT that you expected to stream, the working set is exceeding RAM and the engine is thrashing rather than streaming — the signal to either raise the ceiling, shrink the projection, or push spatial predicates earlier.

Primary Execution Patterns

The cheapest spill is the one that never happens, and the most reliable way to avoid it on a spatial join is to keep the in-memory hash table small. That means staging every join and aggregation as two phases: a disk-friendly pruning phase that uses bounding-box predicates to discard non-candidate pairs before any exact topology is computed, followed by an exact phase that runs ST_Intersects or ST_Contains only on the survivors. The same discipline that powers spatial joins and proximity filters is what keeps memory pressure proportional to candidate pairs rather than the full cross product.

-- Two-stage join: bbox prune first, exact topology on the survivors only.
SELECT a.id, b.zone_name
FROM parcels a
JOIN zoning b
  ON a.geom && b.geom            -- Phase 1 (disk-friendly): cheap bounding-box overlap
                                 -- on min/max envelopes; prunes most pairs before any
                                 -- WKB is decoded, keeping the hash table small.
 AND ST_Intersects(a.geom, b.geom) -- Phase 2 (in-memory/spilled): exact topology runs
                                    -- only on the candidates that survived Phase 1.
WHERE b.zone_type = 'commercial';

Spilling is far less punishing when spatially adjacent rows land near each other on disk, because external hash joins and sorts then read contiguous scratch pages instead of seeking randomly. Sorting by a space-filling curve before materialization clusters proximate geometries into the same row groups — the same locality principle that the R-tree spatial indexing internals exploit for selective lookups.

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")
conn.execute("SET memory_limit = '4GB';")           # bound resident set
conn.execute("SET temp_directory = '/mnt/nvme/duckdb_spill';")  # NVMe scratch

# Hilbert-sort on ingest so spatially close rows share row groups; during a
# disk spill this turns random scratch seeks into sequential reads.
conn.execute("""
    CREATE TABLE indexed_parcels AS
    SELECT *
    FROM parcels
    ORDER BY ST_Hilbert(geom, ST_Extent(geom) OVER ());
""")

For very large vector tables, persist the Hilbert-sorted result as a GeoParquet dataset: the columnar layout preserves the spatial ordering, defers WKB deserialization until an operator needs it, and lets bounding-box-selective queries skip whole row groups via footer statistics — so the next read starts memory-friendly instead of re-sorting.

Execution Plan Validation

EXPLAIN ANALYZE is the authoritative source for whether a query stayed in memory or spilled. The operator name carries the verdict: a plain Hash Join ran resident, while an External Hash Join (or External Order By) means the engine breached memory_limit and went to temp_directory.

EXPLAIN ANALYZE
SELECT a.id, b.zone_name
FROM parcels a
JOIN zoning b ON a.geom && b.geom AND ST_Intersects(a.geom, b.geom)
WHERE b.zone_type = 'commercial';

Representative output for a disk-spilled execution:

Operator	Timing (ms)	Rows Produced	Memory (MB)	Spill (MB)
Projection	12.4	15,200	0.0	0.0
External Hash Join	145.2	15,200	412.5	890.0
Table Scan (parcels)	18.1	45,000	0.0	0.0
Table Scan (zoning)	14.3	12,500	0.0	0.0

Read the plan top-down against these thresholds:

External prefix on any join or sort — the query spilled. Acceptable for one-off batch jobs; a problem if it appears on every execution of a latency-sensitive query.
Non-zero Spill (MB) larger than Memory (MB) — the operator wrote more to disk than it held in RAM, so scratch throughput, not CPU, is the bottleneck.
Row-estimate drift — if the planner’s estimated rows for the join diverge sharply from rows produced, the bounding-box predicate is not pruning and the optimizer is sizing the hash table wrong; recheck that the && overlap is actually present in the ON clause.

If spilling occurs on every run, the fixes in order of leverage are: push the bounding-box predicate earlier so fewer pairs reach the hash table, raise temp_directory throughput to NVMe, raise memory_limit if headroom exists, or reduce threads to cut per-pipeline buffer pressure on the spill volume.

Performance Trade-offs

Routing between resident and disk-backed execution is a quantified decision, not a heuristic. The dominant levers are ingestion format, predicate placement, and coordinate-system uniformity.

Trade-off: ingestion format sets the floor for peak RAM. GeoJSON ingestion parses nested JSON row-by-row and inflates memory 3–5× over columnar input, so large GeoJSON reads spill almost immediately. GeoParquet reads geometry as binary WKB and defers deserialization, cutting peak RAM during the filter phase by up to ~70% — frequently the difference between a resident scan and an external one. Convert GeoJSON to GeoParquet upstream whenever a file will be queried more than once.

Trade-off: the bounding-box pre-filter is the single highest-leverage change for join memory. On realistic parcel-to-zoning workloads, the && envelope prune discards 60–90% of candidate pairs before any geometry is decoded, shrinking the hash table by the same proportion and often keeping it under memory_limit entirely. The cost is one cheap comparison per pair; the payoff is avoiding both WKB decode and a spill.

Trade-off: on-the-fly CRS transformation adds roughly 15–30% CPU and transient memory because each ST_Transform materializes intermediate coordinate arrays before topology runs. Normalizing both sides of a join to a shared CRS once, at ingest, removes that per-row overhead from every downstream query.

Use the following decision flow and matrix to route a workload before you run it:

Workload Characteristic	In-Memory Execution	Disk-Spill Execution
Dataset Size	< 15GB uncompressed	> 15GB or unbounded
Join Type	Point-to-Polygon, <1M rows	Polygon-to-Polygon, complex topology
CRS State	Uniform, pre-validated	Mixed, requires `ST_Transform`
Temp Storage IOPS	N/A	NVMe, >50k IOPS
Memory Utilization	< 60% of `memory_limit`	> 60%, triggers external operators

Edge Cases & Anti-Patterns

CRS drift silently becomes a Cartesian product. DuckDB has no per-table SRID and cannot detect coordinate-system mismatches automatically. Join two layers that disagree on CRS — legacy EPSG codes on one side, WKT strings on the other — and the bounding boxes never overlap, the prune drops nothing, and the join degrades to a full cross product that spills hard. The fix is to normalize both layers explicitly before joining and to spot-check coordinate ranges:

-- Anti-pattern: joining mismatched CRS layers (boxes never overlap → no pruning).
-- Fix: normalize both sides to one CRS at ingest, naming the source explicitly.
CREATE OR REPLACE TABLE parcels_norm AS
SELECT * EXCLUDE (geom), ST_Transform(geom, 'EPSG:3857', 'EPSG:4326') AS geom
FROM parcels;

CREATE OR REPLACE TABLE zoning_norm AS
SELECT * EXCLUDE (geom), ST_Transform(geom, 'EPSG:3857', 'EPSG:4326') AS geom
FROM zoning;

Streaming raw GeoJSON into a repeated query. Pulling a large .geojson through st_read on every run pays the 3–5× parse penalty each time and spills on workloads that GeoParquet would hold in RAM. If a one-off read is unavoidable, at least project only the columns you need so the parser does not materialize unused properties:

-- One-off GeoJSON read with explicit projection to cap the parser's footprint.
CREATE OR REPLACE TABLE parcels_stream AS
SELECT geom, properties->>'name' AS name
FROM st_read('s3://bucket/parcels.geojson');

Treating rasters as a native type. DuckDB has no raster type or TIFF reader; rasters must be converted to tabular pixel points or tiles upstream (for example via GDAL → Parquet) before ingestion. A single 10,000×10,000 three-band export can exceed ~900MB uncompressed, so chunk the conversion and load the result like any large vector table, then bound memory as usual. The full spill-threshold playbook for these workloads lives in memory limits for large raster data.

-- Load pre-converted raster pixels (GDAL → Parquet) as a normal vector table.
CREATE OR REPLACE TABLE raster_points AS
SELECT band, ST_Point(x, y) AS geom, value
FROM read_parquet('s3://bucket/ortho_pixels/*.parquet');

Sharing one process across tenants without isolation. Multi-tenant deployments must keep the temp_directory on a dedicated, encrypted volume with restricted POSIX permissions, and should attach tenant databases read-only — DuckDB has no GRANT/role system, so file permissions plus READ_ONLY attachments are the isolation boundary. The hardened CLI baseline (deterministic environment, credential injection, audit logging) is covered in setting up the DuckDB Spatial CLI.

-- Read-only, per-file isolation for multi-tenant attachments.
ATTACH 'tenant_a.db' AS tenant_a (READ_ONLY);
ATTACH 'tenant_b.db' AS tenant_b (READ_ONLY);

Query Regression Analysis

Spill behaviour drifts as data grows, statistics change, or a refactor moves a predicate from the ON clause to a WHERE clause. Capture the plan as structured JSON, persist a baseline, and fail CI when an operator that used to run resident starts spilling.

import duckdb, json

PROBE = """
EXPLAIN (FORMAT JSON)
SELECT a.id, b.zone_name
FROM parcels a
JOIN zoning b ON a.geom && b.geom AND ST_Intersects(a.geom, b.geom)
WHERE b.zone_type = 'commercial';
"""

def capture_plan(conn) -> dict:
    plan = json.loads(conn.execute(PROBE).fetchone()[0])
    nodes, stack = [], [plan]
    while stack:                       # walk the plan tree, flattening operator names
        node = stack.pop()
        if isinstance(node, dict):
            name = node.get("name", "")
            if name:
                nodes.append(name)
            stack.extend(node.get("children", []))
    return {
        "operators": nodes,
        "external": [n for n in nodes if n.startswith("External")],
    }

def assert_no_new_spill(conn, baseline_path="plan_baseline.json"):
    current = capture_plan(conn)
    try:
        baseline = json.load(open(baseline_path))
    except FileNotFoundError:
        json.dump(current, open(baseline_path, "w"))   # first run seeds the baseline
        return
    new_spills = set(current["external"]) - set(baseline["external"])
    assert not new_spills, f"Regression: new external (spilling) operators {new_spills}"

conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")
conn.execute("SET memory_limit = '4GB';")
assert_no_new_spill(conn)

Run this against a representative fixture in CI. A newly appearing External Hash Join is the earliest signal that a query crossed the memory boundary — far cheaper to catch in a pipeline than as a production latency cliff.

Conclusion

DuckDB Spatial’s execution model is adaptive, but production stability comes from making the in-memory-versus-disk decision explicit. Pin memory_limit, threads, and temp_directory; stage joins as bounding-box prune then exact topology; prefer columnar GeoParquet over row-wise GeoJSON; normalize CRS once at ingest; and read EXPLAIN ANALYZE for the External prefix that betrays a spill. With those guardrails and a plan-regression check in CI, workloads route deterministically between resident vectorized execution and graceful disk-backed spilling.

See also:

GeoParquet parsing in DuckDB Spatial — columnar ingestion that keeps reads memory-friendly.
CRS mapping and transformations — eliminate the transform overhead that pushes joins to disk.
R-tree spatial indexing internals — the locality model behind bounding-box pruning.
Memory limits for large raster data — spill thresholds for rasterized workloads.
Setting up the DuckDB Spatial CLI — hardened session and isolation baseline.

Up: DuckDB Spatial Architecture & Fundamentals

External Reference Standards: raster tiling and compression should follow the Open Geospatial Consortium (OGC) GeoTIFF specification; Python integrations should follow the official DuckDB Python API documentation for connection and transaction handling.

In-Memory vs Disk Storage: Tactical Patterns for DuckDB Spatial Workflows

Runtime Configuration & Memory Guardrails #

Primary Execution Patterns #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Conclusion #

Related #

Runtime Configuration & Memory Guardrails

Primary Execution Patterns

Execution Plan Validation

Performance Trade-offs

Edge Cases & Anti-Patterns

Query Regression Analysis

Conclusion

Related