In-Memory vs Disk Storage: Tactical Patterns for DuckDB Spatial Workflows
DuckDB’s vectorized execution engine defaults to aggressive in-memory processing, but geospatial workloads routinely violate that assumption. Complex polygon intersections, unbounded point clouds, and high-resolution rasters will exhaust RAM if query design, memory limits, and spill-to-disk thresholds are not explicitly managed. The decision between in-memory and disk-backed execution is not architectural dogma; it is a configuration and query-optimization problem. Understanding how DuckDB Spatial Architecture & Fundamentals manages buffer pools, operator fusion, and temporary storage is mandatory for production stability.
Memory Allocation & Thread Configuration
DuckDB Spatial inherits the core engine’s memory management model: it allocates contiguous blocks for columnar vectors, spills intermediate results to a temporary directory when thresholds are breached, and parallelizes spatial operators across available cores. In production, you must explicitly bound these behaviors using SET commands (replacing legacy PRAGMA syntax in v0.9+):
-- Production memory & thread configuration
SET memory_limit='8GB';
SET threads=8;
SET temp_directory='/mnt/duckdb-temp/spatial';
SET enable_progress_bar=true;
The memory_limit acts as a hard ceiling. When spatial joins or aggregations approach this limit, DuckDB automatically switches to external merge-sort and hash-join spilling. However, spatial operators (e.g., ST_Intersects, ST_DWithin) materialize bounding boxes and geometry arrays before evaluation. If your working set exceeds ~60% of memory_limit, expect I/O-bound execution. Monitor spill behavior using EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT a.id, b.zone_name
FROM parcels a
JOIN zoning b ON ST_Intersects(a.geom, b.geom)
WHERE b.zone_type = 'commercial';
Representative output for a disk-spilled execution:
| Operator | Timing (ms) | Rows Produced | Memory (MB) | Blocking IO (MB) | Spill (MB) |
|---|---|---|---|---|---|
| Projection | 12.4 | 15,200 | 0.0 | 0.0 | 0.0 |
| External Hash Join | 145.2 | 15,200 | 412.5 | 1,024.0 | 890.0 |
| Table Scan (parcels) | 18.1 | 45,000 | 0.0 | 0.0 | 0.0 |
| Table Scan (zoning) | 14.3 | 12,500 | 0.0 | 0.0 | 0.0 |
Inspect Physical Operators for Hash Join vs External Hash Join. The presence of External confirms disk spilling. If spilling occurs on every execution, increase temp_directory throughput (NVMe recommended) or pre-filter geometries using bounding box predicates before invoking exact topology functions.
Format-Specific Ingestion & Memory Footprints
GeoJSON ingestion is notoriously memory-inefficient. The format’s nested JSON structure forces row-by-row parsing, inflating memory usage by 3–5x compared to columnar alternatives. For production pipelines, convert GeoJSON to GeoParquet upstream or use DuckDB’s streaming reader with explicit chunking:
-- Streaming GeoJSON ingestion with memory caps
CREATE OR REPLACE TABLE parcels_stream AS
SELECT * FROM read_json_auto('s3://bucket/parcels.json',
maximum_object_size='50MB',
ignore_errors=true);
GeoParquet parsing leverages columnar compression, dictionary encoding, and predicate pushdown. The spatial extension reads geometry columns as binary WKB, deferring deserialization until spatial operators execute. This deferred evaluation reduces peak RAM by up to 70% during filtering phases. Detailed implementation specifics are covered in GeoParquet Parsing, which outlines how the extension handles coordinate precision and metadata retention during columnar reads.
Spatial Indexing Internals & Query Execution
DuckDB does not maintain persistent spatial indexes like PostGIS R-Trees; instead, it relies on in-memory Hilbert curve sorting and just-in-time bounding box filtering. When datasets exceed available RAM, the engine falls back to disk-based partitioning. The performance trade-off is explicit: in-memory Hilbert sorting enables sub-millisecond spatial joins for datasets under 10M features, while disk-backed execution introduces I/O latency proportional to partition count.
To optimize disk-spilled joins, enforce a two-phase filter:
- Phase 1 (Disk-Friendly): bounding-box pre-filter with the
&&operator (orST_DWithinwith coarse thresholds). - Phase 2 (In-Memory/Spilled): Exact topology checks (
ST_Intersects,ST_Contains).
import duckdb
conn = duckdb.connect()
conn.execute("SET memory_limit='4GB'")
# Force explicit partitioning hint via ORDER BY on spatial key
conn.execute("""
CREATE TABLE indexed_parcels AS
SELECT * FROM parcels ORDER BY ST_Hilbert(geom);
""")
Sorting by Hilbert ID clusters spatially proximate records, minimizing cross-partition I/O during disk spills.
CRS Mapping, Transformations & Drift Troubleshooting
On-the-fly coordinate reference system transformations impose significant CPU and memory overhead. Each ST_Transform call instantiates PROJ pipelines, materializing intermediate coordinate arrays before topology evaluation. When joining datasets with mismatched CRS definitions, DuckDB must either transform one side dynamically or fail with precision loss.
-- DuckDB has no per-table SRID; normalize both layers to a common CRS explicitly
CREATE OR REPLACE TABLE parcels_norm AS
SELECT * EXCLUDE (geom), ST_Transform(geom, 'EPSG:3857', 'EPSG:4326') AS geom FROM parcels;
CREATE OR REPLACE TABLE zoning_norm AS
SELECT * EXCLUDE (geom), ST_Transform(geom, 'EPSG:3857', 'EPSG:4326') AS geom FROM zoning;
If CRS drift occurs (e.g., legacy EPSG codes vs WKT strings), spatial joins degrade to full Cartesian products. Diagnostic boundaries for CRS-related memory bloat include monitoring ST_Transform execution time and verifying that both layers were transformed to a common CRS before joining. Refer to CRS Mapping & Transformations for authoritative guidance on projection caching and transformation pipeline optimization.
Raster Workloads & Large Object Handling
DuckDB has no native raster type or TIFF reader; rasters must be converted to tabular/vector form (pixel points or tiles) before ingestion. A single 10,000×10,000 3-band export can exceed ~900MB uncompressed, so chunk the conversion upstream and load the result like any large vector table:
-- Convert rasters to tabular form first (e.g. GDAL: gdal2xyz / gdal_translate
-- → Parquet), then load the pixel points and bound memory as usual.
CREATE OR REPLACE TABLE raster_points AS
SELECT band, ST_Point(x, y) AS geom, value
FROM read_parquet('s3://bucket/ortho_pixels/*.parquet');
For enterprise deployments, isolate raster processing in dedicated worker pools and cap raster memory allocation independently of vector workloads. Comprehensive thresholds and spill configurations are documented in Memory Limits for Large Raster Data. External raster processing standards, such as those defined by the Open Geospatial Consortium (OGC) GeoTIFF specification, should guide your tiling and compression strategies.
Enterprise Deployment & Security Patterns
Production DuckDB Spatial instances require strict isolation boundaries. The temp_directory must reside on a dedicated, encrypted volume with restricted POSIX permissions. Multi-tenant environments should leverage DuckDB’s read-only database attachments and explicit ATTACH statements to prevent cross-tenant data leakage.
-- Secure multi-tenant attachment
ATTACH 'tenant_a.db' AS tenant_a (READ_ONLY);
ATTACH 'tenant_b.db' AS tenant_b (READ_ONLY);
-- DuckDB has no GRANT/role system; READ_ONLY attachments plus per-file OS
-- permissions provide tenant isolation.
CLI initialization for enterprise workloads requires deterministic environment variables and hardened connection strings. The operational baseline is established in Setting Up DuckDB Spatial CLI, which details secure credential injection and audit logging. For Python integrations, enforce connection pooling and explicit transaction boundaries using the official DuckDB Python API documentation.
Diagnostic Boundaries & Decision Matrix
The choice between in-memory and disk execution should follow explicit thresholds, not heuristics. Use the following diagnostic matrix to route workloads:
graph TD
A["Spatial workload"] --> B{"Dataset < 15 GB<br/>and memory use < 60%?"}
B -->|"yes"| C["In-memory<br/>vectorized execution"]
B -->|"no"| D["Set temp_directory (NVMe)"]
D --> E["Disk-backed spilling<br/>external hash join / sort"]
| Workload Characteristic | In-Memory Execution | Disk-Spill Execution |
|---|---|---|
| Dataset Size | < 15GB uncompressed | > 15GB or unbounded |
| Join Type | Point-to-Polygon, <1M rows | Polygon-to-Polygon, complex topology |
| CRS State | Uniform, pre-validated | Mixed, requires ST_Transform |
| Temp Storage IOPS | N/A | NVMe, >50k IOPS |
| Memory Utilization | < 60% of memory_limit |
> 60%, triggers external operators |
When EXPLAIN ANALYZE reports External Hash Join or External Order By, verify that temp_directory throughput matches the query’s spill rate. If I/O wait exceeds 40%, reduce threads to prevent lock contention on the spill buffer, or refactor the query to apply spatial filters earlier in the execution pipeline.
Conclusion
DuckDB Spatial’s execution model is highly adaptive, but production stability requires explicit configuration boundaries. By aligning memory limits, ingestion formats, and spatial indexing strategies with your hardware profile, you can deterministically route workloads between in-memory vectorized execution and disk-backed spilling. Monitor spill operators, enforce CRS consistency, and isolate raster workloads to maintain predictable latency and throughput at scale.