Memory Limits for Large Raster Data
Root-Cause Analysis: Heap Exhaustion in Raster Pipelines
DuckDB’s spatial extension has no raster type and no raster reader — ST_Read is a GDAL/OGR vector reader, not a raster one. The memory problem therefore appears one step later: rasters must be converted to a tabular form (pixel points, tiles, or zonal statistics), and that tabular product can be enormous. A single 10,000×10,000 3-band image expands to 300M pixel rows; loading it eagerly into DuckDB without bounds triggers std::bad_alloc or the OS OOM killer.
As documented in DuckDB Spatial Architecture & Fundamentals, the engine prioritizes vectorized analytical throughput. The fix is to keep the raster→table conversion outside DuckDB (GDAL CLI), write compressed Parquet, and then read it under explicit memory governance so the engine can spill rather than accumulate the full footprint in the process heap.
Configuration Governance: Session & Environment Parameters
Bound DuckDB’s memory and route intermediate results to a high-throughput NVMe-backed temporary directory. The transition from In-Memory vs Disk Storage dictates workload stability under constrained infrastructure.
Apply these session-level parameters before loading the converted pixel tables:
SET memory_limit = '6GB';
SET temp_directory = '/mnt/duckdb_io/temp';
SET enable_external_access = true;
SET preserve_insertion_order = false;
SET threads = 4;
Configure GDAL’s cache through environment variables for the preprocessing step (the GDAL CLI, not DuckDB). Reference the official GDAL Configuration Options for cache semantics:
export GDAL_CACHEMAX=512
export GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR
export VSI_CACHE=TRUE
export VSI_CACHE_SIZE=536870912
These cap GDAL’s tile cache during conversion, disable recursive directory scans on remote object storage, and keep the translation streaming. DuckDB then reads the resulting Parquet and spills intermediate blocks to temp_directory rather than accumulating them in RAM.
Diagnostic Queries & Runtime Telemetry
Validate configuration application and monitor memory pressure before executing production pipelines.
-- Verify active session parameters
SELECT name, value
FROM duckdb_settings()
WHERE name IN ('memory_limit', 'temp_directory', 'threads', 'enable_external_access');
-- Inspect the plan for the pixel-table scan (look for spill / high-cardinality nodes)
EXPLAIN ANALYZE
SELECT band, count(*) FROM read_parquet('/data/ortho_pixels/*.parquet') GROUP BY band;
-- Monitor process RSS during execution (Python-side)
-- import psutil, os; print(psutil.Process(os.getpid()).memory_info().rss / 1024**3)
If EXPLAIN ANALYZE shows a scan node consuming >75% of memory_limit, the pixel table is too wide for the budget. Reduce threads to 2 and read fewer partitions per query (filter by tile or band).
Reproducible Workflow: External Conversion & CRS Correction
DuckDB cannot open a GeoTIFF directly. Convert it to pixel-point Parquet with the GDAL CLI (reprojecting and tiling there), then load the Parquet under DuckDB’s memory limits.
# 1. Reproject + tile into a Cloud Optimized GeoTIFF (streaming, bounded cache)
gdalwarp -t_srs EPSG:4326 -of COG \
-co BLOCKSIZE=2048 -co COMPRESS=ZSTD \
/data/large_raster.tif /data/large_raster_4326.tif
# 2. Emit pixel coordinates + values as XYZ, then let DuckDB write Parquet
gdal2xyz.py -band 1 /data/large_raster_4326.tif /data/pixels_band1.csv
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit = '6GB';")
con.execute("SET temp_directory = '/mnt/duckdb_io/temp';")
con.execute("SET threads = 4;")
con.execute("SET preserve_insertion_order = false;")
con.execute("INSTALL spatial; LOAD spatial;")
# Load the GDAL-produced pixel CSV, build point geometry, write compressed Parquet.
con.execute("""
COPY (
SELECT
ST_Point(column0, column1) AS geom, -- x, y from gdal2xyz
column2 AS band_value
FROM read_csv('/data/pixels_band1.csv', header = false)
) TO '/output/processed_raster.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);
""")
con.close()
During GeoParquet Parsing, the engine writes compressed columnar chunks directly to disk, bypassing heap accumulation. If downstream consumers require vector overlays, Spatial Indexing Internals dictate that spatial indexes are built post-ingestion on the Parquet output, not during conversion.
Incident Resolution & Fallback Routing
If OOM termination persists after applying baseline configurations, execute the following fallback sequence:
- Thread & Cache Reduction: Lower
threadsto2and setGDAL_CACHEMAX=256. High thread counts multiply decompression buffers linearly during conversion. - Explicit Tiling: Convert in tiles with
gdal_retile.pyso each Parquet partition stays small:gdal_retile.py -ps 2048 2048 -targetDir /data/tiles /data/large_raster_4326.tif - CRS Drift Troubleshooting: Mismatched EPSG codes or missing
.aux.xmlsidecars produce silently shifted coordinates. Validate the source CRS withgdalinfo -json /data/input.tifbefore conversion. See CRS Mapping & Transformations. - I/O Path Verification: Confirm
temp_directoryresides on NVMe storage with >500 MB/s sequential write throughput. HDD-backed temp paths cause pipeline stalls that manifest as heap exhaustion. - Enterprise Deployment Patterns: In containerized environments, mount
temp_directoryas a dedicated volume withnoexec,nosuidflags and enforce strictchmod 0750on spill directories. - Read COG Tiles On Demand: For very large mosaics, keep the COG on object storage and convert only the tiles a query needs (filter by bounding box during
gdalwarp -te ...), thenread_parquetthe per-tile output.
Reference the DuckDB Configuration Reference for parameter precedence rules. Session-level SET overrides duckdb.conf and environment defaults.