GeoParquet Parsing: Production-Grade Workflows in DuckDB Spatial
GeoParquet parsing is the foundational I/O operation for modern analytical GIS pipelines. DuckDB Spatial implements a vectorized, columnar parser that decodes WKB geometry directly from Parquet row groups, bypassing traditional row-by-row deserialization. This architecture enables sub-second scans over billion-row spatial datasets while preserving coordinate precision and topology. Understanding the underlying DuckDB Spatial Architecture & Fundamentals is mandatory for tuning ingestion latency, preventing silent geometry corruption, and aligning parser behavior with enterprise SLAs.
Memory Allocation & Spill Control
The GeoParquet parser operates on a streaming, chunk-based execution model. By default, DuckDB materializes WKB columns in memory, applies geometry validation, and projects native GEOMETRY types before passing data to downstream operators. When working sets exceed configured thresholds, the engine spills to disk. Production workloads require explicit control over this pipeline to avoid unpredictable latency spikes.
-- Pre-parse configuration for high-throughput ingestion
SET threads = 16;
SET memory_limit = '12GB';
SET preserve_insertion_order = false;
SET enable_progress_bar = false; -- Eliminates UI overhead in headless ETL
-- Force disk-backed temporary storage for large spatial joins
SET temp_directory = '/mnt/nvme_scratch/duckdb_temp';
Monitor spill behavior during parsing using PRAGMA database_size; and SELECT * FROM duckdb_memory();. If duckdb_temporary_files() grows rapidly during a simple SELECT * FROM read_parquet(...), your working set is exceeding RAM and the parser is thrashing. The architectural trade-off between In-Memory vs Disk Storage dictates whether your pipeline runs at memory bandwidth speeds (~50-80 GB/s on modern NVMe) or becomes I/O-bound. For deterministic performance, size memory_limit to 1.5× the largest uncompressed geometry column and reserve temp_directory on a dedicated NVMe volume with O_DIRECT-compatible filesystem flags.
CRS Validation & Transformation Pipelines
GeoParquet embeds CRS metadata in the geo column metadata block (WKT2 or PROJJSON). DuckDB parses this automatically, but enterprise datasets frequently suffer from CRS drift—where embedded metadata conflicts with actual coordinate ranges, or where legacy exports strip projection definitions. Parsing without validation propagates silent coordinate misalignment into analytical models.
-- Parse with immediate CRS enforcement and drift detection
CREATE TABLE parcels_parsed AS
SELECT
-- GeoParquet records the CRS in file metadata; DuckDB geometries carry no inline
-- SRID, so transform from the known source CRS (EPSG:4326 here) to the target.
ST_Transform(geom, 'EPSG:4326', 'EPSG:3857') AS geom_3857,
parcel_id,
area_sqm,
ST_IsValid(geom) AS is_valid
FROM read_parquet('s3://data-lake/parcels/*.parquet', union_by_name = true);
When troubleshooting CRS drift, inspect raw metadata using SELECT * FROM parquet_metadata('s3://data-lake/parcels/file.parquet'); and cross-reference coordinate bounds against expected extents. DuckDB delegates projection math to the PROJ library, ensuring datum shifts and ellipsoid transformations remain mathematically rigorous. For comprehensive projection mapping strategies, consult CRS Mapping & Transformations. Always enforce ST_IsValid() post-transform; invalid geometries introduced during reprojection will corrupt downstream spatial joins and indexing operations.
Indexing Internals & Query Planning
GeoParquet files do not contain spatial indexes. DuckDB constructs in-memory H3, Quadtree, or R-Tree structures at query time based on operator pushdown. The parser evaluates WHERE predicates against row group statistics (min/max bounding boxes) before decoding WKB, dramatically reducing I/O.
EXPLAIN (ANALYZE, FORMAT TEXT)
SELECT a.parcel_id, b.zoning_code
FROM parcels_parsed a
JOIN zoning_zones b ON ST_Intersects(a.geom_3857, b.geom_3857)
WHERE a.area_sqm > 500;
Typical EXPLAIN Output:
graph TD B["Build: zoning_zones (R-Tree)"] --> J["HASH_JOIN · INNER<br/>Rows: 142,893 · 412 ms"] Pr["Probe: parcels_parsed (H3)"] --> J
The trade-off is explicit: index construction adds ~50-150ms overhead per million geometries, but reduces full table scans by 80-95% on selective spatial predicates. When migrating from legacy formats, the GeoParquet vs Shapefile Performance differential is primarily driven by columnar compression and predicate pushdown, not raw geometry parsing speed.
Cross-Format Interop & Enterprise Deployment
GeoJSON ingestion requires schema inference and string-to-WKB conversion, introducing ~3-5× CPU overhead compared to native GeoParquet. Use read_json_auto() only for ad-hoc exploration; production pipelines should normalize to Parquet upstream.
import duckdb
con = duckdb.connect()
# Batch conversion with explicit schema enforcement
con.execute("""
COPY (
SELECT
ST_GeomFromGeoJSON(geojson_col)::GEOMETRY AS geom,
feature_id
FROM read_json_auto('s3://ingest/geojson/*.json')
) TO 's3://warehouse/normalized/parquet/' (FORMAT PARQUET, COMPRESSION ZSTD)
""")
For enterprise data lakes, GeoParquet integrates natively with table formats that support ACID transactions and time travel. The Integrating DuckDB with Apache Iceberg for GIS pattern enables partition pruning on spatial extents while preserving GeoParquet’s columnar efficiency. Security boundaries must be enforced at the storage layer: use IAM role assumption for S3 access, enforce TLS 1.3 for all remote reads, and apply row-level security via CREATE VIEW with current_user() predicates. The official GeoParquet specification mandates strict metadata validation; non-compliant files will trigger ParserException during row group decoding.
Diagnostic Boundaries & Troubleshooting Matrix
| Symptom | Root Cause | Diagnostic Command | Resolution |
|---|---|---|---|
ParserException: Invalid WKB |
Corrupted byte sequence or endian mismatch | SELECT hex(geom) FROM read_parquet(...) LIMIT 1; |
Re-export source data with explicit little-endian WKB |
MemoryError: Allocation failed |
memory_limit < working set geometry size |
SELECT * FROM duckdb_memory(); |
Raise memory_limit, or set temp_directory so the engine can spill to disk |
| Silent coordinate shift | CRS drift or missing PROJ grid files | SELECT ST_XMin(geom), ST_XMax(geom) FROM ...; |
Validate against known extents; install proj-data |
| High CPU, low I/O | GeoJSON parsing or missing row group stats | EXPLAIN (ANALYZE) ... |
Convert to GeoParquet; regenerate Parquet statistics |
Permission denied on S3 |
Missing IAM credentials or expired token | SELECT * FROM duckdb_settings() WHERE name LIKE 's3_%'; |
Configure s3_access_key_id / s3_secret_access_key or use AWS credential provider |
Execution Boundaries:
- < 10M rows: In-memory parsing with default settings. No temp directory required.
- 10M–500M rows: Set
memory_limit = '8GB', enabletemp_directoryon NVMe. Disable progress bar. - > 500M rows or distributed joins: Partition GeoParquet files by spatial extent (e.g., H3 resolution 6). Use Iceberg metadata tables for partition pruning.
- GeoJSON ingestion: Cap batch size at 100k features per transaction. Convert to Parquet immediately post-ingest.
Adhering to these boundaries ensures deterministic latency, prevents silent topology degradation, and aligns spatial I/O with enterprise data governance standards.