GeoParquet Parsing in DuckDB Spatial

GeoParquet parsing is the foundational I/O operation for analytical GIS pipelines, and this guide covers the specific workflow of turning a columnar Parquet file with embedded geometry into validated, queryable GEOMETRY columns inside DuckDB. It sits under the broader DuckDB Spatial architecture and fundamentals reference and focuses on one operator chain: read_parquet() row-group pruning, WKB decode, CRS reconciliation, and materialization. DuckDB’s parser is vectorized and columnar — it decodes well-known binary (WKB) directly from Parquet row groups rather than deserializing one feature object at a time — which is what lets a single process scan hundred-million-row spatial datasets in seconds while preserving coordinate precision. The sections below give a reproducible session setup, the canonical two-stage read pattern, plan-validation thresholds, quantified trade-offs, the anti-patterns that silently corrupt geometry, and a regression harness you can wire into CI.

Runtime Configuration & Memory Guardrails

The GeoParquet parser runs on a streaming, chunk-based execution model: it materializes WKB columns in 2,048-row vectors, applies geometry validation, projects native GEOMETRY types, and only spills to disk when the working set exceeds the configured ceiling. Production ingestion needs explicit control over that pipeline so latency stays deterministic instead of oscillating with whatever the defaults infer from your machine.

-- Minimal reproducible session for high-throughput GeoParquet ingestion.
SET threads = 16;                 -- Saturates vectorized WKB decode until memory
                                  -- bandwidth caps out; beyond that, extra threads
                                  -- only raise peak RAM, they do not raise throughput.
SET memory_limit = '12GB';        -- Size BELOW total RAM: geometry decode plus any
                                  -- downstream ST_Buffer/ST_Union can multiply the
                                  -- working set well past the on-disk column size.
SET preserve_insertion_order = false; -- Lets the scan reorder row groups freely,
                                      -- cutting buffering on large multi-file reads.
SET enable_progress_bar = false;  -- Removes UI/stderr overhead in headless ETL.

-- Disk-backed scratch for spills during large parses and joins.
SET temp_directory = '/mnt/nvme_scratch/duckdb_temp'; -- Put this on NVMe: when the
                                  -- parser spills, scratch I/O becomes the latency
                                  -- floor, so a slow volume erases the columnar win.

Watch spill behaviour with SELECT * FROM duckdb_memory(); and SELECT * FROM duckdb_temporary_files();. If duckdb_temporary_files() grows during a plain SELECT * FROM read_parquet(...), the working set is exceeding RAM and the parser is thrashing rather than streaming. Whether your read runs at memory-bandwidth speed or becomes I/O-bound is governed by the in-memory versus disk storage trade-off; size memory_limit to hold the largest uncompressed geometry column and keep temp_directory on a dedicated volume so a spill degrades gracefully instead of cliff-edging.

Primary Execution Pattern

Treat parsing as two stages: a pruning stage that uses Parquet footer statistics to skip row groups before any WKB is touched, and a decode-and-reconcile stage that materializes survivors into GEOMETRY, reconciles the coordinate reference system, and stamps validity. Separating them is what keeps the cost proportional to rows returned rather than rows stored.

GeoParquet records its coordinate reference system in the geo file-metadata block (WKT2 or PROJJSON); DuckDB reads that block automatically, but the in-memory GEOMETRY carries no inline SRID, so any reprojection must name the source CRS explicitly. The canonical read folds pruning, decode, transform, and validation into one statement:

-- Parse with predicate pushdown, explicit CRS reconciliation, and a validity flag.
CREATE TABLE parcels_parsed AS
SELECT
    ST_Transform(geom, 'EPSG:4326', 'EPSG:3857') AS geom_3857, -- source → target;
                                  -- GeoParquet metadata names the source, the GEOMETRY
                                  -- itself does not, so pass EPSG:4326 explicitly.
    parcel_id,
    area_sqm,
    ST_IsValid(geom) AS is_valid -- stamp validity at ingest so downstream joins can
                                 -- route invalid rows instead of failing mid-plan.
FROM read_parquet('s3://data-lake/parcels/*.parquet', union_by_name = true)
WHERE area_sqm > 500;            -- pushed into the scan: row groups whose footer
                                 -- max(area_sqm) <= 500 are skipped before WKB decode.

The reprojection math is delegated to the PROJ engine, so datum shifts and ellipsoid transforms stay rigorous; the projection-mapping strategies and unit pitfalls live in the CRS mapping and transformations guide. When a source arrives as GeoJSON instead of Parquet, normalize it first — string-to-WKB conversion is far costlier than columnar decode, and the dedicated GeoJSON ingestion workflow covers the conversion path:

import duckdb

con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")

# One-time normalization: GeoJSON -> GeoParquet with ZSTD + schema enforcement.
con.execute("""
    COPY (
        SELECT
            st_geomfromgeojson(feature->>'geometry') AS geom,
            (feature->'properties'->>'feature_id')::INTEGER AS feature_id
        FROM read_json_auto('s3://ingest/geojson/*.json')
    ) TO 's3://warehouse/normalized/parquet/' (FORMAT PARQUET, COMPRESSION ZSTD)
""")

Materializing the result as a native GEOMETRY column (rather than leaving raw WKB BLOBs) lets the planner read a cached bounding box without re-decoding vertices on every touch; the storage trade-off between the two is detailed in ST_Geometry versus WKB storage.

Execution Plan Validation

GeoParquet files carry no spatial index, so the planner builds bounding-box filtering at query time and pushes scalar predicates down to the scan using the row-group min/max statistics in the Parquet footer. Confirm both behaviours with EXPLAIN ANALYZE before trusting throughput numbers:

EXPLAIN ANALYZE
SELECT a.parcel_id, b.zoning_code
FROM parcels_parsed a
JOIN zoning_zones b ON ST_Intersects(a.geom_3857, b.geom_3857)
WHERE a.area_sqm > 500;

Read the plan against these expectations:

PARQUET_SCAN / READ_PARQUET with Filters: applied — the area_sqm > 500 predicate must appear on the scan node, not only in a later FILTER. If it surfaces only downstream, the footer statistics were not used and every row group was decoded.
rows_scanned close to rows_returned for the scalar predicate — a large gap means pruning is working; near-equality means it is not.
An MBR FILTER ahead of the exact ST_Intersects — the cheap bounding-box overlap must short-circuit before exact topology runs. A naive spatial join over cardinalities $N$ and $M$ degrades to $O(N \times M)$ vertex comparisons; the MBR pre-filter is what makes it tractable, and the structures behind it are documented in the spatial indexing internals reference.

Diagnostic threshold: if the scan reports rows_scanned within roughly 10% of total file rows on a selective query, predicate pushdown failed — usually because the Parquet file was written without per-row-group statistics, or because the predicate wraps the column in a function the optimizer cannot push (for example WHERE floor(area_sqm) > 500). Rewrite the predicate to keep the column bare, or regenerate the file with statistics enabled.

Performance Trade-offs

The numbers below frame when each variant is worth it:

Predicate pushdown vs full decode: on a selective scalar filter, footer-stat pruning skips 80–95% of row groups before any WKB is decoded. The win collapses on low-selectivity predicates (returning most rows) and on files written as one giant row group — keep row groups in the 100k–1M-row range so statistics stay granular.
CRS reconciliation overhead: ST_Transform adds roughly 15–30% CPU to a parse-and-load depending on geometry vertex count. Transform once at ingest and persist the target CRS rather than reprojecting on every analytical query.
GeoParquet vs converting from GeoJSON: GeoJSON ingestion carries 3–5x the CPU of native GeoParquet decode because of schema inference and string-to-WKB conversion. Reserve st_read() / st_geomfromgeojson() for ad-hoc exploration; normalize to Parquet upstream for anything recurring.
GeoParquet vs Shapefile: the format gap is driven by columnar compression and predicate pushdown, not raw geometry parsing speed — the full breakdown lives in GeoParquet versus Shapefile performance.

Execution boundaries by scale:

< 10M rows: in-memory parse with defaults; no temp_directory needed.
10M–500M rows: set memory_limit = '8GB', enable temp_directory on NVMe, disable the progress bar.
> 500M rows or distributed joins: Hive-partition the GeoParquet directory by spatial extent (bounding-box tiles) so the scan prunes whole partitions, not just row groups.
GeoJSON ingestion: cap at ~100k features per transaction and convert to Parquet immediately after ingest.

Edge Cases & Anti-Patterns

Most production failures cluster into a few repeatable causes. Each row pairs the symptom with the diagnostic command and the fix.

Symptom	Root cause	Diagnostic command	Resolution
`ParserException: Invalid WKB`	Corrupted byte sequence or endian mismatch	`SELECT hex(geom) FROM read_parquet(...) LIMIT 1;`	Re-export source with explicit little-endian WKB
`OutOfMemoryException`	`memory_limit` below working-set geometry size	`SELECT * FROM duckdb_memory();`	Raise `memory_limit`, or set `temp_directory` so the engine can spill
Silent coordinate shift	CRS drift or missing PROJ grid files	`SELECT ST_XMin(geom), ST_XMax(geom) FROM ...;`	Validate against known extents; install the `proj-data` grids
High CPU, low I/O	GeoJSON parsing or missing row-group statistics	`EXPLAIN ANALYZE ...`	Convert to GeoParquet; regenerate Parquet statistics
`rows_scanned` ≈ total rows	Predicate wrapped in a function, or stats absent	`EXPLAIN ANALYZE ...` (inspect scan `Filters:`)	Keep the column bare in the predicate; rewrite or re-encode the file
`Permission denied` on S3	Missing IAM credentials or expired token	`SELECT * FROM duckdb_settings() WHERE name LIKE 's3_%';`	Configure via `CREATE SECRET` or environment variables

Two anti-patterns deserve a worked example. First, CRS drift — trusting embedded metadata that disagrees with the actual coordinate ranges. Inspect the raw block and cross-check bounds before transforming:

-- Confirm the embedded CRS matches the data's real extent before reprojecting.
SELECT * FROM parquet_metadata('s3://data-lake/parcels/file.parquet'); -- read 'geo'
SELECT min(ST_XMin(geom)) AS xmin, max(ST_XMax(geom)) AS xmax
FROM read_parquet('s3://data-lake/parcels/file.parquet');  -- EPSG:4326 must fall
                                                           -- within [-180, 180]

If the extent is in the thousands while metadata claims EPSG:4326, the file is mislabeled (often projected metres tagged as degrees); reproject from the true source CRS rather than the recorded one. Second, skipping post-transform validation — reprojection can introduce self-intersections, so always re-stamp validity and route failures instead of letting them poison downstream joins:

-- Anti-pattern: assume transformed geometry is still valid.
-- Fix: re-check and repair within tolerance after ST_Transform.
SELECT parcel_id,
       CASE WHEN ST_IsValid(geom_3857) THEN geom_3857
            ELSE ST_MakeValid(geom_3857) END AS geom_safe
FROM parcels_parsed;

Query Regression Analysis

Parsing performance regresses quietly: a re-exported source loses row-group statistics, a predicate gets wrapped in a cast, or a CRS change inflates decode cost — and the query still returns correct rows, just slower. Capture the plan as a baseline and diff it in CI so the regression surfaces as a failed check rather than a pager alert.

import duckdb, json, hashlib, pathlib

def capture_plan(con, sql):
    """Return a normalized fingerprint of the physical plan + scan accounting."""
    rows = con.execute("EXPLAIN ANALYZE " + sql).fetchall()
    plan = "\n".join(r[1] for r in rows)
    pushdown = "Filters:" in plan          # scalar predicate reached the scan
    mbr = plan.count("ST_Intersects")      # exact topology node present
    digest = hashlib.sha256(plan.encode()).hexdigest()[:12]
    return {"pushdown": pushdown, "mbr_nodes": mbr, "plan_hash": digest}

def assert_no_regression(con, sql, baseline_path):
    current = capture_plan(con, sql)
    p = pathlib.Path(baseline_path)
    if not p.exists():
        p.write_text(json.dumps(current, indent=2))   # first run seeds the baseline
        return
    baseline = json.loads(p.read_text())
    # Pushdown silently turning off is the highest-signal parse regression.
    assert current["pushdown"] == baseline["pushdown"], (
        f"predicate pushdown changed: {baseline['pushdown']} -> {current['pushdown']}"
    )

con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")
assert_no_regression(
    con,
    "SELECT parcel_id FROM read_parquet('parcels.parquet') WHERE area_sqm > 500",
    "baselines/parcels_parse.json",
)

Wire this into the same job that runs your batch processing pipelines: seed a baseline per critical read, fail the build when pushdown flips off, and refresh the baseline deliberately when you intend a plan change. For the handoff into analysis frameworks, the DuckDB-to-GeoPandas sync workflow covers moving parsed geometry into Python without a second WKB round-trip.

See also

GeoParquet versus Shapefile performance — format-level root-cause comparison.
CRS mapping and transformations — reconciling projections during the decode stage.
GeoJSON ingestion — the normalization path when the source is not Parquet.
In-memory versus disk storage — sizing the spill boundary for large parses.
Spatial indexing internals — how MBR filtering accelerates joins over parsed geometry.

Up: DuckDB Spatial Architecture & Fundamentals

External Reference Standards

GeoParquet specification v1.1.0 — https://geoparquet.org/releases/v1.1.0/ (column metadata and CRS encoding requirements).
PROJ — https://proj.org/ (datum and ellipsoid transformation engine used by ST_Transform).

GeoParquet Parsing in DuckDB Spatial

Runtime Configuration & Memory Guardrails #

Primary Execution Pattern #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Related #

External Reference Standards #

Runtime Configuration & Memory Guardrails

Primary Execution Pattern

Execution Plan Validation

Performance Trade-offs

Edge Cases & Anti-Patterns

Query Regression Analysis

Related

External Reference Standards