GeoParquet vs Shapefile Performance

Migrating a spatial read path from Shapefile to GeoParquet collapses the dominant cost of analytical GIS queries — geometry deserialization and full-file scanning — into vectorized, prune-aware columnar reads, and this guide isolates exactly where that cost lives and how to remove it. It sits under the GeoParquet parsing guide within the broader DuckDB Spatial architecture and fundamentals reference, and focuses on one decision: when a legacy .shp source is the bottleneck, what does converting to GeoParquet actually change in the execution plan, and how do you prove it.

Root-Cause Analysis

The performance gap is not a tuning artifact; it is structural. Four distinct failure modes make a Shapefile scan slow, and each one disappears for a different reason once the data is GeoParquet.

1. Multi-file row-oriented I/O. A Shapefile is a composite of .shp (geometry), .shx (offset index), .dbf (attributes), and .prj (projection). Reading one feature forces synchronized seeks across three file descriptors, and because the layout is row-oriented the engine cannot project only the columns a query touches. A SELECT id, ST_Area(geom) still pays to read every attribute in the .dbf. GeoParquet stores geometry and attributes as independent columns in a single file, so column pruning loads only the geometry and filtered attribute columns into the execution buffer.

2. Per-row deserialization. The Shapefile format encodes geometry as C-struct binary offsets with no type safety; every vertex must be unpacked sequentially and every attribute parsed from a fixed-width .dbf record. GeoParquet stores geometry as Well-Known Binary (WKB) in a dedicated column that DuckDB decodes through Apache Arrow’s zero-copy path — the same vectorized decode described for GeoParquet parsing. The cost difference here is the same class of penalty seen in GeoJSON ingestion, where JSON tokenization and dynamic type resolution dominate; binary columnar reads avoid both.

3. No embedded statistics, no pushdown. A Shapefile carries no row-group metadata, so a spatial predicate cannot skip data — the engine reconstructs the .shx offset table in memory and then scans. GeoParquet writes per-row-group min/max bounding-box statistics into the Parquet footer, which the optimizer reads at planning time to skip irrelevant row groups before any geometry is decoded.

4. No native spatial index. Shapefiles have no in-payload index; spatial predicates (ST_Intersects, ST_Contains) trigger full sequential scans unless an external .qix is generated and kept current. GeoParquet’s footer statistics provide coarse pruning for free, and once data is resident you can layer an R-tree index on the materialized GEOMETRY column for fine-grained filtering. Whether a column is a native GEOMETRY or an opaque WKB BLOB also governs whether the planner can prune at all — see ST_Geometry vs WKB.

A fifth, quieter failure mode is CRS handoff: a Shapefile’s .prj is frequently missing or mismatched during ETL, producing silent coordinate drift, whereas GeoParquet embeds OGC CRS metadata in the geo schema key that DuckDB’s CRS transformation pipeline reads at plan time.

Deterministic Configuration

This page’s patterns need only a small, predictable session. Set boundaries before any bulk read so latency does not drift with whatever the defaults infer from the host.

INSTALL spatial; LOAD spatial;

SET threads = 8;                 -- Saturates vectorized WKB decode; past the point
                                 -- where memory bandwidth caps out, more threads
                                 -- only raise peak RAM without raising throughput.
SET memory_limit = '12GB';       -- Keep BELOW total RAM: a Shapefile fallback via
                                 -- st_read materializes whole rows, so the working
                                 -- set can exceed the on-disk size during conversion.
SET preserve_insertion_order = false; -- Lets the scan reorder row groups freely,
                                      -- cutting buffering on multi-file Parquet reads.
SET enable_http_metadata_cache = true; -- Caches the Parquet footer for repeated
                                       -- S3/HTTP reads so pushdown planning is not
                                       -- re-fetched on every query.

If you are reading from a remote store, the footer fetch is what enables pruning — without enable_http_metadata_cache the optimizer re-reads statistics on each query and the GeoParquet advantage shrinks toward the Shapefile baseline. Memory sizing follows the same logic detailed for in-memory vs disk storage: leave headroom for downstream ST_Buffer/ST_Union that can multiply the working set.

Optimized Execution Pattern

The migration is rarely a one-line swap, because the slow Shapefile habit — read everything, then filter — survives the format change unless the predicate is written so the planner can push it down. The contrast below is the behavioral change that matters.

-- BEFORE: legacy Shapefile read. No statistics, no pushdown — every feature is
-- decoded and every .dbf column is read before the spatial filter runs.
CREATE TABLE hits AS
SELECT id, ST_Area(geom) AS area
FROM st_read('/data/parcels.shp')          -- full sequential scan via GDAL/OGR
WHERE ST_Intersects(
        geom,
        ST_GeomFromText('POLYGON((0 0,1 0,1 1,0 1,0 0))'));

-- AFTER: GeoParquet read. The bounding box of the query polygon is matched against
-- row-group footer statistics, so non-overlapping row groups are skipped before any
-- WKB is decoded, and only the two referenced columns are read off disk.
CREATE TABLE hits AS
SELECT id, ST_Area(geometry) AS area       -- column pruning: id + geometry only
FROM read_parquet('s3://bucket/parcels/*.parquet')
WHERE ST_Intersects(
        geometry,
        ST_GeomFromText('POLYGON((0 0,1 0,1 1,0 1,0 0))'));  -- bbox → row-group skip

The single load-bearing difference is that read_parquet exposes the footer to the optimizer while st_read does not. The polygon literal in the WHERE clause gives the planner a bounding box to test against each row group’s min/max statistics; non-overlapping groups never touch the decoder. Partitioned layouts compound this — writing the data as /year=2024/month=01/data.parquet lets directory-based partition pruning eliminate whole files before footer evaluation even begins. None of this is reachable from a Shapefile source, which is why the conversion, not just the query, is the optimization.

Diagnostic Queries & Plan Validation

Prove the pushdown rather than assuming it. The plan node and its filter flag are the ground truth.

EXPLAIN ANALYZE
SELECT id, ST_Area(geometry)
FROM read_parquet('s3://bucket/parcels/*.parquet')
WHERE ST_Intersects(geometry,
        ST_GeomFromText('POLYGON((0 0,1 0,1 1,0 1,0 0))'));

Read the output top-down:

The scan should appear as PARQUET_SCAN, not SEQ_SCAN. A SEQ_SCAN means the source resolved to a non-Parquet path (a stale st_read fallback, for example).
Look for Filters / pushdown on the scan node. If pushdown is absent, the predicate was not lowered into the scan — verify the spatial predicate is in the WHERE clause (not applied after a CTE materialization) and that the file carries valid geo metadata.
Compare Rows scanned against total table rows. A healthy spatial filter shows a row-group skip rate well above zero; if the skip rate is ~0% the footer statistics are missing or the query bbox covers the whole dataset.

Two one-liners confirm the file is actually prune-capable before you trust the plan:

-- Confirm the geometry column type and CRS metadata are present.
SELECT column_name, type FROM parquet_schema('parcels.parquet')
WHERE column_name = 'geometry';

-- Confirm row groups carry statistics. Empty/zero stats = no pushdown possible.
SELECT row_group_id, row_group_num_rows
FROM parquet_metadata('parcels.parquet') LIMIT 10;

If parquet_metadata returns row groups but the plan still scans everything, the writer emitted geometry without bounding-box statistics; regenerate the file with a GeoParquet-compliant writer (for example pyarrow with the geo metadata block) so the footer carries usable min/max extents.

Geometry Validation & Fallback Routing

Format conversion is also the moment invalid geometry surfaces, because GeoParquet’s strict schema rejects what a Shapefile silently tolerated. Guard the boundary, and keep a deterministic fallback for corrupt footers.

-- Shapefile fallback: if a GeoParquet footer is corrupt or schema drift breaks the
-- read, route ingestion back through the GDAL/OGR reader so the pipeline continues.
CREATE TABLE legacy_fallback AS
SELECT * FROM st_read('/data/legacy.shp');

-- Validate before promoting into the analytical path.
SELECT count(*) FILTER (WHERE NOT ST_IsValid(geom)) AS invalid_count
FROM legacy_fallback;

When invalid_count is non-zero, repair with a tolerance-aware pass rather than discarding rows. ST_MakeValid resolves self-intersections and unclosed rings that would otherwise abort an ST_Intersects mid-scan:

CREATE TABLE parcels_clean AS
SELECT * EXCLUDE (geom),
       CASE WHEN ST_IsValid(geom) THEN geom
            ELSE ST_MakeValid(geom) END AS geom
FROM legacy_fallback;

For datasets large enough to threaten memory_limit during conversion — where a single st_read of a multi-gigabyte Shapefile would spill or OOM — chunk the materialization by a coordinate-derived grid cell so each batch stays inside the buffer pool:

-- Process per grid cell to bound peak memory during Shapefile → GeoParquet conversion.
COPY (
  SELECT *, floor(ST_X(ST_Centroid(geom)) / 1.0) AS gx,
            floor(ST_Y(ST_Centroid(geom)) / 1.0) AS gy
  FROM legacy_fallback
  WHERE ST_IsValid(geom)
) TO 's3://bucket/parcels/' (FORMAT PARQUET, PARTITION_BY (gx, gy));

This writes a partition-pruned GeoParquet dataset directly, turning the one-time conversion into the same layout the optimized read pattern above depends on.

See also:

GeoJSON ingestion — the other legacy text-format read path and why its tokenization cost mirrors Shapefile’s per-row decode.
ST_Geometry vs WKB — why native GEOMETRY enables the pushdown that an opaque WKB BLOB blocks.

Up: GeoParquet Parsing in DuckDB Spatial

External Reference Standards

OGC GeoParquet Specification v1.1.0 — https://geoparquet.org/releases/v1.1.0/ (schema-level CRS and bounding-box requirements referenced above).

GeoParquet vs Shapefile Performance

Root-Cause Analysis #

Deterministic Configuration #

Optimized Execution Pattern #

Diagnostic Queries & Plan Validation #

Geometry Validation & Fallback Routing #

Related #

External Reference Standards #