Understanding ST_Geometry vs WKB
The distinction between ST_Geometry and Well-Known Binary (WKB) in DuckDB dictates vectorized execution paths, memory allocation strategies, and spatial index construction overhead. ST_Geometry is DuckDB’s native, strongly-typed spatial object, stored internally as a compact, coordinate-aligned struct array. WKB is a serialized byte stream (BLOB) requiring runtime deserialization before coordinate extraction. Selecting the incorrect representation in analytical pipelines introduces measurable CPU cycles, inflates working set memory, and degrades predicate selectivity.
Memory Layout & Vectorized Execution
DuckDB’s columnar engine processes ST_Geometry natively through the Arrow-compatible GEOMETRY extension type. Coordinates are stored in contiguous, cache-aligned buffers, enabling SIMD-accelerated bounding box evaluation. WKB forces a decode pass per row. In high-cardinality joins or spatial predicates (ST_Intersects, ST_Contains), WKB triggers repeated heap allocations for temporary coordinate arrays.
-- WKB path: incurs decode overhead per predicate evaluation
SELECT id FROM parcels WHERE ST_Intersects(ST_GeomFromWKB(wkb_col), ST_MakePoint(-73.98, 40.75));
-- Native path: zero-copy coordinate access
SELECT id FROM parcels WHERE ST_Intersects(geom_col, ST_MakePoint(-73.98, 40.75));
The query planner recognizes native geometry types and pushes bounding box filters directly into the scan operator. WKB columns bypass this optimization, materializing full rows before spatial evaluation. For datasets exceeding 10M rows, this manifests as a 3–5x increase in peak memory and sustained CPU utilization.
Spatial Index Construction & Query Planning
R-tree and H3 index builds require explicit coordinate extraction. Native geometry feeds directly into the index builder without intermediate serialization. When operating on WKB, DuckDB must materialize a temporary geometry column before index construction, doubling I/O during the build phase. Index granularity and node packing efficiency depend on pre-extracted coordinate buffers, as detailed in Spatial Indexing Internals.
Configuration fix for forced index utilization on WKB-heavy tables:
-- Pre-materialize native geometry. DuckDB has no ALTER ... ADD generated column,
-- so add a plain column, populate it, then build an R-tree index.
ALTER TABLE parcels ADD COLUMN geom_col GEOMETRY;
UPDATE parcels SET geom_col = ST_GeomFromWKB(wkb_col);
CREATE INDEX idx_parcels_geom ON parcels USING RTREE (geom_col);
In-Memory vs Disk Storage & I/O Patterns
DuckDB’s buffer pool manages geometry vectors differently than raw BLOBs. ST_Geometry columns benefit from columnar compression and direct memory mapping during spill-to-disk operations. WKB columns remain opaque to the storage engine, preventing predicate pushdown and forcing full-page reads during out-of-core execution. When memory_limit is constrained, WKB-heavy workloads trigger aggressive spilling, increasing disk I/O latency by 40–70%.
Mitigation requires explicit materialization before analytical joins:
CREATE TABLE parcels_opt AS
SELECT *, ST_GeomFromWKB(wkb_col) AS geom_col FROM parcels;
CRS Mapping, Transformations, and Drift Troubleshooting
Neither DuckDB’s GEOMETRY nor raw WKB carries an inline SRID — DuckDB tracks no per-geometry CRS, so a layer’s projection must be tracked out-of-band (column or table metadata) per the OGC Simple Features specification. Mixing layers in different CRSs produces silent coordinate drift or ST_Transform failures.
Diagnostic query for CRS drift detection (by coordinate range, since there is no ST_SRID):
-- Geographic (lon/lat) data sits within ±180/±90; projected metric data does not.
SELECT
ST_XMin(geom_col) BETWEEN -180 AND 180
AND ST_YMin(geom_col) BETWEEN -90 AND 90 AS looks_geographic,
COUNT(*) AS row_count,
MIN(ST_XMin(geom_col)) AS min_x,
MAX(ST_XMax(geom_col)) AS max_x
FROM parcels
GROUP BY looks_geographic;
Fallback routing for mixed-projection ingestion:
-- Standardize to EPSG:4326 during ingestion (transform from the known source CRS)
INSERT INTO parcels_clean (id, geom_col)
SELECT id, ST_Transform(geom_col, 'EPSG:3857', 'EPSG:4326')
FROM parcels_raw
WHERE ST_IsValid(geom_col);
GeoParquet Parsing & GeoJSON Ingestion Pipelines
GeoParquet files embed geometry as native GEOMETRY extension types when written with compliant writers. DuckDB parses these directly into vectorized buffers. GeoJSON ingestion defaults to JSON parsing followed by ST_GeomFromGeoJSON, which introduces JSON tokenization overhead before spatial struct allocation.
Pipeline optimization for high-throughput ingestion:
-- Direct GeoParquet scan (zero-copy)
SELECT * FROM read_parquet('s3://bucket/data.parquet');
-- Optimized GeoJSON ingestion with batched geometry conversion
COPY (
SELECT
id,
ST_GeomFromGeoJSON(geojson_col) AS geom_col
FROM read_json_auto('s3://bucket/data.json', columns={'id': 'INTEGER', 'geojson_col': 'VARCHAR'})
) TO 's3://bucket/data_optimized.parquet' (FORMAT PARQUET);
Enterprise Deployment & Access Control
Spatial columns require explicit access controls in multi-tenant environments. DuckDB has no GRANT/role system, so enforce access at the boundary: expose only curated views (dropping the raw wkb_col), attach the database read-only for analysts, and rely on filesystem permissions.
-- DuckDB has no GRANT/REVOKE; expose a curated view instead of the base table
-- (omit the raw WKB column, enforce a predicate) and ATTACH read-only for analysts.
CREATE VIEW parcels_secured AS
SELECT id, geom_col FROM parcels
WHERE ST_Area(geom_col) < 1000000; -- size-based exposure policy
Diagnostic Queries & Fallback Routing
Identify WKB-induced bottlenecks before deployment:
-- Detect WKB (BLOB) columns that should be promoted to native GEOMETRY
SELECT table_name, column_name, data_type
FROM duckdb_columns()
WHERE data_type = 'BLOB'
AND column_name ILIKE '%wkb%';
-- Verify index utilization during spatial scan
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT id FROM parcels WHERE ST_Intersects(geom_col, ST_MakePoint(-73.98, 40.75));
Fallback routing when spatial indexes fail to materialize:
- Verify the predicate is selective enough for the planner to choose the R-tree (inspect
EXPLAIN). - Compare against a sequential scan by dropping the index or setting
PRAGMA disabled_optimizers. - Refresh statistics with
ANALYZE parcels;. - If memory pressure persists, partition by a coordinate-derived grid cell (
floor(ST_X(geom)/cell),floor(ST_Y(geom)/cell)) and process per partition.
Configuration Reference & Tuning Parameters
| Parameter | Default | Recommended for Spatial Workloads | Effect |
|---|---|---|---|
preserve_insertion_order |
true |
false |
Unlocks parallel, out-of-order scans and aggregation |
memory_limit |
auto |
75% of host RAM |
Prevents aggressive WKB decode spilling |
threads |
auto |
physical_cores |
Maximizes SIMD coordinate evaluation |
enable_http_metadata_cache |
false |
true |
Caches remote file metadata for repeated S3/HTTP reads |
max_expression_depth |
1000 |
2000 |
Prevents stack overflow in nested ST_Transform chains |
Architecture-level tuning must align with DuckDB Spatial Architecture & Fundamentals to ensure planner optimizations propagate through the execution pipeline. Always validate spatial type consistency before production deployment.