GeoJSON Ingestion
GeoJSON ingestion in production environments is fundamentally an I/O and memory-bound operation. DuckDB Spatial resolves this through a vectorized, zero-copy JSON parser that materializes coordinate arrays directly into contiguous columnar buffers. Understanding the underlying DuckDB Spatial Architecture & Fundamentals is required before scaling ingestion pipelines, as the engine bypasses traditional row-by-row deserialization in favor of SIMD-optimized coordinate streaming. This page documents execution-ready patterns, execution plan tuning, and diagnostic boundaries for data engineering and GIS workflows.
Production Ingestion Pipeline & Configuration
Direct ingestion must bypass intermediate Python serialization layers. Use DuckDB’s native st_read() with explicit schema projection to prevent property bloat and enforce strict type safety. GeoJSON compliance follows RFC 7946: The GeoJSON Format, which mandates WGS84 coordinates and a strict FeatureCollection structure.
-- Configure execution boundaries before ingestion
SET threads = 12;
SET memory_limit = '24GB';
SET preserve_insertion_order = false;
SET max_temp_directory_size = '100GB';
-- Ingest with explicit schema projection and geometry validation
CREATE OR REPLACE TABLE raw_geojson AS
SELECT
(properties->>'id')::BIGINT AS id,
(properties->>'name')::VARCHAR AS name,
(properties->>'category')::VARCHAR AS category,
geom
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');
-- Enforce geometry validity early to prevent downstream query failures
UPDATE raw_geojson
SET geom = ST_MakeValid(geom)
WHERE NOT ST_IsValid(geom);
Python Integration Pattern When orchestrating via Python, disable automatic progress reporting and bind parameters to prevent injection and optimize query caching:
import duckdb
con = duckdb.connect(config={'threads': 12, 'memory_limit': '24GB'})
con.execute("SET preserve_insertion_order = false;")
# Parameterized ingestion for repeatable pipelines
query = """
CREATE OR REPLACE TABLE raw_geojson AS
SELECT (properties->>'id')::BIGINT AS id, geom
FROM st_read($1)
"""
con.execute(query, ['s3://data-lake/ingest/parcels_2024.geojson'])
Execution Plan Analysis & Diagnostics
Run EXPLAIN (ANALYZE, FORMAT JSON) against the ingestion query to validate operator behavior. The following diagnostic boundaries apply to production workloads:
{
"operator": "PhysicalProjection",
"estimated_rows": 1250000,
"actual_rows": 1248932,
"timing": {"total_time_ms": 842, "setup_ms": 12, "execution_ms": 830},
"children": [
{
"operator": "ST_Read",
"parallel_scan": true,
"chunk_size": 1024,
"bytes_processed": "4.2GB",
"timing": {"execution_ms": 615}
}
]
}
Key Operator Thresholds:
ST_Read: Must reportparallel_scan: true. Iffalse, verify the source is not a single-line minified blob. Line-delimited features ({"type":"Feature"}\n{"type":"Feature"}) enable vectorized chunking across worker threads.PhysicalProjection: Confirms column pruning. UnprojectedSELECT *selections trigger full JSON tree traversal, increasing peak RSS by 30–50% and delaying memory reclamation.Spatial Filter: Predicates onst_intersectsorst_containsare pushed down to the scan phase. If the plan shows a post-materializationFilteroperator, rewrite the query to apply spatial predicates before geometry validation.
Diagnostic Boundary: If the plan shows Spill to Disk under memory pressure, adjust SET memory_limit or partition the source file. DuckDB’s spill behavior is deterministic; monitor duckdb_temporary_files to avoid I/O thrashing on network-attached storage.
Storage Topology & Memory Boundaries
GeoJSON ingestion defaults to in-memory materialization, which is optimal for datasets under 4GB. For larger payloads, explicit disk-backed storage prevents OOM termination and enables concurrent analytical workloads. Refer to In-Memory vs Disk Storage for engine-level allocation strategies.
Production Disk-Backed Pattern:
-- Attach a persistent database file for spill and final storage
ATTACH 's3://analytics-db/warehouse.duckdb' AS warehouse;
CREATE OR REPLACE TABLE warehouse.raw_geojson AS
SELECT * FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');
CRS Handling & Transformation GeoJSON strictly uses EPSG:4326. If downstream analytics require projected coordinates (e.g., EPSG:3857 or local state plane), apply transformations immediately after ingestion to avoid repeated projection overhead during query execution. Consult CRS Mapping & Transformations for coordinate system resolution logic and precision loss mitigation.
-- Project to metric coordinate system post-ingestion
CREATE OR REPLACE TABLE parcels_metric AS
SELECT
id, name, category,
st_transform(geom, 'EPSG:4326', 'EPSG:32633') AS geom
FROM raw_geojson;
Performance Trade-offs & Tuning Guidelines
| Configuration | Performance Impact | Diagnostic Trigger |
|---|---|---|
preserve_insertion_order = true |
Disables parallel scan. Ingestion latency increases 2–4×. | parallel_scan: false in EXPLAIN. |
threads > physical_cores |
Thread thrashing and L3 cache contention. Peak RSS spikes. | CPU utilization < 70% with high context switches. |
ST_MakeValid() on 100% of rows |
CPU-bound validation overhead (~15–25% ingestion time). | Only apply when ST_IsValid(geom) returns > 2% invalid. |
temp_directory unset (no spill target) |
OOM termination on payloads > memory_limit. |
duckdb.OutOfMemoryException during ST_Read. |
Execution Boundaries:
- Memory Pressure: If
peak_memory_mbexceeds 85% ofmemory_limit, partition the source GeoJSON by feature count or geographic bounding box before ingestion. - Invalid Geometry Rate: When
ST_IsValidreturns > 5% invalid features, isolate them into a quarantine table rather than applyingST_MakeValidglobally. This preserves query determinism and prevents topology corruption. - Network I/O: For cloud storage, enable
SET s3_region='us-east-1'andSET s3_use_ssl=true. DisableSET s3_url_style='path'unless required by legacy object gateways.
Validation & Error Handling
Production pipelines must enforce strict geometry contracts. Use ST_IsEmpty and ST_IsValid to filter malformed features before indexing or spatial joins.
-- Quarantine invalid geometries for manual review
-- (DuckDB exposes no ST_IsValidReason; classify with the available predicates)
CREATE OR REPLACE TABLE geojson_quarantine AS
SELECT id, name, geom,
CASE WHEN ST_IsEmpty(geom) THEN 'empty'
WHEN NOT ST_IsValid(geom) THEN 'invalid'
END AS failure_reason
FROM raw_geojson
WHERE NOT ST_IsValid(geom) OR ST_IsEmpty(geom);
-- Drop quarantined rows from primary table
DELETE FROM raw_geojson WHERE id IN (SELECT id FROM geojson_quarantine);
Monitor failure_reason distributions to identify upstream serialization errors. Common patterns include self-intersecting polygons, duplicate vertices, or non-planar rings. Resolve at the source ETL layer when possible to avoid repeated compute costs during ingestion.