GeoJSON Ingestion

GeoJSON ingestion in production environments is fundamentally an I/O and memory-bound operation. DuckDB Spatial resolves this through a vectorized, zero-copy JSON parser that materializes coordinate arrays directly into contiguous columnar buffers. Understanding the underlying DuckDB Spatial Architecture & Fundamentals is required before scaling ingestion pipelines, as the engine bypasses traditional row-by-row deserialization in favor of SIMD-optimized coordinate streaming. This page documents execution-ready patterns, execution plan tuning, and diagnostic boundaries for data engineering and GIS workflows.

Production Ingestion Pipeline & Configuration

Direct ingestion must bypass intermediate Python serialization layers. Use DuckDB’s native st_read() with explicit schema projection to prevent property bloat and enforce strict type safety. GeoJSON compliance follows RFC 7946: The GeoJSON Format, which mandates WGS84 coordinates and a strict FeatureCollection structure.

-- Configure execution boundaries before ingestion
SET threads = 12;
SET memory_limit = '24GB';
SET preserve_insertion_order = false;
SET max_temp_directory_size = '100GB';

-- Ingest with explicit schema projection and geometry validation
CREATE OR REPLACE TABLE raw_geojson AS
SELECT
    (properties->>'id')::BIGINT AS id,
    (properties->>'name')::VARCHAR AS name,
    (properties->>'category')::VARCHAR AS category,
    geom
FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');

-- Enforce geometry validity early to prevent downstream query failures
UPDATE raw_geojson
SET geom = ST_MakeValid(geom)
WHERE NOT ST_IsValid(geom);

Python Integration Pattern When orchestrating via Python, disable automatic progress reporting and bind parameters to prevent injection and optimize query caching:

import duckdb

con = duckdb.connect(config={'threads': 12, 'memory_limit': '24GB'})
con.execute("SET preserve_insertion_order = false;")

# Parameterized ingestion for repeatable pipelines
query = """
    CREATE OR REPLACE TABLE raw_geojson AS
    SELECT (properties->>'id')::BIGINT AS id, geom
    FROM st_read($1)
"""
con.execute(query, ['s3://data-lake/ingest/parcels_2024.geojson'])

Execution Plan Analysis & Diagnostics

Run EXPLAIN (ANALYZE, FORMAT JSON) against the ingestion query to validate operator behavior. The following diagnostic boundaries apply to production workloads:

{
  "operator": "PhysicalProjection",
  "estimated_rows": 1250000,
  "actual_rows": 1248932,
  "timing": {"total_time_ms": 842, "setup_ms": 12, "execution_ms": 830},
  "children": [
    {
      "operator": "ST_Read",
      "parallel_scan": true,
      "chunk_size": 1024,
      "bytes_processed": "4.2GB",
      "timing": {"execution_ms": 615}
    }
  ]
}

Key Operator Thresholds:

  • ST_Read: Must report parallel_scan: true. If false, verify the source is not a single-line minified blob. Line-delimited features ({"type":"Feature"}\n{"type":"Feature"}) enable vectorized chunking across worker threads.
  • PhysicalProjection: Confirms column pruning. Unprojected SELECT * selections trigger full JSON tree traversal, increasing peak RSS by 30–50% and delaying memory reclamation.
  • Spatial Filter: Predicates on st_intersects or st_contains are pushed down to the scan phase. If the plan shows a post-materialization Filter operator, rewrite the query to apply spatial predicates before geometry validation.

Diagnostic Boundary: If the plan shows Spill to Disk under memory pressure, adjust SET memory_limit or partition the source file. DuckDB’s spill behavior is deterministic; monitor duckdb_temporary_files to avoid I/O thrashing on network-attached storage.

Storage Topology & Memory Boundaries

GeoJSON ingestion defaults to in-memory materialization, which is optimal for datasets under 4GB. For larger payloads, explicit disk-backed storage prevents OOM termination and enables concurrent analytical workloads. Refer to In-Memory vs Disk Storage for engine-level allocation strategies.

Production Disk-Backed Pattern:

-- Attach a persistent database file for spill and final storage
ATTACH 's3://analytics-db/warehouse.duckdb' AS warehouse;
CREATE OR REPLACE TABLE warehouse.raw_geojson AS
SELECT * FROM st_read('s3://data-lake/ingest/parcels_2024.geojson');

CRS Handling & Transformation GeoJSON strictly uses EPSG:4326. If downstream analytics require projected coordinates (e.g., EPSG:3857 or local state plane), apply transformations immediately after ingestion to avoid repeated projection overhead during query execution. Consult CRS Mapping & Transformations for coordinate system resolution logic and precision loss mitigation.

-- Project to metric coordinate system post-ingestion
CREATE OR REPLACE TABLE parcels_metric AS
SELECT
    id, name, category,
    st_transform(geom, 'EPSG:4326', 'EPSG:32633') AS geom
FROM raw_geojson;

Performance Trade-offs & Tuning Guidelines

Configuration Performance Impact Diagnostic Trigger
preserve_insertion_order = true Disables parallel scan. Ingestion latency increases 2–4×. parallel_scan: false in EXPLAIN.
threads > physical_cores Thread thrashing and L3 cache contention. Peak RSS spikes. CPU utilization < 70% with high context switches.
ST_MakeValid() on 100% of rows CPU-bound validation overhead (~15–25% ingestion time). Only apply when ST_IsValid(geom) returns > 2% invalid.
temp_directory unset (no spill target) OOM termination on payloads > memory_limit. duckdb.OutOfMemoryException during ST_Read.

Execution Boundaries:

  1. Memory Pressure: If peak_memory_mb exceeds 85% of memory_limit, partition the source GeoJSON by feature count or geographic bounding box before ingestion.
  2. Invalid Geometry Rate: When ST_IsValid returns > 5% invalid features, isolate them into a quarantine table rather than applying ST_MakeValid globally. This preserves query determinism and prevents topology corruption.
  3. Network I/O: For cloud storage, enable SET s3_region='us-east-1' and SET s3_use_ssl=true. Disable SET s3_url_style='path' unless required by legacy object gateways.

Validation & Error Handling

Production pipelines must enforce strict geometry contracts. Use ST_IsEmpty and ST_IsValid to filter malformed features before indexing or spatial joins.

-- Quarantine invalid geometries for manual review
-- (DuckDB exposes no ST_IsValidReason; classify with the available predicates)
CREATE OR REPLACE TABLE geojson_quarantine AS
SELECT id, name, geom,
       CASE WHEN ST_IsEmpty(geom) THEN 'empty'
            WHEN NOT ST_IsValid(geom) THEN 'invalid'
       END AS failure_reason
FROM raw_geojson
WHERE NOT ST_IsValid(geom) OR ST_IsEmpty(geom);

-- Drop quarantined rows from primary table
DELETE FROM raw_geojson WHERE id IN (SELECT id FROM geojson_quarantine);

Monitor failure_reason distributions to identify upstream serialization errors. Common patterns include self-intersecting polygons, duplicate vertices, or non-planar rings. Resolve at the source ETL layer when possible to avoid repeated compute costs during ingestion.