Modern Spatial SQL Query Patterns

Spatial analytics has migrated from monolithic RDBMS extensions to vectorized, in-process analytical engines, and the query patterns that ran acceptably on a row-store now behave very differently on columnar hardware. This reference, part of the wider DuckDB Spatial and analytical SQL knowledge base, catalogues the engineering-grade query patterns that data engineers, GIS analysts, and platform teams rely on to keep geometry workloads deterministic at scale. It covers how the engine executes spatial SQL, how to configure a session for predictable memory behaviour, how to ingest geometry without serialization overhead, and how to read execution plans so that regressions surface before they reach production. The operator-specific deep dives live in three areas: spatial joins and proximity filters, vectorized aggregations, and window functions for geospatial context. For engine internals such as storage layout and index construction, consult the companion DuckDB Spatial architecture reference; for orchestration from Python, see the Python and DuckDB integration workflows.

The query path for spatial SQL: cheap pruning and bounding-box pre-filters run at the scan, the vectorized operators run in batches, and expensive exact-topology kernels are invoked only on surviving candidate rows.

Execution Model & Core Concepts

DuckDB processes spatial workloads with a vectorized query engine that operates on fixed-size column chunks (2,048 values per vector by default) rather than iterating one row at a time. A geometry column is stored as a contiguous buffer of Well-Known Binary (WKB) values paired with an offset vector, so the engine can stride across thousands of geometries without pointer chasing. This layout is the foundation of every pattern on this page: it lets the planner evaluate a cheap predicate over an entire vector, discard non-matching rows, and only then hand the survivors to an expensive geometric kernel.

The single most important concept in modern spatial SQL is the two-stage filter. Every GEOMETRY value carries, or can cheaply compute, a minimum bounding rectangle. The && operator (bounding-box overlap) compares those rectangles using four floating-point comparisons, whereas ST_Intersects walks both geometries’ edges and can cost orders of magnitude more for complex polygons. The optimizer is built to run the bounding-box stage first and short-circuit before the exact stage ever fires:

-- Two-stage filter: bbox overlap gates exact topology
SELECT a.asset_id, b.zone_id
FROM assets a
JOIN zones  b
  ON a.geom && b.geom            -- stage 1: SIMD-friendly bbox overlap, runs on every row
 AND ST_Intersects(a.geom, b.geom);  -- stage 2: exact, runs only on bbox survivors

For a join of $N$ rows against $M$ rows, a naive nested predicate is $O(N \times M)$ exact topology evaluations. The bounding-box stage, accelerated by an in-memory R-tree, prunes the candidate set so that the exact stage runs on a fraction of the pairs — typically 60–95% fewer, depending on data density. The mechanics of that pruning structure are detailed in the R-tree spatial indexing internals reference; what matters here is that your SQL must expose the bounding-box predicate so the planner can use it.

The three families of pattern in this section all build on the same staged model:

Spatial joins and proximity correlate two datasets by geometric relationship. The canonical forms are containment (ST_Contains, ST_Within), intersection (ST_Intersects), and distance (ST_DWithin). The dedicated spatial joins and proximity filters guide covers join-order and predicate-placement rules in depth, and the point-in-polygon optimization deep dive handles the most common high-cardinality case.
Aggregations collapse many geometries into summaries — dissolving parcels with ST_Union, collecting points with ST_Collect, or computing pairwise distance matrices. These run directly over coordinate arrays in columnar memory.
Window functions add per-partition ranking and neighbourhood context without self-joins, powering nearest-neighbour ranking, trajectory segmentation, and density-based grouping such as ST_ClusterDBSCAN spatial grouping.

A second core concept is coordinate units. Distance and area predicates are unit-naive: ST_DWithin(a, b, 5000) means “5000 of whatever units the coordinates are in.” In a geographic CRS (EPSG:4326) those units are degrees, so the predicate is meaningless as a metric distance. Either project to a metric CRS before the predicate or wrap geometries with the correct transform; the rules and overhead are documented in the CRS mapping and transformations reference and are a recurring failure mode discussed below.

Configuration Reference

Spatial SQL is sensitive to a small number of session-level knobs. Set them explicitly at connection time rather than relying on defaults — the defaults are tuned for general analytics, not for the memory spikes of geometry materialization.

-- Memory ceiling: must exceed the combined uncompressed geometry footprint of
-- the largest operator's working set. Too low → silent disk spill; too high on a
-- shared box → OS OOM-kill of the whole process.
SET memory_limit = '8GB';

-- Thread count: match physical cores. Hyperthread siblings contend for the same
-- SIMD units and degrade index-build and topology throughput rather than help it.
SET threads = 8;

-- Spill directory: without it, an over-budget query fails instead of spilling.
-- Point it at fast local NVMe, never a network mount, or spill I/O dominates.
SET temp_directory = '/var/lib/duckdb/spill';

-- Cap spill so a runaway query cannot fill the disk and take down co-tenants.
SET max_temp_directory_size = '50GB';

-- Drop row-order guarantees to unlock parallel scans and index builds. Re-impose
-- ordering with an explicit ORDER BY on the final result if you need it.
SET preserve_insertion_order = false;

Trade-off: raising memory_limit reduces spill but increases blast radius on shared hosts — one query can starve every co-tenant. Pair a generous limit with a strict max_temp_directory_size and per-connection isolation rather than trusting a single global ceiling.

Trade-off: preserve_insertion_order = false enables parallel R-tree construction and parallel scans, but any downstream consumer that assumed input order will break. Materialize a sorted output table when order is contractual.

Load the spatial extension and confirm the build once per session. The extension ships the GEOS-backed topology kernels and the ST_Read/ST_Write GDAL bridge:

INSTALL spatial;          -- one-time download into the local extension cache
LOAD spatial;             -- per-session; required before any ST_ function resolves

-- Verify the kernel set and GEOS version you are actually running against,
-- so a planner regression after an upgrade is attributable.
SELECT extension_name, installed, install_mode
FROM duckdb_extensions()
WHERE extension_name = 'spatial';

For interactive tuning and reproducible local setups, the DuckDB Spatial CLI setup walkthrough shows how to persist these settings in a config file so every session starts from the same baseline. When choosing between an in-process database file and a pure in-memory connection, weigh the durability and spill behaviour described in in-memory vs disk storage.

Ingestion & Format Support

The query patterns below assume geometry is already in a native, columnar form. How you get it there determines whether the planner can prune, push down, and run zero-copy — so ingestion is part of query design, not a separate concern.

GeoParquet is the preferred path. Geometry is stored as WKB inside a Parquet column with per-row-group statistics, which lets DuckDB skip entire row groups before decoding a single geometry. Project only the columns you need and let predicate pushdown reach the scan:

-- Column projection + row-group pruning happen at the scan, before any ST_ kernel.
SELECT parcel_id, geom
FROM read_parquet('s3://bucket/parcels/*.parquet')
WHERE region_id = 42;     -- pruned via Parquet statistics, not a full scan

The encoding details and the performance gap versus legacy formats are covered in GeoParquet parsing and the GeoParquet vs Shapefile performance comparison. For document-oriented sources, GeoJSON ingestion describes how to stream features through ST_Read without materializing the whole file.

Materialize geometry once, at ingestion. Parsing text geometry (ST_GeomFromText, ST_GeomFromGeoJSON) is expensive and defeats vectorization. Do it during load into a native GEOMETRY column, never inside a hot join or aggregation:

-- Convert text → native GEOMETRY at load time, so query-time kernels see WKB.
CREATE TABLE stations AS
SELECT station_id,
       ST_GeomFromText(wkt) AS geom   -- parsed once here, never per query
FROM read_csv('stations.csv');

Arrow interop is zero-copy. Results materialize as Arrow tables with WKB extension types, so handing a query result to a Python consumer involves no serialization round-trip. That boundary, and the GeoPandas handoff, are the subject of the DuckDB-to-GeoPandas sync guide. Keeping geometry in Arrow buffers — rather than round-tripping through WKT strings — is what makes the Python integration workflows performant.

Query Planning & Optimization

Every spatial query should be validated with EXPLAIN (estimated plan) and EXPLAIN ANALYZE (measured plan). Reading the plan is how you confirm that the two-stage filter is actually being applied rather than silently degrading into a full topology scan.

EXPLAIN ANALYZE
SELECT a.asset_id, b.zone_id,
       ST_Area(ST_Intersection(a.geom, b.geom)) AS overlap_m2
FROM read_parquet('s3://bucket/assets/*.parquet') a
JOIN zones b
  ON a.geom && b.geom              -- expect: bbox predicate at/near the scan
 AND ST_Intersects(a.geom, b.geom);

What to look for in the output:

Join operator. A HASH_JOIN or a spatial/range join over the bounding-box predicate is healthy. A NESTED_LOOP_JOIN carrying the full ST_Intersects predicate means the bounding-box stage was not exposed — the query has collapsed to $O(N \times M)$ exact evaluations.
Predicate placement. The && filter should appear at or immediately above the scan. If it sits above the join, no pruning happened before topology.
Cardinality drift. Compare estimated vs actual rows per operator. A large gap (orders of magnitude) signals stale statistics or skewed spatial density and predicts unstable plans across data versions.

Capture the plan as JSON when you need machine-readable metrics for regression tracking:

-- Machine-readable plan: diff operator/timing/peak_memory across builds.
EXPLAIN (ANALYZE, FORMAT JSON)
SELECT COUNT(*), ST_Union(geom)
FROM sensor_readings
WHERE ST_Intersects(
        geom,
        ST_GeomFromText('POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))'));

Track three fields across builds: operator_name (a shift from hash to nested-loop is a regression), operator_timing (timing deltas per node localize the slowdown), and peak_memory plus any spill indicators (early warning of a memory-ceiling violation). The per-operator workflow — capturing a baseline plan, diffing it, and alerting on drift from a Python harness — is detailed in each operator guide, starting with spatial joins and proximity filters.

A few optimization rules apply across every pattern:

Prefer ST_DWithin over ST_Distance(...) < d. ST_DWithin pushes an expanded-envelope pre-filter to the scan; the WHERE ST_Distance < d form computes an exact distance for every pair first and cannot prune.
Project before you join. Reduce both inputs to (key, geom) before the spatial join so the hash side stays small and cache-resident.
Reduce precision before set operations. Snapping coordinates to a fixed grid with ST_ReducePrecision before ST_Union/ST_Intersection both speeds the merge and eliminates sliver artifacts, as covered under vectorized aggregations.

Production Deployment Boundaries

DuckDB runs in-process, so its resource boundaries are the host’s boundaries. There is no separate database server to absorb a runaway query — a misconfigured spatial join competes directly with the application that embeds it.

Multi-tenant isolation. A single memory_limit is a global ceiling shared by every concurrent query on that connection. For multi-tenant analytics, give each tenant its own connection with its own memory_limit and max_temp_directory_size, or serialize heavy spatial jobs through a queue. Geometry operations spike unpredictably — ST_Buffer and ST_Union on dense polygons can expand the working set well beyond the input footprint — so size limits to the worst case, not the average.

CPU and thread contention. Spatial topology kernels are CPU-bound and SIMD-heavy. Setting threads above the physical core count makes index builds and topology evaluation slower, not faster, because hyperthread siblings fight over the same vector units. On a shared host, cap threads below the core count to leave headroom for co-resident services.

OS-level constraints. The spill directory must be fast local storage; pointing temp_directory at a network or container-overlay mount turns every spill into a throughput cliff. Ensure the process file-descriptor limit is high enough for wide multi-file Parquet scans, and confirm the spill volume has the headroom implied by max_temp_directory_size.

Storage model. Choose deliberately between an in-memory connection (fastest, volatile, bounded strictly by RAM) and a persistent database file (durable, larger-than-memory working sets via the buffer manager). The decision and its spill implications are analyzed in in-memory vs disk storage, and the related limits for very large rasters in memory limits for large raster data.

Failure Modes & Diagnostics

Spatial SQL rarely fails loudly. The dangerous failures are silent: a plan that quietly degrades, a unit mismatch that returns plausible-but-wrong results, or an invalid geometry that corrupts a downstream union. Detect each with a targeted query.

Plan regression (silent slowdown). A query that was a hash join becomes a nested-loop after a data or version change. Detect it by asserting the join operator in the captured plan:

-- Flag the anti-pattern: full topology scan with no bbox stage.
EXPLAIN ANALYZE
SELECT a.id, b.id
FROM a JOIN b ON ST_Intersects(a.geom, b.geom);  -- missing && → nested loop
-- Fix: add `a.geom && b.geom AND` ahead of the exact predicate.

CRS / unit mismatch (silent wrong answers). A distance predicate evaluated in degrees returns results that look reasonable but are geometrically nonsense. Detect it by checking the declared SRID before running metric predicates:

-- Diagnostic: confirm geometries are in a metric CRS before a metre threshold.
SELECT DISTINCT ST_SRID(geom) AS srid FROM points;   -- expect a projected CRS, not 4326
-- Remediate by projecting once; see the CRS transformations reference for cost.

Memory-ceiling violation (OOM spill). Intermediate geometry materialization exceeds memory_limit and the query either spills to disk or, without a spill directory, fails outright. Detect it before it cascades:

-- Diagnostic: watch live allocation and active spill files during heavy jobs.
SELECT * FROM duckdb_memory();
SELECT * FROM duckdb_temporary_files();   -- non-empty during a job = spilling

Invalid geometry (downstream corruption). Self-intersections, unclosed rings, and CRS-import artifacts violate the Simple Features rules and poison ST_Union/ST_Intersection. Validate at ingestion, isolate offenders, and repair deterministically — never propagate an unchecked buffer:

-- Isolate invalid rows, repair them, and fail fast on the rest.
WITH validation AS (
  SELECT id, geom, ST_IsValid(geom) AS is_valid
  FROM raw_imports
)
SELECT id,
       CASE WHEN is_valid THEN geom
            ELSE ST_MakeValid(geom)   -- deterministic repair for known defects
       END AS sanitized_geom
FROM validation
WHERE NOT is_valid;

Always enforce a consistent precision grid with ST_ReducePrecision before any set operation: floating-point drift under IEEE 754 is the quiet origin of most sliver polygons and topology exceptions. Pipelines should treat ST_IsValid = FALSE as a hard stop, route the offending rows to a quarantine table, and continue with the validated remainder rather than letting a single bad geometry abort a batch.

See also

Spatial joins and proximity filters — join order, predicate placement, and ST_DWithin pruning in depth.
Vectorized aggregations — columnar ST_Union/ST_Collect and distance-matrix computation.
Window functions for geospatial context — partitioned ranking and ST_ClusterDBSCAN grouping.
DuckDB Spatial architecture and fundamentals — storage layout, R-tree indexing internals, and CRS transformations.
Python and DuckDB integration workflows — orchestrating these patterns with zero-copy Arrow and GeoPandas sync.

Up: DuckDB Spatial & Modern Analytical SQL for GIS

External Reference Standards: The columnar and zero-copy behaviour referenced above follows the Apache Arrow columnar memory format; plan-metric extraction follows the DuckDB profiling documentation; and geometry validity follows the OGC Simple Features specification.

Modern Spatial SQL Query Patterns

Execution Model & Core Concepts #

Configuration Reference #

Ingestion & Format Support #

Query Planning & Optimization #

Production Deployment Boundaries #

Failure Modes & Diagnostics #

Related #

Execution Model & Core Concepts

Configuration Reference

Ingestion & Format Support

Query Planning & Optimization

Production Deployment Boundaries

Failure Modes & Diagnostics

Related