Vectorized Aggregations in DuckDB Spatial

Vectorized aggregation is how DuckDB collapses millions of geometries into grids, dissolves, collections, and pairwise summaries without ever iterating one row at a time. The engine processes data in fixed-size column chunks (2,048 values per vector by default) and pushes ST_ functions down to SIMD-friendly kernels, so an aggregation that would crawl on a row-store runs in a single streaming pass on columnar hardware. The danger is that vectorization is silent when it works and silent when it breaks: an implicit GEOMETRY → VARCHAR cast, a scalar Python UDF inside a GROUP BY, or an unbounded geometry union can quietly drop the query back to per-row evaluation and inflate runtime by an order of magnitude. This guide is part of Modern Spatial SQL Query Patterns and covers the configuration, canonical patterns, plan validation, and regression analysis that keep spatial aggregations deterministic at scale.

Runtime Configuration & Memory Guardrails

Aggregation is the most memory-intensive class of spatial query because hash tables, window buffers, and geometry collections all accumulate state before they emit a single result row. Set the session up explicitly before running heavy workloads — the general-analytics defaults do not account for the working-set spikes of geometry materialization.

INSTALL spatial;          -- one-time download into the local extension cache
LOAD spatial;             -- per-session; required before any ST_ aggregate resolves

-- Match physical cores. Hyperthread siblings contend for the same SIMD units, so
-- over-subscribing slows topology kernels and hash-build rather than speeding them.
SET threads = 8;

-- Memory ceiling for the largest aggregate's working set. Too low → silent disk
-- spill mid-GROUP BY; too high on a shared host → OS OOM-kill of the whole process.
SET memory_limit = '12GB';

-- Drop row-order guarantees so GROUP BY and window partitions run in parallel and
-- out of order. Re-impose order with an explicit ORDER BY on the final result only.
SET preserve_insertion_order = false;

-- Spill target for over-budget aggregates. Point at fast local NVMe, never a network
-- mount, or a spilling ST_Union turns into an I/O cliff.
SET temp_directory = '/var/lib/duckdb/spill';

Before running an aggregation pipeline in production, confirm these preconditions hold:

memory_limit exceeds the uncompressed footprint of the widest geometry partition the aggregate will hold in flight.
Geometry columns are native GEOMETRY (WKB-backed), not text parsed per query — materialize at ingestion, as covered in the GeoParquet parsing and GeoJSON ingestion references.
Grouping keys are scalar (integers, hashes) wherever possible, so the hash table stays small and cache-resident.
A spill directory is configured on local storage so an over-budget union spills instead of failing outright.

Trade-off: raising memory_limit reduces spill but enlarges the blast radius on shared hosts — one runaway ST_Union can starve every co-tenant. Pair a generous limit with SET max_temp_directory_size and per-connection isolation rather than trusting one global ceiling. The durability and spill implications of the underlying storage choice are analyzed in in-memory vs disk storage.

Primary Execution Patterns

Every well-behaved spatial aggregation follows the same shape: reduce geometry to a cheap, scalar grouping key in a vectorized projection, then aggregate the survivors in a single hash pass. Pushing geometry evaluation into the GROUP BY is the cardinal mistake — it forces the kernel to run per group rather than per vector.

Grid binning instead of pairwise containment

Grouping by a precomputed spatial grid is the workhorse pattern. Iterative point-in-polygon checks scale poorly; snapping each point to a fixed-size cell with integer arithmetic on its coordinates gives the same binning with a single hash aggregation and no spatial join. DuckDB has no built-in hex/quad grid generator, but coordinate-derived cell keys are fully vectorizable:

WITH binned AS (
    -- Snap each point to a fixed 1000-unit grid cell (requires a metric CRS).
    SELECT
        floor(ST_X(geom) / 1000) AS cell_x,
        floor(ST_Y(geom) / 1000) AS cell_y,
        metric,
        geom
    FROM points_table
)
SELECT
    cell_x,
    cell_y,
    COUNT(*)        AS point_count,
    SUM(metric)     AS total_metric,
    ST_Collect(geom) AS merged_geometry   -- ST_Collect is an aggregate, not a scalar
FROM binned
GROUP BY cell_x, cell_y;

Because the grouping keys are plain integers, this collapses the work from a pairwise $O(N \times M)$ comparison to a single $O(N)$ hash aggregation. The grid is unit-naive — floor(x / 1000) only means “1 km cells” if the coordinates are in a metric CRS, so project before binning; the rules and overhead are documented in CRS mapping and transformations. When you genuinely need true containment rather than a regular grid, the index-aware predicate-pushdown strategies in spatial joins and proximity filters — and the high-cardinality case in point-in-polygon optimization — apply.

Dissolve vs collect: `ST_Union` and `ST_Collect`

The two geometry-combining aggregates have very different cost profiles, and choosing the wrong one is the most common source of aggregation OOM:

ST_Collect(geom) bundles inputs into a MULTI* structure with no topological resolution. It is cheap, streams well, and is the right default when a downstream step will do the real work.
ST_Union(geom) computes topological boundaries and merges overlapping polygons through GEOS. It is CPU-heavy and memory-heavy, because the working set can balloon well beyond the input footprint during the merge.

-- Dissolve administrative parcels into region boundaries (topology resolved).
SELECT region_id, ST_Union(geom) AS dissolved
FROM parcels
GROUP BY region_id;

-- Reduce precision FIRST: snapping to a fixed grid both speeds the merge and
-- eliminates sliver artifacts from IEEE-754 drift before the union runs.
SELECT region_id, ST_Union(ST_ReducePrecision(geom, 0.001)) AS dissolved
FROM parcels
GROUP BY region_id;

Prefer ST_Collect for downstream processing unless strict boundary simplification is required, and always apply ST_ReducePrecision ahead of a union to keep the merge stable.

Window functions for running spatial metrics

Running aggregates over ordered spatial sequences — trajectory smoothing, cumulative distance, rolling density — use window frames rather than self-joins. Careful partitioning is what keeps them from materializing the whole input:

SELECT
    id,
    geom,
    SUM(ST_Distance(geom, LAG(geom) OVER w)) OVER w AS cumulative_dist
FROM gps_tracks
WINDOW w AS (PARTITION BY route_id ORDER BY timestamp ROWS UNBOUNDED PRECEDING);

Trade-off: large ROWS BETWEEN frames force DuckDB to cache intermediate geometry states for the whole frame. For high-cardinality partitions, prefer RANGE frames or stage intermediate buffers in a materialized table. The full set of frame-sizing heuristics and partition gotchas lives in window functions for geospatial context, and density-based grouping is covered in ST_ClusterDBSCAN spatial grouping.

Bounded pairwise aggregation

Pairwise spatial operations scale quadratically and break vectorization when unbounded. Gate them with a distance predicate so the bounding-box stage prunes pairs before any exact distance is computed:

-- Bounded k-NN style aggregation: ST_DWithin pushes an envelope pre-filter to the scan.
SELECT
    p1.id AS src,
    p2.id AS dst,
    ST_Distance(p1.geom, p2.geom) AS dist
FROM points p1
JOIN points p2
  ON ST_DWithin(p1.geom, p2.geom, 500.0)   -- bbox-expanded pre-filter, then exact
WHERE p1.id < p2.id;                        -- dedupe symmetric pairs

For full $N \times N$ matrix generation, the partitioned block strategies that prevent OOM are detailed in calculating distance matrices with SQL.

Execution Plan Validation

Reading the plan is the only reliable way to confirm an aggregation stayed vectorized. Inspect the estimated plan with EXPLAIN and the measured plan with EXPLAIN ANALYZE.

EXPLAIN
SELECT cell_x, cell_y, COUNT(*), ST_Collect(geom)
FROM binned
GROUP BY cell_x, cell_y;

A healthy plan shows a HASH_GROUP_BY over a plain vectorized TABLE_SCAN, with the geometry kernels living in a PROJECTION above the group rather than inside it:

What to assert in the output:

Operator type. A HASH_GROUP_BY over a scalar key is correct. A NESTED_LOOP_JOIN in the plan means an accidental cross join slipped in — verify the GROUP BY keys are the integer cell coordinates, not geometries.
No row-at-a-time fallback. If the plan shows ROW_EXECUTION or a SCALAR_FUNCTION_EVAL of an ST_ function inside the aggregation phase, the query is bypassing SIMD kernels. The usual triggers are scalar Python UDFs, implicit GEOMETRY → VARCHAR casts, or geometry functions evaluated directly in GROUP BY instead of a precomputed key.
Cardinality drift. Capture actual vs estimated rows per operator. A gap larger than ~30% signals skewed geometry density or stale statistics and predicts unstable plans across data versions.

-- Measured plan with per-operator timing and peak memory.
EXPLAIN ANALYZE
SELECT cell_x, cell_y, COUNT(*), SUM(metric)
FROM binned
GROUP BY cell_x, cell_y;

Use EXPLAIN ANALYZE to localize a regression to a node: a sudden jump in operator_timing on the HASH_GROUP_BY, or a non-empty spill indicator on the union, tells you exactly where the budget broke.

Performance Trade-offs

Aggregation tuning is a series of bounded trade-offs. Quantify them against your own data, but these are the typical ranges:

Choice	Effect	When to apply
Integer grid key vs containment join	Reduces $O(N \times M)$ to $O(N)$ ; eliminates the spatial join entirely	Regular binning where exact polygon membership is not required
`ST_Collect` vs `ST_Union`	5–20× lower CPU and far lower peak memory; no topology resolution	Any pipeline where a later step consumes the multi-geometry
`ST_ReducePrecision` before union	Removes sliver artifacts; commonly 20–40% faster merge	Always, ahead of `ST_Union`/`ST_Intersection`
`ST_DWithin` vs `ST_Distance(...) < d`	Pushes an envelope pre-filter to the scan; prunes 60–95% of pairs	Every bounded pairwise or proximity aggregation
`RANGE` vs unbounded `ROWS` frame	Caps per-partition buffer; avoids full materialization	High-cardinality window partitions

The single highest-leverage move is keeping the grouping key scalar. A geometry-valued GROUP BY not only costs more per comparison, it can defeat the hash aggregate entirely and force a sort-based group. Project the cheap key first, aggregate, then reconstruct geometry in the final projection.

Edge Cases & Anti-Patterns

Geometry in the GROUP BY key (silent slowdown). Grouping directly on geom hashes serialized WKB byte-for-byte, which is both slow and semantically wrong for “same location” grouping.

-- Anti-pattern: hashes raw WKB, runs ST_Centroid per group, defeats vectorization.
SELECT ST_Centroid(geom), COUNT(*) FROM parcels GROUP BY geom;

-- Fix: group on a scalar key, compute geometry in the outer projection.
WITH keyed AS (
  SELECT floor(ST_X(ST_Centroid(geom)) / 1000) AS cx,
         floor(ST_Y(ST_Centroid(geom)) / 1000) AS cy,
         geom
  FROM parcels
)
SELECT cx, cy, COUNT(*), ST_Collect(geom) FROM keyed GROUP BY cx, cy;

Grid binning in a geographic CRS (silent wrong answers). floor(ST_X(geom) / 1000) over EPSG:4326 divides degrees, producing cells thousands of kilometres wide. Confirm the SRID before binning by metres:

SELECT DISTINCT ST_SRID(geom) AS srid FROM points;  -- expect a projected CRS, not 4326

Unbounded ST_Union across a giant group (OOM spill). A single GROUP BY bucket holding millions of overlapping polygons forces the entire merge working set into memory at once. Coarsen the group, reduce precision, or switch to ST_Collect and union downstream in batches.

Invalid geometry poisoning the aggregate. Self-intersections and unclosed rings make ST_Union throw or return corrupt boundaries. Validate and repair before aggregating — never feed an unchecked buffer into a union:

-- Isolate offenders, repair deterministically, then aggregate the validated set.
SELECT id, ST_MakeValid(geom) AS geom
FROM parcels
WHERE NOT ST_IsValid(geom);

Implicit casts on mixed inputs. Mixing a native GEOMETRY column with a VARCHAR WKT literal inside an aggregate triggers a per-row parse. Cast literals once with ST_GeomFromText outside the hot path, so the kernel only ever sees WKB.

Query Regression Analysis

Production aggregation pipelines need a baseline plan and an automated diff so a degradation surfaces in CI rather than in an on-call page. Capture the plan as JSON, walk the tree, and assert that no row-at-a-time or nested-loop node appears in the aggregation phase. This harness slots naturally into the orchestration covered in Python and DuckDB integration workflows and the batch processing pipelines guide.

import duckdb
import json

con = duckdb.connect(":memory:")
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("SET threads = 8; SET memory_limit = '12GB';")

QUERY = """
SELECT cell_x, cell_y, COUNT(*) AS n
FROM binned
GROUP BY cell_x, cell_y
"""

# FORMAT JSON gives a machine-readable plan tree for diffing across builds.
plan = json.loads(
    con.execute(f"EXPLAIN (FORMAT JSON) {QUERY}").fetchone()[1]
)

BANNED = {"NESTED_LOOP_JOIN", "ROW_EXECUTION"}

def walk(node):
    name = node.get("name", "")
    yield name
    for child in node.get("children", []):
        yield from walk(child)

found = set(walk(plan[0])) if isinstance(plan, list) else set(walk(plan))
regressions = found & BANNED
assert not regressions, f"Plan regression: {regressions} appeared in aggregation plan"

Three fields are worth tracking across builds: operator_name (a shift to NESTED_LOOP_JOIN or ROW_EXECUTION is a hard regression), operator_timing (per-node deltas localize the slowdown), and peak_memory plus any spill indicator (early warning of a memory_limit violation). For the GeoPandas handoff that consumes these aggregated results without a serialization round-trip, see DuckDB-to-GeoPandas sync; to run the capture without blocking the event loop, see async execution patterns.

Diagnostic thresholds to alert on:

Signal	Threshold	Action
`ROW_EXECUTION` in plan	any occurrence	Isolate scalar expressions; replace Python UDFs with native `ST_` equivalents
Spill to disk	> 5% of `memory_limit`	Raise `memory_limit`, move `temp_directory` to NVMe, or switch `ST_Union` → `ST_Collect`
`NESTED_LOOP_JOIN` in aggregation	any occurrence	Verify `GROUP BY` keys are scalar; remove the accidental cross join
Actual vs estimated rows	> 30% drift	Refresh statistics; check for skewed geometry density

See also

Spatial joins and proximity filters — the containment and ST_DWithin pruning that grid binning replaces, including point-in-polygon optimization.
Window functions for geospatial context — running spatial metrics and ST_ClusterDBSCAN grouping.
Calculating distance matrices with SQL — partitioned block strategies for bounded pairwise aggregation.
R-tree spatial indexing internals and CRS mapping and transformations — the structures and units that aggregation correctness depends on.

Up: Modern Spatial SQL Query Patterns

External Reference Standards: Kernel-level behaviour of the spatial aggregates follows the DuckDB Spatial extension documentation, and execution-context management for the Python harness follows the DuckDB Python API reference.

Vectorized Aggregations in DuckDB Spatial

Runtime Configuration & Memory Guardrails #

Primary Execution Patterns #

Grid binning instead of pairwise containment #

Dissolve vs collect: ST_Union and ST_Collect #

Window functions for running spatial metrics #

Bounded pairwise aggregation #

Execution Plan Validation #

Performance Trade-offs #

Edge Cases & Anti-Patterns #

Query Regression Analysis #

Related #

Runtime Configuration & Memory Guardrails

Primary Execution Patterns

Grid binning instead of pairwise containment

Dissolve vs collect: `ST_Union` and `ST_Collect`

Window functions for running spatial metrics

Bounded pairwise aggregation

Execution Plan Validation

Performance Trade-offs

Edge Cases & Anti-Patterns

Query Regression Analysis

Related