DBSCAN-Style Spatial Grouping in DuckDB

DuckDB Spatial ships no ST_ClusterDBSCAN function, so density-based grouping must be expressed in SQL — and the naive translation, a pairwise distance join, is exactly what turns an $O(N \log N)$ job into an $O(N^2)$ memory blow-up. This walkthrough sits under Window Functions for Geospatial and isolates that one failure: how to group nearby points into density clusters at scale without ever materializing a distance matrix, using grid-cell binning that stays a single linear pass.

Root-Cause Analysis of Clustering Blow-Up

The PostGIS habit of calling ST_ClusterDBSCAN(geom, eps, minpoints) has no equivalent in DuckDB, so engineers reach for a self-join on ST_DWithin to recreate the neighbour relation. Without a bounding-box pre-filter the planner has no way to prune candidates, materializes the full Cartesian product, and exhausts the memory budget before the first aggregation completes. The error surfaces as Out of Memory: failed to allocate X bytes, but the cause is structural, not a bad memory_limit. Density grouping in DuckDB fails along four distinct axes, each with its own fix:

Missing bounding-box pre-filter. A self-join or CROSS JOIN on ST_DWithin with no envelope test in front of it compares every point to every other point. The plan shows a CROSS_PRODUCT or NESTED_LOOP_JOIN node ahead of the grouping. The fix is to abandon the distance matrix entirely and snap each point to a fixed grid cell with integer arithmetic, which collapses the relation to a single hash aggregation.
CRS / unit mismatch. Binning on raw longitude/latitude makes the eps-equivalent cell size latitude-dependent — a 0.001° cell is ~111 m at the equator but ~70 m at 50° N — so the density threshold means different things in different regions. Project to a metric CRS first; the rules live in the coordinate reference system reference.
Geometry invalidity. A single null or malformed GEOMETRY propagates a null cell key and silently drops the point from every cluster. Guard the input with ST_IsValid before binning, covered in the fallback section below.
Index left on the table, unused. An R-tree index accelerates the region-of-interest pre-filter but cannot help a floor()-derived cell key, which is a scan-time computation. Knowing which stage the index serves prevents wasted build cost.

The DuckDB-native answer, grounded in the set-based model of Modern Spatial SQL Query Patterns, replaces the distance matrix with grid-cell assignment: the cell size plays the role of DBSCAN’s eps, and a HAVING COUNT(*) >= k threshold plays the role of minPts.

Deterministic Configuration

The minimal session below enforces memory boundaries, points spill at fast local storage, and prepares a metric-projected, validity-checked input table. Confirm the prerequisites before running any grouping query:

spatial extension installed and loaded in the session
Input geometry projected to a metric CRS (EPSG:3857 or a local UTM zone), not degree-space
memory_limit set below total RAM with headroom for the hash table and spill
A fast local temp_directory available for graceful spill

INSTALL spatial;
LOAD spatial;

-- Cap resident memory below total RAM: a skewed cell distribution can grow the
-- hash table unexpectedly, so leave headroom to spill rather than OOM.
SET memory_limit = '16GB';

-- Match physical cores; grid binning is a memory-bandwidth-bound hash aggregate,
-- so oversubscribing threads raises peak memory without speeding the group-by.
SET threads = 8;

-- Keep spill files on fast local NVMe; a network mount turns a graceful spill
-- into a stall under sustained write pressure.
SET temp_directory = '/mnt/nvme/duckdb_spatial_spill';

-- Cap spill size so a runaway too-fine grid fails fast instead of filling disk.
SET max_temp_directory_size = '50GB';

-- Window/group output rarely needs source order; releasing it frees the
-- aggregate to parallelize across cells.
SET preserve_insertion_order = false;

-- Project once to a metric CRS and keep only valid geometry, so every
-- downstream cell size is expressed in metres and no null key slips through.
CREATE OR REPLACE TEMP TABLE points_metric AS
SELECT
    id,
    ST_Transform(ST_GeomFromWKB(geom_wkb), 'EPSG:4326', 'EPSG:3857') AS geom
FROM raw_sensor_data
WHERE ST_IsValid(ST_GeomFromWKB(geom_wkb));   -- drop invalid input before binning

Optimized Execution Pattern

The behavioral change is to delete the join. The “before” query emulates DBSCAN with a self-join and pays the full quadratic price; the “after” query derives a deterministic cell key and groups in one linear pass.

-- BEFORE: self-join distance matrix — materializes O(N^2) pairs, then OOMs.
SELECT a.id, COUNT(b.id) AS neighbours
FROM points_metric a
JOIN points_metric b                       -- no bbox pre-filter: full pairwise blow-up
  ON ST_DWithin(a.geom, b.geom, 100)
GROUP BY a.id
HAVING COUNT(b.id) >= 5;

-- AFTER: grid-cell binning — snap to a fixed ~100 m cell, then hash-aggregate.
SELECT
    floor(ST_X(geom) / 100) AS cell_x,     -- 100 m cell width == DBSCAN eps
    floor(ST_Y(geom) / 100) AS cell_y,
    COUNT(*)                AS density,    -- COUNT >= k == DBSCAN minPts
    ST_Centroid(ST_Collect(geom)) AS group_center
FROM points_metric
GROUP BY cell_x, cell_y
HAVING COUNT(*) >= 5;                       -- minimum density per cell (minPts)

The annotated diff is the join removal itself: the ON ST_DWithin(...) clause forces a NESTED_LOOP_JOIN over every pair, while floor(ST_X(geom) / 100) is a scalar computed during the scan, so the grouping degrades from quadratic pair evaluation to a single hash pass. The cell size (100) is the tuning knob — it is the neighbourhood radius, and the HAVING threshold (5) is the minimum points a cell needs to qualify as dense rather than noise. Anchoring downstream logic to the derived centroid and density rather than the raw cell ids keeps results stable when you re-tune the grid. This composes directly with vectorized aggregations: bin first, then aggregate over the bins.

When a single uniform grid is too coarse for dense hotspots, pre-aggregate per cell and run a finer second pass only on dense cells — this reduces N before any further work:

-- Grid pre-aggregation: centroid + density per dense cell, feeding a finer pass.
WITH grid_agg AS (
    SELECT
        floor(ST_X(geom) / 100) AS cell_x,
        floor(ST_Y(geom) / 100) AS cell_y,
        ST_Centroid(ST_Collect(geom)) AS centroid,
        COUNT(*)                       AS density
    FROM points_metric
    GROUP BY cell_x, cell_y
    HAVING COUNT(*) >= 5               -- only dense cells survive to the next stage
)
SELECT cell_x, cell_y, centroid, density
FROM grid_agg
ORDER BY density DESC;

To merge adjacent dense cells into larger clusters, self-join only on neighbouring cell ids (cell_x ± 1, cell_y ± 1) — a bounded join over a handful of integer keys, never a full distance matrix.

Diagnostic Queries & Plan Validation

Validate the plan shape before production: a healthy grouping is a HASH_GROUP_BY over a TABLE_SCAN with no join node between them.

-- 1. Confirm the plan is a hash aggregate over a scan, with no cross product.
EXPLAIN ANALYZE
SELECT floor(ST_X(geom) / 100) AS cell_x,
       floor(ST_Y(geom) / 100) AS cell_y,
       COUNT(*)
FROM points_metric
GROUP BY cell_x, cell_y;

A CROSS_PRODUCT or NESTED_LOOP_JOIN node means a pairwise distance step slipped back in — the single most important anti-pattern signal for this operation. The plan should read scan → hash group-by → projection, the same scan-then-aggregate shape that the window functions for geospatial reference relies on for density-ranked partitioning.

-- 2. Quantify cell distribution and the noise ratio before deploying.
SELECT
    cell_x, cell_y,
    COUNT(*) AS point_count,
    CASE WHEN COUNT(*) >= 5 THEN 'CLUSTER' ELSE 'NOISE' END AS classification
FROM (
    SELECT floor(ST_X(geom) / 100) AS cell_x,
           floor(ST_Y(geom) / 100) AS cell_y
    FROM points_metric
) g
GROUP BY cell_x, cell_y
ORDER BY point_count DESC;

Diagnostic Boundary: if the NOISE share exceeds ~40% of cells, the grid is too fine for the data density — widen the cell size using the dataset’s 95th-percentile nearest-neighbour distance as a starting eps. Sustained spill writes to temp_directory are the same signal seen from the other side: a too-fine grid produces many sparse cells. The decision of whether the working set fits in memory at all is governed by the in-memory versus on-disk storage boundaries for your dataset.

Geometry Validation & Fallback Routing

Density grouping is only as trustworthy as its input geometry. A null or self-intersecting feature yields a null cell key that vanishes from every cluster without raising an error, so guard explicitly and route invalid rows through repair rather than dropping them silently.

-- Validity gate with repair: salvage fixable geometry, quarantine the rest.
CREATE OR REPLACE TEMP TABLE points_clean AS
SELECT
    id,
    CASE
        WHEN ST_IsValid(geom) THEN geom
        ELSE ST_MakeValid(geom)        -- repair self-intersections/duplicate vertices
    END AS geom
FROM points_metric
WHERE geom IS NOT NULL
  AND NOT ST_IsEmpty(geom);            -- empty geometry produces a null cell key

For out-of-memory conditions on very large inputs, chunk the binning by a coarse spatial tile so each pass fits the budget, then union the per-tile cluster summaries. Because cell keys are deterministic integers, tile boundaries never double-count a point:

-- Chunked execution: process one coarse 10 km tile at a time to bound memory.
SELECT
    floor(ST_X(geom) / 100) AS cell_x,
    floor(ST_Y(geom) / 100) AS cell_y,
    COUNT(*) AS density
FROM points_clean
WHERE floor(ST_X(geom) / 10000) = :tile_x   -- bind one tile per execution
  AND floor(ST_Y(geom) / 10000) = :tile_y
GROUP BY cell_x, cell_y
HAVING COUNT(*) >= 5;

Route to a coarser grid when execution time exceeds 3× the baseline or memory utilization consistently breaches 85% of the configured memory_limit. When you need true variable-density DBSCAN semantics that a fixed grid cannot express, the honest fallback is to push that subset to a ST_ClusterDBSCAN-capable engine; the grid approach here covers the fixed-radius majority case at a fraction of the cost.

See also

Window Functions for Geospatial — the density-ranked partitioning pattern this page extends with a tunable threshold.
Vectorized Aggregations — bin first, then aggregate centroid and density over the cells.
CRS Mapping & Transformations — why metric projection must precede any cell-size threshold.

Up: Modern Spatial SQL Query Patterns

External Reference Standards: For true variable-density clustering semantics, see the PostGIS ST_ClusterDBSCAN documentation; for DuckDB’s memory and spill behaviour on geometry-heavy aggregations, the DuckDB Spatial Extension Documentation.

DBSCAN-Style Spatial Grouping in DuckDB

Root-Cause Analysis of Clustering Blow-Up #

Deterministic Configuration #

Optimized Execution Pattern #

Diagnostic Queries & Plan Validation #

Geometry Validation & Fallback Routing #

Related #

Root-Cause Analysis of Clustering Blow-Up

Deterministic Configuration

Optimized Execution Pattern

Diagnostic Queries & Plan Validation

Geometry Validation & Fallback Routing

Related