DBSCAN-Style Spatial Grouping in DuckDB

Root-Cause Analysis: Why Naive Clustering Blows Up

DuckDB’s spatial extension does not ship a ST_ClusterDBSCAN function — density-based clustering must be expressed in SQL. The common failure mode is reaching for a pairwise distance evaluation (a CROSS JOIN or self-join on ST_DWithin) to emulate DBSCAN. Without a bounding-box pre-filter the planner materializes the full Cartesian product, shifting complexity from O(NlogN)O(N \log N) to O(N2)O(N^2) and exhausting the memory budget before the first aggregation pass completes.

Symptoms manifest as Error: Out of Memory: failed to allocate X bytes. Root-cause verification means inspecting the execution plan for a CROSS_PRODUCT or NESTED_LOOP_JOIN node ahead of the grouping step. The DuckDB-native answer, covered in Modern Spatial SQL Query Patterns, is to replace the distance matrix with a grid-cell assignment: snap each point to a fixed cell with integer arithmetic, then group. That is a single linear pass with no join.

Exact Configuration & Deterministic Execution Baseline

The following baseline enforces memory boundaries, projects to a metric CRS (so cell sizes are in metres), and assigns grid cells without any pairwise comparison.

-- 1. Enforce execution boundaries & temp storage
SET memory_limit = '16GB';
SET temp_directory = '/tmp/duckdb_spatial_cache';
SET enable_progress_bar = false;
SET preserve_insertion_order = false;

-- 2. Project to a metric CRS and keep only valid geometry
CREATE OR REPLACE TEMP TABLE points_metric AS
SELECT
    id,
    ST_Transform(ST_GeomFromWKB(geom_wkb), 'EPSG:4326', 'EPSG:3857') AS geom
FROM raw_sensor_data
WHERE ST_IsValid(ST_GeomFromWKB(geom_wkb));

-- 3. Assign each point to a fixed ~100 m grid cell (density grouping, no join)
SELECT
    id,
    geom,
    floor(ST_X(geom) / 100) AS cell_x,
    floor(ST_Y(geom) / 100) AS cell_y
FROM points_metric;

Configuration Fix Checklist:

  • Project to a metric CRS (EPSG:3857 or a local UTM zone) with ST_Transform before binning. Degree-based cell sizes introduce latitude-dependent distortion that breaks density thresholds.
  • Pick the cell size to match your neighbourhood radius (here 100 metres); it plays the role of DBSCAN’s eps. The HAVING COUNT(*) >= k threshold plays the role of minPts.
  • If too many cells fall below the density threshold (the “noise” set), increase the cell size or lower the threshold in 10–15% increments. If you need true DBSCAN semantics (variable-density clusters), PostGIS exposes ST_ClusterDBSCAN; DuckDB does not.

Memory Boundaries & Vectorized Execution Tuning

Grid binning is a hash aggregation, so it stays vectorized and bounds memory naturally. Set explicit limits so a skewed cell distribution can spill rather than OOM:

SET threads = 8;
SET memory_limit = '16GB';
SET max_temp_directory_size = '50GB';

Monitor temp_directory I/O throughput; sustained spill writes indicate a too-fine grid producing many sparse cells — increase the cell size or pre-filter with ST_DWithin to a region of interest. The DuckDB Spatial extension documentation details memory allocation strategies for geometry-heavy workloads.

Diagnostic Queries & Plan Verification

Validate the grouping and inspect cluster (cell) distribution before production deployment.

-- 1. Verify the plan is a HASH_GROUP_BY over a scan (no nested loop / cross product)
EXPLAIN ANALYZE
SELECT floor(ST_X(geom) / 100) AS cell_x,
       floor(ST_Y(geom) / 100) AS cell_y,
       COUNT(*)
FROM points_metric
GROUP BY cell_x, cell_y;

-- 2. Quantify cell distribution & noise ratio
SELECT
    cell_x,
    cell_y,
    COUNT(*) AS point_count,
    CASE WHEN COUNT(*) >= 5 THEN 'CLUSTER' ELSE 'NOISE' END AS classification
FROM (
    SELECT floor(ST_X(geom) / 100) AS cell_x,
           floor(ST_Y(geom) / 100) AS cell_y
    FROM points_metric
) g
GROUP BY cell_x, cell_y
ORDER BY point_count DESC;

Inspect the EXPLAIN output for a HASH_GROUP_BY over a TABLE_SCAN. A CROSS_PRODUCT or NESTED_LOOP_JOIN means a pairwise distance step slipped in. If the noise ratio exceeds 40%, widen the cell size using the dataset’s 95th-percentile nearest-neighbour distance.

Fallback Routing & Degradation Paths

When a single uniform grid is too coarse or too fine, route to these deterministic strategies:

  1. Grid Pre-Aggregation: aggregate per cell first (centroid + density), then run a finer second pass only on dense cells — this reduces N before any distance work.
  2. Bounding-Box Pre-Filter: use ST_DWithin (paired with the && operator) to prune isolated points to a region of interest before grouping.
  3. Connected Cells: to merge adjacent dense cells into larger clusters, self-join only neighbouring cell ids (cell_x ± 1, cell_y ± 1) — a bounded join, not a full distance matrix.
-- Grid pre-aggregation: centroid + density per dense cell
WITH grid_agg AS (
    SELECT
        floor(ST_X(geom) / 100) AS cell_x,
        floor(ST_Y(geom) / 100) AS cell_y,
        ST_Centroid(ST_Collect(list(geom))) AS centroid,
        COUNT(*) AS density
    FROM points_metric
    GROUP BY cell_x, cell_y
    HAVING COUNT(*) >= 5
)
SELECT cell_x, cell_y, centroid, density
FROM grid_agg
ORDER BY density DESC;

Route to a coarser grid when query execution time exceeds 3× the baseline or when memory utilization consistently breaches 85% of the configured memory_limit. For the windowing primitives that complement this grouping, see Window Functions for Geospatial.