DBSCAN-Style Spatial Grouping in DuckDB
Root-Cause Analysis: Why Naive Clustering Blows Up
DuckDB’s spatial extension does not ship a ST_ClusterDBSCAN function — density-based clustering must be expressed in SQL. The common failure mode is reaching for a pairwise distance evaluation (a CROSS JOIN or self-join on ST_DWithin) to emulate DBSCAN. Without a bounding-box pre-filter the planner materializes the full Cartesian product, shifting complexity from to and exhausting the memory budget before the first aggregation pass completes.
Symptoms manifest as Error: Out of Memory: failed to allocate X bytes. Root-cause verification means inspecting the execution plan for a CROSS_PRODUCT or NESTED_LOOP_JOIN node ahead of the grouping step. The DuckDB-native answer, covered in Modern Spatial SQL Query Patterns, is to replace the distance matrix with a grid-cell assignment: snap each point to a fixed cell with integer arithmetic, then group. That is a single linear pass with no join.
Exact Configuration & Deterministic Execution Baseline
The following baseline enforces memory boundaries, projects to a metric CRS (so cell sizes are in metres), and assigns grid cells without any pairwise comparison.
-- 1. Enforce execution boundaries & temp storage
SET memory_limit = '16GB';
SET temp_directory = '/tmp/duckdb_spatial_cache';
SET enable_progress_bar = false;
SET preserve_insertion_order = false;
-- 2. Project to a metric CRS and keep only valid geometry
CREATE OR REPLACE TEMP TABLE points_metric AS
SELECT
id,
ST_Transform(ST_GeomFromWKB(geom_wkb), 'EPSG:4326', 'EPSG:3857') AS geom
FROM raw_sensor_data
WHERE ST_IsValid(ST_GeomFromWKB(geom_wkb));
-- 3. Assign each point to a fixed ~100 m grid cell (density grouping, no join)
SELECT
id,
geom,
floor(ST_X(geom) / 100) AS cell_x,
floor(ST_Y(geom) / 100) AS cell_y
FROM points_metric;
Configuration Fix Checklist:
- Project to a metric CRS (EPSG:3857 or a local UTM zone) with
ST_Transformbefore binning. Degree-based cell sizes introduce latitude-dependent distortion that breaks density thresholds. - Pick the cell size to match your neighbourhood radius (here
100metres); it plays the role of DBSCAN’seps. TheHAVING COUNT(*) >= kthreshold plays the role ofminPts. - If too many cells fall below the density threshold (the “noise” set), increase the cell size or lower the threshold in 10–15% increments. If you need true DBSCAN semantics (variable-density clusters), PostGIS exposes
ST_ClusterDBSCAN; DuckDB does not.
Memory Boundaries & Vectorized Execution Tuning
Grid binning is a hash aggregation, so it stays vectorized and bounds memory naturally. Set explicit limits so a skewed cell distribution can spill rather than OOM:
SET threads = 8;
SET memory_limit = '16GB';
SET max_temp_directory_size = '50GB';
Monitor temp_directory I/O throughput; sustained spill writes indicate a too-fine grid producing many sparse cells — increase the cell size or pre-filter with ST_DWithin to a region of interest. The DuckDB Spatial extension documentation details memory allocation strategies for geometry-heavy workloads.
Diagnostic Queries & Plan Verification
Validate the grouping and inspect cluster (cell) distribution before production deployment.
-- 1. Verify the plan is a HASH_GROUP_BY over a scan (no nested loop / cross product)
EXPLAIN ANALYZE
SELECT floor(ST_X(geom) / 100) AS cell_x,
floor(ST_Y(geom) / 100) AS cell_y,
COUNT(*)
FROM points_metric
GROUP BY cell_x, cell_y;
-- 2. Quantify cell distribution & noise ratio
SELECT
cell_x,
cell_y,
COUNT(*) AS point_count,
CASE WHEN COUNT(*) >= 5 THEN 'CLUSTER' ELSE 'NOISE' END AS classification
FROM (
SELECT floor(ST_X(geom) / 100) AS cell_x,
floor(ST_Y(geom) / 100) AS cell_y
FROM points_metric
) g
GROUP BY cell_x, cell_y
ORDER BY point_count DESC;
Inspect the EXPLAIN output for a HASH_GROUP_BY over a TABLE_SCAN. A CROSS_PRODUCT or NESTED_LOOP_JOIN means a pairwise distance step slipped in. If the noise ratio exceeds 40%, widen the cell size using the dataset’s 95th-percentile nearest-neighbour distance.
Fallback Routing & Degradation Paths
When a single uniform grid is too coarse or too fine, route to these deterministic strategies:
- Grid Pre-Aggregation: aggregate per cell first (centroid + density), then run a finer second pass only on dense cells — this reduces N before any distance work.
- Bounding-Box Pre-Filter: use
ST_DWithin(paired with the&&operator) to prune isolated points to a region of interest before grouping. - Connected Cells: to merge adjacent dense cells into larger clusters, self-join only neighbouring cell ids (
cell_x ± 1,cell_y ± 1) — a bounded join, not a full distance matrix.
-- Grid pre-aggregation: centroid + density per dense cell
WITH grid_agg AS (
SELECT
floor(ST_X(geom) / 100) AS cell_x,
floor(ST_Y(geom) / 100) AS cell_y,
ST_Centroid(ST_Collect(list(geom))) AS centroid,
COUNT(*) AS density
FROM points_metric
GROUP BY cell_x, cell_y
HAVING COUNT(*) >= 5
)
SELECT cell_x, cell_y, centroid, density
FROM grid_agg
ORDER BY density DESC;
Route to a coarser grid when query execution time exceeds 3× the baseline or when memory utilization consistently breaches 85% of the configured memory_limit. For the windowing primitives that complement this grouping, see Window Functions for Geospatial.