Skip to content

DBSCAN

Formula

\[ \text{core point } x_i: |\{x_j: d(x_i,x_j)\le \varepsilon\}| \ge \text{MinPts} \]

Parameters

  • \(\varepsilon\): neighborhood radius
  • \(\text{MinPts}\): minimum points for a dense region

What it means

Density-based clustering groups points in dense regions and labels sparse points as noise.

What it's used for

  • Finding arbitrarily shaped clusters.
  • Outlier-aware clustering without specifying \(K\).

Key properties

  • Naturally identifies noise points.
  • Works poorly when cluster densities differ strongly.

Common gotchas

  • Feature scaling and distance choice matter a lot.
  • Parameter tuning can be difficult in high dimensions.

Example

GPS points form route clusters while isolated points remain labeled as noise.

How to Compute (Pseudocode)

Input: points x[1..n], radius eps, MinPts
Output: cluster labels (including noise)

mark all points as unvisited
cluster_id <- 0

for each point p:
  if p is visited:
    continue
  mark p visited
  N <- neighbors within distance eps of p
  if size(N) < MinPts:
    label p as noise (tentatively)
    continue

  cluster_id <- cluster_id + 1
  assign p to cluster_id
  expand cluster by repeatedly visiting neighbors:
    if a neighbor q is unvisited, mark visited and query its eps-neighbors
    if q is a core point, add its neighbors to the expansion frontier
    assign density-reachable points to cluster_id

return labels

Complexity

  • Time: Depends on neighborhood-query implementation; typically \(O(n\log n)\) to \(O(n^2)\), with \(O(n^2)\) worst case for naive distance checks
  • Space: \(O(n)\) to \(O(n^2)\) depending on whether neighborhoods/distances are materialized
  • Assumptions: \(n\) points; indexing structures (KD-tree/ball tree/grid) and data dimensionality strongly affect runtime