Practical DBSCAN

Author

David McCabe

Published

September 27, 2024

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an intuitive clustering technique that groups points based on density. It identifies clusters as connected high-density regions while marking low-density points as noise. Unlike K-means, DBSCAN does not require specifying the number of clusters beforehand.¹

Typical Workflow in R

1. Import DBSCAN library

library(dbscan)

2. Standardise the data (optional - intention dependent)

Since DBSCAN is distance-based, it’s useful to standardise the data across dimensions:

# Centers and scales each feature
dt_scaled <- scale(dt)

## View the mean before scaling
# attr(dt_scaled, "scaled:center")

## Check additional attributes
# attributes(dt_scaled)

3. Initial choice of epsilon (\(\epsilon\)) and minPts

DBSCAN uses two key parameters:

epsilon (\(\epsilon\)) - the radial distance around a point within which points are considered proximate (The radius within which points are considered neighbors.)
minPts (\(k\) or minPts) - tThe minimum number of points required to form a dense region core point i.e. a point on the interior of a cluster as opposed to a non-core point which lies on the edge of a cluster

To determine optimal input parameters - plot the k-nearest neighbor distances (we need to choose a nearest neighbour size which includes a neighbor over the “gap” to capture the) and look for an “elbow” where the distance rises sharply:

kNNdistplot(as.matrix(dt_scaled), k = 24)  # Commonly, k = 4 to 6
abline(h = 0.34, col = "red", lty = 2)  # Choose an appropriate threshold

4. Perform DBSCAN clustering

After \(\epsilon\) and minPts are chosen, we apply DBSCAN:

db <- dbscan(dt_scaled, eps = 0.34, minPts = 24)  # Example parameters
print(db)

DBSCAN clustering for 910 objects.
Parameters: eps = 0.34, minPts = 24
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 6 noise points.

  0   1   2 
  6 739 165 

Available fields: cluster, eps, minPts, metric, borderPoints

5. Visualising Clusters

cluster labels can be appended to the dataset for visualisation, obviously we can only see a 3-D projection of higher dimensional data so this is generally not possible and we must approach the clustering exercise an exercise on raw data:

dt <- dt[,
  cluster := factor(db$cluster)
]

ggplot(dt, aes(x, y,  color = cluster)) +
    geom_point() +
    theme_minimal()

head(dt)

            x         y cluster
        <num>     <num>  <fctr>
1: -13.677425 -14.87027       1
2: -13.838819 -14.88740       1
3: -12.282123 -14.24483       1
4: -11.120282 -14.68589       1
5: -10.257439 -14.33073       1
6:  -9.563675 -15.57450       1

6. Compute Silhouette Scores

# convert cluster assignment to numeric
dt[, cluster := as.numeric(as.character(cluster))]

# exclude outliers
dt <- dt[cluster != 0]

# Compute silhouette score
sil_score <- silhouette(as.numeric(dt$cluster), dist(dt[,.(x,y)]))

# no. cluseters
nClust <- length(unique(dt$cluster))

# Plot silhouette
plot(sil_score, col = viridis(nClust),  border = NA)

Footnotes

Watch: Starmer, J. [StatQuest]. (2022, Jan 10). Clustering with DBSCAN, Clearly Explained!!! [Video]. YouTube. https://youtu.be/RDZUdRSDOok?si=zvwllek-o1nIeEGd↩︎