Practical DBSCAN |
---|
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an intuitive clustering technique that groups points based on density. It identifies clusters as connected high-density regions while marking low-density points as noise. Unlike K-means, DBSCAN does not require specifying the number of clusters beforehand.1
Typical Workflow in R
1. Import DBSCAN library
library(dbscan)
2. Standardise the data (optional - intention dependent)
Since DBSCAN is distance-based, it’s useful to standardise the data across dimensions:
# Centers and scales each feature
<- scale(dt)
dt_scaled
## View the mean before scaling
# attr(dt_scaled, "scaled:center")
## Check additional attributes
# attributes(dt_scaled)
3. Initial choice of epsilon (\(\epsilon\)) and minPts
DBSCAN uses two key parameters:
- epsilon (\(\epsilon\)) - the radial distance around a point within which points are considered proximate (The radius within which points are considered neighbors.)
- minPts (\(k\) or minPts) - tThe minimum number of points required to form a dense region core point i.e. a point on the interior of a cluster as opposed to a non-core point which lies on the edge of a cluster
To determine optimal input parameters - plot the k-nearest neighbor distances (we need to choose a nearest neighbour size which includes a neighbor over the “gap” to capture the) and look for an “elbow” where the distance rises sharply:
kNNdistplot(as.matrix(dt_scaled), k = 24) # Commonly, k = 4 to 6
abline(h = 0.34, col = "red", lty = 2) # Choose an appropriate threshold
4. Perform DBSCAN clustering
After \(\epsilon\) and minPts are chosen, we apply DBSCAN:
<- dbscan(dt_scaled, eps = 0.34, minPts = 24) # Example parameters
db print(db)
DBSCAN clustering for 910 objects.
Parameters: eps = 0.34, minPts = 24
Using euclidean distances and borderpoints = TRUE
The clustering contains 2 cluster(s) and 6 noise points.
0 1 2
6 739 165
Available fields: cluster, eps, minPts, metric, borderPoints
5. Visualising Clusters
cluster labels can be appended to the dataset for visualisation, obviously we can only see a 3-D projection of higher dimensional data so this is generally not possible and we must approach the clustering exercise an exercise on raw data:
<- dt[,
dt := factor(db$cluster)
cluster
]
ggplot(dt, aes(x, y, color = cluster)) +
geom_point() +
theme_minimal()
head(dt)
x y cluster
<num> <num> <fctr>
1: -13.677425 -14.87027 1
2: -13.838819 -14.88740 1
3: -12.282123 -14.24483 1
4: -11.120282 -14.68589 1
5: -10.257439 -14.33073 1
6: -9.563675 -15.57450 1
6. Compute Silhouette Scores
# convert cluster assignment to numeric
:= as.numeric(as.character(cluster))]
dt[, cluster
# exclude outliers
<- dt[cluster != 0]
dt
# Compute silhouette score
<- silhouette(as.numeric(dt$cluster), dist(dt[,.(x,y)]))
sil_score
# no. cluseters
<- length(unique(dt$cluster))
nClust
# Plot silhouette
plot(sil_score, col = viridis(nClust), border = NA)
Footnotes
Watch: Starmer, J. [StatQuest]. (2022, Jan 10). Clustering with DBSCAN, Clearly Explained!!! [Video]. YouTube. https://youtu.be/RDZUdRSDOok?si=zvwllek-o1nIeEGd↩︎