Clustering in High Dimensions: HDBSCAN |
|
Hierarchical Density Based Spatial Clustwering for Applications with Noise is a good first attempt clustering algorithm for very high dimensional data. Data Separability in 4D and Higher - 3D projections Raw Data View - LLM intelligence HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an advanced clustering algorithm that detects clusters of varying density and identifies outliers automatically. It’s an improvement over DBSCAN because it works better on datasets where clusters have different densities.
HDBSCAN Philosophy:
1. clusters are recognised by dense patches of data in the feature space.
2. cluster identification is equivalent to finding peaks in PDF (this is subjective)
A Cluster may be defined as the samples in the volume enclosed by a contour line/surface (connected component of a level set) of the probability density function of the underlying (and unknown) distribution from which our set is drawn 1. Notice there are multiple ways to bound a cluster based on the isolation and prominance of peaks in the PDF.
3. we employ hierarhical clustering to pick the most prominant/isolated peaks
The connected components of a level set form a tree. Analogy - water level dropping -islands
Basic Workflow
0. import library
library(dbscan)
1. data munging
head(dt)
X Y
<num> <num>
1: -0.4354521 -1.080102738
2: 0.3596130 -0.008365066
3: 0.3196329 0.830285436
4: 0.3796975 -0.012767551
5: -0.9018681 0.579905113
6: 0.4148725 -0.021580262
2. run HDBSCAN
= 32 # the minimum cluster size
nPrune <- hdbscan(dt, minPts = nPrune) hdb
3. done?
# Plot the results
<- data.frame(dt, cluster = as.factor(hdb$cluster))
df2
ggplot(df2, aes(x = X, y = Y, color = cluster)) +
geom_point(size = 2) +
theme_minimal() +
labs(title = "HDBSCAN Clustering in R",
color = "Cluster") +
scale_color_viridis_d()
outliers are always assigned as cluster \(0\) so we can color them differently…
$cluster <- as.factor(df2$cluster) # Convert to factor for ggplot
df2$cluster[df2$cluster == "0"] <- "Outlier" # Rename outliers df2
Warning in `[<-.factor`(`*tmp*`, df2$cluster == "0", value = structure(c(NA, :
invalid factor level, NA generated
ggplot(df2, aes(x = X, y = Y, color = cluster)) +
geom_point(size = 2) +
theme_minimal() +
labs(title = "HDBSCAN Clustering in R", color = "Cluster") +
scale_color_manual(values = c("black", rainbow(length(unique(df2$cluster)) - 1)))
Footnotes
The Tutte Institute for Mathematics and Computing (TIMC) (2019, Feb 1). HDBSCAN, Fast Density Based Clustering, the How and the Why | John Healy | PyData New York 2018. [Video]. YouTube. https://www.youtube.com/watch?v=dGsxd67IFiU&t=973s&ab_channel=PyData↩︎