Clustering in High Dimensions: HDBSCAN

Author

D.McCabe

Hierarchical Density Based Spatial Clustwering for Applications with Noise is a good first attempt clustering algorithm for very high dimensional data. Data Separability in 4D and Higher - 3D projections Raw Data View - LLM intelligence HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an advanced clustering algorithm that detects clusters of varying density and identifies outliers automatically. It’s an improvement over DBSCAN because it works better on datasets where clusters have different densities.

HDBSCAN Philosophy:

1. clusters are recognised by dense patches of data in the feature space.

2. cluster identification is equivalent to finding peaks in PDF (this is subjective)

A Cluster may be defined as the samples in the volume enclosed by a contour line/surface (connected component of a level set) of the probability density function of the underlying (and unknown) distribution from which our set is drawn ¹. Notice there are multiple ways to bound a cluster based on the isolation and prominance of peaks in the PDF.

https://en.wikipedia.org/wiki/Topographic_prominence

3. we employ hierarhical clustering to pick the most prominant/isolated peaks

The connected components of a level set form a tree. Analogy - water level dropping -islands

Basic Workflow

0. import library

library(dbscan)

1. data munging

head(dt)

            X            Y
        <num>        <num>
1: -0.4354521 -1.080102738
2:  0.3596130 -0.008365066
3:  0.3196329  0.830285436
4:  0.3796975 -0.012767551
5: -0.9018681  0.579905113
6:  0.4148725 -0.021580262

2. run HDBSCAN

nPrune = 32 # the minimum cluster size
hdb <- hdbscan(dt, minPts = nPrune)

3. done?

# Plot the results
df2 <- data.frame(dt, cluster = as.factor(hdb$cluster))

ggplot(df2, aes(x = X, y = Y, color = cluster)) +
  geom_point(size = 2) +
  theme_minimal() +
  labs(title = "HDBSCAN Clustering in R",
       color = "Cluster") +
  scale_color_viridis_d()

outliers are always assigned as cluster \(0\) so we can color them differently…

df2$cluster <- as.factor(df2$cluster)  # Convert to factor for ggplot
df2$cluster[df2$cluster == "0"] <- "Outlier"  # Rename outliers

Warning in `[<-.factor`(`*tmp*`, df2$cluster == "0", value = structure(c(NA, :
invalid factor level, NA generated

ggplot(df2, aes(x = X, y = Y, color = cluster)) +
  geom_point(size = 2) +
  theme_minimal() +
  labs(title = "HDBSCAN Clustering in R", color = "Cluster") +
  scale_color_manual(values = c("black", rainbow(length(unique(df2$cluster)) - 1)))

Footnotes

The Tutte Institute for Mathematics and Computing (TIMC) (2019, Feb 1). HDBSCAN, Fast Density Based Clustering, the How and the Why | John Healy | PyData New York 2018. [Video]. YouTube. https://www.youtube.com/watch?v=dGsxd67IFiU&t=973s&ab_channel=PyData↩︎