4.3 -DBSCAN

Motivation

  • Recall using k-means to identify 5 clusters in the multishapes data:
  • Non-convex clusters present an issue.
  • The problem with k-means/PAM/agglomerative clustering with most linkages (single being the possible exception!) is the definition of what a cluster is: points that are similar in \(p\)-dimensional space.

multiple different shaped clusters with kmeans results

A different definition of a cluster

Methods like DBSCAN redefine what it means to be a cluster:

multiple different shaped clusters with kmeans results

Clusters are dense regions in \(p-\)space, separated by regions of lower densities of points.

With this definition, you might not even belong to a cluster at all! (See the blue points).

The idea behind DBSCAN

  • DBSCAN stands for density-based scanning
  • Each point is determined to be one of the following:
    • A core point in a cluster
    • A border point of a cluster
    • An outlier
  • These definitions are governed by two parameters:
    • \(\epsilon\)
    • MinPts

Definitions

  • A neighbor of a point is one within distance \(\epsilon\) of that point
  • A point is a core point if it has at least MinPts neighbors
  • A point is a border point if does not have MinPts neighbors, but does have at least one core point as a neighbor
  • A point is an outlier if it is neither a core point nor a border point

Core point if MinPts = 5

Border point if MinPts = 5

Outlier point

Density reachable

  • Two points \(A\) and \(B\) are density reachable if at least one is a core point, and you can travel from \(A\) to \(B\) along a series of core points within \(\epsilon\) of each other

  • In this plot, the black circle \(A\) is a border point, and the orange triangle \(B\) is a core point.
  • These points are density reachable since \(A\) can travel to \(B\) using a series of orange core points.

Density connected

  • Two points \(A\) and \(B\) are density connected if there is a core point \(C\) that is density reachable from both \(A\) and \(B\)

  • In this plot, \(A\) and \(B\) (black circles) are two border points.
  • \(C\) is a core point that is density reachable from both.
  • Thus \(A\) and \(B\) are density connected.

DBSCAN cluster definition

  • A cluster is a collection of density connected points.
  • Not all points are density connected! These are outliers, i.e. points not in any cluster.
  • The definition of “connected” depends both on MinPts and \(\epsilon\)

Impact of minPts

  • Can think of minPts as “minimum expected size of a cluster”: can incorporate domain expertise here
  • For lower-dimensional data, minPts of 4 or 5 is typical default
  • Higher minPts implies points must be more dense to be considered core points \(\Rightarrow\) fewer clusters, more outliers
  • minPts = 1 \(\Rightarrow\) a cluster could have just a single point

Impact of \(\epsilon\)

  • Smaller \(\epsilon \Rightarrow\) more disconnected clusters
  • Larger \(\epsilon \Rightarrow\) distinct points could be merged, noise points go to clusters

Various \(\epsilon\) and minPts

  • Setting minPts and \(\epsilon\) specifies the number of clusters and classification of outliers:

Choosing \(\epsilon\)

  • It’s typical to fix minPts (default in dbscan is 5) and choose \(\epsilon\)
  • What’s a good \(\epsilon\)?
  • The k-nearest-neighbor plot can help
  • Given minPts (say \(k=5\)), calculate the distance of each point to its \((k-1)^{st}\) nearest neighbor
  • Plot in increasing order, look for an elbow in the plot, where the distance to the \((k-1)^{st}\) nearest neighbor starts to increase sharply
  • In practice, may need to iterate to see what clustering results “look good” (subjective!)

Execution in dbscan

  • dbscan library includes both dbscan and kNNdistplot functions
  • Implementing dbscan for the multishapes data set
library(dbscan)
kNNdistplot(multishapes[,1:2], minPts = 5)
abline(h = 0.15)
k nearest neighbor distance plot
  • Here we see an elbow around \(\approx\) 0.15 (horizontal line added with abline(h = 0.15))

Result

Implementing dbscan with eps = 0.15:

do_dbscan <- dbscan(multishapes[,1:2], eps = 0.15, minPts = 5)

#Add clusters to data set:
multishapes$dbcluster <- factor(do_dbscan$cluster)

ggplot(data = multishapes, aes(x = x, y = y, color=dbcluster)) +
  geom_point() + 
  labs(color='Cluster') + 
  theme_classic(base_size = 16)

multishapes data clustered by dbscan

DBSCANning the drugs data set

  • Let’s see what DBSCAN does for our case study data set:
drugs <- read.csv('Data/IllicitDrug.csv') %>% 
  column_to_rownames('State')
(drugs_scaled <- scale(drugs)
) %>% head
              DrugUse BingeDrink    Poverty     HSdrop     Income
Alabama    -0.9150748 -1.2127521  0.6098588  0.9105382 -1.0569412
Alaska      3.0829126  0.4407902 -0.8767792  0.2063128  0.2459105
Arizona     0.6994970 -0.5040911  1.2220039  1.6561886 -0.5053843
Arkansas   -1.2994966 -0.7403114  0.6973081  0.4134379 -1.2513061
California  1.3145720 -0.7796815  0.8722067  1.5733385  0.5486712
Colorado    1.5452251  0.4407902 -0.9350788 -0.2493624  0.9829551

Choosing \(\epsilon\) and scanning

kNNdistplot(drugs_scaled, minPts = 5)
abline(h = 2.65)
k nearest neighbor distance plot for drug use data
  • The k-nearest-neighbor plot really takes off around \(k \approx 2.65\)
  • Implementing DBSCAN with this value:
drug_dbscan <- dbscan(drugs_scaled, eps = 2.65, minPts = 5)

PCA biplot with DBSCAN clusters

drug_pca <- prcomp(drugs, center=TRUE, scale. = TRUE)
fviz_pca(drug_pca, 
         habillage = factor(drug_dbscan$cluster),
         repel = TRUE) + 
      ggtitle('DBSCAN cluster results') +
      guides(shape='none') + 
      labs(color='Cluster',shape='Cluster')
  • DBSCAN put everything into one cluster, with a single outlier point!!
  • Not great: our epsilon value is too high.

PCA biplot with DBSCAN cluster assignments

Second attempt

  • There are several “knees” in this plot. One appears around \(\epsilon \approx 1.3\):
kNNdistplot(drugs_scaled, minPts = 5)
abline(h = 1.3)
k nearest neighbor distance plot for drug use data
  • Implementing DBSCAN with this value:
drug_dbscan <- dbscan(drugs_scaled,eps = 1.3, minPts = 5)

PCA biplot: 2nd attempt

Results with \(\epsilon = 1.3\), minPts = 5:

fviz_pca(drug_pca, 
         habillage = factor(drug_dbscan$cluster),
         repel = TRUE) + 
      ggtitle('DBSCAN cluster results') +
      guides(shape='none') + 
      labs(color='Cluster',shape='Cluster')
  • Second cluster created
  • It’s now calling many points “outliers,” as there isn’t sufficient density of points in \(p\) dimensions to create a 3rd cluster
  • We could try decreasing \(\epsilon\) further (and I tried), but this just pushes more points to “outlier” status.

PCA biplot with DBSCAN cluster assignments

New example: olive oils

  • Recall the olive oil data set:
oils <- read.csv('Data/oliveoil.csv')
head(oils)
  Oil_ID   Region         Area Palmitic Palmitoleic Strearic Oleic Linoleic
1      1 Southern North-Apulia     1075          75      226  7823      672
2      2 Southern North-Apulia     1088          73      224  7709      781
3      3 Southern North-Apulia      911          54      246  8113      549
4      4 Southern North-Apulia      966          57      240  7952      619
5      5 Southern North-Apulia     1051          67      259  7771      672
6      6 Southern North-Apulia      911          49      268  7924      678
  Linolenic Eicosanoic Eicosenoic
1        36         60         29
2        31         61         29
3        31         63         29
4        50         78         35
5        50         80         46
6        51         70         44
  • Here, we have an important labeling column - Area
  • We’ve seen in the NMDS notes that the oil characteristics are quite distinct for different areas
  • How do our clustering methods organize these oils?
    • k-means
    • DBSCAN

Scaling and PCA

oils_scaled <- scale(oils %>% select(Palmitic:Eicosenoic))
pca <- prcomp(oils_scaled)

fviz_pca(pca, 
         habillage = factor(oils$Area),
         label='var',
         repel = TRUE) + 
      ggtitle('PCA of olive oil data color coded by area') +
      labs(color='Area',shape='Area')
PCA biplot with color coding by area

K-means: choosing \(k\)

fviz_nbclust(oils_scaled, 
             FUNcluster = kmeans) + 
fviz_nbclust(oils_scaled, 
             FUNcluster = kmeans,
             method='wss')
Plots for selecting k for k-means clustering
  • Here both the average silhouette and WSS plots would agree: 5 clusters appear best
kmeans_best <- kmeans(oils_scaled, centers = 5, nstart = 10)

K-means 5-cluster solution

fviz_pca(pca, 
         habillage = factor(oils$Area),
         label='var',
         repel = TRUE) + 
      ggtitle('PCA of olive oil data color coded by region') +
      labs(color='Area',shape='Area') + 
  
fviz_pca(pca, 
         habillage = factor(kmeans_best$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('PCA of olive oil data color coded by K-means cluster') +
      labs(color='Cluster',shape='Cluster')

PCA biplots with area and k-means clustering indicated

  • Clustered together:
    • Coastal- and Inland Sardinia
    • North-Apulia, Sicily, Calabria
    • Umbria and East Liguria

DBSCAN: choosing \(\epsilon\)

  • for minPts = 5, a couple candidate \(\epsilon\) values:
kNNdistplot(oils_scaled, minPts = 5)
abline(h = 1.2)
abline(h = 1.7)
knn plot to choose epsilon

Implementation:

oil_dbscan1 <- dbscan(oils_scaled, eps = 1.2)
oil_dbscan2 <- dbscan(oils_scaled, eps = 1.7)

DBSCAN results

fviz_pca(pca, 
         habillage = factor(oil_dbscan1$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN clustering results, epsilon = 1.2') +
      labs(color='Area',shape='Area') + 
  
fviz_pca(pca, 
         habillage = factor(oil_dbscan2$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN clustering results, epsilon = 1.7') +
      labs(color='Area',shape='Area')  
PCA biplots with DBSCAN clustering indicated
  • The \(\epsilon = 1.2\) clustering results certainly seem more informative, and creates 6 clusters + outliers, although it creates a very small Cluster 6
  • Perhaps consider increasing minPts?

Increasing minPts to 10

  • for minPts = 10, a couple candidate \(\epsilon\) values:
kNNdistplot(oils_scaled, minPts = 10)
abline(h = 1.2)
abline(h = 1.4)
abline(h = 1.8)
knn plot to choose epsilon

\(\epsilon\) results when minPts = 10

PCA biplots with DBSCAN clustering indicated
  • By eye, both \(\epsilon = 1.4\) and certainly \(\epsilon = 1.8\) create clusters that are too large
  • The \(\epsilon = 1.2\) solution seems best, although it does appear to have a lot of outliers very close to the gold Cluster 1 (but remember this is just a 2D representation of 8D data).

Code for previous slide

oil_dbscan3 <- dbscan(oils_scaled, eps = 1.2, minPts = 10)
oil_dbscan4 <- dbscan(oils_scaled, eps = 1.4, minPts = 10)
oil_dbscan5 <- dbscan(oils_scaled, eps = 1.8, minPts = 10)

fviz_pca(pca, 
         habillage = factor(oil_dbscan3$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN, eps = 1.2, minPts= 10') +
      labs(color='Cluster',shape='Cluster') + 
  
  
fviz_pca(pca, 
         habillage = factor(oil_dbscan4$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN, eps = 1.4, minPts= 10') +
      labs(color='Cluster',shape='Cluster') +
  
fviz_pca(pca, 
         habillage = factor(oil_dbscan5$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN, eps = 1.8, minPts= 10') +
      labs(color='Cluster',shape='Cluster')  

Area, kmeans, and DBSCAN

PCA biplots with area and k-means and DBSCAN clustering indicated
  • Similarities between k-means and DBSCAN:
    • Coastal and Inland Sardinia form a cluster
    • Umbria and East Liguria form a cluster
    • West Liguria forms a cluster
  • Differences between k-means and DBSCAN:
    • Calabria is clustered together with Sicily and North Apulia for k-means, but with South Apulia for DBSCAN
    • The Sicily/North Apulia k-means cluster is essentially a cluster + outliers for DBSCAN
    • I do think DBSCAN correctly identifies poorly-clustering points as outliers (could consider filtering k-means results to those with large silhouette scores)