4.3 -DBSCAN

Motivation

Recall using k-means to identify 5 clusters in the multishapes data:

Non-convex clusters present an issue.
The problem with k-means/PAM/agglomerative clustering with most linkages (single being the possible exception!) is the definition of what a cluster is: points that are similar in \(p\)-dimensional space.

multiple different shaped clusters with kmeans results

A different definition of a cluster

Methods like DBSCAN redefine what it means to be a cluster:

Clusters are dense regions in \(p-\)space, separated by regions of lower densities of points.

With this definition, you might not even belong to a cluster at all! (See the blue points).

The idea behind DBSCAN

DBSCAN stands for density-based scanning
Each point is determined to be one of the following:
- A core point in a cluster
- A border point of a cluster
- An outlier
These definitions are governed by two parameters:
- \(\epsilon\)
- MinPts

Definitions

A neighbor of a point is one within distance \(\epsilon\) of that point
A point is a core point if it has at least MinPts neighbors
A point is a border point if does not have MinPts neighbors, but does have at least one core point as a neighbor
A point is an outlier if it is neither a core point nor a border point

Core point if `MinPts = 5`

Border point if `MinPts = 5`

Outlier point

Density reachable

Two points \(A\) and \(B\) are density reachable if at least one is a core point, and you can travel from \(A\) to \(B\) along a series of core points within \(\epsilon\) of each other

In this plot, the black circle \(A\) is a border point, and the orange triangle \(B\) is a core point.
These points are density reachable since \(A\) can travel to \(B\) using a series of orange core points.

Density connected

Two points \(A\) and \(B\) are density connected if there is a core point \(C\) that is density reachable from both \(A\) and \(B\)

In this plot, \(A\) and \(B\) (black circles) are two border points.
\(C\) is a core point that is density reachable from both.
Thus \(A\) and \(B\) are density connected.

DBSCAN cluster definition

A cluster is a collection of density connected points.
Not all points are density connected! These are outliers, i.e. points not in any cluster.
The definition of “connected” depends both on MinPts and \(\epsilon\)

Impact of `minPts`

Can think of minPts as “minimum expected size of a cluster”: can incorporate domain expertise here
For lower-dimensional data, minPts of 4 or 5 is typical default
Higher minPts implies points must be more dense to be considered core points \(\Rightarrow\) fewer clusters, more outliers
minPts = 1 \(\Rightarrow\) a cluster could have just a single point

Impact of \(\epsilon\)

Smaller \(\epsilon \Rightarrow\) more disconnected clusters
Larger \(\epsilon \Rightarrow\) distinct points could be merged, noise points go to clusters

Various \(\epsilon\) and `minPts`

Setting minPts and \(\epsilon\) specifies the number of clusters and classification of outliers:

Choosing \(\epsilon\)

It’s typical to fix minPts (default in dbscan is 5) and choose \(\epsilon\)
What’s a good \(\epsilon\)?
The k-nearest-neighbor plot can help
Given minPts (say \(k=5\)), calculate the distance of each point to its \((k-1)^{st}\) nearest neighbor
Plot in increasing order, look for an elbow in the plot, where the distance to the \((k-1)^{st}\) nearest neighbor starts to increase sharply
In practice, may need to iterate to see what clustering results “look good” (subjective!)

Execution in `dbscan`

dbscan library includes both dbscan and kNNdistplot functions
Implementing dbscan for the multishapes data set

library(dbscan)
kNNdistplot(multishapes[,1:2], minPts = 5)
abline(h = 0.15)

Here we see an elbow around \(\approx\) 0.15 (horizontal line added with abline(h = 0.15))

Result

Implementing dbscan with eps = 0.15:

do_dbscan <- dbscan(multishapes[,1:2], eps = 0.15, minPts = 5)

#Add clusters to data set:
multishapes$dbcluster <- factor(do_dbscan$cluster)

ggplot(data = multishapes, aes(x = x, y = y, color=dbcluster)) +
  geom_point() + 
  labs(color='Cluster') + 
  theme_classic(base_size = 16)

DBSCANning the drugs data set

Let’s see what DBSCAN does for our case study data set:

drugs <- read.csv('Data/IllicitDrug.csv') %>% 
  column_to_rownames('State')
(drugs_scaled <- scale(drugs)
) %>% head

              DrugUse BingeDrink    Poverty     HSdrop     Income
Alabama    -0.9150748 -1.2127521  0.6098588  0.9105382 -1.0569412
Alaska      3.0829126  0.4407902 -0.8767792  0.2063128  0.2459105
Arizona     0.6994970 -0.5040911  1.2220039  1.6561886 -0.5053843
Arkansas   -1.2994966 -0.7403114  0.6973081  0.4134379 -1.2513061
California  1.3145720 -0.7796815  0.8722067  1.5733385  0.5486712
Colorado    1.5452251  0.4407902 -0.9350788 -0.2493624  0.9829551

Choosing \(\epsilon\) and scanning

kNNdistplot(drugs_scaled, minPts = 5)
abline(h = 2.65)

k nearest neighbor distance plot for drug use data

The k-nearest-neighbor plot really takes off around \(k \approx 2.65\)
Implementing DBSCAN with this value:

drug_dbscan <- dbscan(drugs_scaled, eps = 2.65, minPts = 5)

PCA biplot with DBSCAN clusters

drug_pca <- prcomp(drugs, center=TRUE, scale. = TRUE)
fviz_pca(drug_pca, 
         habillage = factor(drug_dbscan$cluster),
         repel = TRUE) + 
      ggtitle('DBSCAN cluster results') +
      guides(shape='none') + 
      labs(color='Cluster',shape='Cluster')

DBSCAN put everything into one cluster, with a single outlier point!!
Not great: our epsilon value is too high.

PCA biplot with DBSCAN cluster assignments

Second attempt

There are several “knees” in this plot. One appears around \(\epsilon \approx 1.3\):

kNNdistplot(drugs_scaled, minPts = 5)
abline(h = 1.3)

Implementing DBSCAN with this value:

drug_dbscan <- dbscan(drugs_scaled,eps = 1.3, minPts = 5)

PCA biplot: 2nd attempt

Results with \(\epsilon = 1.3\), minPts = 5:

fviz_pca(drug_pca, 
         habillage = factor(drug_dbscan$cluster),
         repel = TRUE) + 
      ggtitle('DBSCAN cluster results') +
      guides(shape='none') + 
      labs(color='Cluster',shape='Cluster')

Second cluster created
It’s now calling many points “outliers,” as there isn’t sufficient density of points in \(p\) dimensions to create a 3rd cluster
We could try decreasing \(\epsilon\) further (and I tried), but this just pushes more points to “outlier” status.

New example: olive oils

Recall the olive oil data set:

oils <- read.csv('Data/oliveoil.csv')
head(oils)

  Oil_ID   Region         Area Palmitic Palmitoleic Strearic Oleic Linoleic
1      1 Southern North-Apulia     1075          75      226  7823      672
2      2 Southern North-Apulia     1088          73      224  7709      781
3      3 Southern North-Apulia      911          54      246  8113      549
4      4 Southern North-Apulia      966          57      240  7952      619
5      5 Southern North-Apulia     1051          67      259  7771      672
6      6 Southern North-Apulia      911          49      268  7924      678
  Linolenic Eicosanoic Eicosenoic
1        36         60         29
2        31         61         29
3        31         63         29
4        50         78         35
5        50         80         46
6        51         70         44

Here, we have an important labeling column - Area
We’ve seen in the NMDS notes that the oil characteristics are quite distinct for different areas
How do our clustering methods organize these oils?
- k-means
- DBSCAN

Scaling and PCA

oils_scaled <- scale(oils %>% select(Palmitic:Eicosenoic))
pca <- prcomp(oils_scaled)

fviz_pca(pca, 
         habillage = factor(oils$Area),
         label='var',
         repel = TRUE) + 
      ggtitle('PCA of olive oil data color coded by area') +
      labs(color='Area',shape='Area')

K-means: choosing \(k\)

fviz_nbclust(oils_scaled, 
             FUNcluster = kmeans) + 
fviz_nbclust(oils_scaled, 
             FUNcluster = kmeans,
             method='wss')

Plots for selecting k for k-means clustering

Here both the average silhouette and WSS plots would agree: 5 clusters appear best

kmeans_best <- kmeans(oils_scaled, centers = 5, nstart = 10)

K-means 5-cluster solution

fviz_pca(pca, 
         habillage = factor(oils$Area),
         label='var',
         repel = TRUE) + 
      ggtitle('PCA of olive oil data color coded by region') +
      labs(color='Area',shape='Area') + 
  
fviz_pca(pca, 
         habillage = factor(kmeans_best$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('PCA of olive oil data color coded by K-means cluster') +
      labs(color='Cluster',shape='Cluster')

PCA biplots with area and k-means clustering indicated

Clustered together:
- Coastal- and Inland Sardinia
- North-Apulia, Sicily, Calabria
- Umbria and East Liguria

DBSCAN: choosing \(\epsilon\)

for minPts = 5, a couple candidate \(\epsilon\) values:

kNNdistplot(oils_scaled, minPts = 5)
abline(h = 1.2)
abline(h = 1.7)

Implementation:

oil_dbscan1 <- dbscan(oils_scaled, eps = 1.2)
oil_dbscan2 <- dbscan(oils_scaled, eps = 1.7)

DBSCAN results

fviz_pca(pca, 
         habillage = factor(oil_dbscan1$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN clustering results, epsilon = 1.2') +
      labs(color='Area',shape='Area') + 
  
fviz_pca(pca, 
         habillage = factor(oil_dbscan2$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN clustering results, epsilon = 1.7') +
      labs(color='Area',shape='Area')

PCA biplots with DBSCAN clustering indicated

The \(\epsilon = 1.2\) clustering results certainly seem more informative, and creates 6 clusters + outliers, although it creates a very small Cluster 6
Perhaps consider increasing minPts?

Increasing `minPts` to 10

for minPts = 10, a couple candidate \(\epsilon\) values:

kNNdistplot(oils_scaled, minPts = 10)
abline(h = 1.2)
abline(h = 1.4)
abline(h = 1.8)

\(\epsilon\) results when `minPts = 10`

By eye, both \(\epsilon = 1.4\) and certainly \(\epsilon = 1.8\) create clusters that are too large
The \(\epsilon = 1.2\) solution seems best, although it does appear to have a lot of outliers very close to the gold Cluster 1 (but remember this is just a 2D representation of 8D data).

Code for previous slide

oil_dbscan3 <- dbscan(oils_scaled, eps = 1.2, minPts = 10)
oil_dbscan4 <- dbscan(oils_scaled, eps = 1.4, minPts = 10)
oil_dbscan5 <- dbscan(oils_scaled, eps = 1.8, minPts = 10)

fviz_pca(pca, 
         habillage = factor(oil_dbscan3$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN, eps = 1.2, minPts= 10') +
      labs(color='Cluster',shape='Cluster') + 
  
  
fviz_pca(pca, 
         habillage = factor(oil_dbscan4$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN, eps = 1.4, minPts= 10') +
      labs(color='Cluster',shape='Cluster') +
  
fviz_pca(pca, 
         habillage = factor(oil_dbscan5$cluster),
         label='var',
         repel = TRUE) + 
      ggtitle('DBSCAN, eps = 1.8, minPts= 10') +
      labs(color='Cluster',shape='Cluster')

Area, kmeans, and DBSCAN

PCA biplots with area and k-means and DBSCAN clustering indicated

Similarities between k-means and DBSCAN:
- Coastal and Inland Sardinia form a cluster
- Umbria and East Liguria form a cluster
- West Liguria forms a cluster

Differences between k-means and DBSCAN:
- Calabria is clustered together with Sicily and North Apulia for k-means, but with South Apulia for DBSCAN
- The Sicily/North Apulia k-means cluster is essentially a cluster + outliers for DBSCAN
- I do think DBSCAN correctly identifies poorly-clustering points as outliers (could consider filtering k-means results to those with large silhouette scores)