Density-Based Spatial Clustering

How it Works

Density-based spatial clustering of applications with noise (DBSCAN) has a potential advantage over k-means clusterting – we don’t need to tell it how many clusters there are in advance. Another distinction, that could be seen as either a feature or a bug, is that it will classify some points as outliers and not place them in any cluster.

DBSCAN has two parameters eps (“the size of the epsilon neighborhood”) can be thought of as a radius. If a point lies within this radius of another point then it is a potential neighbor of that point.

minPts - the minimum number of neighboring points needed to qualify as a cluster.

Step 1: Begin with an arbitrary starting point (that has not yet been visited). Step 2: Count the number of points within its circle (with radius = eps). If this number if greater than minPts a cluster, then this is considered a “dense” circle and a cluster starts to form, otherwise this point is labelled as noise (for now). If a cluster starts, all points within the radius are added to the cluster and, if those points are within dense circles, all of the points within their circles are added to the cluster as well. This continues until no more points are added to the cluster and then we return to Step 1 (and keep doing so until all points have been visited).

For a demonstration of this procedure go to: https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

How Many Pitches Does Max Scherzer Throw? (revisited)

df <- read.csv("/home/jcross/data/max_scherzer_2019.csv")

library(ggplot2); library(dplyr)
mid<-85
ggplot(df, aes(pfx_x, pfx_z, color=start_speed))+geom_point()+scale_color_gradient2(midpoint=mid, mid="yellow", low="blue", high="red")

scaled.df <- scale(df)


library(dbscan)
dbscan.cluster <- dbscan(scaled.df, eps=0.3, minPts = 10)

df %>%
  as_tibble() %>%
  mutate(cluster = dbscan.cluster$cluster) %>%
  ggplot(aes(pfx_x, pfx_z, color = factor(cluster))) +
  geom_point()

Notice that the red points (cluster 0) are outliers, not considered to be members of any cluster.

If we make the radius small and the number of points (that must be within a circle) large, we can conclude that ALL points must be outliers (as follows):

dbscan.cluster <- dbscan(scaled.df, eps=0.1, minPts = 10)

df %>%
  as_tibble() %>%
  mutate(cluster = dbscan.cluster$cluster) %>%
  ggplot(aes(pfx_x, pfx_z, color = factor(cluster))) +
  geom_point()

whereas, if we make the radius larger and the number of points small…

dbscan.cluster <- dbscan(scaled.df, eps=0.6, minPts = 3)

df %>%
  as_tibble() %>%
  mutate(cluster = dbscan.cluster$cluster) %>%
  ggplot(aes(pfx_x, pfx_z, color = factor(cluster))) +
  geom_point()

Try this clustering algorithm on cake ingredients and mammal teeth

You’ll want to do the same data prep as before (see below):

library(cluster.datasets)
data(cake.ingredients.1961)
df <- cake.ingredients.1961
df[is.na(df)] <- 0
row.names(df) <- df$Cake
df <- df[,-1]
df <- scale(df)
df <- df[,-2]


data(mammal.dentition)
df <- mammal.dentition
row.names(df) <- df$name
df <- df[,-1]
df <- scale(df)