DBSCAN is an acronym that stands for
Density Based Spatial Clustering of Applications with Noise
where noise is outliers that don’t belong to a true cluster.
We’ll be using the fake data from the factoextra
package
called multishapes since it has important features to
demonstrate the advantage DBSCAN has over k-means clustering.
## Loading in the data from the factoextra package
multishapes <-
factoextra::multishapes |>
# Changing shape to a factor so there are separate colors per group
# instead of a gradient
mutate(shape = factor(shape))
# Displaying the data
tibble(multishapes)
## # A tibble: 1,100 × 3
## x y shape
## <dbl> <dbl> <fct>
## 1 -0.804 -0.853 1
## 2 0.853 0.368 1
## 3 0.927 -0.275 1
## 4 -0.753 -0.512 1
## 5 0.707 0.811 1
## 6 1.03 0.395 1
## 7 -0.476 0.990 1
## 8 -0.460 -0.859 1
## 9 -0.680 -0.661 1
## 10 -1.03 -0.0220 1
## # ℹ 1,090 more rows
Still should plot the data to start:
multishapes |>
# Indicating which shape is the outliers (noise)
mutate(
noise = shape == 5
) |>
# Creating the graph with the noise group
ggplot(
mapping = aes(
x = x,
y = y
)
) +
geom_point(
mapping = aes(
color = shape,
shape = noise
)
)
If we just had the green, teal, and purple clusters, then k-means would be an appropriate method to cluster the data.
However, the blue, red, and yellow points make clustering with k-means inappropriate since it uses the centroid of a cluster and the red and yellow groups will have the same centroid!
The blue triangles were generated at random and aren’t from any of the “true” clusters (considered “noise”), which k-means doesn’t have a way of clustering without them since it will include all points in one of the k specified clusters.
Even though DBSCAN uses the “density” of the points and not the “distance”, we still need to rescale the data before running the algorithm
Checking the scale of the variables:
multishapes|>
summarize(
across(
.cols = where(is.numeric),
.fns = sd
)
)
## x y
## 1 0.6449673 1.17617
Since the SD of y is twice the SD of x, we need to rescale the data!
### Rescaling the data to have sd of 1
shape_sc <-
multishapes |>
# Standardizing both x and y
mutate(
across(
.cols = where(is.numeric),
.fns = ~ (. - mean(.))/sd(.)
)
) |>
# Dropping shape since it isn't a numeric variable
dplyr::select(-shape)
# Checking the data now
shape_sc|>
summarize(
across(
.cols = where(is.numeric),
.fns = sd
)
)
## x y
## 1 1 1
Let’s start by naively running the k-means algorithm:
set.seed(1234)
#=
km6 <-
kmeans(
shape_sc,
centers = 6,
nstart = 10
)
### Plotting the results
data.frame(
shape_sc,
clusters = factor(km6$cluster)
) |>
ggplot(
mapping = aes(
x = x,
y = y,
color = clusters
)
) +
geom_point(
show.legend = F,
size = 1.5,
alpha = 0.5
) +
labs(title = "Clusters using K-means")
So that definitely isn’t the best choice of clusters. So let’s look at the results using DBSCAN!
One advantage of DBSCAN over k-means is that we don’t need to specify how many clusters exist in the data to run the algorithm!
Instead, DBSCAN has 2 hyper parameters that we need to specify:
min points = The minimum number of points needed to start a new cluster
ε = the distance from a point that the algorithm will “scan” for a new point to join the cluster
Min points typically depends on the number of cases and variables. We’ll use min points = 5. So how do we find ε?
The most common way is to use a kNN distance plot and find the elbow of the plot. The elbow is the distance between close but separate clusters or noise points.
We can use the kNNdist()
and kNNdistplot()
functions from the dbscan
package.
Both have 2 arguments:
x =
the (scaled) data setk =
the number of nearest neighbors to find the
distances forLet’s start by creating a scatterplot to visualize how far the 5th nearest neighbor is for each point:
### Creating a plot to visualize how the distance changes for each point
shape_sc |>
mutate(
kdist = kNNdist(x = dist(shape_sc),
k = 5
)
) |>
ggplot(
mapping = aes(
x = x,
y = y,
color = kdist
)
) +
geom_point(
show.legend = T
) +
scale_color_gradient(
low="blue",
high = "red",
breaks = seq(0.1, 0.7, by = 0.1)
) +
theme(legend.position = "bottom") +
guides(color = guide_colorbar(barwidth = 20,
title.vjust = 0.75))
The denser the points, the blue-er the points. The red dots are the “noise” points we say from earlier!
Now let’s actually create a kNN distance plot:
kNNdistplot(
x = dist(shape_sc),
k = 5
)
The elbow of the plot appears to be around 0.2, so we’ll use it for ε=0.2
# Using the dbscan() function
set.seed(1234)
shapes_db <-
dbscan(
x = shape_sc,
minPts = 5,
eps = 0.2
)
shapes_db
## DBSCAN clustering for 1100 objects.
## Parameters: eps = 0.2, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 5 cluster(s) and 25 noise points.
##
## 0 1 2 3 4 5
## 25 410 407 106 101 51
##
## Available fields: cluster, eps, minPts, dist, borderPoints
Finding what objects are stored in the dbscan output:
names(shapes_db)
## [1] "cluster" "eps" "minPts" "dist" "borderPoints"
Let’s plot the results of the dbscan algorithm
multishapes |>
mutate(cluster = factor(shapes_db$cluster)) |>
ggplot(
mapping = aes(
x = x,
y = y,
color = cluster,
shape = factor(shape)
)
) +
geom_point(
show.legend = F,
size = 2
) +
labs(title = "Clusters using DBScan with eps = 0.20 and minPts = 5")
Let’s create a graph to see how the choice of min points will impact the value of ε
minpts = c(7, 4, 14)
# Calculating the kNN distance for the minpts choices
kNNdist(
shape_sc,
k = max(minpts),
all = T
) |>
# Converting to a data frame to manipulate and plot the results
data.frame() |>
# Keeping just the columns specified in minpts
dplyr::select(
contains(as.character(minpts))
) |>
# Stacking the distance columns into one column
pivot_longer(
cols = everything(),
names_to = "MinPts",
values_to = "Distance"
) |>
# Arranging the rows in order from minpts then distance
arrange(MinPts, Distance) |>
# Adding the row number for each knn distances for each min point choice
mutate(
.by = MinPts,
point = row_number(),
# Converting the minpts column to a numeric instead of a character
minPts = factor(parse_number(MinPts),
levels = sort(minpts))
) |>
# Creating the graph
ggplot(
mapping = aes(
x = point,
y = Distance,
color = minPts
)
) +
geom_line(
linewidth = 1
) +
labs(
title = paste("kNN Distance plot for k =",
paste(sort(minpts),
collapse = ", "))
)
Our sets appear to be (4, 0.2), (7, 0.25), and (14, 0.35). Let’s compare the 3 choices
shapes_db4 <- dbscan(x = shape_sc, minPts = 4, eps = 0.2)
shapes_db7 <- dbscan(x = shape_sc, minPts = 7, eps = 0.25)
shapes_db14 <- dbscan(x = shape_sc, minPts = 14, eps = 0.35)
# Graphing the results
multishapes |>
mutate(
cluster4 = factor(shapes_db4$cluster),
cluster7 = factor(shapes_db7$cluster),
cluster14 = factor(shapes_db14$cluster)
) |>
# Pivoting the cluster number all into 1 column
pivot_longer(
cols = starts_with("cluster"),
values_to = "cluster"
) |>
# Extracting the min points choice and changing the name
mutate(
minpts = parse_number(name)
) |>
# Arranging them in order from lowest min points to highest min points
arrange(minpts) |>
mutate(
minpts = paste("Min points =", minpts) |> as_factor()
) |>
# Graphing the results
ggplot(
mapping = aes(
x = x,
y = y,
color = cluster
)
) +
geom_point(
show.legend = F
) +
# Creating 3 graphs, one per minpoints choice
facet_wrap(
facets = ~ minpts
) +
labs(title = "Clusters Result using DBScan")