Simple DBSCAN example

Introduction

If you don’t know at all what DBSCAN does, you might want to watch this 3-minute video introduction: https://www.youtube.com/watch?v=_A9Tq6mGtLI.

Here we will be using the R package dbscan. There is another R package that has been used for DBSCAN as well, called fpc. The reason I’m using the ‘dbscan’ package here is that at a glance it seemed to have more up-to-date documentation and be more actively maintained.

I should note that most of the instructions here are based on this tutorial on YouTube, but you don’t need to watch it as everything is included here.

Setup

Install necessary packages, if you don’t already have them, by copy-pasting the code below, uncommenting it, then running it.

dbscan: package which implements the DBSCAN algorithm
factoextra: package which has some useful example data to try out DBSCAN with

#install.packages('dbscan')
#install.packages('factoextra')

Load in everything from the factoextra package to enable use of the ‘multishapes’ example data set.

library("factoextra")

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Set the ‘random seed’, which means that the results will always be the same when running the following commands, since the random aspects of the algorithms are controlled.

set.seed(123456789)

Extract just the first two (‘x’ and ‘y’ coordinates) columns from the ‘multishapes’ data set, since the third column, ‘shapes’, would otherwise be treated as a third dimension in the procedures that follow.

multishapes <- multishapes[, 1:2]

Plotting the data

Plot the example data

plot(multishapes)

Applying K-means clustering

Apply K-means clustering to the data set, which is a popular clustering algorithm that is often what people learn first when studying clustering. This is mostly to see how K-means doesn’t perform very well with the kinds of clusters in the data set.

km_res <- kmeans(multishapes, 5, nstart = 25)

Use the results from applying K-means clustering to plot the example data again, this time coloring the points according to which cluster the algorithm grouped each point in.

plot(multishapes, col=km_res$cluster+1, main="K-means")

Applying DBSCAN

Load in all functions et c from the dbscan package

library("dbscan")

Run the DBSCAN algorithm, specifying:

eps: ‘epsilon’, radius of the ‘epsilon neighborhood’ (the maximum point-to-point distance for considering two points to be in the same cluster)
minPts: the minimum number of points required to be in the ‘epsilon neighborhoods’ of core points (including the point itself).

Note that ‘core points’ refer to points which are within the ‘epsilon neighborhood’ of the randomly selected starting point.

More details can be found in the dbscan package documentation.

dbscan_res <- dbscan(multishapes, eps = 0.15, minPts = 5)

Use the results from applying DBSCAN to plot the example data once more, coloring points according to which cluster DBSCAN grouped each point in. In the DBSCAN results, cluster group ‘0’, plotted below in black, indicates ‘noise points’. A ‘noise point’ is one which isn’t close enough to (‘minPts’ - 1) number of other points to be considered part of any cluster.

plot(multishapes, col=dbscan_res$cluster+1, main="DBSCAN")

Modifying the algorithm parameters

The DBSCAN algorithm is very sensitive to changes to the epsilon and ‘minPts’ values.

# eps: 0.15 -> 0.4
# minPts stays the same
dbscan_res_changed <- dbscan(multishapes[c('x', 'y')], eps = 0.4, minPts = 5)
plot(multishapes, col=dbscan_res_changed$cluster+1, main="DBSCAN")

# eps: 0.15 (like in first run)
# minPts: 5 -> 40
dbscan_res_changed_2 <- dbscan(multishapes, eps = 0.15, minPts = 40)
plot(multishapes, col=dbscan_res_changed_2$cluster+1, main="DBSCAN")