If you don’t know at all what DBSCAN does, you might want to watch this 3-minute video introduction: https://www.youtube.com/watch?v=_A9Tq6mGtLI.
Here we will be using the R package dbscan. There is another R package that has been used for DBSCAN as well, called fpc. The reason I’m using the ‘dbscan’ package here is that at a glance it seemed to have more up-to-date documentation and be more actively maintained.
I should note that most of the instructions here are based on this tutorial on YouTube, but you don’t need to watch it as everything is included here.
Install necessary packages, if you don’t already have them, by copy-pasting the code below, uncommenting it, then running it.
#install.packages('dbscan')
#install.packages('factoextra')
Load in everything from the factoextra package to enable use of the ‘multishapes’ example data set.
library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Set the ‘random seed’, which means that the results will always be the same when running the following commands, since the random aspects of the algorithms are controlled.
set.seed(123456789)
Extract just the first two (‘x’ and ‘y’ coordinates) columns from the ‘multishapes’ data set, since the third column, ‘shapes’, would otherwise be treated as a third dimension in the procedures that follow.
multishapes <- multishapes[, 1:2]
Plot the example data
plot(multishapes)
Apply K-means clustering to the data set, which is a popular clustering algorithm that is often what people learn first when studying clustering. This is mostly to see how K-means doesn’t perform very well with the kinds of clusters in the data set.
km_res <- kmeans(multishapes, 5, nstart = 25)
Use the results from applying K-means clustering to plot the example data again, this time coloring the points according to which cluster the algorithm grouped each point in.
plot(multishapes, col=km_res$cluster+1, main="K-means")
Load in all functions et c from the dbscan package
library("dbscan")
Run the DBSCAN algorithm, specifying:
Note that ‘core points’ refer to points which are within the ‘epsilon neighborhood’ of the randomly selected starting point.
More details can be found in the dbscan package documentation.
dbscan_res <- dbscan(multishapes, eps = 0.15, minPts = 5)
Use the results from applying DBSCAN to plot the example data once more, coloring points according to which cluster DBSCAN grouped each point in. In the DBSCAN results, cluster group ‘0’, plotted below in black, indicates ‘noise points’. A ‘noise point’ is one which isn’t close enough to (‘minPts’ - 1) number of other points to be considered part of any cluster.
plot(multishapes, col=dbscan_res$cluster+1, main="DBSCAN")
The DBSCAN algorithm is very sensitive to changes to the epsilon and ‘minPts’ values.
# eps: 0.15 -> 0.4
# minPts stays the same
dbscan_res_changed <- dbscan(multishapes[c('x', 'y')], eps = 0.4, minPts = 5)
plot(multishapes, col=dbscan_res_changed$cluster+1, main="DBSCAN")
# eps: 0.15 (like in first run)
# minPts: 5 -> 40
dbscan_res_changed_2 <- dbscan(multishapes, eps = 0.15, minPts = 40)
plot(multishapes, col=dbscan_res_changed_2$cluster+1, main="DBSCAN")