Ch 15: DBSCAN

Introduction

DBSCAN is an acronym that stands for

Density Based Spatial Clustering of Applications with Noise

where noise is outliers that don’t belong to a true cluster.

We’ll be using the fake data from the factoextra package called multishapes since it has important features to demonstrate the advantage DBSCAN has over k-means clustering.

## Loading in the data from the factoextra package
multishapes <- 
  factoextra::multishapes |> 

  # Changing shape to a factor so there are separate colors per group 
  # instead of a gradient
  mutate(shape = factor(shape))


# Displaying the data
tibble(multishapes)

## # A tibble: 1,100 × 3
##         x       y shape
##     <dbl>   <dbl> <fct>
##  1 -0.804 -0.853  1    
##  2  0.853  0.368  1    
##  3  0.927 -0.275  1    
##  4 -0.753 -0.512  1    
##  5  0.707  0.811  1    
##  6  1.03   0.395  1    
##  7 -0.476  0.990  1    
##  8 -0.460 -0.859  1    
##  9 -0.680 -0.661  1    
## 10 -1.03  -0.0220 1    
## # ℹ 1,090 more rows

Step 1: Plotting the data:

Still should plot the data to start:

multishapes |> 
  # Indicating which shape is the outliers (noise)
  mutate(
    noise = shape == 5
  ) |> 
  # Creating the graph with the noise group
  ggplot(
    mapping = aes(
      x = x, 
      y = y
    )
  ) + 
  
  geom_point(
    mapping = aes(
      color = shape,
      shape = noise
    )
  )

If we just had the green, teal, and purple clusters, then k-means would be an appropriate method to cluster the data.

However, the blue, red, and yellow points make clustering with k-means inappropriate since it uses the centroid of a cluster and the red and yellow groups will have the same centroid!

The blue triangles were generated at random and aren’t from any of the “true” clusters (considered “noise”), which k-means doesn’t have a way of clustering without them since it will include all points in one of the $k$ specified clusters.

Step 2: Rescaling the data

Even though DBSCAN uses the “density” of the points and not the “distance”, we still need to rescale the data before running the algorithm

Checking the scale of the variables:

multishapes|> 
  summarize(
    across(
      .cols = where(is.numeric),
      .fns = sd
    )
  )

##           x       y
## 1 0.6449673 1.17617

Since the SD of y is twice the SD of x, we need to rescale the data!

### Rescaling the data to have sd of 1 
shape_sc <- 
  multishapes |> 
  
  # Standardizing both x and y
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = ~ (. - mean(.))/sd(.)
    )
  ) |> 
  
  # Dropping shape since it isn't a numeric variable
  dplyr::select(-shape)


# Checking the data now
shape_sc|> 
  summarize(
    across(
      .cols = where(is.numeric),
      .fns = sd
    )
  )

##   x y
## 1 1 1

Clustering with k-means

Let’s start by naively running the k-means algorithm:

set.seed(1234)
#=
km6 <- 
  kmeans(
    shape_sc, 
    centers = 6, 
    nstart = 10
  )

### Plotting the results
data.frame(
  shape_sc, 
  clusters = factor(km6$cluster)
) |>
  
  ggplot(
    mapping = aes(
      x = x, 
      y = y,
      color = clusters
    )
  ) + 
  
  geom_point(
    show.legend = F, 
    size = 1.5,
    alpha = 0.5
  ) + 
  
  labs(title = "Clusters using K-means")

So that definitely isn’t the best choice of clusters. So let’s look at the results using DBSCAN!

DBSCAN from dbscan package

One advantage of DBSCAN over k-means is that we don’t need to specify how many clusters exist in the data to run the algorithm!

Instead, DBSCAN has 2 hyper parameters that we need to specify:

min points = The minimum number of points needed to start a new cluster
$\varepsilon$ = the distance from a point that the algorithm will “scan” for a new point to join the cluster

Min points typically depends on the number of cases and variables. We’ll use min points = 5. So how do we find $\varepsilon$ ?

Determining $\varepsilon$

The most common way is to use a kNN distance plot and find the elbow of the plot. The elbow is the distance between close but separate clusters or noise points.

We can use the kNNdist() and kNNdistplot() functions from the dbscan package.

Both have 2 arguments:

x = the (scaled) data set
k = the number of nearest neighbors to find the distances for

Let’s start by creating a scatterplot to visualize how far the 5th nearest neighbor is for each point:

### Creating a plot to visualize how the distance changes for each point
shape_sc |> 
  mutate(
    kdist = kNNdist(x = dist(shape_sc), 
                    k = 5
    )
  ) |>
  
  ggplot(
    mapping = aes(
      x = x, 
      y = y,
      color = kdist
    )
  ) + 
  
  geom_point(
    show.legend = T
  ) + 
  
  scale_color_gradient(
    low="blue", 
    high = "red",
    breaks = seq(0.1, 0.7, by = 0.1)
  ) +
  
  theme(legend.position = "bottom") +
  
  guides(color = guide_colorbar(barwidth = 20,
                                title.vjust = 0.75))

The denser the points, the blue-er the points. The red dots are the “noise” points we say from earlier!

Now let’s actually create a kNN distance plot:

kNNdistplot(
  x = dist(shape_sc), 
  k = 5
)

The elbow of the plot appears to be around 0.2, so we’ll use it for $\varepsilon = 0.2$

# Using the dbscan() function
set.seed(1234)
shapes_db <- 
  dbscan(
    x = shape_sc, 
    minPts = 5, 
    eps = 0.2
  )

shapes_db

## DBSCAN clustering for 1100 objects.
## Parameters: eps = 0.2, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 5 cluster(s) and 25 noise points.
## 
##   0   1   2   3   4   5 
##  25 410 407 106 101  51 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints

Finding what objects are stored in the dbscan output:

names(shapes_db)

## [1] "cluster"      "eps"          "minPts"       "dist"         "borderPoints"

Let’s plot the results of the dbscan algorithm

multishapes |> 
  mutate(cluster = factor(shapes_db$cluster)) |> 
  
  ggplot(
    mapping = aes(
      x = x, 
      y = y,
      color = cluster,
      shape = factor(shape)
    )
  ) + 
  
  geom_point(
    show.legend = F, 
    size = 2
  ) + 
  
  labs(title = "Clusters using DBScan with eps = 0.20 and minPts = 5")

Let’s create a graph to see how the choice of min points will impact the value of $\varepsilon$

minpts = c(7, 4, 14)


# Calculating the kNN distance for the minpts choices
kNNdist(
  shape_sc, 
  k = max(minpts),
  all = T
) |>  
  
  # Converting to a data frame to manipulate and plot the results  
  data.frame() |> 
  
  # Keeping just the columns specified in minpts
  dplyr::select(
    contains(as.character(minpts))
  ) |>  
  
  # Stacking the distance columns into one column
  pivot_longer(
    cols = everything(),
    names_to = "MinPts",
    values_to = "Distance"
  ) |>  
  # Arranging the rows in order from minpts then distance  
  arrange(MinPts, Distance) |> 
  # Adding the row number for each knn distances for each min point choice
  mutate(
    .by = MinPts,
    point = row_number(),
    # Converting the minpts column to a numeric instead of a character
    minPts = factor(parse_number(MinPts),
                    levels = sort(minpts))
  ) |> 
  # Creating the graph
  ggplot(
    mapping = aes(
      x = point, 
      y = Distance, 
      color = minPts
    )
  ) + 
  
  geom_line(
    linewidth = 1
  ) +
  
  labs(
    title = paste("kNN Distance plot for k =",
                  paste(sort(minpts), 
                        collapse = ", "))
  )

Our sets appear to be (4, 0.2), (7, 0.25), and (14, 0.35). Let’s compare the 3 choices

shapes_db4  <- dbscan(x = shape_sc, minPts = 4,  eps = 0.2)
shapes_db7  <- dbscan(x = shape_sc, minPts = 7,  eps = 0.25)
shapes_db14 <- dbscan(x = shape_sc, minPts = 14, eps = 0.35)

# Graphing the results
multishapes |> 
  mutate(
    cluster4  = factor(shapes_db4$cluster),
    cluster7  = factor(shapes_db7$cluster),
    cluster14 = factor(shapes_db14$cluster)
  ) |> 
  # Pivoting the cluster number all into 1 column
  pivot_longer(
    cols = starts_with("cluster"),
    values_to = "cluster"
  ) |> 
  # Extracting the min points choice and changing the name
  mutate(
    minpts = parse_number(name)
  )  |> 
  
  # Arranging them in order from lowest min points to highest min points
  arrange(minpts) |> 
  mutate(
    minpts = paste("Min points =", minpts) |> as_factor()
  ) |> 
  # Graphing the results
  ggplot(
    mapping = aes(
      x = x, 
      y = y,
      color = cluster
    )
  ) + 
  
  geom_point(
    show.legend = F
  ) + 
  
  # Creating 3 graphs, one per minpoints choice
  facet_wrap(
    facets = ~ minpts
  ) + 
  
  labs(title = "Clusters Result using DBScan")

Ch 15: DBSCAN

Jacob Martin

STAT 5230

Introduction

Step 1: Plotting the data:

Step 2: Rescaling the data

Clustering with k-means

DBSCAN from dbscan package

Determining $\varepsilon$

Ch 15: DBSCAN

Jacob Martin

STAT 5230

Introduction

Step 1: Plotting the data:

Step 2: Rescaling the data

Clustering with k-means

DBSCAN from dbscan package

Determining ε\varepsilon

Determining $\varepsilon$