Introduction.

DBSCAN works by defining a cluster as maximal set of density connected points.

We can think of it as a process of grouping similar objects together or we just a search of similarity. Given a set of points in some space,it groups togethr points that are closely packed together and marking as outliers points that lie alone in low density regions.

Real world applications of DBSCAN.

There are two parameters that are taken into account:

  1. Epsilon \(\epsilon\): Maximum radius of the neighbourhood.(also called the neighbourhood of x)

  2. minimum points: Minimum number of points in the epsilon neighbourhood to define a cluster.

There are also three classification points,namely,core,border and outlier.

A core point has atleast a minimum points in its epsilon neighbourhood including itself.These arr points that are interior of the cluster.

A border point has less than minimum points within its epsilon neighbourhood and can be reached by the cluster.

Noise/Outlier is a point that cannot be reached by the cluster

Density-Reachability.

An object is directly reachable from object x if x is a core object and y is in x’s epsilon neighbourhood.

DBSCAN Step by step.

  1. Pick a random point that has not yet been assigned to a cluster or designated as an outlier .Determine if it is a corepoint .If not ,label the point as Outlier.

  2. Once a core point has been found,add all directly reachable to its cluster .Then do neighbour jumps to each reachable point and add them to the cluster .If an Outlier has been added ,label it as a border point

  3. Repeat this steps untill all points are asignned a cluster or labeld as outlier.

DBSCAN with R

To perform this algorithm in R,we will need 3 R packages namely;

  1. fpc

  2. dbscan

  3. factorextra

The dbscan function form fpc can help us here for this task and looks something like this;

function (data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw", "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL) NULL

Lets what all this stuff means.

Project framming and Definition.

In this project we are going to investigate weather stations dataset form environment canada monthly vlues.

The main object here is to cluster the weather stations based on their distance from each other.

Below is a quick overview of the dataset variables.

Name in the table Meaning 1. Stn_Name Station Name

  1. Lat Latitude (North+, degrees)

  2. Long Longitude (West - , degrees)

  3. Prov Province

  4. Tm Mean Temperature (°C)

  5. DwTm Days without Valid Mean Temperature

  6. D Mean Temperature difference from Normal (1981-2010) (°C)

  7. Tx Highest Monthly Maximum Temperature (°C)

  8. DwTx Days without Valid Maximum Temperature

  9. Tn Lowest Monthly Minimum Temperature (°C)

  10. DwTn Days without Valid Minimum Temperature S Snowfall (cm)

  11. DwS Days without Valid Snowfall

  12. S%N Percent of Normal (1981-2010) Snowfall

  13. P Total Precipitation (mm)

  14. DwP Days without Valid Precipitation

  15. P%N Percent of Normal (1981-2010) Precipitation

  16. S_G Snow on the ground at the end of the month (cm)

  17. Pd Number of days with Precipitation 1.0 mm or more

  18. BS Bright Sunshine (hours)

  19. DwBS Days without Valid Bright Sunshine

  20. BS% Percent of Normal (1981-2010) Bright Sunshine

  21. HDD Degree Days below 18 °C

  22. CDD Degree Days above 18 °C

  23. Stn_No Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).

  24. NA Not Available

Data Import

data = read.csv("C:\\Users\\RuralNet011\\Documents\\GitHub\\Machine-Learning-with-R\\unsupervised learning\\DBSCAN\\weather.csv")
head(data)
##                 Stn_Name    Lat     Long Prov  Tm DwTm   D   Tx DwTx   Tn
## 1              CHEMAINUS 48.935 -123.742   BC 8.2    0  NA 13.5    0  1.0
## 2 COWICHAN LAKE FORESTRY 48.824 -124.133   BC 7.0    0 3.0 15.0    0 -3.0
## 3          LAKE COWICHAN 48.829 -124.052   BC 6.8   13 2.8 16.0    9 -2.5
## 4       DISCOVERY ISLAND 48.425 -123.226   BC  NA   NA  NA 12.5    0   NA
## 5    DUNCAN KELVIN CREEK 48.735 -123.728   BC 7.7    2 3.4 14.5    2 -1.0
## 6      ESQUIMALT HARBOUR 48.432 -123.439   BC 8.8    0  NA 13.1    0  1.9
##   DwTn  S DwS S.N     P DwP P.N S_G Pd BS DwBS BS.   HDD CDD  Stn_No
## 1    0  0   0  NA 178.8   0  NA   0 12 NA   NA  NA 273.3   0 1011500
## 2    0  0   0   0 258.6   0 104   0 12 NA   NA  NA 307.0   0 1012040
## 3    9  0   9  NA 264.6   9  NA  NA 11 NA   NA  NA 168.1   0 1012055
## 4   NA NA  NA  NA    NA  NA  NA  NA NA NA   NA  NA    NA  NA 1012475
## 5    2  0   2  NA 168.4   2  NA  NA 11 NA   NA  NA 267.7   0 1012573
## 6    0 NA  NA  NA  81.0   8  NA  NA 12 NA   NA  NA 258.6   0 1012710

Quick Data Exploration.

Names

names(data)
##  [1] "Stn_Name" "Lat"      "Long"     "Prov"     "Tm"       "DwTm"    
##  [7] "D"        "Tx"       "DwTx"     "Tn"       "DwTn"     "S"       
## [13] "DwS"      "S.N"      "P"        "DwP"      "P.N"      "S_G"     
## [19] "Pd"       "BS"       "DwBS"     "BS."      "HDD"      "CDD"     
## [25] "Stn_No"

Shape

dim(data)
## [1] 1341   25

Structure

str(data)
## 'data.frame':    1341 obs. of  25 variables:
##  $ Stn_Name: Factor w/ 1318 levels "100 MILE HOUSE 6NE",..: 203 255 616 298 307 347 413 680 717 799 ...
##  $ Lat     : num  48.9 48.8 48.8 48.4 48.7 ...
##  $ Long    : num  -124 -124 -124 -123 -124 ...
##  $ Prov    : Factor w/ 13 levels "AB","BC","MB",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Tm      : num  8.2 7 6.8 NA 7.7 8.8 8.9 7.2 NA NA ...
##  $ DwTm    : int  0 0 13 NA 2 0 7 1 NA NA ...
##  $ D       : num  NA 3 2.8 NA 3.4 NA NA NA NA NA ...
##  $ Tx      : num  13.5 15 16 12.5 14.5 13.1 13.5 12.7 NA NA ...
##  $ DwTx    : int  0 0 9 0 2 0 7 1 NA NA ...
##  $ Tn      : num  1 -3 -2.5 NA -1 1.9 2 2.2 NA NA ...
##  $ DwTn    : int  0 0 9 NA 2 0 7 0 NA NA ...
##  $ S       : num  0 0 0 NA 0 NA 0 NA 0 0 ...
##  $ DwS     : int  0 0 9 NA 2 NA 7 NA 0 0 ...
##  $ S.N     : int  NA 0 NA NA NA NA NA NA 0 0 ...
##  $ P       : num  179 259 265 NA 168 ...
##  $ DwP     : int  0 0 9 NA 2 8 7 10 0 0 ...
##  $ P.N     : int  NA 104 NA NA NA NA NA NA 95 114 ...
##  $ S_G     : int  0 0 NA NA NA NA 0 NA 0 0 ...
##  $ Pd      : int  12 12 11 NA 11 12 10 12 16 13 ...
##  $ BS      : logi  NA NA NA NA NA NA ...
##  $ DwBS    : logi  NA NA NA NA NA NA ...
##  $ BS.     : logi  NA NA NA NA NA NA ...
##  $ HDD     : num  273 307 168 NA 268 ...
##  $ CDD     : int  0 0 0 NA 0 0 0 0 NA NA ...
##  $ Stn_No  : Factor w/ 1341 levels "1011500","1012040",..: 1 2 3 4 5 6 7 8 9 10 ...

Missing Data

table(is.na(data))
## 
## FALSE  TRUE 
## 23184 10341

Quite a big number of missing information here.

round(table(is.na(data))[2]/sum(table(is.na(data)))*100,2)
##  TRUE 
## 30.85

This shows that of the total datapoints,31% are missing. We can visualize to see this where these values occur.

library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(data,y.at = F,y.labels = F)

We can drop variables with more than 50% missing data.

d=data[,!sapply(data, function(x)mean(is.na(x)))>.5]
missmap(d,y.at = T,y.labels = F)

Data preparation for DBSCAN.

In the data preparation we are going to subset only the the variables that we will need for the analysis.These include the location and the temperature of the stations.

vars = c("Stn_Name","Lat","Long","Tx","Tn")
d.sub = dplyr::select(d,vars)
head(d.sub)
##                 Stn_Name    Lat     Long   Tx   Tn
## 1              CHEMAINUS 48.935 -123.742 13.5  1.0
## 2 COWICHAN LAKE FORESTRY 48.824 -124.133 15.0 -3.0
## 3          LAKE COWICHAN 48.829 -124.052 16.0 -2.5
## 4       DISCOVERY ISLAND 48.425 -123.226 12.5   NA
## 5    DUNCAN KELVIN CREEK 48.735 -123.728 14.5 -1.0
## 6      ESQUIMALT HARBOUR 48.432 -123.439 13.1  1.9

Treating Missing values.

We are going to replace the missing values with the mean of the respective variables.

d.sub$Tx[is.na(d.sub$Tx)] <- mean(d.sub$Tx, na.rm = T)
d.sub$Tn[is.na(d.sub$Tn)] <- mean(d.sub$Tn, na.rm = T)
missmap(d.sub)

Quick EDA

library(leaflet)
library(htmlwidgets)
#limits of longitude and lat
lln = min(d.sub$Long)
uln = max(d.sub$Long)
llat = min(d.sub$Lat)
ulat = max(d.sub$Lat)
# cemters of longitude and lats
centlon = (lln+uln)/2
centlat = (llat+ulat)/2

sbt = d.sub
mp = leaflet(sbt) %>%
  setView(centlon,centlat,zoom = 4) %>%
  addProviderTiles("OpenStreetMap") %>%
  addCircleMarkers(lng = sbt$Long,
                   lat = sbt$Lat,
                   popup = sbt$Stn_Name,
                   fillColor = "Black",
                   fillOpacity = 1,
                   radius = 4,
                   stroke = F)
mp

Clustering by lpcation.

locs = dplyr::select(d.sub,Lat,Long)

# scalling the data points.

locs.scaled = scale(locs,center = T,scale = T)

head(locs.scaled)
##             Lat      Long
## [1,] -0.3619795 -1.162576
## [2,] -0.3798354 -1.179367
## [3,] -0.3790311 -1.175889
## [4,] -0.4440201 -1.140416
## [5,] -0.3941523 -1.161975
## [6,] -0.4428940 -1.149563

Computing DBSCAN using dbscan package

db = dbscan::dbscan(locs.scaled,eps=0.15,minPts = 12)
db
## DBSCAN clustering for 1341 objects.
## Parameters: eps = 0.15, minPts = 12
## The clustering contains 4 cluster(s) and 271 noise points.
## 
##   0   1   2   3   4 
## 271 603  22 401  44 
## 
## Available fields: cluster, eps, minPts

Ploting DBSCAN results.

factoextra::fviz_cluster(db,locs.scaled,stand = F,ellipse = T,geom = "point")

Displaying results of the DBSCAND

print(db)
## DBSCAN clustering for 1341 objects.
## Parameters: eps = 0.15, minPts = 12
## The clustering contains 4 cluster(s) and 271 noise points.
## 
##   0   1   2   3   4 
## 271 603  22 401  44 
## 
## Available fields: cluster, eps, minPts

Cluster Membership

Please note that the noise/outlier will be coded as 0.Lets have a look at a random subset.

set.seed(2)
db$cluster[sample(1:length(db$cluster),60)]
##  [1] 1 3 1 1 3 3 1 3 0 1 1 0 3 1 1 3 4 0 1 1 3 1 3 1 1 1 1 1 3 1 1 1 3 3 1
## [36] 0 3 0 3 1 3 0 1 1 3 3 3 1 1 3 1 1 3 0 0 3 3 3 0 3

DBSCAN IS sensitive to the choice of \(\epsilon\) in particular clusters if they have different densities.Smaller epsilons leads to definition of sparser clusters as noise while larger epsilon sizes may make denser clusters to be merged. With this reasoning it can be iseen that with different local densities ,a single eps value wont be logical.

Determining the optimal eps value.

To determine the optimal distance ,think of every point and its average distance from its nearest neighbours.Thus we can now use k-nearest neighbour distances matrix to compute this with a specified value of k corresponding to MinPts.

We then plot these points in ascending order with an aim of aim of determining the “knee” which corresponds to the optimal eps parameter.The “knee” in this case is equivalent to a threshhold where a sharp change occurs along the k distance curve.

Back in R,we can use kNNdistplot() function from the dbscan package.

dbscan::kNNdistplot(locs.scaled,k=12)
abline(h=0.15,lty = 2,col=rainbow(1),main="eps optimal value")

Clearly the optimal value of eps is 0.15 from the chart above.

Cluster Prediction with DBSCAN.

DBSCAN can be used to predict cluster points in a new dataset.

Based on the data we are working on in this project,i am going to split it into train and test splits then use the test part of it for prediction.

library(dbscan)
#train test split.
set.seed(2)
trainIndex = sample(nrow(locs.scaled),size = round(0.8*nrow(locs)))
X_train = locs.scaled[trainIndex,]
y_train = locs.scaled[-trainIndex,]

#model application
dbsc = fpc::dbscan(X_train,eps = 0.15,MinPts  = 12)

# Cluster prediction

preds = fpc::predict.dbscan(object = dbsc,
               data = X_train,newdata = y_train,
                )

Lets see how many clusters were predicted.

table(preds)
## preds
##   0   1   2   3   4   5   6 
##  55 106  82   8   9   6   2

We can also see from the output that the dbscan predicted 4 clusters as previously seen in the training set.