DBSCAN works by defining a cluster as maximal set of density connected points.
We can think of it as a process of grouping similar objects together or we just a search of similarity. Given a set of points in some space,it groups togethr points that are closely packed together and marking as outliers points that lie alone in low density regions.
Clustering of weather data
Best locations for service centers
Crime rate data clustering.
There are two parameters that are taken into account:
Epsilon \(\epsilon\): Maximum radius of the neighbourhood.(also called the neighbourhood of x)
minimum points: Minimum number of points in the epsilon neighbourhood to define a cluster.
There are also three classification points,namely,core,border and outlier.
A core point has atleast a minimum points in its epsilon neighbourhood including itself.These arr points that are interior of the cluster.
A border point has less than minimum points within its epsilon neighbourhood and can be reached by the cluster.
Noise/Outlier is a point that cannot be reached by the cluster
An object is directly reachable from object x if x is a core object and y is in x’s epsilon neighbourhood.
Pick a random point that has not yet been assigned to a cluster or designated as an outlier .Determine if it is a corepoint .If not ,label the point as Outlier.
Once a core point has been found,add all directly reachable to its cluster .Then do neighbour jumps to each reachable point and add them to the cluster .If an Outlier has been added ,label it as a border point
Repeat this steps untill all points are asignned a cluster or labeld as outlier.
To perform this algorithm in R,we will need 3 R packages namely;
fpc
dbscan
factorextra
The dbscan
function form fpc
can help us here for this task and looks something like this;
function (data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw", "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL) NULL
Lets what all this stuff means.
data :data matrix you are working on or the dissimillarity matrix
eps : reachability maximum distance.
MinPts : reachability minimum number of points.
scale : if TRUE ,the data will be scalled.
method : possible vales in this parameter include.
dist:Treats the data as dstance matricx
raw :treats data as raw data
highbrid:expect also the raw data,but calculates partial distance materials.
In this project we are going to investigate weather stations dataset form environment canada monthly vlues.
The main object here is to cluster the weather stations based on their distance from each other.
Below is a quick overview of the dataset variables.
Name in the table Meaning 1. Stn_Name Station Name
Lat Latitude (North+, degrees)
Long Longitude (West - , degrees)
Prov Province
Tm Mean Temperature (°C)
DwTm Days without Valid Mean Temperature
D Mean Temperature difference from Normal (1981-2010) (°C)
Tx Highest Monthly Maximum Temperature (°C)
DwTx Days without Valid Maximum Temperature
Tn Lowest Monthly Minimum Temperature (°C)
DwTn Days without Valid Minimum Temperature S Snowfall (cm)
DwS Days without Valid Snowfall
S%N Percent of Normal (1981-2010) Snowfall
P Total Precipitation (mm)
DwP Days without Valid Precipitation
P%N Percent of Normal (1981-2010) Precipitation
S_G Snow on the ground at the end of the month (cm)
Pd Number of days with Precipitation 1.0 mm or more
BS Bright Sunshine (hours)
DwBS Days without Valid Bright Sunshine
BS% Percent of Normal (1981-2010) Bright Sunshine
HDD Degree Days below 18 °C
CDD Degree Days above 18 °C
Stn_No Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).
NA Not Available
data = read.csv("C:\\Users\\RuralNet011\\Documents\\GitHub\\Machine-Learning-with-R\\unsupervised learning\\DBSCAN\\weather.csv")
head(data)
## Stn_Name Lat Long Prov Tm DwTm D Tx DwTx Tn
## 1 CHEMAINUS 48.935 -123.742 BC 8.2 0 NA 13.5 0 1.0
## 2 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0 3.0 15.0 0 -3.0
## 3 LAKE COWICHAN 48.829 -124.052 BC 6.8 13 2.8 16.0 9 -2.5
## 4 DISCOVERY ISLAND 48.425 -123.226 BC NA NA NA 12.5 0 NA
## 5 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2 3.4 14.5 2 -1.0
## 6 ESQUIMALT HARBOUR 48.432 -123.439 BC 8.8 0 NA 13.1 0 1.9
## DwTn S DwS S.N P DwP P.N S_G Pd BS DwBS BS. HDD CDD Stn_No
## 1 0 0 0 NA 178.8 0 NA 0 12 NA NA NA 273.3 0 1011500
## 2 0 0 0 0 258.6 0 104 0 12 NA NA NA 307.0 0 1012040
## 3 9 0 9 NA 264.6 9 NA NA 11 NA NA NA 168.1 0 1012055
## 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1012475
## 5 2 0 2 NA 168.4 2 NA NA 11 NA NA NA 267.7 0 1012573
## 6 0 NA NA NA 81.0 8 NA NA 12 NA NA NA 258.6 0 1012710
Names
names(data)
## [1] "Stn_Name" "Lat" "Long" "Prov" "Tm" "DwTm"
## [7] "D" "Tx" "DwTx" "Tn" "DwTn" "S"
## [13] "DwS" "S.N" "P" "DwP" "P.N" "S_G"
## [19] "Pd" "BS" "DwBS" "BS." "HDD" "CDD"
## [25] "Stn_No"
Shape
dim(data)
## [1] 1341 25
Structure
str(data)
## 'data.frame': 1341 obs. of 25 variables:
## $ Stn_Name: Factor w/ 1318 levels "100 MILE HOUSE 6NE",..: 203 255 616 298 307 347 413 680 717 799 ...
## $ Lat : num 48.9 48.8 48.8 48.4 48.7 ...
## $ Long : num -124 -124 -124 -123 -124 ...
## $ Prov : Factor w/ 13 levels "AB","BC","MB",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Tm : num 8.2 7 6.8 NA 7.7 8.8 8.9 7.2 NA NA ...
## $ DwTm : int 0 0 13 NA 2 0 7 1 NA NA ...
## $ D : num NA 3 2.8 NA 3.4 NA NA NA NA NA ...
## $ Tx : num 13.5 15 16 12.5 14.5 13.1 13.5 12.7 NA NA ...
## $ DwTx : int 0 0 9 0 2 0 7 1 NA NA ...
## $ Tn : num 1 -3 -2.5 NA -1 1.9 2 2.2 NA NA ...
## $ DwTn : int 0 0 9 NA 2 0 7 0 NA NA ...
## $ S : num 0 0 0 NA 0 NA 0 NA 0 0 ...
## $ DwS : int 0 0 9 NA 2 NA 7 NA 0 0 ...
## $ S.N : int NA 0 NA NA NA NA NA NA 0 0 ...
## $ P : num 179 259 265 NA 168 ...
## $ DwP : int 0 0 9 NA 2 8 7 10 0 0 ...
## $ P.N : int NA 104 NA NA NA NA NA NA 95 114 ...
## $ S_G : int 0 0 NA NA NA NA 0 NA 0 0 ...
## $ Pd : int 12 12 11 NA 11 12 10 12 16 13 ...
## $ BS : logi NA NA NA NA NA NA ...
## $ DwBS : logi NA NA NA NA NA NA ...
## $ BS. : logi NA NA NA NA NA NA ...
## $ HDD : num 273 307 168 NA 268 ...
## $ CDD : int 0 0 0 NA 0 0 0 0 NA NA ...
## $ Stn_No : Factor w/ 1341 levels "1011500","1012040",..: 1 2 3 4 5 6 7 8 9 10 ...
Missing Data
table(is.na(data))
##
## FALSE TRUE
## 23184 10341
Quite a big number of missing information here.
round(table(is.na(data))[2]/sum(table(is.na(data)))*100,2)
## TRUE
## 30.85
This shows that of the total datapoints,31% are missing. We can visualize to see this where these values occur.
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(data,y.at = F,y.labels = F)
We can drop variables with more than 50% missing data.
d=data[,!sapply(data, function(x)mean(is.na(x)))>.5]
missmap(d,y.at = T,y.labels = F)
In the data preparation we are going to subset only the the variables that we will need for the analysis.These include the location and the temperature of the stations.
vars = c("Stn_Name","Lat","Long","Tx","Tn")
d.sub = dplyr::select(d,vars)
head(d.sub)
## Stn_Name Lat Long Tx Tn
## 1 CHEMAINUS 48.935 -123.742 13.5 1.0
## 2 COWICHAN LAKE FORESTRY 48.824 -124.133 15.0 -3.0
## 3 LAKE COWICHAN 48.829 -124.052 16.0 -2.5
## 4 DISCOVERY ISLAND 48.425 -123.226 12.5 NA
## 5 DUNCAN KELVIN CREEK 48.735 -123.728 14.5 -1.0
## 6 ESQUIMALT HARBOUR 48.432 -123.439 13.1 1.9
We are going to replace the missing values with the mean of the respective variables.
d.sub$Tx[is.na(d.sub$Tx)] <- mean(d.sub$Tx, na.rm = T)
d.sub$Tn[is.na(d.sub$Tn)] <- mean(d.sub$Tn, na.rm = T)
missmap(d.sub)
library(leaflet)
library(htmlwidgets)
#limits of longitude and lat
lln = min(d.sub$Long)
uln = max(d.sub$Long)
llat = min(d.sub$Lat)
ulat = max(d.sub$Lat)
# cemters of longitude and lats
centlon = (lln+uln)/2
centlat = (llat+ulat)/2
sbt = d.sub
mp = leaflet(sbt) %>%
setView(centlon,centlat,zoom = 4) %>%
addProviderTiles("OpenStreetMap") %>%
addCircleMarkers(lng = sbt$Long,
lat = sbt$Lat,
popup = sbt$Stn_Name,
fillColor = "Black",
fillOpacity = 1,
radius = 4,
stroke = F)
mp
locs = dplyr::select(d.sub,Lat,Long)
# scalling the data points.
locs.scaled = scale(locs,center = T,scale = T)
head(locs.scaled)
## Lat Long
## [1,] -0.3619795 -1.162576
## [2,] -0.3798354 -1.179367
## [3,] -0.3790311 -1.175889
## [4,] -0.4440201 -1.140416
## [5,] -0.3941523 -1.161975
## [6,] -0.4428940 -1.149563
db = dbscan::dbscan(locs.scaled,eps=0.15,minPts = 12)
db
## DBSCAN clustering for 1341 objects.
## Parameters: eps = 0.15, minPts = 12
## The clustering contains 4 cluster(s) and 271 noise points.
##
## 0 1 2 3 4
## 271 603 22 401 44
##
## Available fields: cluster, eps, minPts
factoextra::fviz_cluster(db,locs.scaled,stand = F,ellipse = T,geom = "point")
print(db)
## DBSCAN clustering for 1341 objects.
## Parameters: eps = 0.15, minPts = 12
## The clustering contains 4 cluster(s) and 271 noise points.
##
## 0 1 2 3 4
## 271 603 22 401 44
##
## Available fields: cluster, eps, minPts
Please note that the noise/outlier will be coded as 0.Lets have a look at a random subset.
set.seed(2)
db$cluster[sample(1:length(db$cluster),60)]
## [1] 1 3 1 1 3 3 1 3 0 1 1 0 3 1 1 3 4 0 1 1 3 1 3 1 1 1 1 1 3 1 1 1 3 3 1
## [36] 0 3 0 3 1 3 0 1 1 3 3 3 1 1 3 1 1 3 0 0 3 3 3 0 3
DBSCAN IS sensitive to the choice of \(\epsilon\) in particular clusters if they have different densities.Smaller epsilons leads to definition of sparser clusters as noise while larger epsilon sizes may make denser clusters to be merged. With this reasoning it can be iseen that with different local densities ,a single eps value wont be logical.
To determine the optimal distance ,think of every point and its average distance from its nearest neighbours.Thus we can now use k-nearest neighbour distances matrix to compute this with a specified value of k corresponding to MinPts.
We then plot these points in ascending order with an aim of aim of determining the “knee” which corresponds to the optimal eps parameter.The “knee” in this case is equivalent to a threshhold where a sharp change occurs along the k distance curve.
Back in R,we can use kNNdistplot()
function from the dbscan
package.
dbscan::kNNdistplot(locs.scaled,k=12)
abline(h=0.15,lty = 2,col=rainbow(1),main="eps optimal value")
Clearly the optimal value of eps is 0.15 from the chart above.
DBSCAN can be used to predict cluster points in a new dataset.
Based on the data we are working on in this project,i am going to split it into train and test splits then use the test part of it for prediction.
library(dbscan)
#train test split.
set.seed(2)
trainIndex = sample(nrow(locs.scaled),size = round(0.8*nrow(locs)))
X_train = locs.scaled[trainIndex,]
y_train = locs.scaled[-trainIndex,]
#model application
dbsc = fpc::dbscan(X_train,eps = 0.15,MinPts = 12)
# Cluster prediction
preds = fpc::predict.dbscan(object = dbsc,
data = X_train,newdata = y_train,
)
Lets see how many clusters were predicted.
table(preds)
## preds
## 0 1 2 3 4 5 6
## 55 106 82 8 9 6 2
We can also see from the output that the dbscan predicted 4 clusters as previously seen in the training set.