To use various clustering algorithms to determine the best number of clusters in data.
myData <- read.csv("/Users/ruthokoilu/Desktop/ALL_HW/Project4_practise/OKOILU RUTH.csv", header = FALSE)
head(myData)
## V1 V2 V3
## 1 -62.50976 0.049984160 46.86403
## 2 51.20361 0.008728482 22.27773
## 3 183.55355 0.031539503 33.78600
## 4 -21.71337 0.025953398 16.07261
## 5 95.84524 0.002009111 12.64464
## 6 69.07477 0.030820270 24.37754
summary(myData)
## V1 V2 V3
## Min. :-93.788 Min. :9.239e-05 Min. : 7.704
## 1st Qu.: 5.549 1st Qu.:9.197e-03 1st Qu.:13.971
## Median : 38.745 Median :2.174e-02 Median :24.255
## Mean : 39.836 Mean :3.565e-02 Mean :24.777
## 3rd Qu.: 71.035 3rd Qu.:4.727e-02 3rd Qu.:34.531
## Max. :210.109 Max. :3.027e-01 Max. :51.563
dim(myData)
## [1] 565 3
The data has 565 rows and 3 columns.
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.2
library(ggplot2)
library(RColorBrewer)
colors <- brewer.pal(n=6, name="Dark2")
colors2 <- brewer.pal(n=12, name="Paired")
Here, we’ll set seed to ensure a consistent and reproducible result
set.seed(12345)
A 3D plot of the data points is given below:
library(rgl)
library(car)
## Warning: package 'car' was built under R version 3.4.3
scatter3d(x = myData[,2], y = myData[,1], z = myData[,3], groups = NULL, surface=FALSE)
rglwidget()
Three Clusters Observed
Observing the plot, we can roughly see 3 clusters. These clusters can be discovered by methods which uses a distance metrics to group data points (e.g. Kmeans, and Hierarhical methods) because these clusters consists of points that minimizes the distance between other points of the same cluster and ensure maximum distance between points of different clusters. So, methods like DBSCAN might not achieve this.
Apply the hierarchical algorithm to the dataset. 1.1 Plot the dendrogram and the distance graph (if it is given by your package)
# Prepare hierarchical cluster
hc = hclust(dist(myData), method = "ward.D")
hc
##
## Call:
## hclust(d = dist(myData), method = "ward.D")
##
## Cluster method : ward.D
## Distance : euclidean
## Number of objects: 565
# Dendrogram
plot(hc, hang = -5)
As we can see from the plot above, the distances within heights of 14000 data points range from 0 to 14000. Euclidean distance matrix is used. Note: Ward’s criterion used here minimizes the total within-cluster variance.
Dendogram chooses d such that a slight change in d does not lead to a completely different cluster partition. At about 14000, we have one cluster, at 12000, we have 2 clusters, at 3500, we have 3 clusters, 3000, we have 4 clusters, and at 2000, we have about 5 clusters.
Visually determining where to cut our dendrogram From the above, we can cut the dendrogram at distance 220. This means that we have 2 clusters. We can also form 3 and 4 clusters and compare to select our final choice.
cluster_grps_3<- cutree(hc, k = 3)
cluster_grps_4<- cutree(hc, k = 4)
myData$cluster_grps_3 <- cluster_grps_3
myData$cluster_grps_4 <- cluster_grps_4
1.3 Color the data according to their cluster, and do a 3D scatter diagram. Rotate the diagram to identify the clusters.
# Interactive 3D plot
var1 <- myData$V1
var2 <- myData$V2
var3 <- myData$V3
clusters_2 <- as.factor(myData$cluster_grps_2)
clusters_3 <- as.factor(myData$cluster_grps_3)
clusters_4 <- as.factor(myData$cluster_grps_4)
scatter3d(x = var2, y = var1, z = var3, groups = clusters_3, grid=FALSE, fit= "smooth")
rglwidget()
scatter3d(x = var2, y = var1, z = var3, groups = clusters_4, grid=FALSE, fit= "smooth")
rglwidget()
scatter3d(x = var2, y = var1, z = var3, groups = clusters_3, surface=FALSE, ellipsoid= TRUE)
rglwidget()
scatter3d(x = var2, y = var1, z = var3, groups = clusters_4, surface=FALSE, ellipsoid= TRUE)
rglwidget()
The Elbow plot below helps determine the optimal number of clusters.
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(NbClust)
wss_table1 <- fviz_nbclust(myData[,1:3], hcut, method = "wss")
wss_table1 +
geom_vline(xintercept = 3, linetype = 2)+
labs(subtitle = "Elbow method")
plot(hc, hang = -5)
rect.hclust(hc, k=4, border="red")
plot(hc, hang = -5)
rect.hclust(hc, k=3, border="red")
wss_table1$data
## clusters y
## 1 1 1531618.71
## 2 2 614378.76
## 3 3 376305.00
## 4 4 249160.71
## 5 5 196214.05
## 6 6 160282.32
## 7 7 129945.67
## 8 8 116605.98
## 9 9 105000.54
## 10 10 93823.02
For evaluation purposes, we’ll use the total within sum of squared error (wss). From the 3D plot of 4 clusters, it looks as if the middle clusters are squashed together. From the wss table below 3 clusters has wss of 376305 and 4 clusters has a lower wss of 249160.71. Although 4 clusters have a lower wss,the elbow plot helps us better see the optimal clusters so 3 clusters will be chosen as the optimal number of clusters.
2.1 Apply the algorithm for several values of k starting with k=2.
myData_kmeans <- myData[,1:3]
# k-means for k = 2,3,4,5,6
two_means <- kmeans(myData_kmeans, 2, nstart = 25)
three_means <- kmeans(myData_kmeans, 3, nstart = 25)
four_means <- kmeans(myData_kmeans, 4, nstart = 25)
five_means <- kmeans(myData_kmeans, 5, nstart = 25)
six_means <- kmeans(myData_kmeans, 6, nstart = 25)
2.2 Use the elbow method to determine the best value of k.
par(mfrow=c(2,1))
wss_table2 <- fviz_nbclust(myData[,1:3], kmeans, method = "wss")
fviz_nbclust(myData[,1:3], kmeans, method = "wss") + geom_vline(xintercept = 3, linetype = 2)+
labs(subtitle = "Elbow method")
# Total wss for all k means - notice only slight changes in wss between clusters 3,4 and 5
wss_table2$data
## clusters y
## 1 1 1531618.71
## 2 2 595856.99
## 3 3 353688.48
## 4 4 226415.34
## 5 5 178524.04
## 6 6 159458.86
## 7 7 117449.15
## 8 8 101999.90
## 9 9 92516.81
## 10 10 82134.62
2.3 For the best k value, color the data according to their cluster, and do a 3D scatter diagram. Rotate the diagram to identify visually the clusters.
scatter3d(x = var2, y = var1, z = var3, groups = as.factor(three_means$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors)
rglwidget()
scatter3d(x = var2, y = var1, z = var3, groups = as.factor(four_means$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors)
rglwidget()
Using the total within sum of squared error (wss). From the 3D plots, and from the elbow plot, 3 clusters are determined. From the wss table below 3 clusters has wss of 353688.48 and 4 clusters has a lower wss of 226415.34. One can conclude that the optimal number of clusters is between 3 and 4. For the sake of this experiment, we will choose 3, because the difference between the wss of 3, 4 and 5… clusters is small.
Apply the DBSCAN algorithm to the dataset to determine the number of clusters.
3.1 For Minpts=3, use the elbow method to determine the best values of ε. Run the DBSCAN algorithm for the best value of ε and Minpts=3. Color the data according to their cluster, and do a 3D scatter diagram. Rotate the diagram to identify visually the clusters.
library(fpc)
library(dbscan)
##
## Attaching package: 'dbscan'
## The following object is masked from 'package:fpc':
##
## dbscan
#library(factoextra)
myData_dbscan <- myData[,1:3]
dbscan::kNNdistplot(myData_dbscan, k = 3)
abline(h = 6, lty = 2)
From the elbow plot above, we choose epsilon ε as 6. Analysis for minpoints = 3 is shown below.
set.seed(12345)
db <- fpc::dbscan(myData_dbscan, eps = 6, MinPts = 3)
#library(factoextra)
fviz_cluster(db, data = myData_dbscan, stand = FALSE,
ellipse = TRUE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
scatter3d(x = var2, y = var1, z = var3, groups = as.factor(db$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors2)
rglwidget()
Plot Found in this link. Click here From the above, we can see that a total of 9 clusters were formed and one more noise cluster - the cluster 0 which stands for noise cluster.
Analysis for minpoints = 4 is shown below.
set.seed(12345)
db <- fpc::dbscan(myData_dbscan, eps = 6, MinPts = 4)
fviz_cluster(db, data = myData_dbscan, stand = FALSE,
ellipse = TRUE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
scatter3d(x = var2, y = var1, z = var3, groups = as.factor(db$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors2)
rglwidget()
Plot Found in this link. Click here
From the above, we can see that a total of 9 clusters were formed iand one more noise cluster - the cluster 0 which stands for noise cluster.
Analysis for minpoints = 5 is shown below.
set.seed(12345)
db <- fpc::dbscan(myData_dbscan, eps = 6, MinPts = 5)
fviz_cluster(db, data = myData_dbscan, stand = FALSE,
ellipse = TRUE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
scatter3d(x = myData_dbscan[,1], y = myData_dbscan[,2], z = myData_dbscan[,3], groups = as.factor(db$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors2)
rglwidget()
Plot Found in this link. Click here
From the above, we can see that a total of 8 clusters were formed and one more noise cluster - the cluster 0 which stands for noise cluster.
Analysis for minpoints = 6 is shown below.
Observe the clusters formed.
set.seed(12345)
db <- fpc::dbscan(myData_dbscan, eps = 6, MinPts = 6)
fviz_cluster(db, data = myData_dbscan, stand = FALSE,
ellipse = TRUE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
scatter3d(x = myData_dbscan[,1], y = myData_dbscan[,2], z = myData_dbscan[,3], groups = as.factor(db$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors2)
rglwidget()
Plot Found in this link. Click here
From the above, we can see that a total of 7 clusters were formed and one more noise cluster - the cluster 0 which stands for noise cluster.
We can notice that as the value of minpoint increases, the number of clusters formed decreases.
4.Compare and discuss the results from all three methods. Identify the best clustering of the dataset.
Hierarchical Clustering
Here, we have decided to pick 3 clusters which has wss of 376305, we have concluded that cutting the dendrogram at 3 is optimal because a slight increase or decrease in d does not form a different partition/clusters compared to cutting the dendrogram at other heights. Our elbow diagram also suggests 3 clusters.
scatter3d(x = var2, y = var1, z = var3, groups = clusters_3, surface=FALSE, ellipsoid= TRUE)
rglwidget()
Plot Found in this link. Click here
plot(hc, hang = -1)
rect.hclust(hc, k=3, border="red")
K-Means Clustering
We have chosen k = 3 because of the elbow experiment and it gives the optimal total within sum of squared error of 353688.48.
colors <- brewer.pal(n=6, name="Dark2")
scatter3d(x = var2, y = var1, z = var3, groups = as.factor(three_means$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors)
rglwidget()
Plot Found in this link. Click here
DBSCAN: Density Based Clustering of Applications with Noise (DBSCAN)
As the name implies, it clusters the data objects into n clusters and also clusters the noise into 1 cluster. So the large light blue cluster zero 0 is the noise cluster. DBSCAN is therefore good at finding outliers and noise. Asides this, we can see DBSCAN seems to generate more clusters than the first two methods. The closest cluster formation is at Min points = 6. In fact, 7 clusters were formed. Due to the fact that DBSCAN uses the concept of density reachability and density connectivity, it was not able to find the 3 clusters which can be formed by minizing distance distance.
db <- fpc::dbscan(myData_dbscan, eps = 6, MinPts = 6)
#db <-dbscan(myData_dbscan, eps = 6, minPts = 6)
scatter3d(x = var2, y = var1, z = var3, groups = as.factor(db$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors2)
rglwidget()
We were not provided with the ground truth (the true number of clusters present in the data set). However, as we stated earlier, by looking at the data in 3D below, we can visually/roughly see 3 clusters. We will therefore choose the best method based on the ability to group the set of objects in such a way that objects in the same cluster are more similar and closer to each other than to those in other clusters. i.e. we’ll pick one that minimizes the distance within each cluster. DBSCAN does not do this. Both hierarchical and K means clusters do this, however, K-means cluster groups the points more adequately and K means achieved a lower total within sum of squared errors so K- Means is chosen as the best for this experiment.
scatter3d(x = myData_kmeans[,2], y = myData_kmeans[,1], z = myData_kmeans[,3], groups = as.factor(three_means$cluster), surface=FALSE, ellipsoid= TRUE, surface.col = colors)
rglwidget()
Plot Found in this link. Click here
Note: If we were provided a ground truth, our final choice might be different. Also, the question has not stated what benchmark to be used to identify the best clustering of the dataset and so visual observation and the distance metrics and total within sums of squared error were used.