For the assignment, we will be using the “iris” dataset present in R.
Selecting only the 1st four columns of the “iris” dataset:
rm(list = ls());gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 368362 19.7 592000 31.7 460000 24.6
## Vcells 559183 4.3 1023718 7.9 790604 6.1
data <- iris[,1:4]
dim(data)
## [1] 150 4
Creating an empty vector “tot_wss” to store the total_within_sum_of_squares for various no of clusters:
tot_wss <- c()
tot_wss
## NULL
Next, we are going to do k-means 15 times for various clusters (1 to 15):
for(i in 1:15)
{
cl <- kmeans(data,centers = i)
tot_wss[i] <- cl$tot.withinss
}
tot_wss
## [1] 681.37060 152.34795 78.85144 57.22847 49.82228 47.61943 38.18643
## [8] 29.98894 28.50441 26.80009 24.44235 23.37951 21.46090 20.82540
## [15] 24.75149
We can observe the total of within sum of squares of each cluster from the above vector.
Plotting a graph for the values of tot_wss against no-of-clusters
plot(x=1:15,
y=tot_wss,
type = "b",
xlab = "Number of clusters",
ylab = "Within groups sum of squares")
From the above graph we can observe that as the number of clusters increases, the within groups sum of squares decreases.
Here we will be using NbCLust library to determine. It uses huge no of cluster suitability measuring criteria.
Clearing memory and calling library
rm(list = ls());gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 376237 20.1 750400 40.1 565988 30.3
## Vcells 570630 4.4 1308461 10.0 934067 7.2
library(NbClust)
Using the “iris” dataset as mentioned before. Here we are ignoring the 5th column, since it is categorical.
data <- iris[,-c(5)]
As our evaluation will have 2 graphs, we need to reduce dafault margin to accomodate
par(mar=c(2,2,2,2))
NbCLust measures appropriateness of cluster on a number of indices.
By default, it checks from 2 clusters to 15 clusters.
nb <- NbClust(data, method = "kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 10 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
As we can see from the above, according to the majority rule, the best number of clusters is 2.
Drawing a histogram to see how various indices have voted for number of clusters:
hist(nb$Best.nc[1,],breaks = 15)
From the above histogram, we can observe that out of 26 indices, most(10) voted for 2 clusters.
Here we will be using “vegan” package.
Clearing memory and calling library
rm(list = ls());gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 404631 21.7 750400 40.1 750400 40.1
## Vcells 592028 4.6 1308461 10.0 1308461 10.0
library(vegan)
## Warning: package 'vegan' was built under R version 3.3.3
## Loading required package: permute
## Warning: package 'permute' was built under R version 3.3.3
## Loading required package: lattice
## This is vegan 2.4-2
Reading “iris” dataset
data <- iris[,-c(5)]
Calinski criterion is similar to finding ratio of between-cluster-variance/within-cluster-variance.
model <- cascadeKM(data, 1, 10, iter = 100)
Below are two figures. One is a heatmap and other is clusterno vs values (=BC/WC)
plot(model, sortg = TRUE)
The calinski values for various cluster groups are calculated below:
model$results[2,]
## 1 groups 2 groups 3 groups 4 groups 5 groups 6 groups 7 groups
## NA 513.9245 561.6278 530.7658 495.5415 473.8506 449.6410
## 8 groups 9 groups 10 groups
## 440.6205 414.5753 392.6606
To check for the maximum value i.e. highest BC/WC value:
which.max(model$results[2,])
## 3 groups
## 3
As we can see that 3 groups have the highest BC/WC value among all of these, so number of clusters should be 3.