To decide the number of clusters

For the assignment, we will be using the “iris” dataset present in R.

METHOD 1

Selecting only the 1st four columns of the “iris” dataset:

rm(list = ls());gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 368362 19.7     592000 31.7   460000 24.6
## Vcells 559183  4.3    1023718  7.9   790604  6.1

data <- iris[,1:4]
dim(data)

## [1] 150   4

Creating an empty vector “tot_wss” to store the total_within_sum_of_squares for various no of clusters:

tot_wss <- c()
tot_wss

## NULL

Next, we are going to do k-means 15 times for various clusters (1 to 15):

for(i in 1:15)
{
  cl <- kmeans(data,centers = i)
  tot_wss[i] <- cl$tot.withinss
}
tot_wss

##  [1] 681.37060 152.34795  78.85144  57.22847  49.82228  47.61943  38.18643
##  [8]  29.98894  28.50441  26.80009  24.44235  23.37951  21.46090  20.82540
## [15]  24.75149

We can observe the total of within sum of squares of each cluster from the above vector.
Plotting a graph for the values of tot_wss against no-of-clusters

plot(x=1:15,
     y=tot_wss,
     type = "b",
     xlab = "Number of clusters",
     ylab = "Within groups sum of squares")

From the above graph we can observe that as the number of clusters increases, the within groups sum of squares decreases.

METHOD 2

Here we will be using NbCLust library to determine. It uses huge no of cluster suitability measuring criteria.
Clearing memory and calling library

rm(list = ls());gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 376237 20.1     750400 40.1   565988 30.3
## Vcells 570630  4.4    1308461 10.0   934067  7.2

library(NbClust)

Using the “iris” dataset as mentioned before. Here we are ignoring the 5th column, since it is categorical.

data <- iris[,-c(5)]

As our evaluation will have 2 graphs, we need to reduce dafault margin to accomodate

par(mar=c(2,2,2,2))

NbCLust measures appropriateness of cluster on a number of indices.
By default, it checks from 2 clusters to 15 clusters.

nb <- NbClust(data, method = "kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 10 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

As we can see from the above, according to the majority rule, the best number of clusters is 2.
Drawing a histogram to see how various indices have voted for number of clusters:

hist(nb$Best.nc[1,],breaks = 15)

From the above histogram, we can observe that out of 26 indices, most(10) voted for 2 clusters.

METHOD 3

Here we will be using “vegan” package.
Clearing memory and calling library

rm(list = ls());gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 404631 21.7     750400 40.1   750400 40.1
## Vcells 592028  4.6    1308461 10.0  1308461 10.0

library(vegan)

## Warning: package 'vegan' was built under R version 3.3.3

## Loading required package: permute

## Warning: package 'permute' was built under R version 3.3.3

## Loading required package: lattice

## This is vegan 2.4-2

Reading “iris” dataset

data <- iris[,-c(5)]

Calinski criterion is similar to finding ratio of between-cluster-variance/within-cluster-variance.

model <- cascadeKM(data, 1, 10, iter = 100)

Below are two figures. One is a heatmap and other is clusterno vs values (=BC/WC)

plot(model, sortg = TRUE)

The calinski values for various cluster groups are calculated below:

model$results[2,]

##  1 groups  2 groups  3 groups  4 groups  5 groups  6 groups  7 groups 
##        NA  513.9245  561.6278  530.7658  495.5415  473.8506  449.6410 
##  8 groups  9 groups 10 groups 
##  440.6205  414.5753  392.6606

To check for the maximum value i.e. highest BC/WC value:

which.max(model$results[2,])

## 3 groups 
##        3

As we can see that 3 groups have the highest BC/WC value among all of these, so number of clusters should be 3.

To decide the number of clusters

Ritesh Satapathy

6 April 2017

METHOD 1

METHOD 2

METHOD 3