Introduction Clustering

The steps to create clusters are:
* Step 1
* Step 2

Problem Defination
Perform Cluster Analysis On The Dataset IRIS Available In R

Data Location
Available In R

Data Description
The iris dataset (included with R) contains four measurements for 150 flowers representing three species of iris (Iris setosa, versicolor and virginica).
The dataset contains the following columns 1. Sepal.Length
2. Sepal.Width
3. Petal.Length
4. Petal.Width
5. Species

Setup

Load Libs

library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("NbClust")
library(NbClust)

Functions

detect_na <- function(inp) {
  sum(is.na(inp))
}

Load Dataset

dfrDataset <- iris
head(dfrDataset)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Dataframe Stucture

str(dfrDataset)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Dataframe Summary

lapply(dfrDataset, FUN=summary)
## $Sepal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $Sepal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400 
## 
## $Petal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900 
## 
## $Petal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500 
## 
## $Species
##     setosa versicolor  virginica 
##         50         50         50

Missing Data

lapply(dfrDataset, FUN=detect_na)
## $Sepal.Length
## [1] 0
## 
## $Sepal.Width
## [1] 0
## 
## $Petal.Length
## [1] 0
## 
## $Petal.Width
## [1] 0
## 
## $Species
## [1] 0

Plot Relationship

plot(dfrDataset, col=c("red","green","blue"))

Plot Setal Relationship

ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width, color=Species)) + geom_point(size=0.9)

Plot Petal Relationship

ggplot(dfrDataset, aes(Petal.Length, Petal.Width, color=Species)) + geom_point(size=0.9)

Create Test Dataframe

tstDataset <- select(dfrDataset, -Species)
head(tstDataset)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Note:
Now, we need to calculate “With-In-Sum-Of-Squares (WSS)” iteratively.
WSS is a measure to explain the homogeneity within a cluster.
WSS function plots WSS against the number of clusters.

wss Function

# create function wssplot
wssplot <- function(data, nc=30, seed=707){
    wss <- (nrow(data)-1)*sum(apply(data,2,var))
    for (i in 2:nc){
        set.seed(seed)
        wss[i] <- sum(kmeans(data, centers=i)$withinss)
    }
    plot(1:nc, wss, type="b", xlab="Number of Clusters",
        ylab="Within groups sum of squares")
}
# calling function wssplot()
wssplot(tstDataset)

Note:
We have plotted WSS with number of clusters. From here we can see that there is not much decrease in WSS even if we increase the number of clusters beyond 7 / 8.
This graph is also known as “Elbow Curve” where the bending point (E.g, nc = 7 / 8 in our case) is known as “Elbow Point”.
From the above plot we can conclude that if we keep number of clusters = 7 / 8, we should be able to get good clusters with good homogeneity within themselves.
Let’s fix the cluster size to “8” and call the kmeans() function to give the clusters.

Create Clusters

lstHClust <- hclust(dist(tstDataset))
plot(lstHClust)

Cut Clusters

intClustSize <- 8
vctCutClust <- cutree(lstHClust, intClustSize)
lstHClust
## 
## Call:
## hclust(d = dist(tstDataset))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 150
vctCutClust
##   [1] 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 1
##  [36] 1 2 1 1 1 1 1 1 2 2 1 2 1 2 1 3 3 3 4 3 4 3 5 3 4 5 4 4 3 4 3 4 4 3 4
##  [71] 6 4 6 3 3 3 3 3 3 4 4 4 4 6 4 3 3 3 4 4 4 3 4 5 4 4 4 3 5 4 7 6 8 7 7
## [106] 8 4 8 7 8 7 6 7 6 6 7 7 8 8 3 7 6 8 6 7 8 6 6 7 8 8 8 7 6 6 8 7 7 6 7
## [141] 7 7 6 7 7 7 6 7 7 6

Validation

dfrDataset$Cluster <- vctCutClust
table(dfrDataset$Species, dfrDataset$Cluster)
##             
##               1  2  3  4  5  6  7  8
##   setosa     29 21  0  0  0  0  0  0
##   versicolor  0  0 20 23  4  3  0  0
##   virginica   0  0  1  1  0 14 22 12

Note:
Seeing the results, we can say that there is an overlapping in cluster 3,4,6,7,8

Plot Sepal Relationship

ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width)) + 
    geom_point(col=dfrDataset$Cluster, size=0.9) +
    scale_color_manual(rainbow(intClustSize))

Plot Petal Relationship

ggplot(dfrDataset, aes(Petal.Length, Petal.Width)) +
    geom_point(col=dfrDataset$Cluster, size=0.9) +
    scale_color_manual(rainbow(intClustSize))

Note:
Seeing the results it seems that the clusters are quite good as our clusters don’t have mix of different species (except few exceptions).

Wind Up

print("Wind Up")
## [1] "Wind Up"