Clustering_K-meansClass_IRIS

Introduction Clustering

The steps to create clusters are:
* Step 1
* Step 2

Problem Defination
Perform Cluster Analysis On The Dataset IRIS Available In R

Data Location
Available In R

Data Description
The iris dataset (included with R) contains four measurements for 150 flowers representing three species of iris (Iris setosa, versicolor and virginica).
The dataset contains the following columns 1. Sepal.Length
2. Sepal.Width
3. Petal.Length
4. Petal.Width
5. Species

Setup

Load Libs

library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("NbClust")
library(NbClust)

Functions

detect_na <- function(inp) {
  sum(is.na(inp))
}

Load Dataset

dfrDataset <- iris
head(dfrDataset)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Dataframe Stucture

str(dfrDataset)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Dataframe Summary

lapply(dfrDataset, FUN=summary)

## $Sepal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $Sepal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.057   3.300   4.400 
## 
## $Petal.Length
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.758   5.100   6.900 
## 
## $Petal.Width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500 
## 
## $Species
##     setosa versicolor  virginica 
##         50         50         50

Missing Data

lapply(dfrDataset, FUN=detect_na)

## $Sepal.Length
## [1] 0
## 
## $Sepal.Width
## [1] 0
## 
## $Petal.Length
## [1] 0
## 
## $Petal.Width
## [1] 0
## 
## $Species
## [1] 0

Plot Relationship

plot(dfrDataset, col=c("red","green","blue"))

Plot Sepal Relationship

ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width, color=Species)) + geom_point(size=0.9)

Plot Petal Relationship

ggplot(dfrDataset, aes(Petal.Length, Petal.Width, color=Species)) + geom_point(size=0.9)

Create Test Dataframe

tstDataset <- select(dfrDataset, -Species)
head(tstDataset)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Note:
Now, we need to calculate “With-In-Sum-Of-Squares (WSS)” iteratively.
WSS is a measure to explain the homogeneity within a cluster.
WSS function plots WSS against the number of clusters.

wss Function

# create function wssplot
wssplot <- function(data, nc=30, seed=707){
    wss <- (nrow(data)-1)*sum(apply(data,2,var))
    for (i in 2:nc){
        set.seed(seed)
        wss[i] <- sum(kmeans(data, centers=i)$withinss)
    }
    plot(1:nc, wss, type="b", xlab="Number of Clusters",
        ylab="Within groups sum of squares")
}
# calling function wssplot()
wssplot(tstDataset)

Note:
We have plotted WSS with number of clusters. From here we can see that there is not much decrease in WSS even if we increase the number of clusters beyond 7 / 8.
This graph is also known as “Elbow Curve” where the bending point (E.g, nc = 7 / 8 in our case) is known as “Elbow Point”.
From the above plot we can conclude that if we keep number of clusters = 7 / 8, we should be able to get good clusters with good homogeneity within themselves.
Let’s fix the cluster size to “8” and call the kmeans() function to give the clusters.

Create Clusters

intClustSize <- 8
set.seed(707)
lstKmeans <- kmeans(tstDataset,intClustSize)
lstKmeans$centers

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.512500    4.000000     1.475000    0.275000
## 2     6.442105    2.978947     4.594737    1.431579
## 3     5.100000    3.513043     1.526087    0.273913
## 4     5.532143    2.635714     3.960714    1.228571
## 5     6.568182    3.086364     5.536364    2.163636
## 6     6.036842    2.705263     5.000000    1.778947
## 7     7.475000    3.125000     6.300000    2.050000
## 8     4.678947    3.084211     1.378947    0.200000

lstKmeans$size

## [1]  8 19 23 28 22 19 12 19

lstKmeans

## K-means clustering with 8 clusters of sizes 8, 19, 23, 28, 22, 19, 12, 19
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.512500    4.000000     1.475000    0.275000
## 2     6.442105    2.978947     4.594737    1.431579
## 3     5.100000    3.513043     1.526087    0.273913
## 4     5.532143    2.635714     3.960714    1.228571
## 5     6.568182    3.086364     5.536364    2.163636
## 6     6.036842    2.705263     5.000000    1.778947
## 7     7.475000    3.125000     6.300000    2.050000
## 8     4.678947    3.084211     1.378947    0.200000
## 
## Clustering vector:
##   [1] 3 8 8 8 3 1 8 3 8 8 1 3 8 8 1 1 1 3 1 3 3 3 8 3 3 8 3 3 3 8 8 3 1 1 8
##  [36] 8 3 3 8 3 3 8 8 3 3 8 3 8 3 3 2 2 2 4 2 4 2 4 2 4 4 4 4 2 4 2 4 4 6 4
##  [71] 6 4 6 2 2 2 2 2 2 4 4 4 4 6 4 2 2 2 4 4 4 2 4 4 4 4 4 2 4 4 5 6 7 5 5
## [106] 7 4 7 5 7 5 6 5 6 6 5 5 7 7 6 5 6 7 6 5 7 6 6 5 7 7 7 5 6 6 7 5 5 6 5
## [141] 5 5 6 5 5 5 6 5 5 6
## 
## Within cluster sum of squares by cluster:
## [1] 0.958750 3.708421 2.094783 9.749286 4.315455 4.125263 4.655000 2.488421
##  (between_SS / total_SS =  95.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Note:
Seeing the 8 clusters formed. Cluster 1 size is 8, Cluster 2 is 19 and so on…

Validation

dfrDataset$Cluster <-lstKmeans$cluster
table(dfrDataset$Species, dfrDataset$Cluster)

##             
##               1  2  3  4  5  6  7  8
##   setosa      8  0 23  0  0  0  0 19
##   versicolor  0 19  0 27  0  4  0  0
##   virginica   0  0  0  1 22 15 12  0

Note:
Seeing the results it seems that the clusters are quite good as our clusters don’t have mix of different species (except few exceptions).

Plot Sepal Relationship

ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width)) + 
    geom_point(col=dfrDataset$Cluster, size=0.9) +
    scale_color_manual(rainbow(intClustSize))

Plot Petal Relationship

ggplot(dfrDataset, aes(Petal.Length, Petal.Width)) +
    geom_point(col=dfrDataset$Cluster, size=0.9) +
    scale_color_manual(rainbow(intClustSize))

Wind Up

print("Wind Up")

## [1] "Wind Up"

Clustering_K-meansClass_IRIS_Dataset

Shubhendu Awasthi

July 01, 2017