1.0 Problem Definition

Overview

This is a project for the partial fulfillment of my internship at The Spark Foundation.

Specifying the Research Question

To predict the optimal number of clusters for the given Iris data set and represent the output visually

Defining the Metric for Success

This project will be successful if we will be able to answer the Research question.

Recording the Experimental Design

  1. Problem Definition
  2. Data Sourcing
  3. Check the Data
  4. Perform Data Cleaning
  5. Modelling
  6. Challenge the Solution

2.0 Data Sourcing

Dataset

The dataset availed by the client can be downloaded from this link https://drive.google.com/file/d/11Iq7YvbWZbt8VXjfm06brx66b10YiwK-/view

Understanding the dataset

The data set consists of the following attributes: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: – Iris Setosa – Iris Versicolour – Iris Virginica

3.0 Check the Data

Loading the Libraries

# Libraries
library(data.table)
library (plyr)
library(ggplot2)
library(moments)
library(ggcorrplot)
library(caret)   
## Loading required package: lattice
library(rpart)
library(rpart.plot)
library(gridExtra)
library(grid)
library(lattice)
library(tidyverse)  # data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange()   masks plyr::arrange()
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::combine()   masks gridExtra::combine()
## ✖ purrr::compact()   masks plyr::compact()
## ✖ dplyr::count()     masks plyr::count()
## ✖ dplyr::failwith()  masks plyr::failwith()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::id()        masks plyr::id()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::lift()      masks caret::lift()
## ✖ dplyr::mutate()    masks plyr::mutate()
## ✖ dplyr::rename()    masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ✖ purrr::transpose() masks data.table::transpose()
library(cluster)    # clustering algorithms
library(factoextra) # clustering algorithms & visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library("ape")
library("ggplot2")
library("ggdendro")

Loading data

getwd()
## [1] "C:/Users/HP/Desktop"

Loading the Dataset

df <- read.csv('Iris.csv')

Previewing the top of our dataset

head(df)
##   Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
## 1  1           5.1          3.5           1.4          0.2 Iris-setosa
## 2  2           4.9          3.0           1.4          0.2 Iris-setosa
## 3  3           4.7          3.2           1.3          0.2 Iris-setosa
## 4  4           4.6          3.1           1.5          0.2 Iris-setosa
## 5  5           5.0          3.6           1.4          0.2 Iris-setosa
## 6  6           5.4          3.9           1.7          0.4 Iris-setosa

Previewing the bottom of the dataset

tail(df)
##      Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm        Species
## 145 145           6.7          3.3           5.7          2.5 Iris-virginica
## 146 146           6.7          3.0           5.2          2.3 Iris-virginica
## 147 147           6.3          2.5           5.0          1.9 Iris-virginica
## 148 148           6.5          3.0           5.2          2.0 Iris-virginica
## 149 149           6.2          3.4           5.4          2.3 Iris-virginica
## 150 150           5.9          3.0           5.1          1.8 Iris-virginica

Checking the data types

# Data set structure.
str(df)
## 'data.frame':    150 obs. of  6 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SepalLengthCm: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ SepalWidthCm : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ PetalLengthCm: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ PetalWidthCm : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species      : chr  "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...

4.0 Perform Data Cleaning

Validity

# checking for unnecesary columns 
colnames(df)
## [1] "Id"            "SepalLengthCm" "SepalWidthCm"  "PetalLengthCm"
## [5] "PetalWidthCm"  "Species"

columns seems necessary for this study

# Checking for anomalies
summary(df)
##        Id         SepalLengthCm    SepalWidthCm   PetalLengthCm  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.054   Mean   :3.759  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   PetalWidthCm     Species         
##  Min.   :0.100   Length:150        
##  1st Qu.:0.300   Class :character  
##  Median :1.300   Mode  :character  
##  Mean   :1.199                     
##  3rd Qu.:1.800                     
##  Max.   :2.500

there are no anomalies in the data set

Consistency

# checking for missing values
colSums(is.na(df))
##            Id SepalLengthCm  SepalWidthCm PetalLengthCm  PetalWidthCm 
##             0             0             0             0             0 
##       Species 
##             0

there are no missing values

Completeness

# checking for duplicates
sum(duplicated(df))
## [1] 0

There are no duplicates in this dataset.

Uniformity

# Checking column names uniformity
colnames(df)
## [1] "Id"            "SepalLengthCm" "SepalWidthCm"  "PetalLengthCm"
## [5] "PetalWidthCm"  "Species"

The column names are uniform

Outliers

Using boxplots to check for outliers in numerical columns

# Selecting numerical columns only
num_cols <- unlist(lapply(df, is.numeric))         # Identify numeric columns
num_df <- df[ , num_cols]                        # Subset numeric columns of data
# Boxplots
par(mfrow = c(2,2))
for (i in 1:length(num_df)){
  boxplot(num_df[i], main = paste('Boxplot of',  names(num_df)[i]), 
          ylab = 'Count')
}

We have Outliers in Sepal Width but will keep them for futher analysis

# Dropping ID and Target variable
df = subset(df, select = -c(Id,Species) )

5.0 Modelling

K-Means Clustering

Normalization

Normalize our data set using standard scaler, scale().

scale_data <- as.data.frame(scale(df))

Getting summery of un-scaled data

summary(df)
##  SepalLengthCm    SepalWidthCm   PetalLengthCm    PetalWidthCm  
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.054   Mean   :3.759   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

Comparing it with scaled data

summary(scale_data)
##  SepalLengthCm       SepalWidthCm     PetalLengthCm      PetalWidthCm    
##  Min.   :-1.86378   Min.   :-2.4308   Min.   :-1.5635   Min.   :-1.4396  
##  1st Qu.:-0.89767   1st Qu.:-0.5858   1st Qu.:-1.2234   1st Qu.:-1.1776  
##  Median :-0.05233   Median :-0.1245   Median : 0.3351   Median : 0.1328  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5674   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.1043   Max.   : 1.7804   Max.   : 1.7052

We scale the data values such that the overall statistical summary of every variable has a mean value of zero and an unit variance value.

The scale() function enables us to apply standardization on the data values as it centers and scales the variables.

Model

Applying the K-means clustering algorithm with no. of centroids(k)=3

result<- kmeans(scale_data,3) 

# Previewing the no. of records in each cluster

result$size 
## [1] 47 53 50
# Getting the value of cluster center datapoint value(3 centers for k=3)
# ---
# 
result$centers 
##   SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## 1    1.13217737    0.0962759     0.9929445    1.0137756
## 2   -0.05005221   -0.8773526     0.3463713    0.2811215
## 3   -1.01119138    0.8394944    -1.3005215   -1.2509379
# Getting the cluster vector that shows the cluster where each record falls
result$cluster
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   3   3   3   3   3   3   3   3   3   3   1   1   1   2   2   2   1   2   2   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   2   2   2   2   1   2   2   2   2   1   2   2   2   2   1   1   1   2   2 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   2   2   2   2   2   1   1   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   2   1   1   1   1   2   1   1   1   1   1   1   2   2   1   1   1   1   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   1   2   1   2   1   1   2   1   1   1   1   1   1   2   2   1   1   1   2   1 
## 141 142 143 144 145 146 147 148 149 150 
##   1   1   2   1   1   1   2   1   1   2

The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.

betweenss. The between-cluster sum of squares. In an optimal segmentation, one expects this ratio to be as higher as possible, since we would like to have heterogeneous clusters.

withinss. Vector of within-cluster sum of squares, one component per cluster. In an optimal segmentation, one expects this ratio to be as lower as possible for each cluster, since we would like to have homogeneity within the clusters.

tot.withinss. Total within-cluster sum of squares.

totss. The total sum of squares.

# Between-cluster sum of squares
result$betweenss
## [1] 455.974
# Within-cluster sum of squares
result$withinss
## [1] 47.60995 44.25778 48.15831
# Total within-cluster sum of squares 
result$tot.withinss
## [1] 140.026
# Total sum of squares
result$totss
## [1] 596

Determining the number of K

Elbow Criterion

To study graphically which value of k gives us the best partition, we can plot betweenss and tot.withinss vs Choice of k.

bss <- numeric()
wss <- numeric()

# Run the algorithm for different values of k 
set.seed(1234)

for(i in 1:10){

  # For each k, calculate betweenss and tot.withinss
  bss[i] <- kmeans(scale_data, centers=i)$betweenss
  wss[i] <- kmeans(scale_data, centers=i)$tot.withinss

}

# Between-cluster sum of squares vs Choice of k
p3 <- qplot(1:10, bss, geom=c("point", "line"), 
            xlab="Number of clusters", ylab="Between-cluster sum of squares") +
  scale_x_continuous(breaks=seq(0, 10, 1)) +
  theme_bw()

# Total within-cluster sum of squares vs Choice of k
p4 <- qplot(1:10, wss, geom=c("point", "line"),
            xlab="Number of clusters", ylab="Total within-cluster sum of squares") +
  scale_x_continuous(breaks=seq(0, 10, 1)) +
  theme_bw()

# Subplot
grid.arrange(p3, p4, ncol=2)

Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust).

fviz_nbclust(scale_data, kmeans, method = "wss")

From the elbow method, the optimal number of k =3

Average Silhouette Method
# function to compute average silhouette for k clusters
avg_sil <- function(k) {
  km.res <- kmeans(scale_data, centers = k, nstart = 25)
  ss <- silhouette(km.res$cluster, dist(df))
  mean(ss[, 3])
}

# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15

# extract avg silhouette for 2-15 clusters
avg_sil_values <- map_dbl(k.values, avg_sil)

plot(k.values, avg_sil_values,
       type = "b", pch = 19, frame = FALSE, 
       xlab = "Number of clusters K",
       ylab = "Average Silhouettes")

Similar to the elbow method, this process to compute the “average silhoutte method” has been wrapped up in a single function (fviz_nbclust):

fviz_nbclust(scale_data, kmeans, method = "silhouette")

The silhoutte method gives us optimal number of k= 2

Gap Statistic Method
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(scale_data, FUN = kmeans, nstart = 5,
                    K.max = 10, B = 10)
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = scale_data, FUNcluster = kmeans, K.max = 10, B = 10,     nstart = 5)
## B=10 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 3
##           logW   E.logW       gap     SE.sim
##  [1,] 4.534831 4.774555 0.2397240 0.03699714
##  [2,] 4.024283 4.499760 0.4754762 0.02531904
##  [3,] 3.809974 4.300163 0.4901890 0.02432931
##  [4,] 3.699549 4.147151 0.4476022 0.02402350
##  [5,] 3.589439 4.051000 0.4615613 0.01996632
##  [6,] 3.527540 3.972617 0.4450770 0.01850089
##  [7,] 3.450057 3.908514 0.4584571 0.01915980
##  [8,] 3.406124 3.850725 0.4446006 0.02043261
##  [9,] 3.342182 3.798023 0.4558406 0.02132674
## [10,] 3.296782 3.754313 0.4575307 0.02450223

We can visualize the results with fviz_gap_stat which suggests … clusters as the optimal number of clusters.

fviz_gap_stat(gap_stat)

Implementing k=3

# Execution of k-means with k=3
set.seed(1234)

df_k <- kmeans(scale_data, centers=3)

# Mean values of each cluster
aggregate(df, by=list(df_k$cluster), mean)
##   Group.1 SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## 1       1      5.006000     3.418000      1.464000     0.244000
## 2       2      5.801887     2.673585      4.369811     1.413208
## 3       3      6.780851     3.095745      5.510638     1.972340
# Clustering 
ggpairs(cbind(scale_data, Cluster=as.factor(df_k$cluster)),
        columns=1:5, aes(colour=Cluster, alpha=0.5),
        lower=list(continuous="points"),
        upper=list(continuous="blank"),
        axisLabels="none", switch="both") +
        theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Will visualize to see the clusterings

6.0 Challenge the Solution

From the visualizations, the two classes (in green and blue) are relatively the same and Silhoutte and Gap statistics groups them as one class. However, with the prior knowledge of the Target variable, we will pick Elbow Method (k=3)