This is a project for the partial fulfillment of my internship at The Spark Foundation.
To predict the optimal number of clusters for the given Iris data set and represent the output visually
This project will be successful if we will be able to answer the Research question.
The dataset availed by the client can be downloaded from this link https://drive.google.com/file/d/11Iq7YvbWZbt8VXjfm06brx66b10YiwK-/view
The data set consists of the following attributes: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: – Iris Setosa – Iris Versicolour – Iris Virginica
# Libraries
library(data.table)
library (plyr)
library(ggplot2)
library(moments)
library(ggcorrplot)
library(caret)
## Loading required package: lattice
library(rpart)
library(rpart.plot)
library(gridExtra)
library(grid)
library(lattice)
library(tidyverse) # data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ✖ purrr::transpose() masks data.table::transpose()
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library("ape")
library("ggplot2")
library("ggdendro")
getwd()
## [1] "C:/Users/HP/Desktop"
Loading the Dataset
df <- read.csv('Iris.csv')
Previewing the top of our dataset
head(df)
## Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 6 5.4 3.9 1.7 0.4 Iris-setosa
Previewing the bottom of the dataset
tail(df)
## Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 145 145 6.7 3.3 5.7 2.5 Iris-virginica
## 146 146 6.7 3.0 5.2 2.3 Iris-virginica
## 147 147 6.3 2.5 5.0 1.9 Iris-virginica
## 148 148 6.5 3.0 5.2 2.0 Iris-virginica
## 149 149 6.2 3.4 5.4 2.3 Iris-virginica
## 150 150 5.9 3.0 5.1 1.8 Iris-virginica
Checking the data types
# Data set structure.
str(df)
## 'data.frame': 150 obs. of 6 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SepalLengthCm: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ SepalWidthCm : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ PetalLengthCm: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ PetalWidthCm : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
# checking for unnecesary columns
colnames(df)
## [1] "Id" "SepalLengthCm" "SepalWidthCm" "PetalLengthCm"
## [5] "PetalWidthCm" "Species"
columns seems necessary for this study
# Checking for anomalies
summary(df)
## Id SepalLengthCm SepalWidthCm PetalLengthCm
## Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median : 75.50 Median :5.800 Median :3.000 Median :4.350
## Mean : 75.50 Mean :5.843 Mean :3.054 Mean :3.759
## 3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
## PetalWidthCm Species
## Min. :0.100 Length:150
## 1st Qu.:0.300 Class :character
## Median :1.300 Mode :character
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
there are no anomalies in the data set
# checking for missing values
colSums(is.na(df))
## Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## 0 0 0 0 0
## Species
## 0
there are no missing values
# checking for duplicates
sum(duplicated(df))
## [1] 0
There are no duplicates in this dataset.
# Checking column names uniformity
colnames(df)
## [1] "Id" "SepalLengthCm" "SepalWidthCm" "PetalLengthCm"
## [5] "PetalWidthCm" "Species"
The column names are uniform
Using boxplots to check for outliers in numerical columns
# Selecting numerical columns only
num_cols <- unlist(lapply(df, is.numeric)) # Identify numeric columns
num_df <- df[ , num_cols] # Subset numeric columns of data
# Boxplots
par(mfrow = c(2,2))
for (i in 1:length(num_df)){
boxplot(num_df[i], main = paste('Boxplot of', names(num_df)[i]),
ylab = 'Count')
}
We have Outliers in Sepal Width but will keep them for futher
analysis
# Dropping ID and Target variable
df = subset(df, select = -c(Id,Species) )
Normalize our data set using standard scaler, scale().
scale_data <- as.data.frame(scale(df))
Getting summery of un-scaled data
summary(df)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Comparing it with scaled data
summary(scale_data)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## Min. :-1.86378 Min. :-2.4308 Min. :-1.5635 Min. :-1.4396
## 1st Qu.:-0.89767 1st Qu.:-0.5858 1st Qu.:-1.2234 1st Qu.:-1.1776
## Median :-0.05233 Median :-0.1245 Median : 0.3351 Median : 0.1328
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67225 3rd Qu.: 0.5674 3rd Qu.: 0.7602 3rd Qu.: 0.7880
## Max. : 2.48370 Max. : 3.1043 Max. : 1.7804 Max. : 1.7052
We scale the data values such that the overall statistical summary of every variable has a mean value of zero and an unit variance value.
The scale() function enables us to apply standardization on the data values as it centers and scales the variables.
Applying the K-means clustering algorithm with no. of centroids(k)=3
result<- kmeans(scale_data,3)
# Previewing the no. of records in each cluster
result$size
## [1] 47 53 50
# Getting the value of cluster center datapoint value(3 centers for k=3)
# ---
#
result$centers
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## 1 1.13217737 0.0962759 0.9929445 1.0137756
## 2 -0.05005221 -0.8773526 0.3463713 0.2811215
## 3 -1.01119138 0.8394944 -1.3005215 -1.2509379
# Getting the cluster vector that shows the cluster where each record falls
result$cluster
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 3 3 3 3 3 3 3 3 3 3 1 1 1 2 2 2 1 2 2 2
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 1 1 2 2
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 2 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 2 1
## 141 142 143 144 145 146 147 148 149 150
## 1 1 2 1 1 1 2 1 1 2
The kmeans() function returns some ratios that let us know how compact is a cluster and how different are several clusters among themselves.
betweenss. The between-cluster sum of squares. In an optimal segmentation, one expects this ratio to be as higher as possible, since we would like to have heterogeneous clusters.
withinss. Vector of within-cluster sum of squares, one component per cluster. In an optimal segmentation, one expects this ratio to be as lower as possible for each cluster, since we would like to have homogeneity within the clusters.
tot.withinss. Total within-cluster sum of squares.
totss. The total sum of squares.
# Between-cluster sum of squares
result$betweenss
## [1] 455.974
# Within-cluster sum of squares
result$withinss
## [1] 47.60995 44.25778 48.15831
# Total within-cluster sum of squares
result$tot.withinss
## [1] 140.026
# Total sum of squares
result$totss
## [1] 596
To study graphically which value of k gives us the best partition, we can plot betweenss and tot.withinss vs Choice of k.
bss <- numeric()
wss <- numeric()
# Run the algorithm for different values of k
set.seed(1234)
for(i in 1:10){
# For each k, calculate betweenss and tot.withinss
bss[i] <- kmeans(scale_data, centers=i)$betweenss
wss[i] <- kmeans(scale_data, centers=i)$tot.withinss
}
# Between-cluster sum of squares vs Choice of k
p3 <- qplot(1:10, bss, geom=c("point", "line"),
xlab="Number of clusters", ylab="Between-cluster sum of squares") +
scale_x_continuous(breaks=seq(0, 10, 1)) +
theme_bw()
# Total within-cluster sum of squares vs Choice of k
p4 <- qplot(1:10, wss, geom=c("point", "line"),
xlab="Number of clusters", ylab="Total within-cluster sum of squares") +
scale_x_continuous(breaks=seq(0, 10, 1)) +
theme_bw()
# Subplot
grid.arrange(p3, p4, ncol=2)
Fortunately, this process to compute the “Elbow method” has been wrapped up in a single function (fviz_nbclust).
fviz_nbclust(scale_data, kmeans, method = "wss")
From the elbow method, the optimal number of k =3
# function to compute average silhouette for k clusters
avg_sil <- function(k) {
km.res <- kmeans(scale_data, centers = k, nstart = 25)
ss <- silhouette(km.res$cluster, dist(df))
mean(ss[, 3])
}
# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15
# extract avg silhouette for 2-15 clusters
avg_sil_values <- map_dbl(k.values, avg_sil)
plot(k.values, avg_sil_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Average Silhouettes")
Similar to the elbow method, this process to compute the “average silhoutte method” has been wrapped up in a single function (fviz_nbclust):
fviz_nbclust(scale_data, kmeans, method = "silhouette")
The silhoutte method gives us optimal number of k= 2
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(scale_data, FUN = kmeans, nstart = 5,
K.max = 10, B = 10)
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = scale_data, FUNcluster = kmeans, K.max = 10, B = 10, nstart = 5)
## B=10 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 3
## logW E.logW gap SE.sim
## [1,] 4.534831 4.774555 0.2397240 0.03699714
## [2,] 4.024283 4.499760 0.4754762 0.02531904
## [3,] 3.809974 4.300163 0.4901890 0.02432931
## [4,] 3.699549 4.147151 0.4476022 0.02402350
## [5,] 3.589439 4.051000 0.4615613 0.01996632
## [6,] 3.527540 3.972617 0.4450770 0.01850089
## [7,] 3.450057 3.908514 0.4584571 0.01915980
## [8,] 3.406124 3.850725 0.4446006 0.02043261
## [9,] 3.342182 3.798023 0.4558406 0.02132674
## [10,] 3.296782 3.754313 0.4575307 0.02450223
We can visualize the results with fviz_gap_stat which suggests … clusters as the optimal number of clusters.
fviz_gap_stat(gap_stat)
# Execution of k-means with k=3
set.seed(1234)
df_k <- kmeans(scale_data, centers=3)
# Mean values of each cluster
aggregate(df, by=list(df_k$cluster), mean)
## Group.1 SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## 1 1 5.006000 3.418000 1.464000 0.244000
## 2 2 5.801887 2.673585 4.369811 1.413208
## 3 3 6.780851 3.095745 5.510638 1.972340
# Clustering
ggpairs(cbind(scale_data, Cluster=as.factor(df_k$cluster)),
columns=1:5, aes(colour=Cluster, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Will visualize to see the clusterings
The Elbow Method gives k = 3
The Average Silhhouette Methods gives k= 2
The Gap statistics Method gives k =2
From the visualizations, the two classes (in green and blue) are relatively the same and Silhoutte and Gap statistics groups them as one class. However, with the prior knowledge of the Target variable, we will pick Elbow Method (k=3)