Introduction Clustering
The steps to create clusters are:
* Step 1
* Step 2
Problem Defination
Perform Cluster Analysis On The Dataset IRIS Available In R
Data Location
Available In R
Data Description
The iris dataset (included with R) contains four measurements for 150 flowers representing three species of iris (Iris setosa, versicolor and virginica).
The dataset contains the following columns 1. Sepal.Length
2. Sepal.Width
3. Petal.Length
4. Petal.Width
5. Species
Setup
Load Libs
library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("NbClust")
library(NbClust)
Functions
detect_na <- function(inp) {
sum(is.na(inp))
}
Load Dataset
dfrDataset <- iris
head(dfrDataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Dataframe Stucture
str(dfrDataset)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Dataframe Summary
lapply(dfrDataset, FUN=summary)
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
##
## $Species
## setosa versicolor virginica
## 50 50 50
Missing Data
lapply(dfrDataset, FUN=detect_na)
## $Sepal.Length
## [1] 0
##
## $Sepal.Width
## [1] 0
##
## $Petal.Length
## [1] 0
##
## $Petal.Width
## [1] 0
##
## $Species
## [1] 0
Plot Relationship
plot(dfrDataset, col=c("red","green","blue"))
Plot Setal Relationship
ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width, color=Species)) + geom_point(size=0.9)
Plot Petal Relationship
ggplot(dfrDataset, aes(Petal.Length, Petal.Width, color=Species)) + geom_point(size=0.9)
Create Test Dataframe
tstDataset <- select(dfrDataset, -Species)
head(tstDataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
Note:
Now, we need to calculate “With-In-Sum-Of-Squares (WSS)” iteratively.
WSS is a measure to explain the homogeneity within a cluster.
WSS function plots WSS against the number of clusters.
wss Function
# create function wssplot
wssplot <- function(data, nc=30, seed=707){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)
}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
# calling function wssplot()
wssplot(tstDataset)
Note:
We have plotted WSS with number of clusters. From here we can see that there is not much decrease in WSS even if we increase the number of clusters beyond 7 / 8.
This graph is also known as “Elbow Curve” where the bending point (E.g, nc = 7 / 8 in our case) is known as “Elbow Point”.
From the above plot we can conclude that if we keep number of clusters = 7 / 8, we should be able to get good clusters with good homogeneity within themselves.
Let’s fix the cluster size to “8” and call the kmeans() function to give the clusters.
Create Clusters
lstHClust <- hclust(dist(tstDataset))
plot(lstHClust)
Cut Clusters
intClustSize <- 8
vctCutClust <- cutree(lstHClust, intClustSize)
lstHClust
##
## Call:
## hclust(d = dist(tstDataset))
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 150
vctCutClust
## [1] 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 1
## [36] 1 2 1 1 1 1 1 1 2 2 1 2 1 2 1 3 3 3 4 3 4 3 5 3 4 5 4 4 3 4 3 4 4 3 4
## [71] 6 4 6 3 3 3 3 3 3 4 4 4 4 6 4 3 3 3 4 4 4 3 4 5 4 4 4 3 5 4 7 6 8 7 7
## [106] 8 4 8 7 8 7 6 7 6 6 7 7 8 8 3 7 6 8 6 7 8 6 6 7 8 8 8 7 6 6 8 7 7 6 7
## [141] 7 7 6 7 7 7 6 7 7 6
Validation
dfrDataset$Cluster <- vctCutClust
table(dfrDataset$Species, dfrDataset$Cluster)
##
## 1 2 3 4 5 6 7 8
## setosa 29 21 0 0 0 0 0 0
## versicolor 0 0 20 23 4 3 0 0
## virginica 0 0 1 1 0 14 22 12
Note:
Seeing the results, we can say that there is an overlapping in cluster 3,4,6,7,8
Plot Sepal Relationship
ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width)) +
geom_point(col=dfrDataset$Cluster, size=0.9) +
scale_color_manual(rainbow(intClustSize))
Plot Petal Relationship
ggplot(dfrDataset, aes(Petal.Length, Petal.Width)) +
geom_point(col=dfrDataset$Cluster, size=0.9) +
scale_color_manual(rainbow(intClustSize))
Note:
Seeing the results it seems that the clusters are quite good as our clusters don’t have mix of different species (except few exceptions).
Wind Up
print("Wind Up")
## [1] "Wind Up"