Introduction Clustering
The steps to create clusters are:
* Step 1
* Step 2
Problem Defination
Perform Cluster Analysis On The Dataset IRIS Available In R
Data Location
Available In R
Data Description
The iris dataset (included with R) contains four measurements for 150 flowers representing three species of iris (Iris setosa, versicolor and virginica).
The dataset contains the following columns 1. Sepal.Length
2. Sepal.Width
3. Petal.Length
4. Petal.Width
5. Species
Setup
Load Libs
library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("NbClust")
library(NbClust)
Functions
detect_na <- function(inp) {
sum(is.na(inp))
}
Load Dataset
dfrDataset <- iris
head(dfrDataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Dataframe Stucture
str(dfrDataset)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Dataframe Summary
lapply(dfrDataset, FUN=summary)
## $Sepal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $Sepal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
##
## $Petal.Length
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
##
## $Petal.Width
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
##
## $Species
## setosa versicolor virginica
## 50 50 50
Missing Data
lapply(dfrDataset, FUN=detect_na)
## $Sepal.Length
## [1] 0
##
## $Sepal.Width
## [1] 0
##
## $Petal.Length
## [1] 0
##
## $Petal.Width
## [1] 0
##
## $Species
## [1] 0
Plot Relationship
plot(dfrDataset, col=c("red","green","blue"))
Plot Sepal Relationship
ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width, color=Species)) + geom_point(size=0.9)
Plot Petal Relationship
ggplot(dfrDataset, aes(Petal.Length, Petal.Width, color=Species)) + geom_point(size=0.9)
Create Test Dataframe
tstDataset <- select(dfrDataset, -Species)
head(tstDataset)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
Note:
Now, we need to calculate “With-In-Sum-Of-Squares (WSS)” iteratively.
WSS is a measure to explain the homogeneity within a cluster.
WSS function plots WSS against the number of clusters.
wss Function
# create function wssplot
wssplot <- function(data, nc=30, seed=707){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)
}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
# calling function wssplot()
wssplot(tstDataset)
Note:
We have plotted WSS with number of clusters. From here we can see that there is not much decrease in WSS even if we increase the number of clusters beyond 7 / 8.
This graph is also known as “Elbow Curve” where the bending point (E.g, nc = 7 / 8 in our case) is known as “Elbow Point”.
From the above plot we can conclude that if we keep number of clusters = 7 / 8, we should be able to get good clusters with good homogeneity within themselves.
Let’s fix the cluster size to “8” and call the kmeans() function to give the clusters.
Create Clusters
intClustSize <- 8
set.seed(707)
lstKmeans <- kmeans(tstDataset,intClustSize)
lstKmeans$centers
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.512500 4.000000 1.475000 0.275000
## 2 6.442105 2.978947 4.594737 1.431579
## 3 5.100000 3.513043 1.526087 0.273913
## 4 5.532143 2.635714 3.960714 1.228571
## 5 6.568182 3.086364 5.536364 2.163636
## 6 6.036842 2.705263 5.000000 1.778947
## 7 7.475000 3.125000 6.300000 2.050000
## 8 4.678947 3.084211 1.378947 0.200000
lstKmeans$size
## [1] 8 19 23 28 22 19 12 19
lstKmeans
## K-means clustering with 8 clusters of sizes 8, 19, 23, 28, 22, 19, 12, 19
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.512500 4.000000 1.475000 0.275000
## 2 6.442105 2.978947 4.594737 1.431579
## 3 5.100000 3.513043 1.526087 0.273913
## 4 5.532143 2.635714 3.960714 1.228571
## 5 6.568182 3.086364 5.536364 2.163636
## 6 6.036842 2.705263 5.000000 1.778947
## 7 7.475000 3.125000 6.300000 2.050000
## 8 4.678947 3.084211 1.378947 0.200000
##
## Clustering vector:
## [1] 3 8 8 8 3 1 8 3 8 8 1 3 8 8 1 1 1 3 1 3 3 3 8 3 3 8 3 3 3 8 8 3 1 1 8
## [36] 8 3 3 8 3 3 8 8 3 3 8 3 8 3 3 2 2 2 4 2 4 2 4 2 4 4 4 4 2 4 2 4 4 6 4
## [71] 6 4 6 2 2 2 2 2 2 4 4 4 4 6 4 2 2 2 4 4 4 2 4 4 4 4 4 2 4 4 5 6 7 5 5
## [106] 7 4 7 5 7 5 6 5 6 6 5 5 7 7 6 5 6 7 6 5 7 6 6 5 7 7 7 5 6 6 7 5 5 6 5
## [141] 5 5 6 5 5 5 6 5 5 6
##
## Within cluster sum of squares by cluster:
## [1] 0.958750 3.708421 2.094783 9.749286 4.315455 4.125263 4.655000 2.488421
## (between_SS / total_SS = 95.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Note:
Seeing the 8 clusters formed. Cluster 1 size is 8, Cluster 2 is 19 and so on…
Validation
dfrDataset$Cluster <-lstKmeans$cluster
table(dfrDataset$Species, dfrDataset$Cluster)
##
## 1 2 3 4 5 6 7 8
## setosa 8 0 23 0 0 0 0 19
## versicolor 0 19 0 27 0 4 0 0
## virginica 0 0 0 1 22 15 12 0
Note:
Seeing the results it seems that the clusters are quite good as our clusters don’t have mix of different species (except few exceptions).
Plot Sepal Relationship
ggplot(dfrDataset, aes(Sepal.Length, Sepal.Width)) +
geom_point(col=dfrDataset$Cluster, size=0.9) +
scale_color_manual(rainbow(intClustSize))
Plot Petal Relationship
ggplot(dfrDataset, aes(Petal.Length, Petal.Width)) +
geom_point(col=dfrDataset$Cluster, size=0.9) +
scale_color_manual(rainbow(intClustSize))
Wind Up
print("Wind Up")
## [1] "Wind Up"