Cluster analysis is a statistical method for processing data. It works by organizing items into groups, or clusters, on the basis of how closely associated they are.
Cluster analysis is a method of unsupervised learning and is a powerful data-mining tool of grouping similar objects.
Data:
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class(iris setosa) is linearly separable from the other 2 (Iris virginica and Iris versicolor); the latter are NOT linearly separable from each other.
Objective:
Carry out advanced cluster analysis techniques below on the Iris data set and compare the results.
Cluster analysis technique 1. K-Means Clustering 2. Hierarchical Clustering (Using “ward.D2” Linkage model) 3. Model-based Clustering 4. Density based Clustering
Data Preparation
library(mclust) # For model based clustering
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
library(dbscan) # For density based clustering
library(factoextra) # Clustering Visualization
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2) # For visualization
library(GGally) # For visualization
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggpubr) # For visualization
# Clear the workspace:
rm(list = ls())
# Load data
data(iris)
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
ggpairs(iris, # Data frame
columns = 1:4, # Columns
aes(color = Species, # Color by group (cat. variable)
alpha = 0.5)) # Transparency
#Fitting Model
fitK <- kmeans(iris[, -5], 3, nstart=20)
fitK
## K-means clustering with 3 clusters of sizes 38, 50, 62
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.850000 3.073684 5.742105 2.071053
## 2 5.006000 3.428000 1.462000 0.246000
## 3 5.901613 2.748387 4.393548 1.433871
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3
##
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
I chose the nstart value as 20 so that R will try 20 different initial configurations and then select the one with the lowest within cluster variation.
#Print clusters
fitK$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3
#Plot Clusters
ggpairs(iris, aes(color = factor(fitK$cluster), alpha = 0.5),
lower = list(combo = "count"))
In this section, I will be using Ward.d linkage(minimum variance) method. Ward’s method is a criterion applied in hierarchical cluster analysis.
Ward’s method starts with n clusters, each containing a single object. These n clusters are combined to make one cluster containing all objects. At each step, the process makes a new cluster that minimizes variance, measured by an index called E (also called the sum of squares index).
# Create a function of dendrogram:
dend_func <- (function(x) {fviz_dend(x,
k = 3,
cex = 0.5,
rect = TRUE,
rect_fill = TRUE,
horiz = FALSE,
palette = "jco",
rect_border = "jco",
color_labels_by_k = TRUE) })
# Create a theme of dendrogram plots :
theme1<- theme_gray() +
theme(plot.margin = unit(rep(0.7, 4), "cm"))
Distance matrix computation and visualization
# Compute distances:
dd <- dist(iris[, 1:4], method = "euclidean")
#Fit Model
fitH <- hclust(dd, "ward.D2")
#Plot graph
dend_func(fitH) -> basic_plot
basic_plot + theme1 +
labs(title = "Hierarchical Clustering")
#Print clusters
clusters <- cutree(fitH, k = 3)
clusters
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
ggpairs(iris, aes(color = factor(clusters), alpha = 0.5),
lower = list(combo = "count"))
Model based clustering is an advance clustering method which consider the data is coming from a distribution that is mixture of two or clusters. The model based clustering is based on 3 general assumptions:
1.The optimal model is chosen based on the BIC(Bayesian Information Criterian). 2.Each observation in the data as a certain probability of belonging to each cluster. 3.The observations within each cluster follow a normal distribution (with the appropriate dimension),
MCLust package provides the EM(Expectation-Maximization) algorithm for normal mixture models with variety of covariance structures. EM algorithm is used for maximum likelihood estimation.
#Fit Model Based clustering
fitM <- Mclust(iris[,-5])
#Choose the model
BIC <- mclustBIC(iris[,-5])
plot(BIC)
We can see that VEV is the winner who reached the top BIC score.
summary(fitM)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 2 components:
##
## log-likelihood n df BIC ICL
## -215.726 150 26 -561.7285 -561.7289
##
## Clustering table:
## 1 2
## 50 100
EM algorithm suggest us to use 2 clusters. The optimal selected model name is VEV model. That is the three components are ellipsoidal with varying volume, equal shape, and varies orientation based on the covariance matrix. The summary table also contains the number of species for each component.
#Density Plot
dens <- densityMclust(iris[,1:4])
plot(dens, what = "density", type = "hdr", data = iris[,1:4])
#Classification: Plot shows Boundaries
#Dimension Reduction for Model Based Clustering
dr <- MclustDR(fitM)
plot(dr, what = "classification")
# Classification: plot shows clusters
fviz_mclust(fitM, "classification", geom = "point",
pointsize = 1.5, palette = "jco")
MClust alo provide us specific information below:
fitM$modelName # Optimal selected model
## [1] "VEV"
fitM$G # Optimal number of cluster
## [1] 2
head(fitM$z, 5) # Probability to belong to a given cluster
## [,1] [,2]
## [1,] 1.0000000 2.513256e-11
## [2,] 0.9999999 5.556629e-08
## [3,] 1.0000000 3.635567e-09
## [4,] 0.9999999 8.612037e-08
## [5,] 1.0000000 8.504814e-12
head(fitM$classification, 5) # Cluster assignment of each observation
## [1] 1 1 1 1 1
Density-Based Clustering is an unsupervised learning methods that identify distinctive groups/clusters in the data separating high point of density and low point of density. DB clustering is mostly used to to identify the outliers.
The Density-based Clustering tool works by detecting areas where points are concentrated and where they are separated by areas that are empty or sparse. Points that are not part of a cluster are labeled as noise.
EPS is the maximum distance between two points.It is the distance that the algorithm uses to decide on whether to connect the two points together.
# Choosing optimal EPS level for 3 cluster
kNNdistplot(iris[,-5], k = 3)
abline(h = 0.7, col = "red", lty = 2)
It can be seen that the optimal EPS value is around a distance of 0.70.
fitD <- dbscan(iris[,-5], eps = 0.7, minPts = 5)
fitD
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.7, minPts = 5
## The clustering contains 2 cluster(s) and 3 noise points.
##
## 0 1 2
## 3 50 97
##
## Available fields: cluster, eps, minPts
Please note that the noise/outliers are coded as Zero.
#Cluster Plot
fviz_cluster(fitD, iris[,-5], geom = "point")
Density Based model also suggest that 2 clusters sperate the data set best.The black points on the graph are outliers.
Comparison
#K-Means
table(iris$Species,fitK$cluster)
##
## 1 2 3
## setosa 0 50 0
## versicolor 2 0 48
## virginica 36 0 14
#Hierarchial
table(iris$Species,clusters)
## clusters
## 1 2 3
## setosa 50 0 0
## versicolor 0 49 1
## virginica 0 15 35
K-Means and Hierarchical Clustering model with return almost the same result.
Model based (based on BIC score) and Density based clusters suggest us to use 2 clusters to obtain the best fitting model.