Advanced Clustering Methods

Introduction

Cluster analysis is a statistical method for processing data. It works by organizing items into groups, or clusters, on the basis of how closely associated they are.

Cluster analysis is a method of unsupervised learning and is a powerful data-mining tool of grouping similar objects.

Data:

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class(iris setosa) is linearly separable from the other 2 (Iris virginica and Iris versicolor); the latter are NOT linearly separable from each other.

Objective:

Carry out advanced cluster analysis techniques below on the Iris data set and compare the results.

Cluster analysis technique 1. K-Means Clustering 2. Hierarchical Clustering (Using “ward.D2” Linkage model) 3. Model-based Clustering 4. Density based Clustering

Data Preparation

library(mclust)  # For model based clustering

## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.

library(dbscan)  # For density based clustering 
library(factoextra)  # Clustering Visualization

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2) # For visualization
library(GGally)  # For visualization

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggpubr)   # For visualization

# Clear the workspace: 
rm(list = ls())

# Load data
data(iris)

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

ggpairs(iris,                 # Data frame
        columns = 1:4,        # Columns
        aes(color = Species,  # Color by group (cat. variable)
            alpha = 0.5))     # Transparency

K-Means Clustering

#Fitting Model
fitK <- kmeans(iris[, -5], 3, nstart=20)
fitK

## K-means clustering with 3 clusters of sizes 38, 50, 62
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     6.850000    3.073684     5.742105    2.071053
## 2     5.006000    3.428000     1.462000    0.246000
## 3     5.901613    2.748387     4.393548    1.433871
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

I chose the nstart value as 20 so that R will try 20 different initial configurations and then select the one with the lowest within cluster variation.

#Print clusters 
fitK$cluster

##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3

#Plot Clusters
ggpairs(iris, aes(color = factor(fitK$cluster), alpha = 0.5),
        lower = list(combo = "count"))

Hierarchical Clustering

In this section, I will be using Ward.d linkage(minimum variance) method. Ward’s method is a criterion applied in hierarchical cluster analysis.

Ward’s method starts with n clusters, each containing a single object. These n clusters are combined to make one cluster containing all objects. At each step, the process makes a new cluster that minimizes variance, measured by an index called E (also called the sum of squares index).

# Create a function of dendrogram:
dend_func <- (function(x) {fviz_dend(x, 
          k = 3,   
          cex = 0.5, 
          rect = TRUE, 
          rect_fill = TRUE, 
          horiz = FALSE, 
          palette = "jco", 
          rect_border = "jco", 
          color_labels_by_k = TRUE) })

# Create a theme of dendrogram plots : 
theme1<- theme_gray() + 
         theme(plot.margin = unit(rep(0.7, 4), "cm"))

Distance matrix computation and visualization

# Compute distances: 
dd <- dist(iris[, 1:4], method = "euclidean")

#Fit Model
fitH <- hclust(dd, "ward.D2")


#Plot graph
dend_func(fitH) -> basic_plot

basic_plot + theme1 + 
  labs(title = "Hierarchical Clustering")

#Print clusters
clusters <- cutree(fitH, k = 3) 
clusters

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2

ggpairs(iris, aes(color = factor(clusters), alpha = 0.5),
        lower = list(combo = "count"))

Model-Based Clustering (Gaussian Mixture Models)

Model based clustering is an advance clustering method which consider the data is coming from a distribution that is mixture of two or clusters. The model based clustering is based on 3 general assumptions:

1.The optimal model is chosen based on the BIC(Bayesian Information Criterian).
2.Each observation in the data as a certain probability of belonging to each cluster.
3.The observations within each cluster follow a normal distribution (with the appropriate dimension),

MCLust package provides the EM(Expectation-Maximization) algorithm for normal mixture models with variety of covariance structures. EM algorithm is used for maximum likelihood estimation.

#Fit Model Based clustering
fitM <- Mclust(iris[,-5])

#Choose the model
BIC <- mclustBIC(iris[,-5])
plot(BIC)

We can see that VEV is the winner who reached the top BIC score.

summary(fitM)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 2 components: 
## 
##  log-likelihood   n df       BIC       ICL
##        -215.726 150 26 -561.7285 -561.7289
## 
## Clustering table:
##   1   2 
##  50 100

EM algorithm suggest us to use 2 clusters. The optimal selected model name is VEV model. That is the three components are ellipsoidal with varying volume, equal shape, and varies orientation based on the covariance matrix. The summary table also contains the number of species for each component.

#Density Plot
dens <- densityMclust(iris[,1:4])
plot(dens, what = "density", type = "hdr", data = iris[,1:4])

#Classification: Plot shows Boundaries
#Dimension Reduction for Model Based Clustering
dr <- MclustDR(fitM)
plot(dr, what = "classification")

# Classification: plot shows clusters
fviz_mclust(fitM, "classification", geom = "point", 
            pointsize = 1.5, palette = "jco")

MClust alo provide us specific information below:

fitM$modelName                # Optimal selected model

## [1] "VEV"

fitM$G                        # Optimal number of cluster

## [1] 2

head(fitM$z, 5)              # Probability to belong to a given cluster

##           [,1]         [,2]
## [1,] 1.0000000 2.513256e-11
## [2,] 0.9999999 5.556629e-08
## [3,] 1.0000000 3.635567e-09
## [4,] 0.9999999 8.612037e-08
## [5,] 1.0000000 8.504814e-12

head(fitM$classification, 5) # Cluster assignment of each observation

## [1] 1 1 1 1 1

Density-Based Clustering

Density-Based Clustering is an unsupervised learning methods that identify distinctive groups/clusters in the data separating high point of density and low point of density. DB clustering is mostly used to to identify the outliers.

The Density-based Clustering tool works by detecting areas where points are concentrated and where they are separated by areas that are empty or sparse. Points that are not part of a cluster are labeled as noise.

EPS is the maximum distance between two points.It is the distance that the algorithm uses to decide on whether to connect the two points together.

# Choosing optimal EPS level for 3 cluster
kNNdistplot(iris[,-5], k = 3)
abline(h = 0.7, col = "red", lty = 2)

It can be seen that the optimal EPS value is around a distance of 0.70.

fitD <- dbscan(iris[,-5], eps = 0.7, minPts = 5)
fitD

## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.7, minPts = 5
## The clustering contains 2 cluster(s) and 3 noise points.
## 
##  0  1  2 
##  3 50 97 
## 
## Available fields: cluster, eps, minPts

Please note that the noise/outliers are coded as Zero.

#Cluster Plot
fviz_cluster(fitD, iris[,-5], geom = "point")

Density Based model also suggest that 2 clusters sperate the data set best.The black points on the graph are outliers.

Comparison

#K-Means
table(iris$Species,fitK$cluster)

##             
##               1  2  3
##   setosa      0 50  0
##   versicolor  2  0 48
##   virginica  36  0 14

#Hierarchial 
table(iris$Species,clusters)

##             clusters
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 49  1
##   virginica   0 15 35

K-Means and Hierarchical Clustering model with return almost the same result.

Model based (based on BIC score) and Density based clusters suggest us to use 2 clusters to obtain the best fitting model.

References:

**************

Advanced Clustering Methods

Mustafa Arslan

9/4/2021

References: