Introduction

Clustering is about grouping a set of objects in such way that objects in same group should be as similar to each other as possible, while objects in different groups should be as different as possible. It can be applied to any kind of data and thus it, it is used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. The data used to conduct analysis contain statistics describing 150 irises. The Iris flower data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper. In this study, i will try to find the best clustering approach which will group data using various methods such as different partitional and hierarchical. At the begging let’s download the data and view names of statics.

library(readxl)
IRIS <- read_excel("C:/Users/darek/OneDrive/Desktop/IRIS.xls")
names(IRIS)

## [1] "sepal_length" "sepal_width"  "petal_length" "petal_width"

Analysis of dataset

In the first step of analysis I show density distribution of variables in dataset.

Two facts can be seen from this analysis. The distributions of most of the variables are not normal and variables are presented at different scales. I will normalize them using scale() function in order to get proper and interpretable results.

IRIS<- scale(IRIS)

In next step I check if the dataset contains outliers. In order to do this, one can use the IRQ rule. Observations suspected of being outliers must be lower than Q1-1.5 × IQR or higher than Q3+1.5 × IQR, where IQR is interquartile range (IQR=Q3-Q1).

Outliers <- c()
names <- c(colnames(IRIS))
for(i in names){
  max <- quantile(IRIS[,i], 0.75) + (IQR(IRIS[,i]) * 1.5 )
  min <- quantile(IRIS[,i], 0.25) - (IQR(IRIS[,i]) * 1.5 )
  idx <- which(IRIS[,i] < min | IRIS[,i] > max)
  print(paste(i, length(idx), sep=' ')) 
  Outliers <- c(Outliers, idx) 
}

## [1] "sepal_length 0"
## [1] "sepal_width 4"
## [1] "petal_length 0"
## [1] "petal_width 0"

Using such a method, four potentials outlires in “sepal_width” column were detected. In order to see if these observations stand out from the rest, a plot was drawn.

plot(IRIS[,1], main = "Plot of sepal_width", ylab = "x")

After examining the plot, I decided not to exclude any of the observations from the analysis.

Clustering

Before applying cluster methods, the first step is to assess whether the data is clusterable. For this purpose I used Hopkins’ statistic and a visual approach.

library(clustertend)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

get_clust_tendency(IRIS, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))

## $hopkins_stat
## [1] 0.8712614
## 
## $plot

The value of Hopkins statistics is close to 1, we can reject the null hypothesis and conclude that the dataset is significantly a clusterable data. This is also confirmed by plot analysis.

In next step I cunducted an analysis to find the appropriate number of groups. For this purpose I applied k-means, PAM, CLARA and hierarchical algorithms.

library(gridExtra)
fviz_nbclust(IRIS, FUNcluster = kmeans, method = "silhouette") + theme_classic()

fviz_nbclust(IRIS, FUNcluster = cluster::pam, method = "silhouette") + theme_classic()

fviz_nbclust(IRIS, FUNcluster = cluster::clara, method = "silhouette") + theme_classic()

fviz_nbclust(IRIS, FUNcluster = hcut, method = "silhouette") + theme_classic()

According to all statistics, the optimal number of clusters is two.

K-means

K-means clustering is a method of vector quantization. It splits observations into clusters in which each observation belongs to the cluster with the nearest mean. Its goal is to minimize square error of the intra-class dissimilarity. In order to measure the similarity or regularity among the data-items, distance metrics plays a very important role. In this paper it was decided to measure the distance between observations in three ways: by Euclidean distance, Manhattan Distance and Pearson’s correlation.

Euclidean distance

library(factoextra)
xd<- eclust(IRIS, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
fviz_cluster(xd, data = IRIS, elipse.type = "convex") + theme_minimal()

fviz_silhouette(xd)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Manhattan Distance

library(factoextra)
cd <- eclust(IRIS, k=2, FUNcluster="kmeans", hc_metric="manhattan", graph=FALSE)
fviz_cluster(cd, data = IRIS, elipse.type = "convex") + theme_minimal()

fviz_silhouette(cd)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Pearson’s correlation

library(factoextra)
x <- eclust(IRIS, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
fviz_cluster(x, data = IRIS, elipse.type = "convex") + theme_minimal()

fviz_silhouette(x)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Summarizing above results it occurs that there is no difference when different metrics measuring distance was applied. Silhouette statistic for all is equal to 0.58.

PAM

In this paper dataset is quite small, therefore PAM was selected from the PAM and CLARA pair. That is because CLARA is a PAM algorithm implementation for large datasets. The PAM algorithm is intended to find a sequence of objects called medoids that are centrally located in clusters. This algorithm aims to minimize the sum of the distances of all non-medoid elements from the medoids closest to them. As in the case of k-means, different metrics were used to measure distance.

Pearson correlation

pam <- eclust(IRIS, k=2, FUNcluster="pam", hc_metric="pearson", graph=FALSE)
fviz_cluster(pam, data = IRIS, elipse.type = "convex") + theme_minimal()

fviz_silhouette(pam)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Euclidean distance

pamm <- eclust(IRIS, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
fviz_cluster(pamm, data = IRIS, elipse.type = "convex") + theme_minimal()

fviz_silhouette(pamm)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Same as in k-means there is no difference what metric we will use. Silhouette statistic for both is equal to 0.58. The reason for these results is the database. It contains observations that clearly fall into two separate groups. Each algorithm divides them in the same way, regardless of the distance metric used, so the results are the same.

Hierarchical clustering

Hierarchical cluster analysis is a method of which seeks to build a hierarchy of clusters. There is two types of hierarchical clustering agglomerative and divisive. Agglomerative is a a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. I applied first agglomerative hierarchical clustering. In this method we have to specify the method how measure distances between clusters. We can apply single linkage, complete linkage, average linkage or Ward’s method. For single linkage we measure minimal distance between a points in different groups. In complete linkage, the distance between clusters is equal to the the distance between most disimilar members of two clusters. In average linkage we measure average of all the distances between any points of both groups. Ward’s minimum variance criterion minimizes the total within-cluster variance.

Agglomerative hierarchical clustering

The following dendrograms present agglomerative hierarchical clustering using four different linkage methods.

Single linkage

sl <- eclust(IRIS, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
fviz_dend(sl, show_labels = FALSE, main="Dendrogram of HAC",
         palette = "jco", as.ggplot = TRUE)

fviz_silhouette(sl)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Complete linkage

cl <- eclust(IRIS, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
fviz_dend(cl, show_labels = FALSE, main="Dendrogram of HAC",
         palette = "jco", as.ggplot = TRUE)

fviz_silhouette(cl)

##   cluster size ave.sil.width
## 1       1   73          0.32
## 2       2   77          0.55

Average linkage

al <- eclust(IRIS, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
fviz_dend(al, show_labels = FALSE, main="Dendrogram of HAC",
         palette = "jco", as.ggplot = TRUE)

fviz_silhouette(al)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Ward’s method

wm <- eclust(IRIS, k=2, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
fviz_dend(wm, show_labels = FALSE, main="Dendrogram of HAC",
         palette = "jco", as.ggplot = TRUE)

fviz_silhouette(wm)

##   cluster size ave.sil.width
## 1       1   49          0.69
## 2       2  101          0.52

Divisive hierarchical clustering

In divisive hierarchical clustering, all the points in the dataset belong to one cluster and split is performed recursively as one moves down the hierarchy. That’s why there is no different linkage methods.

hc <- eclust(IRIS, k=2, FUNcluster="diana")
fviz_dend(hc, show_labels = FALSE, main="Dendrogram of DIANA",
         palette = "jco", as.ggplot = TRUE)

fviz_silhouette(hc)

##   cluster size ave.sil.width
## 1       1   50          0.68
## 2       2  100          0.53

Conclusions

Comparing the hierarchical division method to the previously used flat algorithms, it is noticeable that this time there are differences in the division. Agglomerative hierarchical clustering using complete linkage performed clearly worse results than the rest algorithms. There was also a noticeable small difference in the size of clusters using Ward’s method comparing to the rest. However, it is not possible to clearly indicate which algorithm is the best, because each has a different effectiveness depending on the database. The database used in the study was quite divided, so the results of the algorithms were similar.