Introduction

In this paper I will perform clustering of the dataset, which contains data regarding patients with heart diseases. The dataset on which I work comes from Kaggle (https://www.kaggle.com/sonumj/heart-disease-dataset-from-uci). Alogorithms which will be perfromed are kmeans, PAM, and also hieracichal clustering and divisive hierachical clustering.

Description of the dataset

Dataset contains several information about patients who were diagnosed with heart diseases.

dane<-read.csv("heart_disease_patients.csv", header=T, sep=",")
any(is.na(dane)) 
## [1] FALSE
summary(dane)
##        id             age             sex               cp       
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   Min.   :1.000  
##  1st Qu.: 76.5   1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :152.0   Median :56.00   Median :1.0000   Median :3.000  
##  Mean   :152.0   Mean   :54.44   Mean   :0.6799   Mean   :3.158  
##  3rd Qu.:227.5   3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000  
##  Max.   :303.0   Max.   :77.00   Max.   :1.0000   Max.   :4.000  
##     trestbps          chol            fbs            restecg      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :241.0   Median :0.0000   Median :1.0000  
##  Mean   :131.7   Mean   :246.7   Mean   :0.1485   Mean   :0.9901  
##  3rd Qu.:140.0   3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##     thalach          exang           oldpeak         slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.00   Min.   :1.000  
##  1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.80   Median :2.000  
##  Mean   :149.6   Mean   :0.3267   Mean   :1.04   Mean   :1.601  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.20   Max.   :3.000

Before conducting clustering, we should examine clust tendency of the given dataset. With the usage of the function get_clust_tendency from factoextra package, we may asses clustering tendency by Hopkins Statistic and plot a visual display of that.

get_clust_tendency(dane, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.7993521
## 
## $plot

From the theory we know that when values of Hopkins statistic are close to 1, we may conclude that data may be clusterable. Obtained value is 0.79, displayed plot gives us information that we can distinguish which is enough to consider this dataset clusterable. After we checked clustering tendency, let’s use an automatic tool that assigns automatically the best number of clusters. I will use 4 different variants, using manhattan or eucildean metric and complete or average linkage.

manhattan and complete

autom<-NbClust(dane, distance="manhattan", min.nc=2, max.nc=8, method="complete", index="ch")
autom$Best.nc
## Number_clusters     Value_Index 
##          7.0000        131.0724

manhattan and average

autom1<-NbClust(dane, distance="manhattan", min.nc=2, max.nc=8, method="average", index="ch")
autom1$Best.nc
## Number_clusters     Value_Index 
##          3.0000        158.2153

eucildean and complete

autoe<-NbClust(dane, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
autoe$Best.nc
## Number_clusters     Value_Index 
##          3.0000        169.4849

eucildean and average

autoe1<-NbClust(dane, distance="euclidean", min.nc=2, max.nc=8, method="average", index="ch")
autoe$Best.nc
## Number_clusters     Value_Index 
##          3.0000        169.4849

As we see, the best clustering results for a given dataset are for 3 clusters. In manhattan distance and complete linkage we obtained 7 as the best number of clusters. From the analitical tool we may conclude that 3 and 7 may be the best candidates for best clustering.

Now let’s move on to performing clustering on our data.

K-means

par(mfrow=c(1,2))
fviz_nbclust(dane, FUNcluster=kmeans)

fviz_nbclust(dane, FUNcluster=kmeans, method = "gap_stat") 

opt<-Optimal_Clusters_KMeans(dane, max_clusters=10, plot_clusters = TRUE)
opt1<-Optimal_Clusters_KMeans(dane, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")

Based on obtained results, by using 2 clusters we can get the best silhouette values. Bearing in mind results from an analytic tool, I will consider clustering 2,3 and 7 clusters. Similarly as previously, I will consider two distance metrics: euclidean and manhattan.

k=2, metric: euclidean

km20<-eclust(dane, "kmeans", hc_metric="euclidean",k=2)

fviz_cluster(km20, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(km20)
##   cluster size ave.sil.width
## 1       1  146          0.44
## 2       2  157          0.42

round(calinhara(dane, km20$cluster),digits=2)
## [1] 316.19

k=3, metric: euclidean

km30<-eclust(dane, "kmeans", hc_metric="euclidean",k=3)

fviz_cluster(km30, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(km30)
##   cluster size ave.sil.width
## 1       1   94          0.19
## 2       2  132          0.39
## 3       3   77          0.40

round(calinhara(dane, km30$cluster),digits=2)
## [1] 243.27

k=7, metric: euclidean

km70<-eclust(dane, "kmeans", hc_metric="euclidean",k=7)

fviz_cluster(km70, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(km70)
##   cluster size ave.sil.width
## 1       1   33          0.15
## 2       2   60          0.26
## 3       3   54          0.19
## 4       4   60          0.21
## 5       5   35          0.25
## 6       6   21          0.11
## 7       7   40          0.26

round(calinhara(dane, km70$cluster),digits=2)
## [1] 159.59

k=2, metric: manhattan

km21<-eclust(dane, "kmeans", hc_metric="manhattan",k=2)

fviz_cluster(km21, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(km21)
##   cluster size ave.sil.width
## 1       1  146          0.44
## 2       2  157          0.42

round(calinhara(dane, km21$cluster),digits=2)
## [1] 316.19

k=3, metric: manhattan

km31<-eclust(dane, "kmeans", hc_metric="manhattan",k=3)

fviz_cluster(km31, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(km31)
##   cluster size ave.sil.width
## 1       1   94          0.19
## 2       2  132          0.39
## 3       3   77          0.40

round(calinhara(dane, km31$cluster),digits=2)
## [1] 243.27

k=7, metric: manhattan

km71<-eclust(dane, "kmeans", hc_metric="manhattan",k=7)

fviz_cluster(km71, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(km71)
##   cluster size ave.sil.width
## 1       1   33          0.15
## 2       2   60          0.26
## 3       3   54          0.19
## 4       4   60          0.21
## 5       5   35          0.25
## 6       6   21          0.11
## 7       7   40          0.26

round(calinhara(dane, km71$cluster),digits=2)
## [1] 159.59

Clustering results are relatively poor. The best one seems to be k=2. Moreover, there is no difference between manhattan and euclidean metric in the analysis. Besides, from checking the form of the data, clusters are cutting across each other, which means that each cluster’s observations are really similar to each other.

PAM Algorithm

fviz_nbclust(dane, FUNcluster=cluster::pam, linecolor = "red", barcolor  = "white")+theme_dark()

fviz_nbclust(dane, FUNcluster=cluster::pam, method = "gap_stat") 

pamk.best<-pamk(dane, krange=2:10, criterion="ch", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(dane, "dist"),critout=FALSE)
pamk.best$crit
##  [1]   0.0000 310.6594 233.1590 203.4968 199.4986 180.5159 164.8531 160.2082
##  [9] 144.6861 138.2763

k=2, metric: manhattan

pam21<-eclust(dane, "pam",hc_metric="manhattan", k=2) 

fviz_cluster(pam21, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(pam21)
##   cluster size ave.sil.width
## 1       1  169          0.39
## 2       2  134          0.47

k=3, metric: euclidean

pam3<-eclust(dane, "pam",hc_metric="euclidean", k=3) 

fviz_cluster(pam3, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(pam3)
##   cluster size ave.sil.width
## 1       1   75          0.38
## 2       2  108          0.17
## 3       3  120          0.38

k=3, metric: manhattan

pam31<-eclust(dane, "pam",hc_metric="manhattan", k=3) 

fviz_cluster(pam31, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(pam31)
##   cluster size ave.sil.width
## 1       1   75          0.38
## 2       2  108          0.17
## 3       3  120          0.38

k=2, metric: euclidean

pam2<-eclust(dane, "pam",hc_metric="euclidean", k=2) 

fviz_cluster(pam2, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal() 

fviz_silhouette(pam2)
##   cluster size ave.sil.width
## 1       1  169          0.39
## 2       2  134          0.47

The results are similar to those obtained from the k-means algorithm. The best-fitting is for Pam with 2 clusters, and it doesn’t matter if we are using manhattan or euclidean metric.

Hierachical

WSS

fviz_nbclust(dane, FUN = hcut, method = "wss")

Silhouette

fviz_nbclust(dane, FUN = hcut, method="silhouette")

To perform hierarchical clustering, we need to calculate distances between observations. Then, we execute hierarchical clustering with average, simple, complete and Ward method with the usage of several methods. By comparing agglomerate coefficients, we will decide which method is the best.

distances<-dist(dane)
hc1<-hclust(distances,  method="average") 
ac1 <- agnes(dane, method = "average")
hc2<-hclust(distances,  method="single") 
ac2 <- agnes(dane, method = "single")
hc3<-hclust(distances,method="complete") 
ac3 <- agnes(dane, method = "complete")
hc4<-hclust(distances, method="ward") 
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
ac4 <- agnes(dane, method = "ward")

r1<-c("average","single","complete","ward")
r2<-c(ac1$ac, ac2$ac, ac3$ac, ac4$ac)
data.frame(r1,r2)
##         r1        r2
## 1  average 0.8814798
## 2   single 0.8518162
## 3 complete 0.9414316
## 4     ward 0.9854996

The biggest value is for Wards method, so I will perform with usage of this one.

Now let’s count and asses the variability inside the clusters and between them.

clust.vec.2<-cutree(hc4, k=2)
clust.vec.3<-cutree(hc4, k=3)
clust.vec.4<-cutree(hc4, k=4)
clust.vec.7<-cutree(hc4, k=7)
m<-matrix(0, nrow=4, ncol=4)
colnames(m)<-c("k = 2","k = 3","k = 4","k = 7")
m[1,1]<-withindiss(distances, part=clust.vec.2)
m[2,1]<-inertdiss(distances)                
m[3,1]<-m[1,1]/ m[2,1]      
m[4,1]<-1-m[3,1]            

m[1,2]<-withindiss(distances, part=clust.vec.3) 
m[2,2]<-inertdiss(distances)                
m[3,2]<-m[1,2]/ m[2,2]      
m[4,2]<-1-m[3,2]    

m[1,3]<-withindiss(distances, part=clust.vec.4) 
m[2,3]<-inertdiss(distances)            
m[3,3]<-m[1,3]/ m[2,3]  
m[4,3]<-1-m[3,3]

m[1,4]<-withindiss(distances, part=clust.vec.7) 
m[2,4]<-inertdiss(distances)                
m[3,4]<-m[1,4]/ m[2,4]      
m[4,4]<-1-m[3,4]
options("scipen"=100, "digits"=4)
m
##           k = 2      k = 3     k = 4      k = 7
## [1,]  5940.8803  4761.3515  3866.061  2610.3494
## [2,] 11238.5553 11238.5553 11238.555 11238.5553
## [3,]     0.5286     0.4237     0.344     0.2323
## [4,]     0.4714     0.5763     0.656     0.7677

As we may observe, option with 2 clusters gives us the highest diversity of features inside the clusters, but the most homogeneous clusters are in option with 7 clusters.

Let’s now display Silhouette plots for each case.

k=2

plot(silhouette(clust.vec.2,distances),col=5:6, border=NA)

k=3

plot(silhouette(clust.vec.3,distances),col=5:7, border=NA)

k=4

plot(silhouette(clust.vec.4,distances),col=5:8, border=NA)

k=7

plot(silhouette(clust.vec.7,distances),col=5:11, border=NA)

According to Silhouette statistic, the best clustering is obtained for 2 clusters.

Divisive Hierarchical Clustering

In comparison to other agglomerate clustering methods, Divisive Hierarchical Clustering (DIANA) computes hierarchy inversely. Following the definition from datanovia.com, it begins with considering all data in one big cluster. With starting iteration, the most heterogeneous cluster is divided into two, and it lasts until the last object is in its own cluster. This kinds of algorithms are performing really well at identifying large clusters.

After this small definition, lets execute this algortihm with the usage of library cluster.

library(cluster)
diana1 <- diana(dane)
plot(diana1)

clust.vec.2<-cutree(diana1, k=2)
clust.vec.3<-cutree(diana1, k=3)
clust.vec.4<-cutree(diana1, k=4)
clust.vec.7<-cutree(diana1, k=7)

In this section lets just display silhouette plots for each case.

k=2

plot(silhouette(clust.vec.2,distances),col=5:6, border=NA)

k=3

plot(silhouette(clust.vec.3,distances),col=5:7, border=NA)

k=4

plot(silhouette(clust.vec.4,distances),col=5:8, border=NA)

k=7

plot(silhouette(clust.vec.7,distances),col=5:11, border=NA)

For each of the cases DIANA algorithms obtained better average silhouette width, which stands for bigger quality of the results than simple hierarchical clustering technique.

Conclusions

In this paper I examined four clustering algorithms, and examined their performance on data containing information about patients with heart diseases. The results have shown that the best clustering quality on given data set provided DIANA algorithms, followed by k-means, PAM and hierachical.

References

https://www.datanovia.com/en/lessons/divisive-hierarchical-clustering/

https://statweb.stanford.edu/~gwalther/gap

https://www.sciencedirect.com/science/article/pii/S1110016815001933

https://uc-r.github.io/hc_clustering

Class Materials on Unsupervised Learning, Faculty of Economic Sciences, University of Warsaw