In this paper I will perform clustering of the dataset, which contains data regarding patients with heart diseases. The dataset on which I work comes from Kaggle (https://www.kaggle.com/sonumj/heart-disease-dataset-from-uci). Alogorithms which will be perfromed are kmeans, PAM, and also hieracichal clustering and divisive hierachical clustering.
Dataset contains several information about patients who were diagnosed with heart diseases.
dane<-read.csv("heart_disease_patients.csv", header=T, sep=",")
any(is.na(dane))
## [1] FALSE
summary(dane)
## id age sex cp
## Min. : 1.0 Min. :29.00 Min. :0.0000 Min. :1.000
## 1st Qu.: 76.5 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000
## Median :152.0 Median :56.00 Median :1.0000 Median :3.000
## Mean :152.0 Mean :54.44 Mean :0.6799 Mean :3.158
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000
## Max. :303.0 Max. :77.00 Max. :1.0000 Max. :4.000
## trestbps chol fbs restecg
## Min. : 94.0 Min. :126.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:120.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000
## Median :130.0 Median :241.0 Median :0.0000 Median :1.0000
## Mean :131.7 Mean :246.7 Mean :0.1485 Mean :0.9901
## 3rd Qu.:140.0 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000
## Max. :200.0 Max. :564.0 Max. :1.0000 Max. :2.0000
## thalach exang oldpeak slope
## Min. : 71.0 Min. :0.0000 Min. :0.00 Min. :1.000
## 1st Qu.:133.5 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000
## Median :153.0 Median :0.0000 Median :0.80 Median :2.000
## Mean :149.6 Mean :0.3267 Mean :1.04 Mean :1.601
## 3rd Qu.:166.0 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000
## Max. :202.0 Max. :1.0000 Max. :6.20 Max. :3.000
Before conducting clustering, we should examine clust tendency of the given dataset. With the usage of the function get_clust_tendency from factoextra package, we may asses clustering tendency by Hopkins Statistic and plot a visual display of that.
get_clust_tendency(dane, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.7993521
##
## $plot
From the theory we know that when values of Hopkins statistic are close to 1, we may conclude that data may be clusterable. Obtained value is 0.79, displayed plot gives us information that we can distinguish which is enough to consider this dataset clusterable. After we checked clustering tendency, let’s use an automatic tool that assigns automatically the best number of clusters. I will use 4 different variants, using manhattan or eucildean metric and complete or average linkage.
autom<-NbClust(dane, distance="manhattan", min.nc=2, max.nc=8, method="complete", index="ch")
autom$Best.nc
## Number_clusters Value_Index
## 7.0000 131.0724
autom1<-NbClust(dane, distance="manhattan", min.nc=2, max.nc=8, method="average", index="ch")
autom1$Best.nc
## Number_clusters Value_Index
## 3.0000 158.2153
autoe<-NbClust(dane, distance="euclidean", min.nc=2, max.nc=8, method="complete", index="ch")
autoe$Best.nc
## Number_clusters Value_Index
## 3.0000 169.4849
autoe1<-NbClust(dane, distance="euclidean", min.nc=2, max.nc=8, method="average", index="ch")
autoe$Best.nc
## Number_clusters Value_Index
## 3.0000 169.4849
As we see, the best clustering results for a given dataset are for 3 clusters. In manhattan distance and complete linkage we obtained 7 as the best number of clusters. From the analitical tool we may conclude that 3 and 7 may be the best candidates for best clustering.
Now let’s move on to performing clustering on our data.
par(mfrow=c(1,2))
fviz_nbclust(dane, FUNcluster=kmeans)
fviz_nbclust(dane, FUNcluster=kmeans, method = "gap_stat")
opt<-Optimal_Clusters_KMeans(dane, max_clusters=10, plot_clusters = TRUE)
opt1<-Optimal_Clusters_KMeans(dane, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
Based on obtained results, by using 2 clusters we can get the best silhouette values. Bearing in mind results from an analytic tool, I will consider clustering 2,3 and 7 clusters. Similarly as previously, I will consider two distance metrics: euclidean and manhattan.
km20<-eclust(dane, "kmeans", hc_metric="euclidean",k=2)
fviz_cluster(km20, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(km20)
## cluster size ave.sil.width
## 1 1 146 0.44
## 2 2 157 0.42
round(calinhara(dane, km20$cluster),digits=2)
## [1] 316.19
km30<-eclust(dane, "kmeans", hc_metric="euclidean",k=3)
fviz_cluster(km30, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(km30)
## cluster size ave.sil.width
## 1 1 94 0.19
## 2 2 132 0.39
## 3 3 77 0.40
round(calinhara(dane, km30$cluster),digits=2)
## [1] 243.27
km70<-eclust(dane, "kmeans", hc_metric="euclidean",k=7)
fviz_cluster(km70, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(km70)
## cluster size ave.sil.width
## 1 1 33 0.15
## 2 2 60 0.26
## 3 3 54 0.19
## 4 4 60 0.21
## 5 5 35 0.25
## 6 6 21 0.11
## 7 7 40 0.26
round(calinhara(dane, km70$cluster),digits=2)
## [1] 159.59
km21<-eclust(dane, "kmeans", hc_metric="manhattan",k=2)
fviz_cluster(km21, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(km21)
## cluster size ave.sil.width
## 1 1 146 0.44
## 2 2 157 0.42
round(calinhara(dane, km21$cluster),digits=2)
## [1] 316.19
km31<-eclust(dane, "kmeans", hc_metric="manhattan",k=3)
fviz_cluster(km31, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(km31)
## cluster size ave.sil.width
## 1 1 94 0.19
## 2 2 132 0.39
## 3 3 77 0.40
round(calinhara(dane, km31$cluster),digits=2)
## [1] 243.27
km71<-eclust(dane, "kmeans", hc_metric="manhattan",k=7)
fviz_cluster(km71, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(km71)
## cluster size ave.sil.width
## 1 1 33 0.15
## 2 2 60 0.26
## 3 3 54 0.19
## 4 4 60 0.21
## 5 5 35 0.25
## 6 6 21 0.11
## 7 7 40 0.26
round(calinhara(dane, km71$cluster),digits=2)
## [1] 159.59
Clustering results are relatively poor. The best one seems to be k=2. Moreover, there is no difference between manhattan and euclidean metric in the analysis. Besides, from checking the form of the data, clusters are cutting across each other, which means that each cluster’s observations are really similar to each other.
fviz_nbclust(dane, FUNcluster=cluster::pam, linecolor = "red", barcolor = "white")+theme_dark()
fviz_nbclust(dane, FUNcluster=cluster::pam, method = "gap_stat")
pamk.best<-pamk(dane, krange=2:10, criterion="ch", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(dane, "dist"),critout=FALSE)
pamk.best$crit
## [1] 0.0000 310.6594 233.1590 203.4968 199.4986 180.5159 164.8531 160.2082
## [9] 144.6861 138.2763
pam21<-eclust(dane, "pam",hc_metric="manhattan", k=2)
fviz_cluster(pam21, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(pam21)
## cluster size ave.sil.width
## 1 1 169 0.39
## 2 2 134 0.47
pam3<-eclust(dane, "pam",hc_metric="euclidean", k=3)
fviz_cluster(pam3, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(pam3)
## cluster size ave.sil.width
## 1 1 75 0.38
## 2 2 108 0.17
## 3 3 120 0.38
pam31<-eclust(dane, "pam",hc_metric="manhattan", k=3)
fviz_cluster(pam31, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(pam31)
## cluster size ave.sil.width
## 1 1 75 0.38
## 2 2 108 0.17
## 3 3 120 0.38
pam2<-eclust(dane, "pam",hc_metric="euclidean", k=2)
fviz_cluster(pam2, geom="point", ellipse.type="norm", palette = "Set1")+theme_minimal()
fviz_silhouette(pam2)
## cluster size ave.sil.width
## 1 1 169 0.39
## 2 2 134 0.47
The results are similar to those obtained from the k-means algorithm. The best-fitting is for Pam with 2 clusters, and it doesn’t matter if we are using manhattan or euclidean metric.
fviz_nbclust(dane, FUN = hcut, method = "wss")
fviz_nbclust(dane, FUN = hcut, method="silhouette")
To perform hierarchical clustering, we need to calculate distances between observations. Then, we execute hierarchical clustering with average, simple, complete and Ward method with the usage of several methods. By comparing agglomerate coefficients, we will decide which method is the best.
distances<-dist(dane)
hc1<-hclust(distances, method="average")
ac1 <- agnes(dane, method = "average")
hc2<-hclust(distances, method="single")
ac2 <- agnes(dane, method = "single")
hc3<-hclust(distances,method="complete")
ac3 <- agnes(dane, method = "complete")
hc4<-hclust(distances, method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
ac4 <- agnes(dane, method = "ward")
r1<-c("average","single","complete","ward")
r2<-c(ac1$ac, ac2$ac, ac3$ac, ac4$ac)
data.frame(r1,r2)
## r1 r2
## 1 average 0.8814798
## 2 single 0.8518162
## 3 complete 0.9414316
## 4 ward 0.9854996
The biggest value is for Wards method, so I will perform with usage of this one.
Now let’s count and asses the variability inside the clusters and between them.
clust.vec.2<-cutree(hc4, k=2)
clust.vec.3<-cutree(hc4, k=3)
clust.vec.4<-cutree(hc4, k=4)
clust.vec.7<-cutree(hc4, k=7)
m<-matrix(0, nrow=4, ncol=4)
colnames(m)<-c("k = 2","k = 3","k = 4","k = 7")
m[1,1]<-withindiss(distances, part=clust.vec.2)
m[2,1]<-inertdiss(distances)
m[3,1]<-m[1,1]/ m[2,1]
m[4,1]<-1-m[3,1]
m[1,2]<-withindiss(distances, part=clust.vec.3)
m[2,2]<-inertdiss(distances)
m[3,2]<-m[1,2]/ m[2,2]
m[4,2]<-1-m[3,2]
m[1,3]<-withindiss(distances, part=clust.vec.4)
m[2,3]<-inertdiss(distances)
m[3,3]<-m[1,3]/ m[2,3]
m[4,3]<-1-m[3,3]
m[1,4]<-withindiss(distances, part=clust.vec.7)
m[2,4]<-inertdiss(distances)
m[3,4]<-m[1,4]/ m[2,4]
m[4,4]<-1-m[3,4]
options("scipen"=100, "digits"=4)
m
## k = 2 k = 3 k = 4 k = 7
## [1,] 5940.8803 4761.3515 3866.061 2610.3494
## [2,] 11238.5553 11238.5553 11238.555 11238.5553
## [3,] 0.5286 0.4237 0.344 0.2323
## [4,] 0.4714 0.5763 0.656 0.7677
As we may observe, option with 2 clusters gives us the highest diversity of features inside the clusters, but the most homogeneous clusters are in option with 7 clusters.
Let’s now display Silhouette plots for each case.
plot(silhouette(clust.vec.2,distances),col=5:6, border=NA)
plot(silhouette(clust.vec.3,distances),col=5:7, border=NA)
plot(silhouette(clust.vec.4,distances),col=5:8, border=NA)
plot(silhouette(clust.vec.7,distances),col=5:11, border=NA)
According to Silhouette statistic, the best clustering is obtained for 2 clusters.
In comparison to other agglomerate clustering methods, Divisive Hierarchical Clustering (DIANA) computes hierarchy inversely. Following the definition from datanovia.com, it begins with considering all data in one big cluster. With starting iteration, the most heterogeneous cluster is divided into two, and it lasts until the last object is in its own cluster. This kinds of algorithms are performing really well at identifying large clusters.
After this small definition, lets execute this algortihm with the usage of library cluster.
library(cluster)
diana1 <- diana(dane)
plot(diana1)
clust.vec.2<-cutree(diana1, k=2)
clust.vec.3<-cutree(diana1, k=3)
clust.vec.4<-cutree(diana1, k=4)
clust.vec.7<-cutree(diana1, k=7)
In this section lets just display silhouette plots for each case.
plot(silhouette(clust.vec.2,distances),col=5:6, border=NA)
plot(silhouette(clust.vec.3,distances),col=5:7, border=NA)
plot(silhouette(clust.vec.4,distances),col=5:8, border=NA)
plot(silhouette(clust.vec.7,distances),col=5:11, border=NA)
For each of the cases DIANA algorithms obtained better average silhouette width, which stands for bigger quality of the results than simple hierarchical clustering technique.
In this paper I examined four clustering algorithms, and examined their performance on data containing information about patients with heart diseases. The results have shown that the best clustering quality on given data set provided DIANA algorithms, followed by k-means, PAM and hierachical.
https://www.datanovia.com/en/lessons/divisive-hierarchical-clustering/
https://statweb.stanford.edu/~gwalther/gap
https://www.sciencedirect.com/science/article/pii/S1110016815001933
https://uc-r.github.io/hc_clustering
Class Materials on Unsupervised Learning, Faculty of Economic Sciences, University of Warsaw