Clustering is generally known as the most important problem as far as unsupervised learning is concerned. This is because of the obvious reason which is finding structure in unstructured data or basically trying to make sense out of dataset that seems to be in chaos. Basically, in clustering, we collect objects with homogeneous attributes together and it is done in such a way that each of these group of objects are highly heterogeneous to each other.
For the purpose of this analysis we will be performing clustering on a dataset by using 2 clustering techniques, KMEANS clustering and Hierarchical clustering. Afterwards, we will compare the results and analyse which of these techniques did a better job.
The dataset to be used for this analysis contains details about predicting a heart disease. This dataset was gotten from https://www.kaggle.com/faressayah/predicting-heart-disease-using-machine-learning/data It contains analysis of various attributes that can be used to predict if heart disease is present or not. The attributes are:
We will inspect the dataset first. Then afterwards, we will remove the column target as we need to carry out the analysis to predict this result
heart_data= read.csv("~/Downloads/heart.csv.xls")
heart= heart_data
heart$target=NULL
head(heart)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
str(heart)
## 'data.frame': 303 obs. of 13 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
summary(heart)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal
## Min. :0.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.314
## 3rd Qu.:3.000
## Max. :3.000
Next, we will normalize the values of our variables since they are on different scales.
heart= scale(heart)
Here, we will examine the relationship that exists between these variables
library(corrplot)
## corrplot 0.90 loaded
heart_matrix <- data.matrix(heart, rownames.force = NA)
M <- cor(heart_matrix)
corrplot(M, method = "number", number.cex = 0.70, order="hclust")
From the correlation matrix one can see that dataset contains some features that are slightly correlated but at a low level, eg. slope and thalach (0.39).
We will calculate interquartile range statistic to check if the dataset contains outliers. The equation is as follows:
IQR=Q3−Q1
vars= c(colnames(heart))
Outliers <- c()
for(i in vars){
max <- quantile(heart[,i], 0.75) + (IQR(heart[,i]) * 1.5 )
min <- quantile(heart[,i], 0.25) - (IQR(heart[,i]) * 1.5 )
idx <- which(heart[,i] < min | heart[,i] > max)
print(paste(i, length(idx), sep=' ')) # printing variable and number of potential outliers
Outliers <- c(Outliers, idx)
}
## [1] "age 0"
## [1] "sex 0"
## [1] "cp 0"
## [1] "trestbps 9"
## [1] "chol 5"
## [1] "fbs 45"
## [1] "restecg 0"
## [1] "thalach 1"
## [1] "exang 0"
## [1] "oldpeak 5"
## [1] "slope 0"
## [1] "ca 25"
## [1] "thal 2"
The result shows that there up to 45 outliers in particular features. We will plot the variables that contain outliers.
par(mfrow=c(2,2))
colnames <- colnames(heart[,c(4:6, 8, 10, 12:13)])
for (i in colnames) {
plot(heart[,i], main = paste("Plot of ", i), ylab = i)
}
For this analysis, none of the observations will be excluded. To carry out the clustering techniques, we need to determine the optimum number of clusters:
library(gridExtra)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
a <- fviz_nbclust(heart, FUNcluster = kmeans, method = "silhouette") + theme_classic()
b <- fviz_nbclust(heart, FUNcluster = hcut, method = "silhouette") + theme_classic()
grid.arrange(a, b)
According to the results, the optimal number of clusters is 2 for Kmeans and hierarchical clustering.
K-means is one of the simplest unsupervised learning algorithms that solves the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster.
With the Kmeans clustering, we will analyse our data using Pearson correlation and Euclidean distance. The result from both will be analysed and compared.
library(factoextra)
cl_kmeans <- eclust(heart, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
a <- fviz_silhouette(cl_kmeans)
## cluster size ave.sil.width
## 1 1 108 0.12
## 2 2 195 0.19
b <- fviz_cluster(cl_kmeans, data = heart, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)
#### EFFICIENCY TEST OF PEARSON CORRELATION
table(cl_kmeans$cluster, heart_data$target)
##
## 0 1
## 1 95 13
## 2 43 152
cl_kmeans1 <- eclust(heart, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
g <- fviz_silhouette(cl_kmeans1)
## cluster size ave.sil.width
## 1 1 108 0.12
## 2 2 195 0.19
h <- fviz_cluster(cl_kmeans1, data = heart, elipse.type = "convex") + theme_minimal()
grid.arrange(g, h, ncol=2)
table(cl_kmeans1$cluster, heart_data$target)
##
## 0 1
## 1 95 13
## 2 43 152
The summary of the results above shows that there is no difference between Pearson’s correlation and Euclidean distance. As to accuracy, both methods seems to wrongly assign labels of class 1 and 2. Thus number of wrongly assigned observations is 56.
Hierarchical clustering aims to bridge the gap that exists in Kmeans clustering. It eliminates the problem of having to identify the number of clusters before analysis. Hierarchical clustering basically aims to cluster observations into groups, then assigns each observation to separate clusters. Now based on the homogeneity of these clusters, we group the most similar clusters together and repeat this process until we have just a single cluster left. Basically, this method involves building a hierarchy of clusters as the name implies.
We will install the libraries below for this analysis.
library(tidyverse) # data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(cluster) # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
##
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
There are two types of hierarchical clustering, namely Agglomerative hierarchical clustering and Divisive Hierarchical clustering. We will be carrying out our analysis based on these two.
We will apply the 4 linkage methods and compare the results for our analysis.
d <- dist(heart, method = "euclidean")
hc1 <- hclust(d, method = "single" )
plot(hc1, cex = 0.6, hang = -1)
##### EFFICIENCY TEST OF SINGLE LINKAGE
clusterCut1 <- cutree(hc1, 2)
table(clusterCut1, heart_data$target)
##
## clusterCut1 0 1
## 1 137 165
## 2 1 0
d <- dist(heart, method = "euclidean")
hc2 <- hclust(d, method = "average" )
plot(hc2, cex = 0.6, hang = -1)
##### EFFICIENCY TEST OF AVERAGE LINKAGE
clusterCut2 <- cutree(hc2, 2)
table(clusterCut2, heart_data$target)
##
## clusterCut2 0 1
## 1 138 164
## 2 0 1
d <- dist(heart, method = "euclidean")
hc3 <- hclust(d, method = "complete" )
plot(hc3, cex = 0.6, hang = -1)
##### EFFICIENCY TEST OF COMPLETE LINKAGE
clusterCut3 <- cutree(hc3, 2)
table(clusterCut3, heart_data$target)
##
## clusterCut3 0 1
## 1 138 164
## 2 0 1
d <- dist(heart, method = "euclidean")
hc4 <- hclust(d, method = "ward.D" )
plot(hc4, cex = 0.6, hang = -1)
##### EFFICIENCY TEST OF WARD LINKAGE
clusterCut4 <- cutree(hc4, 2)
table(clusterCut4, heart_data$target)
##
## clusterCut4 0 1
## 1 56 154
## 2 82 11
It occurred that the most accurate are the complete and average method. where we have 164 observations that were wrongly assigned to the cluster.
As to divisive hierarchical clustering, it can be implemented using cluster::diana() function. Here, the only parameter within function which has to be set is the number of clusters.
hc5 <- eclust(heart, k=2, FUNcluster="diana")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
pltree(hc5, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc5, k=2, border='red')
clusterCut5 <- cutree(hc5, 2)
table(clusterCut5, heart_data$target)
##
## clusterCut5 0 1
## 1 114 92
## 2 24 73
It is clearly visible that the divisive hierarchical clustering algorithm performed better than the best ones using the agglomerative approach. This is based on the fact that the divisive hierarchical clustering algorithm had 116 observations that were wrongly assigned to the cluster as against 164 wrongly assigned observation like in the agglomorative approach.
The best algorithm based on this analysis in case of efficiency is the Kmeans Clustering algorithm. It had the least number of wrongly assigned observations to clusters. So for the sake of predicting a heart disease based on this analysis, using the Kmeans Clustering Technique shows and proves to be more efficient.
But eventually, selecting the best algorithm really depends on individual goals or desired outcomes. Also, the data set also influences this. So i cannot categorically state that the best clustering technique is Kmeans.
In summary, it is pertinent to understand that clustering algorithms can be applied in many fields. This analysis was applied in the medical field. Other fields include marketing, insurance, social media, geography, and the list goes on and on. Discovering the best algorithms to apply in our analysis makes the difference and gives a more probable outcome.