INTRODUCTION

Clustering is generally known as the most important problem as far as unsupervised learning is concerned. This is because of the obvious reason which is finding structure in unstructured data or basically trying to make sense out of dataset that seems to be in chaos. Basically, in clustering, we collect objects with homogeneous attributes together and it is done in such a way that each of these group of objects are highly heterogeneous to each other.

For the purpose of this analysis we will be performing clustering on a dataset by using 2 clustering techniques, KMEANS clustering and Hierarchical clustering. Afterwards, we will compare the results and analyse which of these techniques did a better job.

DATASET

The dataset to be used for this analysis contains details about predicting a heart disease. This dataset was gotten from https://www.kaggle.com/faressayah/predicting-heart-disease-using-machine-learning/data It contains analysis of various attributes that can be used to predict if heart disease is present or not. The attributes are:

  1. age - age in years
  2. sex - (1 = male; 0 = female)
  3. cp - chest pain type 0: Typical angina: chest pain related decrease blood supply to the heart 1: Atypical angina: chest pain not related to heart 2: Non-anginal pain: typically esophageal spasms (non heart related) 3: Asymptomatic: chest pain not showing signs of disease
  4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern
  5. chol - serum cholestoral in mg/dl serum = LDL + HDL + .2 * triglycerides above 200 is cause for concern
  6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) ‘>126’ mg/dL signals diabetes
  7. restecg - resting electrocardiographic results 0: Nothing to note 1: ST-T Wave abnormality can range from mild symptoms to severe problems signals non-normal heart beat 2: Possible or definite left ventricular hypertrophy Enlarged heart’s main pumping chamber
  8. thalach - maximum heart rate achieved
  9. exang - exercise induced angina (1 = yes; 0 = no)
  10. oldpeak - ST depression induced by exercise relative to rest looks at stress of heart during excercise unhealthy heart will stress more
  11. slope - the slope of the peak exercise ST segment 0: Upsloping: better heart rate with excercise (uncommon) 1: Flatsloping: minimal change (typical healthy heart) 2: Downslopins: signs of unhealthy heart
  12. ca - number of major vessels (0-3) colored by flourosopy colored vessel means the doctor can see the blood passing through the more blood movement the better (no clots)
  13. thal - thalium stress result 1,3: normal 6: fixed defect: used to be defect but ok now 7: reversable defect: no proper blood movement when excercising
  14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)

DATA INSPECTION AND SUMMARY

We will inspect the dataset first. Then afterwards, we will remove the column target as we need to carry out the analysis to predict this result

heart_data= read.csv("~/Downloads/heart.csv.xls")
heart= heart_data
heart$target=NULL
head(heart)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6  57   1  0      140  192   0       1     148     0     0.4     1  0    1
str(heart)
## 'data.frame':    303 obs. of  13 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
summary(heart)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal      
##  Min.   :0.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.314  
##  3rd Qu.:3.000  
##  Max.   :3.000

Next, we will normalize the values of our variables since they are on different scales.

heart= scale(heart)

CORRELATION MATRIX

Here, we will examine the relationship that exists between these variables

library(corrplot)
## corrplot 0.90 loaded
heart_matrix <- data.matrix(heart, rownames.force = NA)
M <- cor(heart_matrix)
corrplot(M, method = "number", number.cex = 0.70, order="hclust")

From the correlation matrix one can see that dataset contains some features that are slightly correlated but at a low level, eg. slope and thalach (0.39).

We will calculate interquartile range statistic to check if the dataset contains outliers. The equation is as follows:

IQR=Q3−Q1

vars= c(colnames(heart))
Outliers <- c()
for(i in vars){
  max <- quantile(heart[,i], 0.75) + (IQR(heart[,i]) * 1.5 )
  min <- quantile(heart[,i], 0.25) - (IQR(heart[,i]) * 1.5 )
  idx <- which(heart[,i] < min | heart[,i] > max)
  print(paste(i, length(idx), sep=' ')) # printing variable and number of potential outliers 
  Outliers <- c(Outliers, idx) 
}
## [1] "age 0"
## [1] "sex 0"
## [1] "cp 0"
## [1] "trestbps 9"
## [1] "chol 5"
## [1] "fbs 45"
## [1] "restecg 0"
## [1] "thalach 1"
## [1] "exang 0"
## [1] "oldpeak 5"
## [1] "slope 0"
## [1] "ca 25"
## [1] "thal 2"

The result shows that there up to 45 outliers in particular features. We will plot the variables that contain outliers.

par(mfrow=c(2,2))
colnames <- colnames(heart[,c(4:6, 8, 10, 12:13)])
for (i in colnames) {
  plot(heart[,i], main = paste("Plot of ", i), ylab = i)
}

For this analysis, none of the observations will be excluded. To carry out the clustering techniques, we need to determine the optimum number of clusters:

library(gridExtra)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
a <- fviz_nbclust(heart, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(heart, FUNcluster = hcut, method = "silhouette") + theme_classic() 

grid.arrange(a, b)

According to the results, the optimal number of clusters is 2 for Kmeans and hierarchical clustering.

KMEANS CLUSTERING

K-means is one of the simplest unsupervised learning algorithms that solves the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster.

With the Kmeans clustering, we will analyse our data using Pearson correlation and Euclidean distance. The result from both will be analysed and compared.

PEARSON CORRELATION

library(factoextra)
cl_kmeans <- eclust(heart, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
a <- fviz_silhouette(cl_kmeans)
##   cluster size ave.sil.width
## 1       1  108          0.12
## 2       2  195          0.19
b <- fviz_cluster(cl_kmeans, data = heart, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)

#### EFFICIENCY TEST OF PEARSON CORRELATION

table(cl_kmeans$cluster, heart_data$target)
##    
##       0   1
##   1  95  13
##   2  43 152

EUCLIDEAN DISTANCE

cl_kmeans1 <- eclust(heart, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
g <- fviz_silhouette(cl_kmeans1)
##   cluster size ave.sil.width
## 1       1  108          0.12
## 2       2  195          0.19
h <- fviz_cluster(cl_kmeans1, data = heart, elipse.type = "convex") + theme_minimal()
grid.arrange(g, h, ncol=2)

EFFICIENCY TEST OF EUCLIDEAN DISTANCE

table(cl_kmeans1$cluster, heart_data$target)
##    
##       0   1
##   1  95  13
##   2  43 152

KMEANS CLUSTERING RESULT INTERPRETATION

The summary of the results above shows that there is no difference between Pearson’s correlation and Euclidean distance. As to accuracy, both methods seems to wrongly assign labels of class 1 and 2. Thus number of wrongly assigned observations is 56.

HIERARCHICAL CLUSTERING

Hierarchical clustering aims to bridge the gap that exists in Kmeans clustering. It eliminates the problem of having to identify the number of clusters before analysis. Hierarchical clustering basically aims to cluster observations into groups, then assigns each observation to separate clusters. Now based on the homogeneity of these clusters, we group the most similar clusters together and repeat this process until we have just a single cluster left. Basically, this method involves building a hierarchy of clusters as the name implies.

We will install the libraries below for this analysis.

library(tidyverse)  # data manipulation
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
library(cluster)    # clustering algorithms
library(factoextra) # clustering visualization
library(dendextend) # for comparing two dendrograms
## 
## ---------------------
## Welcome to dendextend version 1.15.2
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree

There are two types of hierarchical clustering, namely Agglomerative hierarchical clustering and Divisive Hierarchical clustering. We will be carrying out our analysis based on these two.

AGGLOMERATIVE HIERARCHICAL CLUSTERING

We will apply the 4 linkage methods and compare the results for our analysis.

SINGLE LINKAGE

d <- dist(heart, method = "euclidean")
hc1 <- hclust(d, method = "single" )
plot(hc1, cex = 0.6, hang = -1)

##### EFFICIENCY TEST OF SINGLE LINKAGE

clusterCut1 <- cutree(hc1, 2)
table(clusterCut1, heart_data$target)
##            
## clusterCut1   0   1
##           1 137 165
##           2   1   0

AVERAGE LINKAGE

d <- dist(heart, method = "euclidean")
hc2 <- hclust(d, method = "average" )
plot(hc2, cex = 0.6, hang = -1)

##### EFFICIENCY TEST OF AVERAGE LINKAGE

clusterCut2 <- cutree(hc2, 2)
table(clusterCut2, heart_data$target)
##            
## clusterCut2   0   1
##           1 138 164
##           2   0   1

COMPLETE LINKAGE

d <- dist(heart, method = "euclidean")
hc3 <- hclust(d, method = "complete" )
plot(hc3, cex = 0.6, hang = -1)

##### EFFICIENCY TEST OF COMPLETE LINKAGE

clusterCut3 <- cutree(hc3, 2)
table(clusterCut3, heart_data$target)
##            
## clusterCut3   0   1
##           1 138 164
##           2   0   1

WARD LINKAGE

d <- dist(heart, method = "euclidean")
hc4 <- hclust(d, method = "ward.D" )
plot(hc4, cex = 0.6, hang = -1)

##### EFFICIENCY TEST OF WARD LINKAGE

clusterCut4 <- cutree(hc4, 2)
table(clusterCut4, heart_data$target)
##            
## clusterCut4   0   1
##           1  56 154
##           2  82  11

It occurred that the most accurate are the complete and average method. where we have 164 observations that were wrongly assigned to the cluster.

DIVISIVE HIERARCHICAL CLUSTERING

As to divisive hierarchical clustering, it can be implemented using cluster::diana() function. Here, the only parameter within function which has to be set is the number of clusters.

hc5 <- eclust(heart, k=2, FUNcluster="diana")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
pltree(hc5, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc5, k=2, border='red')

clusterCut5 <- cutree(hc5, 2)
table(clusterCut5, heart_data$target)
##            
## clusterCut5   0   1
##           1 114  92
##           2  24  73

HIERARCHICAL CLUSTERING RESULT

It is clearly visible that the divisive hierarchical clustering algorithm performed better than the best ones using the agglomerative approach. This is based on the fact that the divisive hierarchical clustering algorithm had 116 observations that were wrongly assigned to the cluster as against 164 wrongly assigned observation like in the agglomorative approach.

CONCLUSION

The best algorithm based on this analysis in case of efficiency is the Kmeans Clustering algorithm. It had the least number of wrongly assigned observations to clusters. So for the sake of predicting a heart disease based on this analysis, using the Kmeans Clustering Technique shows and proves to be more efficient.

But eventually, selecting the best algorithm really depends on individual goals or desired outcomes. Also, the data set also influences this. So i cannot categorically state that the best clustering technique is Kmeans.

In summary, it is pertinent to understand that clustering algorithms can be applied in many fields. This analysis was applied in the medical field. Other fields include marketing, insurance, social media, geography, and the list goes on and on. Discovering the best algorithms to apply in our analysis makes the difference and gives a more probable outcome.