We have decided to use the Kaggle dataset Heart Disease patients authored by Abid Ali Awan. The data we are analyzing comes from the Veterans Administration Medical Center in Long Beach, California, USA.

The data has 12 variables, namely:

id, Index, numeric;
age, Age of patient, numeric;
sex, Gender of patient, categorical* 2 levels (0 = Male, 1 = Female);
cp, Chest pain type, ordinal (levels 1 to 4 indicating the severity of pain subject feels, 4 being the highest);
trestbps, Resting systolic blood pressure (in mm/Hg on admission to the hospital), numeric;
chol, Serum cholesterol in mg/dl, numeric;
fbs, Fasting blood sugar > 120 mg/dl (1 = true; 0 = false), categorical;
restecg, Resting electrocardiographic results (0 = normal result, 2 = ST-T wave abnormality), categorical;
thalach, Maximum heart rate achieved, numeric;
exang, Exercise induced angina (1 = yes; 0 = no), categorical;
oldpeak, ST depression induced by exercise relative to rest, numeric;
slope, The slope of the peak exercise ST segment (1 = upsloping, 2 = flat, 3 = downsloping), categorical.

It is known that upsloping is relatively harmless and there is low likelihood of massive coronary artery disease (CAD). A flat result suggests further testing is required to determine whether a subject has a massive CAD, while downsloping suggests a high likelihood of a massive CAD. Oldpeak is a measure of myocardial ischemia (inadequate blood supply to the heart muscle) and the bigger value is indicative of a more severe coronary disease.

Exploratory data analysis

We load the data:

library(readr)
library(psych)

data <- read_csv("C:/R/heart_disease_patients.csv")

## Rows: 303 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): id, age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, ol...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# recode the factor variables

data$sex <- factor(data$sex, levels = c(0, 1), labels = c("Male", "Female"))
data$fbs <- factor(data$fbs, levels = c(0, 1), labels = c("False", "True"))
data$restecg <- factor(data$restecg, levels = c(0, 2), labels = c("Normal", "ST-T abnormality"))
data$exang <- factor(data$exang, levels = c(0, 1), labels = c("No", "Yes"))
data$slope <- factor(data$slope, levels = c(1, 2, 3), labels = c("Upsloping", "Flat", "Downsloping"))

# preview the data
head(data)

## # A tibble: 6 × 12
##      id   age sex       cp trestbps  chol fbs   restecg    thalach exang oldpeak
##   <dbl> <dbl> <fct>  <dbl>    <dbl> <dbl> <fct> <fct>        <dbl> <fct>   <dbl>
## 1     1    63 Female     1      145   233 True  ST-T abno…     150 No        2.3
## 2     2    67 Female     4      160   286 False ST-T abno…     108 Yes       1.5
## 3     3    67 Female     4      120   229 False ST-T abno…     129 Yes       2.6
## 4     4    37 Female     3      130   250 False Normal         187 No        3.5
## 5     5    41 Male       2      130   204 False ST-T abno…     172 No        1.4
## 6     6    56 Female     2      120   236 False Normal         178 No        0.8
## # ℹ 1 more variable: slope <fct>

# check for missing values
colSums(apply(data, 2, is.na))

##       id      age      sex       cp trestbps     chol      fbs  restecg 
##        0        0        0        0        0        0        0        4 
##  thalach    exang  oldpeak    slope 
##        0        0        0        0

# print the descriptive statistics 
summary(data)

##        id             age            sex            cp           trestbps    
##  Min.   :  1.0   Min.   :29.00   Male  : 97   Min.   :1.000   Min.   : 94.0  
##  1st Qu.: 76.5   1st Qu.:48.00   Female:206   1st Qu.:3.000   1st Qu.:120.0  
##  Median :152.0   Median :56.00                Median :3.000   Median :130.0  
##  Mean   :152.0   Mean   :54.44                Mean   :3.158   Mean   :131.7  
##  3rd Qu.:227.5   3rd Qu.:61.00                3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :303.0   Max.   :77.00                Max.   :4.000   Max.   :200.0  
##       chol          fbs                  restecg       thalach      exang    
##  Min.   :126.0   False:258   Normal          :151   Min.   : 71.0   No :204  
##  1st Qu.:211.0   True : 45   ST-T abnormality:148   1st Qu.:133.5   Yes: 99  
##  Median :241.0               NA's            :  4   Median :153.0            
##  Mean   :246.7                                      Mean   :149.6            
##  3rd Qu.:275.0                                      3rd Qu.:166.0            
##  Max.   :564.0                                      Max.   :202.0            
##     oldpeak             slope    
##  Min.   :0.00   Upsloping  :142  
##  1st Qu.:0.00   Flat       :140  
##  Median :0.80   Downsloping: 21  
##  Mean   :1.04                    
##  3rd Qu.:1.60                    
##  Max.   :6.20

We can see that the non-numeric variables are imbalanced, i.e. there exists the biggest class in most cases. Moreover, no missing data are present except for restecg, where there are 4 subjects with missing values. For simplicity, we will assume those findings to be normal and impute with the majority class.

data$restecg[is.na(data$restecg)] <- "Normal"

Let us now visualise the data.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

# plot histogram for age
ggplot(data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  labs(title = "Histogram of Age", x = "Age", y = "Frequency")

# plot histogram for trestbps
ggplot(data, aes(x = trestbps)) +
  geom_histogram(binwidth = 10, fill = "green", color = "black") +
  labs(title = "Histogram of Resting Systolic Blood Pressure", x = "Resting Systolic Blood Pressure", y = "Frequency")

# plot histogram for chol
ggplot(data, aes(x = chol)) +
  geom_histogram(binwidth = 20, fill = "red", color = "black") +
  labs(title = "Histogram of Serum Cholesterol", x = "Serum Cholesterol (mg/dl)", y = "Frequency")

# Plot histogram for thalach
ggplot(data, aes(x = thalach)) +
  geom_histogram(binwidth = 10, fill = "purple", color = "black") +
  labs(title = "Histogram of Maximum Heart Rate Achieved", x = "Maximum Heart Rate Achieved", y = "Frequency")

# plot histogram for oldpeak
ggplot(data, aes(x = oldpeak)) +
  geom_histogram(binwidth = 0.5, fill = "orange", color = "black") +
  labs(title = "Histogram of ST Depression", x = "ST Depression Induced by Exercise", y = "Frequency")

Every variable does not look normally distributed, i.e. symmetrical and bell-shaped. We can test for normality using Shapiro’s test. We write the function which tests for normality and test the columns above:

check_normality <- function(variable) {
  test <- shapiro.test(variable)
  p_value <- test$p.value
  
  if (p_value < 0.05) {
    return(paste("p-value:", round(p_value, 3), "- The data are not normally distributed."))
  } else {
    return(paste("p-value:", round(p_value, 3), "- The data are normally distributed.")) # paste simply makes a string, similar to f-strings in Python
  }
}
# check normality for age - cat concatenates strings and "\n" is a new line
cat("Age:", check_normality(data$age), "\n")

## Age: p-value: 0.006 - The data are not normally distributed.

# check normality for trestbps
cat("Resting Systolic Blood Pressure:", check_normality(data$trestbps), "\n")

## Resting Systolic Blood Pressure: p-value: 0 - The data are not normally distributed.

# check normality for chol
cat("Serum Cholesterol:", check_normality(data$chol), "\n")

## Serum Cholesterol: p-value: 0 - The data are not normally distributed.

# check normality for thalach
cat("Maximum Heart Rate Achieved:", check_normality(data$thalach), "\n")

## Maximum Heart Rate Achieved: p-value: 0 - The data are not normally distributed.

# check normality for oldpeak, makes little sense but let's do it anyways
cat("ST Depression:", check_normality(data$oldpeak), "\n")

## ST Depression: p-value: 0 - The data are not normally distributed.

As expected, Shapiro’s test rejects the null hypothesis of normality. This is to be expected, since we are not dealing with normal population, but subjects who already have some sort of disease, hence the deviations from the average values are to be expected.

A flavor of testing-type homework

We can, for example, test whether the median cholesterol levels deviate from normal, which are less than 200 mg/dl in an average subject, disregarding age, sex and other factors.

# load the effectsize library
library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

# perform Wilcoxon signed rank test to check if the median cholesterol level is greater than 200
# hypothesis: h0: median cholesterol level = 200, h1: median cholesterol level > 200
wilcox.test(data$chol, mu = 200, alternative = "greater")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$chol
## V = 42158, p-value < 2.2e-16
## alternative hypothesis: true location is greater than 200

interpret_rank_biserial(rank_biserial(wilcox.test(data$chol, mu = 200, alternative = "greater")))

## r (rank biserial) |       95% CI | Interpretation
## -------------------------------------------------
## 0.84              | [0.81, 1.00] |     very large
## 
## - Deviation from a difference of 200.
## - One-sided CIs: upper bound fixed at [1.00].
## - Interpretation rule: funder2019

As expected, the median cholesterol levels are significantly higher than 200, pointing to the abnormal levels in sick patients. Bisserial correlation is 0.84, which corresponds to a large effect size according to Funder.

We can to the same with systolic blood pressure in rest, assumed to be 120 mm/Hg in a healthy adult. We want to test whether the median blood pressure in our subjects is greater than the normal value.

wilcox.test(data$trestbps, mu = 120, alternative = "greater")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data$trestbps
## V = 30220, p-value < 2.2e-16
## alternative hypothesis: true location is greater than 120

interpret_rank_biserial(rank_biserial(wilcox.test(data$trestbps, mu = 120, alternative = "greater")))

## r (rank biserial) |       95% CI | Interpretation
## -------------------------------------------------
## 0.70              | [0.64, 1.00] |     very large
## 
## - Deviation from a difference of 120.
## - One-sided CIs: upper bound fixed at [1.00].
## - Interpretation rule: funder2019

Again, as expected, the test results suggest that systolic blood pressure is greater than 120 mm/Hg with a very large effect size (r=0.71). We can go on, testing for differences between men and women and so on, but this is outside of the scope of this homework and testing part serves only to illustrate a possible avenue of research.

# load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# select specified columns
selected_data <- data %>%
  select(age, trestbps, chol, thalach, cp, oldpeak)

# scale the data
scaled_data <- as.data.frame(scale(selected_data))

# calculate the dissimilarity index
scaled_data$Dissimilarity <- sqrt(scaled_data$age^2 + scaled_data$trestbps^2 + scaled_data$chol^2 + scaled_data$thalach^2+scaled_data$cp^2+scaled_data$oldpeak^2)

# show the top rows ordered by descending dissimilarity
top_outliers <- head(scaled_data[order(-scaled_data$Dissimilarity), ], 5)

print(top_outliers)

##            age   trestbps       chol    thalach         cp   oldpeak
## 153 1.38970298 -0.9482959  6.1283473  0.4543273 -0.1649949 0.4826527
## 92  0.83652378  1.6085589 -1.5971030 -0.2014103  0.8765355 4.4444984
## 127 0.17270875  3.8813188  0.7977866 -0.7260004  0.8765355 2.5497026
## 184 0.50461627  2.6313008  0.4501413 -0.2014103 -2.2480557 2.7219568
## 124 0.06207291  0.4721790 -0.5734808 -1.6877489  0.8765355 3.9277359
##     Dissimilarity
## 153      6.391695
## 92       5.138134
## 127      4.850495
## 184      4.459190
## 124      4.427144

We see that the observation 153 is the outlier and we will remove it from the data, rescale and compute dissimilarity again.

# select specified columns
selected_data <- data[-153, ] %>%
  select(age, trestbps, chol, thalach, cp, oldpeak)

# scale the data
scaled_data <- as.data.frame(scale(selected_data))

# calculate the dissimilarity index
scaled_data$Dissimilarity <- sqrt(scaled_data$age^2 + scaled_data$trestbps^2 + scaled_data$chol^2 + scaled_data$thalach^2+scaled_data$cp^2+scaled_data$oldpeak^2)

# show the top rows ordered by descending dissimilarity
top_outliers <- head(scaled_data[order(-scaled_data$Dissimilarity), ], 5)

print(top_outliers)

##            age  trestbps       chol    thalach         cp  oldpeak
## 92  0.84243865 1.6051582 -1.6826684 -0.1996431  0.8745772 4.440448
## 127 0.17758724 3.8775491  0.8730003 -0.7235436  0.8745772 2.548060
## 122 0.95324721 1.0370605  3.3256178  0.1932823  0.8745772 2.548060
## 183 0.51001294 2.6277341  0.5020161 -0.1996431 -2.2449776 2.720095
## 124 0.06677867 0.4689628 -0.5903262 -1.6840279  0.8745772 3.924342
##     Dissimilarity
## 92       5.161388
## 127      4.859019
## 122      4.509851
## 183      4.460473
## 124      4.424270

scaled_data$Dissimilarity<-NULL
# we don't need this anymore

Now it looks better. We will not remove any further data points, since the large differences in dissimilarity+ aren’t present in our data.

Is the data clusterable?

Let us investigate the distance between the data points now.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(NbClust)
Distances <- get_dist(scaled_data, method = "euclidian")


fviz_dist(Distances,
         gradient = list(low = "darkred", mid = "grey95", high = "white"))

get_clust_tendency(scaled_data,
                   n = nrow(scaled_data) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7041747
## 
## $plot
## NULL

The dissimilarity matrix suggests there isn’t that great of a separation between points as in our class, but Hopkins test, used to indicate whether natural groups exist, suggests they do. Our data is deemed clusterable, albeit it’s not ideal. We can see 2 to 4 clear patches of red. We need to carefully determine the appropriate number of clusters.

Hierarchical clustering

nb<-NbClust(scaled_data,
          method ="ward.D2") # default distance is euclidean, 2 to 15 clusters are considered

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

# The data suggests 3 clusters are optimal. 
(WardD2 <- scaled_data %>%
  get_dist(method = "euclidian") %>%
  hclust(method = "ward.D2"))

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 302

fviz_dend(WardD2,  k = 3,                    
          palette = c("orange", "violet", "green"),  
          color_labels_by_k = TRUE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We opt for Ward’s D squared approach since we are aiming for spherical and compact clusters.

For example, if we are interested in a complete linkage method, utilising a similarity approach, we get this:

(complete <- scaled_data %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "complete"))

## 
## Call:
## hclust(d = ., method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 302

fviz_dend(complete)

Hierarchical K-Means approach

Let us train the hierarchical K-Means model with 3 clusters and see what happens.

HKM <- hkmeans(scaled_data, 
                   k = 3,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

HKM

## Hierarchical K-means clustering with 3 clusters of sizes 74, 97, 131
## 
## Cluster means:
##          age    trestbps       chol    thalach         cp     oldpeak
## 1  0.6956922  0.90731927  0.7833179  0.1797128 -0.3057949 -0.09291508
## 2  0.4757423 -0.09386392 -0.1570864 -0.9467864  0.6923351  0.75499038
## 3 -0.7452536 -0.44302920 -0.3261691  0.5995385 -0.3399060 -0.50655230
## 
## Clustering vector:
##   [1] 1 2 2 3 3 3 2 1 2 2 2 1 1 3 1 3 3 2 3 3 3 1 1 2 2 3 1 1 3 2 1 2 1 1 3 3 2
##  [38] 2 1 1 2 3 1 1 1 2 3 2 1 3 3 2 3 3 2 2 3 3 3 3 2 3 2 1 2 2 2 1 1 2 1 1 2 2
##  [75] 3 1 2 1 3 2 2 2 3 1 3 3 3 2 3 3 1 2 2 3 1 3 2 2 3 3 3 3 1 2 2 3 3 3 2 3 1
## [112] 2 3 2 2 3 3 3 2 2 3 1 3 2 1 3 1 2 3 3 3 3 3 3 3 1 2 2 3 3 3 1 3 2 3 3 1 3
## [149] 3 3 1 3 1 2 2 1 1 1 1 3 1 3 2 3 3 3 1 3 3 2 2 1 1 2 2 3 2 3 3 3 1 3 1 1 3
## [186] 3 1 1 1 3 2 3 2 2 2 1 3 3 1 3 1 3 1 3 1 2 2 3 1 3 3 3 1 3 3 3 3 2 1 3 3 3
## [223] 2 2 3 3 1 2 2 3 1 3 1 1 2 2 3 3 3 3 3 3 1 2 2 3 2 3 3 2 2 2 3 3 3 2 2 1 3
## [260] 3 1 1 3 2 2 3 2 3 3 2 2 2 2 3 1 1 3 1 2 2 3 2 3 1 2 2 3 3 3 1 1 2 2 2 3 2
## [297] 2 3 2 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 374.5193 407.8028 449.4075
##  (between_SS / total_SS =  31.8 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

fviz_cluster(HKM, 
             palette = "set2", 
             repel = TRUE,
             ggtheme = theme_classic())

It looks decent and it can be seen 188 and 231 are far away in the red part, 133, 178, 182, 22 and 58 are far away in the blue part and 30, 122, 124, 92, 155 in the green one. 245 is close to other points, so we do not want to discard it. We opt for even clusters which are well-rounded.

selected_data1 <- data[-c(153, 188, 231, 133, 177, 178, 182, 22, 58, 30, 122, 124, 92, 155),]%>%
  select(age, trestbps, chol, thalach, cp, oldpeak)

# scale the data
scaled_data1 <- as.data.frame(scale(selected_data1))

# we remove the outliers and rescale the data

HKM1 <- hkmeans(scaled_data1, 
                   k = 3,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

HKM1

## Hierarchical K-means clustering with 3 clusters of sizes 137, 95, 57
## 
## Cluster means:
##          age   trestbps        chol    thalach         cp    oldpeak
## 1 -0.6657077 -0.3918684 -0.23602212  0.5948045 -0.3227616 -0.4985808
## 2  0.4865753 -0.1690940 -0.06552748 -0.9121756  0.7013572  0.6836192
## 3  0.7890754  1.2236825  0.67649372  0.0906748 -0.3931684  0.0589779
## 
## Clustering vector:
##   [1] 3 2 2 1 1 1 2 3 2 2 2 3 1 1 3 1 1 2 1 1 1 1 2 2 1 1 3 1 3 2 3 1 1 1 2 2 2
##  [38] 3 2 1 3 3 3 2 1 2 3 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 2 3 3 2 3 2 2 2 1 3 2
##  [75] 3 1 2 2 2 1 3 1 1 1 2 1 1 2 2 1 3 1 2 2 1 1 1 1 1 2 2 1 1 1 2 1 3 2 1 2 2
## [112] 1 1 1 2 2 1 1 3 1 3 2 1 1 1 1 1 1 1 2 2 1 1 1 3 1 2 1 1 3 1 1 1 3 2 3 2 3
## [149] 1 3 2 1 3 1 2 1 1 1 1 1 1 3 2 3 3 2 2 1 1 1 1 3 3 1 1 3 3 1 2 1 2 2 2 3 1
## [186] 1 3 1 3 1 3 1 1 2 2 1 3 1 1 1 3 1 1 1 1 2 1 1 1 1 2 2 1 1 3 2 2 3 1 3 3 2
## [223] 2 1 1 1 1 1 1 3 2 2 1 2 1 1 2 2 2 1 1 1 2 2 3 1 1 3 3 1 2 2 1 2 1 1 2 2 2
## [260] 2 1 3 3 1 3 2 2 1 2 1 3 2 3 1 1 1 3 3 2 2 2 1 2 2 1 2 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 488.6907 398.0597 302.8731
##  (between_SS / total_SS =  31.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

fviz_cluster(HKM1, 
             palette = "set2", 
             repel = TRUE,
             ggtheme = theme_classic())

We do not have any other outliers and our clusters are well-balanced.

K-Means approach

We now focus on K-Means algorithm. Let us determine the optimal number of clusters.

library(NbClust)
fviz_nbclust(scaled_data1, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

fviz_nbclust(scaled_data1, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method")

nb<-NbClust(scaled_data1,
        method ="kmeans") # default distance is euclidean, 2 to 15 clusters are considered

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 6 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 2 proposed 11 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 3 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

Elbow method alone is inconclusive, it could be 2, it could be 6 and anything in between, but sillhouette method suggests 2 as an optimal number of clusters for this analysis. Nbclust function which implements a voting approach, suggests 2. 3 is also high in the number of votes.

KMNH <- kmeans(scaled_data1,
                     centers = 2,
                     nstart = 50,
                    iter.max =500)

KMNH

## K-means clustering with 2 clusters of sizes 153, 136
## 
## Cluster means:
##          age   trestbps       chol    thalach         cp    oldpeak
## 1 -0.5976141 -0.2730750 -0.2082443  0.5865003 -0.4007050 -0.4299568
## 2  0.6723159  0.3072094  0.2342749 -0.6598129  0.4507931  0.4837014
## 
## Clustering vector:
##   [1] 2 2 2 1 1 1 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 2 2 2 1 1 2 2 2
##  [38] 2 2 1 2 2 2 1 1 2 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2
##  [75] 2 1 2 1 2 1 2 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 2 1 2 2 1 2 2
## [112] 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2
## [149] 1 2 2 1 2 1 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1
## [186] 1 2 1 2 1 2 1 1 2 2 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 1 2 2 2 2 1 2 1 2
## [223] 2 1 1 1 1 1 1 2 2 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 1 1 1 1 2 2 1 2 1 1 2 2 2
## [260] 2 1 1 2 1 1 2 2 1 2 1 2 2 2 1 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 623.9686 725.4272
##  (between_SS / total_SS =  21.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We can see that the amount of variability explained by clustering is only 21.9 and that the second smaller cluster has more variability in itself. Let us see whether the outliers contributed to such poor result:

fviz_cluster(KMNH,
             palette = "Set2",
             repel = TRUE,
             ggtheme = theme_bw(),
             data = scaled_data1)

KMNH1 <- kmeans(scaled_data1,
                     centers = 3,
                     nstart = 50,
                    iter.max =500)

KMNH1

## K-means clustering with 3 clusters of sizes 77, 96, 116
## 
## Cluster means:
##          age    trestbps        chol    thalach         cp    oldpeak
## 1  0.7177555  0.60411394  0.55299651  0.2741042 -0.5346015 -0.3018214
## 2  0.4809916  0.05062011 -0.02664637 -0.9396673  0.7250369  0.8076828
## 3 -0.8745032 -0.44289917 -0.34502310  0.5957072 -0.2451657 -0.4680802
## 
## Clustering vector:
##   [1] 1 2 2 3 3 3 2 1 2 2 2 1 1 3 1 3 3 2 3 3 1 1 2 2 3 1 1 3 1 2 1 1 3 3 2 2 2
##  [38] 1 2 3 1 1 1 2 3 2 1 3 3 2 3 3 2 2 3 1 3 2 3 2 1 2 2 2 1 2 2 1 1 2 2 3 1 2
##  [75] 1 3 2 2 2 3 1 1 3 3 2 3 3 1 2 3 1 3 2 2 3 3 3 3 1 1 3 3 3 3 2 3 1 2 3 2 2
## [112] 3 1 3 2 2 3 3 1 3 2 2 3 3 3 3 3 3 1 2 2 3 3 1 1 3 2 3 3 2 3 3 1 1 3 1 2 1
## [149] 1 1 1 3 1 3 2 3 3 3 1 3 3 2 2 1 1 2 2 3 3 3 3 1 1 1 3 1 1 3 2 3 2 2 2 1 3
## [186] 3 1 3 1 3 1 3 3 2 2 1 2 3 3 3 1 3 3 3 3 2 1 3 3 3 2 2 3 3 1 2 2 2 3 1 1 2
## [223] 2 3 3 3 3 3 3 1 2 2 3 2 3 1 2 2 2 3 3 3 2 2 1 1 3 1 1 3 2 2 3 2 3 3 2 2 2
## [260] 2 1 1 1 3 1 2 2 3 2 3 1 2 2 1 3 3 1 1 2 2 2 3 2 2 3 2 2 1 3
## 
## Within cluster sum of squares by cluster:
## [1] 352.5733 430.9888 399.1071
##  (between_SS / total_SS =  31.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(KMNH1,
             palette = "Set2",
             repel = TRUE,
             ggtheme = theme_bw(),
             data = scaled_data1)

3 clusters looks quite decent and we are content with the results, comparable to ones in the hierarchical case. 4 clusters would lead to poor separation issues and we omit it here for the reasons of brevity. We see that in terms of variability, second cluster is the most versatile one, but we are ready to pay that price, so not to remove many points. 121 may be an outlier, but that would lead to removal of more points, and the improvement is not significant.

Let us plot the average values per cluster, as shown in class.

library(tidyr)
(centres <- as.data.frame(KMNH1$centers))  # use centroids of our last cluster

##          age    trestbps        chol    thalach         cp    oldpeak
## 1  0.7177555  0.60411394  0.55299651  0.2741042 -0.5346015 -0.3018214
## 2  0.4809916  0.05062011 -0.02664637 -0.9396673  0.7250369  0.8076828
## 3 -0.8745032 -0.44289917 -0.34502310  0.5957072 -0.2451657 -0.4680802

Figure <- as.data.frame(centres)
Figure$id <- 1:nrow(Figure)  # add cluster IDs, not having these before
Figure <- pivot_longer(Figure, 
                       cols = c("age", "trestbps", "chol", "thalach", "cp", "oldpeak"),
                       names_to = "Variable", 
                       values_to = "Value")


Figure$Group <- factor(Figure$id, 
                       levels = c(1, 2, 3),  # we have 3 groups
                       labels = c("Cluster 1", "Cluster 2", "Cluster 3"))

Figure$VariableFactor <- factor(Figure$Variable, 
                                levels = c("age", "trestbps", "chol", "thalach", "cp", "oldpeak"),
                                labels =  c("age", "trestbps", "chol", "thalach", "cp", "oldpeak"))

# plot the data
ggplot(Figure, aes(x = VariableFactor, y = Value)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  theme_classic() +
  geom_point(aes(shape = Group, col = Group), size = 4) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster Variables") +
  scale_color_brewer(palette = "Set2") +
  ylim(-1.44, 1.44) + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 10)) +
  ggtitle("Cluster Centroids")

We are running a one-way ANOVA on each cluster variable to see if these variables really help separate the groups. If we find that the means of two groups are different for each variable, then we know we have picked the right variables for clustering.

data1<-data[-c(153, 188, 231, 133, 177, 178, 182, 22, 58, 30, 122, 124, 92, 155),]
data1$cluster <- KMNH1$cluster

fit <- aov(cbind(age, trestbps, chol, thalach, cp, oldpeak) ~ as.factor(cluster),
           data = data1)
summary(fit)

##  Response age :
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(cluster)   2  12075  6037.5  156.72 < 2.2e-16 ***
## Residuals          286  11018    38.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response trestbps :
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(cluster)   2  15838  7918.8  30.847 7.399e-13 ***
## Residuals          286  73420   256.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response chol :
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(cluster)   2  81503   40751  21.357 2.266e-09 ***
## Residuals          286 545712    1908                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response thalach :
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(cluster)   2  66769   33384  120.52 < 2.2e-16 ***
## Residuals          286  79223     277                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response cp :
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(cluster)   2  72.99  36.495  54.472 < 2.2e-16 ***
## Residuals          286 191.61   0.670                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response oldpeak :
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(cluster)   2 112.53  56.267   70.45 < 2.2e-16 ***
## Residuals          286 228.42   0.799                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response age:

The F value is 156.72 with a p-value of < 2.2e-16. This means there are significant differences in age across the clusters.

Response trestbps (Resting Blood Pressure):

The F value is 30.847 with a p-value of 7.399e-13. This indicates significant differences in resting blood pressure across the clusters.

Response chol (Cholesterol):

The F value is 21.357 with a p-value of 2.266e-09. This suggests significant differences in cholesterol levels across the clusters.

Response thalach (Maximum Heart Rate Achieved):

The F value is 120.52 with a p-value of < 2.2e-16. This means there are significant differences in maximum heart rate achieved across the clusters.

Response CP (Chest Pain Type):

The F value is 54.472 with a p-value of < 2.2e-16. This indicates significant differences in chest pain type across the clusters.

Response oldpeak (ST Depression Induced by Exercise Relative to Rest):

The F value is 70.45 with a p-value of < 2.2e-16. This suggests significant differences in ST depression across the clusters.

The results indicate that all p-values are significantly below the threshold of 0.05, suggesting that the differences between clusters are statistically significant for all variables examined. This demonstrates that the clustering variables effectively differentiate between the groups.

Criterion validity

# we write the function which is going to help us do the chi squared test analysis, as shown in class

chi_square <- function(data, cluster_var, selected_var) {
  # perform chi-square test
  chisq_results <- chisq.test(data[[selected_var]], as.factor(data[[cluster_var]]))
  
  # print chi-square test results
  print(chisq_results)
  
  # add observed frequencies with margins
  observed_with_margins <- addmargins(chisq_results$observed)
  print(observed_with_margins)
  
  # print expected frequencies
  expected_frequencies <- round(chisq_results$expected, 2)
  print(expected_frequencies)
  
  # print standardized residuals
  standardized_residuals <- round(chisq_results$residuals, 2)
  print(standardized_residuals)
}


chi_square(data1, "cluster", "exang")

## 
##  Pearson's Chi-squared test
## 
## data:  data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 48.754, df = 2, p-value = 2.59e-11
## 
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]   1   2   3 Sum
##                  No   62  39  95 196
##                  Yes  15  57  21  93
##                  Sum  77  96 116 289
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##                  No  52.22 65.11 78.67
##                  Yes 24.78 30.89 37.33
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##                  No   1.35 -3.24  1.84
##                  Yes -1.96  4.70 -2.67

chi_square(data1, "cluster", "sex")

## 
##  Pearson's Chi-squared test
## 
## data:  data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 7.8664, df = 2, p-value = 0.01958
## 
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]   1   2   3 Sum
##               Male    34  25  32  91
##               Female  43  71  84 198
##               Sum     77  96 116 289
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##               Male   24.25 30.23 36.53
##               Female 52.75 65.77 79.47
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##               Male    1.98 -0.95 -0.75
##               Female -1.34  0.64  0.51

chi_square(data1, "cluster", "slope")

## Warning in chisq.test(data[[selected_var]], as.factor(data[[cluster_var]])):
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 56.999, df = 4, p-value = 1.238e-11
## 
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]   1   2   3 Sum
##          Upsloping    46  16  76 138
##          Flat         27  72  34 133
##          Downsloping   4   8   6  18
##          Sum          77  96 116 289
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##          Upsloping   36.77 45.84 55.39
##          Flat        35.44 44.18 53.38
##          Downsloping  4.80  5.98  7.22
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##          Upsloping    1.52 -4.41  2.77
##          Flat        -1.42  4.19 -2.65
##          Downsloping -0.36  0.83 -0.46

chi_square(data1, "cluster", "restecg")

## 
##  Pearson's Chi-squared test
## 
## data:  data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 8.4297, df = 2, p-value = 0.01477
## 
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]   1   2   3 Sum
##     Normal            34  45  73 152
##     ST-T abnormality  43  51  43 137
##     Sum               77  96 116 289
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]    1     2     3
##     Normal           40.5 50.49 61.01
##     ST-T abnormality 36.5 45.51 54.99
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##     Normal           -1.02 -0.77  1.53
##     ST-T abnormality  1.08  0.81 -1.62

chi_square(data1, "cluster", "fbs")

## 
##  Pearson's Chi-squared test
## 
## data:  data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 7.0638, df = 2, p-value = 0.02925
## 
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]   1   2   3 Sum
##                False  59  82 105 246
##                True   18  14  11  43
##                Sum    77  96 116 289
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##                False 65.54 81.72 98.74
##                True  11.46 14.28 17.26
##                     as.factor(data[[cluster_var]])
## data[[selected_var]]     1     2     3
##                False -0.81  0.03  0.63
##                True   1.93 -0.08 -1.51

Hypotheses for the Chi-Square Test:

\(H_0\): There is no association between the selected variable and the cluster labels.
\(H_1\): There is an association between the selected variable and the cluster labels.

We deduce there is an association between cluster labels and exang, sex, slope, restecg and fbs, which is to be expected, since every of those factors may contribute to some end to the coronary disease.

Conclusion of the clustering analysis

Cluster 1

Cluster 1 has the oldest patients who are predominantly male and have enhanced fasting blood sugar levels. They are having high systolic blood pressure (\(\mu = 0.604\)), high cholesterol levels (\(\mu = 0.553\)), and medium levels of maximum heart rate (\(\mu = 0.274\)). The pain is lowest (\(\mu = -0.535\)) and ST peak is not that high (\(\mu = -0.302\)), suggesting these patients probably have issues with hypertension. They also show ST upsloping, which is normal, but their resting ECG values are often abnormal, hence they need long-term monitoring after additional checks to exclude more serious symptoms. It seems their life is not compromised at the moment.

Cluster 2

Cluster 2 has middle-aged patients who are predominantely female. Their systolic blood pressure (\(\mu = 0.051\)) and cholesterol levels (\(\mu = -0.027\)) are not that high. Their maximum heart rate is high (\(\mu = -0.940\)), they report a significant level of pain (\(\mu = 0.725\)), and their ST peak is high (\(\mu = 0.808\)), suggesting that the heart muscle is not getting enough blood. Their ST is either flat or downsloping and their resting ECG values are often abnormal. These patients need urgent medical care, since their symptoms are pointing to the serious coronary artery disease.

Cluster 3

In Cluster 3, we have the youngest patients who are predominantly female, often do not show ST slope and resting ECG abnormalities, while their fasting blood sugar is often normal. They have not that high blood pressure (\(\mu = -0.443\)) and cholesterol (\(\mu = -0.345\)), but high pulse rates (\(\mu = 0.596\)). The chest pain is low (\(\mu = -0.245\)) and ST peaks are the lowest (\(\mu = -0.468\)). These patients need further checks to exclude chest pain, and then the tests would focus on their heart rhythm. Anxiety may be one of the reasons they are showing these symptoms, being ex-service personnel.

Homework 2

Lazar Kostić

2025-01-31