We have decided to use the Kaggle dataset Heart Disease patients authored by Abid Ali Awan. The data we are analyzing comes from the Veterans Administration Medical Center in Long Beach, California, USA.
The data has 12 variables, namely:
It is known that upsloping is relatively harmless and there is low likelihood of massive coronary artery disease (CAD). A flat result suggests further testing is required to determine whether a subject has a massive CAD, while downsloping suggests a high likelihood of a massive CAD. Oldpeak is a measure of myocardial ischemia (inadequate blood supply to the heart muscle) and the bigger value is indicative of a more severe coronary disease.
We load the data:
library(readr)
library(psych)
data <- read_csv("C:/R/heart_disease_patients.csv")
## Rows: 303 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): id, age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, ol...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# recode the factor variables
data$sex <- factor(data$sex, levels = c(0, 1), labels = c("Male", "Female"))
data$fbs <- factor(data$fbs, levels = c(0, 1), labels = c("False", "True"))
data$restecg <- factor(data$restecg, levels = c(0, 2), labels = c("Normal", "ST-T abnormality"))
data$exang <- factor(data$exang, levels = c(0, 1), labels = c("No", "Yes"))
data$slope <- factor(data$slope, levels = c(1, 2, 3), labels = c("Upsloping", "Flat", "Downsloping"))
# preview the data
head(data)
## # A tibble: 6 × 12
## id age sex cp trestbps chol fbs restecg thalach exang oldpeak
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct> <dbl>
## 1 1 63 Female 1 145 233 True ST-T abno… 150 No 2.3
## 2 2 67 Female 4 160 286 False ST-T abno… 108 Yes 1.5
## 3 3 67 Female 4 120 229 False ST-T abno… 129 Yes 2.6
## 4 4 37 Female 3 130 250 False Normal 187 No 3.5
## 5 5 41 Male 2 130 204 False ST-T abno… 172 No 1.4
## 6 6 56 Female 2 120 236 False Normal 178 No 0.8
## # ℹ 1 more variable: slope <fct>
# check for missing values
colSums(apply(data, 2, is.na))
## id age sex cp trestbps chol fbs restecg
## 0 0 0 0 0 0 0 4
## thalach exang oldpeak slope
## 0 0 0 0
# print the descriptive statistics
summary(data)
## id age sex cp trestbps
## Min. : 1.0 Min. :29.00 Male : 97 Min. :1.000 Min. : 94.0
## 1st Qu.: 76.5 1st Qu.:48.00 Female:206 1st Qu.:3.000 1st Qu.:120.0
## Median :152.0 Median :56.00 Median :3.000 Median :130.0
## Mean :152.0 Mean :54.44 Mean :3.158 Mean :131.7
## 3rd Qu.:227.5 3rd Qu.:61.00 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :303.0 Max. :77.00 Max. :4.000 Max. :200.0
## chol fbs restecg thalach exang
## Min. :126.0 False:258 Normal :151 Min. : 71.0 No :204
## 1st Qu.:211.0 True : 45 ST-T abnormality:148 1st Qu.:133.5 Yes: 99
## Median :241.0 NA's : 4 Median :153.0
## Mean :246.7 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:166.0
## Max. :564.0 Max. :202.0
## oldpeak slope
## Min. :0.00 Upsloping :142
## 1st Qu.:0.00 Flat :140
## Median :0.80 Downsloping: 21
## Mean :1.04
## 3rd Qu.:1.60
## Max. :6.20
We can see that the non-numeric variables are imbalanced, i.e. there exists the biggest class in most cases. Moreover, no missing data are present except for restecg, where there are 4 subjects with missing values. For simplicity, we will assume those findings to be normal and impute with the majority class.
data$restecg[is.na(data$restecg)] <- "Normal"
Let us now visualise the data.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
# plot histogram for age
ggplot(data, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
labs(title = "Histogram of Age", x = "Age", y = "Frequency")
# plot histogram for trestbps
ggplot(data, aes(x = trestbps)) +
geom_histogram(binwidth = 10, fill = "green", color = "black") +
labs(title = "Histogram of Resting Systolic Blood Pressure", x = "Resting Systolic Blood Pressure", y = "Frequency")
# plot histogram for chol
ggplot(data, aes(x = chol)) +
geom_histogram(binwidth = 20, fill = "red", color = "black") +
labs(title = "Histogram of Serum Cholesterol", x = "Serum Cholesterol (mg/dl)", y = "Frequency")
# Plot histogram for thalach
ggplot(data, aes(x = thalach)) +
geom_histogram(binwidth = 10, fill = "purple", color = "black") +
labs(title = "Histogram of Maximum Heart Rate Achieved", x = "Maximum Heart Rate Achieved", y = "Frequency")
# plot histogram for oldpeak
ggplot(data, aes(x = oldpeak)) +
geom_histogram(binwidth = 0.5, fill = "orange", color = "black") +
labs(title = "Histogram of ST Depression", x = "ST Depression Induced by Exercise", y = "Frequency")
Every variable does not look normally distributed, i.e. symmetrical and bell-shaped. We can test for normality using Shapiro’s test. We write the function which tests for normality and test the columns above:
check_normality <- function(variable) {
test <- shapiro.test(variable)
p_value <- test$p.value
if (p_value < 0.05) {
return(paste("p-value:", round(p_value, 3), "- The data are not normally distributed."))
} else {
return(paste("p-value:", round(p_value, 3), "- The data are normally distributed.")) # paste simply makes a string, similar to f-strings in Python
}
}
# check normality for age - cat concatenates strings and "\n" is a new line
cat("Age:", check_normality(data$age), "\n")
## Age: p-value: 0.006 - The data are not normally distributed.
# check normality for trestbps
cat("Resting Systolic Blood Pressure:", check_normality(data$trestbps), "\n")
## Resting Systolic Blood Pressure: p-value: 0 - The data are not normally distributed.
# check normality for chol
cat("Serum Cholesterol:", check_normality(data$chol), "\n")
## Serum Cholesterol: p-value: 0 - The data are not normally distributed.
# check normality for thalach
cat("Maximum Heart Rate Achieved:", check_normality(data$thalach), "\n")
## Maximum Heart Rate Achieved: p-value: 0 - The data are not normally distributed.
# check normality for oldpeak, makes little sense but let's do it anyways
cat("ST Depression:", check_normality(data$oldpeak), "\n")
## ST Depression: p-value: 0 - The data are not normally distributed.
As expected, Shapiro’s test rejects the null hypothesis of normality. This is to be expected, since we are not dealing with normal population, but subjects who already have some sort of disease, hence the deviations from the average values are to be expected.
We can, for example, test whether the median cholesterol levels deviate from normal, which are less than 200 mg/dl in an average subject, disregarding age, sex and other factors.
# load the effectsize library
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
# perform Wilcoxon signed rank test to check if the median cholesterol level is greater than 200
# hypothesis: h0: median cholesterol level = 200, h1: median cholesterol level > 200
wilcox.test(data$chol, mu = 200, alternative = "greater")
##
## Wilcoxon signed rank test with continuity correction
##
## data: data$chol
## V = 42158, p-value < 2.2e-16
## alternative hypothesis: true location is greater than 200
interpret_rank_biserial(rank_biserial(wilcox.test(data$chol, mu = 200, alternative = "greater")))
## r (rank biserial) | 95% CI | Interpretation
## -------------------------------------------------
## 0.84 | [0.81, 1.00] | very large
##
## - Deviation from a difference of 200.
## - One-sided CIs: upper bound fixed at [1.00].
## - Interpretation rule: funder2019
As expected, the median cholesterol levels are significantly higher than 200, pointing to the abnormal levels in sick patients. Bisserial correlation is 0.84, which corresponds to a large effect size according to Funder.
We can to the same with systolic blood pressure in rest, assumed to be 120 mm/Hg in a healthy adult. We want to test whether the median blood pressure in our subjects is greater than the normal value.
wilcox.test(data$trestbps, mu = 120, alternative = "greater")
##
## Wilcoxon signed rank test with continuity correction
##
## data: data$trestbps
## V = 30220, p-value < 2.2e-16
## alternative hypothesis: true location is greater than 120
interpret_rank_biserial(rank_biserial(wilcox.test(data$trestbps, mu = 120, alternative = "greater")))
## r (rank biserial) | 95% CI | Interpretation
## -------------------------------------------------
## 0.70 | [0.64, 1.00] | very large
##
## - Deviation from a difference of 120.
## - One-sided CIs: upper bound fixed at [1.00].
## - Interpretation rule: funder2019
Again, as expected, the test results suggest that systolic blood pressure is greater than 120 mm/Hg with a very large effect size (r=0.71). We can go on, testing for differences between men and women and so on, but this is outside of the scope of this homework and testing part serves only to illustrate a possible avenue of research.
# load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# select specified columns
selected_data <- data %>%
select(age, trestbps, chol, thalach, cp, oldpeak)
# scale the data
scaled_data <- as.data.frame(scale(selected_data))
# calculate the dissimilarity index
scaled_data$Dissimilarity <- sqrt(scaled_data$age^2 + scaled_data$trestbps^2 + scaled_data$chol^2 + scaled_data$thalach^2+scaled_data$cp^2+scaled_data$oldpeak^2)
# show the top rows ordered by descending dissimilarity
top_outliers <- head(scaled_data[order(-scaled_data$Dissimilarity), ], 5)
print(top_outliers)
## age trestbps chol thalach cp oldpeak
## 153 1.38970298 -0.9482959 6.1283473 0.4543273 -0.1649949 0.4826527
## 92 0.83652378 1.6085589 -1.5971030 -0.2014103 0.8765355 4.4444984
## 127 0.17270875 3.8813188 0.7977866 -0.7260004 0.8765355 2.5497026
## 184 0.50461627 2.6313008 0.4501413 -0.2014103 -2.2480557 2.7219568
## 124 0.06207291 0.4721790 -0.5734808 -1.6877489 0.8765355 3.9277359
## Dissimilarity
## 153 6.391695
## 92 5.138134
## 127 4.850495
## 184 4.459190
## 124 4.427144
We see that the observation 153 is the outlier and we will remove it from the data, rescale and compute dissimilarity again.
# select specified columns
selected_data <- data[-153, ] %>%
select(age, trestbps, chol, thalach, cp, oldpeak)
# scale the data
scaled_data <- as.data.frame(scale(selected_data))
# calculate the dissimilarity index
scaled_data$Dissimilarity <- sqrt(scaled_data$age^2 + scaled_data$trestbps^2 + scaled_data$chol^2 + scaled_data$thalach^2+scaled_data$cp^2+scaled_data$oldpeak^2)
# show the top rows ordered by descending dissimilarity
top_outliers <- head(scaled_data[order(-scaled_data$Dissimilarity), ], 5)
print(top_outliers)
## age trestbps chol thalach cp oldpeak
## 92 0.84243865 1.6051582 -1.6826684 -0.1996431 0.8745772 4.440448
## 127 0.17758724 3.8775491 0.8730003 -0.7235436 0.8745772 2.548060
## 122 0.95324721 1.0370605 3.3256178 0.1932823 0.8745772 2.548060
## 183 0.51001294 2.6277341 0.5020161 -0.1996431 -2.2449776 2.720095
## 124 0.06677867 0.4689628 -0.5903262 -1.6840279 0.8745772 3.924342
## Dissimilarity
## 92 5.161388
## 127 4.859019
## 122 4.509851
## 183 4.460473
## 124 4.424270
scaled_data$Dissimilarity<-NULL
# we don't need this anymore
Now it looks better. We will not remove any further data points, since the large differences in dissimilarity+ aren’t present in our data.
Let us investigate the distance between the data points now.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(NbClust)
Distances <- get_dist(scaled_data, method = "euclidian")
fviz_dist(Distances,
gradient = list(low = "darkred", mid = "grey95", high = "white"))
get_clust_tendency(scaled_data,
n = nrow(scaled_data) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.7041747
##
## $plot
## NULL
The dissimilarity matrix suggests there isn’t that great of a separation between points as in our class, but Hopkins test, used to indicate whether natural groups exist, suggests they do. Our data is deemed clusterable, albeit it’s not ideal. We can see 2 to 4 clear patches of red. We need to carefully determine the appropriate number of clusters.
nb<-NbClust(scaled_data,
method ="ward.D2") # default distance is euclidean, 2 to 15 clusters are considered
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
# The data suggests 3 clusters are optimal.
(WardD2 <- scaled_data %>%
get_dist(method = "euclidian") %>%
hclust(method = "ward.D2"))
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 302
fviz_dend(WardD2, k = 3,
palette = c("orange", "violet", "green"),
color_labels_by_k = TRUE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We opt for Ward’s D squared approach since we are aiming for spherical and compact clusters.
For example, if we are interested in a complete linkage method, utilising a similarity approach, we get this:
(complete <- scaled_data %>%
get_dist(method = "euclidean") %>%
hclust(method = "complete"))
##
## Call:
## hclust(d = ., method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 302
fviz_dend(complete)
Let us train the hierarchical K-Means model with 3 clusters and see what happens.
HKM <- hkmeans(scaled_data,
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
HKM
## Hierarchical K-means clustering with 3 clusters of sizes 74, 97, 131
##
## Cluster means:
## age trestbps chol thalach cp oldpeak
## 1 0.6956922 0.90731927 0.7833179 0.1797128 -0.3057949 -0.09291508
## 2 0.4757423 -0.09386392 -0.1570864 -0.9467864 0.6923351 0.75499038
## 3 -0.7452536 -0.44302920 -0.3261691 0.5995385 -0.3399060 -0.50655230
##
## Clustering vector:
## [1] 1 2 2 3 3 3 2 1 2 2 2 1 1 3 1 3 3 2 3 3 3 1 1 2 2 3 1 1 3 2 1 2 1 1 3 3 2
## [38] 2 1 1 2 3 1 1 1 2 3 2 1 3 3 2 3 3 2 2 3 3 3 3 2 3 2 1 2 2 2 1 1 2 1 1 2 2
## [75] 3 1 2 1 3 2 2 2 3 1 3 3 3 2 3 3 1 2 2 3 1 3 2 2 3 3 3 3 1 2 2 3 3 3 2 3 1
## [112] 2 3 2 2 3 3 3 2 2 3 1 3 2 1 3 1 2 3 3 3 3 3 3 3 1 2 2 3 3 3 1 3 2 3 3 1 3
## [149] 3 3 1 3 1 2 2 1 1 1 1 3 1 3 2 3 3 3 1 3 3 2 2 1 1 2 2 3 2 3 3 3 1 3 1 1 3
## [186] 3 1 1 1 3 2 3 2 2 2 1 3 3 1 3 1 3 1 3 1 2 2 3 1 3 3 3 1 3 3 3 3 2 1 3 3 3
## [223] 2 2 3 3 1 2 2 3 1 3 1 1 2 2 3 3 3 3 3 3 1 2 2 3 2 3 3 2 2 2 3 3 3 2 2 1 3
## [260] 3 1 1 3 2 2 3 2 3 3 2 2 2 2 3 1 1 3 1 2 2 3 2 3 1 2 2 3 3 3 1 1 2 2 2 3 2
## [297] 2 3 2 2 3 3
##
## Within cluster sum of squares by cluster:
## [1] 374.5193 407.8028 449.4075
## (between_SS / total_SS = 31.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
fviz_cluster(HKM,
palette = "set2",
repel = TRUE,
ggtheme = theme_classic())
It looks decent and it can be seen 188 and 231 are far away in the red part, 133, 178, 182, 22 and 58 are far away in the blue part and 30, 122, 124, 92, 155 in the green one. 245 is close to other points, so we do not want to discard it. We opt for even clusters which are well-rounded.
selected_data1 <- data[-c(153, 188, 231, 133, 177, 178, 182, 22, 58, 30, 122, 124, 92, 155),]%>%
select(age, trestbps, chol, thalach, cp, oldpeak)
# scale the data
scaled_data1 <- as.data.frame(scale(selected_data1))
# we remove the outliers and rescale the data
HKM1 <- hkmeans(scaled_data1,
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
HKM1
## Hierarchical K-means clustering with 3 clusters of sizes 137, 95, 57
##
## Cluster means:
## age trestbps chol thalach cp oldpeak
## 1 -0.6657077 -0.3918684 -0.23602212 0.5948045 -0.3227616 -0.4985808
## 2 0.4865753 -0.1690940 -0.06552748 -0.9121756 0.7013572 0.6836192
## 3 0.7890754 1.2236825 0.67649372 0.0906748 -0.3931684 0.0589779
##
## Clustering vector:
## [1] 3 2 2 1 1 1 2 3 2 2 2 3 1 1 3 1 1 2 1 1 1 1 2 2 1 1 3 1 3 2 3 1 1 1 2 2 2
## [38] 3 2 1 3 3 3 2 1 2 3 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 2 3 3 2 3 2 2 2 1 3 2
## [75] 3 1 2 2 2 1 3 1 1 1 2 1 1 2 2 1 3 1 2 2 1 1 1 1 1 2 2 1 1 1 2 1 3 2 1 2 2
## [112] 1 1 1 2 2 1 1 3 1 3 2 1 1 1 1 1 1 1 2 2 1 1 1 3 1 2 1 1 3 1 1 1 3 2 3 2 3
## [149] 1 3 2 1 3 1 2 1 1 1 1 1 1 3 2 3 3 2 2 1 1 1 1 3 3 1 1 3 3 1 2 1 2 2 2 3 1
## [186] 1 3 1 3 1 3 1 1 2 2 1 3 1 1 1 3 1 1 1 1 2 1 1 1 1 2 2 1 1 3 2 2 3 1 3 3 2
## [223] 2 1 1 1 1 1 1 3 2 2 1 2 1 1 2 2 2 1 1 1 2 2 3 1 1 3 3 1 2 2 1 2 1 1 2 2 2
## [260] 2 1 3 3 1 3 2 2 1 2 1 3 2 3 1 1 1 3 3 2 2 2 1 2 2 1 2 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 488.6907 398.0597 302.8731
## (between_SS / total_SS = 31.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
fviz_cluster(HKM1,
palette = "set2",
repel = TRUE,
ggtheme = theme_classic())
We do not have any other outliers and our clusters are
well-balanced.
We now focus on K-Means algorithm. Let us determine the optimal number of clusters.
library(NbClust)
fviz_nbclust(scaled_data1, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
fviz_nbclust(scaled_data1, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method")
nb<-NbClust(scaled_data1,
method ="kmeans") # default distance is euclidean, 2 to 15 clusters are considered
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 2 proposed 11 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
Elbow method alone is inconclusive, it could be 2, it could be 6 and anything in between, but sillhouette method suggests 2 as an optimal number of clusters for this analysis. Nbclust function which implements a voting approach, suggests 2. 3 is also high in the number of votes.
KMNH <- kmeans(scaled_data1,
centers = 2,
nstart = 50,
iter.max =500)
KMNH
## K-means clustering with 2 clusters of sizes 153, 136
##
## Cluster means:
## age trestbps chol thalach cp oldpeak
## 1 -0.5976141 -0.2730750 -0.2082443 0.5865003 -0.4007050 -0.4299568
## 2 0.6723159 0.3072094 0.2342749 -0.6598129 0.4507931 0.4837014
##
## Clustering vector:
## [1] 2 2 2 1 1 1 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 2 2 2 1 1 2 2 2
## [38] 2 2 1 2 2 2 1 1 2 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2
## [75] 2 1 2 1 2 1 2 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 2 1 2 2 1 2 2
## [112] 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2
## [149] 1 2 2 1 2 1 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1
## [186] 1 2 1 2 1 2 1 1 2 2 1 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 1 2 2 2 2 1 2 1 2
## [223] 2 1 1 1 1 1 1 2 2 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 1 1 1 1 2 2 1 2 1 1 2 2 2
## [260] 2 1 1 2 1 1 2 2 1 2 1 2 2 2 1 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 623.9686 725.4272
## (between_SS / total_SS = 21.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
We can see that the amount of variability explained by clustering is only 21.9 and that the second smaller cluster has more variability in itself. Let us see whether the outliers contributed to such poor result:
fviz_cluster(KMNH,
palette = "Set2",
repel = TRUE,
ggtheme = theme_bw(),
data = scaled_data1)
KMNH1 <- kmeans(scaled_data1,
centers = 3,
nstart = 50,
iter.max =500)
KMNH1
## K-means clustering with 3 clusters of sizes 77, 96, 116
##
## Cluster means:
## age trestbps chol thalach cp oldpeak
## 1 0.7177555 0.60411394 0.55299651 0.2741042 -0.5346015 -0.3018214
## 2 0.4809916 0.05062011 -0.02664637 -0.9396673 0.7250369 0.8076828
## 3 -0.8745032 -0.44289917 -0.34502310 0.5957072 -0.2451657 -0.4680802
##
## Clustering vector:
## [1] 1 2 2 3 3 3 2 1 2 2 2 1 1 3 1 3 3 2 3 3 1 1 2 2 3 1 1 3 1 2 1 1 3 3 2 2 2
## [38] 1 2 3 1 1 1 2 3 2 1 3 3 2 3 3 2 2 3 1 3 2 3 2 1 2 2 2 1 2 2 1 1 2 2 3 1 2
## [75] 1 3 2 2 2 3 1 1 3 3 2 3 3 1 2 3 1 3 2 2 3 3 3 3 1 1 3 3 3 3 2 3 1 2 3 2 2
## [112] 3 1 3 2 2 3 3 1 3 2 2 3 3 3 3 3 3 1 2 2 3 3 1 1 3 2 3 3 2 3 3 1 1 3 1 2 1
## [149] 1 1 1 3 1 3 2 3 3 3 1 3 3 2 2 1 1 2 2 3 3 3 3 1 1 1 3 1 1 3 2 3 2 2 2 1 3
## [186] 3 1 3 1 3 1 3 3 2 2 1 2 3 3 3 1 3 3 3 3 2 1 3 3 3 2 2 3 3 1 2 2 2 3 1 1 2
## [223] 2 3 3 3 3 3 3 1 2 2 3 2 3 1 2 2 2 3 3 3 2 2 1 1 3 1 1 3 2 2 3 2 3 3 2 2 2
## [260] 2 1 1 1 3 1 2 2 3 2 3 1 2 2 1 3 3 1 1 2 2 2 3 2 2 3 2 2 1 3
##
## Within cluster sum of squares by cluster:
## [1] 352.5733 430.9888 399.1071
## (between_SS / total_SS = 31.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(KMNH1,
palette = "Set2",
repel = TRUE,
ggtheme = theme_bw(),
data = scaled_data1)
3 clusters looks quite decent and we are content with the results, comparable to ones in the hierarchical case. 4 clusters would lead to poor separation issues and we omit it here for the reasons of brevity. We see that in terms of variability, second cluster is the most versatile one, but we are ready to pay that price, so not to remove many points. 121 may be an outlier, but that would lead to removal of more points, and the improvement is not significant.
Let us plot the average values per cluster, as shown in class.
library(tidyr)
(centres <- as.data.frame(KMNH1$centers)) # use centroids of our last cluster
## age trestbps chol thalach cp oldpeak
## 1 0.7177555 0.60411394 0.55299651 0.2741042 -0.5346015 -0.3018214
## 2 0.4809916 0.05062011 -0.02664637 -0.9396673 0.7250369 0.8076828
## 3 -0.8745032 -0.44289917 -0.34502310 0.5957072 -0.2451657 -0.4680802
Figure <- as.data.frame(centres)
Figure$id <- 1:nrow(Figure) # add cluster IDs, not having these before
Figure <- pivot_longer(Figure,
cols = c("age", "trestbps", "chol", "thalach", "cp", "oldpeak"),
names_to = "Variable",
values_to = "Value")
Figure$Group <- factor(Figure$id,
levels = c(1, 2, 3), # we have 3 groups
labels = c("Cluster 1", "Cluster 2", "Cluster 3"))
Figure$VariableFactor <- factor(Figure$Variable,
levels = c("age", "trestbps", "chol", "thalach", "cp", "oldpeak"),
labels = c("age", "trestbps", "chol", "thalach", "cp", "oldpeak"))
# plot the data
ggplot(Figure, aes(x = VariableFactor, y = Value)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
theme_classic() +
geom_point(aes(shape = Group, col = Group), size = 4) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Averages") +
xlab("Cluster Variables") +
scale_color_brewer(palette = "Set2") +
ylim(-1.44, 1.44) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 10)) +
ggtitle("Cluster Centroids")
We are running a one-way ANOVA on each cluster variable to see if these variables really help separate the groups. If we find that the means of two groups are different for each variable, then we know we have picked the right variables for clustering.
data1<-data[-c(153, 188, 231, 133, 177, 178, 182, 22, 58, 30, 122, 124, 92, 155),]
data1$cluster <- KMNH1$cluster
fit <- aov(cbind(age, trestbps, chol, thalach, cp, oldpeak) ~ as.factor(cluster),
data = data1)
summary(fit)
## Response age :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 2 12075 6037.5 156.72 < 2.2e-16 ***
## Residuals 286 11018 38.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response trestbps :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 2 15838 7918.8 30.847 7.399e-13 ***
## Residuals 286 73420 256.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response chol :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 2 81503 40751 21.357 2.266e-09 ***
## Residuals 286 545712 1908
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response thalach :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 2 66769 33384 120.52 < 2.2e-16 ***
## Residuals 286 79223 277
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response cp :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 2 72.99 36.495 54.472 < 2.2e-16 ***
## Residuals 286 191.61 0.670
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response oldpeak :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(cluster) 2 112.53 56.267 70.45 < 2.2e-16 ***
## Residuals 286 228.42 0.799
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F value is 156.72 with a p-value of < 2.2e-16. This means there are significant differences in age across the clusters.
The F value is 30.847 with a p-value of 7.399e-13. This indicates significant differences in resting blood pressure across the clusters.
The F value is 21.357 with a p-value of 2.266e-09. This suggests significant differences in cholesterol levels across the clusters.
The F value is 120.52 with a p-value of < 2.2e-16. This means there are significant differences in maximum heart rate achieved across the clusters.
The F value is 54.472 with a p-value of < 2.2e-16. This indicates significant differences in chest pain type across the clusters.
The F value is 70.45 with a p-value of < 2.2e-16. This suggests significant differences in ST depression across the clusters.
The results indicate that all p-values are significantly below the threshold of 0.05, suggesting that the differences between clusters are statistically significant for all variables examined. This demonstrates that the clustering variables effectively differentiate between the groups.
# we write the function which is going to help us do the chi squared test analysis, as shown in class
chi_square <- function(data, cluster_var, selected_var) {
# perform chi-square test
chisq_results <- chisq.test(data[[selected_var]], as.factor(data[[cluster_var]]))
# print chi-square test results
print(chisq_results)
# add observed frequencies with margins
observed_with_margins <- addmargins(chisq_results$observed)
print(observed_with_margins)
# print expected frequencies
expected_frequencies <- round(chisq_results$expected, 2)
print(expected_frequencies)
# print standardized residuals
standardized_residuals <- round(chisq_results$residuals, 2)
print(standardized_residuals)
}
chi_square(data1, "cluster", "exang")
##
## Pearson's Chi-squared test
##
## data: data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 48.754, df = 2, p-value = 2.59e-11
##
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3 Sum
## No 62 39 95 196
## Yes 15 57 21 93
## Sum 77 96 116 289
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## No 52.22 65.11 78.67
## Yes 24.78 30.89 37.33
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## No 1.35 -3.24 1.84
## Yes -1.96 4.70 -2.67
chi_square(data1, "cluster", "sex")
##
## Pearson's Chi-squared test
##
## data: data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 7.8664, df = 2, p-value = 0.01958
##
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3 Sum
## Male 34 25 32 91
## Female 43 71 84 198
## Sum 77 96 116 289
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## Male 24.25 30.23 36.53
## Female 52.75 65.77 79.47
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## Male 1.98 -0.95 -0.75
## Female -1.34 0.64 0.51
chi_square(data1, "cluster", "slope")
## Warning in chisq.test(data[[selected_var]], as.factor(data[[cluster_var]])):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 56.999, df = 4, p-value = 1.238e-11
##
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3 Sum
## Upsloping 46 16 76 138
## Flat 27 72 34 133
## Downsloping 4 8 6 18
## Sum 77 96 116 289
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## Upsloping 36.77 45.84 55.39
## Flat 35.44 44.18 53.38
## Downsloping 4.80 5.98 7.22
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## Upsloping 1.52 -4.41 2.77
## Flat -1.42 4.19 -2.65
## Downsloping -0.36 0.83 -0.46
chi_square(data1, "cluster", "restecg")
##
## Pearson's Chi-squared test
##
## data: data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 8.4297, df = 2, p-value = 0.01477
##
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3 Sum
## Normal 34 45 73 152
## ST-T abnormality 43 51 43 137
## Sum 77 96 116 289
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## Normal 40.5 50.49 61.01
## ST-T abnormality 36.5 45.51 54.99
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## Normal -1.02 -0.77 1.53
## ST-T abnormality 1.08 0.81 -1.62
chi_square(data1, "cluster", "fbs")
##
## Pearson's Chi-squared test
##
## data: data[[selected_var]] and as.factor(data[[cluster_var]])
## X-squared = 7.0638, df = 2, p-value = 0.02925
##
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3 Sum
## False 59 82 105 246
## True 18 14 11 43
## Sum 77 96 116 289
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## False 65.54 81.72 98.74
## True 11.46 14.28 17.26
## as.factor(data[[cluster_var]])
## data[[selected_var]] 1 2 3
## False -0.81 0.03 0.63
## True 1.93 -0.08 -1.51
Hypotheses for the Chi-Square Test:
We deduce there is an association between cluster labels and exang, sex, slope, restecg and fbs, which is to be expected, since every of those factors may contribute to some end to the coronary disease.
Cluster 1 has the oldest patients who are predominantly male and have enhanced fasting blood sugar levels. They are having high systolic blood pressure (\(\mu = 0.604\)), high cholesterol levels (\(\mu = 0.553\)), and medium levels of maximum heart rate (\(\mu = 0.274\)). The pain is lowest (\(\mu = -0.535\)) and ST peak is not that high (\(\mu = -0.302\)), suggesting these patients probably have issues with hypertension. They also show ST upsloping, which is normal, but their resting ECG values are often abnormal, hence they need long-term monitoring after additional checks to exclude more serious symptoms. It seems their life is not compromised at the moment.
Cluster 2 has middle-aged patients who are predominantely female. Their systolic blood pressure (\(\mu = 0.051\)) and cholesterol levels (\(\mu = -0.027\)) are not that high. Their maximum heart rate is high (\(\mu = -0.940\)), they report a significant level of pain (\(\mu = 0.725\)), and their ST peak is high (\(\mu = 0.808\)), suggesting that the heart muscle is not getting enough blood. Their ST is either flat or downsloping and their resting ECG values are often abnormal. These patients need urgent medical care, since their symptoms are pointing to the serious coronary artery disease.
In Cluster 3, we have the youngest patients who are predominantly female, often do not show ST slope and resting ECG abnormalities, while their fasting blood sugar is often normal. They have not that high blood pressure (\(\mu = -0.443\)) and cholesterol (\(\mu = -0.345\)), but high pulse rates (\(\mu = 0.596\)). The chest pain is low (\(\mu = -0.245\)) and ST peaks are the lowest (\(\mu = -0.468\)). These patients need further checks to exclude chest pain, and then the tests would focus on their heart rhythm. Anxiety may be one of the reasons they are showing these symptoms, being ex-service personnel.