library(readr)
mydata<- read_csv("HeartDiseaseTrain-Test.csv")
## Rows: 1025 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): sex, chest_pain_type, fasting_blood_sugar, rest_ecg, exercise_induced_angina, slope, vessels_colored_by_flo...
## dbl (6): age, resting_blood_pressure, cholestoral, Max_heart_rate, oldpeak, target
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mydata)
## # A tibble: 6 × 14
## age sex chest_pain_type resting_blood_pressure cholestoral fasting_blood_sugar rest_ecg Max_heart_rate
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 52 Male Typical angina 125 212 Lower than 120 mg/ml ST-T wave abnor… 168
## 2 53 Male Typical angina 140 203 Greater than 120 mg/ml Normal 155
## 3 70 Male Typical angina 145 174 Lower than 120 mg/ml ST-T wave abnor… 125
## 4 61 Male Typical angina 148 203 Lower than 120 mg/ml ST-T wave abnor… 161
## 5 62 Female Typical angina 138 294 Greater than 120 mg/ml ST-T wave abnor… 106
## 6 58 Female Typical angina 100 248 Lower than 120 mg/ml Normal 122
## # ℹ 6 more variables: exercise_induced_angina <chr>, oldpeak <dbl>, slope <chr>, vessels_colored_by_flourosopy <chr>,
## # thalassemia <chr>, target <dbl>
Unit of observation is one patient.
Sample size is 1025.
Explanation of variables:
The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.
# Add the ID column and place it as the first column
mydata <- data.frame(ID = seq(1, nrow(mydata)), mydata)
# View the updated dataset
head(mydata)
## ID age sex chest_pain_type resting_blood_pressure cholestoral fasting_blood_sugar rest_ecg
## 1 1 52 Male Typical angina 125 212 Lower than 120 mg/ml ST-T wave abnormality
## 2 2 53 Male Typical angina 140 203 Greater than 120 mg/ml Normal
## 3 3 70 Male Typical angina 145 174 Lower than 120 mg/ml ST-T wave abnormality
## 4 4 61 Male Typical angina 148 203 Lower than 120 mg/ml ST-T wave abnormality
## 5 5 62 Female Typical angina 138 294 Greater than 120 mg/ml ST-T wave abnormality
## 6 6 58 Female Typical angina 100 248 Lower than 120 mg/ml Normal
## Max_heart_rate exercise_induced_angina oldpeak slope vessels_colored_by_flourosopy thalassemia target
## 1 168 No 1.0 Downsloping Two Reversable Defect 0
## 2 155 Yes 3.1 Upsloping Zero Reversable Defect 0
## 3 125 Yes 2.6 Upsloping Zero Reversable Defect 0
## 4 161 No 0.0 Downsloping One Reversable Defect 0
## 5 106 No 1.9 Flat Three Fixed Defect 0
## 6 122 No 1.0 Flat Zero Fixed Defect 1
Here I added column displaying IDs.
colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Age"
colnames(mydata) [3] <- "Sex"
colnames(mydata) [4] <- "Chest pain"
colnames(mydata) [5] <- "Resting blood pressure"
colnames(mydata) [6] <- "Cholestoral"
colnames(mydata) [7] <- "Fasting blood sugar"
colnames(mydata) [8] <- "Resting electrocardiographic results"
colnames(mydata) [9] <- "Max heart rate"
colnames(mydata) [10] <- "Exercise induced angina"
colnames(mydata) [11] <- "Oldpeak"
colnames(mydata) [12] <- "Slope of the peak"
colnames(mydata) [13] <- "Number of major vessels"
colnames(mydata) [14] <- "Thal"
colnames(mydata) [15] <- "Target"
head(mydata)
## ID Age Sex Chest pain Resting blood pressure Cholestoral Fasting blood sugar
## 1 1 52 Male Typical angina 125 212 Lower than 120 mg/ml
## 2 2 53 Male Typical angina 140 203 Greater than 120 mg/ml
## 3 3 70 Male Typical angina 145 174 Lower than 120 mg/ml
## 4 4 61 Male Typical angina 148 203 Lower than 120 mg/ml
## 5 5 62 Female Typical angina 138 294 Greater than 120 mg/ml
## 6 6 58 Female Typical angina 100 248 Lower than 120 mg/ml
## Resting electrocardiographic results Max heart rate Exercise induced angina Oldpeak Slope of the peak
## 1 ST-T wave abnormality 168 No 1.0 Downsloping
## 2 Normal 155 Yes 3.1 Upsloping
## 3 ST-T wave abnormality 125 Yes 2.6 Upsloping
## 4 ST-T wave abnormality 161 No 0.0 Downsloping
## 5 ST-T wave abnormality 106 No 1.9 Flat
## 6 Normal 122 No 1.0 Flat
## Number of major vessels Thal Target
## 1 Two Reversable Defect 0
## 2 Zero Reversable Defect 0
## 3 Zero Reversable Defect 0
## 4 One Reversable Defect 0
## 5 Three Fixed Defect 0
## 6 Zero Fixed Defect 1
summary(mydata)
## ID Age Sex Chest pain Resting blood pressure Cholestoral
## Min. : 1 Min. :29.00 Length:1025 Length:1025 Min. : 94.0 Min. :126
## 1st Qu.: 257 1st Qu.:48.00 Class :character Class :character 1st Qu.:120.0 1st Qu.:211
## Median : 513 Median :56.00 Mode :character Mode :character Median :130.0 Median :240
## Mean : 513 Mean :54.43 Mean :131.6 Mean :246
## 3rd Qu.: 769 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
## Max. :1025 Max. :77.00 Max. :200.0 Max. :564
## Fasting blood sugar Resting electrocardiographic results Max heart rate Exercise induced angina Oldpeak
## Length:1025 Length:1025 Min. : 71.0 Length:1025 Min. :0.000
## Class :character Class :character 1st Qu.:132.0 Class :character 1st Qu.:0.000
## Mode :character Mode :character Median :152.0 Mode :character Median :0.800
## Mean :149.1 Mean :1.072
## 3rd Qu.:166.0 3rd Qu.:1.800
## Max. :202.0 Max. :6.200
## Slope of the peak Number of major vessels Thal Target
## Length:1025 Length:1025 Length:1025 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :1.0000
## Mean :0.5132
## 3rd Qu.:1.0000
## Max. :1.0000
mydata$`Sex` <- factor(mydata$`Sex`)
mydata$`Chest pain` <- factor(mydata$`Chest pain`)
mydata$`Fasting blood sugar` <- factor(mydata$`Fasting blood sugar`)
mydata$`Resting electrocardiographic results` <- factor(mydata$`Resting electrocardiographic results`)
mydata$`Exercise induced angina` <- factor(mydata$`Exercise induced angina`)
mydata$`Slope of the peak` <- factor(mydata$`Slope of the peak`)
mydata$`Number of major vessels` <- factor(mydata$`Number of major vessels`)
mydata$Thal <- factor(mydata$Thal)
mydata$Target <- factor(mydata$Target)
head(mydata)
## ID Age Sex Chest pain Resting blood pressure Cholestoral Fasting blood sugar
## 1 1 52 Male Typical angina 125 212 Lower than 120 mg/ml
## 2 2 53 Male Typical angina 140 203 Greater than 120 mg/ml
## 3 3 70 Male Typical angina 145 174 Lower than 120 mg/ml
## 4 4 61 Male Typical angina 148 203 Lower than 120 mg/ml
## 5 5 62 Female Typical angina 138 294 Greater than 120 mg/ml
## 6 6 58 Female Typical angina 100 248 Lower than 120 mg/ml
## Resting electrocardiographic results Max heart rate Exercise induced angina Oldpeak Slope of the peak
## 1 ST-T wave abnormality 168 No 1.0 Downsloping
## 2 Normal 155 Yes 3.1 Upsloping
## 3 ST-T wave abnormality 125 Yes 2.6 Upsloping
## 4 ST-T wave abnormality 161 No 0.0 Downsloping
## 5 ST-T wave abnormality 106 No 1.9 Flat
## 6 Normal 122 No 1.0 Flat
## Number of major vessels Thal Target
## 1 Two Reversable Defect 0
## 2 Zero Reversable Defect 0
## 3 Zero Reversable Defect 0
## 4 One Reversable Defect 0
## 5 Three Fixed Defect 0
## 6 Zero Fixed Defect 1
summary(mydata)
## ID Age Sex Chest pain Resting blood pressure Cholestoral
## Min. : 1 Min. :29.00 Female:312 Asymptomatic : 77 Min. : 94.0 Min. :126
## 1st Qu.: 257 1st Qu.:48.00 Male :713 Atypical angina :167 1st Qu.:120.0 1st Qu.:211
## Median : 513 Median :56.00 Non-anginal pain:284 Median :130.0 Median :240
## Mean : 513 Mean :54.43 Typical angina :497 Mean :131.6 Mean :246
## 3rd Qu.: 769 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
## Max. :1025 Max. :77.00 Max. :200.0 Max. :564
## Fasting blood sugar Resting electrocardiographic results Max heart rate Exercise induced angina
## Greater than 120 mg/ml:153 Left ventricular hypertrophy: 15 Min. : 71.0 No :680
## Lower than 120 mg/ml :872 Normal :497 1st Qu.:132.0 Yes:345
## ST-T wave abnormality :513 Median :152.0
## Mean :149.1
## 3rd Qu.:166.0
## Max. :202.0
## Oldpeak Slope of the peak Number of major vessels Thal Target
## Min. :0.000 Downsloping:469 Four : 18 Fixed Defect :544 0:499
## 1st Qu.:0.000 Flat :482 One :226 No : 7 1:526
## Median :0.800 Upsloping : 74 Three: 69 Normal : 64
## Mean :1.072 Two :134 Reversable Defect:410
## 3rd Qu.:1.800 Zero :578
## Max. :6.200
The minimum age is 29.
The average resting blood pressure is 131,6.
Fifty percent of patients have 240 or less than 240 and fifty percent of patients have more than 240 cholestoral.
Can we discover meaningful clusters of patients based on their clinical and demographic features, and how do these clusters correspond to heart disease diagnoses?
Now, because my dataset has too many units, I will shorten my sample size to 150 units for easier further interpretation.
set.seed(1)
# Randomly sample 150 rows from the dataset
mydata_sample <- mydata %>% sample_n(150)
# View the sampled dataset
head(mydata_sample)
## ID Age Sex Chest pain Resting blood pressure Cholestoral Fasting blood sugar
## 1 1017 65 Male Asymptomatic 138 282 Greater than 120 mg/ml
## 2 836 49 Male Non-anginal pain 118 149 Lower than 120 mg/ml
## 3 679 41 Female Non-anginal pain 112 268 Lower than 120 mg/ml
## 4 129 52 Male Non-anginal pain 138 223 Lower than 120 mg/ml
## 5 930 60 Male Typical angina 130 206 Lower than 120 mg/ml
## 6 509 56 Female Typical angina 200 288 Greater than 120 mg/ml
## Resting electrocardiographic results Max heart rate Exercise induced angina Oldpeak Slope of the peak
## 1 Normal 174 No 1.4 Flat
## 2 Normal 126 No 0.8 Downsloping
## 3 Normal 172 Yes 0.0 Downsloping
## 4 ST-T wave abnormality 169 No 0.0 Downsloping
## 5 Normal 132 Yes 2.4 Flat
## 6 Normal 133 Yes 4.0 Upsloping
## Number of major vessels Thal Target
## 1 One Fixed Defect 0
## 2 Three Fixed Defect 0
## 3 Zero Fixed Defect 1
## 4 Four Fixed Defect 1
## 5 Two Reversable Defect 0
## 6 Two Reversable Defect 0
# Check the number of rows to confirm the sample size
nrow(mydata_sample)
## [1] 150
# Standardization of cluster variables
mydata_clu_std <- as.data.frame(scale(mydata_sample[c("Resting blood pressure", "Cholestoral", "Max heart rate")]))
We standardize all cluster variables, because we want all of the variables to have the same effect, despite the difference in variance.
mydata_sample$Dissimilarity = sqrt(mydata_clu_std$`Resting blood pressure`^2 + mydata_clu_std$Cholestoral^2 + mydata_clu_std$`Max heart rate`^2)
head(mydata_sample[order(-mydata_sample$Dissimilarity), c("ID", "Resting blood pressure", "Cholestoral", "Max heart rate", "Dissimilarity")], 10)
## ID Resting blood pressure Cholestoral Max heart rate Dissimilarity
## 73 193 115 564 160 5.391256
## 84 465 115 564 160 5.391256
## 6 509 200 288 133 3.888467
## 88 176 200 288 133 3.888467
## 137 295 200 288 133 3.888467
## 77 987 180 327 117 3.366625
## 95 141 152 274 88 3.174031
## 67 642 134 409 150 2.712762
## 133 48 178 228 165 2.646638
## 132 682 170 326 140 2.544133
From the table above I can see some jump in dissimilarity numbers after the first two IDs that are displayed here, so I identify ID465 and ID193 as potential outlierS. This is why I will remove these units.
# Remove rows with specific outliers based on Dissimilarity or variable values
mydata_sample <- mydata_sample %>%
filter(!ID %in% c(465, 193)) # Use backticks for column names with spaces
# Reassign IDs
mydata_sample$ID <- seq(1, nrow(mydata_sample))
# Standardize the clustering variables
mydata_clu_std <- as.data.frame(scale(mydata_sample[c("Resting blood pressure", "Cholestoral", "Max heart rate")]))
# Set row names of the standardized data to a unique identifier
rownames(mydata_clu_std) <- mydata_sample$ID
Now the sampler size got smaller, the new sample size is now 148.
# Access specific rows by their identifiers (replace with actual IDs or values in your dataset)
mydata_clu_std[c(1, 5, 10), ] # Replace 1, 5, 10 with the row indices or identifiers relevant to your dataset
## Resting blood pressure Cholestoral Max heart rate
## 1 0.37608783 0.8184698 1.0531856
## 5 -0.05497508 -0.7946629 -0.8802467
## 10 -0.16274081 -0.8158883 1.5135266
# Graphical presentation of the dissimilarity matrix
library(factoextra)
# Compute the distance matrix using Euclidean distance
Distance <- get_dist(mydata_clu_std,
method = "euclidean") # Note the correct spelling of "euclidean"
# Visualize the dissimilarity matrix
fviz_dist(Distance,
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
From the matrix of distances displayed above, we can see that some groups of clusters are forming, and I see 4 or 5 groups.
# Load the factoextra library
library(factoextra)
get_clust_tendency(mydata_clu_std,
n= nrow(mydata_clu_std)-1,
graph=FALSE)
## $hopkins_stat
## [1] 0.6657298
##
## $plot
## NULL
Hopkins statistics in my case is above 0.5, so I can conclude that the data is clusterable. If it would be greater, closer to 1, it would be even more appropriate. Now the next question is how many clusters to use, so I will check this.
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 148
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Based on the dendrogram, I would choose 2 clusters, as there is the biggest jump in vertical line.
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
With the elbow method, I can see the slope changes the most at 2 and 5, so both are possible options for number of clusters.
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")
The highest value of the Silhouette analysis is at 6, so this is the suggested numbers of clusters.
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 4 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB Silhouette Duda
## 2 3.4746 53.5714 37.9606 -2.6363 162.4096 4000271 9373.262 322.6215 1.4594 1.3669 0.3507 1.6100 0.2695 1.3098
## 3 0.9411 52.3692 38.1343 -4.4964 255.8988 4785588 9330.161 256.0479 2.5131 1.7223 0.3280 1.3736 0.2345 0.7192
## 4 0.8115 56.4144 32.3503 -5.3607 339.7263 4828670 4410.775 202.7307 3.6009 2.1753 0.3780 1.1465 0.2956 1.5856
## 5 0.9738 59.4877 9.7934 -3.9121 435.4611 3951128 2774.193 165.5410 5.2847 2.6640 0.3629 1.1124 0.2890 1.0052
## 6 4.0081 52.4388 13.2846 -5.1898 461.8185 4761457 3183.146 154.9306 5.7124 2.8464 0.3247 1.2655 0.2376 2.4653
## 7 0.1171 49.6492 32.7782 -5.5243 507.5370 4758565 2760.723 141.6763 6.7311 3.1127 0.3226 1.2199 0.2237 1.5312
## 8 3.8897 56.7269 6.6299 -3.2021 594.8110 3446365 1660.113 114.9531 8.7615 3.8363 0.3595 1.1164 0.2698 0.9387
## 9 0.5681 52.4381 18.9386 -4.0267 610.4642 3924038 1488.517 109.7555 9.1230 4.0180 0.3877 1.1483 0.2436 1.7593
## 10 191.4579 54.6706 9.0077 -3.1185 675.9204 3112949 1218.597 96.5946 11.3004 4.5655 0.3353 1.0552 0.2755 1.4890
## Pseudot2 Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 -20.5785 -0.3948 0.3499 161.3107 0.3897 0.7216 0.5959 0.0448 0.0039 2.5461 1.3421 1.5280
## 3 18.7448 0.6500 0.3699 85.3493 0.3952 -0.6624 1.3145 0.0713 0.0039 2.1388 1.1972 2.4664
## 4 -25.8536 -0.6113 0.3673 50.6827 0.5377 0.8726 1.0462 0.1015 0.0055 1.9241 1.1020 0.8803
## 5 -0.2367 -0.0085 0.3530 33.1082 0.4907 0.8134 1.6782 0.0997 0.0063 2.1876 0.9905 0.5494
## 6 -22.5861 -0.9523 0.3287 25.8218 0.4500 1.0138 2.2926 0.0947 0.0066 2.1364 0.9463 0.5478
## 7 -8.6726 -0.5537 0.3113 20.2395 0.4088 -0.0809 2.9965 0.0950 0.0067 2.3157 0.8957 0.4413
## 8 0.8487 0.1010 0.3040 14.3691 0.4385 15.1804 2.8837 0.1151 0.0076 2.3309 0.8131 0.3798
## 9 -16.4010 -0.6998 0.2889 12.1951 0.4063 0.1725 3.4008 0.0883 0.0078 2.4678 0.8062 0.2845
## 10 -5.5831 -0.4970 0.2794 9.6595 0.4011 0.2082 3.7689 0.0825 0.0079 2.3107 0.7502 0.2459
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.4783 94.9094 1.0000
## 3 0.4587 56.6455 0.5843
## 4 0.4208 96.3332 1.0000
## 5 0.4105 66.0575 1.0000
## 6 0.2617 107.2113 1.0000
## 7 0.2464 76.4466 1.0000
## 8 0.1434 77.6251 0.9588
## 9 0.3119 83.8289 1.0000
## 10 0.0819 190.4474 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 10.0000 5.0000 8.0000 2.0000 5.0000 8 4.000 5.0000 10.0000 8.0000 7.0000 10.0000
## Value_Index 191.4579 59.4877 26.1483 -2.6363 95.7348 1789874 4919.386 26.5792 2.1774 -0.5419 0.3226 1.0552
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain Dunn Hubert SDindex
## Number_clusters 4.0000 2.0000 2.0000 2.0000 3.0000 3.0000 4.0000 1 2.0000 8.0000 0 4.0000
## Value_Index 0.2956 1.3098 -20.5785 -0.3948 0.3699 75.9614 0.5377 NA 0.5959 0.1151 0 1.9241
## Dindex SDbw
## Number_clusters 0 10.0000
## Value_Index 0 0.2459
##
## $Best.partition
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 2 1 2
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 2 2 1 1 2 1 2 2 2 2 1 2 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 1 1 2 2 2 2 2 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
## 2 2 2 1 2 2 2 1 2 1 2 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 2
The most of all indices suggested 2 as the number of clusters to use, so I will proceed with 2 clusters.
Clustering <- kmeans(mydata_clu_std,
centers = 2, #Number of groups
nstart = 25) #Number of different positions of initial leaders
Clustering
## K-means clustering with 2 clusters of sizes 57, 91
##
## Cluster means:
## Resting blood pressure Cholestoral Max heart rate
## 1 0.3051893 0.7938930 -0.7461824
## 2 -0.1911625 -0.4972736 0.4673890
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 1 1 2 1 1 1 2 1 2 2 1 2
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 2 2 1 2 2 1 2 1 1 2 2 1 2 2 2 1 2 2 1 1 2 1 2 2 2 2 1 2 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 2 2 2 2 2 1 1 1 2 2 1 2 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 1 1 2
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 1 1 2 2 2 2 2 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 1
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
## 2 2 2 1 2 2 2 1 2 1 2 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 2
##
## Within cluster sum of squares by cluster:
## [1] 153.9651 168.3566
## (between_SS / total_SS = 26.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
## [8] "iter" "ifault"
Here I performed K-Means clustering. Biggest ratio of final leaders is 26.9%, which explains the total variability of clustering the 3 variables into 2 groups, and the other 73.1% is not explained.
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)
With the help of Principal Component Analysis around 73.9% (42.2+31.7) of is showed when combining 3 variables into 2 dimensions.
I decided to not remove any units, because none of the units seem really far away from the other units in the cluster.
Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Resting blood pressure Cholestoral Max heart rate
## 1 0.3051893 0.7938930 -0.7461824
## 2 -0.1911625 -0.4972736 0.4673890
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
Figure <- pivot_longer(Figure, cols = c("Resting blood pressure", "Cholestoral", "Max heart rate"),
names_to = "name",
values_to = "value")
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3, 4),
labels = c("1", "2", "3", "4"))
Figure$ImeF <- factor(Figure$name,
levels = c("Resting blood pressure", "Cholestoral", "Max heart rate"),
labels = c("Resting Blood Pressure", "Cholestoral", "Max Heart Rate"))
library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))
I will use this profiles of groups at the end for my conclusion.
mydata_sample$Group <- Clustering$cluster #Assigning units to groups
# Perform one-way ANOVA for each clustering variable to test differences between clusters
fit <- aov(cbind(`Resting blood pressure`, `Cholestoral`, `Max heart rate`) ~ as.factor(Group),
data = mydata_sample) # Replace `Cluster` with the name of your cluster assignment column
# Summarize the ANOVA results
summary(fit)
## Response Resting blood pressure :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 2974 2973.94 9.1108 0.002999 **
## Residuals 146 47657 326.42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Cholestoral :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 129690 129690 96.311 < 2.2e-16 ***
## Residuals 146 196601 1347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Max heart rate :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 1 24357 24357.1 79.007 2.131e-15 ***
## Residuals 146 45011 308.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response for Resting blood pressure:
H0: μ(Resting blood pressure, G1) = μ(Resting blood pressure, G2)
H1: At least one μ(Social_support, j) is different.
We can reject H0 at p < 0.01. We can reject H0 for all other cluster variables at p < 0.001. Therefore we can assume that the groups are statistically different in the mean values of the cluster variables.
Now I will check the criterion validity of the classification with variable that was not used in the clustering process. For this I chose Oldpeak. We have to check two assumptions: 1. Normal distribution of Oldpeak in each group 2. Homogenity of the variances
# Compute the mean of a variable for each cluster
aggregate(mydata_sample$Oldpeak,
by = list(mydata_sample$Group), # Replace `Cluster` with your cluster assignment column
FUN = mean)
## Group.1 x
## 1 1 1.4771930
## 2 2 0.8912088
On average group 1 has the biggest Oldpeak (ST depression induced by exercise relative to rest).
# Load the car library
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Perform Levene's test for homogeneity of variance
leveneTest(mydata_sample$Oldpeak, as.factor(mydata_sample$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.7646 0.05428 .
## 146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: σ2 (Oldpeak, G1) = σ2 (Oldpeak, G2)
H1: At least one σ2 (Oldpeak, j) is different. We cannot reject H0, because p > 0.05, meaning that we cannot say at least one variance is different. With this we checked the homogeneity of variances. Now we can proceed with Welch F-test (if normality is not violated).
# Load the required libraries
library(dplyr)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
# Perform Shapiro-Wilk normality test for each cluster group
mydata_sample %>%
group_by(Group) %>% # Replace `Cluster` with your cluster assignment column
shapiro_test(Oldpeak) # Replace `Oldpeak` with the variable you want to test
## # A tibble: 2 × 4
## Group variable statistic p
## <int> <chr> <dbl> <dbl>
## 1 1 Oldpeak 0.906 0.000316
## 2 2 Oldpeak 0.841 0.0000000176
H0: Oldpeak is normally distributed in G1.
H1: Oldpeak is not normally distributed in G1.
We cannot reject H0.
H0: Oldpeak is normally distributed in G2.
H1: Oldpeak is not normally distributed in G2.
We can reject H0.
Because I rejected one H0, I now have to use Kruskal-Wallis Rank Sum Test.
# Perform Kruskal-Wallis test for the variable of interest across clusters
kruskal.test(Oldpeak ~ Group,
data = mydata_sample) # Replace `Cluster` with your cluster assignment column
##
## Kruskal-Wallis rank sum test
##
## data: Oldpeak by Group
## Kruskal-Wallis chi-squared = 8.1107, df = 1, p-value = 0.004401
H0: Location distribution of Oldpeak are the same for all groups.
H1: Location distribution of Oldpeak are not the same for all groups.
# Compute the effect size for the Kruskal-Wallis test
library(rstatix)
kruskal_effsize(Oldpeak ~ Group,
data = mydata_sample) # Replace `Cluster` with your cluster assignment column
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Oldpeak 148 0.0487 eta2[H] small
We reject H0 at p < 0.005, therefore at least one location distribution is different from the other, meaning I trust that my clusters are meaningful. All in all, my result is validated (effect size is small).
Based on three standardized variables (Resting Blood Pressure, Cholesterol, and Maximum Heart Rate), I identified two distinct groups using hierarchical and K-means clustering.
Group 1 (57/148 = 38.5%) – High Cholesterol & Elevated Blood Pressure with Lower Heart Rate This group consists of individuals with significantly higher cholesterol levels and above-average resting blood pressure. However, their maximum heart rate is considerably lower than that of the other group. This suggests that these patients may have reduced cardiovascular efficiency or underlying heart conditions that limit their ability to reach higher heart rates. The presence of higher cholesterol levels combined with elevated blood pressure may put them at a higher risk for heart disease, strokes, and other cardiovascular complications. These individuals likely require medical intervention, including lifestyle changes and potentially medication, to manage their cardiovascular health.
Group 2 (91/148 = 61.5%) – Lower Cholesterol & Moderate Blood Pressure with Higher Heart Rate Individuals in this group tend to have lower cholesterol levels and resting blood pressure compared to the first group. However, they exhibit a significantly higher maximum heart rate, indicating better cardiovascular function and adaptability to physical exertion. This suggests that these patients may have better heart efficiency and a lower risk of severe cardiovascular conditions compared to Group 1. Nevertheless, individuals with particularly high heart rates may still be at risk of arrhythmias or other heart-related concerns. Lifestyle modifications such as balanced exercise routines and continued monitoring of blood pressure and cholesterol levels may be beneficial for these individuals.