library(readr)
mydata<- read_csv("HeartDiseaseTrain-Test.csv")

## Rows: 1025 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): sex, chest_pain_type, fasting_blood_sugar, rest_ecg, exercise_induced_angina, slope, vessels_colored_by_flo...
## dbl (6): age, resting_blood_pressure, cholestoral, Max_heart_rate, oldpeak, target
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(mydata)

## # A tibble: 6 × 14
##     age sex    chest_pain_type resting_blood_pressure cholestoral fasting_blood_sugar    rest_ecg         Max_heart_rate
##   <dbl> <chr>  <chr>                            <dbl>       <dbl> <chr>                  <chr>                     <dbl>
## 1    52 Male   Typical angina                     125         212 Lower than 120 mg/ml   ST-T wave abnor…            168
## 2    53 Male   Typical angina                     140         203 Greater than 120 mg/ml Normal                      155
## 3    70 Male   Typical angina                     145         174 Lower than 120 mg/ml   ST-T wave abnor…            125
## 4    61 Male   Typical angina                     148         203 Lower than 120 mg/ml   ST-T wave abnor…            161
## 5    62 Female Typical angina                     138         294 Greater than 120 mg/ml ST-T wave abnor…            106
## 6    58 Female Typical angina                     100         248 Lower than 120 mg/ml   Normal                      122
## # ℹ 6 more variables: exercise_induced_angina <chr>, oldpeak <dbl>, slope <chr>, vessels_colored_by_flourosopy <chr>,
## #   thalassemia <chr>, target <dbl>

Unit of observation is one patient.
Sample size is 1025.
Explanation of variables:

Age (in years)
Sex (male or female)
Chest pain type (4 values)
Resting blood pressure
Serum cholestoral in mg/dl
Fasting blood sugar > 120 mg/dl
Resting electrocardiographic results (values 0,1,2)
Maximum heart rate achieved
Exercise induced angina
Oldpeak = ST depression induced by exercise relative to rest
The slope of the peak exercise ST segment
Number of major vessels (0-3) colored by flourosopy
Thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
Target (presence of heart disease in the patient): 0= not reached, 1= reached

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

My source is Kaggle.

# Add the ID column and place it as the first column
mydata <- data.frame(ID = seq(1, nrow(mydata)), mydata)

# View the updated dataset
head(mydata)

##   ID age    sex chest_pain_type resting_blood_pressure cholestoral    fasting_blood_sugar              rest_ecg
## 1  1  52   Male  Typical angina                    125         212   Lower than 120 mg/ml ST-T wave abnormality
## 2  2  53   Male  Typical angina                    140         203 Greater than 120 mg/ml                Normal
## 3  3  70   Male  Typical angina                    145         174   Lower than 120 mg/ml ST-T wave abnormality
## 4  4  61   Male  Typical angina                    148         203   Lower than 120 mg/ml ST-T wave abnormality
## 5  5  62 Female  Typical angina                    138         294 Greater than 120 mg/ml ST-T wave abnormality
## 6  6  58 Female  Typical angina                    100         248   Lower than 120 mg/ml                Normal
##   Max_heart_rate exercise_induced_angina oldpeak       slope vessels_colored_by_flourosopy       thalassemia target
## 1            168                      No     1.0 Downsloping                           Two Reversable Defect      0
## 2            155                     Yes     3.1   Upsloping                          Zero Reversable Defect      0
## 3            125                     Yes     2.6   Upsloping                          Zero Reversable Defect      0
## 4            161                      No     0.0 Downsloping                           One Reversable Defect      0
## 5            106                      No     1.9        Flat                         Three      Fixed Defect      0
## 6            122                      No     1.0        Flat                          Zero      Fixed Defect      1

Here I added column displaying IDs.

colnames(mydata) [1] <- "ID"
colnames(mydata) [2] <- "Age"
colnames(mydata) [3] <- "Sex"
colnames(mydata) [4] <- "Chest pain"
colnames(mydata) [5] <- "Resting blood pressure"
colnames(mydata) [6] <- "Cholestoral"
colnames(mydata) [7] <- "Fasting blood sugar"
colnames(mydata) [8] <- "Resting electrocardiographic results"
colnames(mydata) [9] <- "Max heart rate"
colnames(mydata) [10] <- "Exercise induced angina"
colnames(mydata) [11] <- "Oldpeak"
colnames(mydata) [12] <- "Slope of the peak"
colnames(mydata) [13] <- "Number of major vessels"
colnames(mydata) [14] <- "Thal"
colnames(mydata) [15] <- "Target"
head(mydata)

##   ID Age    Sex     Chest pain Resting blood pressure Cholestoral    Fasting blood sugar
## 1  1  52   Male Typical angina                    125         212   Lower than 120 mg/ml
## 2  2  53   Male Typical angina                    140         203 Greater than 120 mg/ml
## 3  3  70   Male Typical angina                    145         174   Lower than 120 mg/ml
## 4  4  61   Male Typical angina                    148         203   Lower than 120 mg/ml
## 5  5  62 Female Typical angina                    138         294 Greater than 120 mg/ml
## 6  6  58 Female Typical angina                    100         248   Lower than 120 mg/ml
##   Resting electrocardiographic results Max heart rate Exercise induced angina Oldpeak Slope of the peak
## 1                ST-T wave abnormality            168                      No     1.0       Downsloping
## 2                               Normal            155                     Yes     3.1         Upsloping
## 3                ST-T wave abnormality            125                     Yes     2.6         Upsloping
## 4                ST-T wave abnormality            161                      No     0.0       Downsloping
## 5                ST-T wave abnormality            106                      No     1.9              Flat
## 6                               Normal            122                      No     1.0              Flat
##   Number of major vessels              Thal Target
## 1                     Two Reversable Defect      0
## 2                    Zero Reversable Defect      0
## 3                    Zero Reversable Defect      0
## 4                     One Reversable Defect      0
## 5                   Three      Fixed Defect      0
## 6                    Zero      Fixed Defect      1

summary(mydata)

##        ID            Age            Sex             Chest pain        Resting blood pressure  Cholestoral 
##  Min.   :   1   Min.   :29.00   Length:1025        Length:1025        Min.   : 94.0          Min.   :126  
##  1st Qu.: 257   1st Qu.:48.00   Class :character   Class :character   1st Qu.:120.0          1st Qu.:211  
##  Median : 513   Median :56.00   Mode  :character   Mode  :character   Median :130.0          Median :240  
##  Mean   : 513   Mean   :54.43                                         Mean   :131.6          Mean   :246  
##  3rd Qu.: 769   3rd Qu.:61.00                                         3rd Qu.:140.0          3rd Qu.:275  
##  Max.   :1025   Max.   :77.00                                         Max.   :200.0          Max.   :564  
##  Fasting blood sugar Resting electrocardiographic results Max heart rate  Exercise induced angina    Oldpeak     
##  Length:1025         Length:1025                          Min.   : 71.0   Length:1025             Min.   :0.000  
##  Class :character    Class :character                     1st Qu.:132.0   Class :character        1st Qu.:0.000  
##  Mode  :character    Mode  :character                     Median :152.0   Mode  :character        Median :0.800  
##                                                           Mean   :149.1                           Mean   :1.072  
##                                                           3rd Qu.:166.0                           3rd Qu.:1.800  
##                                                           Max.   :202.0                           Max.   :6.200  
##  Slope of the peak  Number of major vessels     Thal               Target      
##  Length:1025        Length:1025             Length:1025        Min.   :0.0000  
##  Class :character   Class :character        Class :character   1st Qu.:0.0000  
##  Mode  :character   Mode  :character        Mode  :character   Median :1.0000  
##                                                                Mean   :0.5132  
##                                                                3rd Qu.:1.0000  
##                                                                Max.   :1.0000

mydata$`Sex` <- factor(mydata$`Sex`)
mydata$`Chest pain` <- factor(mydata$`Chest pain`)
mydata$`Fasting blood sugar` <- factor(mydata$`Fasting blood sugar`)
mydata$`Resting electrocardiographic results` <- factor(mydata$`Resting electrocardiographic results`)
mydata$`Exercise induced angina` <- factor(mydata$`Exercise induced angina`)
mydata$`Slope of the peak` <- factor(mydata$`Slope of the peak`)
mydata$`Number of major vessels` <- factor(mydata$`Number of major vessels`)
mydata$Thal <- factor(mydata$Thal)
mydata$Target <- factor(mydata$Target)



head(mydata)

##   ID Age    Sex     Chest pain Resting blood pressure Cholestoral    Fasting blood sugar
## 1  1  52   Male Typical angina                    125         212   Lower than 120 mg/ml
## 2  2  53   Male Typical angina                    140         203 Greater than 120 mg/ml
## 3  3  70   Male Typical angina                    145         174   Lower than 120 mg/ml
## 4  4  61   Male Typical angina                    148         203   Lower than 120 mg/ml
## 5  5  62 Female Typical angina                    138         294 Greater than 120 mg/ml
## 6  6  58 Female Typical angina                    100         248   Lower than 120 mg/ml
##   Resting electrocardiographic results Max heart rate Exercise induced angina Oldpeak Slope of the peak
## 1                ST-T wave abnormality            168                      No     1.0       Downsloping
## 2                               Normal            155                     Yes     3.1         Upsloping
## 3                ST-T wave abnormality            125                     Yes     2.6         Upsloping
## 4                ST-T wave abnormality            161                      No     0.0       Downsloping
## 5                ST-T wave abnormality            106                      No     1.9              Flat
## 6                               Normal            122                      No     1.0              Flat
##   Number of major vessels              Thal Target
## 1                     Two Reversable Defect      0
## 2                    Zero Reversable Defect      0
## 3                    Zero Reversable Defect      0
## 4                     One Reversable Defect      0
## 5                   Three      Fixed Defect      0
## 6                    Zero      Fixed Defect      1

summary(mydata)

##        ID            Age            Sex                 Chest pain  Resting blood pressure  Cholestoral 
##  Min.   :   1   Min.   :29.00   Female:312   Asymptomatic    : 77   Min.   : 94.0          Min.   :126  
##  1st Qu.: 257   1st Qu.:48.00   Male  :713   Atypical angina :167   1st Qu.:120.0          1st Qu.:211  
##  Median : 513   Median :56.00                Non-anginal pain:284   Median :130.0          Median :240  
##  Mean   : 513   Mean   :54.43                Typical angina  :497   Mean   :131.6          Mean   :246  
##  3rd Qu.: 769   3rd Qu.:61.00                                       3rd Qu.:140.0          3rd Qu.:275  
##  Max.   :1025   Max.   :77.00                                       Max.   :200.0          Max.   :564  
##              Fasting blood sugar           Resting electrocardiographic results Max heart rate  Exercise induced angina
##  Greater than 120 mg/ml:153      Left ventricular hypertrophy: 15               Min.   : 71.0   No :680                
##  Lower than 120 mg/ml  :872      Normal                      :497               1st Qu.:132.0   Yes:345                
##                                  ST-T wave abnormality       :513               Median :152.0                          
##                                                                                 Mean   :149.1                          
##                                                                                 3rd Qu.:166.0                          
##                                                                                 Max.   :202.0                          
##     Oldpeak        Slope of the peak Number of major vessels                Thal     Target 
##  Min.   :0.000   Downsloping:469     Four : 18               Fixed Defect     :544   0:499  
##  1st Qu.:0.000   Flat       :482     One  :226               No               :  7   1:526  
##  Median :0.800   Upsloping  : 74     Three: 69               Normal           : 64          
##  Mean   :1.072                       Two  :134               Reversable Defect:410          
##  3rd Qu.:1.800                       Zero :578                                              
##  Max.   :6.200

The minimum age is 29.

The average resting blood pressure is 131,6.

Fifty percent of patients have 240 or less than 240 and fifty percent of patients have more than 240 cholestoral.

Research question 1

Can we discover meaningful clusters of patients based on their clinical and demographic features, and how do these clusters correspond to heart disease diagnoses?

Now, because my dataset has too many units, I will shorten my sample size to 150 units for easier further interpretation.

set.seed(1)
# Randomly sample 150 rows from the dataset
mydata_sample <- mydata %>% sample_n(150)

# View the sampled dataset
head(mydata_sample)

##     ID Age    Sex       Chest pain Resting blood pressure Cholestoral    Fasting blood sugar
## 1 1017  65   Male     Asymptomatic                    138         282 Greater than 120 mg/ml
## 2  836  49   Male Non-anginal pain                    118         149   Lower than 120 mg/ml
## 3  679  41 Female Non-anginal pain                    112         268   Lower than 120 mg/ml
## 4  129  52   Male Non-anginal pain                    138         223   Lower than 120 mg/ml
## 5  930  60   Male   Typical angina                    130         206   Lower than 120 mg/ml
## 6  509  56 Female   Typical angina                    200         288 Greater than 120 mg/ml
##   Resting electrocardiographic results Max heart rate Exercise induced angina Oldpeak Slope of the peak
## 1                               Normal            174                      No     1.4              Flat
## 2                               Normal            126                      No     0.8       Downsloping
## 3                               Normal            172                     Yes     0.0       Downsloping
## 4                ST-T wave abnormality            169                      No     0.0       Downsloping
## 5                               Normal            132                     Yes     2.4              Flat
## 6                               Normal            133                     Yes     4.0         Upsloping
##   Number of major vessels              Thal Target
## 1                     One      Fixed Defect      0
## 2                   Three      Fixed Defect      0
## 3                    Zero      Fixed Defect      1
## 4                    Four      Fixed Defect      1
## 5                     Two Reversable Defect      0
## 6                     Two Reversable Defect      0

# Check the number of rows to confirm the sample size
nrow(mydata_sample)

## [1] 150

Is data appropriate?

It has to be numerical. (it is)
Dissimilarity metric should be checked with Hopkins

# Standardization of cluster variables
mydata_clu_std <- as.data.frame(scale(mydata_sample[c("Resting blood pressure", "Cholestoral", "Max heart rate")]))

We standardize all cluster variables, because we want all of the variables to have the same effect, despite the difference in variance.

mydata_sample$Dissimilarity = sqrt(mydata_clu_std$`Resting blood pressure`^2 + mydata_clu_std$Cholestoral^2 + mydata_clu_std$`Max heart rate`^2)

head(mydata_sample[order(-mydata_sample$Dissimilarity), c("ID", "Resting blood pressure", "Cholestoral", "Max heart rate", "Dissimilarity")], 10)

##      ID Resting blood pressure Cholestoral Max heart rate Dissimilarity
## 73  193                    115         564            160      5.391256
## 84  465                    115         564            160      5.391256
## 6   509                    200         288            133      3.888467
## 88  176                    200         288            133      3.888467
## 137 295                    200         288            133      3.888467
## 77  987                    180         327            117      3.366625
## 95  141                    152         274             88      3.174031
## 67  642                    134         409            150      2.712762
## 133  48                    178         228            165      2.646638
## 132 682                    170         326            140      2.544133

From the table above I can see some jump in dissimilarity numbers after the first two IDs that are displayed here, so I identify ID465 and ID193 as potential outlierS. This is why I will remove these units.

# Remove rows with specific outliers based on Dissimilarity or variable values
mydata_sample <- mydata_sample %>%
  filter(!ID %in% c(465, 193))  # Use backticks for column names with spaces

# Reassign IDs
mydata_sample$ID <- seq(1, nrow(mydata_sample))

# Standardize the clustering variables
mydata_clu_std <- as.data.frame(scale(mydata_sample[c("Resting blood pressure", "Cholestoral", "Max heart rate")]))

# Set row names of the standardized data to a unique identifier
rownames(mydata_clu_std) <- mydata_sample$ID

Now the sampler size got smaller, the new sample size is now 148.

# Access specific rows by their identifiers (replace with actual IDs or values in your dataset)
mydata_clu_std[c(1, 5, 10), ]  # Replace 1, 5, 10 with the row indices or identifiers relevant to your dataset

##    Resting blood pressure Cholestoral Max heart rate
## 1              0.37608783   0.8184698      1.0531856
## 5             -0.05497508  -0.7946629     -0.8802467
## 10            -0.16274081  -0.8158883      1.5135266

# Graphical presentation of the dissimilarity matrix
library(factoextra)

# Compute the distance matrix using Euclidean distance
Distance <- get_dist(mydata_clu_std, 
                     method = "euclidean")  # Note the correct spelling of "euclidean"

# Visualize the dissimilarity matrix
fviz_dist(Distance, 
          gradient = list(low = "darkred", 
                          mid = "grey95", 
                          high = "white"))

From the matrix of distances displayed above, we can see that some groups of clusters are forming, and I see 4 or 5 groups.

# Load the factoextra library
library(factoextra)

get_clust_tendency(mydata_clu_std,
                   n= nrow(mydata_clu_std)-1,
                   graph=FALSE)

## $hopkins_stat
## [1] 0.6657298
## 
## $plot
## NULL

Hopkins statistics in my case is above 0.5, so I can conclude that the data is clusterable. If it would be greater, closer to 1, it would be even more appropriate. Now the next question is how many clusters to use, so I will check this.

How many clusters to use?

WARD <- mydata_clu_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          


WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 148

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Based on the dendrogram, I would choose 2 clusters, as there is the biggest jump in vertical line.

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

With the elbow method, I can see the slope changes the most at 2 and 5, so both are possible options for number of clusters.

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

The highest value of the Silhouette analysis is at 6, so this is the suggested numbers of clusters.

NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 4 proposed 8 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

## $All.index
##          KL      CH Hartigan     CCC    Scott Marriot   TrCovW   TraceW Friedman  Rubin Cindex     DB Silhouette   Duda
## 2    3.4746 53.5714  37.9606 -2.6363 162.4096 4000271 9373.262 322.6215   1.4594 1.3669 0.3507 1.6100     0.2695 1.3098
## 3    0.9411 52.3692  38.1343 -4.4964 255.8988 4785588 9330.161 256.0479   2.5131 1.7223 0.3280 1.3736     0.2345 0.7192
## 4    0.8115 56.4144  32.3503 -5.3607 339.7263 4828670 4410.775 202.7307   3.6009 2.1753 0.3780 1.1465     0.2956 1.5856
## 5    0.9738 59.4877   9.7934 -3.9121 435.4611 3951128 2774.193 165.5410   5.2847 2.6640 0.3629 1.1124     0.2890 1.0052
## 6    4.0081 52.4388  13.2846 -5.1898 461.8185 4761457 3183.146 154.9306   5.7124 2.8464 0.3247 1.2655     0.2376 2.4653
## 7    0.1171 49.6492  32.7782 -5.5243 507.5370 4758565 2760.723 141.6763   6.7311 3.1127 0.3226 1.2199     0.2237 1.5312
## 8    3.8897 56.7269   6.6299 -3.2021 594.8110 3446365 1660.113 114.9531   8.7615 3.8363 0.3595 1.1164     0.2698 0.9387
## 9    0.5681 52.4381  18.9386 -4.0267 610.4642 3924038 1488.517 109.7555   9.1230 4.0180 0.3877 1.1483     0.2436 1.7593
## 10 191.4579 54.6706   9.0077 -3.1185 675.9204 3112949 1218.597  96.5946  11.3004 4.5655 0.3353 1.0552     0.2755 1.4890
##    Pseudot2   Beale Ratkowsky     Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  -20.5785 -0.3948    0.3499 161.3107     0.3897  0.7216  0.5959 0.0448 0.0039  2.5461 1.3421 1.5280
## 3   18.7448  0.6500    0.3699  85.3493     0.3952 -0.6624  1.3145 0.0713 0.0039  2.1388 1.1972 2.4664
## 4  -25.8536 -0.6113    0.3673  50.6827     0.5377  0.8726  1.0462 0.1015 0.0055  1.9241 1.1020 0.8803
## 5   -0.2367 -0.0085    0.3530  33.1082     0.4907  0.8134  1.6782 0.0997 0.0063  2.1876 0.9905 0.5494
## 6  -22.5861 -0.9523    0.3287  25.8218     0.4500  1.0138  2.2926 0.0947 0.0066  2.1364 0.9463 0.5478
## 7   -8.6726 -0.5537    0.3113  20.2395     0.4088 -0.0809  2.9965 0.0950 0.0067  2.3157 0.8957 0.4413
## 8    0.8487  0.1010    0.3040  14.3691     0.4385 15.1804  2.8837 0.1151 0.0076  2.3309 0.8131 0.3798
## 9  -16.4010 -0.6998    0.2889  12.1951     0.4063  0.1725  3.4008 0.0883 0.0078  2.4678 0.8062 0.2845
## 10  -5.5831 -0.4970    0.2794   9.6595     0.4011  0.2082  3.7689 0.0825 0.0079  2.3107 0.7502 0.2459
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.4783            94.9094       1.0000
## 3          0.4587            56.6455       0.5843
## 4          0.4208            96.3332       1.0000
## 5          0.4105            66.0575       1.0000
## 6          0.2617           107.2113       1.0000
## 7          0.2464            76.4466       1.0000
## 8          0.1434            77.6251       0.9588
## 9          0.3119            83.8289       1.0000
## 10         0.0819           190.4474       1.0000
## 
## $Best.nc
##                       KL      CH Hartigan     CCC   Scott Marriot   TrCovW  TraceW Friedman   Rubin Cindex      DB
## Number_clusters  10.0000  5.0000   8.0000  2.0000  5.0000       8    4.000  5.0000  10.0000  8.0000 7.0000 10.0000
## Value_Index     191.4579 59.4877  26.1483 -2.6363 95.7348 1789874 4919.386 26.5792   2.1774 -0.5419 0.3226  1.0552
##                 Silhouette   Duda PseudoT2   Beale Ratkowsky    Ball PtBiserial Frey McClain   Dunn Hubert SDindex
## Number_clusters     4.0000 2.0000   2.0000  2.0000    3.0000  3.0000     4.0000    1  2.0000 8.0000      0  4.0000
## Value_Index         0.2956 1.3098 -20.5785 -0.3948    0.3699 75.9614     0.5377   NA  0.5959 0.1151      0  1.9241
##                 Dindex    SDbw
## Number_clusters      0 10.0000
## Value_Index          0  0.2459
## 
## $Best.partition
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
##   2   2   2   2   2   1   2   2   1   2   2   2   2   2   2   2   2   1   1   1   2   1   1   2   2   1   2   2   1   2 
##  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   2   2   1   2   2   2   2   1   1   2   2   1   2   2   2   1   2   2   1   1   2   1   2   2   2   2   1   2   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   2   2   2   2   2   2   1   1   2   2   1   2   1   2   2   1   1   2   2   2   1   2   2   2   2   1   2   2   2   2 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   1   1   2   2   2   2   2   1   1   2   2   1   1   2   1   2   2   2   2   2   2   2   2   2   1   2   1   2   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 
##   2   2   2   1   2   2   2   1   2   1   2   2   1   1   1   1   2   1   2   2   1   1   1   1   2   2   1   2

The most of all indices suggested 2 as the number of clusters to use, so I will proceed with 2 clusters.

K-Means Clustering

Clustering <- kmeans(mydata_clu_std, 
                     centers = 2, #Number of groups
                     nstart = 25) #Number of different positions of initial leaders

Clustering

## K-means clustering with 2 clusters of sizes 57, 91
## 
## Cluster means:
##   Resting blood pressure Cholestoral Max heart rate
## 1              0.3051893   0.7938930     -0.7461824
## 2             -0.1911625  -0.4972736      0.4673890
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
##   2   2   2   2   2   1   2   2   1   2   2   2   2   2   2   2   2   1   1   1   2   1   1   1   2   1   2   2   1   2 
##  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   2   2   1   2   2   1   2   1   1   2   2   1   2   2   2   1   2   2   1   1   2   1   2   2   2   2   1   2   1   1 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   2   2   2   2   2   1   1   1   2   2   1   2   1   2   2   1   1   2   2   2   1   2   2   2   2   1   2   1   1   2 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   1   1   2   2   2   2   2   1   1   2   2   1   1   2   1   2   2   2   2   2   2   2   2   2   1   2   1   2   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 
##   2   2   2   1   2   2   2   1   2   1   2   2   1   1   1   1   2   1   2   2   1   1   1   1   2   2   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 153.9651 168.3566
##  (between_SS / total_SS =  26.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
## [8] "iter"         "ifault"

Here I performed K-Means clustering. Biggest ratio of final leaders is 26.9%, which explains the total variability of clustering the 3 variables into 2 groups, and the other 73.1% is not explained.

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

With the help of Principal Component Analysis around 73.9% (42.2+31.7) of is showed when combining 3 variables into 2 dimensions.

I decided to not remove any units, because none of the units seem really far away from the other units in the cluster.

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups

##   Resting blood pressure Cholestoral Max heart rate
## 1              0.3051893   0.7938930     -0.7461824
## 2             -0.1911625  -0.4972736      0.4673890

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

Figure <- pivot_longer(Figure, cols = c("Resting blood pressure", "Cholestoral", "Max heart rate"),
                       names_to = "name",
                       values_to = "value")

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3, 4), 
                       labels = c("1", "2", "3", "4"))

Figure$ImeF <- factor(Figure$name, 
              levels = c("Resting blood pressure", "Cholestoral", "Max heart rate"), 
                      labels = c("Resting Blood Pressure", "Cholestoral", "Max Heart Rate"))

library(ggplot2)
ggplot(Figure, aes(x = ImeF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

I will use this profiles of groups at the end for my conclusion.

mydata_sample$Group <- Clustering$cluster #Assigning units to groups

Appropriatness of cluster variables used

# Perform one-way ANOVA for each clustering variable to test differences between clusters
fit <- aov(cbind(`Resting blood pressure`, `Cholestoral`, `Max heart rate`) ~ as.factor(Group), 
           data = mydata_sample)  # Replace `Cluster` with the name of your cluster assignment column

# Summarize the ANOVA results
summary(fit)

##  Response Resting blood pressure :
##                   Df Sum Sq Mean Sq F value   Pr(>F)   
## as.factor(Group)   1   2974 2973.94  9.1108 0.002999 **
## Residuals        146  47657  326.42                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Cholestoral :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1 129690  129690  96.311 < 2.2e-16 ***
## Residuals        146 196601    1347                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Max heart rate :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   1  24357 24357.1  79.007 2.131e-15 ***
## Residuals        146  45011   308.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response for Resting blood pressure:

H0: μ(Resting blood pressure, G1) = μ(Resting blood pressure, G2)

H1: At least one μ(Social_support, j) is different.

We can reject H0 at p < 0.01. We can reject H0 for all other cluster variables at p < 0.001. Therefore we can assume that the groups are statistically different in the mean values of the cluster variables.

Criterion validity

Now I will check the criterion validity of the classification with variable that was not used in the clustering process. For this I chose Oldpeak. We have to check two assumptions: 1. Normal distribution of Oldpeak in each group 2. Homogenity of the variances

# Compute the mean of a variable for each cluster
aggregate(mydata_sample$Oldpeak, 
          by = list(mydata_sample$Group),  # Replace `Cluster` with your cluster assignment column
          FUN = mean)

##   Group.1         x
## 1       1 1.4771930
## 2       2 0.8912088

On average group 1 has the biggest Oldpeak (ST depression induced by exercise relative to rest).

# Load the car library
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Perform Levene's test for homogeneity of variance
leveneTest(mydata_sample$Oldpeak, as.factor(mydata_sample$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   1  3.7646 0.05428 .
##       146                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: σ2 (Oldpeak, G1) = σ2 (Oldpeak, G2)
H1: At least one σ2 (Oldpeak, j) is different. We cannot reject H0, because p > 0.05, meaning that we cannot say at least one variance is different. With this we checked the homogeneity of variances. Now we can proceed with Welch F-test (if normality is not violated).

# Load the required libraries
library(dplyr)
library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

# Perform Shapiro-Wilk normality test for each cluster group
mydata_sample %>%
  group_by(Group) %>%  # Replace `Cluster` with your cluster assignment column
  shapiro_test(Oldpeak)  # Replace `Oldpeak` with the variable you want to test

## # A tibble: 2 × 4
##   Group variable statistic            p
##   <int> <chr>        <dbl>        <dbl>
## 1     1 Oldpeak      0.906 0.000316    
## 2     2 Oldpeak      0.841 0.0000000176

H0: Oldpeak is normally distributed in G1.

H1: Oldpeak is not normally distributed in G1.

We cannot reject H0.

H0: Oldpeak is normally distributed in G2.

H1: Oldpeak is not normally distributed in G2.

We can reject H0.

Because I rejected one H0, I now have to use Kruskal-Wallis Rank Sum Test.

# Perform Kruskal-Wallis test for the variable of interest across clusters
kruskal.test(Oldpeak ~ Group, 
             data = mydata_sample)  # Replace `Cluster` with your cluster assignment column

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Oldpeak by Group
## Kruskal-Wallis chi-squared = 8.1107, df = 1, p-value = 0.004401

H0: Location distribution of Oldpeak are the same for all groups.
H1: Location distribution of Oldpeak are not the same for all groups.

# Compute the effect size for the Kruskal-Wallis test
library(rstatix)
kruskal_effsize(Oldpeak ~ Group, 
                data = mydata_sample)  # Replace `Cluster` with your cluster assignment column

## # A tibble: 1 × 5
##   .y.         n effsize method  magnitude
## * <chr>   <int>   <dbl> <chr>   <ord>    
## 1 Oldpeak   148  0.0487 eta2[H] small

We reject H0 at p < 0.005, therefore at least one location distribution is different from the other, meaning I trust that my clusters are meaningful. All in all, my result is validated (effect size is small).

Conclusion

Based on three standardized variables (Resting Blood Pressure, Cholesterol, and Maximum Heart Rate), I identified two distinct groups using hierarchical and K-means clustering.

Group 1 (57/148 = 38.5%) – High Cholesterol & Elevated Blood Pressure with Lower Heart Rate This group consists of individuals with significantly higher cholesterol levels and above-average resting blood pressure. However, their maximum heart rate is considerably lower than that of the other group. This suggests that these patients may have reduced cardiovascular efficiency or underlying heart conditions that limit their ability to reach higher heart rates. The presence of higher cholesterol levels combined with elevated blood pressure may put them at a higher risk for heart disease, strokes, and other cardiovascular complications. These individuals likely require medical intervention, including lifestyle changes and potentially medication, to manage their cardiovascular health.

Group 2 (91/148 = 61.5%) – Lower Cholesterol & Moderate Blood Pressure with Higher Heart Rate Individuals in this group tend to have lower cholesterol levels and resting blood pressure compared to the first group. However, they exhibit a significantly higher maximum heart rate, indicating better cardiovascular function and adaptability to physical exertion. This suggests that these patients may have better heart efficiency and a lower risk of severe cardiovascular conditions compared to Group 1. Nevertheless, individuals with particularly high heart rates may still be at risk of arrhythmias or other heart-related concerns. Lifestyle modifications such as balanced exercise routines and continued monitoring of blood pressure and cholesterol levels may be beneficial for these individuals.

MVA HW 2

Maj Batagelj

2025-01-27