Heart disease, a term encompassing ailments that affect the heart and circulatory system, emerges as both a worldwide health concern and chief instigator of disability. Because the heart plays an essential role in bodily functions; any affliction to it can trigger extensive consequences on other organs and physiological processes. A myriad of forms exist within this category - from conditions inducing coronary artery narrowing to valve malfunctions, heart enlargements - frequently culminating in heart failure or even fatal cardiac episodes.

Tailored specifically for heart disease, the data set provided offers a valuable resource: the potential to extract insights that illuminate each feature’s significance and their interrelationships. Our primary objective in this analysis is to determine–with precision–the probability of an individual’s susceptibility to severe heart problems.

Research Question: Can we identify distinct groups of individuals based on their cardiovascular health-related attributes?

Utilizing clustering analysis as a suitable method, we can actively explore patterns and groupings within the dataset. The mix of categorical and numeric variables associated with cardiovascular health presents an opportunity to identify individuals’ similarities and differences based on these attributes; this is where clustering proves particularly helpful. Through our application of clustering analysis, we unearth natural groupings or clusters in the dataset - a process that enables us to pinpoint subgroups of individuals exhibiting akin characteristics. The dataset, commonly found in clustering applications, comprises a mix of numeric and categorical variables; clustering algorithms can accommodate both types of data. This provides insights into the diverse phenotype of cardiovascular health: it handles various distributions within these variable types–an invaluable tool for comprehensive analysis. As an unsupervised learning technique, clustering necessitates no labeled data; it’s a method well-suited for our exploration of patterns and groupings without predefined categories. We interpret the results of clustering to understand each identified group’s characteristics: this is how we derive meaningful insights from raw information. Developing targeted interventions or understanding risk factors associated with cardiovascular health can benefit from this.

#Load the required libraries
library(tidyverse)
library(dplyr)
library(cluster)
#Import the dataset 
#set working directory
setwd("C:/Users/Baha/Downloads")
Heart <-read.csv("heart.csv")
head(Heart)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  52   1  0      125  212   0       1     168     0     1.0     2  2    3
## 2  53   1  0      140  203   1       0     155     1     3.1     0  0    3
## 3  70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 4  61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 5  62   0  0      138  294   1       1     106     0     1.9     1  3    2
## 6  58   0  0      100  248   0       0     122     0     1.0     1  0    2
##   target
## 1   0.23
## 2   0.37
## 3   0.24
## 4   0.28
## 5   0.21
## 6   0.78

Description of variables: Age: Numeric (e.g., 52) Sex: Categorical (0: Female, 1: Male) Chest Pain Type: Categorical (0: Typical Angina, 1: Atypical Angina, 2: Non-anginal Pain, 3: Asymptomatic) Resting Blood Pressure: Numeric (e.g., 125) Serum Cholesterol: Numeric in mg/dL (e.g., 212) Fasting Blood Sugar: Categorical (0: <= 120 mg/dL, 1: > 120 mg/dL) Resting Electrocardiographic Results: Categorical (0: Normal, 1: Abnormality, 2: Hypertrophy) Maximum Heart Rate Achieved: Numeric (e.g., 168) Exercise-Induced Angina: Categorical (0: No, 1: Yes) Oldpeak (ST Depression): Numeric (e.g., 1.0) Slope of Peak Exercise ST Segment: Categorical (0: Upsloping, 1: Flat, 2: Downsloping) Number of Major Vessels Colored by Fluoroscopy: Numeric (0 to 3) Thalassemia: Categorical (0: Normal, 1: Fixed Defect, 2: Reversible Defect)

Data source: The dataset was obtained from kaggle website https://www.kaggle.com/datasets/juledz/heart-attack-prediction

#factor the categorical variables
attach(Heart)
Heart$sex <-as.factor(Heart$sex)
Heart$cp <-as.factor(Heart$cp)
Heart$fbs <-as.factor(Heart$fbs)
Heart$restecg <-as.factor(Heart$restecg)
Heart$exang <-as.factor(Heart$exang)
Heart$slope <-as.factor(Heart$slope)
Heart$ca <-as.factor(Heart$ca)
#convert the loaded dataset into a data frame
head(as.data.frame(Heart))
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  52   1  0      125  212   0       1     168     0     1.0     2  2    3
## 2  53   1  0      140  203   1       0     155     1     3.1     0  0    3
## 3  70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 4  61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 5  62   0  0      138  294   1       1     106     0     1.9     1  3    2
## 6  58   0  0      100  248   0       0     122     0     1.0     1  0    2
##   target
## 1   0.23
## 2   0.37
## 3   0.24
## 4   0.28
## 5   0.21
## 6   0.78
#Sample 300 observations from the loaded dataset
sample_size <-300
# Randomly sample 300 observations from the data frame
Heart <- Heart[sample(nrow(Heart), size = sample_size, replace = FALSE), ]
dim(Heart)
## [1] 300  14
dplyr::glimpse(Heart)
## Rows: 300
## Columns: 14
## $ age      <int> 57, 63, 58, 52, 56, 41, 38, 45, 44, 64, 67, 64, 46, 58, 52, 6…
## $ sex      <fct> 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1…
## $ cp       <fct> 0, 1, 2, 1, 1, 1, 2, 1, 2, 0, 2, 0, 0, 0, 0, 2, 3, 2, 2, 2, 0…
## $ trestbps <int> 128, 140, 105, 134, 140, 110, 138, 130, 130, 145, 152, 145, 1…
## $ chol     <int> 303, 195, 240, 201, 294, 235, 175, 234, 233, 212, 212, 212, 2…
## $ fbs      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ restecg  <fct> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0…
## $ thalach  <int> 159, 179, 154, 158, 153, 153, 173, 175, 179, 132, 150, 132, 1…
## $ exang    <fct> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1…
## $ oldpeak  <dbl> 0.0, 0.0, 0.6, 0.8, 1.3, 0.0, 0.0, 0.6, 0.4, 2.0, 0.8, 2.0, 0…
## $ slope    <fct> 2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 0, 2, 2, 2, 2, 2, 1, 1…
## $ ca       <fct> 1, 2, 0, 1, 0, 0, 4, 0, 0, 2, 0, 2, 0, 3, 1, 0, 1, 1, 3, 0, 2…
## $ thal     <int> 2, 2, 3, 2, 2, 2, 2, 2, 2, 1, 3, 1, 3, 1, 2, 2, 2, 2, 2, 3, 3…
## $ target   <dbl> 0.90, 0.75, 0.83, 0.78, 0.73, 0.72, 0.85, 0.78, 0.72, 0.37, 0…
#Check for missing values in the sampled dataset
colSums(is.na(Heart)) # there are no missing values in the dataset
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

There are no missing values in the loaded dataset.

#Get the summary statistics of the dataset
summary(Heart)
##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 97   0:149   Min.   : 94.0   Min.   :149.0   0:257  
##  1st Qu.:47.00   1:203   1: 46   1st Qu.:120.0   1st Qu.:212.0   1: 43  
##  Median :55.00           2: 85   Median :130.0   Median :239.5          
##  Mean   :54.44           3: 20   Mean   :131.4   Mean   :249.1          
##  3rd Qu.:61.25                   3rd Qu.:140.0   3rd Qu.:275.5          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##  restecg    thalach      exang      oldpeak      slope   ca           thal     
##  0:143   Min.   : 71.0   0:194   Min.   :0.000   0: 25   0:166   Min.   :0.00  
##  1:153   1st Qu.:138.0   1:106   1st Qu.:0.000   1:127   1: 74   1st Qu.:2.00  
##  2:  4   Median :154.0           Median :0.800   2:148   2: 37   Median :2.00  
##          Mean   :150.1           Mean   :1.083           3: 17   Mean   :2.32  
##          3rd Qu.:165.0           3rd Qu.:1.800           4:  6   3rd Qu.:3.00  
##          Max.   :202.0           Max.   :6.200                   Max.   :3.00  
##      target      
##  Min.   :0.1000  
##  1st Qu.:0.2575  
##  Median :0.7200  
##  Mean   :0.5516  
##  3rd Qu.:0.8200  
##  Max.   :0.9000

Descriptive statistics: The average age of individuals in this study is 54 years, the median is 55 years and the minimum and maximum ages are 34 years and 77 years respectively. The number of individuals with normal Resting Electrocardiographic is 149, those with abnormal Resting Electrocardiographic is 147 and those with Hypertrophic resting Electrocardiographic is 4. The number of males under study is 215 compared to the number of females which is only 85. The number of individuals with typical angina type of chest pain is 152. Atypical angina is 44, non-anginal pain is 79 and individiuals with asymptomatic type of chest pain is 25.

Standardize the numeric variables

Heart$age_z <-scale(Heart$age)
Heart$trestbps_z <-scale(Heart$trestbps)
Heart$chol_z <-scale(Heart$chol)
Heart$thalach_z <-scale(Heart$thalach)
Heart$oldpeak_z <-scale(Heart$oldpeak)
Heart$thal_z <-scale(Heart$thal)
Heart$target_z <-scale(Heart$target)
#Load the required package
library(Hmisc)
rcorr(as.matrix(Heart[, c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")]), 
      type = "pearson")
##            age_z trestbps_z chol_z thalach_z oldpeak_z thal_z target_z
## age_z       1.00       0.32   0.23     -0.31      0.16   0.05    -0.15
## trestbps_z  0.32       1.00   0.11     -0.05      0.27  -0.03    -0.08
## chol_z      0.23       0.11   1.00     -0.03      0.10   0.09    -0.11
## thalach_z  -0.31      -0.05  -0.03      1.00     -0.35  -0.10     0.38
## oldpeak_z   0.16       0.27   0.10     -0.35      1.00   0.21    -0.44
## thal_z      0.05      -0.03   0.09     -0.10      0.21   1.00    -0.33
## target_z   -0.15      -0.08  -0.11      0.38     -0.44  -0.33     1.00
## 
## n= 300 
## 
## 
## P
##            age_z  trestbps_z chol_z thalach_z oldpeak_z thal_z target_z
## age_z             0.0000     0.0000 0.0000    0.0043    0.4193 0.0091  
## trestbps_z 0.0000            0.0491 0.3717    0.0000    0.6558 0.1641  
## chol_z     0.0000 0.0491            0.5846    0.0740    0.1208 0.0646  
## thalach_z  0.0000 0.3717     0.5846           0.0000    0.0786 0.0000  
## oldpeak_z  0.0043 0.0000     0.0740 0.0000              0.0002 0.0000  
## thal_z     0.4193 0.6558     0.1208 0.0786    0.0002           0.0000  
## target_z   0.0091 0.1641     0.0646 0.0000    0.0000    0.0000

Upon examining the correlation matrix, I ascertained that a close proximity to zero is desirable for its correlations: indeed, it’s optimal when they hover around this value. However – should the values deviate significantly from zero - introducing two variables measuring the same dimension can be advantageous; such an approach mitigates potential distortions in our data analysis.

To verify potential deviations within the units, I aim to identify outliers. In this pursuit, my strategy involves calculating each unit’s Euclidean distance from the mean across all classification variables.

Heart$Difference_z <- sqrt(Heart$age_z^2 + Heart$trestbps_z^2 + Heart$chol_z^2 + Heart$thalach_z^2 + Heart$oldpeak_z^2 + Heart$thal_z^2 + Heart$target_z^2)
head(Heart[order(-Heart$Difference_z), ], 10)
##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 193  67   0  2      115  564   0       0     160     0     1.6     1  0    3
## 70   62   0  0      160  164   0       0     145     0     6.2     0  3    3
## 295  56   0  0      200  288   1       0     133     1     4.0     0  2    3
## 686  63   0  0      150  407   0       0     154     0     4.0     1  3    3
## 614  55   1  0      140  217   0       1     111     1     5.6     0  0    3
## 890  63   0  0      150  407   0       0     154     0     4.0     1  3    3
## 247  54   1  1      192  283   0       0     195     0     0.0     2  1    3
## 357  59   1  0      164  176   1       0      90     0     1.0     1  2    1
## 360  53   0  2      128  216   0       0     115     0     0.0     2  0    0
## 330  53   0  2      128  216   0       0     115     0     0.0     2  0    0
##     target       age_z trestbps_z     chol_z  thalach_z   oldpeak_z    thal_z
## 193   0.89  1.40386657 -0.9133619  6.0339727  0.4530751  0.43474623  1.104760
## 70    0.32  0.84515077  1.5979182 -1.6294313 -0.2366536  4.30040216  1.104760
## 295   0.23  0.17469181  3.8301672  0.7462240 -0.7884365  2.45161019  1.104760
## 686   0.13  0.95689393  1.0398560  3.0260866  0.1771836  2.45161019  1.104760
## 614   0.33  0.06294865  0.4817937 -0.6140302 -1.8000385  3.79618617  1.104760
## 890   0.19  0.95689393  1.0398560  3.0260866  0.1771836  2.45161019  1.104760
## 247   0.20 -0.04879451  3.3837174  0.6504314  2.0624420 -0.90982974  1.104760
## 357   0.19  0.50992129  1.8211431 -1.3995291 -2.7656587 -0.06946976 -2.144534
## 360   0.87 -0.16053767 -0.1878810 -0.6331887 -1.6161109 -0.90982974 -3.769181
## 330   0.80 -0.16053767 -0.1878810 -0.6331887 -1.6161109 -0.90982974 -3.769181
##       target_z Difference_z
## 193  1.1730204     6.496511
## 70  -0.8028118     5.131972
## 295 -1.1147854     4.934857
## 686 -1.4614226     4.533452
## 614 -0.7681481     4.480495
## 890 -1.2534403     4.470743
## 247 -1.2187765     4.434225
## 357 -1.2534403     4.399897
## 360  1.1036930     4.396188
## 330  0.8610469     4.341627
#Loading the required package
library(factoextra)
Distances <- get_dist(Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")], 
                     method = "euclidean")

Distances1 <- Distances^2

fviz_dist(Distances1)

get_clust_tendency(Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")],
                   n = nrow(Heart) - 1, 
                   graph = FALSE)
## $hopkins_stat
## [1] 0.8033171
## 
## $plot
## NULL

I calculated the Hopkins statistics. Since it is greater than 0.5 (0.7963), it is OK and I can continue displaying the Dendrogram.

library(dplyr)
Ward <- Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")] %>% 
                            
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          

print(Ward)
## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 300
library(factoextra)
fviz_dend(Ward)

CONCLUSION: I can see from the dendrogram that, based on the distances, it makes the most sense to classify into 3 groups.

set.seed(123)
library(dplyr)
library(NbClust)
TotalGroups <- Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")] %>%
  
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 8, 
          method = "ward.D2", 
          index = "all") 

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

the number of clusters equals 3, determines the number of groups, display hierarchically.

Heart$ClassificationWard<- cutree(Ward, 
                                  k = 3)  
head(Heart)
##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 421  57   0  0      128  303   0       0     159     0     0.0     2  1    2
## 600  63   0  1      140  195   0       1     179     0     0.0     2  2    2
## 369  58   1  2      105  240   0       0     154     1     0.6     1  0    3
## 79   52   1  1      134  201   0       1     158     0     0.8     2  1    2
## 310  56   0  1      140  294   0       0     153     0     1.3     1  0    2
## 140  41   1  1      110  235   0       1     153     0     0.0     2  0    2
##     target      age_z trestbps_z     chol_z thalach_z  oldpeak_z    thal_z
## 421   0.90  0.2864350 -0.1878810  1.0336016 0.4070932 -0.9098297 -0.519887
## 600   0.75  0.9568939  0.4817937 -1.0355175 1.3267314 -0.9098297 -0.519887
## 369   0.83  0.3981781 -1.4714241 -0.1733845 0.1771836 -0.4056138  1.104760
## 79    0.78 -0.2722808  0.1469564 -0.9205664 0.3611113 -0.2375418 -0.519887
## 310   0.73  0.1746918  0.4817937  0.8611750 0.1312017  0.1826382 -0.519887
## 140   0.72 -1.5014556 -1.1923930 -0.2691771 0.1312017 -0.9098297 -0.519887
##      target_z Difference_z ClassificationWard
## 421 1.2076841     1.976860                  1
## 600 0.6877283     2.356125                  1
## 369 0.9650381     2.168273                  1
## 79  0.7917195     1.423764                  1
## 310 0.6184008     1.306728                  1
## 140 0.5837371     2.281373                  1
Start_leaders <-  aggregate(Heart[, c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")], 
                           by = list(Heart$ClassificationWard), 
                           FUN = mean)

print(Start_leaders)
##   Group.1      age_z trestbps_z     chol_z   thalach_z  oldpeak_z     thal_z
## 1       1 -0.4525937 -0.3109083 -0.2790466  0.55862448 -0.5049290 -0.2860363
## 2       2  0.5717087  0.6111329  0.6170731 -0.89879309  0.5583286 -0.1758441
## 3       3  0.1343027 -0.1314024 -0.1881573  0.03203591  0.2312374  0.6349825
##     target_z
## 1  0.8736519
## 2 -0.3105870
## 3 -1.0713513

Perform now the sorting according to the method of leaders.

library(factoextra)

GuideMethod <- hkmeans(Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")],  
                   k = 3, 
                   hc.metric = "euclidean", 
                   hc.method = "ward.D2") 

print(GuideMethod)
## Hierarchical K-means clustering with 3 clusters of sizes 136, 64, 100
## 
## Cluster means:
##         age_z trestbps_z      chol_z  thalach_z  oldpeak_z     thal_z
## 1 -0.47111793 -0.3680202 -0.32566649  0.5311767 -0.5261065 -0.3765358
## 2  0.91149827  1.1750117  0.71838425 -0.2711400  0.3874760 -0.4437317
## 3  0.05736149 -0.2515000 -0.01685949 -0.5488707  0.4675203  0.7960770
##     target_z
## 1  0.8149136
## 2 -0.1664075
## 3 -1.0017816
## 
## Clustering vector:
##  421  600  369   79  310  140   84  504  808 1001  242  609   44  151  804  318 
##    1    1    1    1    2    1    1    1    1    2    3    2    3    2    3    1 
##  715  933  364  158  541  908  717  168 1011  757  897  112  990  823   68  974 
##    1    2    1    1    3    1    3    1    3    3    2    3    2    3    2    1 
##  135  222  718  172  146  879  676  347  265  391  508  942  408  776  701  420 
##    1    3    2    3    3    3    3    3    2    2    1    1    3    1    1    2 
##  427  439  998  131   78  124  976  719  921  901  970  659  194  467  791  270 
##    1    1    3    2    3    2    3    1    3    3    1    3    2    1    3    1 
##  377  751  567  730  951  379  306  236   86  850  199  867 1004  359  418   19 
##    1    1    1    1    2    3    3    2    1    3    1    1    1    3    1    1 
##  330  419  136  694  296  269  460  345  999  184  711  539  433  271  666  473 
##    1    1    2    1    3    3    1    1    2    1    1    3    1    1    2    1 
##  315  410  649   87   16  809  332  454  556   40  625  618  826  303  594  773 
##    1    3    1    1    1    1    3    1    3    3    2    1    1    1    3    2 
##  468  729 1012  996  950  230  501 1020  357  813  680  129  373  877  611  675 
##    3    3    1    1    1    2    1    1    2    2    2    1    1    3    3    3 
##  772  284  444  575  940  363  375  936 1019  191  633  615  193  693  660  785 
##    1    1    1    3    1    1    1    1    3    1    1    3    2    3    1    1 
##  644  206  128  721  276  872  763  317  395  949  758  895  851  947  165  958 
##    1    1    1    2    3    1    1    2    3    3    2    3    3    2    3    3 
##  760  665  295  255  961  104  183  205  960  864  733  401    2  360  663  520 
##    2    2    2    3    1    1    3    2    2    3    1    3    3    1    1    3 
##   12  272  610  392   92   97  938  578  722   91  353  712  423  927  132  946 
##    3    1    2    3    1    1    2    2    1    1    3    3    1    3    1    1 
##  549  481   64  533  731  843  141  125  393  979  911  266  890  382  417  386 
##    1    3    2    3    1    3    3    3    1    1    3    1    2    3    1    1 
##  399  366   80  349  855  142  282  195  703  636  499  622  780  228  319    3 
##    2    1    1    3    2    3    1    2    2    1    1    3    1    1    3    3 
##  188  688  750  247   14  370  736  179  383  381  783  342   39  207  920  613 
##    3    3    1    2    3    1    1    1    3    3    2    1    3    2    3    2 
##  643  868  538  253  829   65  935  204  678  214  530  894  943  513  477  981 
##    3    1    1    3    1    1    1    2    3    1    2    1    2    3    2    1 
##  792  580  251   72  654  396  614  743  686  458  545  725  766   70  916  691 
##    3    3    3    3    3    2    3    3    2    1    1    2    3    3    3    1 
##  453   24  161   34  496  912  106   49  510  954   31  331  111  261  682  333 
##    2    1    2    2    1    2    3    2    2    1    1    1    1    1    2    1 
##  992  464  201  641  243  500  818  138  830  860  301  175 
##    3    1    1    1    1    1    1    2    3    3    1    3 
## 
## Within cluster sum of squares by cluster:
## [1] 484.9268 464.8431 468.9403
##  (between_SS / total_SS =  32.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

FINDING: R has determined 3 centers. The group sizes are 102, 129 and 69. The final leaders are marked with 1, 2 and 3. 30.7% tells us that the groups are successfully separated from each other by that much.

fviz_cluster(GuideMethod, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

Heart$ClassificationVod <- GuideMethod$cluster
head(Heart[c("sex", "ClassificationWard", "ClassificationVod")])
##     sex ClassificationWard ClassificationVod
## 421   0                  1                 1
## 600   0                  1                 1
## 369   1                  1                 1
## 79    1                  1                 1
## 310   0                  1                 2
## 140   1                  1                 1

The separation and groupings are nicely visible in the above display.

table(Heart$ClassificationWard)
## 
##   1   2   3 
## 132  85  83
table(Heart$ClassificationVod)
## 
##   1   2   3 
## 136  64 100
table(Heart$ClassificationWard, Heart$ClassificationVod)
##    
##       1   2   3
##   1 128   4   0
##   2   5  47  33
##   3   3  13  67

Below are the average values of the ranking variables by group, which also show the positions of the final leaders.

Average <- GuideMethod$centers
print(Average)
##         age_z trestbps_z      chol_z  thalach_z  oldpeak_z     thal_z
## 1 -0.47111793 -0.3680202 -0.32566649  0.5311767 -0.5261065 -0.3765358
## 2  0.91149827  1.1750117  0.71838425 -0.2711400  0.3874760 -0.4437317
## 3  0.05736149 -0.2515000 -0.01685949 -0.5488707  0.4675203  0.7960770
##     target_z
## 1  0.8149136
## 2 -0.1664075
## 3 -1.0017816

The averages are shown in the graph below,

library(ggplot2)
library(tidyr)

Picture <- as.data.frame(Average)
Picture$id <- 1:nrow(Picture)
Picture <- pivot_longer(Picture, cols = c(age_z, trestbps_z, chol_z, thalach_z, oldpeak_z,thal_z, target_z))

Picture$Group <- factor(Picture$id, 
                        levels = c(1, 2, 3), 
                        labels = c("1", "2", "3"))

Picture$nameFactor <- factor(Picture$name, 
                           levels = c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z"), 
                           labels = c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z"))

ggplot(Picture, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size=3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Average") +
  xlab("Classification variables") +
  ylim(-1.5, 1.5)

FINDING: 4 out of 7 classification variables in group 3 are above average. Only 2 classification variables in group 2 are above average and 3 out of the 7 classification variables in group 1 are above average.

fit <- aov(cbind(age_z, trestbps_z, chol_z, thalach_z, oldpeak_z,thal_z, target_z) ~ as.factor(ClassificationVod), 
              data = Heart)

summary(fit)
##  Response 1 :
##                               Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationVod)   2  83.688  41.844  57.719 < 2.2e-16 ***
## Residuals                    297 215.312   0.725                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                               Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationVod)   2 113.11  56.553  90.355 < 2.2e-16 ***
## Residuals                    297 185.89   0.626                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                               Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationVod)   2  47.481 23.7406  28.034 7.039e-12 ***
## Residuals                    297 251.519  0.8469                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                               Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationVod)   2  73.203  36.602  48.144 < 2.2e-16 ***
## Residuals                    297 225.797   0.760                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                               Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationVod)   2  69.11  34.555  44.642 < 2.2e-16 ***
## Residuals                    297 229.89   0.774                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 6 :
##                               Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationVod)   2  95.257  47.629  69.429 < 2.2e-16 ***
## Residuals                    297 203.743   0.686                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 7 :
##                               Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClassificationVod)   2 192.44  96.222   268.2 < 2.2e-16 ***
## Residuals                    297 106.56   0.359                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I utilized the analysis of variance (ANOVA) to determine if all variables effectively divided units into groups. The p-value informed me that I could reject the assumption of equal means for these variables; indeed, they differed significantly from one another - their distinctions growing in proportion to their statistical significance in ANOVA.

options(scipen = 999)
he_quadrat <- chisq.test(Heart$sex, as.factor(Heart$ClassificationVod))
print(he_quadrat)
## 
##  Pearson's Chi-squared test
## 
## data:  Heart$sex and as.factor(Heart$ClassificationVod)
## X-squared = 39.58, df = 2, p-value = 0.000000002543
addmargins(he_quadrat$observed)
##          
## Heart$sex   1   2   3 Sum
##       0    52  35  10  97
##       1    84  29  90 203
##       Sum 136  64 100 300
addmargins(round(he_quadrat$expected, 2))
##          
## Heart$sex      1     2      3    Sum
##       0    43.97 20.69  32.33  96.99
##       1    92.03 43.31  67.67 203.01
##       Sum 136.00 64.00 100.00 300.00
round(he_quadrat$res, 2)
##          
## Heart$sex     1     2     3
##         0  1.21  3.15 -3.93
##         1 -0.84 -2.17  2.71

Conclusions:

The study involved the performance of the classification of 300 individuals on 7 standardized variables. In the hierarchical classification, the study used Ward’s clustering algorithm and, based on the analysis of the dendrogram, decided to classify them into three groups (k = 3). The study further optimized the classification using the leader method.

Answer to research question: individuals can be identified into distinct groups based on their cardiovascular health-related attributes