Heart disease, a term encompassing ailments that affect the heart and circulatory system, emerges as both a worldwide health concern and chief instigator of disability. Because the heart plays an essential role in bodily functions; any affliction to it can trigger extensive consequences on other organs and physiological processes. A myriad of forms exist within this category - from conditions inducing coronary artery narrowing to valve malfunctions, heart enlargements - frequently culminating in heart failure or even fatal cardiac episodes.
Tailored specifically for heart disease, the data set provided offers a valuable resource: the potential to extract insights that illuminate each feature’s significance and their interrelationships. Our primary objective in this analysis is to determine–with precision–the probability of an individual’s susceptibility to severe heart problems.
Research Question: Can we identify distinct groups of individuals based on their cardiovascular health-related attributes?
Utilizing clustering analysis as a suitable method, we can actively explore patterns and groupings within the dataset. The mix of categorical and numeric variables associated with cardiovascular health presents an opportunity to identify individuals’ similarities and differences based on these attributes; this is where clustering proves particularly helpful. Through our application of clustering analysis, we unearth natural groupings or clusters in the dataset - a process that enables us to pinpoint subgroups of individuals exhibiting akin characteristics. The dataset, commonly found in clustering applications, comprises a mix of numeric and categorical variables; clustering algorithms can accommodate both types of data. This provides insights into the diverse phenotype of cardiovascular health: it handles various distributions within these variable types–an invaluable tool for comprehensive analysis. As an unsupervised learning technique, clustering necessitates no labeled data; it’s a method well-suited for our exploration of patterns and groupings without predefined categories. We interpret the results of clustering to understand each identified group’s characteristics: this is how we derive meaningful insights from raw information. Developing targeted interventions or understanding risk factors associated with cardiovascular health can benefit from this.
#Load the required libraries
library(tidyverse)
library(dplyr)
library(cluster)
#Import the dataset
#set working directory
setwd("C:/Users/Baha/Downloads")
Heart <-read.csv("heart.csv")
head(Heart)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 52 1 0 125 212 0 1 168 0 1.0 2 2 3
## 2 53 1 0 140 203 1 0 155 1 3.1 0 0 3
## 3 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 4 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 5 62 0 0 138 294 1 1 106 0 1.9 1 3 2
## 6 58 0 0 100 248 0 0 122 0 1.0 1 0 2
## target
## 1 0.23
## 2 0.37
## 3 0.24
## 4 0.28
## 5 0.21
## 6 0.78
Description of variables: Age: Numeric (e.g., 52) Sex: Categorical (0: Female, 1: Male) Chest Pain Type: Categorical (0: Typical Angina, 1: Atypical Angina, 2: Non-anginal Pain, 3: Asymptomatic) Resting Blood Pressure: Numeric (e.g., 125) Serum Cholesterol: Numeric in mg/dL (e.g., 212) Fasting Blood Sugar: Categorical (0: <= 120 mg/dL, 1: > 120 mg/dL) Resting Electrocardiographic Results: Categorical (0: Normal, 1: Abnormality, 2: Hypertrophy) Maximum Heart Rate Achieved: Numeric (e.g., 168) Exercise-Induced Angina: Categorical (0: No, 1: Yes) Oldpeak (ST Depression): Numeric (e.g., 1.0) Slope of Peak Exercise ST Segment: Categorical (0: Upsloping, 1: Flat, 2: Downsloping) Number of Major Vessels Colored by Fluoroscopy: Numeric (0 to 3) Thalassemia: Categorical (0: Normal, 1: Fixed Defect, 2: Reversible Defect)
Data source: The dataset was obtained from kaggle website https://www.kaggle.com/datasets/juledz/heart-attack-prediction
#factor the categorical variables
attach(Heart)
Heart$sex <-as.factor(Heart$sex)
Heart$cp <-as.factor(Heart$cp)
Heart$fbs <-as.factor(Heart$fbs)
Heart$restecg <-as.factor(Heart$restecg)
Heart$exang <-as.factor(Heart$exang)
Heart$slope <-as.factor(Heart$slope)
Heart$ca <-as.factor(Heart$ca)
#convert the loaded dataset into a data frame
head(as.data.frame(Heart))
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 52 1 0 125 212 0 1 168 0 1.0 2 2 3
## 2 53 1 0 140 203 1 0 155 1 3.1 0 0 3
## 3 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 4 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 5 62 0 0 138 294 1 1 106 0 1.9 1 3 2
## 6 58 0 0 100 248 0 0 122 0 1.0 1 0 2
## target
## 1 0.23
## 2 0.37
## 3 0.24
## 4 0.28
## 5 0.21
## 6 0.78
#Sample 300 observations from the loaded dataset
sample_size <-300
# Randomly sample 300 observations from the data frame
Heart <- Heart[sample(nrow(Heart), size = sample_size, replace = FALSE), ]
dim(Heart)
## [1] 300 14
dplyr::glimpse(Heart)
## Rows: 300
## Columns: 14
## $ age <int> 57, 63, 58, 52, 56, 41, 38, 45, 44, 64, 67, 64, 46, 58, 52, 6…
## $ sex <fct> 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1…
## $ cp <fct> 0, 1, 2, 1, 1, 1, 2, 1, 2, 0, 2, 0, 0, 0, 0, 2, 3, 2, 2, 2, 0…
## $ trestbps <int> 128, 140, 105, 134, 140, 110, 138, 130, 130, 145, 152, 145, 1…
## $ chol <int> 303, 195, 240, 201, 294, 235, 175, 234, 233, 212, 212, 212, 2…
## $ fbs <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ restecg <fct> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0…
## $ thalach <int> 159, 179, 154, 158, 153, 153, 173, 175, 179, 132, 150, 132, 1…
## $ exang <fct> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1…
## $ oldpeak <dbl> 0.0, 0.0, 0.6, 0.8, 1.3, 0.0, 0.0, 0.6, 0.4, 2.0, 0.8, 2.0, 0…
## $ slope <fct> 2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 0, 2, 2, 2, 2, 2, 1, 1…
## $ ca <fct> 1, 2, 0, 1, 0, 0, 4, 0, 0, 2, 0, 2, 0, 3, 1, 0, 1, 1, 3, 0, 2…
## $ thal <int> 2, 2, 3, 2, 2, 2, 2, 2, 2, 1, 3, 1, 3, 1, 2, 2, 2, 2, 2, 3, 3…
## $ target <dbl> 0.90, 0.75, 0.83, 0.78, 0.73, 0.72, 0.85, 0.78, 0.72, 0.37, 0…
#Check for missing values in the sampled dataset
colSums(is.na(Heart)) # there are no missing values in the dataset
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
There are no missing values in the loaded dataset.
#Get the summary statistics of the dataset
summary(Heart)
## age sex cp trestbps chol fbs
## Min. :29.00 0: 97 0:149 Min. : 94.0 Min. :149.0 0:257
## 1st Qu.:47.00 1:203 1: 46 1st Qu.:120.0 1st Qu.:212.0 1: 43
## Median :55.00 2: 85 Median :130.0 Median :239.5
## Mean :54.44 3: 20 Mean :131.4 Mean :249.1
## 3rd Qu.:61.25 3rd Qu.:140.0 3rd Qu.:275.5
## Max. :77.00 Max. :200.0 Max. :564.0
## restecg thalach exang oldpeak slope ca thal
## 0:143 Min. : 71.0 0:194 Min. :0.000 0: 25 0:166 Min. :0.00
## 1:153 1st Qu.:138.0 1:106 1st Qu.:0.000 1:127 1: 74 1st Qu.:2.00
## 2: 4 Median :154.0 Median :0.800 2:148 2: 37 Median :2.00
## Mean :150.1 Mean :1.083 3: 17 Mean :2.32
## 3rd Qu.:165.0 3rd Qu.:1.800 4: 6 3rd Qu.:3.00
## Max. :202.0 Max. :6.200 Max. :3.00
## target
## Min. :0.1000
## 1st Qu.:0.2575
## Median :0.7200
## Mean :0.5516
## 3rd Qu.:0.8200
## Max. :0.9000
Descriptive statistics: The average age of individuals in this study is 54 years, the median is 55 years and the minimum and maximum ages are 34 years and 77 years respectively. The number of individuals with normal Resting Electrocardiographic is 149, those with abnormal Resting Electrocardiographic is 147 and those with Hypertrophic resting Electrocardiographic is 4. The number of males under study is 215 compared to the number of females which is only 85. The number of individuals with typical angina type of chest pain is 152. Atypical angina is 44, non-anginal pain is 79 and individiuals with asymptomatic type of chest pain is 25.
Standardize the numeric variables
Heart$age_z <-scale(Heart$age)
Heart$trestbps_z <-scale(Heart$trestbps)
Heart$chol_z <-scale(Heart$chol)
Heart$thalach_z <-scale(Heart$thalach)
Heart$oldpeak_z <-scale(Heart$oldpeak)
Heart$thal_z <-scale(Heart$thal)
Heart$target_z <-scale(Heart$target)
#Load the required package
library(Hmisc)
rcorr(as.matrix(Heart[, c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")]),
type = "pearson")
## age_z trestbps_z chol_z thalach_z oldpeak_z thal_z target_z
## age_z 1.00 0.32 0.23 -0.31 0.16 0.05 -0.15
## trestbps_z 0.32 1.00 0.11 -0.05 0.27 -0.03 -0.08
## chol_z 0.23 0.11 1.00 -0.03 0.10 0.09 -0.11
## thalach_z -0.31 -0.05 -0.03 1.00 -0.35 -0.10 0.38
## oldpeak_z 0.16 0.27 0.10 -0.35 1.00 0.21 -0.44
## thal_z 0.05 -0.03 0.09 -0.10 0.21 1.00 -0.33
## target_z -0.15 -0.08 -0.11 0.38 -0.44 -0.33 1.00
##
## n= 300
##
##
## P
## age_z trestbps_z chol_z thalach_z oldpeak_z thal_z target_z
## age_z 0.0000 0.0000 0.0000 0.0043 0.4193 0.0091
## trestbps_z 0.0000 0.0491 0.3717 0.0000 0.6558 0.1641
## chol_z 0.0000 0.0491 0.5846 0.0740 0.1208 0.0646
## thalach_z 0.0000 0.3717 0.5846 0.0000 0.0786 0.0000
## oldpeak_z 0.0043 0.0000 0.0740 0.0000 0.0002 0.0000
## thal_z 0.4193 0.6558 0.1208 0.0786 0.0002 0.0000
## target_z 0.0091 0.1641 0.0646 0.0000 0.0000 0.0000
Upon examining the correlation matrix, I ascertained that a close proximity to zero is desirable for its correlations: indeed, it’s optimal when they hover around this value. However – should the values deviate significantly from zero - introducing two variables measuring the same dimension can be advantageous; such an approach mitigates potential distortions in our data analysis.
To verify potential deviations within the units, I aim to identify outliers. In this pursuit, my strategy involves calculating each unit’s Euclidean distance from the mean across all classification variables.
Heart$Difference_z <- sqrt(Heart$age_z^2 + Heart$trestbps_z^2 + Heart$chol_z^2 + Heart$thalach_z^2 + Heart$oldpeak_z^2 + Heart$thal_z^2 + Heart$target_z^2)
head(Heart[order(-Heart$Difference_z), ], 10)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 193 67 0 2 115 564 0 0 160 0 1.6 1 0 3
## 70 62 0 0 160 164 0 0 145 0 6.2 0 3 3
## 295 56 0 0 200 288 1 0 133 1 4.0 0 2 3
## 686 63 0 0 150 407 0 0 154 0 4.0 1 3 3
## 614 55 1 0 140 217 0 1 111 1 5.6 0 0 3
## 890 63 0 0 150 407 0 0 154 0 4.0 1 3 3
## 247 54 1 1 192 283 0 0 195 0 0.0 2 1 3
## 357 59 1 0 164 176 1 0 90 0 1.0 1 2 1
## 360 53 0 2 128 216 0 0 115 0 0.0 2 0 0
## 330 53 0 2 128 216 0 0 115 0 0.0 2 0 0
## target age_z trestbps_z chol_z thalach_z oldpeak_z thal_z
## 193 0.89 1.40386657 -0.9133619 6.0339727 0.4530751 0.43474623 1.104760
## 70 0.32 0.84515077 1.5979182 -1.6294313 -0.2366536 4.30040216 1.104760
## 295 0.23 0.17469181 3.8301672 0.7462240 -0.7884365 2.45161019 1.104760
## 686 0.13 0.95689393 1.0398560 3.0260866 0.1771836 2.45161019 1.104760
## 614 0.33 0.06294865 0.4817937 -0.6140302 -1.8000385 3.79618617 1.104760
## 890 0.19 0.95689393 1.0398560 3.0260866 0.1771836 2.45161019 1.104760
## 247 0.20 -0.04879451 3.3837174 0.6504314 2.0624420 -0.90982974 1.104760
## 357 0.19 0.50992129 1.8211431 -1.3995291 -2.7656587 -0.06946976 -2.144534
## 360 0.87 -0.16053767 -0.1878810 -0.6331887 -1.6161109 -0.90982974 -3.769181
## 330 0.80 -0.16053767 -0.1878810 -0.6331887 -1.6161109 -0.90982974 -3.769181
## target_z Difference_z
## 193 1.1730204 6.496511
## 70 -0.8028118 5.131972
## 295 -1.1147854 4.934857
## 686 -1.4614226 4.533452
## 614 -0.7681481 4.480495
## 890 -1.2534403 4.470743
## 247 -1.2187765 4.434225
## 357 -1.2534403 4.399897
## 360 1.1036930 4.396188
## 330 0.8610469 4.341627
#Loading the required package
library(factoextra)
Distances <- get_dist(Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")],
method = "euclidean")
Distances1 <- Distances^2
fviz_dist(Distances1)
get_clust_tendency(Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")],
n = nrow(Heart) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8033171
##
## $plot
## NULL
I calculated the Hopkins statistics. Since it is greater than 0.5 (0.7963), it is OK and I can continue displaying the Dendrogram.
library(dplyr)
Ward <- Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")] %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
print(Ward)
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 300
library(factoextra)
fviz_dend(Ward)
CONCLUSION: I can see from the dendrogram that, based on the distances, it makes the most sense to classify into 3 groups.
set.seed(123)
library(dplyr)
library(NbClust)
TotalGroups <- Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")] %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 8,
method = "ward.D2",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 3 proposed 5 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
the number of clusters equals 3, determines the number of groups, display hierarchically.
Heart$ClassificationWard<- cutree(Ward,
k = 3)
head(Heart)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 421 57 0 0 128 303 0 0 159 0 0.0 2 1 2
## 600 63 0 1 140 195 0 1 179 0 0.0 2 2 2
## 369 58 1 2 105 240 0 0 154 1 0.6 1 0 3
## 79 52 1 1 134 201 0 1 158 0 0.8 2 1 2
## 310 56 0 1 140 294 0 0 153 0 1.3 1 0 2
## 140 41 1 1 110 235 0 1 153 0 0.0 2 0 2
## target age_z trestbps_z chol_z thalach_z oldpeak_z thal_z
## 421 0.90 0.2864350 -0.1878810 1.0336016 0.4070932 -0.9098297 -0.519887
## 600 0.75 0.9568939 0.4817937 -1.0355175 1.3267314 -0.9098297 -0.519887
## 369 0.83 0.3981781 -1.4714241 -0.1733845 0.1771836 -0.4056138 1.104760
## 79 0.78 -0.2722808 0.1469564 -0.9205664 0.3611113 -0.2375418 -0.519887
## 310 0.73 0.1746918 0.4817937 0.8611750 0.1312017 0.1826382 -0.519887
## 140 0.72 -1.5014556 -1.1923930 -0.2691771 0.1312017 -0.9098297 -0.519887
## target_z Difference_z ClassificationWard
## 421 1.2076841 1.976860 1
## 600 0.6877283 2.356125 1
## 369 0.9650381 2.168273 1
## 79 0.7917195 1.423764 1
## 310 0.6184008 1.306728 1
## 140 0.5837371 2.281373 1
Start_leaders <- aggregate(Heart[, c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")],
by = list(Heart$ClassificationWard),
FUN = mean)
print(Start_leaders)
## Group.1 age_z trestbps_z chol_z thalach_z oldpeak_z thal_z
## 1 1 -0.4525937 -0.3109083 -0.2790466 0.55862448 -0.5049290 -0.2860363
## 2 2 0.5717087 0.6111329 0.6170731 -0.89879309 0.5583286 -0.1758441
## 3 3 0.1343027 -0.1314024 -0.1881573 0.03203591 0.2312374 0.6349825
## target_z
## 1 0.8736519
## 2 -0.3105870
## 3 -1.0713513
Perform now the sorting according to the method of leaders.
library(factoextra)
GuideMethod <- hkmeans(Heart[c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z")],
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
print(GuideMethod)
## Hierarchical K-means clustering with 3 clusters of sizes 136, 64, 100
##
## Cluster means:
## age_z trestbps_z chol_z thalach_z oldpeak_z thal_z
## 1 -0.47111793 -0.3680202 -0.32566649 0.5311767 -0.5261065 -0.3765358
## 2 0.91149827 1.1750117 0.71838425 -0.2711400 0.3874760 -0.4437317
## 3 0.05736149 -0.2515000 -0.01685949 -0.5488707 0.4675203 0.7960770
## target_z
## 1 0.8149136
## 2 -0.1664075
## 3 -1.0017816
##
## Clustering vector:
## 421 600 369 79 310 140 84 504 808 1001 242 609 44 151 804 318
## 1 1 1 1 2 1 1 1 1 2 3 2 3 2 3 1
## 715 933 364 158 541 908 717 168 1011 757 897 112 990 823 68 974
## 1 2 1 1 3 1 3 1 3 3 2 3 2 3 2 1
## 135 222 718 172 146 879 676 347 265 391 508 942 408 776 701 420
## 1 3 2 3 3 3 3 3 2 2 1 1 3 1 1 2
## 427 439 998 131 78 124 976 719 921 901 970 659 194 467 791 270
## 1 1 3 2 3 2 3 1 3 3 1 3 2 1 3 1
## 377 751 567 730 951 379 306 236 86 850 199 867 1004 359 418 19
## 1 1 1 1 2 3 3 2 1 3 1 1 1 3 1 1
## 330 419 136 694 296 269 460 345 999 184 711 539 433 271 666 473
## 1 1 2 1 3 3 1 1 2 1 1 3 1 1 2 1
## 315 410 649 87 16 809 332 454 556 40 625 618 826 303 594 773
## 1 3 1 1 1 1 3 1 3 3 2 1 1 1 3 2
## 468 729 1012 996 950 230 501 1020 357 813 680 129 373 877 611 675
## 3 3 1 1 1 2 1 1 2 2 2 1 1 3 3 3
## 772 284 444 575 940 363 375 936 1019 191 633 615 193 693 660 785
## 1 1 1 3 1 1 1 1 3 1 1 3 2 3 1 1
## 644 206 128 721 276 872 763 317 395 949 758 895 851 947 165 958
## 1 1 1 2 3 1 1 2 3 3 2 3 3 2 3 3
## 760 665 295 255 961 104 183 205 960 864 733 401 2 360 663 520
## 2 2 2 3 1 1 3 2 2 3 1 3 3 1 1 3
## 12 272 610 392 92 97 938 578 722 91 353 712 423 927 132 946
## 3 1 2 3 1 1 2 2 1 1 3 3 1 3 1 1
## 549 481 64 533 731 843 141 125 393 979 911 266 890 382 417 386
## 1 3 2 3 1 3 3 3 1 1 3 1 2 3 1 1
## 399 366 80 349 855 142 282 195 703 636 499 622 780 228 319 3
## 2 1 1 3 2 3 1 2 2 1 1 3 1 1 3 3
## 188 688 750 247 14 370 736 179 383 381 783 342 39 207 920 613
## 3 3 1 2 3 1 1 1 3 3 2 1 3 2 3 2
## 643 868 538 253 829 65 935 204 678 214 530 894 943 513 477 981
## 3 1 1 3 1 1 1 2 3 1 2 1 2 3 2 1
## 792 580 251 72 654 396 614 743 686 458 545 725 766 70 916 691
## 3 3 3 3 3 2 3 3 2 1 1 2 3 3 3 1
## 453 24 161 34 496 912 106 49 510 954 31 331 111 261 682 333
## 2 1 2 2 1 2 3 2 2 1 1 1 1 1 2 1
## 992 464 201 641 243 500 818 138 830 860 301 175
## 3 1 1 1 1 1 1 2 3 3 1 3
##
## Within cluster sum of squares by cluster:
## [1] 484.9268 464.8431 468.9403
## (between_SS / total_SS = 32.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
FINDING: R has determined 3 centers. The group sizes are 102, 129 and 69. The final leaders are marked with 1, 2 and 3. 30.7% tells us that the groups are successfully separated from each other by that much.
fviz_cluster(GuideMethod,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
Heart$ClassificationVod <- GuideMethod$cluster
head(Heart[c("sex", "ClassificationWard", "ClassificationVod")])
## sex ClassificationWard ClassificationVod
## 421 0 1 1
## 600 0 1 1
## 369 1 1 1
## 79 1 1 1
## 310 0 1 2
## 140 1 1 1
The separation and groupings are nicely visible in the above display.
table(Heart$ClassificationWard)
##
## 1 2 3
## 132 85 83
table(Heart$ClassificationVod)
##
## 1 2 3
## 136 64 100
table(Heart$ClassificationWard, Heart$ClassificationVod)
##
## 1 2 3
## 1 128 4 0
## 2 5 47 33
## 3 3 13 67
Below are the average values of the ranking variables by group, which also show the positions of the final leaders.
Average <- GuideMethod$centers
print(Average)
## age_z trestbps_z chol_z thalach_z oldpeak_z thal_z
## 1 -0.47111793 -0.3680202 -0.32566649 0.5311767 -0.5261065 -0.3765358
## 2 0.91149827 1.1750117 0.71838425 -0.2711400 0.3874760 -0.4437317
## 3 0.05736149 -0.2515000 -0.01685949 -0.5488707 0.4675203 0.7960770
## target_z
## 1 0.8149136
## 2 -0.1664075
## 3 -1.0017816
The averages are shown in the graph below,
library(ggplot2)
library(tidyr)
Picture <- as.data.frame(Average)
Picture$id <- 1:nrow(Picture)
Picture <- pivot_longer(Picture, cols = c(age_z, trestbps_z, chol_z, thalach_z, oldpeak_z,thal_z, target_z))
Picture$Group <- factor(Picture$id,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Picture$nameFactor <- factor(Picture$name,
levels = c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z"),
labels = c("age_z", "trestbps_z", "chol_z", "thalach_z", "oldpeak_z","thal_z", "target_z"))
ggplot(Picture, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size=3) +
geom_line(aes(group = id), linewidth = 1) +
ylab("Average") +
xlab("Classification variables") +
ylim(-1.5, 1.5)
FINDING: 4 out of 7 classification variables in group 3 are above average. Only 2 classification variables in group 2 are above average and 3 out of the 7 classification variables in group 1 are above average.
fit <- aov(cbind(age_z, trestbps_z, chol_z, thalach_z, oldpeak_z,thal_z, target_z) ~ as.factor(ClassificationVod),
data = Heart)
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationVod) 2 83.688 41.844 57.719 < 2.2e-16 ***
## Residuals 297 215.312 0.725
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationVod) 2 113.11 56.553 90.355 < 2.2e-16 ***
## Residuals 297 185.89 0.626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationVod) 2 47.481 23.7406 28.034 7.039e-12 ***
## Residuals 297 251.519 0.8469
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationVod) 2 73.203 36.602 48.144 < 2.2e-16 ***
## Residuals 297 225.797 0.760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 5 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationVod) 2 69.11 34.555 44.642 < 2.2e-16 ***
## Residuals 297 229.89 0.774
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 6 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationVod) 2 95.257 47.629 69.429 < 2.2e-16 ***
## Residuals 297 203.743 0.686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 7 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClassificationVod) 2 192.44 96.222 268.2 < 2.2e-16 ***
## Residuals 297 106.56 0.359
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I utilized the analysis of variance (ANOVA) to determine if all variables effectively divided units into groups. The p-value informed me that I could reject the assumption of equal means for these variables; indeed, they differed significantly from one another - their distinctions growing in proportion to their statistical significance in ANOVA.
options(scipen = 999)
he_quadrat <- chisq.test(Heart$sex, as.factor(Heart$ClassificationVod))
print(he_quadrat)
##
## Pearson's Chi-squared test
##
## data: Heart$sex and as.factor(Heart$ClassificationVod)
## X-squared = 39.58, df = 2, p-value = 0.000000002543
addmargins(he_quadrat$observed)
##
## Heart$sex 1 2 3 Sum
## 0 52 35 10 97
## 1 84 29 90 203
## Sum 136 64 100 300
addmargins(round(he_quadrat$expected, 2))
##
## Heart$sex 1 2 3 Sum
## 0 43.97 20.69 32.33 96.99
## 1 92.03 43.31 67.67 203.01
## Sum 136.00 64.00 100.00 300.00
round(he_quadrat$res, 2)
##
## Heart$sex 1 2 3
## 0 1.21 3.15 -3.93
## 1 -0.84 -2.17 2.71
Conclusions:
The study involved the performance of the classification of 300 individuals on 7 standardized variables. In the hierarchical classification, the study used Ward’s clustering algorithm and, based on the analysis of the dendrogram, decided to classify them into three groups (k = 3). The study further optimized the classification using the leader method.
Answer to research question: individuals can be identified into distinct groups based on their cardiovascular health-related attributes