https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/
The goal of this analysis is to decrease the dimensionality, reduce the noise of the variables and predict whether an unknown person will develop a heart disease or not based on his specific measures as well as the information given by this data. To do this we will perform a principal component analysis, a linear discriminant analysis, and a multiple correspondance analysis.
table_dataset
| Demographic characters of the study participants | N = 9181 |
|---|---|
| Age | 54 (9) |
| Sex | |
| F | 193 / 918 (21%) |
| M | 725 / 918 (79%) |
| ChestPainType | |
| ASY | 496 / 918 (54%) |
| ATA | 173 / 918 (19%) |
| NAP | 203 / 918 (22%) |
| TA | 46 / 918 (5.0%) |
| RestingBP | 132 (19) |
| Cholesterol | 199 (109) |
| FastingBS | |
| High | 214 / 918 (23%) |
| Normal | 704 / 918 (77%) |
| RestingECG | |
| LVH | 188 / 918 (20%) |
| Normal | 552 / 918 (60%) |
| ST | 178 / 918 (19%) |
| MaxHR | 137 (25) |
| ExerciseAngina | |
| N | 547 / 918 (60%) |
| Y | 371 / 918 (40%) |
| Oldpeak | 0.89 (1.07) |
| ST_Slope | |
| Down | 63 / 918 (6.9%) |
| Flat | 460 / 918 (50%) |
| Up | 395 / 918 (43%) |
| HeartDisease | |
| N | 410 / 918 (45%) |
| Y | 508 / 918 (55%) |
| 1 Mean (SD); n / N (%) | |
As we can see from table 1, all the categorical data are described using the count n over the total sample size N and the percentage, however, the continuous data are described by the arythmetic mean and the standard deviation.
The total sample size is 918 participants, their mean age is 54 years old with a standard deviation of 9 years old. The overwhelming majority of the participants are males with a percentage of 79%. Of all the participants, 54% have asymptomatic chest pain type, the mean resting blood pressure is 132 mm Hg with a standard deviation of 19 mm Hg, and the mean cholesterol level is 199 mg/dl with a standard deviation of 109 mg/dl. 77% of the patients have normal fasting blood sugar and 60% have normal resting ECG. The mean hear rate is 137 beats with a standard deviation of 25 beats. Only 40% of the participants suffer from angina during exercice, and a half of them had flat ST slope. Finally, 55% suffer from heart disease and the remaining 45% have no heart disease.
First of all, the numerical data must be scaled before performing the PCA
heart.pca<-heart[,c("Age","RestingBP","Cholesterol","MaxHR","Oldpeak")]
pca<-PCA(heart.pca,scale.unit = T, ncp=5, graph = F)
barplot(pca$eig[,"eigenvalue"], xlab = "Componants", ylab="Eigenvalues")
abline(h = 1, lty = "dashed")
The rule stipulates that only components with an eigenvalue greater
than 1 should be used. Based on de the barplot, we shall keep the first
two componants.
pca<-PCA(heart.pca,scale.unit = T, ncp=2, graph = F)
lapply(pca$var,round,3)
## $coord
## Dim.1 Dim.2
## Age 0.787 0.010
## RestingBP 0.489 0.518
## Cholesterol -0.233 0.812
## MaxHR -0.704 0.376
## Oldpeak 0.546 0.352
##
## $cor
## Dim.1 Dim.2
## Age 0.787 0.010
## RestingBP 0.489 0.518
## Cholesterol -0.233 0.812
## MaxHR -0.704 0.376
## Oldpeak 0.546 0.352
##
## $cos2
## Dim.1 Dim.2
## Age 0.620 0.000
## RestingBP 0.239 0.268
## Cholesterol 0.054 0.660
## MaxHR 0.496 0.141
## Oldpeak 0.298 0.124
##
## $contrib
## Dim.1 Dim.2
## Age 36.319 0.009
## RestingBP 14.006 22.483
## Cholesterol 3.168 55.273
## MaxHR 29.073 11.829
## Oldpeak 17.434 10.406
We can see from these results that age has a strong positive correlation with the first component(0.787), however, maximum heart rate has a strong negative correlation also with the first component(-0.704). Regarding the other variables including resting blood pressure and cholesterol, they have both a positive strong correlation with the second dimension with values of 0.518 and 0.812 respectively.
The above results also show that both age and maxHR have contributions of 36.3% and 29% from the total inertia of the first dimension, however, the variables resting blood pressure and cholesterol have contributions of 22% and 55% from the second dimension.
round(sort(rowSums(pca$var$cos2)), digits = 3)
## Oldpeak RestingBP Age MaxHR Cholesterol
## 0.422 0.508 0.620 0.638 0.714
As we see the values of the sum of cos² are ordered in an ascending order, the variable oldpeak is less represented in the first two principal components since its cos² is less than 0.5, we will also confirm this from the correlation circle
plot.PCA(pca, choix = "var", axes=c(1,2))
As we noticed before the variables oldpeak and RestingBP have the shortest arrows in the correlation circle and this means that it is not represented adequately in the first two components. Oldpeak looks independant of the Cholesterol, as RestingBP looks independant of Max HR, but as their representation is quite low, it would be dangerous to draw out such conclusions.
fviz_pca_ind(pca, select.ind = list(cos2=100),
col.ind="cos2", repel=TRUE,
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
The plot shows the first 100 observations with the highest cos². Two
groups seem to be emerging :
- A consentrated group of younger people on the left of the map with Max
HR and cholestertol rather high.
- A more scattered group with older people and a cholesterol and Max HR
lower than the previous group
We can deduce these hunches by comparing the circle of correlations and
the map of individuals.
##The addition of our main categorical variable (heart disease) in the PCA
heart.pca$HeartDisease<-heart$HeartDisease
res <- PCA(heart.pca, ncp = 2, graph = FALSE,
quali.sup = c(6))
#Figure 4:Individual plot showing the two levels of the variable heart disease
plot(res, choix = "ind", cex=0.4)
Ici il faut encore faire la description mais je ne suis pas sur que
ce graphe soit pertinent dans l’analyse.
##Linear Discriminant anlysis
library(MASS)
# The values of the first and only linear discriminant
heart.lda <- lda(HeartDisease ~ .,
data = heart.pca,
method = "mle")
heart.lda
## Call:
## lda(HeartDisease ~ ., data = heart.pca, method = "mle")
##
## Prior probabilities of groups:
## N Y
## 0.4466231 0.5533769
##
## Group means:
## Age RestingBP Cholesterol MaxHR Oldpeak
## N 50.55122 130.1805 227.1220 148.1512 0.4080488
## Y 55.89961 134.1850 175.9409 127.6555 1.2742126
##
## Coefficients of linear discriminants:
## LD1
## Age 0.014676526
## RestingBP 0.002479736
## Cholesterol -0.003578610
## MaxHR -0.023266312
## Oldpeak 0.705322727
heart.pred <- predict(heart.lda)
d <- cbind(heart.pca, heart.pred)
head(d)
## Age RestingBP Cholesterol MaxHR Oldpeak HeartDisease class posterior.N
## 1 40 140 289 172 0.0 N N 0.9116175
## 2 49 160 180 156 1.0 Y N 0.5228573
## 3 37 130 283 98 0.0 N N 0.5143677
## 4 48 138 214 108 1.5 Y Y 0.1615806
## 5 54 150 195 122 0.0 N Y 0.4954692
## 6 39 120 339 170 0.0 N N 0.9309189
## posterior.Y LD1
## 1 0.08838255 -1.9468646
## 2 0.47714275 -0.2975290
## 3 0.48563229 -0.2725128
## 4 0.83841935 0.9810119
## 5 0.50453085 -0.2168910
## 6 0.06908107 -2.1435337
z <- predict(heart.lda)$x
cor(heart.pca[,-6], z)
## LD1
## Age 0.5037050
## RestingBP 0.1921479
## Cholesterol -0.4156618
## MaxHR -0.7151291
## Oldpeak 0.7214334
These coefficients represents the correlations of each variable with the first discriminant axis LD1. They are an indication of how each variables contributes to the separation between ‘Yes’ and ‘No’ categories:
ggplot(d, aes(x = HeartDisease, y = LD1)) + geom_boxplot()
ggplot(d, aes(x = LD1)) + geom_histogram(bins = 10) + facet_grid(HeartDisease ~ .)
For the first graph, considering that the third quartile of the “No” category lies below zero on the first discriminant axis (LD1) and that variables like Cholesterol and MaxHR have high negative correlations with LD1, it suggests that for the “No” category, the values of these variables tend to be lower than for the “Yes” category along this axis. These variables contribute to the separation between the categories along LD1.
Adding to this, as the first quartile of the “Yes” category is just below zero on LD1, it further indicates that even within the “Yes” category, there’s a subset of observations having values on LD1 that are close to or even below zero. This aligns with the broader observation that the “Yes” category demonstrates some overlap with the “No” category along LD1, despite the general separation between the two groups.
The second graph allows us to confirm that if the link between a negative (positive) value on the LD1 and the belonging to the category ‘no’ (‘Yes’)
##The 2 by 2 table showing the observed versus the expected classification
validation<-with(d, table(obs = HeartDisease, pred = class))
validation
## pred
## obs N Y
## N 295 115
## Y 111 397
#The proportion of well-classified patients:
proportion <- sum(diag(validation))/nrow(heart.pca)
proportion
## [1] 0.7538126
##The sensitivity
sens <- validation[2,2] / (validation[2,2] + validation[2,1])
sens
## [1] 0.7814961
##The specificity
spec <- validation[1,1] / (validation[1,1] + validation[1,2])
spec
## [1] 0.7195122
As we see from the results above, the sensitivity of the classification is 78% and the specificity is 72% and this means that in 78% of the cases, a new observation with known parameters can be classified in the true class and in 72% in the cases, the test can correctly exclude an observation with known parameters from the wrong class.
par(mfrow=c(1,1))
heart.mca<- heart[,c("Sex","ChestPainType","FastingBS","RestingECG","ExerciseAngina","ST_Slope","HeartDisease")]
mca<-MCA(heart.mca,ncp=5, graph = F)
round(mca$eig, digits = 3)
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.385 24.528 24.528
## dim 2 0.165 10.497 35.025
## dim 3 0.156 9.898 44.923
## dim 4 0.142 9.060 53.983
## dim 5 0.141 8.952 62.935
## dim 6 0.136 8.644 71.579
## dim 7 0.123 7.807 79.386
## dim 8 0.120 7.617 87.003
## dim 9 0.085 5.400 92.403
## dim 10 0.074 4.703 97.106
## dim 11 0.045 2.894 100.000
barplot(mca$eig[,"percentage of variance"],
ylab = "Percentage of variance",xlab = "Dimensions")
As we see here, the percentage of variance explained in each dimension is low (only 35% in the first 2 dimensions), we will use here the first 2 dimensions for the simplicity of the analysis
mca<-MCA(heart.mca,ncp=2, graph = F)
lapply(mca$var, round, 3)
## $coord
## Dim 1 Dim 2
## F 0.812 0.039
## M -0.216 -0.010
## ASY -0.632 -0.201
## ATA 1.179 -0.610
## NAP 0.466 0.410
## TA 0.330 2.647
## FastingBS_High -0.627 0.660
## FastingBS_Normal 0.191 -0.201
## RestingECG_LVH -0.035 1.103
## RestingECG_Normal 0.142 -0.437
## RestingECG_ST -0.402 0.189
## ExerciseAngina_N 0.587 0.184
## ExerciseAngina_Y -0.866 -0.271
## Down -0.783 1.256
## Flat -0.644 -0.149
## Up 0.875 -0.027
## HeartDisease_N 0.958 0.037
## HeartDisease_Y -0.773 -0.030
##
## $contrib
## Dim 1 Dim 2
## F 5.143 0.028
## M 1.369 0.007
## ASY 8.008 1.886
## ATA 9.706 6.064
## NAP 1.778 3.220
## TA 0.202 30.418
## FastingBS_High 3.396 8.794
## FastingBS_Normal 1.032 2.673
## RestingECG_LVH 0.009 21.592
## RestingECG_Normal 0.447 9.936
## RestingECG_ST 1.162 0.601
## ExerciseAngina_N 7.614 1.746
## ExerciseAngina_Y 11.226 2.574
## Down 1.557 9.379
## Flat 7.704 0.962
## Up 12.206 0.027
## HeartDisease_N 15.184 0.052
## HeartDisease_Y 12.255 0.042
##
## $cos2
## Dim 1 Dim 2
## F 0.176 0.000
## M 0.176 0.000
## ASY 0.470 0.047
## ATA 0.323 0.086
## NAP 0.062 0.048
## TA 0.006 0.370
## FastingBS_High 0.119 0.132
## FastingBS_Normal 0.119 0.132
## RestingECG_LVH 0.000 0.314
## RestingECG_Normal 0.030 0.288
## RestingECG_ST 0.039 0.009
## ExerciseAngina_N 0.508 0.050
## ExerciseAngina_Y 0.508 0.050
## Down 0.045 0.116
## Flat 0.417 0.022
## Up 0.578 0.001
## HeartDisease_N 0.740 0.001
## HeartDisease_Y 0.740 0.001
##
## $v.test
## Dim 1 Dim 2
## F 12.694 0.614
## M -12.694 -0.614
## ASY -20.760 -6.591
## ATA 17.202 -8.895
## NAP 7.515 6.616
## TA 2.293 18.414
## FastingBS_High -10.468 11.019
## FastingBS_Normal 10.468 -11.019
## RestingECG_LVH -0.537 16.956
## RestingECG_Normal 5.265 -16.244
## RestingECG_ST -5.972 2.811
## ExerciseAngina_N 21.591 6.762
## ExerciseAngina_Y -21.591 -6.762
## Down -6.432 10.326
## Flat -19.547 -4.517
## Up 23.024 -0.711
## HeartDisease_N 26.055 0.997
## HeartDisease_Y -26.055 -0.997
##
## $eta2
## Dim 1 Dim 2
## Sex 0.176 0.000
## ChestPainType 0.531 0.480
## FastingBS 0.119 0.132
## RestingECG 0.044 0.371
## ExerciseAngina 0.508 0.050
## ST_Slope 0.579 0.120
## HeartDisease 0.740 0.001
The results above show that the modalities that contribute the most to the first dimension are heart disease status and the upsloping ST, however, typical angina (TA) and left ventricular hypertrophy contribute the most to dimension 2
plot(mca, cex = 0.4)
From this figure, we can see that those who have heart disease have higher tendency to have higher fasting blood sugar, left ventricular hypertrophy (LVH) and ST-wave abnormality in their resting ECG, and also downsloping in their ST slope. Those patients also can suffer from asymptomatic angina during exercice. On the other hand, those who have not a heart disease, have normal fasting blood sugar, normal resting ECG,and non anginal pain.
The in-depth statistical analysis of the Heart dataset has provided valuable insights into factors influencing the predisposition to heart diseases. Through the application of the three techniques: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Multiple Correspondence Analysis (MCA), several significant findings have emerged.
PCA Insight: PCA highlighted the substantial contribution of variables such as age, maximum heart rate, and cholesterol in reducing the dataset’s dimensionality. These variables exhibited strong correlations with the principal components, suggesting their pivotal role in data variance.
LDA Analysis: Linear Discriminant Analysis revealed a clear separation between the “Yes” and “No” categories for heart diseases, primarily driven by variables like cholesterol and maximum heart rate. However, a slight overlap between the two categories was observed, indicating some similarity in these variables’ values for certain cases.
MCA Insights: Multiple Correspondence Analysis reveal intriguing patterns in chest pain types, resting electrocardiogram results, and other indicators related to heart diseases. These insights enhanced the understanding of relationships among various categorical variables and their influence on the presence or absence of heart diseases.
Future Perspectives: These analyses provide crucial leads for future studies and clinical interventions. Advanced approaches can explore refining predictive models for heart diseases by considering a broader range of factors and integrating machine learning techniques for improved predictive accuracy.
In summary, this comprehensive statistical analysis uncovered significant trends and sets the stage for further research to better understand and more effectively prevent heart diseases.