Statistic Analysis of Heart dataset

source of the dataset

https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/

Variables description:

Age: age of the patient in years
Sex: sex of the patient [M: Male, F: Female]
ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
FastingBS: fasting blood sugar [High: if FastingBS > 120 mg/dl, Normal: otherwise]
RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH:left ventricular hypertrophy]
MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
Oldpeak: oldpeak = ST [Numeric value measured in depression]
ST_Slope: the slope of the peak exercise ST segment #[Up: upsloping, Flat: flat, Down: downsloping]
HeartDisease: output class [1: heart disease, 0: Normal]

Introduction

The goal of this analysis is to decrease the dimensionality, reduce the noise of the variables and predict whether an unknown person will develop a heart disease or not based on his specific measures as well as the information given by this data. To do this we will perform a principal component analysis, a linear discriminant analysis, and a multiple correspondance analysis.

Principal Component Analysis

table_dataset

Demographic characters of the study participants	N = 918¹
Age	54 (9)
Sex
F	193 / 918 (21%)
M	725 / 918 (79%)
ChestPainType
ASY	496 / 918 (54%)
ATA	173 / 918 (19%)
NAP	203 / 918 (22%)
TA	46 / 918 (5.0%)
RestingBP	132 (19)
Cholesterol	199 (109)
FastingBS
High	214 / 918 (23%)
Normal	704 / 918 (77%)
RestingECG
LVH	188 / 918 (20%)
Normal	552 / 918 (60%)
ST	178 / 918 (19%)
MaxHR	137 (25)
ExerciseAngina
N	547 / 918 (60%)
Y	371 / 918 (40%)
Oldpeak	0.89 (1.07)
ST_Slope
Down	63 / 918 (6.9%)
Flat	460 / 918 (50%)
Up	395 / 918 (43%)
HeartDisease
N	410 / 918 (45%)
Y	508 / 918 (55%)
¹ Mean (SD); n / N (%)

As we can see from table 1, all the categorical data are described using the count n over the total sample size N and the percentage, however, the continuous data are described by the arythmetic mean and the standard deviation.

The total sample size is 918 participants, their mean age is 54 years old with a standard deviation of 9 years old. The overwhelming majority of the participants are males with a percentage of 79%. Of all the participants, 54% have asymptomatic chest pain type, the mean resting blood pressure is 132 mm Hg with a standard deviation of 19 mm Hg, and the mean cholesterol level is 199 mg/dl with a standard deviation of 109 mg/dl. 77% of the patients have normal fasting blood sugar and 60% have normal resting ECG. The mean hear rate is 137 beats with a standard deviation of 25 beats. Only 40% of the participants suffer from angina during exercice, and a half of them had flat ST slope. Finally, 55% suffer from heart disease and the remaining 45% have no heart disease.

Figure 1 : Barplot of eigenvalues

First of all, the numerical data must be scaled before performing the PCA

heart.pca<-heart[,c("Age","RestingBP","Cholesterol","MaxHR","Oldpeak")]
pca<-PCA(heart.pca,scale.unit = T, ncp=5, graph = F)

barplot(pca$eig[,"eigenvalue"], xlab = "Componants", ylab="Eigenvalues")
abline(h = 1, lty = "dashed")

The rule stipulates that only components with an eigenvalue greater than 1 should be used. Based on de the barplot, we shall keep the first two componants.

Variables Analisys

pca<-PCA(heart.pca,scale.unit = T, ncp=2, graph = F)

lapply(pca$var,round,3)

## $coord
##              Dim.1 Dim.2
## Age          0.787 0.010
## RestingBP    0.489 0.518
## Cholesterol -0.233 0.812
## MaxHR       -0.704 0.376
## Oldpeak      0.546 0.352
## 
## $cor
##              Dim.1 Dim.2
## Age          0.787 0.010
## RestingBP    0.489 0.518
## Cholesterol -0.233 0.812
## MaxHR       -0.704 0.376
## Oldpeak      0.546 0.352
## 
## $cos2
##             Dim.1 Dim.2
## Age         0.620 0.000
## RestingBP   0.239 0.268
## Cholesterol 0.054 0.660
## MaxHR       0.496 0.141
## Oldpeak     0.298 0.124
## 
## $contrib
##              Dim.1  Dim.2
## Age         36.319  0.009
## RestingBP   14.006 22.483
## Cholesterol  3.168 55.273
## MaxHR       29.073 11.829
## Oldpeak     17.434 10.406

We can see from these results that age has a strong positive correlation with the first component(0.787), however, maximum heart rate has a strong negative correlation also with the first component(-0.704). Regarding the other variables including resting blood pressure and cholesterol, they have both a positive strong correlation with the second dimension with values of 0.518 and 0.812 respectively.

The above results also show that both age and maxHR have contributions of 36.3% and 29% from the total inertia of the first dimension, however, the variables resting blood pressure and cholesterol have contributions of 22% and 55% from the second dimension.

The sum of the cos² of each variable for the first two principal components:

round(sort(rowSums(pca$var$cos2)), digits = 3)

##     Oldpeak   RestingBP         Age       MaxHR Cholesterol 
##       0.422       0.508       0.620       0.638       0.714

As we see the values of the sum of cos² are ordered in an ascending order, the variable oldpeak is less represented in the first two principal components since its cos² is less than 0.5, we will also confirm this from the correlation circle

Figure2: Correlation circle of all the continuous variables with the first 2 dimensions:

plot.PCA(pca, choix = "var", axes=c(1,2))

As we noticed before the variables oldpeak and RestingBP have the shortest arrows in the correlation circle and this means that it is not represented adequately in the first two components. Oldpeak looks independant of the Cholesterol, as RestingBP looks independant of Max HR, but as their representation is quite low, it would be dangerous to draw out such conclusions.

Figure 3 : Graph of the individuals showing the 100 observations with the highest cos²

fviz_pca_ind(pca, select.ind = list(cos2=100),
             col.ind="cos2", repel=TRUE,
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

The plot shows the first 100 observations with the highest cos². Two groups seem to be emerging :
- A consentrated group of younger people on the left of the map with Max HR and cholestertol rather high.
- A more scattered group with older people and a cholesterol and Max HR lower than the previous group
We can deduce these hunches by comparing the circle of correlations and the map of individuals.

##The addition of our main categorical variable (heart disease) in the PCA
heart.pca$HeartDisease<-heart$HeartDisease

res <- PCA(heart.pca, ncp = 2, graph = FALSE, 
           quali.sup = c(6))

#Figure 4:Individual plot showing the two levels of the variable heart disease
plot(res, choix = "ind", cex=0.4)

Ici il faut encore faire la description mais je ne suis pas sur que ce graphe soit pertinent dans l’analyse.

Linear Discriminant Analysis

##Linear Discriminant anlysis
library(MASS)
# The values of the first and only linear discriminant
heart.lda <- lda(HeartDisease ~ ., 
                data = heart.pca, 
                method = "mle") 
heart.lda

## Call:
## lda(HeartDisease ~ ., data = heart.pca, method = "mle")
## 
## Prior probabilities of groups:
##         N         Y 
## 0.4466231 0.5533769 
## 
## Group means:
##        Age RestingBP Cholesterol    MaxHR   Oldpeak
## N 50.55122  130.1805    227.1220 148.1512 0.4080488
## Y 55.89961  134.1850    175.9409 127.6555 1.2742126
## 
## Coefficients of linear discriminants:
##                      LD1
## Age          0.014676526
## RestingBP    0.002479736
## Cholesterol -0.003578610
## MaxHR       -0.023266312
## Oldpeak      0.705322727

Adding the posterior probabilities for each level of the variable heart disease as well as the linear discrimonant to the original dataset:

heart.pred <- predict(heart.lda)
d <- cbind(heart.pca, heart.pred)
head(d)

##   Age RestingBP Cholesterol MaxHR Oldpeak HeartDisease class posterior.N
## 1  40       140         289   172     0.0            N     N   0.9116175
## 2  49       160         180   156     1.0            Y     N   0.5228573
## 3  37       130         283    98     0.0            N     N   0.5143677
## 4  48       138         214   108     1.5            Y     Y   0.1615806
## 5  54       150         195   122     0.0            N     Y   0.4954692
## 6  39       120         339   170     0.0            N     N   0.9309189
##   posterior.Y        LD1
## 1  0.08838255 -1.9468646
## 2  0.47714275 -0.2975290
## 3  0.48563229 -0.2725128
## 4  0.83841935  0.9810119
## 5  0.50453085 -0.2168910
## 6  0.06908107 -2.1435337

Correlation of the original variables with the canonic discriminant variable :

z <- predict(heart.lda)$x
cor(heart.pca[,-6], z)

##                    LD1
## Age          0.5037050
## RestingBP    0.1921479
## Cholesterol -0.4156618
## MaxHR       -0.7151291
## Oldpeak      0.7214334

These coefficients represents the correlations of each variable with the first discriminant axis LD1. They are an indication of how each variables contributes to the separation between ‘Yes’ and ‘No’ categories:

Boxplot and a histogram showing the distribution of the LD across the levels of heart disease variable

ggplot(d, aes(x = HeartDisease, y = LD1)) + geom_boxplot()

ggplot(d, aes(x = LD1)) + geom_histogram(bins = 10) + facet_grid(HeartDisease ~ .)

For the first graph, considering that the third quartile of the “No” category lies below zero on the first discriminant axis (LD1) and that variables like Cholesterol and MaxHR have high negative correlations with LD1, it suggests that for the “No” category, the values of these variables tend to be lower than for the “Yes” category along this axis. These variables contribute to the separation between the categories along LD1.

Adding to this, as the first quartile of the “Yes” category is just below zero on LD1, it further indicates that even within the “Yes” category, there’s a subset of observations having values on LD1 that are close to or even below zero. This aligns with the broader observation that the “Yes” category demonstrates some overlap with the “No” category along LD1, despite the general separation between the two groups.

The second graph allows us to confirm that if the link between a negative (positive) value on the LD1 and the belonging to the category ‘no’ (‘Yes’)

Second part of LDA

##The 2 by 2 table showing the observed versus the expected classification
validation<-with(d, table(obs = HeartDisease, pred = class))
validation

##    pred
## obs   N   Y
##   N 295 115
##   Y 111 397

#The proportion of well-classified patients:
proportion <- sum(diag(validation))/nrow(heart.pca)
proportion

## [1] 0.7538126

##The sensitivity
sens <- validation[2,2] / (validation[2,2] + validation[2,1])
sens

## [1] 0.7814961

##The specificity
spec <- validation[1,1] / (validation[1,1] + validation[1,2])
spec

## [1] 0.7195122

As we see from the results above, the sensitivity of the classification is 78% and the specificity is 72% and this means that in 78% of the cases, a new observation with known parameters can be classified in the true class and in 72% in the cases, the test can correctly exclude an observation with known parameters from the wrong class.

Multiple Correspondance Analysis

par(mfrow=c(1,1))
heart.mca<- heart[,c("Sex","ChestPainType","FastingBS","RestingECG","ExerciseAngina","ST_Slope","HeartDisease")]
mca<-MCA(heart.mca,ncp=5, graph = F)

Eigenvalue

round(mca$eig, digits = 3)

##        eigenvalue percentage of variance cumulative percentage of variance
## dim 1       0.385                 24.528                            24.528
## dim 2       0.165                 10.497                            35.025
## dim 3       0.156                  9.898                            44.923
## dim 4       0.142                  9.060                            53.983
## dim 5       0.141                  8.952                            62.935
## dim 6       0.136                  8.644                            71.579
## dim 7       0.123                  7.807                            79.386
## dim 8       0.120                  7.617                            87.003
## dim 9       0.085                  5.400                            92.403
## dim 10      0.074                  4.703                            97.106
## dim 11      0.045                  2.894                           100.000

Figure6: Barplot of the percentage of variance across the dimensions

barplot(mca$eig[,"percentage of variance"], 
        ylab = "Percentage of variance",xlab = "Dimensions")

As we see here, the percentage of variance explained in each dimension is low (only 35% in the first 2 dimensions), we will use here the first 2 dimensions for the simplicity of the analysis

mca<-MCA(heart.mca,ncp=2, graph = F)
lapply(mca$var, round, 3)

## $coord
##                    Dim 1  Dim 2
## F                  0.812  0.039
## M                 -0.216 -0.010
## ASY               -0.632 -0.201
## ATA                1.179 -0.610
## NAP                0.466  0.410
## TA                 0.330  2.647
## FastingBS_High    -0.627  0.660
## FastingBS_Normal   0.191 -0.201
## RestingECG_LVH    -0.035  1.103
## RestingECG_Normal  0.142 -0.437
## RestingECG_ST     -0.402  0.189
## ExerciseAngina_N   0.587  0.184
## ExerciseAngina_Y  -0.866 -0.271
## Down              -0.783  1.256
## Flat              -0.644 -0.149
## Up                 0.875 -0.027
## HeartDisease_N     0.958  0.037
## HeartDisease_Y    -0.773 -0.030
## 
## $contrib
##                    Dim 1  Dim 2
## F                  5.143  0.028
## M                  1.369  0.007
## ASY                8.008  1.886
## ATA                9.706  6.064
## NAP                1.778  3.220
## TA                 0.202 30.418
## FastingBS_High     3.396  8.794
## FastingBS_Normal   1.032  2.673
## RestingECG_LVH     0.009 21.592
## RestingECG_Normal  0.447  9.936
## RestingECG_ST      1.162  0.601
## ExerciseAngina_N   7.614  1.746
## ExerciseAngina_Y  11.226  2.574
## Down               1.557  9.379
## Flat               7.704  0.962
## Up                12.206  0.027
## HeartDisease_N    15.184  0.052
## HeartDisease_Y    12.255  0.042
## 
## $cos2
##                   Dim 1 Dim 2
## F                 0.176 0.000
## M                 0.176 0.000
## ASY               0.470 0.047
## ATA               0.323 0.086
## NAP               0.062 0.048
## TA                0.006 0.370
## FastingBS_High    0.119 0.132
## FastingBS_Normal  0.119 0.132
## RestingECG_LVH    0.000 0.314
## RestingECG_Normal 0.030 0.288
## RestingECG_ST     0.039 0.009
## ExerciseAngina_N  0.508 0.050
## ExerciseAngina_Y  0.508 0.050
## Down              0.045 0.116
## Flat              0.417 0.022
## Up                0.578 0.001
## HeartDisease_N    0.740 0.001
## HeartDisease_Y    0.740 0.001
## 
## $v.test
##                     Dim 1   Dim 2
## F                  12.694   0.614
## M                 -12.694  -0.614
## ASY               -20.760  -6.591
## ATA                17.202  -8.895
## NAP                 7.515   6.616
## TA                  2.293  18.414
## FastingBS_High    -10.468  11.019
## FastingBS_Normal   10.468 -11.019
## RestingECG_LVH     -0.537  16.956
## RestingECG_Normal   5.265 -16.244
## RestingECG_ST      -5.972   2.811
## ExerciseAngina_N   21.591   6.762
## ExerciseAngina_Y  -21.591  -6.762
## Down               -6.432  10.326
## Flat              -19.547  -4.517
## Up                 23.024  -0.711
## HeartDisease_N     26.055   0.997
## HeartDisease_Y    -26.055  -0.997
## 
## $eta2
##                Dim 1 Dim 2
## Sex            0.176 0.000
## ChestPainType  0.531 0.480
## FastingBS      0.119 0.132
## RestingECG     0.044 0.371
## ExerciseAngina 0.508 0.050
## ST_Slope       0.579 0.120
## HeartDisease   0.740 0.001

The results above show that the modalities that contribute the most to the first dimension are heart disease status and the upsloping ST, however, typical angina (TA) and left ventricular hypertrophy contribute the most to dimension 2

plot(mca, cex = 0.4)

From this figure, we can see that those who have heart disease have higher tendency to have higher fasting blood sugar, left ventricular hypertrophy (LVH) and ST-wave abnormality in their resting ECG, and also downsloping in their ST slope. Those patients also can suffer from asymptomatic angina during exercice. On the other hand, those who have not a heart disease, have normal fasting blood sugar, normal resting ECG,and non anginal pain.

Conclusion

The in-depth statistical analysis of the Heart dataset has provided valuable insights into factors influencing the predisposition to heart diseases. Through the application of the three techniques: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Multiple Correspondence Analysis (MCA), several significant findings have emerged.

Key Findings:

PCA Insight: PCA highlighted the substantial contribution of variables such as age, maximum heart rate, and cholesterol in reducing the dataset’s dimensionality. These variables exhibited strong correlations with the principal components, suggesting their pivotal role in data variance.

LDA Analysis: Linear Discriminant Analysis revealed a clear separation between the “Yes” and “No” categories for heart diseases, primarily driven by variables like cholesterol and maximum heart rate. However, a slight overlap between the two categories was observed, indicating some similarity in these variables’ values for certain cases.

MCA Insights: Multiple Correspondence Analysis reveal intriguing patterns in chest pain types, resting electrocardiogram results, and other indicators related to heart diseases. These insights enhanced the understanding of relationships among various categorical variables and their influence on the presence or absence of heart diseases.

Future Perspectives: These analyses provide crucial leads for future studies and clinical interventions. Advanced approaches can explore refining predictive models for heart diseases by considering a broader range of factors and integrating machine learning techniques for improved predictive accuracy.

In summary, this comprehensive statistical analysis uncovered significant trends and sets the stage for further research to better understand and more effectively prevent heart diseases.