This study investigates the effects of gender, age, maximum heart rate, and resting blood pressure on heart disease prediction. Using a combined dataset in Kaggle sourced from the UCI Machine Learning Repository, the study applied statistical analysis to (1) explore the individual effects of gender on predicting heart disease, (2) examine the combined effects of gender, age, maximum heart rate, and resting blood pressure on heart disease, and (3) analyze how gender affects the effects of other variables on heart disease. Results highlight the significance of gender, age, maximum heart rate, and resting blood pressure in predicting heart disease. Moreover, it reveals that resting blood pressure has varying effects in predicting heart disease based on gender. These findings address the need for future research to expand on this current study, analyze gender-specific biological and physiological factors on predicting heart diseases, and conduct longitudinal studies to enhance prediction accuracy and reliability.
Heart disease has been the leading cause of death since the 1950s, taking away a life every 33 seconds.10 Think about how many lives lost in a year. Approximately 17.9 million people die annually from heart disease worldwide.15 Heart disease, also known as cardiovascular disease, is a range of disorders affecting the heart and its related blood vessels. Recent statistics highlight the growing prominence of heart disease, projecting that the situation will worsen in the coming decades.1 In 2023, over one out of every four Americans died from heart disease.11 Additionally, the U.S. healthcare system spends over $200 billion yearly on hospital care and medications related to cardiovascular diseases.6,11 Yet, in spite of the substantial death toll from heart disease, millions of survivors nevertheless endure chronic health problems or are at the risk of developing them.6
Through decades-long research, scientists have determined many risk factors for heart disease, including smoking, high blood pressure, high cholesterol, gender, other existing health conditions, and age.4,6 However, with so many potential risk factors and the overlap of many symptoms with other diseases, detecting heart disease remains a challenge for both physicians and researchers.2 One potential solution to this challenge is the use of statistical models to identify key characteristics commonly associated with individuals suffering from heart disease and predict the likelihood of developing the condition.
This study will use a dataset in Kaggle compiled from the University of California, Irvine’s Machine Learning Repository to analyze the relationship between heart disease and four specific risk factors: Gender, Age, Maximum Heart Rate, and Resting Blood Pressure. The study aims to address the following key inquiries:
The null hypotheses tested throughout this study are as follows:
Null Hypothesis #1: Gender does not affect the likelihood of heart disease in any way.
Null Hypothesis #2: There is no meaningful effect of gender, age, maximum heart rate, and resting blood pressure on the likelihood of heart disease.
Null Hypothesis #3: The effects of age, maximum heart rate, and resting blood pressure on the likelihood of heart disease do not vary based on gender.
The dataset was sourced from Kaggle, and it is the cleaned, compilation of five other heart datasets from different countries: Cleveland, Hungarian, Switzerland, and the United States.3 The individual heart disease datasets are obtained from two databases within the UCI Machine Learning Repository. The first database “Heart Disease,” consists of the four datasets of patients undergoing angiography in Cleveland Clinic, Hungarian Institute of Cardiology, University Hospitals in Zurich and Basel, and Veterans Administration Medical Center.8 The second database, “Statlog Heart” do not provide specific details pertaining to the data collection process.
| HeartDisease | Age | RestingBP | MaxHR | Sex |
|---|---|---|---|---|
| No | 40 | 140 | 172 | Male |
| Yes | 49 | 160 | 156 | Female |
| No | 37 | 130 | 98 | Male |
| Yes | 48 | 138 | 108 | Female |
| No | 54 | 150 | 122 | Male |
| No | 39 | 120 | 170 | Male |
The variables are described as below:
‘HeartDisease’ : The response binary variable of the dataset. 1 indicates the presence and 0 indicates no presence of heart disease.
‘Age’: An explanatory variable represent in a subject in years.
‘Sex’: An explanatory, categorical variable representing the gender of a subject in two levels: Male and Female
‘MaxHR’: An explanatory variable measuring the maximum heart rate achieved in a subject.
‘RestingBP’: An explanatory variable measuring the resting blood pressure in millimeters per mercury (mm/Hg)
The data was analyzed in two primary sections. First, univariate and bivariate exploratory data analyses were conducted to examine the relationship between the categorical predictor ‘Sex’ and response variable ‘HeartDisease,’ using a bar graph and a mosaic plot. Then a simple logistic regression was utilized to model the relationship and a likelihood ratio test was conducted to evaluate its significance. In the second section, boxplots were used to perform a bivariate exploratory data analysis on ‘Age,’ ‘RestingBP,’ and ‘MaxHR’ and their relationships with ‘HeartDisease.’ A stepwise regression procedure identified the best-fitting model with the smallest AIC and residual deviance. To assess the model’s linearity and independence (includes multicollinearity), a variance inflation factor (VIF) test and empirical logit graphs were used. Interaction terms were added to new models and then tested using the likelihood ratio test Additionally, the empirical logit plots were used to identify potential interaction terms, which were subsequently fitted into three additional models. These additional models were compared with the original model using a likelihood ratio test, which its results will finalize a model to be used for further analysis.
Likelihood Ratio Test (LRT)
Null Hypothesis: Model # is no different from the base model.
Alternative Hypothesis: Model # is significantly better at predicting ‘HeartDisease’ than the base model.
Variance Inflation Factor Test (VIF)
I. EDA
In the data, the following count is as follows: There is a total of 725 Males and 193 Females in the dataset. Of the 410 subjects who do not have heart disease, 143 are female and 267 are male. Of the 508 subjects who do have heart disease, 458 are male and 50 are female.
II. Modeling & Assessment
Refer to Appendix #A for statistical summary of the simple logistic regression model.
Linearity - Automatically satisfied since the model includes only a categorical predictor with no numerical ordering.
Independence – Given that each subject is a unique individual with their own physical and health characteristics, it is satisfied
Randomness - Considered questionable due to the data collection process, so the analysis will proceed with caution.
A likelihood ratio test was used to assess the model’s significance.
The test results show that Model 2 has a chi-square value of 286.39 and a p-value near 0. Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the model predicting ‘HeartDisease’ with ‘Sex’ is statistically significant, warranting its use for further analysis.
\[Logit(HeartDisease) = -1.051 + 1.590(SexMale) \]INTERPRETATION: The model predicts that males are approximately 4.90 times likely to risk heart disease compared to females.
I. BIVARIATE AND MULTIVARIATE EDA
Please refer to Appendix #B for full numerical data summaries. In heart disease cases, age ranges from 31 to 77, with a mean of 56. MaxHR ranges from 60 to 126, with a mean of 127.66. RestingBP ranges from 0 to 145, with a mean of 134.19. In normal cases, age ranges from 28 to 76, with a mean of 50.55. MaxHR ranges from 69 to 150, with a mean of 148.15. RestingBP ranges from 80 to 140, with a mean of 130.18. This is further reinforced in the boxplots, where both Age and RestingBP have higher medians in heart disease cases, while MaxHR has a lower median in heart disease cases.
II. MODELING & ASSESSMENT
| Step | Df | Deviance | Resid. Df | Resid. Dev | AIC |
|---|---|---|---|---|---|
| NA | NA | 917 | 1262.136 | 1264.136 | |
| + MaxHR | 1 | 159.11958 | 916 | 1103.017 | 1107.017 |
| + Sex | 1 | 57.84577 | 915 | 1045.171 | 1051.171 |
| + Age | 1 | 23.17421 | 914 | 1021.997 | 1029.997 |
The stepwise regression procedure selected the multiple logistic regression model based on the smallest residual deviance and AIC, which includes all predictors: ‘Sex,’ ‘Age,’ ‘MaxHR,’ and ‘RestingBP’ to predict the binary response variable ‘HeartDisease.’ Refer to Appendix #C for the model’s statistical summary.
Linearity - The empirical logit plots, which examine the relationships between the log-odds of the response variable (HeartDisease) and each of the three quantitative variables—Age, MaxHR, and RestingBP—illustrate linearity with no signs of patterns.
Independence - The variance inflation factors for all variables are less than two, which is indicative of no significant multicollinearity (Please refer to Appendix #D) This, combined with the fact that each subject is unique and cannot be altered by other subjects, satisfies the independence assumption.
Randomness - Due to unclear data collection, randomness is questionable.
Unusual Features - Given the results of Cook’s Distance Vs. Leverage model in Appendix #E, there is one data point (Subject #450) with unusually high leverage.
Data Transformation:
The data for subject #450 is excluded and will be refitted using
multiple logistic regression. Conditions are rechecked, and it is
confirmed that there are no unusual features influencing the model, as
shown in Appendix #E, H, and the empirical logit plot below. Refer to
Appendix #G for statistical summary of the model.
IV. MODEL COMPARISONS
In the empirical logit plots, two models illustrate potential interaction effects, as their slopes appear to intersect at a point beyond the models. This suggests that MaxHR and RestingBP have varying effects on the likelihood of heart disease depending on gender.
To determine the significance of the interaction terms and whether they should be included in the multiple logistic regression, two models are created with the interaction term added in ascending order (Refer to Appendix #I) for statistical summary of the models. They are then compared through a likelihood ratio test to determine the finalized model.
Given the results of the likelihood ratio test, only Model 2 has a chi-square value of 9.1147 with a p-value smaller than 0.05 (0.002 < 0.05). Therefore, there is enough evidence to reject the null hypothesis. As such, the final model predicts ‘HeartDisease’ using all the main terms and the interaction term ‘RestingBP:Sex’.
\[ \begin{aligned} \text{Logit(HeartDisease)} = -3.089 + 0.038(\text{Age}) - 0.030(\text{MaxHR}) + 0.032(\text{RestingBP}) \\ + 5.733(\text{SexMale}) - 0.032(\text{RestingBP} \times \text{SexMale}) \end{aligned} \]
INTERPRETATION: The model indicates that every yearly increase in age makes individuals approximately 1.04 times more likely to develop heart disease. Conversely, each additional heartbeat in maximum heart rate reduces the likelihood of heart disease by about 0.03 times. An increase of 1 mm/Hg in resting blood pressure is associated with a 1.03 times higher likelihood of heart disease. Males are approximately 308.89 times more likely to develop heart disease than females. Additionally, the interaction between resting blood pressure and sex suggests that for males, each additional mm/Hg of resting blood pressure decreases the odds of heart disease by 0.75.
The objective of this study was to explore the effects of gender, age, maximum heart rate, and resting blood pressure on heart disease risk, with a particular with a focus on gender and its interaction with other factors. Using data from the UCI Machine Learning Repository and logistic regression models, the study confirmed the significant influence of these factors on heart disease, also highlighting the interaction between resting blood pressure and gender. Key findings showed that both age and resting blood pressure increased heart disease risk, with being male raising the risk 4.9 times more than females alone. In contrast, maximum heart rate had a negative coefficient, a lower maximum heart rate was associated with higher risk, reflecting reduced cardiovascular weakness and reduced flexibility. The interaction term indicates that resting blood pressure has a weaker effect on heart disease risk for males than for females.
These results are consistent with existing research: Increased age and lower maximum heart rate signifies reduced cardiovascular function and efficiency, while high resting blood pressure exacerbates the risk by enlarging the left ventricle, making the heart less effective at pumping blood.8,12,13 Gender differences are also significant, with men being biologically predisposed to higher blood pressure and women experiencing a more direct effect of resting blood pressure, especially after menopause.12 This is due to the decline in estrogen levels after menopause, which increases women’s vulnerability to heart disease.12 As such, women would experience a more pronounced effect of resting blood pressure, thereby posing a heightened risk to heart disease.
I. Limitations
This study has several limitations. First, the dataset combines records from the ‘Heart Disease’ and ‘Statlog (Heart)’ datasets in the UCI Machine Learning Repository. While the ‘Heart Disease’ dataset mentions data collected from patients undergoing angiography at specific medical sites, its sampling methods are unspecified. Additionally, the ‘Statlog (Heart)’ dataset lacks details about its data collection procedures, introducing uncertainty in the study’s interpretability. Despite these issues, the study proceeded with caution for the final project. Furthermore, the dataset has a gender imbalance, with 725 male and 193 female participants, which may introduce bias and affect the reliability and interpretability of the results.
II. Implications
Future research should place emphasis on transparent data collection, including detailed sampling methods and patient demographics, to ensure valid and interpretable results. Given the varying effects of resting blood pressure on heart disease by gender, further studies should explore how biological and physiological factors interact with known risk factors, providing deeper insights into the relationship of heart disease and gender. Longitudinal studies would also help model long-term effects of variables like maximum heart rate, resting blood pressure, and gender, improving accuracy and reliable predictions. Overall, the study emphasizes the impact of age, gender, maximum heart rate, and resting blood pressure on heart disease, particularly the varying gender-based effects of resting blood pressure. Addressing the research gaps and limitations, as discussed before improve the accuracy and reliable predictions, which can be used to develop better heart disease prevention and diagnosis strategies, thereby also reducing gender disparities in treatment.
A. Summary of Statistics for the model ‘m1’ predicting ‘HeartDisease’ with ‘Sex’
msummary(m1)
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0508 0.1643 -6.396 1.6e-10 ***
## SexMale 1.5904 0.1814 8.766 < 2e-16 ***
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1262.1 on 917 degrees of freedom
## Residual deviance: 1175.0 on 916 degrees of freedom
## AIC: 1179
##
## Number of Fisher Scoring iterations: 4
B. Bivariate Statistics: Age
| HeartDisease | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| No | 28 | 43 | 51 | 57 | 76 | 50.55122 | 9.444915 | 410 | 0 |
| Yes | 31 | 51 | 57 | 62 | 77 | 55.89961 | 8.727056 | 508 | 0 |
Bivariate Statistics: Age
| HeartDisease | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| No | 69 | 134 | 150 | 165.00 | 202 | 148.1512 | 23.28807 | 410 | 0 |
| Yes | 60 | 112 | 126 | 144.25 | 195 | 127.6555 | 23.38692 | 508 | 0 |
Bivariate Statistics: Age
| HeartDisease | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| No | 80 | 120 | 130 | 140 | 190 | 130.1805 | 16.49958 | 410 | 0 |
| Yes | 0 | 120 | 132 | 145 | 200 | 134.1850 | 19.82868 | 508 | 0 |
C. Summary of Statistics for the model ‘m2’ predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, and ‘MaxHR’
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.331579 0.915613 0.362 0.717
## Age 0.039206 0.009077 4.319 1.57e-05 ***
## SexMale 1.460560 0.195878 7.456 8.89e-14 ***
## RestingBP 0.005342 0.004270 1.251 0.211
## MaxHR -0.029503 0.003511 -8.403 < 2e-16 ***
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1262.1 on 917 degrees of freedom
## Residual deviance: 1020.4 on 913 degrees of freedom
## AIC: 1030.4
##
## Number of Fisher Scoring iterations: 4
D. Variance Inflation Factor Test
| VIF | Multicollinear | |
|---|---|---|
| Age | 1.145 | No |
| Sex | 1.012 | No |
| RestingBP | 1.061 | No |
| MaxHR | 1.077 | No |
E. Cook’s Dist vs Leverage for m2 (Left) and m3 (Right)
F. Data Point Corresponding to Extreme High Leverage Point
extremely_high_leverage_point
## 450
## 450
G. Summary of Statistics for the model ‘m3’ excluding data point #450 predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, and ‘MaxHR’
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.183637 0.921671 0.199 0.842
## SexMale 1.458917 0.196134 7.438 1.02e-13 ***
## Age 0.038169 0.009105 4.192 2.77e-05 ***
## MaxHR -0.029682 0.003518 -8.437 < 2e-16 ***
## RestingBP 0.007050 0.004447 1.585 0.113
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1261.0 on 916 degrees of freedom
## Residual deviance: 1018.2 on 912 degrees of freedom
## AIC: 1028.2
##
## Number of Fisher Scoring iterations: 4
H. VIF Test of m3
| VIF | Multicollinear | |
|---|---|---|
| Sex | 1.012 | No |
| Age | 1.151 | No |
| MaxHR | 1.078 | No |
| RestingBP | 1.067 | No |
I. Summary of Statistics for the model ‘m4’ predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, ‘MaxHR’, and ‘RestingBP:SexMale’.
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.089364 1.484244 -2.081 0.037394 *
## Age 0.038243 0.009142 4.183 2.88e-05 ***
## MaxHR -0.030160 0.003544 -8.511 < 2e-16 ***
## RestingBP 0.031600 0.009742 3.244 0.001180 **
## SexMale 5.733437 1.488778 3.851 0.000118 ***
## RestingBP:SexMale -0.031681 0.010862 -2.917 0.003538 **
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1261 on 916 degrees of freedom
## Residual deviance: 1009 on 911 degrees of freedom
## AIC: 1021
##
## Number of Fisher Scoring iterations: 4
Summary of Statistics for the model ‘m5’ predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, ‘MaxHR’, ‘MaxHR:SexMale’, ‘SexFemale:RestingBP’, and ‘SexMale:RestingBP’.
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4278476 1.7463477 -2.535 0.01123 *
## Age 0.0376766 0.0091539 4.116 3.86e-05 ***
## MaxHR -0.0193601 0.0081847 -2.365 0.01801 *
## SexMale 7.4402550 1.8961840 3.924 8.72e-05 ***
## MaxHR:SexMale -0.0130733 0.0090073 -1.451 0.14666
## SexFemale:RestingBP 0.0305362 0.0094510 3.231 0.00123 **
## SexMale:RestingBP -0.0002518 0.0050365 -0.050 0.96013
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1261 on 916 degrees of freedom
## Residual deviance: 1007 on 910 degrees of freedom
## AIC: 1021
##
## Number of Fisher Scoring iterations: 4