Predicting Heart Disease

Abstract

This study investigates the effects of gender, age, maximum heart rate, and resting blood pressure on heart disease prediction. Using a combined dataset in Kaggle sourced from the UCI Machine Learning Repository, the study applied statistical analysis to (1) explore the individual effects of gender on predicting heart disease, (2) examine the combined effects of gender, age, maximum heart rate, and resting blood pressure on heart disease, and (3) analyze how gender affects the effects of other variables on heart disease. Results highlight the significance of gender, age, maximum heart rate, and resting blood pressure in predicting heart disease. Moreover, it reveals that resting blood pressure has varying effects in predicting heart disease based on gender. These findings address the need for future research to expand on this current study, analyze gender-specific biological and physiological factors on predicting heart diseases, and conduct longitudinal studies to enhance prediction accuracy and reliability.

Background & Significance

Heart disease has been the leading cause of death since the 1950s, taking away a life every 33 seconds.¹⁰ Think about how many lives lost in a year. Approximately 17.9 million people die annually from heart disease worldwide.¹⁵ Heart disease, also known as cardiovascular disease, is a range of disorders affecting the heart and its related blood vessels. Recent statistics highlight the growing prominence of heart disease, projecting that the situation will worsen in the coming decades.¹ In 2023, over one out of every four Americans died from heart disease.¹¹ Additionally, the U.S. healthcare system spends over $200 billion yearly on hospital care and medications related to cardiovascular diseases.^6,11 Yet, in spite of the substantial death toll from heart disease, millions of survivors nevertheless endure chronic health problems or are at the risk of developing them.⁶

Through decades-long research, scientists have determined many risk factors for heart disease, including smoking, high blood pressure, high cholesterol, gender, other existing health conditions, and age.^4,6 However, with so many potential risk factors and the overlap of many symptoms with other diseases, detecting heart disease remains a challenge for both physicians and researchers.² One potential solution to this challenge is the use of statistical models to identify key characteristics commonly associated with individuals suffering from heart disease and predict the likelihood of developing the condition.

This study will use a dataset in Kaggle compiled from the University of California, Irvine’s Machine Learning Repository to analyze the relationship between heart disease and four specific risk factors: Gender, Age, Maximum Heart Rate, and Resting Blood Pressure. The study aims to address the following key inquiries:

How does gender affect the likelihood of heart disease?
What is the combined effect of gender, age, maximum heart rate, and resting blood pressure on the likelihood of heart disease?
Do the effects of age, maximum heart rate, and resting blood pressure on the likelihood of heart disease vary based on gender?

The null hypotheses tested throughout this study are as follows:

Null Hypothesis #1: Gender does not affect the likelihood of heart disease in any way.
Null Hypothesis #2: There is no meaningful effect of gender, age, maximum heart rate, and resting blood pressure on the likelihood of heart disease.
Null Hypothesis #3: The effects of age, maximum heart rate, and resting blood pressure on the likelihood of heart disease do not vary based on gender.

Methods

I. Data

The dataset was sourced from Kaggle, and it is the cleaned, compilation of five other heart datasets from different countries: Cleveland, Hungarian, Switzerland, and the United States.³ The individual heart disease datasets are obtained from two databases within the UCI Machine Learning Repository. The first database “Heart Disease,” consists of the four datasets of patients undergoing angiography in Cleveland Clinic, Hungarian Institute of Cardiology, University Hospitals in Zurich and Basel, and Veterans Administration Medical Center.⁸ The second database, “Statlog Heart” do not provide specific details pertaining to the data collection process.

II. Variables

Glimpse of Data ‘Heart’
HeartDisease	Age	RestingBP	MaxHR	Sex
No	40	140	172	Male
Yes	49	160	156	Female
No	37	130	98	Male
Yes	48	138	108	Female
No	54	150	122	Male
No	39	120	170	Male

The variables are described as below:

‘HeartDisease’ : The response binary variable of the dataset. 1 indicates the presence and 0 indicates no presence of heart disease.

‘Age’: An explanatory variable represent in a subject in years.

‘Sex’: An explanatory, categorical variable representing the gender of a subject in two levels: Male and Female

‘MaxHR’: An explanatory variable measuring the maximum heart rate achieved in a subject.

‘RestingBP’: An explanatory variable measuring the resting blood pressure in millimeters per mercury (mm/Hg)

III. Statistical Procedures:

The data was analyzed in two primary sections. First, univariate and bivariate exploratory data analyses were conducted to examine the relationship between the categorical predictor ‘Sex’ and response variable ‘HeartDisease,’ using a bar graph and a mosaic plot. Then a simple logistic regression was utilized to model the relationship and a likelihood ratio test was conducted to evaluate its significance. In the second section, boxplots were used to perform a bivariate exploratory data analysis on ‘Age,’ ‘RestingBP,’ and ‘MaxHR’ and their relationships with ‘HeartDisease.’ A stepwise regression procedure identified the best-fitting model with the smallest AIC and residual deviance. To assess the model’s linearity and independence (includes multicollinearity), a variance inflation factor (VIF) test and empirical logit graphs were used. Interaction terms were added to new models and then tested using the likelihood ratio test Additionally, the empirical logit plots were used to identify potential interaction terms, which were subsequently fitted into three additional models. These additional models were compared with the original model using a likelihood ratio test, which its results will finalize a model to be used for further analysis.

Likelihood Ratio Test (LRT)

Null Hypothesis: Model # is no different from the base model.
Alternative Hypothesis: Model # is significantly better at predicting ‘HeartDisease’ than the base model.

Variance Inflation Factor Test (VIF)

Detects multicollinearity. A VIF above 2 indicates multicollinearity.

Results

A. SIMPLE LOGISTIC REGRESSION

I. EDA

In the data, the following count is as follows: There is a total of 725 Males and 193 Females in the dataset. Of the 410 subjects who do not have heart disease, 143 are female and 267 are male. Of the 508 subjects who do have heart disease, 458 are male and 50 are female.

II. Modeling & Assessment

Refer to Appendix #A for statistical summary of the simple logistic regression model.

Linearity - Automatically satisfied since the model includes only a categorical predictor with no numerical ordering.
Independence – Given that each subject is a unique individual with their own physical and health characteristics, it is satisfied
Randomness - Considered questionable due to the data collection process, so the analysis will proceed with caution.

A likelihood ratio test was used to assess the model’s significance.

Model 1 (null model): HeartDisease ~ 1
Model 2: HeartDisease ~ Sex

The test results show that Model 2 has a chi-square value of 286.39 and a p-value near 0. Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the model predicting ‘HeartDisease’ with ‘Sex’ is statistically significant, warranting its use for further analysis.

\[Logit(HeartDisease) = -1.051 + 1.590(SexMale) \]INTERPRETATION: The model predicts that males are approximately 4.90 times likely to risk heart disease compared to females.

B. Multiple Logistic Regression

I. BIVARIATE AND MULTIVARIATE EDA

Please refer to Appendix #B for full numerical data summaries. In heart disease cases, age ranges from 31 to 77, with a mean of 56. MaxHR ranges from 60 to 126, with a mean of 127.66. RestingBP ranges from 0 to 145, with a mean of 134.19. In normal cases, age ranges from 28 to 76, with a mean of 50.55. MaxHR ranges from 69 to 150, with a mean of 148.15. RestingBP ranges from 80 to 140, with a mean of 130.18. This is further reinforced in the boxplots, where both Age and RestingBP have higher medians in heart disease cases, while MaxHR has a lower median in heart disease cases.

II. MODELING & ASSESSMENT

Stepwise Selection Results
Step	Df	Deviance	Resid. Df	Resid. Dev	AIC
	NA	NA	917	1262.136	1264.136
+ MaxHR	1	159.11958	916	1103.017	1107.017
+ Sex	1	57.84577	915	1045.171	1051.171
+ Age	1	23.17421	914	1021.997	1029.997

The stepwise regression procedure selected the multiple logistic regression model based on the smallest residual deviance and AIC, which includes all predictors: ‘Sex,’ ‘Age,’ ‘MaxHR,’ and ‘RestingBP’ to predict the binary response variable ‘HeartDisease.’ Refer to Appendix #C for the model’s statistical summary.

Linearity - The empirical logit plots, which examine the relationships between the log-odds of the response variable (HeartDisease) and each of the three quantitative variables—Age, MaxHR, and RestingBP—illustrate linearity with no signs of patterns.
Independence - The variance inflation factors for all variables are less than two, which is indicative of no significant multicollinearity (Please refer to Appendix #D) This, combined with the fact that each subject is unique and cannot be altered by other subjects, satisfies the independence assumption.
Randomness - Due to unclear data collection, randomness is questionable.
Unusual Features - Given the results of Cook’s Distance Vs. Leverage model in Appendix #E, there is one data point (Subject #450) with unusually high leverage.

Data Transformation:
The data for subject #450 is excluded and will be refitted using multiple logistic regression. Conditions are rechecked, and it is confirmed that there are no unusual features influencing the model, as shown in Appendix #E, H, and the empirical logit plot below. Refer to Appendix #G for statistical summary of the model.

IV. MODEL COMPARISONS

In the empirical logit plots, two models illustrate potential interaction effects, as their slopes appear to intersect at a point beyond the models. This suggests that MaxHR and RestingBP have varying effects on the likelihood of heart disease depending on gender.

To determine the significance of the interaction terms and whether they should be included in the multiple logistic regression, two models are created with the interaction term added in ascending order (Refer to Appendix #I) for statistical summary of the models. They are then compared through a likelihood ratio test to determine the finalized model.

Model 1: HeartDisease ~ Sex + Age + MaxHR + RestingBP
Model 2: HeartDisease ~ Sex + Age + MaxHR + RestingBP + RestingBP:Sex
Model 3: HeartDisease ~ Sex + Age + MaxHR + RestingBP + RestingBP:Sex + MaxHR:Sex

Given the results of the likelihood ratio test, only Model 2 has a chi-square value of 9.1147 with a p-value smaller than 0.05 (0.002 < 0.05). Therefore, there is enough evidence to reject the null hypothesis. As such, the final model predicts ‘HeartDisease’ using all the main terms and the interaction term ‘RestingBP:Sex’.

\[ \begin{aligned} \text{Logit(HeartDisease)} = -3.089 + 0.038(\text{Age}) - 0.030(\text{MaxHR}) + 0.032(\text{RestingBP}) \\ + 5.733(\text{SexMale}) - 0.032(\text{RestingBP} \times \text{SexMale}) \end{aligned} \]

INTERPRETATION: The model indicates that every yearly increase in age makes individuals approximately 1.04 times more likely to develop heart disease. Conversely, each additional heartbeat in maximum heart rate reduces the likelihood of heart disease by about 0.03 times. An increase of 1 mm/Hg in resting blood pressure is associated with a 1.03 times higher likelihood of heart disease. Males are approximately 308.89 times more likely to develop heart disease than females. Additionally, the interaction between resting blood pressure and sex suggests that for males, each additional mm/Hg of resting blood pressure decreases the odds of heart disease by 0.75.

Discussion

The objective of this study was to explore the effects of gender, age, maximum heart rate, and resting blood pressure on heart disease risk, with a particular with a focus on gender and its interaction with other factors. Using data from the UCI Machine Learning Repository and logistic regression models, the study confirmed the significant influence of these factors on heart disease, also highlighting the interaction between resting blood pressure and gender. Key findings showed that both age and resting blood pressure increased heart disease risk, with being male raising the risk 4.9 times more than females alone. In contrast, maximum heart rate had a negative coefficient, a lower maximum heart rate was associated with higher risk, reflecting reduced cardiovascular weakness and reduced flexibility. The interaction term indicates that resting blood pressure has a weaker effect on heart disease risk for males than for females.

These results are consistent with existing research: Increased age and lower maximum heart rate signifies reduced cardiovascular function and efficiency, while high resting blood pressure exacerbates the risk by enlarging the left ventricle, making the heart less effective at pumping blood.^8,12,13 Gender differences are also significant, with men being biologically predisposed to higher blood pressure and women experiencing a more direct effect of resting blood pressure, especially after menopause.¹² This is due to the decline in estrogen levels after menopause, which increases women’s vulnerability to heart disease.¹² As such, women would experience a more pronounced effect of resting blood pressure, thereby posing a heightened risk to heart disease.

I. Limitations

This study has several limitations. First, the dataset combines records from the ‘Heart Disease’ and ‘Statlog (Heart)’ datasets in the UCI Machine Learning Repository. While the ‘Heart Disease’ dataset mentions data collected from patients undergoing angiography at specific medical sites, its sampling methods are unspecified. Additionally, the ‘Statlog (Heart)’ dataset lacks details about its data collection procedures, introducing uncertainty in the study’s interpretability. Despite these issues, the study proceeded with caution for the final project. Furthermore, the dataset has a gender imbalance, with 725 male and 193 female participants, which may introduce bias and affect the reliability and interpretability of the results.

II. Implications

Future research should place emphasis on transparent data collection, including detailed sampling methods and patient demographics, to ensure valid and interpretable results. Given the varying effects of resting blood pressure on heart disease by gender, further studies should explore how biological and physiological factors interact with known risk factors, providing deeper insights into the relationship of heart disease and gender. Longitudinal studies would also help model long-term effects of variables like maximum heart rate, resting blood pressure, and gender, improving accuracy and reliable predictions. Overall, the study emphasizes the impact of age, gender, maximum heart rate, and resting blood pressure on heart disease, particularly the varying gender-based effects of resting blood pressure. Addressing the research gaps and limitations, as discussed before improve the accuracy and reliable predictions, which can be used to develop better heart disease prevention and diagnosis strategies, thereby also reducing gender disparities in treatment.

Appendix

A. Summary of Statistics for the model ‘m1’ predicting ‘HeartDisease’ with ‘Sex’

msummary(m1)

## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0508     0.1643  -6.396  1.6e-10 ***
## SexMale       1.5904     0.1814   8.766  < 2e-16 ***
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1262.1  on 917  degrees of freedom
## Residual deviance: 1175.0  on 916  degrees of freedom
## AIC: 1179
## 
## Number of Fisher Scoring iterations: 4

B. Bivariate Statistics: Age

HeartDisease	min	Q1	median	Q3	max	mean	sd	n	missing
No	28	43	51	57	76	50.55122	9.444915	410	0
Yes	31	51	57	62	77	55.89961	8.727056	508	0

Bivariate Statistics: Age

HeartDisease	min	Q1	median	Q3	max	mean	sd	n	missing
No	69	134	150	165.00	202	148.1512	23.28807	410	0
Yes	60	112	126	144.25	195	127.6555	23.38692	508	0

Bivariate Statistics: Age

HeartDisease	min	Q1	median	Q3	max	mean	sd	n	missing
No	80	120	130	140	190	130.1805	16.49958	410	0
Yes	0	120	132	145	200	134.1850	19.82868	508	0

C. Summary of Statistics for the model ‘m2’ predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, and ‘MaxHR’

## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.331579   0.915613   0.362    0.717    
## Age          0.039206   0.009077   4.319 1.57e-05 ***
## SexMale      1.460560   0.195878   7.456 8.89e-14 ***
## RestingBP    0.005342   0.004270   1.251    0.211    
## MaxHR       -0.029503   0.003511  -8.403  < 2e-16 ***
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1262.1  on 917  degrees of freedom
## Residual deviance: 1020.4  on 913  degrees of freedom
## AIC: 1030.4
## 
## Number of Fisher Scoring iterations: 4

D. Variance Inflation Factor Test

Variance Inflation Factor (VIF) for Predictors and Multicollinearity Status
	VIF	Multicollinear
Age	1.145	No
Sex	1.012	No
RestingBP	1.061	No
MaxHR	1.077	No

E. Cook’s Dist vs Leverage for m2 (Left) and m3 (Right)

F. Data Point Corresponding to Extreme High Leverage Point

extremely_high_leverage_point

## 450 
## 450

G. Summary of Statistics for the model ‘m3’ excluding data point #450 predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, and ‘MaxHR’

## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.183637   0.921671   0.199    0.842    
## SexMale      1.458917   0.196134   7.438 1.02e-13 ***
## Age          0.038169   0.009105   4.192 2.77e-05 ***
## MaxHR       -0.029682   0.003518  -8.437  < 2e-16 ***
## RestingBP    0.007050   0.004447   1.585    0.113    
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1261.0  on 916  degrees of freedom
## Residual deviance: 1018.2  on 912  degrees of freedom
## AIC: 1028.2
## 
## Number of Fisher Scoring iterations: 4

H. VIF Test of m3

Variance Inflation Factor (VIF) for Predictors and Multicollinearity Status
	VIF	Multicollinear
Sex	1.012	No
Age	1.151	No
MaxHR	1.078	No
RestingBP	1.067	No

I. Summary of Statistics for the model ‘m4’ predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, ‘MaxHR’, and ‘RestingBP:SexMale’.

## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -3.089364   1.484244  -2.081 0.037394 *  
## Age                0.038243   0.009142   4.183 2.88e-05 ***
## MaxHR             -0.030160   0.003544  -8.511  < 2e-16 ***
## RestingBP          0.031600   0.009742   3.244 0.001180 ** 
## SexMale            5.733437   1.488778   3.851 0.000118 ***
## RestingBP:SexMale -0.031681   0.010862  -2.917 0.003538 ** 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1261  on 916  degrees of freedom
## Residual deviance: 1009  on 911  degrees of freedom
## AIC: 1021
## 
## Number of Fisher Scoring iterations: 4

Summary of Statistics for the model ‘m5’ predicting ‘HeartDisease’ with ‘Sex’, ‘Age’, ‘RestingBP’, ‘MaxHR’, ‘MaxHR:SexMale’, ‘SexFemale:RestingBP’, and ‘SexMale:RestingBP’.

## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -4.4278476  1.7463477  -2.535  0.01123 *  
## Age                  0.0376766  0.0091539   4.116 3.86e-05 ***
## MaxHR               -0.0193601  0.0081847  -2.365  0.01801 *  
## SexMale              7.4402550  1.8961840   3.924 8.72e-05 ***
## MaxHR:SexMale       -0.0130733  0.0090073  -1.451  0.14666    
## SexFemale:RestingBP  0.0305362  0.0094510   3.231  0.00123 ** 
## SexMale:RestingBP   -0.0002518  0.0050365  -0.050  0.96013    
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1261  on 916  degrees of freedom
## Residual deviance: 1007  on 910  degrees of freedom
## AIC: 1021
## 
## Number of Fisher Scoring iterations: 4

BIBLIOGRAPHY

Bozkurt, B., Ahmad, T., Alexander, K., Baker, W. L., Bosak, K., Breathett, K., Carter, S., Drazner, M. H., Dunlay, S. M., Fonarow, G. C., Greene, S. J., Heidenreich, P., Ho, J. E., Hsich, E., Ibrahim, N. E., Jones, L. M., Khan, S. S., Khazanie, P., Koelling, T., Lee, C. S., … WRITING COMMITTEE MEMBERS (2024). HF STATS 2024: Heart Failure Epidemiology and Outcomes Statistics An Updated 2024 Report from the Heart Failure Society of America. Journal of cardiac failure, S1071-9164(24)00232-X. Advance online publication. https://doi.org/10.1016/j.cardfail.2024.07.001
Brunner-La Rocca, H. P., Fleischhacker, L., Golubnitschaja, O., Heemskerk, F., Helms, T., Hoedemakers, T., Allianses, S. H., Jaarsma, T., Kinkorova, J., Ramaekers, J., Ruff, P., Schnur, I., Vanoli, E., Verdu, J., & Zippel-Schultz, B. (2016). Challenges in personalised management of chronic diseases-heart failure as prominent example to advance the care process. The EPMA journal, 7(1), 2. https://doi.org/10.1186/s13167-016-0051-9
Fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [Date Retrieved] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.
Fryar CD, Chen T-C, Li X. Prevalence of uncontrolled risk factors for cardiovascular disease: United States, 1999–2010. NCHS Data Brief. 2012;(103):1–8.
Hajar R. (2017). Risk Factors for Coronary Artery Disease: Historical Perspectives. Heart views : the official journal of the Gulf Heart Association, 18(3), 109–114. https://doi.org/10.4103/HEARTVIEWS.HEARTVIEWS_106_17
Institute of Medicine (US) Committee on a National Surveillance System for Cardiovascular and Select Chronic Diseases. A Nationwide Framework for Surveillance of Cardiovascular and Chronic Lung Diseases. Washington (DC): National Academies Press (US); 2011. 2, Cardiovascular Disease. Available from: https://www.ncbi.nlm.nih.gov/books/NBK83160/
Jakovljevic D. G. (2018). Physical activity and cardiovascular aging: Physiological and molecular insights. Experimental gerontology, 109, 67–74. https://doi.org/10.1016/j.exger.2017.05.016
Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R. (1989). Heart Disease [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.
Mayo Clinic Staff. (2023). High blood pressure dangers: Hypertension’s effects on your body. Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/in-depth/high-blood-pressure/art-20045868
National Center for Health Statistics. Multiple Cause of Death 2018–2022 on CDC WONDER Database. Accessed May 3, 2024. https://wonder.cdc.gov/mcd.html
Nora Eccles Harrison Cardiovascular Research & Training Institute. (n.d.). 2024 Heart Disease Statistics and Their Implications for Public Health. CVRTI. https://cvrti.utah.edu/2024-heart-disease-statistics-and-their-implications/
Reckelhoff J. F. (2001). Gender differences in the regulation of blood pressure. Hypertension (Dallas, Tex. : 1979), 37(5), 1199–1208. https://doi.org/10.1161/01.hyp.37.5.1199
Saheera, S., & Krishnamurthy, P. (2020). Cardiovascular Changes Associated with Hypertensive Heart Disease and Aging. Cell transplantation, 29, 963689720920830. https://doi.org/10.1177/0963689720920830
Statlog (Heart) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C57303.
World Health Organization. (2021). Cardiovascular diseases (CVDs) [Fact sheet].