Unlocking Predictive Insights of Heart Attack Risk

Final Report
Data Science 1 with R (STAT 301-1)

Author

Louise Oh

Published

February 2, 2024

Github Repo Link

https://github.com/stat301-1-2023-fall/final-project-1-louiseoh2025.git

Introduction

The Heart Attack Risk Prediction Dataset ¹ provides a comprehensive array of features relevant to heart health and lifestyle choices, encompassing patient-specific details, lifestyle choices, health indicators, and socioeconomic factors

This exploratory data analysis aims to unlock predictive insights of developing risk of heart attack, and whether patient characteristics, lifestyle factors, or socioeconomic status could be correlated to heart risk factors to identify potential causation. The goal is to uncover how these variables interact to influence the probability of a heart attack. Given the ongoing global concern surrounding heart attacks, it is crucial to gain a deeper understanding of the factors that contribute to their occurrence and potential ways to reduce the risks.

Data Overview & Quality

The dataset has 26 variables and 8763 observations. There is one unique ID, 12 numerical variables, and 13 categorical variables. Below Figure 1 is a table of a codebook of the dataset with its variables and descriptions.

Figure 1: Heart Attack Data Codebook

Binary (1/0) and other categorical variables were converted into factors and numerical variables were converted into integers or doubles. The new types of variables are used throughout the explorations. There were no missingness issues in the dataset, but there were variables types that needed to be mutated. Blood pressure formatted in systolic / diastolic format were separated into two new categories, and blood pressure levels were mutated into categories of normal, elevated, high, and hypertension.

While the dataset structure had no potential issues that may impact the analysis, the data itself is a synthetic, unauthentic data. Because it is unethical to share and use patient data without consent, a well-crafted dataset was utilized instead. This may impact the outcomes, trends, and conclusion of the exploratory data analysis and may not be generalizable to the real world. Despite the absence of validity in this dataset, this exploratory data analysis intends to establish a framework and preliminary exploration for future studies utilizing authentic patient data.

Explorations

First, it is important to note that there are more number of patients without heart attack risk than with heart attack risk in this dataset. The Figure 2 below shows that there 5624 patients without heart attack and 3139 patients with heart attack.

Figure 2: Number of Patients by Heart Attack Risk

Therefore, most of the following analyses involving more than one variable will be performed by comparing the proportions of heart attack risks of patients by variable, instead of comparing their counts.

The analysis will explore four categories of variables - patient-specific details, lifestyle factors, medical indicators, and socioeconomic aspects. Thorough initial investigation was done and only relevant variables are included in this report.

Patient-Specific Details

The patient specific detail first explores the distribution of heart rate of patients. Heart rate may seem the most intuitive factor that is predicted to be correlated with heart attack risk. In Figure 3, when heart rate distribution is visualized in a histogram with narrow bins, there is a large spike of patients with a heart rate of about 105 beats per minute, a potential unimodal distribution. A normal resting heart rate should be between 60 to 100 beats per minute, so the values of proportion of each range was explored through a density graph plotted by patients’ heart attack risk. However, this bivariate analysis showed that there is no significant difference between heart rates by the status of risk; patients across all range of heart rate had similar percentage of patients with and without heart attack risk. The distribution was uniform and nearly rectangular for both options, and did not have identifiable modes.

Cholesterol levels of patients were explored next because a higher pulse rate is often associated with high cholesterol (American Heart Association, 2023)². The density plot of cholesterol level by heart attack risk in Figure 4 showed a difference in distribution of the extreme ends of the cholesterol range. A total cholesterol below 200 is typically considered healthy. For this group of patients, patients are more less likely to have heart attack risk with cholesterol below 200, as shown in the blue area on the left end. On the other hand, for the group of patients with extremely high total cholesterol of over 325, there is a small peak showing that there are more patients who are at risk of a heart attack. The distribution was nearly uniform, but the smaller modes on the ends displayed a possible mutimodality of the distribution.

Cholesterol levels were further investigated by being plotted with BMI, then grouped by heart attack risk. Observation points for cholesterol and BMI were plotted across all ranges with no correlation. The regression line by heart attack risk were slightly different. While patients with no heart attack risk showed 0 correlation between cholesterol and BMI, patients with heart attack risk slightly deviated, especially in the lower end of the cholesterol range.

Figure 4: Distribution of Cholesterol Level by Heart Attack Risk

Lifestyle Choices

Next, lifestyle factors like smoking and average duration of sleep were compared by patients’ status of heart attack risk. Figure 5 shows the distribution of smoking status in patients. As shown in the distribution of all patients, most patients in the dataset were smokers, interestingly. Nevertheless, the proportion of smoking status was similar for patients with and without heart attack risk. In other words, smoking status had no significant correlation with developing a heart attack according to this dataset.

Figure 5: Distribution of Smoking Status

Similar analyses were performed for patients’ sex, diabetes status, family history of developing a heart attack, obesity status, alcohol consumption, and diet levels, but there were no significant correlation just as Figure 5 showed for smoking status.

The number of hours of patients’ exercise, sleep, and sedentary were explored next. While there were no significant bivariate relationship with heart attack risk, sleep duration showed the largest potential in trend. Figure 6 illustrates the distribution of patients’ average sleep per day in hours by heart attack risk. Although there is no linear trend in heart attack development by the sleep duration, the additional line graph indicates that the number of patients without heart attack risk slightly increase as the sleep duration is longer. 10 hours of sleep per day has the highest count among patients without a heart attack risk. In contrast, 9-10 hours of sleep has the the lowest counts among patients with heart attack risk. While there were notable distributions, these unique observations may be worth noting.

Figure 6: Distribution of Sleep Duration by Heart Attack Risk

Health Indicators

In the next steps, health indicators like Body Mass Index (BMI), triglycerides, and blood pressure were investigated in relation to patients’ heart attack risk. First, the distribution of patients’ BMI were presented on a density plot in Figure 7. The distribution is uniform and nearly rectangular with no significant modes. Ironically, there were more proportion of patients within the healthy range between 20 and 26 with heart attack risk compared to those within the overweight and obese range between 26 and 34. As a result, the subsequent boxplot explores the distribution of BMI when grouped by blood pressure levels because BMI is known to be positively associated with both systolic and diastolic blood pressure (Landi et al., 2018)³.

The blood pressure levels are divided into four categories: normal for systolic less than 120 and diastolic less than 80; elevated for systolic from 120 to 129 and diastolic less than 80; high for systolic from 130 to 179 or diastolic from 80 to 119; hypertension for systolic over 180 and/or diastolic over 120; all in measures of mmHg (American Heart Association, 2023)⁴. Even when the distribution of BMI was categorized by blood pressure, there was no clear difference between patients with heart and heart attack risk. The quartiles and ranges in Figure 7 imply that number of patients were similarly even across all range of the BMI.

Figure 7: Body Mass Index (BMI) by Blood Pressure Levels and Heart Attack Risk

Futhermore, Figure 8 shows a similar analyses as Figure 7 but for triglyceride levels instead of BMI. For the patients in this dataset (adults between 18-90 years old), triglycerides above 500 mg/dL are considered very high, with critical risk (National Institute of Health, 2023)⁵. Figure 8 shows that for triglycerides over 500, there are always more patients with heart attack risk than without heart attack risk. For a relatively lower range from 200 to 500, there are always more patients without heart attack risk than with heart attack risk. This shows a potential association between higher triglyceride levels with exposure to heart attack risk. The general distribution of patients both with and without heart attack risk shaped a uniform rectangle.

In a multivariate analysis with blood pressure level, triglycerides level varied slightly by the status of heart attack risk, but again, the overall distribution was uniform. For patients with normal and elevated blood pressure levels, patients at risk of heart attack risk had slightly higher median and first quartile triglyceride level. This was less obvious in patients with hypertension, and not observed in patients with high blood pressure level. However, the boxplot still shows that patients with normal blood pressure and without heart attack risk has the lowest triglyceride levels on average.

Figure 8: Triglycerides by Blood Pressure Levels and Heart Attack Risk

Socioeconomic Factors

Finally, socioeconomic aspects may also be a heart attack risk indicator. Figure 9 shows income distributions by country, and are not always as symmetrical. Some countries show little association of income between heart attack risk, while others show a larger contrast. For example, in Japan, Spain, and New Zealand, the distribution of patients with heart attack risk was skewed to the right, having a slightly higher income compared to patients without heart attack risk. On the other hand, in countries like Thaliand and South Africa, the opposite was observed. The distribution of patients without heart attack risk was slightly skewed to the right - patients without heart attack risk had very slighly higher median income.

Figure 9: Distribution of Income by Country and Heart Attack Risk

It is crucial to determine whether these observations hold significance or if they are merely the result of chance. It is also important to approach any speculations between these relationships with caution, as these observations could be influenced by a multitude of factors, and individual cases may vary. The contrast in income level was observed in countries with the largest difference in GDPs. In developed countries like Japan, Spain, and New Zealand, many patients visit the hospital because of a chronic disease (disease that develop with aging). On the other hand, developing countries like South Africa and Nigeria, patients are more likely to have visited the hospital because of a non-chronic disease (eg. EID).

In more developed countries, individuals may have better access to healthcare services, leading to early detection and management of health risks, including heart attack risks. This could result in patients with heart attack risk having relatively higher incomes as they may be more actively managing their health. Similarly, developed countries may have higher levels of health awareness and education, leading individuals to adopt healthier lifestyles and seek medical attention earlier. This could contribute to a scenario where those with heart attack risks in developed nations have higher incomes.

There seemed to be more varying proportions and potential trends of patients at heart attack risk when grouped by countries than any other variables. Thus, the proportion of patients’ smoking status were calculated by country and heart attack risk in addition to its exploration in Figure 5. In Figure 10, in the United States, 27.5% of non-smokers were at risk for heart attack while a higher 40.8% of smokers were at risk for heart attack. This shows that in the United States, smoking could potentially by a risk factor for developing a heart attack. On the other hand, for Colombia, non-smokers has a similar proportion of patients with and without heart attack risk each 51% and 49%, but a higher 63.9% of smokers without heart attack risk. This is surpising because one would expect a higher proportion of patients who smoke to be at risk of heart attack risk since smoking is a negative habit for health.

Figure 10: Percentage of Heart Attack Risk by Country and Smoking Status

Conclusion

This exploratory data analysis provides a valuable foundation for understanding the complex interplay of factors associated with heart attack risk within the Heart Attack Risk dataset. While the dataset’s simulated nature may limit its direct applicability to real-world patient scenarios, the analysis serves a valuable model and approach in finding insights into potential trends and associations. A thorough investigation shed light on aspects of heart attack risk in relation to patient-specific details, lifestyle choices, health indicators, and socioeconomic factors.

Outcomes

Despite the complexity of factors influencing heart health, patient-specific details such as heart rate did not exhibit significant differences between patients with and without heart attack risk. Patients within normal cholesterol range were more unlikely to develop a heart attack, and those at extremely high range were more likely to be at risk for heart attack. Lifestyle factors, including smoking and sleep duration, did not reveal a significant trend in developing a heart attack risk. Various health indicators like BMI and triglyceride levels, displayed potential relationship with heart attack risk when patients were grouped by blood pressure levels. Surprisingly, socioeconomic aspects, particularly income levels by country, showed variations by heart attack risk for particular more and less developed countries. Similarly, the proportion of patients with and without heart attack risk varied by smoking status, but the trend differed country by country.

Overall, it may be the case that for patients who are already visiting the hospital and having bad habits, there is a modest difference in developing a heart attack. Heart attack may come at an unexpected situation regardless of health and lifestyle indicators. The outcomes were surprising because nothing seemed to really influences heart attack risk. However, the notable distributions of health indicators and heart attack risk by country revealed the potential unpredictability of heart attacks. This realization emphasizes the impact of unforeseen social and situational events on suddenly disrupting the normal blood flow to the heart.

Future Research

Because this dataset is not obtained from real patients, there was only little new noteworthy knowledge that contributed to the prediction of developing a heart attack other than its unpredictability. The synthetic nature of the dataset may limit the generalizability of findings to real-world patient populations. Moreover, it is unclear how patients’ heart attack risk status were measured. Finally, while this exploratory data analysis provides valuable descriptive insights, further steps must involve the development of advanced predictive models using techniques and longitudinal data that may contribute to more accurate risk prediction.

Moving forward, several avenues for further exploration and analysis present themselves. Most importantly, collaborating with healthcare institutions to access real patient data would strengthen the external validity of findings. Instead of using heart attack risk data, genuine occurrence of heart attacks could provide a more accurate representation of factors influence on heart attack risks. Engaging with healthcare professionals and domain experts for clinical validation can offer a practical perspective on the observed trends and guide the translation of findings into actionable insights.

Final Insights

In summary, while this exploratory data analysis may not directly contribute to predicting heart attack risk in real patients, it lays the groundwork for subsequent advanced modeling and analyses. While there some notable aspects in the distributions of cholesterol levels, sleep duration, and triglyceride levels by blood pressure when grouped by heart attack risk, it is unclear whether the modest scale of observations hold significance or if they are merely the result of chance. The synthesized insights implied that unforeseen events based on socioeconomic backgrounds may be the most frequent and influential factor in heart attack occurrences. This exploratory data analysis provides a starting point for further research aimed at improving cardiovascular risk prediction and informing preventive strategies.

Footnotes

https://www.kaggle.com/datasets/iamsouravbanerjee/heart-attack-prediction-dataset↩︎
https://www.heart.org/en/news/2019/02/01/8-things-that-can-affect-your-heart-and-what-to-do-about-them↩︎
10.3390/nu10121976↩︎
https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings↩︎
https://www.nhlbi.nih.gov/health/high-blood-triglycerides↩︎