Predicting a heart attack can be at times difficult as a heart attack can strike randomly. Some are warned by symptoms either physical or mental problems, that can cause implications that lead to experiencing a heart attack. With one of the leading risk factors of a heart attack is having high cholesterol1, this study is to see if we can statistically predict the causes of a heart attack based on cholesterol levels and if we can predict the chances of how likely a patient will encounter a heart attack.
To understand our study, we will be using a data set from Kaggle about the possibilities of a heart attack and the source of data is from the state of Cleveland, in the Cleveland Clinic Foundation from the UC Irvine Machine learning repository. The data set was donated back in 1988 with all of the patients and credentials removed for safety purposes. Because of this, we do not have an identification variable for this data set.
We created an outcome variable that evaluates the odds of a heart attack, cholesterol which is the numerical explanatory variable, we do not have a categorical variable available either, so instead, we created a new variable that uses the age variable to split them into different age brackets, young, middle, and old aged brackets. There are also variables that we deemed important in this study which are the maximum heart rate and resting blood pressure.
Here is a sample of 5 randomly selected data that will be chosen for this study.
| heart_attack | chol | age_bracket | thalach | trestbps |
|---|---|---|---|---|
| 2.589286 | 290 | Young | 153 | 112 |
| 1.648649 | 244 | Young | 178 | 148 |
| 2.583333 | 341 | Young | 136 | 132 |
| 1.946154 | 253 | Middle | 144 | 130 |
| 1.518182 | 167 | Young | 114 | 110 |
Our sample size has 303 patients, and each observation is accompanied with complete valid variables. This can helps us to increase the probability of statistical validity, considering that we can fully optimize each patients’ variables. The mean of the odds of a heart attack was greatest for old aged patients (n = 79, \(\bar{x}\) = 1.9, sd = 0.6), intermediary for middle aged patients (n = 168, \(\bar{x}\) = 1.9, sd = 0.4), and lowest for young patients (n = 56, \(\bar{x}\) = 1.8, sd = 0.4).
| age_bracket | n | correlation | mean | median | sd | min | max |
|---|---|---|---|---|---|---|---|
| Middle | 168 | 0.7877395 | 1.898677 | 1.878671 | 0.4106272 | 0.840000 | 3.117647 |
| Old | 79 | 0.8833069 | 1.932442 | 1.881482 | 0.5550622 | 1.025000 | 4.904348 |
| Young | 56 | 0.8797350 | 1.840309 | 1.785285 | 0.3834713 | 1.268116 | 2.634783 |
Upon analyzing the distribution of the data set in Figure 1, we noticed that the graph appears to be spread, and also right skewed. There is also an outlier near 5 on the chances of heart attack. Because this stands out from the general distribution of the graph, we must keep this abnormality in mind throughout this study.
Figure 1. Average heart attack value of patients
Figure 2 is a generated scatter plot showcasing the relationship between the chances of a heart attack and the cholesterol levels. Upon analyzing this scatterplot, we noticed an overall positive trend, indicating that there is a positive relationship between our outcome variable and numerical variable. As the cholesterol levels increase, so do the chances of a heart attack. Our correlation coefficient is 0.79, confirming that there is indeed a strong positive correlation. There is also an outlier located past 550 on the x axis, and 5 on the y axis, which has high leverage and low influence.
Figure 2. Relationship between the average chance of a heart attack and the cholesterol levels
Upon generating the boxplot and analyzing the distribution of the data set in Figure 3, it appears that the odds of a heart attack appear to be greatest for the old age bracket, and the lowest for the young age bracket. Although the young age bracket does not have outliers, the two other age brackets do contain outliers. The old age bracket has a specific outlier that is near 5 on the chances of a heart attack, while the middle age bracket has two outliers closer to the whiskers of that specific boxplot.
Figure 3. The relationship between the chances of a heart attack and age bracket
Figure 4. Relationship between the odds of a heart attack, cholesterol (mg/dl), and age bracket of patients
The components of our multiple linear regression model are the following:
Table 2. Regression table of parallel slopes model of chance of heart attack, as a function of percentage of cholesterol and age bracket:
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 0.078 | 0.071 | 1.105 | 0.270 | -0.061 | 0.218 |
| chol | 0.007 | 0.000 | 26.643 | 0.000 | 0.007 | 0.008 |
| age_bracket: Old | -0.070 | 0.034 | -2.082 | 0.038 | -0.136 | -0.004 |
| age_bracket: Young | 0.083 | 0.038 | 2.173 | 0.031 | 0.008 | 0.158 |
The regression equation for chances of a heart attack is the following:
\[ \begin{aligned}\widehat {chance} =& b_{0} + b_{chol} \cdot chol + b_{Old} \cdot 1_{is\ Old}(x_2) + b_{Young} \cdot 1_{is\ Young}(x_2) \\ =& 0.078 + 0.007 \cdot chol -0.070 \cdot 1_{is\ Old}(x_2) + 0.083 \cdot 1_{is\ Young}(x_2) \end{aligned} \]
The intercept (\(b_0\) = 0.078) represents the outcome variable when 0% of patients have a higher amount of cholesterol.
The slope’s estimate of cholesterol level(\(b_{chol}\) = 0.007) can change depending on the percentage’s occurrence based on age and health. According to this estimate, there is an associated change of 0.007 for every 1% point increase in cholesterol level. The estimate for the slope increases by 0.007 when the age bracket is young, while for middle age bracket it increases by 0.070. Lastly, for the old age bracket, there is a 0.83 increase. This affects the outcome variable, because as the cholesterol level increases, so does the chances of experiencing a heart attack.
The Old adult age bracket estimate (\(b_{Old}\) = -0.070) and the Young adult age bracket estimate (\(b_{Young}\) = 0.083) are the offsets in the intercept. Essentially, on average, the old adult age bracket has a 0.070 lower chance of experiencing a heart attack than the middle adult age bracket, and the young adult age bracket has a 0.083 higher chance of experiencing a heart attack than the middle age bracket.
Thus the three regression lines have equations:
\[ \begin{aligned} \text{Young adult age bracket (in blue)}: \widehat {chance} =& 0.161 + 0.007 \cdot chol \\ \text{Middle adult age bracket (in red)}: \widehat {chance} =& 0.078 + 0.007 \cdot chol\\ \text{Old adult age bracket (in green)}: \widehat {chance} =& 0.008 + 0.007 \cdot chol\\ \end{aligned} \]
If We use the output of our regression table, we will be able to test two different null hypotheses. Our first null hypothesis is that there is no relationship between cholesterol levels and the odds of having a heart attack. \[ \begin{aligned} \ H0:β_{chol}=0 \\ \text{vs } HA:β_{chol}≠0\\ \end{aligned} \] With the relationship of the estimate of cholesterol levels being positive, we can see that there is a positive relationship between the cholesterol levels and the odds of having a heart attack. We can also base more information from the table that,
For the second null hypotheses, we will test the relationships of the differences of intercepts in age brackets, Old and Young age brackets are zero.
\[ \begin{aligned} \ H0:β_{Old}=0 \\ \text{vs } HA:β_{Old}≠0\\ \end{aligned} \] and \[ \begin{aligned} \ H0:β_{Young}=0 \\ \text{vs } HA:β_{Young}≠0\\ \end{aligned} \] We can say that does the old aged brackets have an equal intercept with young aged bracket or not? We can also ask the question, is there a difference of intercepts between old and young aged brackets or are they the same region. Our data shows that the intercept of the old age bracket is \(b_{Old}\)=(-0.070) and the young age bracket is \(b_{Young}\)=(0.083), where it is true that there is a difference in intercepts as each of the intercepts have different signs. We have observed in the table that,
With testing the hypotheses, we can concur that all hypothesis have rejected the null and that the intercept is not equal to zero. So all intercepts are equal.
Figure 5. The histogram of residuals for statistical model
Figure 6. The Scatterplots of residuals against the numeric explanatory variable (cholesterol)
Figure 7. The Scatterplots of residuals against the fitted values
Figure 8. The boxplot of residuals of each age bracket
In figure 5, we can see that the histogram is normally distributed, in figure 6, there is an outlier in the top left in the region about more than 550 mg/dl of cholesterol, other than that there is nothing unusual with the graph. It’s also identical to figure 7, an outlier in the top left somewhere in the 4.5 fitted values, and no patterns to write about. in figure 8, there are outliers present in each age bracket, where the larger the bracket size, the more outliers are present in the bracket, and an even higher outlier in the old age bracket. We can conclude that any assumptions for inference in multiple linear regression have been reached.
We can conclude that we can predict the chances of a heart attack have a correlation with age. On average, the older the patient, the higher the probability of a heart attack because of the increase of cholesterol levels by 0.007 mg/dl. Even if old age adults have a higher possibility of having a heart attack, there is still a possibility for both middle and young adults to experience a heart attack just from cholesterol levels. Although it is lower, it is still a possibility.
It is expected for old adults to have a higher possibility of experiencing a heart attack because of their age, but seeing young adults also have a measurable chance does look concerning. This may be the result of mental factors such as stress2, as it can contract heart diseases.
Overall, we can not do anything about our age, but what we can do is improve our lifestyle. Cholesterol is one of the main factors in heart disease, so if we try to get a healthy balanced diet, we can improve our health and lessen the chances of experiencing a heart attack.
We discovered a few limitations during this study. Firstly, for our goal, which was to predict the occurrences of a heart attack based on age and health, we used multiple variables such as cholesterol and age from the dataset. Although this was the case and we used them to calculate the final heart attack occurrence rate for each age bracket, our final results could have been slightly different if we incorporated other variables. For example, generally, the chances of a heart attack are estimated using variables including the ones we incorporated, including daily lifestyle, genetics, or smoking.
Another limitation is that our data set excludes name and social security numbers for confidentiality purposes. Essentially, this left us without any identification variables. Our last possible limitation is that our extracted data set was sampled from the year 1988. Although this does not affect our final results on the prediction of the occurrences of a heart attack based on age and health, the dataset can be considered outdated from its year of release.
If we were to continue updating this topic, I think we would broaden our study from only focusing on one region, to adding different regions in America. This way, we can get a better generalization of our study that can become a piece of precise information to be used for all health organizations around America.
With our study only focusing on cholesterol, we can improve the study by adding different factors such as heart rate which would be a sign of experiencing a heart attack, and even heart complications that would broaden the study.
Fryar CD, Chen T-C, Li X. Prevalence of uncontrolled risk factors for cardiovascular disease: United States, 1999–2010. NCHS Data Brief, August 2012 _https://www.washingtonpost.com/news/answer-sheet/wp/2017/03/06/what-the-numbers-really-tell-us-about-americas-public-schools/?noredirect=on&utm_term=.d9a5b415678d↩︎
Pickering TG. Mental stress as a causal factor in the development of hypertension and cardiovascular disease. Current hypertension reports. 2001 Jun;3(3):249-54. _https://link.springer.com/article/10.1007/s11906-001-0047-1↩︎