1. Introduction

Predicting a heart attack can be at times difficult as a heart attack can strike randomly. Some are warned by symptoms either physical or mental problems, that can cause implications that lead to experiencing a heart attack. With one of the leading risk factors of a heart attack is having high cholesterol1, this study is to see if we can statistically predict the causes of a heart attack based on cholesterol levels and if we can predict the chances of how likely a patient will encounter a heart attack.

To understand our study, we will be using a data set from Kaggle about the possibilities of a heart attack and the source of data is from the state of Cleveland, in the Cleveland Clinic Foundation from the UC Irvine Machine learning repository. The data set was donated back in 1988 with all of the patients and credentials removed for safety purposes. Because of this, we do not have an identification variable for this data set.

We created an outcome variable that evaluates the odds of a heart attack, cholesterol which is the numerical explanatory variable, we do not have a categorical variable available either, so instead, we created a new variable that uses the age variable to split them into different age brackets, young, middle, and old aged brackets. There are also variables that we deemed important in this study which are the maximum heart rate and resting blood pressure.

Here is a sample of 5 randomly selected data that will be chosen for this study.

heart_attack chol age_bracket thalach trestbps
2.589286 290 Young 153 112
1.648649 244 Young 178 148
2.583333 341 Young 136 132
1.946154 253 Middle 144 130
1.518182 167 Young 114 110

2. Exploratory data analysis

Our sample size has 303 patients, and each observation is accompanied with complete valid variables. This can helps us to increase the probability of statistical validity, considering that we can fully optimize each patients’ variables. The mean of the odds of a heart attack was greatest for old aged patients (n = 79, \(\bar{x}\) = 1.9, sd = 0.6), intermediary for middle aged patients (n = 168, \(\bar{x}\) = 1.9, sd = 0.4), and lowest for young patients (n = 56, \(\bar{x}\) = 1.8, sd = 0.4).


age_bracket n correlation mean median sd min max
Middle 168 0.7877395 1.898677 1.878671 0.4106272 0.840000 3.117647
Old 79 0.8833069 1.932442 1.881482 0.5550622 1.025000 4.904348
Young 56 0.8797350 1.840309 1.785285 0.3834713 1.268116 2.634783

Upon analyzing the distribution of the data set in Figure 1, we noticed that the graph appears to be spread, and also right skewed. There is also an outlier near 5 on the chances of heart attack. Because this stands out from the general distribution of the graph, we must keep this abnormality in mind throughout this study.

Figure 1. Average heart attack value of patients

Figure 1. Average heart attack value of patients

Figure 2 is a generated scatter plot showcasing the relationship between the chances of a heart attack and the cholesterol levels. Upon analyzing this scatterplot, we noticed an overall positive trend, indicating that there is a positive relationship between our outcome variable and numerical variable. As the cholesterol levels increase, so do the chances of a heart attack. Our correlation coefficient is 0.79, confirming that there is indeed a strong positive correlation. There is also an outlier located past 550 on the x axis, and 5 on the y axis, which has high leverage and low influence.

Figure 2. Relationship between the average chance of a heart attack and the cholesterol levels

Figure 2. Relationship between the average chance of a heart attack and the cholesterol levels

Upon generating the boxplot and analyzing the distribution of the data set in Figure 3, it appears that the odds of a heart attack appear to be greatest for the old age bracket, and the lowest for the young age bracket. Although the young age bracket does not have outliers, the two other age brackets do contain outliers. The old age bracket has a specific outlier that is near 5 on the chances of a heart attack, while the middle age bracket has two outliers closer to the whiskers of that specific boxplot.

Figure 3. The relationship between the chances of a heart attack and age bracket

Figure 3. The relationship between the chances of a heart attack and age bracket

Lastly, for Figure 4, a colored scatterplot was generated so that we could compare each of the variables. Looking at each age bracket, we observe that there is a positive relationship between all of them, showing the relationship between cholesterol levels and the odds of a heart attack. Both the young and middle age bracket have regression lines that are parallel, and the old age bracket has a longer regression line that extends further outwards toward the outlier.
Figure 4. Relationship between the odds of a heart attack, cholesterol (mg/dl), and age bracket of patients

Figure 4. Relationship between the odds of a heart attack, cholesterol (mg/dl), and age bracket of patients


3. Multiple linear regression

3.1 Methods

The components of our multiple linear regression model are the following:

  • Outcome variable \(y\) = Odds of heart attack
  • Numerical explanatory variable \(x_1\) = Number of Cholesterol in mg/dl
  • Categorical explanatory variable \(x_2\) = Adult Age Bracket

3.2 Model Results


Table 2. Regression table of parallel slopes model of chance of heart attack, as a function of percentage of cholesterol and age bracket:

term estimate std_error statistic p_value lower_ci upper_ci
intercept 0.078 0.071 1.105 0.270 -0.061 0.218
chol 0.007 0.000 26.643 0.000 0.007 0.008
age_bracket: Old -0.070 0.034 -2.082 0.038 -0.136 -0.004
age_bracket: Young 0.083 0.038 2.173 0.031 0.008 0.158

3.3 Interpreting the regression table

The regression equation for chances of a heart attack is the following:

\[ \begin{aligned}\widehat {chance} =& b_{0} + b_{chol} \cdot chol + b_{Old} \cdot 1_{is\ Old}(x_2) + b_{Young} \cdot 1_{is\ Young}(x_2) \\ =& 0.078 + 0.007 \cdot chol -0.070 \cdot 1_{is\ Old}(x_2) + 0.083 \cdot 1_{is\ Young}(x_2) \end{aligned} \]

  • The intercept (\(b_0\) = 0.078) represents the outcome variable when 0% of patients have a higher amount of cholesterol.

  • The slope’s estimate of cholesterol level(\(b_{chol}\) = 0.007) can change depending on the percentage’s occurrence based on age and health. According to this estimate, there is an associated change of 0.007 for every 1% point increase in cholesterol level. The estimate for the slope increases by 0.007 when the age bracket is young, while for middle age bracket it increases by 0.070. Lastly, for the old age bracket, there is a 0.83 increase. This affects the outcome variable, because as the cholesterol level increases, so does the chances of experiencing a heart attack.

  • The Old adult age bracket estimate (\(b_{Old}\) = -0.070) and the Young adult age bracket estimate (\(b_{Young}\) = 0.083) are the offsets in the intercept. Essentially, on average, the old adult age bracket has a 0.070 lower chance of experiencing a heart attack than the middle adult age bracket, and the young adult age bracket has a 0.083 higher chance of experiencing a heart attack than the middle age bracket.

Thus the three regression lines have equations:

\[ \begin{aligned} \text{Young adult age bracket (in blue)}: \widehat {chance} =& 0.161 + 0.007 \cdot chol \\ \text{Middle adult age bracket (in red)}: \widehat {chance} =& 0.078 + 0.007 \cdot chol\\ \text{Old adult age bracket (in green)}: \widehat {chance} =& 0.008 + 0.007 \cdot chol\\ \end{aligned} \]

3.4 Inference for multiple regression

If We use the output of our regression table, we will be able to test two different null hypotheses. Our first null hypothesis is that there is no relationship between cholesterol levels and the odds of having a heart attack. \[ \begin{aligned} \ H0:β_{chol}=0 \\ \text{vs } HA:β_{chol}≠0\\ \end{aligned} \] With the relationship of the estimate of cholesterol levels being positive, we can see that there is a positive relationship between the cholesterol levels and the odds of having a heart attack. We can also base more information from the table that,

  • The confidence intervals of cholesterol levels \(b_{chol}\)(0.007, 0.008) where both are in the same positive range, and
  • since p-value p<0.000 seems to be small, We can reject the null hypothesis as our \(b_{chol}\) ≠ 0 which is also bigger than the p-value.

For the second null hypotheses, we will test the relationships of the differences of intercepts in age brackets, Old and Young age brackets are zero.

\[ \begin{aligned} \ H0:β_{Old}=0 \\ \text{vs } HA:β_{Old}≠0\\ \end{aligned} \] and \[ \begin{aligned} \ H0:β_{Young}=0 \\ \text{vs } HA:β_{Young}≠0\\ \end{aligned} \] We can say that does the old aged brackets have an equal intercept with young aged bracket or not? We can also ask the question, is there a difference of intercepts between old and young aged brackets or are they the same region. Our data shows that the intercept of the old age bracket is \(b_{Old}\)=(-0.070) and the young age bracket is \(b_{Young}\)=(0.083), where it is true that there is a difference in intercepts as each of the intercepts have different signs. We have observed in the table that,

  • the 95% confidence intervals of the intercepts are \(b_{Old}\)=(-0.136,-0.004) \(b_{Young}\)=(0.008, 0.158). the old age bracket intercept has a lower confidence interval than the young age bracket intercept where both do not overlap to zero, so we reject the null hypothesis.
  • We can also test this by using p-value(\(b_{Old}\)=0.038, \(b_{Young}\)=0.031) where the p-value is less than the confidence intervals, so we reject the null hypothesis.

With testing the hypotheses, we can concur that all hypothesis have rejected the null and that the intercept is not equal to zero. So all intercepts are equal.

3.5 Residual Analysis

Figure 5. The histogram of residuals for statistical model

Figure 5. The histogram of residuals for statistical model

Figure 6. The Scatterplots of residuals against the numeric explanatory variable (cholesterol)

Figure 6. The Scatterplots of residuals against the numeric explanatory variable (cholesterol)

Figure 7. The Scatterplots of residuals against the fitted values

Figure 7. The Scatterplots of residuals against the fitted values

Figure 8. The boxplot of residuals of each age bracket

Figure 8. The boxplot of residuals of each age bracket

In figure 5, we can see that the histogram is normally distributed, in figure 6, there is an outlier in the top left in the region about more than 550 mg/dl of cholesterol, other than that there is nothing unusual with the graph. It’s also identical to figure 7, an outlier in the top left somewhere in the 4.5 fitted values, and no patterns to write about. in figure 8, there are outliers present in each age bracket, where the larger the bracket size, the more outliers are present in the bracket, and an even higher outlier in the old age bracket. We can conclude that any assumptions for inference in multiple linear regression have been reached.


4. Discussion

4.1 Conclusions

We can conclude that we can predict the chances of a heart attack have a correlation with age. On average, the older the patient, the higher the probability of a heart attack because of the increase of cholesterol levels by 0.007 mg/dl. Even if old age adults have a higher possibility of having a heart attack, there is still a possibility for both middle and young adults to experience a heart attack just from cholesterol levels. Although it is lower, it is still a possibility.

It is expected for old adults to have a higher possibility of experiencing a heart attack because of their age, but seeing young adults also have a measurable chance does look concerning. This may be the result of mental factors such as stress2, as it can contract heart diseases.

Overall, we can not do anything about our age, but what we can do is improve our lifestyle. Cholesterol is one of the main factors in heart disease, so if we try to get a healthy balanced diet, we can improve our health and lessen the chances of experiencing a heart attack.

4.2 Limitations

We discovered a few limitations during this study. Firstly, for our goal, which was to predict the occurrences of a heart attack based on age and health, we used multiple variables such as cholesterol and age from the dataset. Although this was the case and we used them to calculate the final heart attack occurrence rate for each age bracket, our final results could have been slightly different if we incorporated other variables. For example, generally, the chances of a heart attack are estimated using variables including the ones we incorporated, including daily lifestyle, genetics, or smoking.

Another limitation is that our data set excludes name and social security numbers for confidentiality purposes. Essentially, this left us without any identification variables. Our last possible limitation is that our extracted data set was sampled from the year 1988. Although this does not affect our final results on the prediction of the occurrences of a heart attack based on age and health, the dataset can be considered outdated from its year of release.

4.3 Further questions

If we were to continue updating this topic, I think we would broaden our study from only focusing on one region, to adding different regions in America. This way, we can get a better generalization of our study that can become a piece of precise information to be used for all health organizations around America.

With our study only focusing on cholesterol, we can improve the study by adding different factors such as heart rate which would be a sign of experiencing a heart attack, and even heart complications that would broaden the study.


5. Citations and References


  1. Fryar CD, Chen T-C, Li X. Prevalence of uncontrolled risk factors for cardiovascular disease: United States, 1999–2010. NCHS Data Brief, August 2012 _https://www.washingtonpost.com/news/answer-sheet/wp/2017/03/06/what-the-numbers-really-tell-us-about-americas-public-schools/?noredirect=on&utm_term=.d9a5b415678d↩︎

  2. Pickering TG. Mental stress as a causal factor in the development of hypertension and cardiovascular disease. Current hypertension reports. 2001 Jun;3(3):249-54. _https://link.springer.com/article/10.1007/s11906-001-0047-1↩︎