Our dataset comes from the Cleveland Clinic in Cleveland, OH, which observes a random sample of 303 patients undergoing angiotherapy. It streamlines 4 datasets on heart disease into a subset of 14 variables. Of these 14, we chose to focus on 9, the details of which are given below:
To begin using the data provided, it is imperative that we perform data cleaning by replacing categorical data labeled with numerals (e.g. 0, 1) with factor variables to prevent interpretation of categorical data as numeric. Then in order to understand the relationships between the above variables, we will conduct several hypothesis tests. We begin by formulating hypotheses for each relationship, conducting randomization tests using simple linear regression (SLR), and forming a confidence interval for our findings. The final step will be to create a distribution to demonstrate our findings visually.
Maximum heart rate (measured in beats per minute) is generally taken as an indicator of general health. For the same amount of physical exercise, lower heart rate indicates that the heart pumps blood more efficiently to bodily extremities. Conversely, as humans age, our bodies become less efficient; thus, we predict an inverse relationship between age and max heart rate–that is, older adults will have a higher maximum heart rate than younger adults.
To examine the possible relationship between age and maximum heart rate, we must first establish the following hypotheses:
\(H_0: {\beta}_1 = 0\), where \({\beta}_1\) = estimated change of predicted max heart rate for each one-unit increase in age.
\(H_{\alpha}: {\beta}_1 < 0\), where \({\beta}_1\) = estimated change of predicted max heart rate for each one-unit increase in age.
The statistic to be investigated is a slope. At an \(\alpha\) level of 0.05, we shall determine a 95% bootstrap percentile confidence interval.
#Bootstrap Percentile Confidence Interval for Hypothesis 1
set.seed(47)
#(1) obtain observed stat
slope_obs <- heart %>%
specify(MaxHR ~ Age) %>%
calculate(stat = "slope")
#(2) resample data with replacement
boot_heart <- heart %>%
specify(MaxHR ~ Age) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope")
#(3) calculate 95% CI
ci_heart_95 <- boot_heart %>%
summarize(lower = quantile(stat, probs = 0.025),
upper = quantile(stat, probs = 0.975))
ci_heart_95
With regards to this 95% Bootstrap Percentile Confidence Interval, the value 0, as indicated by the null hypothesis’ null value, lies beyond the interval. This can be further visualized with a bar plot of the bootstrapped sampling distribution with its shaded CI region (see Figure 1).
#(4) visualize distribution
visualize(boot_heart) + labs(x = "Bootstrapped Coefficients", y = "Count") +
shade_ci(endpoints = ci_heart_95)
Bootstrapped Samp. Dist. of Max Heart Rate’s Estimated Change for One-unit Age Increase
Conclusion for Hypothesis #1: With a 95% Boostrap Percentile Confidence Interval, we are 95% certain that, in the general population of angiography patients, the true change in maximum heart rate (measured in BPM) per 1 year increase in age lies between -1.29 and -0.76 years at an alpha significance level of 0.05. Since our null value lies within the range of plausible values, the data support our alternative hypothesis that there is a negative correlation between the true change in maximum heart rate (measured in beats per minute) per 1 year increase in age, suggesting that as age increases, maximum heart rate decreases.
With regards to the scope of our inferences, because this study was observational rather than experimental, we are not able to claim causation, however we may infer correlation. We may also generalize our findings to the general population of angiography patients.
Angina, a type of chest pain caused by reduced blood flow to the heart, can be sorted into different types. Stable angina, the more common variant, is commonly induced by exertion; unstable angina, on the other hand, can appear without much physical exertion. Thus, angina itself can come about during exercise. Serum cholesterol, on the other hand, is the concentration of cholesterol in the bloodstream, measured in mg/dL.
To examine the possible relationship between cholesterol and exercise-induced angina, we must first establish the following hypotheses:
To summarize, the null hypothesis stipulates that there is no difference in average serum cholesterol level between those with exercise-induced angina and those without, and the alternate hypothesis stipulates that the average serum cholesterol level for those with exercise-induced angina is greater than that of those without exercise-induced angina.
In effect, the statistic that is to be investigated is a difference of means. This shall include a randomization test as demonstrated below, with an \(\alpha\) level of 0.05.
diff_obs = heart %>%
specify(Cholesterol ~ ExerciseAnginaType) %>%
calculate(stat = "diff in means", order = c("Presence", "Absence"))
null_chol <- heart %>%
specify(Cholesterol ~ ExerciseAnginaType) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Presence", "Absence"))
## [1] p-value: 0.099
To demonstrate the p-value generated above, we can use a visualization of the null dstribution with the p-value region (see Figure 2).
visualize(null_chol) + labs(x = "Null Difference in Means", y = "Count") +
shade_p_value(obs_stat = diff_obs, direction = "greater")
Null Dist. of a Difference in Means of Serum Cholesterol
Analysis of Randomization Test: Provided our findings of a 0.099 p-value in the context of an \(\alpha\)-level of 0.05, we fail to reject the null hypothesis which supposes that there is no difference in average serum cholesterol levels (measured in mg/dL) between those with exercise-induced angina and those without. We have no evidence to support the notion that those with exercise-induced angina have a higher average serum cholesterol level than those without exercise-induced angina.
Since this study was observational rather than experimental, we are not able to claim causation, however we may infer correlation. We may also generalize our findings to the general population of angiography patients.
Motivation Heart disease is one of the most common health problems in the U.S.A and the world, thousands of seniors and people from difference ages died every year unexpectedly by this problem. An interesting causality was reveal when it has been seen samples where men patients suffer heart disease condition were more than women that suffer from heart disease condition. In this study, we will analyse this difference in a sample of patients from {…}, and we will analyse if this result is statistically significant to make a conclusion about this found and discard whether is a coincidence or a real relation.
Confidence Interval for Gender and Heart Disease Correlation
# Confidential Interval of 95%
heart <- heart_df %>%
mutate(Sex = factor(Sex, levels = c(0,1), labels = c("Female", "Male")))
## Bootstrap Distribution for Difference in Proportions
boot_diff <- heart %>%
specify(HeartDisease ~ Sex, success = "Presence") %>%
generate(reps = 1000, type ="bootstrap") %>%
calculate (stat = "diff in props", order = c("Female", "Male"))
ci_diff_95<-boot_diff %>%
summarize(
lower = quantile(stat,probs = 0.025),
upper = quantile(stat,probs = 0.975)
)
print(ci_diff_95)
## # A tibble: 1 × 2
## lower upper
## <dbl> <dbl>
## 1 -0.419 -0.200
To visualize the confidence interval as shown above, it is imperative to visualize the bootstrapped sampling distribution as a bar plot with a shaded CI region (see Figure 3).
# Visualize CI
visualize(boot_diff)+
shade_ci(endpoints = c(ci_diff_95$lower, ci_diff_95$upper)) +
labs(
title = "95% Bootstrap Confidence Interval for Difference in Proportions",
x = "Difference in Proportion (Female - Male)",
y = "Frequency"
)
Bootstrapped Samp. Dist. of a Difference in Prop. of Heart Disease
Conclusions
On average, the proportion of women with heart disease was about 8 percentage points lower than that of men (-0.08). The 95% of CI for the difference in proportions is (-0.15, 0.02). (because 0, no difference assumption, lies inside this interval, we cannot rule out that the true difference is zero. This means the evidence is not strong enough to claim that men and women differ in heart-disease prevalence in this sample. In the Hypothesis Test, we got that p-value is greater than 0.05, so we fail to reject the \(H_0\).There is no statistically significant difference between men and women in the proportion of heart disease at the 5% significance level.
In conclusion, based on our randomization test and bootstrap confidence interval, there is no statistically significant evidence of difference between men and women in proportion of individuals with heart disease (p = 0.21).The 95% confidence interval for the difference in proportions (female-male) ranged approximately -0.15 to 0.02, which includes 0.Therefore, we cannot conclude that gender is associated with heart disease in this sample. However, the negative point estimate suggests that women may have a slightly lower proportion of heart disease than men, though this difference is not statistically reliable given the sample variability.
heart_df2 <- heart_df
heart_df2 <- na.omit(heart_df2)
model1 <- lm(hd_presence ~ ExerciseAngina + STdepression + Cholesterol, data = heart_df2)
summary(model1)
##
## Call:
## lm(formula = hd_presence ~ ExerciseAngina + STdepression + Cholesterol,
## data = heart_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9169 -0.2845 -0.1495 0.3138 0.8799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0185653 0.1292081 -0.144 0.886
## ExerciseAngina 0.3413819 0.0572730 5.961 7.95e-09 ***
## STdepression 0.1421230 0.0234895 6.050 4.88e-09 ***
## Cholesterol 0.0008061 0.0005020 1.606 0.109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4242 on 266 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2738
## F-statistic: 34.81 on 3 and 266 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model1)
Upper left: Check for Linearity; Upper right: Verifying Normally Distributed Errors; Lower left: Homoscedasticity Plot; Lower right: Check for Influential Datapoints (High Leverage)
par(mfrow = c(1,1))
The multiple regression model shows that exercise-induced angina (\(\beta\) = 0.34, \(p\) < 0.001) and ST depression (\(\beta\) = 0.14, \(p\) < 0.001) are strong and statistically significant predictors of heart disease presence, meaning that patients with angina during exercise or higher ST depression values are substantially more likely to have heart disease. In contrast, cholesterol (\(\beta\) = 0.0008, \(p\) = 0.109) is not a significant predictor and does not add meaningful explanatory power once the other variables are included. Overall, the model is significant (F(3,266) = 34.81, \(p\) < 0.001) and explains about 28% of the variability in heart disease presence (R2 = 0.28), indicating that exercise-induced angina and ST depression are the most clinically informative predictors in this dataset.
Visualizations are an important way to communicate information about data clearly. Interactive visualizations in particular increase engagement and readability, allowing users find details about the data at their own pace. For this extension, we will be using the package plotly, which makes use of ggplot functions, to create interactive plots that display information about each data point when moused over. For our data, this helps us to see where our data comes from for each patient.
Visualization for Hypothesis #1
# Interactive visualization for Hypothesis #1
lm1 <- lm(MaxHR ~ Age, heart)
plot1 <- ggplot(heart, aes(x = Age,
y = MaxHR,
color = Gender)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "purple") +
labs(title = "Maximum Heart Rate (BPM) by Age",
x = "Age (years)",
y = "Max Heart Rate (BPM)",
color = "Sex") +
theme_bw()
suppressMessages(suppressWarnings(ggplotly(plot1)))
Interpretation of Visualization #1: Plotting the distribution of maximum heart rate (in BPM) against the age of angiography patients (in years), as well as the regression line for our linear model, we can see a general downward-sloping trend that supports the findings of Hypothesis #1: for each 1-year increase in age, maximum heart rate decreases. The third variable, sex, is indicated by each point’s color (blue = “male” and red = “female”), and there appears to be a fairly even distribution across gender.
We justify fitting our linear model using four criteria, laid out by the acronym LINE (linearity between X and Y variables, independence in observations, normality of the Y, and equal variance of results). Based on the project guidelines, we assume observations are independent, and we can see normality in our distribution under HT#2. From this plot, we also confirm linearity between age and max heart rate due to the general downward trend in heart rate as age increases. Finally, for equality in variance, we can also visually see a relatively even scatter of points above and below the line across different ages. Altogether, the downward slope continues to affirm the conclusion for HT#1.
Visualization for Hypothesis #2
# Interactive visualization for Hypothesis #2
plot2 <- ggplot(heart, aes(x = ExerciseAnginaType,
y = Cholesterol,
fill = ExerciseAnginaType,
text = paste(
"Presence/Absence of Exercise-Induced Angina:", HeartDisease,
"<br>Average Cholesterol Level:", Cholesterol, "mg/dL" ))) +
geom_point() +
geom_jitter() +
labs(title = "Presence of Exercise-Induced Angina versus Average Serum Cholesterol Level",
x = "Presence of Angina (Absence/Presence)",
y = "Average Serum Cholesterol Level (mg/dL)",
color = "Sex") +
theme_bw()
suppressMessages(ggplotly(plot2, tooltip = "text"))
Interpretation of Visualization #2: In our scatterplot plotting average serum cholesterol levels (in mg/dL) against the presence of exercise-induced angina (indicated by presence/absence), we can compare the exact values for each patient’s cholesterol levels between the two levels of our categorical variable. Aside from a few outliers in the ‘absence’ category, we see similar distributions of cholesterol level for each. While examining these graphs alone do not give us significantly strong evidence from which to draw a conclusion, using these visualiations we can visually support the conclusion we found via randomization test for Hypothesis #2: there is no evidence to support the notion that those with exercise-induced angina have a higher average serum cholesterol level than those without exercise-induced angina.
Conclusion: In conclusion, our interactive visualizations help us to further contextualize and understand the conclusions we made for our previous hypotheses tests.The takeaway is that, while the visualizations and individual datapoints do not provide strong basis from which to draw the conclusions themselves, they can provide further support or nuance to other statistical tests we run.
What is Logistic Regression?
Logistic regression shares many similarities with linear regression; in fact, it is a generalized linear model, which refers to the relation of a linear model to the response variable via a link function (in logistic regression, this is the logit). This generalized linear model can be expressed as the following:
\[ln\bigg{(}\frac{p}{1-p}\bigg{)}=\beta_0+\beta_1x_1+\beta_2x_2+...\]
On the left-hand side, \(ln(\frac{p}{1-p})\) is the logit, otherwise called the log-odds, since it is a logarithmic function of the odds, or the probability of a binary outcome expressed as a ratio of its chance of success and its chance of failure. On the right-hand side lies the well-known linear regression model. As this is a generalized linear model, we can also express it as the following below:
\[p=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\beta_2x_2+...)}}\]
Since \(p\), the probability of an event’s success, may only lie between 0 and 1, we arrive at the quintessential problem that necessitates the use of logistic regression: a logistic regression model models the odds of a single event with binary outcomes, not the distribution of probabilities on the outcomes for many like events. In other words, a logistic regression effectively models a binomial trial with sample size \(n=1\).
Since this is rooted in a Bernoulli (yes-no) distribution rather than a Gaussian (normal) distribution, the raw response residuals in a logistic regression are stuck in the range \([-1,1]\) and thus cannot be normally distributed. This is a violation of the assumptions of a linear regression model, and thus requires the log-odds—the link function—to form a generalized linear model.
We can interpret the model’s output through the Odds Ratio (OR), or in other words the change in odds of an event occurring as a multiplicative factor. A given OR is calculated by exponentiating the logit or predictor variable(s) and ranges between 0 and \(\infty\); a given OR \(n\) for variable \(x\) means that a given event is \(n\) times as likely to occur provided a one-unit increase in \(x\) compared to without. It thus stands to reason that in the instance of no association between two given variables, the OR would be 1 and the predictor variable \(\beta\) would be 0—which is the typical assumption taken by a null hypothesis.
Testing the Assumptions of Logistic Regression
In studying the effect of blood pressure, age, sex, and number of vessels colored by fluoroscopy on the presence of heart disease, we must first test the assumptions of the logistic regression. While we are aware that the sample size is sufficiently large, the observations are independent, and the response variable is binary, we must test for outliers, linearity between the logit and predictor variables, and multicollinearity.
log_model <- glm(hd_presence ~ Number_vessels_fluro + BP + Age + Sex,
data = heart_df,
family = "binomial"
)
Below, we shall use Cook’s Distance as a means of detecting significant outliers (see Figure 5).
Cook’s Distance graph for measuring outliers
Following the Cook’s Distance graph (which did not display significant outliers), we shall use the Box-Tidwell Test for testing the assumption of linearity between the logit and the predictor variables. To evaluate the Box-Tidwell Test, any p-values lower than 0.05 will indicate a violation of the assumption of linearity. Categorical variables and variables without significant variance were excluded—this leaves blood pressure and age.
## MLE of lambda Score Statistic (t) Pr(>|t|)
## BP -3.058029 -1.1061 0.2697
## Age -0.071243 -0.7901 0.4302
##
## iterations = 7
##
## Score test for null hypothesis that all lambdas = 1:
## F = 0.91419, df = 2 and 265, Pr(>F) = 0.4021
Seeing as all the p-values are above 0.05, we can assume that the linearity assumption is not broken.
Next, I shall use the VIF to determine if there is any multicollinearity between the predictor variables. For this particular VIF, any values above a 10 will indicate severe multicollinearity between a given predictor variable and other variables.
## Number_vessels_fluro BP Age
## 1.175497 1.093793 1.251324
## Sex
## 1.099686
Given that all the VIF values lie below 10, we can see that severe multicollinearity is not present.
The Multiple Logistic Regression Model
With all of the assumptions of the logistic regression model met, we may begin and present the common assumptions raised by the null hypothesis:
\(H_0: \beta_{fluoro}=\beta_{blood}=\beta_{age}=\beta_{sex}=0\)
\(H_\alpha: \neg H_0\)
summary(log_model)
##
## Call:
## glm(formula = hd_presence ~ Number_vessels_fluro + BP + Age +
## Sex, family = "binomial", data = heart_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.771061 1.420595 -4.062 4.86e-05 ***
## Number_vessels_fluro 1.176481 0.203797 5.773 7.80e-09 ***
## BP 0.019593 0.008651 2.265 0.0235 *
## Age 0.017741 0.018657 0.951 0.3417
## Sex 1.740496 0.365412 4.763 1.91e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 370.96 on 269 degrees of freedom
## Residual deviance: 277.94 on 265 degrees of freedom
## AIC: 287.94
##
## Number of Fisher Scoring iterations: 4
We now know from this model that sex, blood pressure, and the number of vessels colored by fluoroscopy are very statistically significant as predictor variables for influencing the odds of getting heart disease (as evidenced by the p-values), but what is the p-value of the entire model? More importantly, the estimates generated by the logistic regression are log-odds, not Odds Ratios.
## [1] Logistic regression model overall p-value:
## [2] 0
exp_values <- exp(coef(log_model))
std_values <- format(exp_values, scientific = FALSE)
std_values
## (Intercept) Number_vessels_fluro BP
## "0.003116449" "3.242942449" "1.019786162"
## Age Sex
## "1.017898814" "5.700170871"
Results and Conclusions
We can now see that we have sufficient evidence to reject the null hypothesis, which stipulates that the coefficients of the model (aka the log-odds) would all equal zero. Additionally, with the reveal of the Odds Ratios, we may interpret the statistically significant predictor variables in the following manner:
In sum, from our group’s analysis of the heart dataset we determined several conclusions. From HT #1, we found that the data supported our alternative hypothesis that there is a negative correlation between the true change in maximum heart rate (measured in beats per minute) per 1 year increase in age, suggesting that as age increases, maximum heart rate decreases. From HT #2, we found no evidence to support the notion that those with exercise-induced angina have a higher average serum cholesterol level than those without exercise-induced angina. And from HT #3, we lacked strong evidence to claim that male and female angiography patients differ in heart-disease prevalence. Our extensions furthered our inquiry; through multiple linear regression modelling we found that exercise-induced angina and ST depression were the strongest and most statistically significant predictors of heart disease presence in the dataset, through interactive visualizations we further affirmed the conclusions of our above hypotheses tests, and using logistic regression we also found that blood pressure and sex were statistically significant predictors of heart disease.
Overall, we were surprised to find that age was not a significant predictor in Aria’s model, as were we surprised by the massive increase in heart disease risk when switching categories from women to men. It was interesting to go through the process needed to come to the conclusions surrounding the relationships between different variables, as we went through several steps of cleaning our data, ensuring we met all requirements, running the tests all before coming to our conclusions. We also found it took more time than expected to neatly format and compile our work into one, cohesive document.
For further study, as the dataset contains a large 14 variables, we recommend running tests using similar methods (SLR, MLR, LogR, etc.) among different combinations of variables that we tested and those we did not to discover what unusual relationships may occur. As stated before, with regards to our hypotheses tests, since this study was observational rather than experimental, we are not able to claim causation, however we may infer correlation, and also generalize our findings to the general population of angiography patients. Even so, it may be useful for other clinics to gather newer such datasets to compare our findings from angiography patients in the 1980s to patients in the modern day.