Our data originates from a study conducted at the Cleveland Clinic in Cleveland, Ohio during the 1980s, focusing on patients undergoing angiography. While the original study collected 76 variables, our analysis uses a wrangled subset of 14 variables. The data is provided in CSV format and can be assumed to be a random sample of independent observations.
For this project, we examine whether there is a correlation between cholesterol levels (our explanatory variable) and the presence of heart disease (our response variable). Heart disease is originally coded as a categorical variable (presence or absence). However, we will make heart disease numeric in some instances to explore alternative modeling approaches. Additionally, each observational unit represents a single patient from the study.
(a) Measures of central tendency.
To summarize cholesterol levels, we computed central tendency measures and a five-number summary. The five-number summary (minimum, Q1, median, Q3, maximum) shows the spread and shape of the distribution, while the mean provides a measure of average cholesterol level. Our graph shows most of the data being around 250 mg/dL and having a roughly symmetrical shape.
## min Q1 median Q3 max mean sd n missing
## 126 213 245 280 564 249.6593 51.68624 270 0
(b) Measures of dispersion.
Measures of dispersion, such as the standard deviation and interquartile range, describe how spread out cholesterol values are around the mean and median. The results show a fairly symmetric distribution (since the mean is about 250 and the median 245), and moderate variability (since the standard deviation is about 52 and the IQR is 67).
## # A tibble: 1 × 4
## mean_chol median_chol sd_chol IQR_chol
## <dbl> <dbl> <dbl> <dbl>
## 1 250. 245 51.7 67
(a) Randomization test.
We have chosen to use a randomization test because it allows us to assess the statistical significance of our observed relationship without relying on strict parameters. This is especially helpful in this medical data since we don’t know if the data is skewed or non-normal.
Randomization tests are based on simulated random assignment of the explanatory variable done under the null hypothesis which states that the two variables have no relationship.
The goal of this randomization test is to determine if there’s a difference in the mean cholesterol level between patients with heart disease and patients without heart disease. The population parameter of interest is the true difference in mean cholesterol between patients with heart disease and those without. The statistic of interest is the observed difference in sample means of cholesterol between these two groups.
Under the null hypothesis, we assume there is no difference in average cholesterol levels between patients with and without heart disease. The alternative hypothesis states that patients with heart disease have higher cholesterol levels. This is stated in math notation below, where the average cholesterol levels of patients with heart disease is stated by \(\mu_d\) and patients without heart disease (healthy patients) is stated by \(\mu_h\). We will determine if our results are statistically significant based on \(\alpha = 0.05\).
\(H_0: \mu_d - \mu_h = 0\) \(H_A: \mu_d - \mu_h > 0\)
## Response: cholesterol (numeric)
## Explanatory: heart_disease (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 12.3
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.023
The calculated p-value of 0.023 means that there is only a 2.3% likelihood of seeing data as extreme or more extreme than our statistic of 12.3. We have evidence to reject the null hypothesis since the p-value is less than an alpha-significance level of 0.05. We conclude that the average cholesterol level is significantly higher among patients with heart disease compared to those without.
Since this is an observational study, we cannot infer causality, and confounding variables may influence the relationship. However, since the sample was randomly selected and observations are independent, it is reasonable to generalize these findings to similar patient populations undergoing angiography.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 1 × 2
## lower upper
## <dbl> <dbl>
## 1 0.300 23.9
The confidence interval shows that we are 95% confident that the true difference in mean cholesterol (patients with heart disease minus those without) lies between about 0.29 and 23.88. 0 is not included within this interval. This supports our alternative hypothesis that patients with heart disease have higher cholesterol levels.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:mosaic':
##
## do
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
An interactive graph of the bootstrap distribution can be seen in the HTML submission of this document. The static histogram already shows that the resampled differences in mean cholesterol cluster above zero, giving evidence for us to reject the null hypothesis. However, the interactive version allows readers to explore the distribution more closely ang engage with variability in the resampled statistics, making the bootstrap process more transparent and intuitive.
(b) Two-Sample t-test.
A two-sample t-test compares the difference in population means without assuming the population standard deviations are known. Instead, it uses sample standard deviations to estimate the standard error of the difference in means. This makes the t-test appropriate when population variances are unknown. On the other hand, a randomization test compares differences in sample means without assuming normality. These tests can analyze the same research question using different inferential frameworks. Having both, will help us to build our statistical evidence to make a decision on whether or not differences in cholesterol levels between the two groups have a potential association with heart disease.
We also conducted a two-sample t-test to compare the mean cholesterol levels of patients with heart disease to those without heart disease. This test allows us to determine whether the observed difference in sample means is statistically significant under the assumption that, because we have a large sample size, the distribution of the difference in means is approximately normal.
Unlike the randomization test (which makes no assumptions about the underlying distribution), the t-test uses a theoretical sampling distribution and standardizes our observed difference using a t-score. This provides an analytic approach to assessing significance.
The population parameter of interest is the true difference in mean cholesterol levels between the two groups:
Patients with heart disease Patients without heart disease
The sample statistic is the observed difference in sample means. We want to determine whether patients with heart disease have a higher mean cholesterol level than healthy patients. \(\mu_d\) will represent the mean cholesterol of patients with heart disease and \(\mu_h\) will represent the mean cholesterol of healthy patients.
Using the notation \(\mu\) to represent the mean cholesterol level: \(t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)
Our hypotheses are as follows: \(H_0: \mu_d - \mu_h = 0\) \(H_a: \mu_d - \mu_h > 0\)
Under the null hypothesis, there is no difference in average cholesterol between the groups. Under the alternative hypothesis, patients diagnosed with heart disease are expected to have higher mean cholesterol levels. We will interpret our results at an alpha significance level of \(\alpha = 0.05\).
## [1] 256.4667
## [1] 244.2133
## [1] 1.971501
## [1] 265.0557
## [1] 0.02485327
The mean cholesterol level of the heart disease group is 256, and for the group without heart disease it is 244. We also computed a t-score value of 1.97.
Because the p-value is 0.025, which is less than \(\alpha = 0.05\), we reject the null hypothesis. This means there is about a 2.5% chance of observing a difference as extreme or more extreme than the one in our sample, if the null hypothesis were true. Thus, we have evidence that patients with heart disease have higher mean cholesterol than patients without heart disease.
A t-score of 1.97 indicates that the observed difference in mean cholesterol between heart disease and healthy patients is 1.97 standard errors above what we would expect under the null hypothesis.
(c) Linear regression model.
We chose to use correlation and a linear regression model to evaluate the relationship between cholesterol level (our explanatory variable) and the presence of heart disease (our response variable, which we coded as numeric). This method allows us to quantify both the direction and strength of the relationship, and regression enables us to model how changes in cholesterol may predict the likelihood of heart disease.
Correlation measures how closely two variables move together, while regression fits a mathematical model that estimates the change in the response variable per unit change in the explanatory variable. These methods allow us to evaluate statistical significance (via p-value) and the size of the effect (via the regression slope).
The purpose of this model is to assess whether cholesterol level is associated with the presence of heart disease in this population. The population parameters of interest are:
Under the null hypothesis, we assume there is no linear association between cholesterol level and the presence of heart disease. The alternative hypothesis states that higher cholesterol levels are associated with a higher probability of heart disease. We will interpret our results at an alpha significance level of \(\alpha = 0.05\).
\(\beta_1\) will represent the slope coefficent of cholestreol in the regression model \(H_0: \beta_1 = 0\) \(H_a: \beta_1 > 0\)
We convert heart disease to a numeric variable (1 = presence, 0 = absence) so we can apply correlation and regression techniques.
## cor
## 0.1180205
##
## Call:
## lm(formula = hd_numeric ~ cholesterol, data = heart)
##
## Coefficients:
## (Intercept) cholesterol
## 0.160647 0.001137
## [1] 0.05273889
## `geom_smooth()` using formula = 'y ~ x'
The regression output shows that the slope coefficient is 0.001137. The p-value for this coefficient is 0.05274, which indicates that cholesterol level is not a significant predictor of heart disease in this sample.
Extension: Test for Difference in Variance
We performed an F-test to determine whether cholesterol levels had different variances between patients with and without heart disease. The null hypothesis states that both groups have equal variances, while the alternative hypothesis asserts that the variances are different.
\(H_0: \sigma_{heart disease}^2 = \sigma_{healthy}^2\) \(H_a:\sigma_{heart disease}^2 \neq sigma_{healthy}^2\)
##
## F test to compare two variances
##
## data: cholesterol by heart_disease
## F = 0.78855, num df = 119, denom df = 149, p-value = 0.1771
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5618631 1.1140688
## sample estimates:
## ratio of variances
## 0.7885512
The F-test produced a test statistic of F = 0.7886 with degrees of freedom 119 and 149, and a p-value of 0.1771. The 95% confidence interval for the ratio of variances was (0.562, 1.114), which includes 1, the value expected if the variances were equal.
Because the p-value is greater than our significance level of 0.05, we fail to reject the null hypothesis. There is not sufficient evidence to conclude that cholesterol variability differs between patients with and without heart disease.
This means that while average cholesterol levels may differ between groups (as shown in earlier tests), the spread or variability of cholesterol levels appears similar across groups.
An interesting aspect of this project was seeing how different inferential methods (randomization test, t-test, regression) sometimes differed in their conclusions. The randomization and t-tests both provided clear evidence that patients with heart disease had higher average cholesterol levels, which aligned with expectations. However, the linear regression model did not show a statistically significant relationship, which shows cholesterol may not strongly predict heart disease when modeled directly. Overall, the data were approximately what we anticipated, but the contrast between tests shows the importance of using certain approaches to gain a fuller understanding.