This data is from a study about heart disease.
The data comes from a random sample of patients undergoing an angiography, which is a medical procedure to see blood vessels, at the Cleveland Clinic in Ohio during the 1980s.
Note on the population and generalizing conclusions
The conclusions from the hypothesis testing are applicable to the population from which this data was taken, patients at the Cleveland Clinic in the 1980s. Likely, these are patients that are mostly from Ohio. They may not be representative of the US demographics as a whole. We cannot generalize this to the US population now because the population is different now. There may be confounding variables such as differences in lifestyle and food that make it difficult to generalize to Americans in 2025.
Here we use the plotly library to make interactive visualizations!
First, we will see if high blood pressure is correlated with higher risk of heart disease.
fig_bp <- plot_ly(heart, y = ~BP, type = "box", color = ~HeartDisease, colors = c('Blue', 'Red'))
fig_bp
Interactive visualization: blood pressure and occurence of heart disease
The median blood pressure for people with and without heart disease do not appear to be different. Both have medians of 130 mm Hg. However, the upper fence for the Presence group is higher at 180 mm Hg. Let’s move on to explore other variables, such as number of fluorescent blood vessels that show up during angiography.
heart_pres <- filter(heart, HeartDisease == 'Presence')
fig_pres <- plot_ly(heart_pres, labels = ~Number_vessels_fluro,
type = "pie")
fig_pres
Pie chart of number of fluorescent blood vessels for those with heart disease
heart_abs <- filter(heart, HeartDisease == 'Absence')
fig_abs <- plot_ly(heart_abs, labels = ~Number_vessels_fluro,
type = "pie")
fig_abs
Pie chart of number of fluorescent blood vessels for those without heart disease
The groups with and without heart disease have different proportions of fluorescent blood vessels. More people with heart disease have fluorescent blood vessels than those without. This may be an interesting hypothesis test.
Null Hypothesis: Sex and Heart Disease are independent
Alternative Hypothesis: Sex and Heart Disease are NOT independent
The motivation for this test is to see if your sex will affect the likelihood of being afflicted with heart disease. Understanding these differences of sex and heart disease is critical to improving diagnosis, treatment, and prevention strategies tailored to each sex, ultimately reducing disparities and improving outcomes for both men and women.
The p value is small, having a value of 1.926e-06 which is smaller than our significance level alpha value of 0.05. This means we reject the null hypothesis. As the null hypothesis is that the variables of sex and heart disease are independent, rejecting the null means there is statistically significant evidence supporting a relationship between sex and heart disease. It appears that males are more likely to have heart disease than females. However, we cannot draw a causal relationship to say that being male causes heart disease.
heart$Sex[heart$Sex == 1] <- 'Male'
heart$Sex[heart$Sex == 0] <- 'Female'
ggplot(heart, aes(x = HeartDisease, fill = Sex)) +
geom_bar(position = "dodge")
# store list of people with heart disease
disease_yes <- filter(heart, HeartDisease=="Presence")
# store number of females and males with disease
female_disease <- sum(disease_yes$Sex == 'Female')
male_disease <- sum(disease_yes$Sex == 'Male')
# store list of people without heart disease
disease_no <- filter(heart, HeartDisease=="Absence")
# store number of females and males without disease
female_noDisease <- sum(disease_no$Sex == 'Female')
male_noDisease <- sum(disease_no$Sex == 'Male')
disease_data <- matrix(c(male_disease, female_disease, male_noDisease,
female_noDisease), ncol=2, byrow=TRUE)
sex <- c("Male", "Female")
disease <- c("yes disease","no disease")
chi <- data.frame(disease_data)
colnames(chi) <- sex
rownames(chi) <- disease
chi
chi %>%
chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 22.667, df = 1, p-value = 1.926e-06
The motivation for this test is to determine whether those with and without heart disease have different numbers of fluorescent blood vessels. These show up during the angiography test. If they have different numbers of blood vessels, this may be used to determine who needs extra screening for heart disease.
\(H_0\) people with and without heart disease have the average number of vessels fluoresce during an angiography.
\(\mu_{heart disease} = \mu_{healthy}\)
\(H_A\) people with heart disease have a higher average number of vessels fluoresce during an angiography than those without heart disease.
\(\mu_{heart disease} > \mu_{healthy}\)
ggplot(data = heart, aes(x = HeartDisease, y = Number_vessels_fluro), color = HeartDisease) +
geom_boxplot()
Boxplot for number of fluorescent vessels based on heart disease group
t.test(Number_vessels_fluro ~ HeartDisease, data = heart, alternative = 'less')
##
## Welch Two Sample t-test
##
## data: Number_vessels_fluro by HeartDisease
## t = -7.9751, df = 190.58, p-value = 6.868e-14
## alternative hypothesis: true difference in means between group Absence and group Presence is less than 0
## 95 percent confidence interval:
## -Inf -0.684403
## sample estimates:
## mean in group Absence mean in group Presence
## 0.2866667 1.1500000
Conclusion: Since the p-value < 0.05, I reject the null hypothesis that there is no difference in the mean number of fluorescent blood vessels for those who do and do not have heart disease, with a significance level of 0.05. It is almost impossible that our sample comes from a population where those with and without heart disease have no difference in mean number of fluorescent blood vessels. This test provides evidence that people with heart disease have a mean number of fluorescent blood vessels. However, we cannot draw a causal relationship from this test because it was not set up in the experimental design.
Angiography may be used as a screening tool. The mean in the group that does not have heart disease is 0, telling us that the majority of people without heart disease do not have any fluorescent blood vessels. If someone has at least one fluorescent vessel, they should be screened for heart disease.
The motivation for this multiple linear regression is to see the interaction between age and cholesterol with the response variable, having heart disease. Since both age and cholesterol have p values less than 0.05, they should be included in the linear model and are good predictors of having heart disease. Since the coefficients for both variables are positive, an increase in both age and cholesterol levels is correlated with an increase in the risk of heart disease.
mult_lines <- lm(BP ~ Age + Cholesterol, data=heart)
mult_lines
##
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
##
## Coefficients:
## (Intercept) Age Cholesterol
## 94.74812 0.48421 0.04101
summary(mult_lines)
##
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.453 -11.543 -1.160 9.834 66.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.74812 7.35892 12.875 < 2e-16 ***
## Age 0.48421 0.11748 4.122 5.03e-05 ***
## Cholesterol 0.04101 0.02071 1.981 0.0486 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.12 on 267 degrees of freedom
## Multiple R-squared: 0.08796, Adjusted R-squared: 0.08113
## F-statistic: 12.88 on 2 and 267 DF, p-value: 4.59e-06
I really enjoyed working with real-world data in order to run statistical analyses. As we learned in class, statistics is learning to make inferences about a population based on empirical data. From tests like a Chi Square Test, it was fun running tests to see if sex as a variable is associated with heart disease. From the results, we can use the data to better society through public awareness, policy action, and better treatment. Learning that there is a relationship between sex and heart disease, we can highlight preventative and diagnostic care for the male sex who have higher prevalence of heart disease.
I did not know what an angiography was, so it was interesting to see that it was correlated to having heart disease. I was interested in seeing what variables were correlated to having heart disease.