Exploratory data analysis

Interactive data visualization

Here we use the plotly library to make interactive visualizations!

First, we will see if high blood pressure is correlated with higher risk of heart disease.

fig_bp <- plot_ly(heart, y = ~BP, type = "box", color = ~HeartDisease, colors = c('Blue', 'Red'))

fig_bp

Interactive visualization: blood pressure and occurence of heart disease

The median blood pressure for people with and without heart disease do not appear to be different. Both have medians of 130 mm Hg. However, the upper fence for the Presence group is higher at 180 mm Hg. Let’s move on to explore other variables, such as number of fluorescent blood vessels that show up during angiography.

heart_pres <- filter(heart, HeartDisease == 'Presence')

fig_pres <- plot_ly(heart_pres, labels = ~Number_vessels_fluro,
                    type = "pie")

fig_pres

Pie chart of number of fluorescent blood vessels for those with heart disease

heart_abs <- filter(heart, HeartDisease == 'Absence')

fig_abs <- plot_ly(heart_abs, labels = ~Number_vessels_fluro,
                    type = "pie")

fig_abs

Pie chart of number of fluorescent blood vessels for those without heart disease

The groups with and without heart disease have different proportions of fluorescent blood vessels. More people with heart disease have fluorescent blood vessels than those without. This may be an interesting hypothesis test.

Hypothesis Test: Chi Squared Test of Independence between Sex and Heart Disease

Null Hypothesis: Sex and Heart Disease are independent

Alternative Hypothesis: Sex and Heart Disease are NOT independent

The motivation for this test is to see if your sex will affect the likelihood of being afflicted with heart disease. Understanding these differences of sex and heart disease is critical to improving diagnosis, treatment, and prevention strategies tailored to each sex, ultimately reducing disparities and improving outcomes for both men and women.

The p value is small, having a value of 1.926e-06 which is smaller than our significance level alpha value of 0.05. This means we reject the null hypothesis. As the null hypothesis is that the variables of sex and heart disease are independent, rejecting the null means there is statistically significant evidence supporting a relationship between sex and heart disease. It appears that males are more likely to have heart disease than females. However, we cannot draw a causal relationship to say that being male causes heart disease.

heart$Sex[heart$Sex == 1] <- 'Male'
heart$Sex[heart$Sex == 0] <- 'Female'

ggplot(heart, aes(x = HeartDisease, fill = Sex)) + 
  geom_bar(position = "dodge")

# store list of people with heart disease
disease_yes <- filter(heart, HeartDisease=="Presence")

# store number of females and males with disease
female_disease <- sum(disease_yes$Sex == 'Female')
male_disease <- sum(disease_yes$Sex == 'Male')

# store list of people without heart disease
disease_no <- filter(heart, HeartDisease=="Absence")

# store number of females and males without disease
female_noDisease <- sum(disease_no$Sex == 'Female')
male_noDisease <- sum(disease_no$Sex == 'Male')

disease_data <- matrix(c(male_disease, female_disease, male_noDisease, 
                         female_noDisease), ncol=2, byrow=TRUE)
sex <- c("Male", "Female")
disease <- c("yes disease","no disease")
chi <- data.frame(disease_data)
colnames(chi) <- sex
rownames(chi) <- disease

chi

chi %>%
  chisq.test()

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 22.667, df = 1, p-value = 1.926e-06

Hypothesis test: t-test for difference in means

The motivation for this test is to determine whether those with and without heart disease have different numbers of fluorescent blood vessels. These show up during the angiography test. If they have different numbers of blood vessels, this may be used to determine who needs extra screening for heart disease.

\(H_0\) people with and without heart disease have the average number of vessels fluoresce during an angiography.

\(\mu_{heart disease} = \mu_{healthy}\)

\(H_A\) people with heart disease have a higher average number of vessels fluoresce during an angiography than those without heart disease.

\(\mu_{heart disease} > \mu_{healthy}\)

ggplot(data = heart, aes(x = HeartDisease, y = Number_vessels_fluro), color = HeartDisease) + 
  geom_boxplot()

Boxplot for number of fluorescent vessels based on heart disease group

t.test(Number_vessels_fluro ~ HeartDisease, data = heart, alternative = 'less')

## 
##  Welch Two Sample t-test
## 
## data:  Number_vessels_fluro by HeartDisease
## t = -7.9751, df = 190.58, p-value = 6.868e-14
## alternative hypothesis: true difference in means between group Absence and group Presence is less than 0
## 95 percent confidence interval:
##       -Inf -0.684403
## sample estimates:
##  mean in group Absence mean in group Presence 
##              0.2866667              1.1500000

Conclusion: Since the p-value < 0.05, I reject the null hypothesis that there is no difference in the mean number of fluorescent blood vessels for those who do and do not have heart disease, with a significance level of 0.05. It is almost impossible that our sample comes from a population where those with and without heart disease have no difference in mean number of fluorescent blood vessels. This test provides evidence that people with heart disease have a mean number of fluorescent blood vessels. However, we cannot draw a causal relationship from this test because it was not set up in the experimental design.

Angiography may be used as a screening tool. The mean in the group that does not have heart disease is 0, telling us that the majority of people without heart disease do not have any fluorescent blood vessels. If someone has at least one fluorescent vessel, they should be screened for heart disease.

Multiple Linear Regression Model with more than 2 variables’

The motivation for this multiple linear regression is to see the interaction between age and cholesterol with the response variable, having heart disease. Since both age and cholesterol have p values less than 0.05, they should be included in the linear model and are good predictors of having heart disease. Since the coefficients for both variables are positive, an increase in both age and cholesterol levels is correlated with an increase in the risk of heart disease.

mult_lines <- lm(BP ~ Age + Cholesterol, data=heart)
mult_lines

## 
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
## 
## Coefficients:
## (Intercept)          Age  Cholesterol  
##    94.74812      0.48421      0.04101

summary(mult_lines)

## 
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.453 -11.543  -1.160   9.834  66.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 94.74812    7.35892  12.875  < 2e-16 ***
## Age          0.48421    0.11748   4.122 5.03e-05 ***
## Cholesterol  0.04101    0.02071   1.981   0.0486 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.12 on 267 degrees of freedom
## Multiple R-squared:  0.08796,    Adjusted R-squared:  0.08113 
## F-statistic: 12.88 on 2 and 267 DF,  p-value: 4.59e-06

Reflection

I really enjoyed working with real-world data in order to run statistical analyses. As we learned in class, statistics is learning to make inferences about a population based on empirical data. From tests like a Chi Square Test, it was fun running tests to see if sex as a variable is associated with heart disease. From the results, we can use the data to better society through public awareness, policy action, and better treatment. Learning that there is a relationship between sex and heart disease, we can highlight preventative and diagnostic care for the male sex who have higher prevalence of heart disease.

I did not know what an angiography was, so it was interesting to see that it was correlated to having heart disease. I was interested in seeing what variables were correlated to having heart disease.

Exploration of variables impacting occurence of heart disease

AL and ES