Import Data

heart <- read.csv("heart.csv")

library()

Hypothesis Test: Chi Squared Test of Independence between Gender and Heart Disease #Analysis: P value is very small, smaller than alpha value of 0.05. Means we reject the null hypothesis. Null is there is no relationship between two variables. Means there is statistical evidence supporting a relationship between sex and heart disease.

male <- sum(heart$Sex)
female <- 270-183

#how many rows and columns in heart data: total is 270
dim(heart)
## [1] 270  14
#how many rows and columns in those with disease: total is 120
disease_yes <- filter(heart, HeartDisease=="Presence")
dim(disease_yes)
## [1] 120  14
#how many females (0) have heart disease, total is 20. how many males (1) have heart disease, total is 100
table(disease_yes$Sex)
## 
##   0   1 
##  20 100
#how many rows and columns in those without disease: total is 150
disease_no <- filter(heart, HeartDisease=="Absence")
dim(disease_no)
## [1] 150  14
#how many females (0) do not have heart disease, total is 67 how many males (1) do not have heart disease, total is 83
table(disease_no$Sex)
## 
##  0  1 
## 67 83
disease_data <- matrix(c(100, 20, 83, 67), ncol=2, byrow=TRUE)
sex <- c("Male", "Female")
disease <- c("yes disease","no disease")
chi <- data.frame(disease_data)
colnames(chi) <- sex
rownames(chi) <- disease

chi
##             Male Female
## yes disease  100     20
## no disease    83     67
chi %>%
  chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 22.667, df = 1, p-value = 1.926e-06

Visualizations

disease_data <- matrix(c(20, 67, 100, 83), 
                   ncol=2, byrow=TRUE)
sex_f <- rep(rep(c("female", "male"), 2), times = disease_data)
disease_f <- rep(rep(c("present","absent"), each = 2), times = disease_data)
plot_this <- data.frame(sex_f, disease_f)


ggplot(plot_this) +
  geom_bar(aes(x = disease_f, fill = sex_f ), position = "fill") +
  xlab("heart disease")+
  ylab("sex")

Multiple Linear Regression Model with more than 2 variables

mult_lines <- lm(BP ~ Age + Cholesterol, data=heart)
mult_lines
## 
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
## 
## Coefficients:
## (Intercept)          Age  Cholesterol  
##    94.74812      0.48421      0.04101
summary(mult_lines)
## 
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.453 -11.543  -1.160   9.834  66.324 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 94.74812    7.35892  12.875  < 2e-16 ***
## Age          0.48421    0.11748   4.122 5.03e-05 ***
## Cholesterol  0.04101    0.02071   1.981   0.0486 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.12 on 267 degrees of freedom
## Multiple R-squared:  0.08796,    Adjusted R-squared:  0.08113 
## F-statistic: 12.88 on 2 and 267 DF,  p-value: 4.59e-06

Hypothesis test: t-test for difference in means

THe motivation for this test is to determine whether those with and without heart disease have different levels of cholesterol, which is linked to a higher risk of heart disease.

\(H_0\) people with and without heart disease have the same levels of cholesterol

\(H_A\) people with heart disease have a higher level of cholesterol than those without heart disease

t.test(Cholesterol ~ HeartDisease, data = heart, alternative = 'less')
## 
##  Welch Two Sample t-test
## 
## data:  Cholesterol by HeartDisease
## t = -1.9715, df = 265.06, p-value = 0.02485
## alternative hypothesis: true difference in means between group Absence and group Presence is less than 0
## 95 percent confidence interval:
##       -Inf -1.994333
## sample estimates:
##  mean in group Absence mean in group Presence 
##               244.2133               256.4667

Conclusion: Since the p-value < 0.05, I reject the null hypothesis that there is no difference in the mean level of cholesterol for those who do and do not have heart disease. It is almost impossible that our sample comes from a population where those with and without heart disease have no difference in their level of cholesterol. This test provides evidence that people with heart disease have a higher level of cholesterol, although we cannot draw a causal relationship from this test because it was not set up in the experimental design.

Write explanation

fig <- plot_ly(heart, x = ~Age, y = ~BP, text = ~HeartDisease, type = 'scatter', 
               color = ~HeartDisease)
               
fig <- fig %>% layout(title = 'Age v Blood Pressure for patients at the Cleveland Clinic',
         xaxis = list(showgrid = FALSE),
         yaxis = list(showgrid = FALSE))


fig <- plot_ly(heart, y = ~BP, type = "box", color = ~HeartDisease)


fig