Import Data
heart <- read.csv("heart.csv")
library()
Hypothesis Test: Chi Squared Test of Independence between Gender and Heart Disease #Analysis: P value is very small, smaller than alpha value of 0.05. Means we reject the null hypothesis. Null is there is no relationship between two variables. Means there is statistical evidence supporting a relationship between sex and heart disease.
male <- sum(heart$Sex)
female <- 270-183
#how many rows and columns in heart data: total is 270
dim(heart)
## [1] 270 14
#how many rows and columns in those with disease: total is 120
disease_yes <- filter(heart, HeartDisease=="Presence")
dim(disease_yes)
## [1] 120 14
#how many females (0) have heart disease, total is 20. how many males (1) have heart disease, total is 100
table(disease_yes$Sex)
##
## 0 1
## 20 100
#how many rows and columns in those without disease: total is 150
disease_no <- filter(heart, HeartDisease=="Absence")
dim(disease_no)
## [1] 150 14
#how many females (0) do not have heart disease, total is 67 how many males (1) do not have heart disease, total is 83
table(disease_no$Sex)
##
## 0 1
## 67 83
disease_data <- matrix(c(100, 20, 83, 67), ncol=2, byrow=TRUE)
sex <- c("Male", "Female")
disease <- c("yes disease","no disease")
chi <- data.frame(disease_data)
colnames(chi) <- sex
rownames(chi) <- disease
chi
## Male Female
## yes disease 100 20
## no disease 83 67
chi %>%
chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 22.667, df = 1, p-value = 1.926e-06
Visualizations
disease_data <- matrix(c(20, 67, 100, 83),
ncol=2, byrow=TRUE)
sex_f <- rep(rep(c("female", "male"), 2), times = disease_data)
disease_f <- rep(rep(c("present","absent"), each = 2), times = disease_data)
plot_this <- data.frame(sex_f, disease_f)
ggplot(plot_this) +
geom_bar(aes(x = disease_f, fill = sex_f ), position = "fill") +
xlab("heart disease")+
ylab("sex")
Multiple Linear Regression Model with more than 2 variables
mult_lines <- lm(BP ~ Age + Cholesterol, data=heart)
mult_lines
##
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
##
## Coefficients:
## (Intercept) Age Cholesterol
## 94.74812 0.48421 0.04101
summary(mult_lines)
##
## Call:
## lm(formula = BP ~ Age + Cholesterol, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.453 -11.543 -1.160 9.834 66.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.74812 7.35892 12.875 < 2e-16 ***
## Age 0.48421 0.11748 4.122 5.03e-05 ***
## Cholesterol 0.04101 0.02071 1.981 0.0486 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.12 on 267 degrees of freedom
## Multiple R-squared: 0.08796, Adjusted R-squared: 0.08113
## F-statistic: 12.88 on 2 and 267 DF, p-value: 4.59e-06
THe motivation for this test is to determine whether those with and without heart disease have different levels of cholesterol, which is linked to a higher risk of heart disease.
\(H_0\) people with and without heart disease have the same levels of cholesterol
\(H_A\) people with heart disease have a higher level of cholesterol than those without heart disease
t.test(Cholesterol ~ HeartDisease, data = heart, alternative = 'less')
##
## Welch Two Sample t-test
##
## data: Cholesterol by HeartDisease
## t = -1.9715, df = 265.06, p-value = 0.02485
## alternative hypothesis: true difference in means between group Absence and group Presence is less than 0
## 95 percent confidence interval:
## -Inf -1.994333
## sample estimates:
## mean in group Absence mean in group Presence
## 244.2133 256.4667
Conclusion: Since the p-value < 0.05, I reject the null hypothesis that there is no difference in the mean level of cholesterol for those who do and do not have heart disease. It is almost impossible that our sample comes from a population where those with and without heart disease have no difference in their level of cholesterol. This test provides evidence that people with heart disease have a higher level of cholesterol, although we cannot draw a causal relationship from this test because it was not set up in the experimental design.
fig <- plot_ly(heart, x = ~Age, y = ~BP, text = ~HeartDisease, type = 'scatter',
color = ~HeartDisease)
fig <- fig %>% layout(title = 'Age v Blood Pressure for patients at the Cleveland Clinic',
xaxis = list(showgrid = FALSE),
yaxis = list(showgrid = FALSE))
fig <- plot_ly(heart, y = ~BP, type = "box", color = ~HeartDisease)
fig