Hypothesis testing usually based on the false premise. The alternative hypothesis is usually used to overturn the null hypothesis. The purpose of the testing is to determine whether the statistical significance. However, when the groups are compared to be different, the difference can come from the random error or the variability of sampling. Therefore, using the cut-off for the decision of the association between two groups is not enough. The confidence interval should also be provided. About the REVEAL data, we want to know whether the habit of alcohol drinking can lead to hepatocellular carcinoma. To get more clues about this hypothesis, we can do some testing to test the different characteristics by exposure and non-exposure group.
The goals of this lab class are listed below:
Test the binary variables and continuous variables
Discuss the results and make “Table 1: Characteristics of the study population”
library("tidyverse")
REVEAL<- read.csv("/Users/liupochen_macbook/Desktop/MGH_EPI/clean_data/REVEAL_new.csv")
glimpse(REVEAL)
## Observations: 23,612
## Variables: 11
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ gender <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age <int> 48, 59, 45, 49, 50, 46, 61, 47, 55, 49, 43, 50, 53, …
## $ smoke <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1…
## $ drink <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1…
## $ alt <int> 12, 32, 28, 24, 53, 19, 15, 14, 18, 23, 13, 15, 43, …
## $ antihcv <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HBsAg <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
## $ hcc <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ person_time <dbl> 23.8, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9…
table(REVEAL$gender, REVEAL$drink)
##
## 0 1
## 0 11667 68
## 1 9432 2445
chisq.test(table(REVEAL$gender, REVEAL$drink))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(REVEAL$gender, REVEAL$drink)
## X-squared = 2482.2, df = 1, p-value < 2.2e-16
Please use the chisq.test() to test the other binary variables.
ggplot(REVEAL)+
geom_boxplot(aes(x=drink, y=age, group=drink))
drink<-REVEAL%>%
filter(drink==1)
non_drink<-REVEAL%>%
filter(drink==0)
t.test(drink$age, non_drink$age)
##
## Welch Two Sample t-test
##
## data: drink$age and non_drink$age
## t = 5.3617, df = 3122.1, p-value = 8.843e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7266102 1.5644121
## sample estimates:
## mean of x mean of y
## 48.35893 47.21342
Please use the t.test() to test the other continuous variables.
Please use the REVEAL data to calculate the descriptive statistics of the characteristics in the above table and fill the table.