Introduction

Hypothesis testing usually based on the false premise. The alternative hypothesis is usually used to overturn the null hypothesis. The purpose of the testing is to determine whether the statistical significance. However, when the groups are compared to be different, the difference can come from the random error or the variability of sampling. Therefore, using the cut-off for the decision of the association between two groups is not enough. The confidence interval should also be provided. About the REVEAL data, we want to know whether the habit of alcohol drinking can lead to hepatocellular carcinoma. To get more clues about this hypothesis, we can do some testing to test the different characteristics by exposure and non-exposure group.

The goals of this lab class are listed below:

library("tidyverse")
REVEAL<- read.csv("/Users/liupochen_macbook/Desktop/MGH_EPI/clean_data/REVEAL_new.csv")
glimpse(REVEAL)
## Observations: 23,612
## Variables: 11
## $ X           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ gender      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age         <int> 48, 59, 45, 49, 50, 46, 61, 47, 55, 49, 43, 50, 53, …
## $ smoke       <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1…
## $ drink       <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1…
## $ alt         <int> 12, 32, 28, 24, 53, 19, 15, 14, 18, 23, 13, 15, 43, …
## $ antihcv     <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HBsAg       <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
## $ hcc         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ person_time <dbl> 23.8, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9…

Chi-square test

table(REVEAL$gender, REVEAL$drink)
##    
##         0     1
##   0 11667    68
##   1  9432  2445
chisq.test(table(REVEAL$gender, REVEAL$drink))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(REVEAL$gender, REVEAL$drink)
## X-squared = 2482.2, df = 1, p-value < 2.2e-16

Exercise 1

Please use the chisq.test() to test the other binary variables.

T-test

ggplot(REVEAL)+
  geom_boxplot(aes(x=drink, y=age, group=drink))

drink<-REVEAL%>%
  filter(drink==1)
non_drink<-REVEAL%>%
  filter(drink==0)

t.test(drink$age, non_drink$age)
## 
##  Welch Two Sample t-test
## 
## data:  drink$age and non_drink$age
## t = 5.3617, df = 3122.1, p-value = 8.843e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7266102 1.5644121
## sample estimates:
## mean of x mean of y 
##  48.35893  47.21342

Exercise 2

Please use the t.test() to test the other continuous variables.

Homework: Table 1: Characteristics of the study population

table1.png

Please use the REVEAL data to calculate the descriptive statistics of the characteristics in the above table and fill the table.