Introduction

In the previous class, we can measure the disease frequency (incidence risk and incidence rate) to simply understand the profile for the population. However, the effect of the risk factors can not be elaborated by only analyzing the single group (case group). Comparing the risk factors among two or more exposure groups can help us to clarify the relationship between exposures and outcomes in our data, further, generate the hypotheses. In this lab class, we are going to calculate the risk, rate, risk ratio and rate ratio for every risk factors in the REVEAL dataset.

The goals of this lab class are listed below:

To draw a painting, we have to prepare the brushes, palettes, pigments, canvas, etc. The same as exploring data, we need to do some preparation. Firstly, please load the tidyverse package and import the REVEAL data.

library("tidyverse")
REVEAL<- read.csv("/Users/liupochen_macbook/Desktop/MGH_EPI/clean_data/REVEAL_new.csv")
glimpse(REVEAL)
## Observations: 23,612
## Variables: 11
## $ X           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ gender      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age         <int> 48, 59, 45, 49, 50, 46, 61, 47, 55, 49, 43, 50, 53, …
## $ smoke       <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1…
## $ drink       <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1…
## $ alt         <int> 12, 32, 28, 24, 53, 19, 15, 14, 18, 23, 13, 15, 43, …
## $ antihcv     <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HBsAg       <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
## $ hcc         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ person_time <dbl> 23.8, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9…

Incidence proportion and risk ratio

Create a table for incidence proportion

We paired to use the group_by() and summarize() function to change the units of analysis from the original dataset to an individual table. Table with groups created by group_by() will result in one row in the output for each group. The variable count create by summarize will store the count results of the function n().

risk_tbl<- REVEAL%>%
  group_by(drink, hcc)%>%
  summarize(count=n())
risk_tbl
## # A tibble: 4 x 3
## # Groups:   drink [2]
##   drink   hcc count
##   <int> <int> <int>
## 1     0     0 20516
## 2     0     1   583
## 3     1     0  2377
## 4     1     1   136

Why not create a two by two table? The main reason is that R typically deals with the dataset by the vector, column. Therefore, as we counting the numbers of people in each condition, the same type of data which come from the same calculation will go inside the same column.

However, we still can use the other functions to generate the 2x2 table, here I only provide one of the method, function table().

table(REVEAL$drink, REVEAL$hcc)
##    
##         0     1
##   0 20516   583
##   1  2377   136

Unfortunately for the reason above, we have to calculate the incidence proportion by hand.

Calculate the incidence proportion and incidence risk ratio

#Incidence proportion in the drink group
136/(136+2377)
## [1] 0.05411858
#Incidence proportion in the non-drink group
583/(583+20516)
## [1] 0.02763164
#Risk ratio
risk_ratio<- (136/(136+2377))/(583/(583+20516))
risk_ratio
## [1] 1.958573

Exercise 1

Please fill in the correct answer in the following sentence.

Interpretation: The risk of HCC among those who drink was ___ times as high as the risk of HCC among those who did not drink.

Exercise 2

Please create an incidence proportion table for the smoke risk factor then calculate the incidence proportion for both smoke and non-smoke group. After you successfully calculate the incidence proportion for both two groups, can you make some interpretation about it?

Hint

## # A tibble: 2 x 3
##   Group     Non_HCC   HCC
##   <chr>       <dbl> <dbl>
## 1 Non-smoke   16366   419
## 2 Smoke        6527   300

95% confidence interval of risk ratio

se_risk<- 1.96*sqrt(1/136-1/2513+1/583-1/21099)
#se_risk<- 1.96*sqrt((2377/136)/2513+(20516/583)/21099)
se_risk
## [1] 0.1820046
#95% confidence interval upper limit for risk ratio
ul_risk<- exp(log(risk_ratio)+se_risk)
ul_risk
## [1] 2.349543
#95% confidence interval lower limit for risk ratio
ll_risk<- exp(log(risk_ratio)-se_risk)
ll_risk
## [1] 1.632662

Exercise

Please fill in the correct answer in the following sentence.

Interpretation: We are 95% confident that the relative risk of HCC among those who drink compared to HCC who do not drink is between 1.632 and 2.349. The null value is ___.

Since the 95% confidence interval does not include the null value, the finding is ___.

Concept of confidence interval

Confidence interval only deals with random error, do not take the known or unknown biases or confounding into account, therefore we must consider whether the result is misleading.

Incidence rate and rate ratio

Create a table for incidence rate

The filter() function can return the matching condition results from the dataset. This extremely useful function allows us to subset the data based on their values. You may notice that in the filter(), we use == to set the matching query but except =. You can test for using = in the filter() function, what would be going on?

rate_tbl<-REVEAL %>%
  group_by(drink)%>%
  summarize(total_hcc=sum(hcc),
            total_time=sum(person_time),
            rate=total_hcc/total_time)
rate_tbl
## # A tibble: 2 x 4
##   drink total_hcc total_time    rate
##   <int>     <int>      <dbl>   <dbl>
## 1     0       583    479373. 0.00122
## 2     1       136     56322. 0.00241

The process for creating the rate table is in line with the data transform nature. We can create a new variable named rate by calculating the incidence rate with summarise(rate=hcc/pt).

Calculate the incidence rate ratio

rate_ratio<- 0.002414/0.001216
rate_ratio
## [1] 1.985197

Exercise 3

Please create an incidence rate table for the smoke risk factor then calculate the incidence rate ratio.

95% confidence interval of rate ratio

se_rate<- sqrt(1/136+1/583)
se_rate
## [1] 0.09522713
#95% confidence interval upper limit for rate ratio
ul_rate<- exp(log(rate_ratio)+1.96*se_rate)
ul_rate
## [1] 2.392559
#95% confidence interval lower limit for rate ratio
ll_rate<- exp(log(rate_ratio)-1.96*se_rate)
ll_rate
## [1] 1.647194

Data visualization

95% confidence interval of RR

RR_vis<-tibble(group=c("risk_ratio", "rate_ratio"),
               RR=c(risk_ratio,rate_ratio),
               lowerlimit=c(ll_risk,ll_rate),
               upperlimit=c(ul_risk,ul_rate))
RR_vis
## # A tibble: 2 x 4
##   group         RR lowerlimit upperlimit
##   <chr>      <dbl>      <dbl>      <dbl>
## 1 risk_ratio  1.96       1.63       2.35
## 2 rate_ratio  1.99       1.65       2.39
plot<- ggplot(RR_vis,aes(x= RR, y= group, xmin= lowerlimit, xmax= upperlimit))+
       geom_point()+
       geom_vline(aes(xintercept= 1), col= "#193175", alpha= .5)+
       geom_errorbarh(height=.2)+
       scale_x_continuous(trans= "log")
plot

Add the other risk factors

## # A tibble: 14 x 4
##    group           RR lowerlimit upperlimit
##    <chr>        <dbl>      <dbl>      <dbl>
##  1 Drink.Risk    1.96       1.63       2.35
##  2 Gender.Risk   2.21       1.89       2.59
##  3 Smoke.Risk    1.76       1.52       2.04
##  4 AntiHCV.Risk  6.33       5.42       7.39
##  5 HBsAg.Risk    6.11       5.29       7.05
##  6 ALT.Risk      4.19       3.60       4.86
##  7 Age.Risk      2.43       2.09       2.83
##  8 Drink.Rate    1.99       1.65       2.39
##  9 Gender.Rate   2.25       1.92       2.63
## 10 Smoke.Rate    1.77       1.53       2.06
## 11 AntiHCV.Rate  6.66       5.65       7.85
## 12 HBsAg.Rate    6.37       5.50       7.38
## 13 ALT.Rate      4.29       3.68       5.00
## 14 Age.Rate      2.47       2.11       2.88