In the previous class, we can measure the disease frequency (incidence risk and incidence rate) to simply understand the profile for the population. However, the effect of the risk factors can not be elaborated by only analyzing the single group (case group). Comparing the risk factors among two or more exposure groups can help us to clarify the relationship between exposures and outcomes in our data, further, generate the hypotheses. In this lab class, we are going to calculate the risk, rate, risk ratio and rate ratio for every risk factors in the REVEAL dataset.
The goals of this lab class are listed below:
Create the table for the calculation of incidence proportion and incidence rate
Calculate the incidence risk ratio and the incidence rate ratio
To draw a painting, we have to prepare the brushes, palettes, pigments, canvas, etc. The same as exploring data, we need to do some preparation. Firstly, please load the tidyverse package and import the REVEAL data.
library("tidyverse")
REVEAL<- read.csv("/Users/liupochen_macbook/Desktop/MGH_EPI/clean_data/REVEAL_new.csv")
glimpse(REVEAL)
## Observations: 23,612
## Variables: 11
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 1…
## $ gender <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ age <int> 48, 59, 45, 49, 50, 46, 61, 47, 55, 49, 43, 50, 53, …
## $ smoke <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1…
## $ drink <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1…
## $ alt <int> 12, 32, 28, 24, 53, 19, 15, 14, 18, 23, 13, 15, 43, …
## $ antihcv <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HBsAg <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
## $ hcc <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ person_time <dbl> 23.8, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 23.9…
We paired to use the group_by() and summarize() function to change the units of analysis from the original dataset to an individual table. Table with groups created by group_by() will result in one row in the output for each group. The variable count create by summarize will store the count results of the function n().
risk_tbl<- REVEAL%>%
group_by(drink, hcc)%>%
summarize(count=n())
risk_tbl
## # A tibble: 4 x 3
## # Groups: drink [2]
## drink hcc count
## <int> <int> <int>
## 1 0 0 20516
## 2 0 1 583
## 3 1 0 2377
## 4 1 1 136
Why not create a two by two table? The main reason is that R typically deals with the dataset by the vector, column. Therefore, as we counting the numbers of people in each condition, the same type of data which come from the same calculation will go inside the same column.
However, we still can use the other functions to generate the 2x2 table, here I only provide one of the method, function table().
table(REVEAL$drink, REVEAL$hcc)
##
## 0 1
## 0 20516 583
## 1 2377 136
Unfortunately for the reason above, we have to calculate the incidence proportion by hand.
#Incidence proportion in the drink group
136/(136+2377)
## [1] 0.05411858
#Incidence proportion in the non-drink group
583/(583+20516)
## [1] 0.02763164
#Risk ratio
risk_ratio<- (136/(136+2377))/(583/(583+20516))
risk_ratio
## [1] 1.958573
Please fill in the correct answer in the following sentence.
Interpretation: The risk of HCC among those who drink was ___ times as high as the risk of HCC among those who did not drink.
Please create an incidence proportion table for the smoke risk factor then calculate the incidence proportion for both smoke and non-smoke group. After you successfully calculate the incidence proportion for both two groups, can you make some interpretation about it?
Hint
## # A tibble: 2 x 3
## Group Non_HCC HCC
## <chr> <dbl> <dbl>
## 1 Non-smoke 16366 419
## 2 Smoke 6527 300
95% confidence interval of risk ratio
se_risk<- 1.96*sqrt(1/136-1/2513+1/583-1/21099)
#se_risk<- 1.96*sqrt((2377/136)/2513+(20516/583)/21099)
se_risk
## [1] 0.1820046
#95% confidence interval upper limit for risk ratio
ul_risk<- exp(log(risk_ratio)+se_risk)
ul_risk
## [1] 2.349543
#95% confidence interval lower limit for risk ratio
ll_risk<- exp(log(risk_ratio)-se_risk)
ll_risk
## [1] 1.632662
Exercise
Please fill in the correct answer in the following sentence.
Interpretation: We are 95% confident that the relative risk of HCC among those who drink compared to HCC who do not drink is between 1.632 and 2.349. The null value is ___.
Since the 95% confidence interval does not include the null value, the finding is ___.
Concept of confidence interval
Confidence interval only deals with random error, do not take the known or unknown biases or confounding into account, therefore we must consider whether the result is misleading.
The filter() function can return the matching condition results from the dataset. This extremely useful function allows us to subset the data based on their values. You may notice that in the filter(), we use == to set the matching query but except =. You can test for using = in the filter() function, what would be going on?
rate_tbl<-REVEAL %>%
group_by(drink)%>%
summarize(total_hcc=sum(hcc),
total_time=sum(person_time),
rate=total_hcc/total_time)
rate_tbl
## # A tibble: 2 x 4
## drink total_hcc total_time rate
## <int> <int> <dbl> <dbl>
## 1 0 583 479373. 0.00122
## 2 1 136 56322. 0.00241
The process for creating the rate table is in line with the data transform nature. We can create a new variable named rate by calculating the incidence rate with summarise(rate=hcc/pt).
rate_ratio<- 0.002414/0.001216
rate_ratio
## [1] 1.985197
Please create an incidence rate table for the smoke risk factor then calculate the incidence rate ratio.
95% confidence interval of rate ratio
se_rate<- sqrt(1/136+1/583)
se_rate
## [1] 0.09522713
#95% confidence interval upper limit for rate ratio
ul_rate<- exp(log(rate_ratio)+1.96*se_rate)
ul_rate
## [1] 2.392559
#95% confidence interval lower limit for rate ratio
ll_rate<- exp(log(rate_ratio)-1.96*se_rate)
ll_rate
## [1] 1.647194
95% confidence interval of RR
RR_vis<-tibble(group=c("risk_ratio", "rate_ratio"),
RR=c(risk_ratio,rate_ratio),
lowerlimit=c(ll_risk,ll_rate),
upperlimit=c(ul_risk,ul_rate))
RR_vis
## # A tibble: 2 x 4
## group RR lowerlimit upperlimit
## <chr> <dbl> <dbl> <dbl>
## 1 risk_ratio 1.96 1.63 2.35
## 2 rate_ratio 1.99 1.65 2.39
plot<- ggplot(RR_vis,aes(x= RR, y= group, xmin= lowerlimit, xmax= upperlimit))+
geom_point()+
geom_vline(aes(xintercept= 1), col= "#193175", alpha= .5)+
geom_errorbarh(height=.2)+
scale_x_continuous(trans= "log")
plot
## # A tibble: 14 x 4
## group RR lowerlimit upperlimit
## <chr> <dbl> <dbl> <dbl>
## 1 Drink.Risk 1.96 1.63 2.35
## 2 Gender.Risk 2.21 1.89 2.59
## 3 Smoke.Risk 1.76 1.52 2.04
## 4 AntiHCV.Risk 6.33 5.42 7.39
## 5 HBsAg.Risk 6.11 5.29 7.05
## 6 ALT.Risk 4.19 3.60 4.86
## 7 Age.Risk 2.43 2.09 2.83
## 8 Drink.Rate 1.99 1.65 2.39
## 9 Gender.Rate 2.25 1.92 2.63
## 10 Smoke.Rate 1.77 1.53 2.06
## 11 AntiHCV.Rate 6.66 5.65 7.85
## 12 HBsAg.Rate 6.37 5.50 7.38
## 13 ALT.Rate 4.29 3.68 5.00
## 14 Age.Rate 2.47 2.11 2.88