Example
Out of 350 patients that adhered to a new drug treatment, 273 recovered.The response proportion was 273/350 or 0.78 (78%). We can use the rbinom function to simulate this binomial process. Let’s simulate this 10 times (i.e., 10 replicates).
replicates <- 10
num_of_events <- 273
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
sim_events
## [1] 288 271 266 272 262 262 262 277 268 270
Exercise 5.1 (Bootstrap a confidence interval). In the same study above, 350 patients did not take the new drug treatment, and 289 recovered. Calculate the bootstrap confidence interval for this proportion (289/350).
replicates <- 5000
num_of_events <- 289
num_at_risk <- 350
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
conf.level <- 0.95
alpha <- 1 - conf.level
## 0.95 CI for binomial counts
quantile(sim_events, c(alpha/2, 1-alpha/2))
## 2.5% 97.5%
## 275 303
## 0.95 CI for binomial proportions
quantile(sim_events, c(alpha/2, 1-alpha/2))/num_at_risk
## 2.5% 97.5%
## 0.7857143 0.8657143
Exercise 5.2 (Function for bootstrapping a confidence interval). Adapt your code from the previous problem to write and test a function for calculating the bootstrap confidence interval for a binomial proportion.
boots <- function(x1, x0, conf.level = 0.95, replicates = 5000){
#### x1 = c(number cases among exposed, number of exposed)
#### x0 = c(number cases among nonexposed, number of nonexposed)
n1 <- x1[2]
n0 <- x0[2]
p1 <- x1/n1 # risk among exposed
p0 <- x0/n0 # risk among unexposed
risk <- num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
alpha <- 1 - conf.level
## 0.95 CI for binomial counts
quantile(sim_events, c(alpha/2, 1-alpha/2))
ci<-quantile(sim_events, c(alpha/2, 1-alpha/2))/num_at_risk
list(
conf.int = unname(ci),
conf.level = conf.level,
replicates = replicates)
}
## test
x1<-289
x0<-350
boots(x1, x0, conf.level = 0.95, replicates = 5000)
## $conf.int
## [1] 0.7857143 0.8657143
##
## $conf.level
## [1] 0.95
##
## $replicates
## [1] 5000
Exercise 5.3 (Calculating risks and risk ratio). Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study.
Using the results from Table 5.4 create and test a function that:
Calculates the risks 𝑅1 and 𝑅0
Calculates the risk ratio (𝑅𝑅)
Calculates 95% confidence intervals
Uses the fisher.test function (Fisher’s Exact Test) to calculate the twosided 𝑝-value of the null hypothesis of no association. (Hint: see epitable function in Exercise 5.5)
Collects and returns all the results as a ‘list’.
Tab5.4 <- matrix(c(273,289,77, 61), 2, 2)
rownames(Tab5.4) <- c('Yes', 'No')
colnames(Tab5.4) <- c('Yes', 'No')
dimnames(Tab5.4)<- list( Treatment= c('Yes','No'), Recovery= c('Yes','No') )
epitable <- function(a, b, c, d, digits = 3, ...){
N1=a+c
N0=b+d
R0 <- b/N0
R1 <- a/N1
RR <- R1/R0
OR <- (R1/(1-R1)) / (R0/(1-R0))
ft <- fisher.test(Tab5.4,simulate.p.value=TRUE,B=2000,...)
pv <- ft$p.value
SE.RR<-sqrt((1/a)-(1/N1)+(1/b)-(1/N0))
conf.int<-c(RR-1.96*SE.RR, RR+1.96*SE.RR)
results <- signif(c(R0, R1, RR, OR, pv, conf.int), digits = digits)
names(results) <- c('R_0', 'R_1', 'RR', 'OR', 'p.value', '95 lower ci', '95 upper ci')
list(data = Tab5.4, results = results)
}
epitable(273,289, 77, 61)
## $data
## Recovery
## Treatment Yes No
## Yes 273 77
## No 289 61
##
## $results
## R_0 R_1 RR OR p.value 95 lower ci
## 0.826 0.780 0.945 0.748 0.154 0.871
## 95 upper ci
## 1.020
fisher.test(Tab5.4)
##
## Fisher's Exact Test for Count Data
##
## data: Tab5.4
## p-value = 0.154
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.504886 1.106634
## sample estimates:
## odds ratio
## 0.7486641
Exercise 5.4 (Evaluating drug treatment). Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study.Using the read.csv function import the data frame for Table 5.4. Use the code below.
Analyze the drugrx.df data frame using the xtab function, your function from the previous problem, and any R functions you wish. Is the drug treatment effective? Is there a difference in drug treatment effectiveness comparing men and women?
drugrx.df<- read.csv("/Users/dr.auwal/Desktop/Personals/UC Berkeley/Classes/PB HLTH 251D Applied Epidemiology Using R/Practice/data/drugrx-pearl2.csv")
xTab5.4<- xtabs(~ Drug + Recovered, data = drugrx.df)
fisher.test(xTab5.4)
##
## Fisher's Exact Test for Count Data
##
## data: xTab5.4
## p-value = 0.154
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.504886 1.106634
## sample estimates:
## odds ratio
## 0.7486641
Statistically, it seems the drug is not effective. Since the drug is not effective, its effectiveness comparing both men and women will not be appreciated
Exercise 5.5 (Oswego outbreak investigation). Study this clean version of the Oswego data set and import it as a data frame: ~/data/oswego/oswego2.txt. Remember this data set is available from https://github.com/taragonmd/data. Locate the data set and select the “Raw” tab. Copy the URL and use with the read.csv function. Import this clean version.
oswego2 <- read.csv("https://raw.githubusercontent.com/taragonmd/data/master/oswego/oswego2.txt", strip.white = TRUE)
food.item <- c("Baked.ham", "Spinach", "Mashed.potato", "Cabbage.salad",
"Jello", "Rolls", "Brown.bread", "Milk", "Coffee", "Water", "Cakes",
"Vanilla.ice.cream", "Chocolate.ice.cream", "Fruit.salad")
names(oswego2)[9:22] <- food.item
str(oswego2)
## 'data.frame': 75 obs. of 22 variables:
## $ ID : int 2 3 4 6 7 8 9 10 14 16 ...
## $ Age : int 52 65 59 63 70 40 15 33 10 32 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 2 1 1 2 1 1 1 2 1 ...
## $ MealDate : Factor w/ 1 level "04/18/1940": 1 1 1 1 1 1 1 1 1 1 ...
## $ MealTime : Factor w/ 6 levels "11:00:00","18:30:00",..: 5 2 2 4 4 4 6 3 4 NA ...
## $ Ill : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ OnsetDate : Factor w/ 2 levels "4/18/1940","4/19/1940": 2 2 2 1 1 2 2 1 2 2 ...
## $ OnsetTime : Factor w/ 17 levels "00:00:00","00:30:00",..: 2 2 2 15 15 4 3 16 4 7 ...
## $ Baked.ham : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
## $ Spinach : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
## $ Mashed.potato : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 1 2 1 1 ...
## $ Cabbage.salad : Factor w/ 2 levels "N","Y": 1 2 1 2 1 1 1 1 1 1 ...
## $ Jello : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
## $ Rolls : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 2 ...
## $ Brown.bread : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
## $ Milk : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Coffee : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 1 1 1 2 ...
## $ Water : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 2 1 1 ...
## $ Cakes : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 1 1 2 ...
## $ Vanilla.ice.cream : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Chocolate.ice.cream: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 2 ...
## $ Fruit.salad : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
Study the following epitable function, especially the use of the … argument to past options to the fisher.test function.
epitable <- function(exposure, outcome, digits = 3, ...){
# 2x2 table from 'xtabs' function
# row names are exposure status
# col names are illness outcome
# digits = significant digits (default = 3)
# ... = options to fisher.test function
x <- xtabs(~ exposure + outcome, na.action=na.pass)
rowtot <- apply(x, 1, sum)
rowdist <- sweep(x, 1, rowtot, '/')
risk0 <- rowdist['N','Y']
risk1 <- rowdist['Y','Y']
rr <- risk1/risk0
or <- (risk1/(1-risk1)) / (risk0/(1-risk0))
ft <- fisher.test(x, ...)
pv <- ft$p.value
results <- signif(c(risk0, risk1, rr, or, pv), digits = digits)
names(results) <- c('R_0', 'R_1', 'RR', 'OR', 'p.value')
list(data = x, results = results)
}
Use the epitable function to test each food item to see if it is associated with developing illness. Construct a table with the following column headings
epitable(oswego2$Baked.ham, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 12 17
## Y 17 29
##
## $results
## R_0 R_1 RR OR p.value
## 0.586 0.630 1.080 1.200 0.809
epitable(oswego2$Spinach, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 12 20
## Y 17 26
##
## $results
## R_0 R_1 RR OR p.value
## 0.625 0.605 0.967 0.918 1.000
epitable(oswego2$Mashed.potato, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 14 23
## Y 14 23
##
## $results
## R_0 R_1 RR OR p.value
## 0.622 0.622 1.000 1.000 1.000
epitable(oswego2$Cabbage.salad, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 19 28
## Y 10 18
##
## $results
## R_0 R_1 RR OR p.value
## 0.596 0.643 1.080 1.220 0.808
epitable(oswego2$Jello, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 22 30
## Y 7 16
##
## $results
## R_0 R_1 RR OR p.value
## 0.577 0.696 1.210 1.680 0.442
epitable(oswego2$Rolls, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 13 25
## Y 16 21
##
## $results
## R_0 R_1 RR OR p.value
## 0.658 0.568 0.863 0.682 0.482
epitable(oswego2$Brown.bread, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 20 28
## Y 9 18
##
## $results
## R_0 R_1 RR OR p.value
## 0.583 0.667 1.140 1.430 0.622
epitable(oswego2$Milk, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 27 44
## Y 2 2
##
## $results
## R_0 R_1 RR OR p.value
## 0.620 0.500 0.807 0.614 0.638
epitable(oswego2$Coffee, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 17 27
## Y 12 19
##
## $results
## R_0 R_1 RR OR p.value
## 0.614 0.613 0.999 0.997 1.000
epitable(oswego2$Water, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 18 33
## Y 11 13
##
## $results
## R_0 R_1 RR OR p.value
## 0.647 0.542 0.837 0.645 0.450
epitable(oswego2$Cakes, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 16 19
## Y 13 27
##
## $results
## R_0 R_1 RR OR p.value
## 0.543 0.675 1.240 1.750 0.342
epitable(oswego2$Vanilla.ice.cream, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 18 3
## Y 11 43
##
## $results
## R_0 R_1 RR OR p.value
## 1.43e-01 7.96e-01 5.57e+00 2.35e+01 2.60e-07
epitable(oswego2$Chocolate.ice.cream, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 7 20
## Y 22 25
##
## $results
## R_0 R_1 RR OR p.value
## 0.7410 0.5320 0.7180 0.3980 0.0891
epitable(oswego2$Fruit.salad, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 27 42
## Y 2 4
##
## $results
## R_0 R_1 RR OR p.value
## 0.609 0.667 1.100 1.290 1.000
The Table
Table <- function(x) {
epitable(x, oswego2$Ill)$results
}
t(sapply(oswego2[, food.item], Table))
## R_0 R_1 RR OR p.value
## Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
## Spinach 0.625 0.605 0.967 0.918 1.00e+00
## Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
## Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
## Jello 0.577 0.696 1.210 1.680 4.42e-01
## Rolls 0.658 0.568 0.863 0.682 4.82e-01
## Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
## Milk 0.620 0.500 0.807 0.614 6.38e-01
## Coffee 0.614 0.613 0.999 0.997 1.00e+00
## Water 0.647 0.542 0.837 0.645 4.50e-01
## Cakes 0.543 0.675 1.240 1.750 3.42e-01
## Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
## Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
## Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
From the list of results, only Vanilla ice cream’s p-value is significant. Therefore Vanilla ice cream may be associated with illness
Exercise 5.6 (Generate outbreak investigation results table). Write a for loop to automate the creation of the outbreak table from previous exercise. Hint provided:
outbreak <- matrix(NA, length(food.item), 5)
rownames(outbreak) <- food.item
colnames(outbreak) <- c('R_0','R_1','RR','OR','P value')
for(i in 1:length(food.item)) {
outbreak[i,] <- epitable(oswego2[,food.item[i]], oswego2$Ill)$results
}
outbreak
## R_0 R_1 RR OR P value
## Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
## Spinach 0.625 0.605 0.967 0.918 1.00e+00
## Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
## Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
## Jello 0.577 0.696 1.210 1.680 4.42e-01
## Rolls 0.658 0.568 0.863 0.682 4.82e-01
## Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
## Milk 0.620 0.500 0.807 0.614 6.38e-01
## Coffee 0.614 0.613 0.999 0.997 1.00e+00
## Water 0.647 0.542 0.837 0.645 4.50e-01
## Cakes 0.543 0.675 1.240 1.750 3.42e-01
## Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
## Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
## Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
Exercise 5.7 (Comparing lapply and sapply). Study the following use of lapply and sapply. Explain how both functions work, and what are their differences.
myepitab <- function(x) {
epitable(x, oswego2$Ill)$results
}
lapply(oswego2[, food.item], myepitab)
## $Baked.ham
## R_0 R_1 RR OR p.value
## 0.586 0.630 1.080 1.200 0.809
##
## $Spinach
## R_0 R_1 RR OR p.value
## 0.625 0.605 0.967 0.918 1.000
##
## $Mashed.potato
## R_0 R_1 RR OR p.value
## 0.622 0.622 1.000 1.000 1.000
##
## $Cabbage.salad
## R_0 R_1 RR OR p.value
## 0.596 0.643 1.080 1.220 0.808
##
## $Jello
## R_0 R_1 RR OR p.value
## 0.577 0.696 1.210 1.680 0.442
##
## $Rolls
## R_0 R_1 RR OR p.value
## 0.658 0.568 0.863 0.682 0.482
##
## $Brown.bread
## R_0 R_1 RR OR p.value
## 0.583 0.667 1.140 1.430 0.622
##
## $Milk
## R_0 R_1 RR OR p.value
## 0.620 0.500 0.807 0.614 0.638
##
## $Coffee
## R_0 R_1 RR OR p.value
## 0.614 0.613 0.999 0.997 1.000
##
## $Water
## R_0 R_1 RR OR p.value
## 0.647 0.542 0.837 0.645 0.450
##
## $Cakes
## R_0 R_1 RR OR p.value
## 0.543 0.675 1.240 1.750 0.342
##
## $Vanilla.ice.cream
## R_0 R_1 RR OR p.value
## 1.43e-01 7.96e-01 5.57e+00 2.35e+01 2.60e-07
##
## $Chocolate.ice.cream
## R_0 R_1 RR OR p.value
## 0.7410 0.5320 0.7180 0.3980 0.0891
##
## $Fruit.salad
## R_0 R_1 RR OR p.value
## 0.609 0.667 1.100 1.290 1.000
t(sapply(oswego2[, food.item], myepitab)) # transpose output
## R_0 R_1 RR OR p.value
## Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
## Spinach 0.625 0.605 0.967 0.918 1.00e+00
## Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
## Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
## Jello 0.577 0.696 1.210 1.680 4.42e-01
## Rolls 0.658 0.568 0.863 0.682 4.82e-01
## Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
## Milk 0.620 0.500 0.807 0.614 6.38e-01
## Coffee 0.614 0.613 0.999 0.997 1.00e+00
## Water 0.647 0.542 0.837 0.645 4.50e-01
## Cakes 0.543 0.675 1.240 1.750 3.42e-01
## Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
## Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
## Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
Both lapply and sapply are used to apply function to each component (bin) of a list. They are identical except that sapply “simplifies” the final result and lapply returns list
Exercise 5.8 (Lexical scoping). Study the following R code and explain the final answer for g(2):
a <- 1
b <- 2
f <- function(x) {
a * x + b
}
g <- function(x) {
a <- 2
b <- 1
f(x)
}
g(2)
## [1] 4
The values executed by g(x) is predefined by the function of ‘f’ and R executes the function defined before ‘f’. Although the functions are defined before ‘g’, R doesn’t consider the position relative to ‘g’. Since a <- 1, b <- 2 was defined before ‘f’, hence R works with this code. Therefore a * x + b= 1*2+2=4.
Exercise 5.9 (More lexical scoping). Study the following R code and explain the final answer for g(2):
g <- function(x) {
a <- 2
b <- 1
f <- function(x) {
a * x + b
}
f(x)
}
g(2)
## [1] 5
This time, the functions are defined after ‘g’. Since the position related to ‘g’ doesn’t count, R only executes the funtion defined before of ‘f’. therefore, R executes a <- 2, b <- 1. a * x + b= 2*2+1= 5