#Exercise 5.1 (Bootstrap a confidence interval)
In the same study above, 350 patients did not take the new drug treatment, and 289 recovered. Calculate the bootstrap confidence interval for this proportion (289/350).
replicates<-5000
num_at_risk<-350
num_of_events<-289
risk <-num_of_events/num_at_risk
sim_events <- rbinom(n = replicates, size = num_at_risk, prob = risk)
conf.level <- 0.95
alpha <- 1 - conf.level
## 0.95 CI for binomial counts
quantile(sim_events, c(alpha/2, 1-alpha/2))
## 2.5% 97.5%
## 275 303
#Exercise 5.2 (Function for bootstrapping a confidence interval)
Adapt your code from the previous problem to write and test a function for calculating the bootstrap confidence interval for a binomial proportion.
For example, if your previous code was
boots<-function(r,e,t,c){
sim<-rbinom(n=r, size =t, p=e/t)
a<-1-c
quantile(sim, c(a/2, 1-a/2))
}
boots(500, 289, 350, .95)
## 2.5% 97.5%
## 274.475 303.000
#Exercise 5.3
Using the results from Table 5.4 create and test a function that: a. Calculates the risks R0 and R1 b. Calculates the risk ratio c. Calculates 95% confidence intervals d. Uses the fisher.test function (Fisher’s Exact Test) to calculate the two-sided p-value of the null hypothesis of no association. (Hint: see epitable function in Exercise 5.5) Collects and returns all the results as a ‘list’
epistats<-function(a,b,c,d){
r0<-a/a+b
r1<-c/c+d
rr<-r0/r1
matr<-cbind(c(a,b),c(c,d))
ci<-quantile(rr, c(.05/2, .95/2))
fisher<-fisher.test(x=matr, alternative="two.sided", conf.level = 0.95)
list(risks = c(r0 = r0, r1 = r1),
risk.ratio = rr,
conf.level = ci,
fisher=fisher)
}
epistats(273, 289, 77, 61)
## $risks
## r0 r1
## 290 62
##
## $risk.ratio
## [1] 4.677419
##
## $conf.level
## 2.5% 47.5%
## 4.677419 4.677419
##
## $fisher
##
## Fisher's Exact Test for Count Data
##
## data: matr
## p-value = 0.154
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.504886 1.106634
## sample estimates:
## odds ratio
## 0.7486641
#Exercise 5.4 (Evaluating drug treatment) Consider an observational study where 700 patients were given access to a new drug for an ailment. A total of 350 patients chose to take the drug and 350 patients did not. Table 5.4 displays the results of this study. Using the read.csv function import the data frame for Table 5.4. Use the code below.
Analyze the drugrx.df data frame using the xtab function, your function from the previous problem, and any R functions you wish. Is the drug treatment effective? Is there a difference in drug treatment effectiveness comparing men and women?
drugrx.df <- read.csv("https://raw.githubusercontent.com/taragonmd/data/master/exported/jp-drugrx-p2.txt")
str(drugrx.df)
## 'data.frame': 700 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Recovered: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Drug : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Gender : Factor w/ 2 levels "Men","Women": 1 1 1 1 1 1 1 1 1 1 ...
tab.test <- xtabs(~ Recovered + Drug + Gender, data = drugrx.df)
tab.test
## , , Gender = Men
##
## Drug
## Recovered No Yes
## No 36 6
## Yes 234 81
##
## , , Gender = Women
##
## Drug
## Recovered No Yes
## No 25 71
## Yes 55 192
male<-epistats(tab.test[4], tab.test[3], tab.test[2], tab.test[1]); male
## $risks
## r0 r1
## 7 37
##
## $risk.ratio
## [1] 0.1891892
##
## $conf.level
## 2.5% 47.5%
## 0.1891892 0.1891892
##
## $fisher
##
## Fisher's Exact Test for Count Data
##
## data: matr
## p-value = 0.1266
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8238548 6.2439802
## sample estimates:
## odds ratio
## 2.073293
fem<-epistats(tab.test[8], tab.test[7], tab.test[6], tab.test[5]); fem
## $risks
## r0 r1
## 72 26
##
## $risk.ratio
## [1] 2.769231
##
## $conf.level
## 2.5% 47.5%
## 2.769231 2.769231
##
## $fisher
##
## Fisher's Exact Test for Count Data
##
## data: matr
## p-value = 0.4786
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.6795617 2.1820561
## sample estimates:
## odds ratio
## 1.228405
Since for both men and women the CI include zero, we cannot reject the null hypothesis that the treatment is effective.
#Exercise 5.5 (Oswego outbreak investigation) Study this clean version of the Oswego data set and import it as a data frame: ~/data/oswego/oswego2.txt. Remember this data set is available from https://github.com/taragonmd/data. Locate the data set and select the “Raw” tab. Copy the URL and use with the read.csv function. Import this clean version.
oswego2 <- read.csv("https://raw.githubusercontent.com/taragonmd/data/master/oswego/oswego2.txt", strip.white = TRUE)
food.item <- c("Baked.ham", "Spinach", "Mashed.potato", "Cabbage.salad",
"Jello", "Rolls", "Brown.bread", "Milk", "Coffee", "Water", "Cakes",
"Vanilla.ice.cream", "Chocolate.ice.cream", "Fruit.salad")
names(oswego2)[9:22] <- food.item
str(oswego2)
## 'data.frame': 75 obs. of 22 variables:
## $ ID : int 2 3 4 6 7 8 9 10 14 16 ...
## $ Age : int 52 65 59 63 70 40 15 33 10 32 ...
## $ Sex : Factor w/ 2 levels "F","M": 1 2 1 1 2 1 1 1 2 1 ...
## $ MealDate : Factor w/ 1 level "04/18/1940": 1 1 1 1 1 1 1 1 1 1 ...
## $ MealTime : Factor w/ 6 levels "11:00:00","18:30:00",..: 5 2 2 4 4 4 6 3 4 NA ...
## $ Ill : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ OnsetDate : Factor w/ 2 levels "4/18/1940","4/19/1940": 2 2 2 1 1 2 2 1 2 2 ...
## $ OnsetTime : Factor w/ 17 levels "00:00:00","00:30:00",..: 2 2 2 15 15 4 3 16 4 7 ...
## $ Baked.ham : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
## $ Spinach : Factor w/ 2 levels "N","Y": 2 2 2 2 2 1 1 2 1 2 ...
## $ Mashed.potato : Factor w/ 2 levels "N","Y": 2 2 1 1 2 1 1 2 1 1 ...
## $ Cabbage.salad : Factor w/ 2 levels "N","Y": 1 2 1 2 1 1 1 1 1 1 ...
## $ Jello : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 1 1 1 ...
## $ Rolls : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 2 ...
## $ Brown.bread : Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 1 1 ...
## $ Milk : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ Coffee : Factor w/ 2 levels "N","Y": 2 2 2 1 2 1 1 1 1 2 ...
## $ Water : Factor w/ 2 levels "N","Y": 1 1 1 2 2 1 1 2 1 1 ...
## $ Cakes : Factor w/ 2 levels "N","Y": 1 1 2 1 1 1 2 1 1 2 ...
## $ Vanilla.ice.cream : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Chocolate.ice.cream: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 2 ...
## $ Fruit.salad : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
Study the following epitable function, especially the use of the … argument to past options to the fisher.test function. Use the epitable function to test each food item to see if it is associated with developing illness. Construct a table with the following column headings:
Food item R0 R1 Risk ratio Odds ratio Fisher’s p-value
Discuss your findings.
epitable <- function(exposure, outcome, digits = 3, ...){
# Updated 2016-10-16
# 2x2 table from 'xtabs' function
# row names are exposure status
# col names are illness outcome
# digits = significant digits (default = 3)
# ... = options to fisher.test function
x <- xtabs(~ exposure + outcome, na.action=na.pass)
rowtot <- apply(x, 1, sum)
rowdist <- sweep(x, 1, rowtot, '/')
risk0 <- rowdist['N','Y']
risk1 <- rowdist['Y','Y']
rr <- risk1/risk0
or <- (risk1/(1-risk1)) / (risk0/(1-risk0))
ft <- fisher.test(x, ...)
pv <- ft$p.value
results <- signif(c(risk0, risk1, rr, or, pv), digits = digits)
names(results) <- c('R_0', 'R_1', 'RR', 'OR', 'p.value')
list(data = x, results = results)
}
#Here is an example of testing one food item:
epitable(oswego2$Baked.ham, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 12 17
## Y 17 29
##
## $results
## R_0 R_1 RR OR p.value
## 0.586 0.630 1.080 1.200 0.809
epitable(oswego2$Spinach, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 12 20
## Y 17 26
##
## $results
## R_0 R_1 RR OR p.value
## 0.625 0.605 0.967 0.918 1.000
epitable(oswego2$Mashed.potato, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 14 23
## Y 14 23
##
## $results
## R_0 R_1 RR OR p.value
## 0.622 0.622 1.000 1.000 1.000
epitable(oswego2$Cabbage.salad, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 19 28
## Y 10 18
##
## $results
## R_0 R_1 RR OR p.value
## 0.596 0.643 1.080 1.220 0.808
epitable(oswego2$Jello, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 22 30
## Y 7 16
##
## $results
## R_0 R_1 RR OR p.value
## 0.577 0.696 1.210 1.680 0.442
epitable(oswego2$Rolls, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 13 25
## Y 16 21
##
## $results
## R_0 R_1 RR OR p.value
## 0.658 0.568 0.863 0.682 0.482
epitable(oswego2$Brown.bread, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 20 28
## Y 9 18
##
## $results
## R_0 R_1 RR OR p.value
## 0.583 0.667 1.140 1.430 0.622
epitable(oswego2$Milk, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 27 44
## Y 2 2
##
## $results
## R_0 R_1 RR OR p.value
## 0.620 0.500 0.807 0.614 0.638
epitable(oswego2$Coffee, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 17 27
## Y 12 19
##
## $results
## R_0 R_1 RR OR p.value
## 0.614 0.613 0.999 0.997 1.000
epitable(oswego2$Water, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 18 33
## Y 11 13
##
## $results
## R_0 R_1 RR OR p.value
## 0.647 0.542 0.837 0.645 0.450
epitable(oswego2$Cakes, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 16 19
## Y 13 27
##
## $results
## R_0 R_1 RR OR p.value
## 0.543 0.675 1.240 1.750 0.342
epitable(oswego2$Vanilla.ice.cream, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 18 3
## Y 11 43
##
## $results
## R_0 R_1 RR OR p.value
## 1.43e-01 7.96e-01 5.57e+00 2.35e+01 2.60e-07
epitable(oswego2$Fruit.salad, oswego2$Ill)
## $data
## outcome
## exposure N Y
## N 27 42
## Y 2 4
##
## $results
## R_0 R_1 RR OR p.value
## 0.609 0.667 1.100 1.290 1.000
myepitab <- function(x) {
epitable(x, oswego2$Ill)$results
}
t(sapply(oswego2[, food.item], myepitab))
## R_0 R_1 RR OR p.value
## Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
## Spinach 0.625 0.605 0.967 0.918 1.00e+00
## Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
## Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
## Jello 0.577 0.696 1.210 1.680 4.42e-01
## Rolls 0.658 0.568 0.863 0.682 4.82e-01
## Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
## Milk 0.620 0.500 0.807 0.614 6.38e-01
## Coffee 0.614 0.613 0.999 0.997 1.00e+00
## Water 0.647 0.542 0.837 0.645 4.50e-01
## Cakes 0.543 0.675 1.240 1.750 3.42e-01
## Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
## Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
## Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
When we explore each item, we observe that the items associated with higher risks include (according to their RR).
#Exercise 5.6 (Generate outbreak investigation results table)
Write a for loop to automate the creation of the outbreak table from previous exercise. Hint provided:
outbreak <- matrix(NA, length(food.item), 5)
rownames(outbreak) <- food.item
colnames(outbreak) <- c('R_0','R_1','RR','OR','P value')
for(i in 1:length(food.item)) {
outbreak[i,] <- epitable(oswego2[,food.item[i]], oswego2$Ill)$results
}
outbreak
## R_0 R_1 RR OR P value
## Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
## Spinach 0.625 0.605 0.967 0.918 1.00e+00
## Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
## Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
## Jello 0.577 0.696 1.210 1.680 4.42e-01
## Rolls 0.658 0.568 0.863 0.682 4.82e-01
## Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
## Milk 0.620 0.500 0.807 0.614 6.38e-01
## Coffee 0.614 0.613 0.999 0.997 1.00e+00
## Water 0.647 0.542 0.837 0.645 4.50e-01
## Cakes 0.543 0.675 1.240 1.750 3.42e-01
## Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
## Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
## Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
#Exercise 5.7 (Comparing lapply and sapply)
Study the following use of lapply and sapply. Explain how both functions work, and what are their differences.
myepitab <- function(x) {
epitable(x, oswego2$Ill)$results
}
lapply(oswego2[, food.item], myepitab)
## $Baked.ham
## R_0 R_1 RR OR p.value
## 0.586 0.630 1.080 1.200 0.809
##
## $Spinach
## R_0 R_1 RR OR p.value
## 0.625 0.605 0.967 0.918 1.000
##
## $Mashed.potato
## R_0 R_1 RR OR p.value
## 0.622 0.622 1.000 1.000 1.000
##
## $Cabbage.salad
## R_0 R_1 RR OR p.value
## 0.596 0.643 1.080 1.220 0.808
##
## $Jello
## R_0 R_1 RR OR p.value
## 0.577 0.696 1.210 1.680 0.442
##
## $Rolls
## R_0 R_1 RR OR p.value
## 0.658 0.568 0.863 0.682 0.482
##
## $Brown.bread
## R_0 R_1 RR OR p.value
## 0.583 0.667 1.140 1.430 0.622
##
## $Milk
## R_0 R_1 RR OR p.value
## 0.620 0.500 0.807 0.614 0.638
##
## $Coffee
## R_0 R_1 RR OR p.value
## 0.614 0.613 0.999 0.997 1.000
##
## $Water
## R_0 R_1 RR OR p.value
## 0.647 0.542 0.837 0.645 0.450
##
## $Cakes
## R_0 R_1 RR OR p.value
## 0.543 0.675 1.240 1.750 0.342
##
## $Vanilla.ice.cream
## R_0 R_1 RR OR p.value
## 1.43e-01 7.96e-01 5.57e+00 2.35e+01 2.60e-07
##
## $Chocolate.ice.cream
## R_0 R_1 RR OR p.value
## 0.7410 0.5320 0.7180 0.3980 0.0891
##
## $Fruit.salad
## R_0 R_1 RR OR p.value
## 0.609 0.667 1.100 1.290 1.000
t(sapply(oswego2[, food.item], myepitab)) # transpose output
## R_0 R_1 RR OR p.value
## Baked.ham 0.586 0.630 1.080 1.200 8.09e-01
## Spinach 0.625 0.605 0.967 0.918 1.00e+00
## Mashed.potato 0.622 0.622 1.000 1.000 1.00e+00
## Cabbage.salad 0.596 0.643 1.080 1.220 8.08e-01
## Jello 0.577 0.696 1.210 1.680 4.42e-01
## Rolls 0.658 0.568 0.863 0.682 4.82e-01
## Brown.bread 0.583 0.667 1.140 1.430 6.22e-01
## Milk 0.620 0.500 0.807 0.614 6.38e-01
## Coffee 0.614 0.613 0.999 0.997 1.00e+00
## Water 0.647 0.542 0.837 0.645 4.50e-01
## Cakes 0.543 0.675 1.240 1.750 3.42e-01
## Vanilla.ice.cream 0.143 0.796 5.570 23.500 2.60e-07
## Chocolate.ice.cream 0.741 0.532 0.718 0.398 8.91e-02
## Fruit.salad 0.609 0.667 1.100 1.290 1.00e+00
The difference is that lapply does not give only one table, but several small ones and sapply gives one table with altogether all the results.
#Exercise 5.8 (Lexical scoping) Study the following R code and explain the final answer for g(2):
a<-1
b<-2
f<-function(x) {
a * x + b
}
g <-function(x) {
a<-2
b<-1
f(x)
}
g(2)
## [1] 4
I think it ignores the second values and just executes the first function.
#Exercise 5.9 (More lexical scoping) Study the following R code and explain the final answer for g(2):
g <- function(x) {
a <- 2
b <- 1
f <- function(x) {
a * x + b
}
f(x)
}
g(2)
## [1] 5
Here I think it is using the new values because it is separating the functions and just uses the first arguments given.