Epidemiology 208: Homework #2
Due: Friday, July 27, 2012
Question #1:
exp.dis.tab <- function(a,b,c,d, dimnames = list(exposure = c("Exposed", "Non-Exposed"),
disease = c("Disease+", "Disease-"))) {
mat <- matrix(c(a,b,c,d), ncol = 2, byrow = TRUE, dimnames = dimnames)
mat
}
Question 1.1: What is the value for the crude Odds Ratio for death, comparing the risk of death for diabetic subjects (exposed group) to non-diabetics? (Hint: You did this in Homework #1)
Table 1.2
tab1.2 <-
exp.dis.tab(94,27,1456,2857,
list(diabetes2 = c("1 Diabetic", "2 Not diabetic"),
death2 = c("1 Death","2 No Death")
)
)
addmargins(tab1.2)
death2
diabetes2 1 Death 2 No Death Sum
1 Diabetic 94 27 121
2 Not diabetic 1456 2857 4313
Sum 1550 2884 4434
The crude odds ratio can be calculated from Table 1.2. By cross product method, it is:
(94 * 2857) / (27 * 1456) = 6.8315
.
Question 1.2: Using the Mantel-Haenszel weighting formula, what is the value for the age-adjusted Odds Ratio for death, comparing the risk of death for diabetics (exposed group) to non-diabetics?
## Data in array-format
tab1.3.array <- array(c( 2, 90, 2, 660,
16,360,13,1250,
41,557, 7, 774,
35,449, 5, 173),
dim = c(2,2,4),
list(diabetes2 = c("1 Diabetic", "2 Not diabetic"),
death2 = c("1 Yes", "2 No"),
agecat = c(1:4)))
## Matrix element vectors
ai <- tab1.3.array[1,1,]
bi <- tab1.3.array[1,2,]
ci <- tab1.3.array[2,1,]
di <- tab1.3.array[2,2,]
## Create list for displaying
tab1.3.list <- alply(.data = tab1.3.array, .margins = 3, .fun = invisible)
names(tab1.3.list) <- c("agecat1","agecat2","agecat3","agecat4")
lapply(tab1.3.list, addmargins)
$agecat1
death2
diabetes2 1 Yes 2 No Sum
1 Diabetic 2 2 4
2 Not diabetic 90 660 750
Sum 92 662 754
$agecat2
death2
diabetes2 1 Yes 2 No Sum
1 Diabetic 16 13 29
2 Not diabetic 360 1250 1610
Sum 376 1263 1639
$agecat3
death2
diabetes2 1 Yes 2 No Sum
1 Diabetic 41 7 48
2 Not diabetic 557 774 1331
Sum 598 781 1379
$agecat4
death2
diabetes2 1 Yes 2 No Sum
1 Diabetic 35 5 40
2 Not diabetic 449 173 622
Sum 484 178 662
The odds ratios for the age category strata are: 7.3333, 4.2735, 8.139, 2.6971
(agecat = 1, 2, 3, 4, respectively).
From page 187 of Rothman's Epidemiology: An Introduction.
\[ OR_{MH} = \frac {\sum_i \frac {a_i d_i} {T_i}} {\sum_i \frac {b_i c_i} {T_i}} \]
Calculation for the odds ratio with Mantel-Haenszel weighting using R.
ai <- c( 2, 16, 41, 35) # Exposed, Disease+ cells
bi <- c( 2, 13, 7, 5) # Exposed, Disease- cells
ci <- c( 90, 360, 557, 449) # Non-Exposed, Disease+ cells
di <- c(660, 1250, 774, 173) # Non-Exposed, Disease- cells
Ti <- (ai + bi + ci + di) # 1. Total for each level separately
numerator <- sum((ai * di) / Ti) # 2. (ai * di) / Ti for each level, then sum
denominator <- sum((bi * ci) / Ti) # 3. (bi * ci) / Ti for each level, then sum
OR.mh <- numerator / denominator # 4. Divide 2 by 3
\( OR_{MH} \) = 4.9515
is the answer.
Visualization with a meta-analysis package.
epi.mh(ev.trt = ai, n.trt = ai + bi, ev.ctrl = ci, n.ctrl = ci + di, names, method = "odds.ratio")
meta.res <-
meta::metabin(event.e = ai, n.e = ai + bi,
event.c = ci, n.c = ci + di,
sm = "OR", studlab = paste("agecat", c(1:4), sep = ""),
method="MH",
label.e="Exposed", label.c="Non-Exposed", outclab = "Death",
## label.left="LEFT", label.right="RIGHT",
comb.fixed = TRUE, comb.random = FALSE)
meta.res
meta::forest.meta(meta.res)
Question 1.3: Is age a confounder in these data? Why or why not
It is a counfounder.
As age can be associated with both diabetes (exposure) and death (outcome), it is a potential confounder. Also the crude OR and \( OR_{MH} \) are different indicating presence of confounding. In fact, the age composition of the diabetics (exposed) is different from that of the non-diabetics because there are more older people in the diabetic population, at least partly contributing to the higher mortality in the diabetic group.
Question 1.4: Is age an effect-modifier in these data? Why or why not
It is an effect measure modifier.
The ORs are different by more than 20% across the age strata, indicating different effect of exposure (being diabetic) on the outcome (death) in these strata, at least in the ratio scale. Thus, the effect measure (odds ratio here) is modified by the agecat variable, thus age is an effect measure modifier.
Question # 2
Question 2.1: Complete the following crude and sex-specific tables showing data that are consistent with the results given above:
tab.q2 <- function(a,b,c,d) {
mat <- matrix(c(a,b,c,d), ncol = 2, byrow = TRUE,
dimnames = list(Exposure = c("Exposed", "Non-Exposed"),
Disease = c("Disease+", "Disease-")))
mat
}
Crude Table
addmargins(tab.q2(260,240,70,430))
Disease
Exposure Disease+ Disease- Sum
Exposed 260 240 500
Non-Exposed 70 430 500
Sum 330 670 1000
Sex-specific tables:
Males
addmargins(tab.q2(240,160,30,70))
Disease
Exposure Disease+ Disease- Sum
Exposed 240 160 400
Non-Exposed 30 70 100
Sum 270 230 500
Females
addmargins(tab.q2(20,80,40,360))
Disease
Exposure Disease+ Disease- Sum
Exposed 20 80 100
Non-Exposed 40 360 400
Sum 60 440 500
Question 2.2: Is sex a confounder? Why or why not?
It is a confounder.
The crude risk ratio is (260 / 500) / (70 / 500) = 3.7143
.
The adjusted risk ratio is 2.0 as shown below.
The difference between these suggest there is a confounding by the stratification variable, sex.
The adjusted risk ratio by the Mantel-Haenszel method is defined as:
\[ RR_{MH} = \frac {\sum_i \frac {a_i (c_i + d_i)} {T_i}} {\sum_i \frac {c_i (a_i + b_i)} {T_i}} \]
Calculation for the \( RR_{MH} \).
ai <- c(240, 20) # Exposed, Disease+ cells
bi <- c(160, 80) # Exposed, Disease- cells
ci <- c( 30, 40) # Non-Exposed, Disease+ cells
di <- c( 70, 360) # Non-Exposed, Disease- cells
Ti <- ai + bi + ci + di # Total cells
numerator <- sum(ai * (ci + di) / Ti)
denominator <- sum(ci * (ai + bi) / Ti)
RR.mh <- numerator / denominator
cat("RR.mh =", RR.mh, "\n")
RR.mh = 2
Visualization of \( RR_{male}, RR_{female}, RR_{MH} \).
epi.mh(ev.trt = ai, n.trt = ai + bi, ev.ctrl = ci, n.ctrl = ci + di, names, method = "risk.ratio")
meta.res <-
meta::metabin(event.e = ai, n.e = ai + bi,
event.c = ci, n.c = ci + di,
sm = "RR", studlab = c("Male","Female"),
method="MH",
label.e="Exposed", label.c="Non-Exposed", outclab = "Death",
comb.fixed = TRUE, comb.random = FALSE)
meta.res
meta::forest.meta(meta.res)
There are more exposed males than exposed females, and also the baseline risk of disease among non-exposed is higher among the males. Therefore, sex is associated with both exposure status and outcome, and it is a definition of a confounder. By having more high-risk (mainly male) exposed people and low-risk (mainly female) non-exposed people, the risk ratio between these two groups became larger in the overall sample.
Question 2.3: Is sex an effect modifier of the Risk Ratio? Why or not?
It is not an effect (measure) modifier on the risk ratio scale.
Effect (measure) modification is present in a situation in which a measure of effect changes over value of some other variable (page 199, Rothman's Epidemiology: An Introduction, 2012). In this case, risk ratios are same across the gender strata (2.0 in both groups as shown in the diagram above), meaning sex does not modify the effect of exposure on the risk ratio scale.
In men, it is 0.6 / 0.3 = 2.
In women, it is 0.2 / 0.1 = 2.
However, the risk difference in the men (0.6 - 0.3 = 0.3) is greater than that of the women (0.2 - 0.1 = 0.1), indicating effect modification in the risk difference scale.
Visualization of \( RD_{male}, RD_{female}, RD_{MH} \).
meta.res <-
meta::metabin(event.e = ai, n.e = ai + bi,
event.c = ci, n.c = ci + di,
sm = "RD", studlab = c("Male","Female"),
method="MH",
label.e="Exposed", label.c="Non-Exposed", outclab = "Death",
## label.left="LEFT", label.right="RIGHT",
comb.fixed = TRUE, comb.random = FALSE)
meta.res
meta::forest.meta(meta.res)
Question 2.4: Change the statements for one or more of the seven “findings” listed in the original table to reverse your answer to Question 2.2. (You do not need to change the answers to Question 2.1 but only write the new “finding(s)” in the space that follows).
Finding 2:
80% of exposed subjects were males \( \dashrightarrow \) 50% of the exposed subjects were males
20% of the non-exposed subjects were males \( \dashrightarrow \) 50% of the non-exposed subjects were males
In this case, the overall risk ratio is (200/500) / (100/500) = 0.4 / 0.2 = 2, and it is also 2 in each stratum.
Males: (150/250) / (75/250) = 0.6 / 0.3 = 2.0
Females: (50/250) / (25/250) = 0.2 / 0.1 = 2.0
\( RR_{MH} \) = 2.0
Visualization of \( RR_{male}, RR_{female}, RR_{MH} \).
ai <- c(150, 50) # Exposed, Disease+ cells
bi <- c(100, 200) # Exposed, Disease- cells
ci <- c( 75, 25) # Non-Exposed, Disease+ cells
di <- c(175, 225) # Non-Exposed, Disease- cells
Ti <- ai + bi + ci + di # Total cells
numerator <- sum(ai * (ci + di) / Ti)
denominator <- sum(ci * (ai + bi) / Ti)
RR.mh <- numerator / denominator
cat("RR.mh =", RR.mh, "\n")
meta.res <-
meta::metabin(event.e = ai, n.e = ai + bi,
event.c = ci, n.c = ci + di,
sm = "RR", studlab = c("Male","Female"),
method="MH",
label.e="Exposed", label.c="Non-Exposed", outclab = "Death",
## label.left="LEFT", label.right="RIGHT",
comb.fixed = TRUE, comb.random = FALSE)
meta.res
meta::forest.meta(meta.res)
Although being male is still associated with higher baseline risk of disease (0.3), now there is no association between sex and exposure, i.e., male:female ratios are the same between the exposed and non-exposed. Without having association with both exposure and outcome, the sex variable is no longer a confounder.
Crude Table
men <- addmargins(tab.q2(150,100,75,175))
women <- addmargins(tab.q2(50,200,25,225))
total <- men + women
total
Disease
Exposure Disease+ Disease- Sum
Exposed 200 300 500
Non-Exposed 100 400 500
Sum 300 700 1000
Sex-specific tables:
Males
men
Disease
Exposure Disease+ Disease- Sum
Exposed 150 100 250
Non-Exposed 75 175 250
Sum 225 275 500
Females
women
Disease
Exposure Disease+ Disease- Sum
Exposed 50 200 250
Non-Exposed 25 225 250
Sum 75 425 500
Question 2.5: Change the statements for one or more of the seven “findings” listed in the original table to reverse your answer to Question 2.3. (You do not need to change the answers to Question 2.1 but only write the new “finding(s)” in the space that follows).
Changed:
Finding 4: three times \( \dashrightarrow \) two times
Finding 6: 2.0 \( \dashrightarrow \) 1.333333
In this case, the risk ratio are different across the sex strata.
In the males, it is (160 / 400) / (30 / 100) = 0.4 / 0.3 = 1.333333.
In the females, it is (20 / 100) / (40 / 400) = 0.2 / 0.1 = 2.
The risk ratio shows effect modification by sex. However, the risk difference is constant across sex strata, i.e., 0.4 - 0.3 = 0.2 - 0.1 = 0.1.
Crude Table
men <- addmargins(tab.q2(160,240,30,70))
women <- addmargins(tab.q2(20,80,40,360))
total <- men + women
total
Disease
Exposure Disease+ Disease- Sum
Exposed 180 320 500
Non-Exposed 70 430 500
Sum 250 750 1000
Sex-specific tables:
Males
men
Disease
Exposure Disease+ Disease- Sum
Exposed 160 240 400
Non-Exposed 30 70 100
Sum 190 310 500
Females
women
Disease
Exposure Disease+ Disease- Sum
Exposed 20 80 100
Non-Exposed 40 360 400
Sum 60 440 500
Question #3:
After reading this manuscript your colleague says that the study design is biased because of the role of random error in any single blood pressure measurement impacting the selection of subjects for this study. Do you agree with your colleague? Why or why not? If the study design is biased, how would you modify the study design to avoid this problem?
I agree.
There are more people with the true blood pressure below 140 mmHg than above 140 mmHg, since by definition there should be more normal people than abnormal people. Thus, if a single blood pressure measurement is used for enrollment, there will be more false positives (normotensive people enrolled) than false negatives (hypertensive people missed). These people are expected to show regression to their mean at the final measurement, thus, the previously normotensive people are expected to have more normal measurements, causing the sample mean to be naturally lower regardless of the intervention.
To avoid this issue with random error, the mean of several blood pressure measurement should be calculated for each candidate for inclusion assessment, as random errors are expected to cancel each other out if the mean is used. The investigators should be able to include hypertensive subjects more accurately with this method.
Also, a randomized controlled trial is ideal, as comparison to a placebo arm is ultimately necessary to rule out an effect of simple secular trend.
Question #4:
Question 4.1 Comparing Groups A to Group B
Different: new vs standard treatment
Same: both in trial and randomized, blinded to treatment, cancer patients survived to 1 year
After randomization, both groups are expected to have similar baseline characteristics and are comparable except for the treatment allocation. Therefore, any between-group difference in the outcome is explained solely by the treatment allocation without confounding by other factors. Since both groups were in the trial, it is not possible to measure the “care effect” from this comparison.
Question 4.2 Comparing Groups A to Group C
Different: new vs standard treatment, in or outside trial, with or without randomization and blinding
Same: cancer patients survived to 1 year
There are many differences between these groups. Thus, the difference in outcome can be explained by either one or combinations of benefit or harm from the new treatment, the “care effect” by being in the trial, high expectation from being in the trial, and baseline differences (one arm is not randomized). Therefore, it is hard to measure the independent effect of the “care effect” by simply comparing these groups because of confounding by other between-group differences.
Question 4.3 Comparing Groups B to Group C
Different: in or outside trial, with or without randomization and blinding
Same: standard care, cancer patients survived to 1 year
These groups both receive standard of care, and the largest difference is whether they were enrolled in the trial or not, making this comparison most appropriate for measurement of the “care effect”. However, as the patients outside the trial have not been randomized, there may be baseline differences, for example, age, sex, comorbidities, and stage of cancer. Therefore, meticulous adjustment for the baseline differences to remove confounding is required for a valid comparison. Even after the adjustment, unmeasured differences such as the high expectation from being in the trial may partly explain differences in the outcome.
Question #5.
tab.q5 <- function(a,b,c,d, exposure = c("Exposed", "Non-Exposed"), disease = c("Disease+", "Disease-")) {
mat <- matrix(c(a,b,c,d), ncol = 2, byrow = TRUE,
dimnames = list(Exposure = exposure,
Disease = disease)
)
mat
}
Question 5.1 Complete the following 2x2 table showing the physical activity status among cases and controls. Calculate an appropriate measure of association between physical activity and cancer from these data.
addmargins(tab.q5(40,50,60,50, c("Physically Active","Not Physically Active"), c("cases","controls")))
Disease
Exposure cases controls Sum
Physically Active 40 50 90
Not Physically Active 60 50 110
Sum 100 100 200
Odds ratio = (40 * 50) / (50 * 60) = 0.6667
This shows a negative association with physical activity and cancer.
Question 5.2 Suppose that the next-of-kin of the cases and the controls misreported the true physical activity status in both groups with the same error probabilities. Would the result of such misreporting among cases and controls make the value for the measure of association in the previous question be:
2. Biased towards the null (i.e. the reported value in Question 5.1. underestimates the true effect of physical activity on cancer)
As the errors occur randomly, both types of misclassification (active to not active, not active to active) occur similarly in both groups. Therefore, it is a nondifferential misclassification. This type of misclassification tends to bias the result towards null.
Question 5.3 Suppose that the next-of-kin of the cases over-reported the actual number of physically active cases at the start of the PHS but that the controls correctly reported their true physical activity at the start of the PHS. Would the result of such over-reporting among cases make the value for the measure of association in the previous question be:
2. Biased towards the null (i.e. the reported value in Question 5.1. underestimates the true effect of physical activity on cancer)
The over-report means there were fewer physically active people among the cases than actually shown in the data. As it occurred exclusively in the cases, it is a differential misclassification. For example, if there were only 20 physically active people among the cases, the odds ratio would have been:
\( (20 * 50) / (50 * 80) = 0.25 \).
This means that the effect of physical activity was actually more prominent. Thus, the unidirectional over-reporting of the number of physically active people among the cases only, but not in the controls, caused a bias towards null.
Question 5.4 Concerned that the next-of-kin of diseased subjects might recall past physical activity differently than alive subjects, suppose you enrolled deceased subjects as controls and obtain information of their level of physical activity at the start of the PHS from their next-of-kin (as done for the 100 cases). Suppose that you enroll 100 controls who died from coronary heart disease as controls and that next of kin reported that 20 of these controls were physically active at the start of the PHS.
Complete the following 2x2 table showing the physical activity status among cases and controls. Calculate an appropriate measure of association between physical activity and cancer from these data.
addmargins(tab.q5(40,20,60,80, c("Physically Active","Not Physically Active"), c("cases","controls")))
Disease
Exposure cases controls Sum
Physically Active 40 20 60
Not Physically Active 60 80 140
Sum 100 100 200
Odds ratio = (40 * 80) / (20 * 60) = 2.6667
This shows a positive association with physical activity and cancer.
Do you prefer to use the deceased controls in this table or the alive controls in the table in Question 5.1? Why?
Deceased controls are preferable in terms of information bias, however, those who died of coronary heart disease are not appropriate.
Controls in a case-control study should represent the population at risk for the disease of interest, in this case cancer death. The deceased control in this example is solely comprised of patients died of a single cause, coronary heart disease. If the reason of death is associated with the exposure in question, the result will be biased, as the exposure status is expected to be different form the population at risk. In this particular example, there were fewer physically active people in the controls, probably fewer than the general at-risk population, which may be the cause or result of their heart disease. As the controls did not correctly represent the physical activity status of the at-risk population, the resulting odds ratio was probably biased upward.
Alive controls can recall the exercise status themselves, thus it is expected to be more accurate than information from their family. This can cause information bias.
The most appropriate kind of patients are the deceased patients at the same hospital whose cause of death was not associated with exercise habits. These patients would be from the same source population both geographically and socioeconomically as they were at the same hospital at death. If the cause of death was not associated with exercise, the exposure status is expected to represent that of the source population.
Question #6.
Question 6.1 Construct a 2x2 table showing the relationship between time of bleeding and mortality, using the Late Bleeders as the “exposed” group. Calculate the mortality risk ratio form these data.
addmargins(tab.q5(9, 36-9, 18, 41-18, c("Late Bleeders","Early Bleeders"), c("Died","Survived")))
Disease
Exposure Died Survived Sum
Late Bleeders 9 27 36
Early Bleeders 18 23 41
Sum 27 50 77
Risk ratio = (9/36) / (18/41) = 0.5694
(lower risk among late bleeders)
Question 6.2 Louis claimed that the two comparison groups were comparable with respect to diet, age, and treatments other than bloodletting. However, if the early bleeders had severe symptoms that caused them to see a physician early in the course of disease, but the late bleeders were less severely ill and saw a physician later in the course of disease, how might this effect the interpretation of the answer to Question 6.1 and any conclusion that you might reach. (Note: In reality, Louis claimed that the two comparison groups had similar severity of symptoms).
If there had been more severely ill patients in the Early Bleeder group, the observed difference in mortality ratio (more in the Early Bleeder) might be explained by the severity of their illness, rather than the exposure status. If that was the case, the severity of illness was associated with both the exposure (blood letting initiation time) and outcome (death), serving as a counfounder.
Question 6.3 The reported mean duration of disease was 17.8 days for the 41 early bleeders and 20.8 days for the 36 late bleeders. Do these data on duration of disease support a benefit from bloodletting? Why or why not?
These data do not give enough information, thus it is inconclusive.
These data are really time-to-death data with censoring by cure. Thus, short duration may mean a severe disease that killed a patient or a mild disease in which the patient was cured early. Without combining the time variable with the outcome variable and doing survival analysis, it is not possible to fully utilize these data. However, in this particular case, the shorter mean duration of the 41 early bleeders probably mean early death among 18 people who ultimately died.
As a matter of fact, if survival analysis is performed on the original data (manually reproduced from J Clin Epidemiol. 1996 Dec;49(12):1327-33.), The late bleeders show better survival, indicating the shorter mean disease duration among the early bleeders was indeed due to early death.
Data also available at :http://www.epidemiology.ch/history/louis.htm
blood.letting <- read.table(header = TRUE,
text = "start.letting duration n.times.letting death age
1 10 3 0 NA
1 12 2 0 NA
1 14 2 0 NA
2 7 3 0 NA
2 10 2 0 NA
2 12 2 0 NA
3 19 3 0 NA
3 29 3 0 NA
3 20 2 0 NA
3 20 NA 0 NA
3 16 3 0 NA
3 17 4 0 NA
4 19 3 0 NA
4 12 2 0 NA
4 15 2 0 NA
4 22 4 0 NA
4 12 4 0 NA
4 21 2 0 NA
4 25 3 0 NA
4 28 4 0 NA
4 40 2 0 NA
4 16 2 0 NA
4 12 4 0 NA
5 28 2 0 NA
5 17 3 0 NA
5 40 2 0 NA
5 13 2 0 NA
5 21 2 0 NA
5 13 2 0 NA
6 13 1 0 NA
6 16 2 0 NA
6 23 3 0 NA
6 35 5 0 NA
6 17 2 0 NA
7 24 2 0 NA
7 12 4 0 NA
7 19 2 0 NA
7 18 2 0 NA
7 15 2 0 NA
7 27 2 0 NA
8 19 2 0 NA
8 12 1 0 NA
8 18 1 0 NA
8 20 3 0 NA
8 13 2 0 NA
8 21 2 0 NA
9 35 1 0 NA
9 11 2 0 NA
9 17 2 0 NA
9 30 3 0 NA
1 6 5 1 18
2 53 5 1 65
2 12 3 1 69
2 8 2 1 65
2 12 1 1 55
2 17 7 1 75
3 4 1 1 57
3 16 2 1 54
3 6 3 1 30
3 6 4 1 47
3 47 2 1 75
3 11 4 1 45
4 29 2 1 19
4 29 4 1 46
4 12 1 1 85
4 15 3 1 37
4 17 1 1 67
4 20 3 1 22
5 16 4 1 58
5 8 2 1 63
5 9 4 1 24
6 62 4 1 20
6 10 2 1 40
6 29 3 1 24
7 20 2 1 68
8 25 1 1 40
9 22 1 1 50")
blood.letting <- within(blood.letting, {
late.bleeder <- start.letting >= 5
})
## layout(matrix(c(1:2), ncol = 2))
## Time to death
survplot(survfit(Surv(duration, death == 1) ~ late.bleeder,
data = blood.letting), label.curves=list(keys = 1:2),
n.risk = TRUE, time.inc = 10, conf = "none")
## In fact, if the graph is plotted for time to cure (death as censoring), these two groups show similar curves.
## Time to cure
## survplot(survfit(Surv(duration, death == 0) ~ late.bleeder,
## data = blood.letting), label.curves=list(keys = 1:2),
## n.risk = TRUE, time.inc = 10, conf = "none",
## fun = function(x) 1 - x, ylab = "Cumulative Cure Probability")
All computations were carried out with R 2.15.1 (optional R packages used: plyr, rms, meta, epiR).