I. CRUDE ANALYSIS

Epidemiology 202: Homework 3

hw3.dat <-
    read.table(header = T, text = "
Sex IncomeLevel Distance Cases PersonYears
Men    Low   <200m  35  704
Men    Mid   <200m  26  684
Men    High  <200m  31 1134
Men    Low  >=200m 145 3871
Men    Mid  >=200m 138 4238
Men    High >=200m 280 9765
Women  Low   <200m  28  587
Women  Mid   <200m  20  275
Women  High  <200m  26  377
Women  Low  >=200m  93 1990
Women  Mid  >=200m  98 2076
Women  High >=200m 151 3796
")

hw3.dat$IncomeLevel <- factor(hw3.dat$IncomeLevel, c("Low","Mid","High"))
hw3.dat

     Sex IncomeLevel Distance Cases PersonYears
1    Men         Low    <200m    35         704
2    Men         Mid    <200m    26         684
3    Men        High    <200m    31        1134
4    Men         Low   >=200m   145        3871
5    Men         Mid   >=200m   138        4238
6    Men        High   >=200m   280        9765
7  Women         Low    <200m    28         587
8  Women         Mid    <200m    20         275
9  Women        High    <200m    26         377
10 Women         Low   >=200m    93        1990
11 Women         Mid   >=200m    98        2076
12 Women        High   >=200m   151        3796

I. CRUDE ANALYSIS

1.Using the data in the table, create a new table to explore the crude association between distance and all-cause mortality. Calculate the crude incidence rate ratio (IRR) comparing those living <200m from a road to those living >=200m (reference group). Interpret the results.

library(plyr)

I1 <- ddply(.data      = hw3.dat,
            .variables = "Distance",
            .fun       = summarize,
            "Total Cases"                 = sum(Cases),
            "Total PY"                    = sum(PersonYears),
            "Incidence rate (per 1000PY)" = sum(Cases) / sum(PersonYears) * 1000
            )
print(I1, row.names = F)

 Distance Total Cases Total PY Incidence rate (per 1000PY)
    <200m         166     3761                       44.14
   >=200m         905    25736                       35.16

library(epiR)
epi.2by2(as.matrix(I1[,2:3]), method = "cohort.time", units = 1000)

             Disease +    Time at risk        Inc rate *
Exposed +          166            3761              44.1
Exposed -          905           25736              35.2
Total             1071           29497              36.3

Point estimates and 95 % CIs:
---------------------------------------------------------
Inc rate ratio                           1.26 (1.06, 1.48)
Attrib rate *                            8.97 (1.88, 16.07)
Attrib rate in population *              1.14 (-2.01, 4.3)
Attrib fraction in exposed (%)           20.33 (5.41, 32.55)
Attrib fraction in population (%)        3.15 (2.64, 3.66)
---------------------------------------------------------
 * Cases per 1000 units of population time at risk

Therefore, the incidnece rate ratio is 44.13720 / 35.16475 = 1.26. The people living closer to roads are at 26% increased rate of all-cause mortality.

2.Construct a 95% confidence interval for the crude IRR in Question 1. Please interpret.

The confidence interval on the natural log scale is defined as:

\( X \pm Z_{1 - \alpha /2} \sqrt{\hat{Var}(X)} \)

where \( \hat{Var}(X) = \frac {1} {a} + \frac {1} {b} \) and \( X \) is the natural log of the estimated rate ratio.

X      <- log(I1[1,4] / I1[2,4])        # natural log of rate ratio
Z      <- qnorm(p = 0.975)              # Z 1-alpha/2
ci.log <- X + c(-1,1) * Z * sqrt(1/I1[1,2] + 1/I1[2,2]) # 95% CI on log scale
ci.exp <- exp(ci.log)                   # back to normal scale
ci.exp

[1] 1.064 1.481

From the formula, the 95% confidence interval of the rate ratio is from 1.06 to 1.48, indicating 6% to 48% increase in rate of all-cause mortality. The values between these two limits are considered compatible with the data obtained in the study. We are 95% confident that the interval has captures the true value we tried to estimate. The interval is tight, meaning the estimation was stable. The interval also does not contain the null value of 1, thus, the result is likely to be significant by hypothesis testing with the alpha level at 5%.

3.Calculate the crude incidence rate difference (IRD) comparing the all-cause mortality rate among the study participants living <200m from a road to those living >=200m (reference group). Interpret the results.

The incidence rate difference is 44.13720 - 35.16475 = 8.97 per 1000 person-years. For every 1000 person-years of observation, the people living closer to roads experience 8.97 more cases of all-cause mortality.

4.Construct a 95% confidence interval for the IRD. Please interpret it.

The confidence interval is defined as:

\( X \pm Z_{1 - \alpha /2} \sqrt{\hat{Var}(X)} \)

where \( \hat{Var}(X) = \frac {a} {N^2_1} + \frac {b} {N^2_0} \) and \( X \) is the the estimated rate difference.

X       <- -diff(I1[,2] / I1[,3]) # rate difference
Z       <- qnorm(p = 0.975)       # Z 1-alpha/2
ci.diff <- X + c(-1,1) * Z * sqrt(I1[1,"Total Cases"]/I1[1,"Total PY"]^2 + I1[2,"Total Cases"]/I1[2,"Total PY"]^2)
ci.diff * 1000

[1]  1.878 16.067

From the formula, the 95% confidence interval is from 1.88 to 16.07 (per 1000 person-years). The values between these two limits are considered compatible with the data obtained in the study. We are 95% confident that the interval has captured the true value we tried to estimate. The interval is relatively tight, meaning the estimation is stable. It also does not include the null value of zero, thus, the result is likely to be significant by hypothesis testing with the alpha level at 5%.

II. CONFOUNDING

1.The authors think that income might be a confounder of the association between distance to a major roadway and all-cause mortality. What properties must a confounder have? From the author’s perspective, how does income exemplify these properties? (Note: if you disagree with the authors, you will have a chance to make your point in question 5b. For now, assume this is the correct model.)

Properties required

The variable must be associated with exposure in the study base.
The variable must be associated with outcome in the unexposed.
The variable must not be a downstream consequence of the exposure or the outcome.

In this study

The authors mention that Those living close to a major roadway tended to live in areas of lower socioeconomic position as noted by lower block group income and lower neighborhood education, although their demographic and clinical characteristics were similar to those living farther from a major roadway, with the exception that fewer were of white race.

The income was associated with where people lived in the study base, i.e., people with lower socioeconomic status (measured by income) lived closer to a major roadway.
The income may have been associated with all-cause mortality in the control, i.e., people with lower socieconomic status (measured by income) are known to have worse health outcomes, including mortality, when they are not living close a major roadway.
The income was not a downstream consequence of the living near roads or the all-cause mortality, i.e., living close to a major roadway is unlikely to affect income, and all-cause mortality can not affect income.

2.The authors are trying to convince their colleagues that income is indeed a confounder and it is important to adjust for in the analysis. When making a comment such as this, it is important to consider what the results might have been if, in fact, the potential confounder were controlled for.

a.Draw a directed acyclic graph (DAG) to indicate under what potential circumstances confounding by income might give a crude measure of association between proximity to a major roadway and all-cause mortality that is higher (in absolute value) than the true value (either overestimating a harmful effect or underestimating a protective effect). Use the (+) symbol and the (-) symbol to indicate positive or inverse associations, respectively, between income and proximity to a major roadway and also between income and all-cause mortality.

library(Epi)

par(mar = c(0,0,0,0), cex = 1.5)
plot(NA, bty = "n", xlim = 0:1*100, ylim = c(0.4,0.9)*100, xaxt = "n", yaxt = "n", xlab = "", ylab = "")
dag.a  <- tbox( "Proximity to a major roadway", 20, 50, 22, 10, col.border = "white")
dag.c  <- tbox( "Low income",                   10, 80, 22, 10, col.border = "white")
dag.y  <- tbox( "All-cause mortality",          70, 50, 22, 10, col.border = "white" )
boxarr(dag.c, dag.a , col = gray(0.7), lwd = 3)
boxarr(dag.c, dag.y , col = gray(0.7), lwd = 3)

text(18, 65, "+")
text(45, 65, "+")

plot of chunk unnamed-chunk-7

Low income people tend to live closer to major roadways and these people are at higher risk of all-cause mortality because low income decrease the standard of living and health care they receive. In this scenario, those who live close to major roads are the low income people, who are at higher risk of all-cause mortality, thus there will be a spurious positive association between living close to a major roadway and all-cause mortality, even if there is no true association.

b.Using the data provided in Table 1, create a new table to properly calculate a measure of association that assesses the relationship between income and all-cause mortality among individuals who do not live close to a roadway (here considered the unexposed). Is income positively or inversely associated with all-cause mortality among the unexposed? Please be consistent with your choice of reference level for income in parts b, c and d.

Incidence rate (per 1000PY) among those who were living far from a major roadway (>=200m)

II1 <- ddply(hw3.dat,
             c("Distance","IncomeLevel"),
             summarize,
             "Total Cases" = sum(Cases),
             "Total PY" = sum(PersonYears),
             "Incidence rate (per 1000PY)" = sum(Cases) / sum(PersonYears) * 1000
             )
print(II1[II1$Distance == ">=200m",-1], digits = 4, row.names = F)

 IncomeLevel Total Cases Total PY Incidence rate (per 1000PY)
         Low         238     5861                       40.61
         Mid         236     6314                       37.38
        High         431    13561                       31.78

As the income level increases, the mortality rate (incidence rate of all-cause mortality) decreases monotonously, thus, there is an inverse association (the higher the income, the lower the all-cause mortality). The incidence rate ratio compared to the reference level (low income) was 37.37726 / 40.60740 = 0.92 for the middle income people and 31.78232 / 40.60740 = 0.78 for the high income people.

c. Using the data provided in Table 1, create a new table to assess the relationship between income and distance to a major roadway in the study base (i.e. the person-time that gave rise to the cases). Calculate this association and indicate whether income is positively or inversely associated with distance to a major roadway in the study base.

tab.ii.c <- xtabs(PersonYears ~ Distance +IncomeLevel, hw3.dat)
library(gmodels)
CrossTable(tab.ii.c, prop.r = F, prop.t = F, prop.chisq = F, digits = 2)



   Cell Contents
|-------------------------|
|                       N |
|           N / Col Total |
|-------------------------|


Total Observations in Table:  29497 


             | IncomeLevel 
    Distance |       Low |       Mid |      High | Row Total | 
-------------|-----------|-----------|-----------|-----------|
       <200m |      1291 |       959 |      1511 |      3761 | 
             |      0.18 |      0.13 |      0.10 |           | 
-------------|-----------|-----------|-----------|-----------|
      >=200m |      5861 |      6314 |     13561 |     25736 | 
             |      0.82 |      0.87 |      0.90 |           | 
-------------|-----------|-----------|-----------|-----------|
Column Total |      7152 |      7273 |     15072 |     29497 | 
             |      0.24 |      0.25 |      0.51 |           | 
-------------|-----------|-----------|-----------|-----------|

Person-years grouped by Distance and Income Level (Column proportion shown at bottom)

             | Income Level
    Distance |       Low |       Mid |      High | Row Total |
-------------|-----------|-----------|-----------|-----------|
       <200m |      1291 |       959 |      1511 |      3761 |
             |      0.18 |      0.13 |      0.10 |           |
-------------|-----------|-----------|-----------|-----------|
      >=200m |      5861 |      6314 |     13561 |     25736 |
             |      0.82 |      0.87 |      0.90 |           |
-------------|-----------|-----------|-----------|-----------|
Column Total |      7152 |      7273 |     15072 |     29497 |
-------------|-----------|-----------|-----------|-----------|

Among the person-years spent in the low income group, 18% was spent in the <200m group. The proportion of person-years spent in the <200m group decreased with increasing income levels to 13% (Middle income) and then 10% (High income group). Thus, the income is inversely associated with distance to a major roadway in the study base. Thus, in the study base, increasing income is positively associated with distance to a major roadway (the higher the income, the farther the house from a major roadway).

d. Given the results from Part II questions 2b-c, and still assuming that income is a confounder, could adjusting for income change the crude effect of living distance to a major roadway on all-cause mortality? If so, in which direction would the IRR change?

The hypothesis that the income level is a confounder is feasible, given the associations shown in questions 2b and 2c, thus, adjustment for the income level is likely to change the effect of living close to a major roadway. The direction of confounding is upward because both of the associations to the exposure and outcome are positive, therefore, adjusting for it will change the effect downward.

e. Given the direction and the magnitude of the results decide whether it is reasonable to be concerned about income substantially affecting the results in this way.

Compared to the low income stratum (exposed person-years 18% of total), the proportion of exposed person-years were only about half (exposed person-years 10% of total; 45% reduction in exposure). All-cause mortality was reduced by 22% in the high income stratum compared to the low income stratum (0.78 in question II-2-c). It is reasonable to adjust for it, as the crude result of positive association was small in magnitude (incidence rate ratio 1.26 in question I-1).

f. After going through this exercise, do you now think that living <200m from a major roadway could increase the risk of all-cause mortality? Please explain.

It is possible there could be some effect of exposure on outcome even after adjustment for confounding. The hypothesis also biologically plausible as living close to major roads can increase exposure to polluted air and potentially increase traffic accidents.

g. For the purposes of this question, assume that sex is not associated with the exposure in the study base, but is moderately associated with the outcome in the unexposed (you do not need to show this). Based on these results, would you expect that adjusting for sex would change the point estimate of the association of distance to a major roadway with all-cause mortality? Explain (1 sentence) Would you expect it to change the precision of the estimate? Explain (1 sentence)

The point estimate will not be affected, as sex is not associated with the exposure but only with the outcome.

The precision of the estimate will be adversely affected, as sex is associated with the outcome.

3. If you were to include two different categories when controlling for income (above average income, below average income), and simultaneously stratified on sex, a two-category variable, (male, female), how many strata would you have? Alternatively, if you simultaneously control for income as the original four-category variable (quartiles) seen in the paper and sex, how many strata would you have?

For the first scenario, it is 2 x 2 = 4 strata.

For the second scenario, it is 4 x 2 = 8 strata.

4. Jointly control for confounding by income (as a 3-category variable) and sex and calculate the following:

a. Mantel-Haenszel summary incidence rate ratio and 95% confidence interval for the association between living <200m from a major roadway and all-cause mortality. Interpret your results.

## Data creation
hw3.4a <- dlply(hw3.dat,
                c("Sex","IncomeLevel"),
                function(df) {
                    out <- df[,c("Cases","PersonYears")]
                    rownames(out) <- df$Distance
                    as.matrix(out)
                }
                )

hw3.4a.names <- names(hw3.4a)
attributes(hw3.4a)$split_labels <- NULL
hw3.4a.array <- laply(hw3.4a, .fun = invisible)
hw3.4a.array <- aperm(hw3.4a.array, c(2,3,1))
dimnames(hw3.4a.array)[[3]] <- hw3.4a.names
hw3.4a.array

, , Men.Low

       Cases PersonYears
<200m     35         704
>=200m   145        3871

, , Men.Mid

       Cases PersonYears
<200m     26         684
>=200m   138        4238

, , Men.High

       Cases PersonYears
<200m     31        1134
>=200m   280        9765

, , Women.Low

       Cases PersonYears
<200m     28         587
>=200m    93        1990

, , Women.Mid

       Cases PersonYears
<200m     20         275
>=200m    98        2076

, , Women.High

       Cases PersonYears
<200m     26         377
>=200m   151        3796


## Test with epiR::epi.2by2()
epi.2by2(hw3.4a.array, method = "cohort.time", units = 1000)

             Disease +    Time at risk        Inc rate *
Exposed +          166            3761              44.1
Exposed -          905           25736              35.2
Total             1071           29497              36.3

Point estimates and 95 % CIs:
---------------------------------------------------------
Inc rate ratio (crude)                    1.26 (1.06, 1.48)
Inc rate ratio (M-H)                      1.22 (1.03, 1.44)
Inc rate ratio (crude:M-H)                1.03
Attrib rate (crude) *                     8.97 (1.88, 16.07)
Attrib rate (M-H) *                       7.96 (0.82, 15.11)
Attrib rate (crude:M-H)                   1.13
---------------------------------------------------------
 * Cases per 1000 units of population time at risk

Calculation was carried out using the following summary rate ratio formula (note 7, page 4).

Summary rate ratio = \( exp(X) = \frac {\sum {w_i \frac {a_i / N_{1i}} {b_i / N_{0i}}}} {\sum {W_i}} \)

\( w_i = \frac {b_i N_{1i}} {T_i} \)

\( \hat{Var}(X) = \frac {\sum{(M_{1i} N_{1i} N_{0i}) / T_i^2}} {[\sum{\frac{a_i N_{0i}} {T_i}] [\sum{\frac{b_i N_{1i}} {T_i}]}}} \)

Confidence interval (log scale) = \( X \pm Z_{1-\alpha /2} \sqrt{\hat{Var}(X)} \)

## Vectorized calculation using R
ai  = hw3.4a.array[1,1,] # Exposed case count vector
bi  = hw3.4a.array[2,1,] # Unexposed case count vector
N1i = hw3.4a.array[1,2,] # Exposed person-time vector
N0i = hw3.4a.array[2,2,] # Unexposed person-time vector
M1i = ai + bi            # Total case count vector
Ti  = N1i + N0i          # Total person-time vector

wi  = (bi * N1i) / Ti    # Weight vector
Var = sum(1/Ti^2 * M1i * N1i * N0i) / ((sum(ai * N0i / Ti)) * (sum(bi * N1i / Ti))) # Variance

## Mantel-Haenszel summary incidence rate ratio
mh.irr = sum(wi * ((ai / N1i) / (bi / N0i))) / sum(wi)
cat(mh.irr,"\n")

1.221


## 95% confidence interval for Mantel-Haenszel summary incidence rate ratio
log.95ci = log(mh.irr) + c(-1,1) * Z * sqrt(Var)
cat(exp(log.95ci),"\n")

1.034 1.441

Mantel-Haenszel adjusted incidence rate ratio = 1.22 (95% confidence interval 1.03, 1.44)

After adjusting for income and sex, living closer to a major roadway is still associated with 22% increased incidence of all-cause mortality. Values between 3% to 44% are compatible with the data. We are 95% confident that the true value lies between this boundaries. The confidence interval is wide and include a near-null value of 3%, indicating the estimate (22%) is not very stable. However, it still does not include the null value of 1.00, thus it is likely to be significant in hypothesis testing.

b. Calculate the summary incidence rate difference and the 95% confidence interval (using the most efficient weighting scheme) for the association between living <200m from a major roadway and all-cause mortality. Interpret your results.

Calculation was carried out using the following summary rate difference formula (note 7, page 4).

Summary rate difference = \( X = \frac {\sum {w_i [\frac{a_i} {N_{1i}} - \frac{b_i} {N_{0i}}]}} {\sum {w_i}} \)

\( w_i = \frac {N_{1i}^2 N_{0i}^2} {a_i N_{0i}^2 + b_i N_{1i}^2} \)

\( \hat{Var}(X) = \frac {1} {\sum {w_i}} \)

Confidence interval = \( X \pm Z_{1-\alpha /2} \sqrt{\hat{Var}(X)} \)

## ai, bi, N1i, N0i, M1i, and Ti were defined previously.
wi  = (N1i^2 * N0i^2) / (ai * N0i^2 + bi * N1i^2) # Inverse variance weight vector
Var = 1 / sum(wi)                                 # Variance

## Inverse variance weight summary incidence rate difference
iw.ird = sum(wi * (ai/N1i - bi/N0i)) / sum(wi)
cat(iw.ird * 1000, "\n")

5.254


## 95% CI for inverse weight summary incidence rate difference
ci95 = iw.ird + c(-1,1) * Z * sqrt(Var)
cat(ci95 * 1000, "\n")

-1.532 12.04

Summary incidence rate difference = 5.25 (95% confidence interval -1.53, 12.03) per 1000 person-years.

After adjusting for income and sex, living closer to a major roadway is still associated with 5.25 excess cases of all-cause mortality per 1000 person-years of observation. Values between -1.53 to 12.03 (per 1000 person-years) are compatible with the data. We are 95% confident that the true value lies between these boundaries. The confidence interval is relatively wide, and it includes the null value of 0, meaning no difference or difference in the opposite direction (negative value) is also compatible with the data. Thus it is likely to be non-significant in hypothesis testing.

c. Test the hypothesis of no association between living <200m from a major roadway and all-cause mortality. State the null and alternative hypotheses. Calculate the p-value for the test statistic. Interpret your results.

X       = sum(ai)                       # X
EX_h0   = sum((N1i * M1i) / Ti)         # E(X|H0)
VarX_h0 = sum(1/Ti^2 * N1i * N0i * M1i) # Var(X|H0)

Z_squared = (X - EX_h0)^2 / VarX_h0     # Z^2 value
pchisq(Z_squared, 1, low = F)           # Chi-squared distribution with 1 degree of freedom

[1] 0.01856

\( X = \sum a_i = 166 \)

\( \hat{E}(X|H_0) = \frac {N_{1i} M_{1i}} {T_i} = 140.2027 \)

\( \hat{Var}(\hat{X}|H_0) = \frac {N_{1i} M_{1i} M_{1i}} {T_i^2} = 120.0692 \)

\( Z^2 = \frac {[X - \hat{E}(X|H_0)]^2} {\hat{Var}(X|H_0)} = (166 - 140.2027)^2 / 120.0692 = 5.542643 \)

Chi-squared p-value (degree of freedom = 1) for this value is 0.019. This figure means that thre is a 0.019 chance of having this result even if the null hypothesis (living close to a major roadway has no effect) is true. It is also lower than the conventional value of 0.05, therefore, there is a significant difference between these two groups after adjusting for income and sex.

5. Compare the results of the crude IRR and the adjusted IRR for the association between proximity to a major roadway and all-cause mortality calculated above.

crude IRR 8.97 (95% confidence interval 1.88, 16.07) / 1000 person-years
adjusted IRR 5.25 (95% confidence interval -1.53, 12.04) / 1000 person-years

The effect was changed downward as expected by upward confounding. In the differene scale, the 95% confidence interval includes the null value of zero after adjustment

a. From the authors’ perspective, explain why the results might be similar/dissimilar.

The authors stated that Adjustment for key personal, clinical, and neighborhood-level socioeconomic covariates did not alter the results substantially after age adjustment. The authors think these estimates are similar. These values are indeed similar as they are the same in direction, and almost the same in magnitude.

b. Income may be a consequence of proximity to a major roadway. Draw the DAG that includes income, proximity to a major roadway, and all-cause mortality (No need to include direction of associations here). How would you interpret the adjusted and unadjusted estimates calculated above based on this assumption?

par( mar = c(0,0,0,0), cex = 1.5 )
plot(NA, bty = "n", xlim = 0:1*100, ylim = c(0.4,0.9)*100, xaxt = "n", yaxt = "n", xlab = "", ylab = "")
dag.a  <- tbox( "Proximity to a major roadway", 20, 50, 22, 10, col.border = "white")
dag.c  <- tbox( "Low income",                   50, 80, 22, 10, col.border = "white")
dag.y  <- tbox( "All-cause mortality",          70, 50, 22, 10, col.border = "white" )
boxarr(dag.a, dag.c , col = gray(0.7), lwd = 3)
boxarr(dag.c, dag.y , col = gray(0.7), lwd = 3)

plot of chunk unnamed-chunk-14

In this case income is an intermediate variable through which proximity to a major roadway influences the all-cause mortality. In this scenario, the crude estimate of the effect is the correct effect, and adjusting for the income will distort this legitimate effect. The adjusted estimate will give the effect of proximity to a major roadway on all-cause mortality through pathways other than through income. This is appropriate only if the authors wanted the effect of the exposure not mediated by income.

6. After collecting the data, you realize that age, rather than income, is the appropriate variable to control for confounding, but you have not collected data on age. Age and income are moderately correlated in the population (on average, income increases with age). Draw the DAG representing this scenario. What will be the consequence of adjusting for income on your estimate of the effect of distance to a major roadway on all-cause mortality, relative to the estimates you would have obtained adjusting for 1) nothing (unadjusted), and 2) age?

par( mar = c(0,0,0,0), cex = 1.5 )
plot(NA, bty = "n", xlim = 0:1*100, ylim = c(0.4,0.9)*100, xaxt = "n", yaxt = "n", xlab = "", ylab = "")

dag.a  <- tbox( "Proximity to a major roadway", 20, 50, 22, 10, col.border = "white")
dag.c1 <- tbox( "Age",                          10, 80, 22, 10, col.border = "white")
dag.c2 <- tbox( "Low income",                   50, 80, 22, 10, col.border = "white")
dag.y  <- tbox( "All-cause mortality",          70, 50, 22, 10, col.border = "white")

boxarr(dag.c1, dag.a , col = gray(0.7), lwd = 3)
boxarr(dag.c1, dag.y , col = gray(0.7), lwd = 3)
boxarr(dag.c1, dag.c2 , col = gray(0.7), lwd = 3)

text(18, 65, "-")
text(45, 65, "+")
text(30, 82, "-")

plot of chunk unnamed-chunk-15

It is assumed that higher age is positively correlated with higher income, that higher age is inversely associated with living closer to a major roadway, and that higher age is positively associated with all-cause mortality. In this case, the direction of confounding by age will be downward (the closer to a major roadway, the lower the all-cause mortality).

1) Compared to the unadjusted estimate, adjustment for income will decrease the magnitude of confounding but not as much as adjustment for age, depending on the degree of association between age and income. Thus, the effect estimate will change upward, but not enough to make the estimate unbiased.

2) Compared to the estimate adjusted for age, the estimate adjusted for income is incomplete. Therefore, the estimate will be lower than the estimate obtained by adjusting for age.

III. EFFECT MODIFICATION

1. Please calculate the following: a. Incidence rate ratio and incidence rate difference for the association between distance to major roadway and all-cause mortality for each strata defined by sex and income level simultaneously. Using the “eyeball test”, briefly comment on the direction of the associations and if they are consistent across strata.

Incidence rate ratio stratified by sex and income level

III.1.a1 <- ddply(hw3.dat,
                  c("Sex","IncomeLevel"),
                  function(df) {
                      df$IR <- df$Cases / df$PersonYears
                      IRR <- df[df$Distance == "<200m","IR"] / df[df$Distance == ">=200m","IR"]
                      data.frame(IRR = IRR)
                  }
                  )
round(xtabs(IRR ~ Sex + IncomeLevel, III.1.a1), 2)

       IncomeLevel
Sex      Low  Mid High
  Men   1.33 1.17 0.95
  Women 1.02 1.54 1.73

For men, the incidence rate ratio decreases as the income level increases. However, for women the incidence rate ratio increases as the income level increases. In each group, income appears to have effect on the outcome, however the effect is reversed in women compared to men. For men in the high income category, the effect is negative (0.95), whereas for women in the high income category, the effect is most prominent (1.73).

Incidence rate difference (per 1000 person-years) stratified by sex and income level

III.1.a2 <- ddply(hw3.dat,
                  c("Sex","IncomeLevel"),
                  function(df) {
                      df$IR <- df$Cases / df$PersonYears
                      IRD <- df[df$Distance == "<200m","IR"] - df[df$Distance == ">=200m","IR"]
                      data.frame(IRD = IRD)
                  }
                  )
round(xtabs(IRD ~ Sex + IncomeLevel, III.1.a2) * 1000, 2)

       IncomeLevel
Sex       Low   Mid  High
  Men   12.26  5.45 -1.34
  Women  0.97 25.52 29.19

For men, the incidence rate difference decreases as the income level increases. However, for women the incidence rate difference increases as the income level increases. In each group, income appears to have effect on the outcome, however the effect is reversed in women compared to men. For men in the high income category, the effect is negative (-1.34/1000PY), whereas for women in the high income category, the effect is most prominent (29.19/1000PY).

b. For the stratified data perform the test of homogeneity for the rate ratio. State the null and alternative hypotheses. Find the p-value for the test statistic and interpret. Interpret your findings.

\( H_0 \): Across all strata, the effect of exposure (living close to a major roadway) on outcome (all-cause mortality) measured by rate ratio is uniform.
\( H_1 \): The effect is not uniform.

\( H = \sum_{i=1}^I \frac {[\hat{X}_i - \hat{X}_{summary}]^2} {\hat{Var}_i[\hat{X_i}]} \)

where \( \hat{X}_i = log(IRR_i) \), \( \hat{X}_{summary} = log(IRR_{MH}) \), and \( \hat{Var}_i(\hat{X}_i) = \frac {1} {a_i} + \frac {1} {b_i} \)

Vectorized calculation using R

IRRi     = (ai/N1i) / (bi/N0i) # Incidence rate ratio for each stratum
logIRRi  = log(IRRi)           # log incidence rate ratio for each stratum
logIRRmh = log(mh.irr)         # log Mantel-Haenszel summary incidence rate ratio (Q II-4-a)
VarXi    = 1 / ai  + 1 / bi    # Variance for each stratum

## Chi squared value for I - 1 degrees of freedom
H        = sum((logIRRi - logIRRmh)^2 / VarXi) 
pchisq(H, df = 6-1, lower.tail = F)

[1] 0.2812

The chi-squared value is 6.27 with five degrees of freedom, thus, P = 0.281. Therefore, statistically speaking, there was no statically significant heterogeneity across strata defined by sex and income in the effect measured in the ratio scale.

c. Now perform the test of homogeneity for the rate difference. State the null and alternative hypotheses. Find the p-value for the test statistic and interpret. Interpret your findings.

\( H = \sum_{i=1}^I \frac {[\hat{X}_i - \hat{X}_{summary}]^2} {\hat{Var}_i[\hat{X_i}]} \)

where \( \hat{X}_i = IRD_i \), \( \hat{X}_{summary} = IRD_{inv var} \), and \( \hat{Var}_i(\hat{X}_i) = \frac {a} {N^2_{1i}} + \frac {b} {N^2_{0i}} \)

Vectorized calculation using R

IRDi      = (ai/N1i - bi/N0i) # Incidence rate difference for each stratum
IRDinvvar = iw.ird            # Inverse weight summary incidence rate difference (Q II-4-b)
VarXi     = ai / N1i^2 + bi / N0i^2 # Variance for each stratum
H         = sum((IRDi - IRDinvvar)^2 / VarXi) # Chi squared value for I - 1 degrees of freedom
pchisq(H, df = 6-1, lower.tail = F)

[1] 0.2371

The chi-squared value is 6.79 with five degrees of freedom, thus, P = 0.237, therefore, statistically speaking, there was no statically significant heterogeneity across strata defined by sex and income in the effect measured in the difference scale.

2. Is there evidence of effect measure modification on either the additive or multiplicative scales by income and sex jointly? Explain briefly.

Plots of effects in each stratum

library(ggplot2)

ird.plot <-
    ggplot(III.1.a2, aes(x = IncomeLevel, y = IRD*1000, group = Sex, color = Sex)) +
    theme_bw() + labs(title = "Incident rate difference", y = "IRD / 1000PY") +
    geom_point() + geom_line(lwd = 1, aes(lty = Sex)) +
    scale_color_manual(values = c("Men" = "blue", "Women" = "Red"))

irr.plot <-
    ggplot(III.1.a1, aes(x = IncomeLevel, y = IRR, group = Sex, color = Sex)) +
    theme_bw() + labs(title = "Incident rate ratio") +
    geom_point() + geom_line(lwd = 1, aes(lty = Sex)) +
    scale_color_manual(values = c("Men" = "blue", "Women" = "Red"))

..gg.multiplot(ird.plot, irr.plot, cols = 2)

plot of chunk unnamed-chunk-20

Although the test for depature from homogeneity was not significant, different trends in men (solid line) and women (broken line) are appant from the graphs, i.e., higher income level is protective in men whereas it is harmful in women. This tendency is observed both in the incidence rate difference graph and incidence rate ratio graph, thus, the effect measure modification between sex and income level is present in both additive and multiplicative scales.