Overview

Today:

Go over problem set
Review RD, RR, and ORs
Stratification for adjustment
Stratification for EMM
Causal hypothesis (in terms of RD, RR, OR, etc.)


Assignment Schedule and Due Dates

Updated assignments and due dates:

  • ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan
  • NOV 10 (Fri): DRAFT of analytic plan due
  • NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.
  • NOV 24 (Fri): DRAFT of analyses with code due
  • DEC 08 (Fri): Final written report
    • Post your written reports to Laulima by the end of the due date

Measures of Associations

Build clean dataset

# A descriptive table restricted to those with blood pressure AND PFOA data, and  stratified by blood pressure:
tbl1 <- 
 merged_final_clean %>% 
  filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>% 
  filter(age >= 18 & age <= 65) %>% 
  table1(~ RIAGENDR + RIDRETH3 + age + as.factor(edu_cat) + hh_inc + inc_ratio + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | hi_bp, data = .,
         topclass="Rtable1-zebra",
         overall = F, extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"))
tbl1
Normal BP
(N=969)
High BP
(N=188)
P-value
Participant Sex
Female 493 (50.9%) 83 (44.1%) 0.108
Male 476 (49.1%) 105 (55.9%)
Self-reported Race/Ethnicity
Mexican American 153 (15.8%) 35 (18.6%) <0.001
Non-Hispanic Asian 151 (15.6%) 24 (12.8%)
Non-Hispanic Black 193 (19.9%) 67 (35.6%)
Non-Hispanic White 326 (33.6%) 37 (19.7%)
Other Hispanic 92 (9.5%) 18 (9.6%)
Other Race - Including Multi-Racial 54 (5.6%) 7 (3.7%)
Participant Age (years)
Mean (SD) 39.9 (14.5) 53.4 (9.68) <0.001
Median [MIN, MAX] 39.0 [18.0, 65.0] 56.0 [23.0, 65.0]
IQR [25%ile, 75%ile] 26.0 [27.0, 53.0] 13.0 [48.0, 61.0]
N Missing 0 0
Highest Completed Education
< HS 176 (18.2%) 49 (26.1%) 0.0429
HS 573 (59.1%) 102 (54.3%)
College 219 (22.6%) 37 (19.7%)
Missing 1 (0.1%) 0 (0%)
Household Income (in $1000s)
Mean (SD) 48.6 (32.4) 46.6 (33.8) 0.477
Median [MIN, MAX] 45.0 [0, 100] 35.0 [0, 100]
IQR [25%ile, 75%ile] 55.0 [20.0, 75.0] 55.0 [20.0, 75.0]
N Missing 78 16
Missing 78 (8.0%) 16 (8.5%)
Family Income:Poverty Line Ratio
Mean (SD) 2.47 (1.66) 2.58 (1.61) 0.442
Median [MIN, MAX] 2.02 [0, 5.00] 2.34 [0, 5.00]
IQR [25%ile, 75%ile] 3.02 [1.06, 4.08] 3.03 [1.19, 4.22]
N Missing 112 30
Missing 112 (11.6%) 30 (16.0%)
Systolic Blood Pressure (mmHg)
Mean (SD) 116 (11.1) 149 (14.5) <0.001
Median [MIN, MAX] 116 [88.0, 138] 146 [122, 216]
IQR [25%ile, 75%ile] 18.0 [108, 126] 12.5 [142, 154]
N Missing 0 0
Diastolic Blood Pressure (mmHg)
Mean (SD) 70.1 (10.3) 85.5 (13.8) <0.001
Median [MIN, MAX] 72.0 [0, 88.0] 88.0 [0, 124]
IQR [25%ile, 75%ile] 12.0 [64.0, 76.0] 16.5 [77.5, 94.0]
N Missing 0 0
RBC Folate (ng/mL)
Mean (SD) 1070 (415) 1100 (657) 0.818
Median [MIN, MAX] 1010 [379, 3340] 976 [219, 4540]
IQR [25%ile, 75%ile] 469 [791, 1260] 544 [751, 1300]
N Missing 483 133
Missing 483 (49.8%) 133 (70.7%)
Plasma PFOA (ng/mL)
Mean (SD) 1.59 (2.24) 1.70 (1.14) 0.311
Median [MIN, MAX] 1.20 [0.0700, 52.8] 1.50 [0.100, 9.50]
IQR [25%ile, 75%ile] 1.10 [0.700, 1.80] 1.13 [1.00, 2.13]
N Missing 0 0
Plasma PFOS (ng/mL)
Mean (SD) 4.20 (5.21) 5.73 (7.26) 0.0065
Median [MIN, MAX] 2.80 [0.0700, 63.1] 3.60 [0.100, 59.9]
IQR [25%ile, 75%ile] 3.10 [1.60, 4.70] 3.25 [2.18, 5.43]
N Missing 0 0
Plasma PFNA (ng/mL)
Mean (SD) 0.523 (0.526) 0.630 (0.492) 0.00735
Median [MIN, MAX] 0.400 [0.0700, 6.50] 0.500 [0.0700, 3.00]
IQR [25%ile, 75%ile] 0.400 [0.200, 0.600] 0.500 [0.300, 0.800]
N Missing 0 0
Plasma PFHxS (ng/mL)
Mean (SD) 1.61 (2.78) 1.55 (1.63) 0.709
Median [MIN, MAX] 1.00 [0.0700, 48.8] 1.10 [0.100, 17.0]
IQR [25%ile, 75%ile] 1.20 [0.500, 1.70] 1.10 [0.800, 1.90]
N Missing 0 0

Select adjustment set with DAGs

The implied adjustment set (if this DAG is true):

ggdag_adjustment_set(bp_dag_new2) + theme_classic() + geom_dag_text(col = "Orange")

Regression-based estimates

# Create an analytic set
# Previously, we had edu_cat, hi_bp, and hi_pfas as labelled variables to make the tables nices
# For ease of analysis, I'm going to transform them back to numbers

analytic_set <- 
  merged_final_clean %>% 
  #filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>% 
  filter(!is.na(hi_bp) & !is.na(hi_pfas)) %>% 
  filter(age >= 18 & age <= 65) %>% # Restricting our analytic set as we decided above
  mutate(edu_cat = as.numeric(edu_cat)-1,
         hi_bp = as.numeric(hi_bp)-1,
         hi_pfas = as.numeric(hi_pfas)-1) 

Last time we showed mean difference in systolic blood pressures by 1 unit (ng/mL) change in PFOS:

mdl <- lm(data = analytic_set, formula = BPXSY1 ~ LBXNFOS) 
summary(mdl)
## 
## Call:
## lm(formula = BPXSY1 ~ LBXNFOS, data = analytic_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.566 -11.153  -2.349   8.651  95.652 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.74715    0.62407 191.880  < 2e-16 ***
## LBXNFOS       0.40034    0.08712   4.595  4.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.64 on 1155 degrees of freedom
## Multiple R-squared:  0.01795,    Adjusted R-squared:  0.0171 
## F-statistic: 21.11 on 1 and 1155 DF,  p-value: 4.802e-06
beta = round(summary(mdl)$coefficients[2,1], 2)
ci_lb = round(summary(mdl)$coefficients[2,1] - 1.96*summary(mdl)$coefficients[2,2], 2)
ci_ub = round(summary(mdl)$coefficients[2,1] + 1.96*summary(mdl)$coefficients[2,2], 2)

# We also showed the diagnostics for this regression
# mdl %>% ggplot() + geom_point(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) # one way
# ggResidpanel::resid_panel(mdl) # more comprehensive way

We observe a 0.4 mmHg higher in systolic blood pressure per 1 ng/mL higher PFAS exposure (beta = 0.4, 95% CI: 0.23, 0.57).

We can also look at exposure in a different way, rather than linear changes per unit, we can look at changes in orders of magnitude (many biological systems work in this way). We can do this quickly by specifying this in the model, here we specify log2 which conveniently gives us the effect size from a doubling of the measure:

mdl2 <- lm(data = analytic_set, formula = BPXSY1 ~ log2(LBXNFOS)) 
summary(mdl2)
## 
## Call:
## lm(formula = BPXSY1 ~ log2(LBXNFOS), data = analytic_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.666 -10.974  -2.393   8.473  96.284 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   118.6089     0.7600 156.056  < 2e-16 ***
## log2(LBXNFOS)   1.8918     0.3775   5.012 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.61 on 1155 degrees of freedom
## Multiple R-squared:  0.02129,    Adjusted R-squared:  0.02044 
## F-statistic: 25.12 on 1 and 1155 DF,  p-value: 6.228e-07
beta2 = round(summary(mdl2)$coefficients[2,1], 2)
ci_lb2 = round(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2], 2)
ci_ub2 = round(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2], 2)

We observe a 1.89 mmHg higher in systolic blood pressure for each doubling of PFAS exposure (beta = 1.89, 95% CI: 1.15, 2.63).

We can also look at Risk Ratios:

mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "log")) 
summary(mdl3)
## 
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "log"), 
##     data = analytic_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.13449    0.11661 -18.305  < 2e-16 ***
## log2(LBXNFOS)  0.18681    0.04969   3.759  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1013.6  on 1155  degrees of freedom
## AIC: 1017.6
## 
## Number of Fisher Scoring iterations: 6
rr3 = round(exp(summary(mdl2)$coefficients[2,1]), 2)
ci_lb3 = round(exp(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2]), 2)
ci_ub3 = round(exp(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2]), 2)

Each doubling of plasma PFAS exposure is associated with 6.63 as high risk of high blood pressure (beta = 6.63, 95% CI: 3.16, 13.9), in U.S. adults between the ages of 18-65.

…or Odds Ratios:

mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit")) 
summary(mdl3)
## 
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.01585    0.13754 -14.656  < 2e-16 ***
## log2(LBXNFOS)  0.22539    0.06245   3.609 0.000307 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1013.7  on 1155  degrees of freedom
## AIC: 1017.7
## 
## Number of Fisher Scoring iterations: 4
rr3 = round(exp(summary(mdl2)$coefficients[2,1]), 2)
ci_lb3 = round(exp(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2]), 2)
ci_ub3 = round(exp(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2]), 2)

Each doubling of plasma PFAS exposure is associated with 6.63 as high odds of high blood pressure (beta = 6.63, 95% CI: 3.16, 13.9), in U.S. adults between the ages of 18-65.

RD, RR, OR by hand

These regression results are not magical, and this can be seen by comparing the associations between a binary exposure (hi_pfas) and binary outcome (hi_bp) computed by hand and by regression.

Let’s see the original 2x2 table: Rows = Outcome (0 = normal BP, 1 = high BP) Columns = Exposure (0 = normal PFAS, 1 = high PFAS)

org_2x2 <- table(analytic_set$hi_bp, analytic_set$hi_pfas)
org_2x2
##    
##       0   1
##   0 524 445
##   1  78 110

A risk difference can be calculated as (risk amongst exposed) - (risk amongst unexposed).

(110/(110+445)) - (78/(78+524))

What is the regression-based estimate:

lm(data = analytic_set, formula = hi_bp ~ hi_pfas)

Why is this?

## `geom_smooth()` using formula = 'y ~ x'

How about the _risk ratio__ (risk amongst exposed) / (risk amongst unexposed):

(110/(110+445)) / (78/(78+524))

rr_mdl <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "log"))

exp(summary(rr_mdl)$coefficients[2,1])

And the odds ratio (ODDS amongst exposed) / (ODDS amongst unexposed):

(110/(445)) / (78/(524)), or to re-arrange

(110 * 524) / (78 * 445)

or_mdl <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"))

exp(summary(or_mdl)$coefficients[2,1])

For continuous exposures, the model fit is just repeated across the range of X values!

Adjustment

We know for our final model we need to adjust for the confounder educational status. What does adjustment actually do? How does it work?

Let’s go back to binary odds ratios:

mdl4 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit")) 
summary(mdl4)
## 
## Call:
## glm(formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9048     0.1214 -15.695  < 2e-16 ***
## hi_pfas       0.5072     0.1615   3.141  0.00168 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1016.9  on 1155  degrees of freedom
## AIC: 1020.9
## 
## Number of Fisher Scoring iterations: 4

Having higher than median plasma PFOS is associated with a 1.66 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65.

Regression-based adjustment

Next, we adjust for educational category (as a categorical variable, meaning each value has a different estimate):

mdl5 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit")) 
summary(mdl5)
## 
## Call:
## glm(formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.5404     0.1843  -8.358  < 2e-16 ***
## hi_pfas               0.5245     0.1624   3.230  0.00124 ** 
## as.factor(edu_cat)1  -0.4503     0.1951  -2.308  0.02098 *  
## as.factor(edu_cat)2  -0.5486     0.2419  -2.268  0.02334 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.5  on 1155  degrees of freedom
## Residual deviance: 1010.0  on 1152  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1018
## 
## Number of Fisher Scoring iterations: 4

Having higher than median plasma PFOS is associated with a 1.69 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65, adjusted for differences in distribution of education category AND assuming that the odds ratios are the same between educational categories.

How do we arrive at this estimate? Basically, we take the estimated association within each level of the confounder and then average the results in a particular way. The MH-stratified adjustment shows us this more clearly:

Stratum-specific Estimates

set_edu0 = table(filter(analytic_set, edu_cat == 0)$hi_bp, filter(analytic_set, edu_cat == 0)$hi_pfas)
set_edu1 = table(filter(analytic_set, edu_cat == 1)$hi_bp, filter(analytic_set, edu_cat == 1)$hi_pfas)
set_edu2 = table(filter(analytic_set, edu_cat == 2)$hi_bp, filter(analytic_set, edu_cat == 2)$hi_pfas)

Stratum-specific odds ratio for Less than High School (edu_cat = 0):

set_edu0
##    
##      0  1
##   0 96 80
##   1 25 24

Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)

oddsratio.wald(set_edu0, rev = "both")

OR = [exposed cases (24) * unexposed controls (95)] / [unexposed cases (25) / exposed controls (81)]

There is less data in this stratum, so estimates within this group are less stable, which has implications for the overall adjustment!

Stratum-specific odds ratio for Less than High School (edu_cat = 1):

set_edu1
##    
##       0   1
##   0 325 248
##   1  40  62

Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)

oddsratio.wald(set_edu1, rev = c("both"))

Compute this OR.

Stratum-specific odds ratio for Less than High School (edu_cat = 2):

set_edu2
##    
##       0   1
##   0 103 116
##   1  13  24

Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)

oddsratio.wald(set_edu2, rev = c("both"))
## $data
##        
##           1   0 Total
##   1      24  13    37
##   0     116 103   219
##   Total 140 116   256
## 
## $measure
##    odds ratio with 95% C.I.
##     estimate     lower    upper
##   1 1.000000        NA       NA
##   0 1.639257 0.7937441 3.385429
## 
## $p.value
##    two-sided
##     midp.exact fisher.exact chi.square
##   1         NA           NA         NA
##   0   0.184154    0.2128106  0.1787665
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

Compute this OR.

Mantel-Haeszel Adjustment

Compute the MH Adjusted Estimate by Hand:

a = exposed cases b = exposed controls c = un-exposed cases d = un-exposed controls

  1. Summing across all strata of the confounder
  2. Take the cross product of a X d / divide by the total number in the stratum
  3. Take the cross product of b X c / divide by the total number in the stratum
  4. Take the ratio of those two measures

Check your work:

strat_set <-array(c(set_edu0, set_edu1, set_edu2), dim = c(2, 2, 3))
epi.2by2(strat_set)

Recall the logistic regression estimate from the previous model:

OR = 1.69 [95% CI: 1.23, 2.32]

What adjustment does

Adjustment takes the estimates within level of the confounder and then lets each stratum contribute to the overall estimate proportional to the amount of data (observations) in each of the stratum, assuming that there is one common odds ratio.

Stratum-specific estimates

Let us assume that there are different effects for people with different educational attainment. We can represent that with an interaction term, and therefore there is not one common OR:

mdl6 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas * as.factor(edu_cat), family = binomial(link = "logit")) 
summary(mdl6)
## 
## Call:
## glm(formula = hi_bp ~ hi_pfas * as.factor(edu_cat), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.3455     0.2245  -5.992 2.07e-09 ***
## hi_pfas                       0.1415     0.3234   0.438  0.66172    
## as.factor(edu_cat)1          -0.7495     0.2802  -2.675  0.00747 ** 
## as.factor(edu_cat)2          -0.7243     0.3702  -1.957  0.05040 .  
## hi_pfas:as.factor(edu_cat)1   0.5672     0.3909   1.451  0.14683    
## hi_pfas:as.factor(edu_cat)2   0.3527     0.4914   0.718  0.47288    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.5  on 1155  degrees of freedom
## Residual deviance: 1007.9  on 1150  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1019.9
## 
## Number of Fisher Scoring iterations: 4

OR in edu_cat == 0: 1.152

OR in edu_cat == 1: 2.03125

OR in edu_cat == 2: 1.6392573