Week 9 - Measures of Associations and Adjustments

Overview

Today:

Go over problem set
Review RD, RR, and ORs
Stratification for adjustment
Stratification for EMM
Causal hypothesis (in terms of RD, RR, OR, etc.)

Assignment Schedule and Due Dates

Updated assignments and due dates:

ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan

NOV 10 (Fri): DRAFT of analytic plan due

NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.

NOV 24 (Fri): DRAFT of analyses with code due

DEC 08 (Fri): Final written report

Post your written reports to Laulima by the end of the due date

Measures of Associations

Build clean dataset

# A descriptive table restricted to those with blood pressure AND PFOA data, and  stratified by blood pressure:
tbl1 <- 
 merged_final_clean %>% 
  filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>% 
  filter(age >= 18 & age <= 65) %>% 
  table1(~ RIAGENDR + RIDRETH3 + age + as.factor(edu_cat) + hh_inc + inc_ratio + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | hi_bp, data = .,
         topclass="Rtable1-zebra",
         overall = F, extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"))
tbl1

	Normal BP (N=969)	High BP (N=188)	P-value
Participant Sex
Female	493 (50.9%)	83 (44.1%)	0.108
Male	476 (49.1%)	105 (55.9%)
Self-reported Race/Ethnicity
Mexican American	153 (15.8%)	35 (18.6%)	<0.001
Non-Hispanic Asian	151 (15.6%)	24 (12.8%)
Non-Hispanic Black	193 (19.9%)	67 (35.6%)
Non-Hispanic White	326 (33.6%)	37 (19.7%)
Other Hispanic	92 (9.5%)	18 (9.6%)
Other Race - Including Multi-Racial	54 (5.6%)	7 (3.7%)
Participant Age (years)
Mean (SD)	39.9 (14.5)	53.4 (9.68)	<0.001
Median [MIN, MAX]	39.0 [18.0, 65.0]	56.0 [23.0, 65.0]
IQR [25%ile, 75%ile]	26.0 [27.0, 53.0]	13.0 [48.0, 61.0]
N Missing	0	0
Highest Completed Education
< HS	176 (18.2%)	49 (26.1%)	0.0429
HS	573 (59.1%)	102 (54.3%)
College	219 (22.6%)	37 (19.7%)
Missing	1 (0.1%)	0 (0%)
Household Income (in $1000s)
Mean (SD)	48.6 (32.4)	46.6 (33.8)	0.477
Median [MIN, MAX]	45.0 [0, 100]	35.0 [0, 100]
IQR [25%ile, 75%ile]	55.0 [20.0, 75.0]	55.0 [20.0, 75.0]
N Missing	78	16
Missing	78 (8.0%)	16 (8.5%)
Family Income:Poverty Line Ratio
Mean (SD)	2.47 (1.66)	2.58 (1.61)	0.442
Median [MIN, MAX]	2.02 [0, 5.00]	2.34 [0, 5.00]
IQR [25%ile, 75%ile]	3.02 [1.06, 4.08]	3.03 [1.19, 4.22]
N Missing	112	30
Missing	112 (11.6%)	30 (16.0%)
Systolic Blood Pressure (mmHg)
Mean (SD)	116 (11.1)	149 (14.5)	<0.001
Median [MIN, MAX]	116 [88.0, 138]	146 [122, 216]
IQR [25%ile, 75%ile]	18.0 [108, 126]	12.5 [142, 154]
N Missing	0	0
Diastolic Blood Pressure (mmHg)
Mean (SD)	70.1 (10.3)	85.5 (13.8)	<0.001
Median [MIN, MAX]	72.0 [0, 88.0]	88.0 [0, 124]
IQR [25%ile, 75%ile]	12.0 [64.0, 76.0]	16.5 [77.5, 94.0]
N Missing	0	0
RBC Folate (ng/mL)
Mean (SD)	1070 (415)	1100 (657)	0.818
Median [MIN, MAX]	1010 [379, 3340]	976 [219, 4540]
IQR [25%ile, 75%ile]	469 [791, 1260]	544 [751, 1300]
N Missing	483	133
Missing	483 (49.8%)	133 (70.7%)
Plasma PFOA (ng/mL)
Mean (SD)	1.59 (2.24)	1.70 (1.14)	0.311
Median [MIN, MAX]	1.20 [0.0700, 52.8]	1.50 [0.100, 9.50]
IQR [25%ile, 75%ile]	1.10 [0.700, 1.80]	1.13 [1.00, 2.13]
N Missing	0	0
Plasma PFOS (ng/mL)
Mean (SD)	4.20 (5.21)	5.73 (7.26)	0.0065
Median [MIN, MAX]	2.80 [0.0700, 63.1]	3.60 [0.100, 59.9]
IQR [25%ile, 75%ile]	3.10 [1.60, 4.70]	3.25 [2.18, 5.43]
N Missing	0	0
Plasma PFNA (ng/mL)
Mean (SD)	0.523 (0.526)	0.630 (0.492)	0.00735
Median [MIN, MAX]	0.400 [0.0700, 6.50]	0.500 [0.0700, 3.00]
IQR [25%ile, 75%ile]	0.400 [0.200, 0.600]	0.500 [0.300, 0.800]
N Missing	0	0
Plasma PFHxS (ng/mL)
Mean (SD)	1.61 (2.78)	1.55 (1.63)	0.709
Median [MIN, MAX]	1.00 [0.0700, 48.8]	1.10 [0.100, 17.0]
IQR [25%ile, 75%ile]	1.20 [0.500, 1.70]	1.10 [0.800, 1.90]
N Missing	0	0

Select adjustment set with DAGs

The implied adjustment set (if this DAG is true):

ggdag_adjustment_set(bp_dag_new2) + theme_classic() + geom_dag_text(col = "Orange")

Regression-based estimates

# Create an analytic set
# Previously, we had edu_cat, hi_bp, and hi_pfas as labelled variables to make the tables nices
# For ease of analysis, I'm going to transform them back to numbers

analytic_set <- 
  merged_final_clean %>% 
  #filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>% 
  filter(!is.na(hi_bp) & !is.na(hi_pfas)) %>% 
  filter(age >= 18 & age <= 65) %>% # Restricting our analytic set as we decided above
  mutate(edu_cat = as.numeric(edu_cat)-1,
         hi_bp = as.numeric(hi_bp)-1,
         hi_pfas = as.numeric(hi_pfas)-1)

Last time we showed mean difference in systolic blood pressures by 1 unit (ng/mL) change in PFOS:

mdl <- lm(data = analytic_set, formula = BPXSY1 ~ LBXNFOS) 
summary(mdl)

## 
## Call:
## lm(formula = BPXSY1 ~ LBXNFOS, data = analytic_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.566 -11.153  -2.349   8.651  95.652 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.74715    0.62407 191.880  < 2e-16 ***
## LBXNFOS       0.40034    0.08712   4.595  4.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.64 on 1155 degrees of freedom
## Multiple R-squared:  0.01795,    Adjusted R-squared:  0.0171 
## F-statistic: 21.11 on 1 and 1155 DF,  p-value: 4.802e-06

beta = round(summary(mdl)$coefficients[2,1], 2)
ci_lb = round(summary(mdl)$coefficients[2,1] - 1.96*summary(mdl)$coefficients[2,2], 2)
ci_ub = round(summary(mdl)$coefficients[2,1] + 1.96*summary(mdl)$coefficients[2,2], 2)

# We also showed the diagnostics for this regression
# mdl %>% ggplot() + geom_point(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) # one way
# ggResidpanel::resid_panel(mdl) # more comprehensive way

We observe a 0.4 mmHg higher in systolic blood pressure per 1 ng/mL higher PFAS exposure (beta = 0.4, 95% CI: 0.23, 0.57).

We can also look at exposure in a different way, rather than linear changes per unit, we can look at changes in orders of magnitude (many biological systems work in this way). We can do this quickly by specifying this in the model, here we specify log2 which conveniently gives us the effect size from a doubling of the measure:

mdl2 <- lm(data = analytic_set, formula = BPXSY1 ~ log2(LBXNFOS)) 
summary(mdl2)

## 
## Call:
## lm(formula = BPXSY1 ~ log2(LBXNFOS), data = analytic_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.666 -10.974  -2.393   8.473  96.284 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   118.6089     0.7600 156.056  < 2e-16 ***
## log2(LBXNFOS)   1.8918     0.3775   5.012 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.61 on 1155 degrees of freedom
## Multiple R-squared:  0.02129,    Adjusted R-squared:  0.02044 
## F-statistic: 25.12 on 1 and 1155 DF,  p-value: 6.228e-07

beta2 = round(summary(mdl2)$coefficients[2,1], 2)
ci_lb2 = round(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2], 2)
ci_ub2 = round(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2], 2)

We observe a 1.89 mmHg higher in systolic blood pressure for each doubling of PFAS exposure (beta = 1.89, 95% CI: 1.15, 2.63).

We can also look at Risk Ratios:

mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "log")) 
summary(mdl3)

## 
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "log"), 
##     data = analytic_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.13449    0.11661 -18.305  < 2e-16 ***
## log2(LBXNFOS)  0.18681    0.04969   3.759  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1013.6  on 1155  degrees of freedom
## AIC: 1017.6
## 
## Number of Fisher Scoring iterations: 6

rr3 = round(exp(summary(mdl2)$coefficients[2,1]), 2)
ci_lb3 = round(exp(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2]), 2)
ci_ub3 = round(exp(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2]), 2)

Each doubling of plasma PFAS exposure is associated with 6.63 as high risk of high blood pressure (beta = 6.63, 95% CI: 3.16, 13.9), in U.S. adults between the ages of 18-65.

…or Odds Ratios:

mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit")) 
summary(mdl3)

## 
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.01585    0.13754 -14.656  < 2e-16 ***
## log2(LBXNFOS)  0.22539    0.06245   3.609 0.000307 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1013.7  on 1155  degrees of freedom
## AIC: 1017.7
## 
## Number of Fisher Scoring iterations: 4

rr3 = round(exp(summary(mdl2)$coefficients[2,1]), 2)
ci_lb3 = round(exp(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2]), 2)
ci_ub3 = round(exp(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2]), 2)

Each doubling of plasma PFAS exposure is associated with 6.63 as high odds of high blood pressure (beta = 6.63, 95% CI: 3.16, 13.9), in U.S. adults between the ages of 18-65.

RD, RR, OR by hand

These regression results are not magical, and this can be seen by comparing the associations between a binary exposure (hi_pfas) and binary outcome (hi_bp) computed by hand and by regression.

Let’s see the original 2x2 table: Rows = Outcome (0 = normal BP, 1 = high BP) Columns = Exposure (0 = normal PFAS, 1 = high PFAS)

org_2x2 <- table(analytic_set$hi_bp, analytic_set$hi_pfas)
org_2x2

##    
##       0   1
##   0 524 445
##   1  78 110

A risk difference can be calculated as (risk amongst exposed) - (risk amongst unexposed).

(110/(110+445)) - (78/(78+524))

What is the regression-based estimate:

lm(data = analytic_set, formula = hi_bp ~ hi_pfas)

Why is this?

## `geom_smooth()` using formula = 'y ~ x'

How about the _risk ratio__ (risk amongst exposed) / (risk amongst unexposed):

(110/(110+445)) / (78/(78+524))

rr_mdl <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "log"))

exp(summary(rr_mdl)$coefficients[2,1])

And the odds ratio (ODDS amongst exposed) / (ODDS amongst unexposed):

(110/(445)) / (78/(524)), or to re-arrange

(110 * 524) / (78 * 445)

or_mdl <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"))

exp(summary(or_mdl)$coefficients[2,1])

For continuous exposures, the model fit is just repeated across the range of X values!

Adjustment

We know for our final model we need to adjust for the confounder educational status. What does adjustment actually do? How does it work?

Let’s go back to binary odds ratios:

mdl4 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit")) 
summary(mdl4)

## 
## Call:
## glm(formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9048     0.1214 -15.695  < 2e-16 ***
## hi_pfas       0.5072     0.1615   3.141  0.00168 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1016.9  on 1155  degrees of freedom
## AIC: 1020.9
## 
## Number of Fisher Scoring iterations: 4

Having higher than median plasma PFOS is associated with a 1.66 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65.

Regression-based adjustment

Next, we adjust for educational category (as a categorical variable, meaning each value has a different estimate):

mdl5 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit")) 
summary(mdl5)

## 
## Call:
## glm(formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.5404     0.1843  -8.358  < 2e-16 ***
## hi_pfas               0.5245     0.1624   3.230  0.00124 ** 
## as.factor(edu_cat)1  -0.4503     0.1951  -2.308  0.02098 *  
## as.factor(edu_cat)2  -0.5486     0.2419  -2.268  0.02334 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.5  on 1155  degrees of freedom
## Residual deviance: 1010.0  on 1152  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1018
## 
## Number of Fisher Scoring iterations: 4

Having higher than median plasma PFOS is associated with a 1.69 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65, adjusted for differences in distribution of education category AND assuming that the odds ratios are the same between educational categories.

How do we arrive at this estimate? Basically, we take the estimated association within each level of the confounder and then average the results in a particular way. The MH-stratified adjustment shows us this more clearly:

Stratum-specific Estimates

set_edu0 = table(filter(analytic_set, edu_cat == 0)$hi_bp, filter(analytic_set, edu_cat == 0)$hi_pfas)
set_edu1 = table(filter(analytic_set, edu_cat == 1)$hi_bp, filter(analytic_set, edu_cat == 1)$hi_pfas)
set_edu2 = table(filter(analytic_set, edu_cat == 2)$hi_bp, filter(analytic_set, edu_cat == 2)$hi_pfas)

Stratum-specific odds ratio for Less than High School (edu_cat = 0):

set_edu0

##    
##      0  1
##   0 96 80
##   1 25 24

Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)

oddsratio.wald(set_edu0, rev = "both")

OR = [exposed cases (24) * unexposed controls (95)] / [unexposed cases (25) / exposed controls (81)]

There is less data in this stratum, so estimates within this group are less stable, which has implications for the overall adjustment!

Stratum-specific odds ratio for Less than High School (edu_cat = 1):

set_edu1

##    
##       0   1
##   0 325 248
##   1  40  62

Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)

oddsratio.wald(set_edu1, rev = c("both"))

Compute this OR.

Stratum-specific odds ratio for Less than High School (edu_cat = 2):

set_edu2

##    
##       0   1
##   0 103 116
##   1  13  24

Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)

oddsratio.wald(set_edu2, rev = c("both"))

## $data
##        
##           1   0 Total
##   1      24  13    37
##   0     116 103   219
##   Total 140 116   256
## 
## $measure
##    odds ratio with 95% C.I.
##     estimate     lower    upper
##   1 1.000000        NA       NA
##   0 1.639257 0.7937441 3.385429
## 
## $p.value
##    two-sided
##     midp.exact fisher.exact chi.square
##   1         NA           NA         NA
##   0   0.184154    0.2128106  0.1787665
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

Compute this OR.

Mantel-Haeszel Adjustment

Compute the MH Adjusted Estimate by Hand:

a = exposed cases b = exposed controls c = un-exposed cases d = un-exposed controls

Summing across all strata of the confounder
Take the cross product of a X d / divide by the total number in the stratum
Take the cross product of b X c / divide by the total number in the stratum
Take the ratio of those two measures

Check your work:

strat_set <-array(c(set_edu0, set_edu1, set_edu2), dim = c(2, 2, 3))
epi.2by2(strat_set)

Recall the logistic regression estimate from the previous model:

OR = 1.69 [95% CI: 1.23, 2.32]

What adjustment does

Adjustment takes the estimates within level of the confounder and then lets each stratum contribute to the overall estimate proportional to the amount of data (observations) in each of the stratum, assuming that there is one common odds ratio.

Stratum-specific estimates

Let us assume that there are different effects for people with different educational attainment. We can represent that with an interaction term, and therefore there is not one common OR:

mdl6 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas * as.factor(edu_cat), family = binomial(link = "logit")) 
summary(mdl6)

## 
## Call:
## glm(formula = hi_bp ~ hi_pfas * as.factor(edu_cat), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.3455     0.2245  -5.992 2.07e-09 ***
## hi_pfas                       0.1415     0.3234   0.438  0.66172    
## as.factor(edu_cat)1          -0.7495     0.2802  -2.675  0.00747 ** 
## as.factor(edu_cat)2          -0.7243     0.3702  -1.957  0.05040 .  
## hi_pfas:as.factor(edu_cat)1   0.5672     0.3909   1.451  0.14683    
## hi_pfas:as.factor(edu_cat)2   0.3527     0.4914   0.718  0.47288    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.5  on 1155  degrees of freedom
## Residual deviance: 1007.9  on 1150  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1019.9
## 
## Number of Fisher Scoring iterations: 4

OR in edu_cat == 0: 1.152

OR in edu_cat == 1: 2.03125

OR in edu_cat == 2: 1.6392573

Week 9 - Measures of Associations and Adjustments

Jon Huang

2023-10-18

Overview

Assignment Schedule and Due Dates

Measures of Associations

Build clean dataset

Select adjustment set with DAGs

Regression-based estimates

RD, RR, OR by hand

Adjustment

Regression-based adjustment

Stratum-specific Estimates

Mantel-Haeszel Adjustment

What adjustment does

Stratum-specific estimates