Today:
Go over problem set
Review RD, RR, and ORs
Stratification for adjustment
Stratification for EMM
Causal hypothesis (in terms of RD, RR, OR, etc.)
Updated assignments and due dates:
- ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan
- NOV 10 (Fri): DRAFT of analytic plan due
- NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.
- NOV 24 (Fri): DRAFT of analyses with code due
- DEC 08 (Fri): Final written report
- Post your written reports to Laulima by the end of the due date
# A descriptive table restricted to those with blood pressure AND PFOA data, and stratified by blood pressure:
tbl1 <-
merged_final_clean %>%
filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>%
filter(age >= 18 & age <= 65) %>%
table1(~ RIAGENDR + RIDRETH3 + age + as.factor(edu_cat) + hh_inc + inc_ratio + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | hi_bp, data = .,
topclass="Rtable1-zebra",
overall = F, extra.col=list(`P-value`=pvalue),
render.continuous=c(.="Mean (SD)",
.="Median [MIN, MAX]",
"IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
"N Missing" = "NMISS"))
tbl1
| Normal BP (N=969) |
High BP (N=188) |
P-value | |
|---|---|---|---|
| Participant Sex | |||
| Female | 493 (50.9%) | 83 (44.1%) | 0.108 |
| Male | 476 (49.1%) | 105 (55.9%) | |
| Self-reported Race/Ethnicity | |||
| Mexican American | 153 (15.8%) | 35 (18.6%) | <0.001 |
| Non-Hispanic Asian | 151 (15.6%) | 24 (12.8%) | |
| Non-Hispanic Black | 193 (19.9%) | 67 (35.6%) | |
| Non-Hispanic White | 326 (33.6%) | 37 (19.7%) | |
| Other Hispanic | 92 (9.5%) | 18 (9.6%) | |
| Other Race - Including Multi-Racial | 54 (5.6%) | 7 (3.7%) | |
| Participant Age (years) | |||
| Mean (SD) | 39.9 (14.5) | 53.4 (9.68) | <0.001 |
| Median [MIN, MAX] | 39.0 [18.0, 65.0] | 56.0 [23.0, 65.0] | |
| IQR [25%ile, 75%ile] | 26.0 [27.0, 53.0] | 13.0 [48.0, 61.0] | |
| N Missing | 0 | 0 | |
| Highest Completed Education | |||
| < HS | 176 (18.2%) | 49 (26.1%) | 0.0429 |
| HS | 573 (59.1%) | 102 (54.3%) | |
| College | 219 (22.6%) | 37 (19.7%) | |
| Missing | 1 (0.1%) | 0 (0%) | |
| Household Income (in $1000s) | |||
| Mean (SD) | 48.6 (32.4) | 46.6 (33.8) | 0.477 |
| Median [MIN, MAX] | 45.0 [0, 100] | 35.0 [0, 100] | |
| IQR [25%ile, 75%ile] | 55.0 [20.0, 75.0] | 55.0 [20.0, 75.0] | |
| N Missing | 78 | 16 | |
| Missing | 78 (8.0%) | 16 (8.5%) | |
| Family Income:Poverty Line Ratio | |||
| Mean (SD) | 2.47 (1.66) | 2.58 (1.61) | 0.442 |
| Median [MIN, MAX] | 2.02 [0, 5.00] | 2.34 [0, 5.00] | |
| IQR [25%ile, 75%ile] | 3.02 [1.06, 4.08] | 3.03 [1.19, 4.22] | |
| N Missing | 112 | 30 | |
| Missing | 112 (11.6%) | 30 (16.0%) | |
| Systolic Blood Pressure (mmHg) | |||
| Mean (SD) | 116 (11.1) | 149 (14.5) | <0.001 |
| Median [MIN, MAX] | 116 [88.0, 138] | 146 [122, 216] | |
| IQR [25%ile, 75%ile] | 18.0 [108, 126] | 12.5 [142, 154] | |
| N Missing | 0 | 0 | |
| Diastolic Blood Pressure (mmHg) | |||
| Mean (SD) | 70.1 (10.3) | 85.5 (13.8) | <0.001 |
| Median [MIN, MAX] | 72.0 [0, 88.0] | 88.0 [0, 124] | |
| IQR [25%ile, 75%ile] | 12.0 [64.0, 76.0] | 16.5 [77.5, 94.0] | |
| N Missing | 0 | 0 | |
| RBC Folate (ng/mL) | |||
| Mean (SD) | 1070 (415) | 1100 (657) | 0.818 |
| Median [MIN, MAX] | 1010 [379, 3340] | 976 [219, 4540] | |
| IQR [25%ile, 75%ile] | 469 [791, 1260] | 544 [751, 1300] | |
| N Missing | 483 | 133 | |
| Missing | 483 (49.8%) | 133 (70.7%) | |
| Plasma PFOA (ng/mL) | |||
| Mean (SD) | 1.59 (2.24) | 1.70 (1.14) | 0.311 |
| Median [MIN, MAX] | 1.20 [0.0700, 52.8] | 1.50 [0.100, 9.50] | |
| IQR [25%ile, 75%ile] | 1.10 [0.700, 1.80] | 1.13 [1.00, 2.13] | |
| N Missing | 0 | 0 | |
| Plasma PFOS (ng/mL) | |||
| Mean (SD) | 4.20 (5.21) | 5.73 (7.26) | 0.0065 |
| Median [MIN, MAX] | 2.80 [0.0700, 63.1] | 3.60 [0.100, 59.9] | |
| IQR [25%ile, 75%ile] | 3.10 [1.60, 4.70] | 3.25 [2.18, 5.43] | |
| N Missing | 0 | 0 | |
| Plasma PFNA (ng/mL) | |||
| Mean (SD) | 0.523 (0.526) | 0.630 (0.492) | 0.00735 |
| Median [MIN, MAX] | 0.400 [0.0700, 6.50] | 0.500 [0.0700, 3.00] | |
| IQR [25%ile, 75%ile] | 0.400 [0.200, 0.600] | 0.500 [0.300, 0.800] | |
| N Missing | 0 | 0 | |
| Plasma PFHxS (ng/mL) | |||
| Mean (SD) | 1.61 (2.78) | 1.55 (1.63) | 0.709 |
| Median [MIN, MAX] | 1.00 [0.0700, 48.8] | 1.10 [0.100, 17.0] | |
| IQR [25%ile, 75%ile] | 1.20 [0.500, 1.70] | 1.10 [0.800, 1.90] | |
| N Missing | 0 | 0 |
The implied adjustment set (if this DAG is true):
ggdag_adjustment_set(bp_dag_new2) + theme_classic() + geom_dag_text(col = "Orange")
# Create an analytic set
# Previously, we had edu_cat, hi_bp, and hi_pfas as labelled variables to make the tables nices
# For ease of analysis, I'm going to transform them back to numbers
analytic_set <-
merged_final_clean %>%
#filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>%
filter(!is.na(hi_bp) & !is.na(hi_pfas)) %>%
filter(age >= 18 & age <= 65) %>% # Restricting our analytic set as we decided above
mutate(edu_cat = as.numeric(edu_cat)-1,
hi_bp = as.numeric(hi_bp)-1,
hi_pfas = as.numeric(hi_pfas)-1)
Last time we showed mean difference in systolic blood pressures by 1 unit (ng/mL) change in PFOS:
mdl <- lm(data = analytic_set, formula = BPXSY1 ~ LBXNFOS)
summary(mdl)
##
## Call:
## lm(formula = BPXSY1 ~ LBXNFOS, data = analytic_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.566 -11.153 -2.349 8.651 95.652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.74715 0.62407 191.880 < 2e-16 ***
## LBXNFOS 0.40034 0.08712 4.595 4.8e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.64 on 1155 degrees of freedom
## Multiple R-squared: 0.01795, Adjusted R-squared: 0.0171
## F-statistic: 21.11 on 1 and 1155 DF, p-value: 4.802e-06
beta = round(summary(mdl)$coefficients[2,1], 2)
ci_lb = round(summary(mdl)$coefficients[2,1] - 1.96*summary(mdl)$coefficients[2,2], 2)
ci_ub = round(summary(mdl)$coefficients[2,1] + 1.96*summary(mdl)$coefficients[2,2], 2)
# We also showed the diagnostics for this regression
# mdl %>% ggplot() + geom_point(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) # one way
# ggResidpanel::resid_panel(mdl) # more comprehensive way
We observe a 0.4 mmHg higher in systolic blood pressure per 1 ng/mL higher PFAS exposure (beta = 0.4, 95% CI: 0.23, 0.57).
We can also look at exposure in a different way, rather than linear changes per unit, we can look at changes in orders of magnitude (many biological systems work in this way). We can do this quickly by specifying this in the model, here we specify log2 which conveniently gives us the effect size from a doubling of the measure:
mdl2 <- lm(data = analytic_set, formula = BPXSY1 ~ log2(LBXNFOS))
summary(mdl2)
##
## Call:
## lm(formula = BPXSY1 ~ log2(LBXNFOS), data = analytic_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.666 -10.974 -2.393 8.473 96.284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.6089 0.7600 156.056 < 2e-16 ***
## log2(LBXNFOS) 1.8918 0.3775 5.012 6.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.61 on 1155 degrees of freedom
## Multiple R-squared: 0.02129, Adjusted R-squared: 0.02044
## F-statistic: 25.12 on 1 and 1155 DF, p-value: 6.228e-07
beta2 = round(summary(mdl2)$coefficients[2,1], 2)
ci_lb2 = round(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2], 2)
ci_ub2 = round(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2], 2)
We observe a 1.89 mmHg higher in systolic blood pressure for each doubling of PFAS exposure (beta = 1.89, 95% CI: 1.15, 2.63).
We can also look at Risk Ratios:
mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "log"))
summary(mdl3)
##
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "log"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.13449 0.11661 -18.305 < 2e-16 ***
## log2(LBXNFOS) 0.18681 0.04969 3.759 0.00017 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.9 on 1156 degrees of freedom
## Residual deviance: 1013.6 on 1155 degrees of freedom
## AIC: 1017.6
##
## Number of Fisher Scoring iterations: 6
rr3 = round(exp(summary(mdl2)$coefficients[2,1]), 2)
ci_lb3 = round(exp(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2]), 2)
ci_ub3 = round(exp(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2]), 2)
Each doubling of plasma PFAS exposure is associated with 6.63 as high risk of high blood pressure (beta = 6.63, 95% CI: 3.16, 13.9), in U.S. adults between the ages of 18-65.
…or Odds Ratios:
mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit"))
summary(mdl3)
##
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.01585 0.13754 -14.656 < 2e-16 ***
## log2(LBXNFOS) 0.22539 0.06245 3.609 0.000307 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.9 on 1156 degrees of freedom
## Residual deviance: 1013.7 on 1155 degrees of freedom
## AIC: 1017.7
##
## Number of Fisher Scoring iterations: 4
rr3 = round(exp(summary(mdl2)$coefficients[2,1]), 2)
ci_lb3 = round(exp(summary(mdl2)$coefficients[2,1] - 1.96*summary(mdl2)$coefficients[2,2]), 2)
ci_ub3 = round(exp(summary(mdl2)$coefficients[2,1] + 1.96*summary(mdl2)$coefficients[2,2]), 2)
Each doubling of plasma PFAS exposure is associated with 6.63 as high odds of high blood pressure (beta = 6.63, 95% CI: 3.16, 13.9), in U.S. adults between the ages of 18-65.
These regression results are not magical, and this can be seen by comparing the associations between a binary exposure (hi_pfas) and binary outcome (hi_bp) computed by hand and by regression.
Let’s see the original 2x2 table: Rows = Outcome (0 = normal BP, 1 = high BP) Columns = Exposure (0 = normal PFAS, 1 = high PFAS)
org_2x2 <- table(analytic_set$hi_bp, analytic_set$hi_pfas)
org_2x2
##
## 0 1
## 0 524 445
## 1 78 110
A risk difference can be calculated as (risk amongst exposed) - (risk amongst unexposed).
(110/(110+445)) - (78/(78+524))
What is the regression-based estimate:
lm(data = analytic_set, formula = hi_bp ~ hi_pfas)
Why is this?
## `geom_smooth()` using formula = 'y ~ x'
How about the _risk ratio__ (risk amongst exposed) / (risk amongst unexposed):
(110/(110+445)) / (78/(78+524))
rr_mdl <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "log"))
exp(summary(rr_mdl)$coefficients[2,1])
And the odds ratio (ODDS amongst exposed) / (ODDS amongst unexposed):
(110/(445)) / (78/(524)), or to re-arrange
(110 * 524) / (78 * 445)
or_mdl <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"))
exp(summary(or_mdl)$coefficients[2,1])
For continuous exposures, the model fit is just repeated across the range of X values!
We know for our final model we need to adjust for the confounder educational status. What does adjustment actually do? How does it work?
Let’s go back to binary odds ratios:
mdl4 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"))
summary(mdl4)
##
## Call:
## glm(formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9048 0.1214 -15.695 < 2e-16 ***
## hi_pfas 0.5072 0.1615 3.141 0.00168 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.9 on 1156 degrees of freedom
## Residual deviance: 1016.9 on 1155 degrees of freedom
## AIC: 1020.9
##
## Number of Fisher Scoring iterations: 4
Having higher than median plasma PFOS is associated with a 1.66 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65.
Next, we adjust for educational category (as a categorical variable, meaning each value has a different estimate):
mdl5 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit"))
summary(mdl5)
##
## Call:
## glm(formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5404 0.1843 -8.358 < 2e-16 ***
## hi_pfas 0.5245 0.1624 3.230 0.00124 **
## as.factor(edu_cat)1 -0.4503 0.1951 -2.308 0.02098 *
## as.factor(edu_cat)2 -0.5486 0.2419 -2.268 0.02334 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.5 on 1155 degrees of freedom
## Residual deviance: 1010.0 on 1152 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 1018
##
## Number of Fisher Scoring iterations: 4
Having higher than median plasma PFOS is associated with a 1.69 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65, adjusted for differences in distribution of education category AND assuming that the odds ratios are the same between educational categories.
How do we arrive at this estimate? Basically, we take the estimated association within each level of the confounder and then average the results in a particular way. The MH-stratified adjustment shows us this more clearly:
set_edu0 = table(filter(analytic_set, edu_cat == 0)$hi_bp, filter(analytic_set, edu_cat == 0)$hi_pfas)
set_edu1 = table(filter(analytic_set, edu_cat == 1)$hi_bp, filter(analytic_set, edu_cat == 1)$hi_pfas)
set_edu2 = table(filter(analytic_set, edu_cat == 2)$hi_bp, filter(analytic_set, edu_cat == 2)$hi_pfas)
Stratum-specific odds ratio for Less than High School (edu_cat = 0):
set_edu0
##
## 0 1
## 0 96 80
## 1 25 24
Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)
oddsratio.wald(set_edu0, rev = "both")
OR = [exposed cases (24) * unexposed controls (95)] / [unexposed cases (25) / exposed controls (81)]
There is less data in this stratum, so estimates within this group are less stable, which has implications for the overall adjustment!
Stratum-specific odds ratio for Less than High School (edu_cat = 1):
set_edu1
##
## 0 1
## 0 325 248
## 1 40 62
Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)
oddsratio.wald(set_edu1, rev = c("both"))
Compute this OR.
Stratum-specific odds ratio for Less than High School (edu_cat = 2):
set_edu2
##
## 0 1
## 0 103 116
## 1 13 24
Columns = Exposure (1 = high PFAS, 0 = normal PFAS)
Rows = Outcome (1 = high BP, 0 = normal BP)
oddsratio.wald(set_edu2, rev = c("both"))
## $data
##
## 1 0 Total
## 1 24 13 37
## 0 116 103 219
## Total 140 116 256
##
## $measure
## odds ratio with 95% C.I.
## estimate lower upper
## 1 1.000000 NA NA
## 0 1.639257 0.7937441 3.385429
##
## $p.value
## two-sided
## midp.exact fisher.exact chi.square
## 1 NA NA NA
## 0 0.184154 0.2128106 0.1787665
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
Compute this OR.
Compute the MH Adjusted Estimate by Hand:
a = exposed cases b = exposed controls c = un-exposed cases d = un-exposed controls
Check your work:
strat_set <-array(c(set_edu0, set_edu1, set_edu2), dim = c(2, 2, 3))
epi.2by2(strat_set)
Recall the logistic regression estimate from the previous model:
OR = 1.69 [95% CI: 1.23, 2.32]
Adjustment takes the estimates within level of the confounder and then lets each stratum contribute to the overall estimate proportional to the amount of data (observations) in each of the stratum, assuming that there is one common odds ratio.
Let us assume that there are different effects for people with different educational attainment. We can represent that with an interaction term, and therefore there is not one common OR:
mdl6 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas * as.factor(edu_cat), family = binomial(link = "logit"))
summary(mdl6)
##
## Call:
## glm(formula = hi_bp ~ hi_pfas * as.factor(edu_cat), family = binomial(link = "logit"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3455 0.2245 -5.992 2.07e-09 ***
## hi_pfas 0.1415 0.3234 0.438 0.66172
## as.factor(edu_cat)1 -0.7495 0.2802 -2.675 0.00747 **
## as.factor(edu_cat)2 -0.7243 0.3702 -1.957 0.05040 .
## hi_pfas:as.factor(edu_cat)1 0.5672 0.3909 1.451 0.14683
## hi_pfas:as.factor(edu_cat)2 0.3527 0.4914 0.718 0.47288
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.5 on 1155 degrees of freedom
## Residual deviance: 1007.9 on 1150 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 1019.9
##
## Number of Fisher Scoring iterations: 4
OR in edu_cat == 0: 1.152
OR in edu_cat == 1: 2.03125
OR in edu_cat == 2: 1.6392573