Assignment Schedule and Due Dates

Updated assignments and due dates:

  • ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan
  • NOV 10 (Fri): DRAFT of analytic plan due
  • NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.
  • NOV 24 (Fri): DRAFT of analyses with code due
  • DEC 08 (Fri): Final written report
    • Post your written reports to Laulima by the end of the due date

Overview

Today:

Review Topics - Questions?
Tabulating and Plotting Regression Results
Table 2 Fallacy
Regression and Statistical Inference


Review Topics

  1. Role of the quantitative epidemiologist
  2. Good causal questions (vs descriptive & predictive)
  3. Importing, merging, cleaning data
  4. Tabulating & visualizing data
  5. Quantitative causal inference - Causal assumptions
  6. Making and using DAGs
  7. Measures of association and regression
  8. Adjustment by stratification and regression
  9. Confounding and effect modification
  10. Disparities and poorly-defined exposures (“race”)
  11. Missing data - weighting and imputation

Questions about these topics (including “why is this important / why should we care?”)?

Any you would like more practice, examples, coding for?


Tabulating Regression Results

Let’s say that we’ve gone through all the steps above and we are now happy with our causal question(s), dataset, analytic models and estimand(s), and statistical models. That is, we are ready to fit our final regression models and report our results.

Specifically, we want to report the effect of a doubling of PFAS on systolic blood pressure, adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.

Table for a single regression model

First, let’s look at a common way to present regression results for the n-PFOA results using the gtsummary package:

analytic_set <- readxl::read_xlsx(paste0(alt_path, "analytic_set.xlsx"))

analytic_set %>%
  lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>%
  gtsummary::tbl_regression()
Characteristic Beta 95% CI1 p-value
log2(LBXNFOA) 1.1 0.10, 2.0 0.030
RIAGENDR


    Female
    Male 3.7 1.8, 5.5 <0.001
age 0.45 0.38, 0.51 <0.001
as.factor(RIDRETH3)


    Mexican American
    Non-Hispanic Asian -2.1 -5.5, 1.4 0.2
    Non-Hispanic Black 3.3 0.30, 6.4 0.031
    Non-Hispanic White -4.8 -7.6, -1.9 0.001
    Other Hispanic -4.7 -8.4, -0.90 0.015
    Other Race - Including Multi-Racial 0.01 -4.5, 4.5 >0.9
as.factor(edu_cat)


    0
    1 -0.76 -3.3, 1.8 0.6
    2 -3.6 -6.9, -0.25 0.035
inc_ratio -0.20 -0.83, 0.44 0.5
1 CI = Confidence Interval

While this is not wrong per-se as a way for you to look or present your regression results. There is a problem with trying to interpret the coefficients for any variables that is not your main exposure (i.e. any of your confounders) based on this regression model.

Why?

HINT: You based your adjustment model on your DAG for your main exposure, not your confounders!


DAG for main regression

ggdag::dagify(BP ~ PFAS + AGE + EDU + ETHNICITY + INCOME, 
              PFAS ~ EDU + AGE + ETHNICITY + INCOME, 
              EDU ~ AGE + ETHNICITY, 
              INCOME ~ AGE + ETHNICITY, 
              exposure = "PFAS", outcome = "BP") %>% 
  ggdag_adjustment_set() + theme_classic() + geom_dag_text(col = "black")


DAG for covariate?

Our covariate, even if it qualifies as a causal exposure will not have the same DAG or confounders as the main exposure! Intepreting the coefficient for the covariate suggests that we ought to adjuste for the same set of confounders.

However, in a proper DAG for this covariate these variables might be mediators or colliders!

ggdag::dagify(BP ~ PFAS + EDU + ETHNICITY + INCOME + MARRIED, 
              PFAS ~ EDU + AGE + ETHNICITY + INCOME + MARRIED, 
              EDU ~ AGE + ETHNICITY + MARRIED, 
              INCOME ~ AGE + ETHNICITY + MARRIED, 
              exposure = "AGE", outcome = "BP") %>% 
  ggdag_dseparated(controlling_for = c("PFAS", "EDU", "ETHNICITY", "INCOME")) + theme_classic() + geom_dag_text(col = "black")


Table 2 Fallacy

Intepreting the coefficients in a regression table for anything other than the main exposure (for which you drew your DAG) as a meaningful quantity has been termed “Table 2 Fallacy”.

In a study of 17,278,392 adults (!), researchers did just this, producing a large regression model throwing all manner of potential exposures and covariates into a single model to predict COVID-19 related deaths. They then presented these “mutually adjusted” results in a single plot, representing the individual coefficients from this model.

They found that current smoking and high blood pressure were associated with a REDUCED RISK of COVID death!

*Williamson, et al. Nature. 2020.*

Williamson, et al. Nature. 2020.

Despite others pointing out the problems with adjusting for consequences and colliders of smoking and death in commentaries and elsewhere, this study was never corrected or retracted!


Summary Results Tables and Figures

Let’s do it better by presenting only the estimates for our main exposures. Since there are 5 PFASs that have been measured, we want to report 5 estimands.

Flextable

We can put these results together into a single table using the very appropriately named flextable package:

## RUN AND SAVE REGRESSION MODEL ESTIMATES
nfoa_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

bfoa_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXBFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

nfos_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXNFOS) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

pfhs_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXPFHS) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

pfna_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXPFNA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

## SET SOME FORMATTING OPTIONS FOR FLEXTABLE
flextable::set_flextable_defaults(digits = 2)

## MERGE THE MODEL RESULTS, SELECT ONLY THE MAIN EXPOSURES, AND DO SOME REFORMATTING
clean_table <- 
  add_row(nfoa_res, bfoa_res) %>% add_row(nfos_res) %>% add_row(pfhs_res) %>% add_row(pfna_res) %>% 
  filter(grepl("log2", term)) %>%
  mutate(LB = estimate - 1.96*std.error, UB = estimate + 1.96*std.error) %>% 
  mutate(term = case_when(term == "log2(LBXNFOA)" ~ "n-PFOA",
                          term == "log2(LBXBFOA)" ~ "branched-PFOA",
                          term == "log2(LBXNFOS)" ~ "n-PFOS",
                          term == "log2(LBXPFHS)" ~ "PFHxS",
                          term == "log2(LBXPFNA)" ~ "PFNA")) %>% 
  mutate(CI = paste0("(", round(LB,2), ", ", round(UB,2), ")")) %>% 
  select(PFAS = term, estimate, LB, UB, "95% CI" = CI, "p-value" = p.value) 

## FORMAT THE FLEXTABLE FROM THE CLEANED RESULTS
results_table <- 
  clean_table %>% select(-LB, -UB) %>% # DON'T NEED THESE IN THE FINAL TABLE
  flextable::flextable() %>% 
  flextable::colformat_double() %>% 
  # flextable::color(~ estimate < 0, color = "green", ~ estimate) %>% 
  # flextable::color(~ estimate > 0, color = "red", ~ estimate) %>% 
  # flextable::bold(~ `p-value` < 0.05) %>%
  flextable::footnote(i = 1, j = 1:2,
            value = as_paragraph(
              c("PFAS measured in plasma",
                "Change in systolic blood pressure in mmHg per doubled PFAS. Adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.") ),
            ref_symbols = c("a", "b"),
            part = "header") %>% 
  flextable::autofit()
        
results_table

PFASa

estimateb

95% CI

p-value

n-PFOA

1.05

(0.1, 2)

0.03

branched-PFOA

0.65

(-1.65, 2.95)

0.58

n-PFOS

-0.19

(-1, 0.62)

0.64

PFHxS

0.66

(-0.13, 1.45)

0.10

PFNA

0.41

(-0.41, 1.23)

0.32

aPFAS measured in plasma

bChange in systolic blood pressure in mmHg per doubled PFAS. Adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.

We can also save it to a docx (or other file format) and throw in that first table of just the n-PFOA results at the same time:

table2 <- 
  analytic_set %>%
  lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>%
  gtsummary::tbl_regression() %>% gtsummary::as_flex_table()

flextable::save_as_docx("Table 2 - n-PFOA results" = table2, "Table 3 - All PFAS results" = results_table, path = "results_tables.docx")

Like many R functions, flextable is almost infinitely customizable, but of course if you want to export it to a document first to format it, that’s also possible!

Forest plot

clean_table %>% ggplot() + 
  geom_point(aes(x = PFAS, y = estimate)) +
  geom_errorbar(aes(x = PFAS, ymin = LB, ymax = UB)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Plasma PFAS", y = "Increase in systolic BP, per doubled PFAS (mmHg)") +
  coord_flip() + theme_bw()


Regression and Statistical Inference

We will go over the concepts and play with the visualization on how regression works from Kristoffer Magnusson’s wonderful website.

*Williamson, et al. Nature. 2020.*

Williamson, et al. Nature. 2020.

We will also discuss the meaning and failings of p-values through this visualization linking mean, test-statistics, and p-values.