Week 12 - Review and Some More Core Statistical Concepts

Assignment Schedule and Due Dates

Updated assignments and due dates:

ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan

NOV 10 (Fri): DRAFT of analytic plan due

NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.

NOV 24 (Fri): DRAFT of analyses with code due

DEC 08 (Fri): Final written report

Post your written reports to Laulima by the end of the due date

Overview

Today:

Review Topics - Questions?
Tabulating and Plotting Regression Results
Table 2 Fallacy
Regression and Statistical Inference

Review Topics

Role of the quantitative epidemiologist
Good causal questions (vs descriptive & predictive)
Importing, merging, cleaning data
Tabulating & visualizing data
Quantitative causal inference - Causal assumptions
Making and using DAGs
Measures of association and regression
Adjustment by stratification and regression
Confounding and effect modification
Disparities and poorly-defined exposures (“race”)
Missing data - weighting and imputation

Questions about these topics (including “why is this important / why should we care?”)?

Any you would like more practice, examples, coding for?

Tabulating Regression Results

Let’s say that we’ve gone through all the steps above and we are now happy with our causal question(s), dataset, analytic models and estimand(s), and statistical models. That is, we are ready to fit our final regression models and report our results.

Specifically, we want to report the effect of a doubling of PFAS on systolic blood pressure, adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.

Table for a single regression model

First, let’s look at a common way to present regression results for the n-PFOA results using the gtsummary package:

analytic_set <- readxl::read_xlsx(paste0(alt_path, "analytic_set.xlsx"))

analytic_set %>%
  lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>%
  gtsummary::tbl_regression()

Characteristic	Beta	95% CI¹	p-value
log2(LBXNFOA)	1.1	0.10, 2.0	0.030
RIAGENDR
Female	—	—
Male	3.7	1.8, 5.5	<0.001
age	0.45	0.38, 0.51	<0.001
as.factor(RIDRETH3)
Mexican American	—	—
Non-Hispanic Asian	-2.1	-5.5, 1.4	0.2
Non-Hispanic Black	3.3	0.30, 6.4	0.031
Non-Hispanic White	-4.8	-7.6, -1.9	0.001
Other Hispanic	-4.7	-8.4, -0.90	0.015
Other Race - Including Multi-Racial	0.01	-4.5, 4.5	>0.9
as.factor(edu_cat)
0	—	—
1	-0.76	-3.3, 1.8	0.6
2	-3.6	-6.9, -0.25	0.035
inc_ratio	-0.20	-0.83, 0.44	0.5
¹ CI = Confidence Interval

While this is not wrong per-se as a way for you to look or present your regression results. There is a problem with trying to interpret the coefficients for any variables that is not your main exposure (i.e. any of your confounders) based on this regression model.

Why?

HINT: You based your adjustment model on your DAG for your main exposure, not your confounders!

DAG for main regression

ggdag::dagify(BP ~ PFAS + AGE + EDU + ETHNICITY + INCOME, 
              PFAS ~ EDU + AGE + ETHNICITY + INCOME, 
              EDU ~ AGE + ETHNICITY, 
              INCOME ~ AGE + ETHNICITY, 
              exposure = "PFAS", outcome = "BP") %>% 
  ggdag_adjustment_set() + theme_classic() + geom_dag_text(col = "black")

DAG for covariate?

Our covariate, even if it qualifies as a causal exposure will not have the same DAG or confounders as the main exposure! Intepreting the coefficient for the covariate suggests that we ought to adjuste for the same set of confounders.

However, in a proper DAG for this covariate these variables might be mediators or colliders!

ggdag::dagify(BP ~ PFAS + EDU + ETHNICITY + INCOME + MARRIED, 
              PFAS ~ EDU + AGE + ETHNICITY + INCOME + MARRIED, 
              EDU ~ AGE + ETHNICITY + MARRIED, 
              INCOME ~ AGE + ETHNICITY + MARRIED, 
              exposure = "AGE", outcome = "BP") %>% 
  ggdag_dseparated(controlling_for = c("PFAS", "EDU", "ETHNICITY", "INCOME")) + theme_classic() + geom_dag_text(col = "black")

Table 2 Fallacy

Intepreting the coefficients in a regression table for anything other than the main exposure (for which you drew your DAG) as a meaningful quantity has been termed “Table 2 Fallacy”.

In a study of 17,278,392 adults (!), researchers did just this, producing a large regression model throwing all manner of potential exposures and covariates into a single model to predict COVID-19 related deaths. They then presented these “mutually adjusted” results in a single plot, representing the individual coefficients from this model.

They found that current smoking and high blood pressure were associated with a REDUCED RISK of COVID death!

Williamson, et al. Nature. 2020.

Despite others pointing out the problems with adjusting for consequences and colliders of smoking and death in commentaries and elsewhere, this study was never corrected or retracted!

Summary Results Tables and Figures

Let’s do it better by presenting only the estimates for our main exposures. Since there are 5 PFASs that have been measured, we want to report 5 estimands.

Flextable

We can put these results together into a single table using the very appropriately named flextable package:

## RUN AND SAVE REGRESSION MODEL ESTIMATES
nfoa_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

bfoa_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXBFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

nfos_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXNFOS) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

pfhs_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXPFHS) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

pfna_res <- analytic_set %>% 
  lm(BPXSY1 ~ log2(LBXPFNA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()

## SET SOME FORMATTING OPTIONS FOR FLEXTABLE
flextable::set_flextable_defaults(digits = 2)

## MERGE THE MODEL RESULTS, SELECT ONLY THE MAIN EXPOSURES, AND DO SOME REFORMATTING
clean_table <- 
  add_row(nfoa_res, bfoa_res) %>% add_row(nfos_res) %>% add_row(pfhs_res) %>% add_row(pfna_res) %>% 
  filter(grepl("log2", term)) %>%
  mutate(LB = estimate - 1.96*std.error, UB = estimate + 1.96*std.error) %>% 
  mutate(term = case_when(term == "log2(LBXNFOA)" ~ "n-PFOA",
                          term == "log2(LBXBFOA)" ~ "branched-PFOA",
                          term == "log2(LBXNFOS)" ~ "n-PFOS",
                          term == "log2(LBXPFHS)" ~ "PFHxS",
                          term == "log2(LBXPFNA)" ~ "PFNA")) %>% 
  mutate(CI = paste0("(", round(LB,2), ", ", round(UB,2), ")")) %>% 
  select(PFAS = term, estimate, LB, UB, "95% CI" = CI, "p-value" = p.value) 

## FORMAT THE FLEXTABLE FROM THE CLEANED RESULTS
results_table <- 
  clean_table %>% select(-LB, -UB) %>% # DON'T NEED THESE IN THE FINAL TABLE
  flextable::flextable() %>% 
  flextable::colformat_double() %>% 
  # flextable::color(~ estimate < 0, color = "green", ~ estimate) %>% 
  # flextable::color(~ estimate > 0, color = "red", ~ estimate) %>% 
  # flextable::bold(~ `p-value` < 0.05) %>%
  flextable::footnote(i = 1, j = 1:2,
            value = as_paragraph(
              c("PFAS measured in plasma",
                "Change in systolic blood pressure in mmHg per doubled PFAS. Adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.") ),
            ref_symbols = c("a", "b"),
            part = "header") %>% 
  flextable::autofit()
        
results_table

PFASa	estimateb	95% CI	p-value
n-PFOA	1.05	(0.1, 2)	0.03
branched-PFOA	0.65	(-1.65, 2.95)	0.58
n-PFOS	-0.19	(-1, 0.62)	0.64
PFHxS	0.66	(-0.13, 1.45)	0.10
PFNA	0.41	(-0.41, 1.23)	0.32
aPFAS measured in plasma
bChange in systolic blood pressure in mmHg per doubled PFAS. Adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.

We can also save it to a docx (or other file format) and throw in that first table of just the n-PFOA results at the same time:

table2 <- 
  analytic_set %>%
  lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>%
  gtsummary::tbl_regression() %>% gtsummary::as_flex_table()

flextable::save_as_docx("Table 2 - n-PFOA results" = table2, "Table 3 - All PFAS results" = results_table, path = "results_tables.docx")

Like many R functions, flextable is almost infinitely customizable, but of course if you want to export it to a document first to format it, that’s also possible!

Forest plot

clean_table %>% ggplot() + 
  geom_point(aes(x = PFAS, y = estimate)) +
  geom_errorbar(aes(x = PFAS, ymin = LB, ymax = UB)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Plasma PFAS", y = "Increase in systolic BP, per doubled PFAS (mmHg)") +
  coord_flip() + theme_bw()

Regression and Statistical Inference

We will go over the concepts and play with the visualization on how regression works from Kristoffer Magnusson’s wonderful website.