Updated assignments and due dates:
- ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan
- NOV 10 (Fri): DRAFT of analytic plan due
- NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.
- NOV 24 (Fri): DRAFT of analyses with code due
- DEC 08 (Fri): Final written report
- Post your written reports to Laulima by the end of the due date
Today:
Review Topics - Questions?
Tabulating and Plotting Regression Results
Table 2 Fallacy
Regression and Statistical Inference
Questions about these topics (including “why is this important / why should we care?”)?
Any you would like more practice, examples, coding for?
Let’s say that we’ve gone through all the steps above and we are now happy with our causal question(s), dataset, analytic models and estimand(s), and statistical models. That is, we are ready to fit our final regression models and report our results.
Specifically, we want to report the effect of a doubling of PFAS on systolic blood pressure, adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.
First, let’s look at a common way to present regression results for the n-PFOA results using the gtsummary package:
analytic_set <- readxl::read_xlsx(paste0(alt_path, "analytic_set.xlsx"))
analytic_set %>%
lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>%
gtsummary::tbl_regression()
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| log2(LBXNFOA) | 1.1 | 0.10, 2.0 | 0.030 |
| RIAGENDR | |||
| Female | — | — | |
| Male | 3.7 | 1.8, 5.5 | <0.001 |
| age | 0.45 | 0.38, 0.51 | <0.001 |
| as.factor(RIDRETH3) | |||
| Mexican American | — | — | |
| Non-Hispanic Asian | -2.1 | -5.5, 1.4 | 0.2 |
| Non-Hispanic Black | 3.3 | 0.30, 6.4 | 0.031 |
| Non-Hispanic White | -4.8 | -7.6, -1.9 | 0.001 |
| Other Hispanic | -4.7 | -8.4, -0.90 | 0.015 |
| Other Race - Including Multi-Racial | 0.01 | -4.5, 4.5 | >0.9 |
| as.factor(edu_cat) | |||
| 0 | — | — | |
| 1 | -0.76 | -3.3, 1.8 | 0.6 |
| 2 | -3.6 | -6.9, -0.25 | 0.035 |
| inc_ratio | -0.20 | -0.83, 0.44 | 0.5 |
| 1 CI = Confidence Interval | |||
While this is not wrong per-se as a way for you to look or present your regression results. There is a problem with trying to interpret the coefficients for any variables that is not your main exposure (i.e. any of your confounders) based on this regression model.
Why?
HINT: You based your adjustment model on your DAG for your main exposure, not your confounders!
ggdag::dagify(BP ~ PFAS + AGE + EDU + ETHNICITY + INCOME,
PFAS ~ EDU + AGE + ETHNICITY + INCOME,
EDU ~ AGE + ETHNICITY,
INCOME ~ AGE + ETHNICITY,
exposure = "PFAS", outcome = "BP") %>%
ggdag_adjustment_set() + theme_classic() + geom_dag_text(col = "black")
Our covariate, even if it qualifies as a causal exposure will not have the same DAG or confounders as the main exposure! Intepreting the coefficient for the covariate suggests that we ought to adjuste for the same set of confounders.
However, in a proper DAG for this covariate these variables might be mediators or colliders!
ggdag::dagify(BP ~ PFAS + EDU + ETHNICITY + INCOME + MARRIED,
PFAS ~ EDU + AGE + ETHNICITY + INCOME + MARRIED,
EDU ~ AGE + ETHNICITY + MARRIED,
INCOME ~ AGE + ETHNICITY + MARRIED,
exposure = "AGE", outcome = "BP") %>%
ggdag_dseparated(controlling_for = c("PFAS", "EDU", "ETHNICITY", "INCOME")) + theme_classic() + geom_dag_text(col = "black")
Intepreting the coefficients in a regression table for anything other than the main exposure (for which you drew your DAG) as a meaningful quantity has been termed “Table 2 Fallacy”.
In a study of 17,278,392 adults (!), researchers did just this, producing a large regression model throwing all manner of potential exposures and covariates into a single model to predict COVID-19 related deaths. They then presented these “mutually adjusted” results in a single plot, representing the individual coefficients from this model.
They found that current smoking and high blood pressure were associated with a REDUCED RISK of COVID death!
Williamson, et al. Nature. 2020.
Despite others pointing out the problems with adjusting for consequences and colliders of smoking and death in commentaries and elsewhere, this study was never corrected or retracted!
Let’s do it better by presenting only the estimates for our main exposures. Since there are 5 PFASs that have been measured, we want to report 5 estimands.
We can put these results together into a single table using the very appropriately named flextable package:
## RUN AND SAVE REGRESSION MODEL ESTIMATES
nfoa_res <- analytic_set %>%
lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()
bfoa_res <- analytic_set %>%
lm(BPXSY1 ~ log2(LBXBFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()
nfos_res <- analytic_set %>%
lm(BPXSY1 ~ log2(LBXNFOS) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()
pfhs_res <- analytic_set %>%
lm(BPXSY1 ~ log2(LBXPFHS) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()
pfna_res <- analytic_set %>%
lm(BPXSY1 ~ log2(LBXPFNA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>% broom::tidy()
## SET SOME FORMATTING OPTIONS FOR FLEXTABLE
flextable::set_flextable_defaults(digits = 2)
## MERGE THE MODEL RESULTS, SELECT ONLY THE MAIN EXPOSURES, AND DO SOME REFORMATTING
clean_table <-
add_row(nfoa_res, bfoa_res) %>% add_row(nfos_res) %>% add_row(pfhs_res) %>% add_row(pfna_res) %>%
filter(grepl("log2", term)) %>%
mutate(LB = estimate - 1.96*std.error, UB = estimate + 1.96*std.error) %>%
mutate(term = case_when(term == "log2(LBXNFOA)" ~ "n-PFOA",
term == "log2(LBXBFOA)" ~ "branched-PFOA",
term == "log2(LBXNFOS)" ~ "n-PFOS",
term == "log2(LBXPFHS)" ~ "PFHxS",
term == "log2(LBXPFNA)" ~ "PFNA")) %>%
mutate(CI = paste0("(", round(LB,2), ", ", round(UB,2), ")")) %>%
select(PFAS = term, estimate, LB, UB, "95% CI" = CI, "p-value" = p.value)
## FORMAT THE FLEXTABLE FROM THE CLEANED RESULTS
results_table <-
clean_table %>% select(-LB, -UB) %>% # DON'T NEED THESE IN THE FINAL TABLE
flextable::flextable() %>%
flextable::colformat_double() %>%
# flextable::color(~ estimate < 0, color = "green", ~ estimate) %>%
# flextable::color(~ estimate > 0, color = "red", ~ estimate) %>%
# flextable::bold(~ `p-value` < 0.05) %>%
flextable::footnote(i = 1, j = 1:2,
value = as_paragraph(
c("PFAS measured in plasma",
"Change in systolic blood pressure in mmHg per doubled PFAS. Adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio.") ),
ref_symbols = c("a", "b"),
part = "header") %>%
flextable::autofit()
results_table
PFASa | estimateb | 95% CI | p-value |
|---|---|---|---|
n-PFOA | 1.05 | (0.1, 2) | 0.03 |
branched-PFOA | 0.65 | (-1.65, 2.95) | 0.58 |
n-PFOS | -0.19 | (-1, 0.62) | 0.64 |
PFHxS | 0.66 | (-0.13, 1.45) | 0.10 |
PFNA | 0.41 | (-0.41, 1.23) | 0.32 |
aPFAS measured in plasma | |||
bChange in systolic blood pressure in mmHg per doubled PFAS. Adjusted for sex, age, self-reported ethnicity, education category, and income-to-poverty ratio. | |||
We can also save it to a docx (or other file format) and throw in that first table of just the n-PFOA results at the same time:
table2 <-
analytic_set %>%
lm(BPXSY1 ~ log2(LBXNFOA) + RIAGENDR + age + as.factor(RIDRETH3) + as.factor(edu_cat) + inc_ratio, data = .) %>%
gtsummary::tbl_regression() %>% gtsummary::as_flex_table()
flextable::save_as_docx("Table 2 - n-PFOA results" = table2, "Table 3 - All PFAS results" = results_table, path = "results_tables.docx")
Like many R functions, flextable is almost infinitely customizable, but of course if you want to export it to a document first to format it, that’s also possible!
clean_table %>% ggplot() +
geom_point(aes(x = PFAS, y = estimate)) +
geom_errorbar(aes(x = PFAS, ymin = LB, ymax = UB)) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Plasma PFAS", y = "Increase in systolic BP, per doubled PFAS (mmHg)") +
coord_flip() + theme_bw()
We will go over the concepts and play with the visualization on how regression works from Kristoffer Magnusson’s wonderful website.
Williamson, et al. Nature. 2020.
We will also discuss the meaning and failings of p-values through this visualization linking mean, test-statistics, and p-values.