Week 8 - Analytic & Statistical Model Design

Overview

Today we will finalize our DAG, consider some solutions to unmeasured confounding, and fit models implied by our DAG.

We will emphasize intuitions about what adjustments do, including by stratification and regression, and discuss how to interpret coefficients.

If we have time, we will extend this to effect measure modification parameters.

Assignment Schedule and Due Dates

Updated assignments and due dates:

ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan

OCT 27 (Fri): DRAFT of analytic plan due

NOV 8, 15, 22: In lieu of analytic plan presentation, workshop (analyses, code)

NOV 17 (Fri): DRAFT of analyses with code due

DEC 08 (Fri): Final written report

Post your written reports to Laulima by the end of the due date

PFAS and adult blood pressure analysis

Review public health context

Exposure to “forever chemicals” (per- and poly-fluoroalkyl substances; PFAS) may cause a number of adverse health outcomes including high blood pressure, as they have similar structures as fatty acids and may have similar biological functions (hormonal; cell-signaling) but unlike fatty acids from food are not easily broken down.

Furthermore, they may have a role to play in ethnic and socioeconomic health disparities, due to differences in exposure and differences in effect. Differences in exposure may arise from different diets and environmental exposures due to historical and structurally discriminatory practices (e.g. less safe food and water) and differences in effect may arise from differences in susceptibility due to factors such as food insecurity or nutritional state.

Today’s analysis

ANALYTIC DESIGN: Try to estimate effect of early-life / pre-adult changes in PFAS concentration on adult blood pressure.
- Discuss what first DAG implies
- Consider implications and revise
- Look at what adjustment sets are required, including additional data
- Finalize sample - target population, exclusions/selection, adjustment sets
STATISTICAL DESIGN: Explore considerations for estimand and estimator
- Target estimand: mean difference, risk ratio, odds ratio
- Estimator: Transformations of exposure or outcome variables
STATISTICAL DESIGN: Methods for adjustment
- Stratification / Mantel-Haenzel
- Regression

ANALYTIC DESIGN

Considering available data

Load nhanes_merged_18.xlsx.

Refer to “nhanes_pfas_project.R” on how the file was created. This file merges five files from the 2017-2018 wave of the CDC’s NHANES annual survey.

Documentation for the merged NHANES datasets:

Demographics (DEMO_J)
Blood (serum) measures of PFAS (PFAS_J)
Blood (RBC) measures of Folate (FOLATE_J)
Blood pressure measurement (BPX_J)
Food security (P_FSQ)

NOTE: For this and most remaining analyses we will set aside consideration of the survey weighting. It is important for inference, but not immediately related to our decisions and challenges around analytic and statistical design. We will return to this later.

merged_final <- readxl::read_xlsx(paste0(alt_path, "nhanes_merged_18.xlsx"))
merged_final

The first thing we want to do, to do a quick check of the dataset is to see whether the variable types seem to line up: variables that should be numbers are dbl or int. Categorical variables can be numeric, or if they are text chr. One thing we notice is that participant age (RIDAGEYR) should be a number, in years, but it is in character (chr) format. If count() it, we find it this because one of the values is not a number (“80 years of age and over”):

We will need to figure out what to do with this, let’s look at how many these are:

Seems like a lot, but we will only be working with the subset that have PFAS measurements, let’s restrict and see how many this might affect:

merged_final %>% filter(!is.na(LBXNFOA)) %>% count(RIDAGEYR == "80 years of age and over")

Still a fair number, but perhaps we can recode it to “80” for now and potentially restrict the final study to younger people (emulating a study with shorter “follow-up”) anyway.

In addition to age, we also see a similar problem with the family income to poverty ratio ($INDFMPIR$):

I’ve gone through this data, so here I’m just going to do a shortcut to filter down to the problematic values:

Since this is a large number we will need to figure out what to do with them, categorization will lose data, but perhaps that is the best we can do. For now, let’s consider recoding it to “5”. This and age is an example of “top coding” with an analogue of “bottom coding” that is often done to preserve anonymity (extreme values can identify people), but serves as a challenge to analyses.

In any case, let’s start building a cleaner dataset (merged_final_clean) with age and income ratio re-code.

Then let’s quickly tabulate it to see what we else need to fix:

merged_final_clean <- 
  merged_final %>% 
  mutate(age = case_when(RIDAGEYR == "80 years of age and over" ~ 80,
                         T ~ as.numeric(RIDAGEYR)),
         inc_ratio = case_when(INDFMPIR == "Value greater than or equal to 5.00" ~ 5,
                               T ~ as.numeric(INDFMPIR))) 
tbl1 <- 
 merged_final_clean %>% 
  #filter(RIDRETH3 %in% c("Non-Hispanic Asian", "Non-Hispanic White")) %>% 
  table1(~ RIAGENDR + age + DMDEDUC2 + DMDEDUC3 + INDFMIN2 + INDHHIN2 + inc_ratio + BPXPLS + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | RIDRETH3, data = .,
         topclass="Rtable1-zebra",
         #overall = F, extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"))
tbl1

	Mexican American (N=1367)	Non-Hispanic Asian (N=1168)	Non-Hispanic Black (N=2115)	Non-Hispanic White (N=3150)	Other Hispanic (N=820)	Other Race - Including Multi-Racial (N=634)	Overall (N=9254)
RIAGENDR
Female	719 (52.6%)	615 (52.7%)	1083 (51.2%)	1567 (49.7%)	420 (51.2%)	293 (46.2%)	4697 (50.8%)
Male	648 (47.4%)	553 (47.3%)	1032 (48.8%)	1583 (50.3%)	400 (48.8%)	341 (53.8%)	4557 (49.2%)
age
Mean (SD)	29.6 (23.2)	36.6 (22.5)	34.7 (25.0)	37.0 (27.7)	34.5 (24.6)	25.5 (23.2)	34.3 (25.5)
Median [MIN, MAX]	23.0 [0, 80.0]	36.0 [0, 80.0]	32.0 [0, 80.0]	34.0 [0, 80.0]	33.0 [0, 80.0]	16.5 [0, 80.0]	31.0 [0, 80.0]
IQR [25%ile, 75%ile]	40.0 [10.0, 50.0]	39.0 [16.0, 55.0]	48.0 [11.0, 59.0]	53.0 [10.0, 63.0]	47.0 [11.0, 58.0]	37.0 [7.00, 44.0]	47.0 [11.0, 58.0]
N Missing	0	0	0	0	0	0	0
DMDEDUC2
9-11th grade (Includes 12th grade with no diploma)	133 (9.7%)	56 (4.8%)	158 (7.5%)	193 (6.1%)	72 (8.8%)	26 (4.1%)	638 (6.9%)
College graduate or above	53 (3.9%)	453 (38.8%)	248 (11.7%)	449 (14.3%)	79 (9.6%)	54 (8.5%)	1336 (14.4%)
Don't Know	2 (0.1%)	0 (0%)	2 (0.1%)	1 (0.0%)	5 (0.6%)	1 (0.2%)	11 (0.1%)
High school graduate/GED or equivalent	168 (12.3%)	105 (9.0%)	342 (16.2%)	545 (17.3%)	98 (12.0%)	67 (10.6%)	1325 (14.3%)
Less than 9th grade	206 (15.1%)	54 (4.6%)	34 (1.6%)	50 (1.6%)	124 (15.1%)	11 (1.7%)	479 (5.2%)
Some college or AA degree	173 (12.7%)	142 (12.2%)	513 (24.3%)	697 (22.1%)	139 (17.0%)	114 (18.0%)	1778 (19.2%)
Refused	0 (0%)	1 (0.1%)	1 (0.0%)	0 (0%)	0 (0%)	0 (0%)	2 (0.0%)
Missing	632 (46.2%)	357 (30.6%)	817 (38.6%)	1215 (38.6%)	303 (37.0%)	361 (56.9%)	3685 (39.8%)
DMDEDUC3
10th grade	26 (1.9%)	20 (1.7%)	28 (1.3%)	39 (1.2%)	13 (1.6%)	13 (2.1%)	139 (1.5%)
11th grade	30 (2.2%)	24 (2.1%)	32 (1.5%)	46 (1.5%)	7 (0.9%)	16 (2.5%)	155 (1.7%)
12th grade, no diploma	2 (0.1%)	3 (0.3%)	7 (0.3%)	4 (0.1%)	0 (0%)	4 (0.6%)	20 (0.2%)
1st grade	31 (2.3%)	14 (1.2%)	37 (1.7%)	60 (1.9%)	12 (1.5%)	22 (3.5%)	176 (1.9%)
2nd grade	31 (2.3%)	23 (2.0%)	51 (2.4%)	57 (1.8%)	22 (2.7%)	18 (2.8%)	202 (2.2%)
3rd grade	37 (2.7%)	15 (1.3%)	57 (2.7%)	56 (1.8%)	16 (2.0%)	21 (3.3%)	202 (2.2%)
4th grade	34 (2.5%)	17 (1.5%)	34 (1.6%)	53 (1.7%)	18 (2.2%)	23 (3.6%)	179 (1.9%)
5th grade	42 (3.1%)	15 (1.3%)	54 (2.6%)	57 (1.8%)	16 (2.0%)	15 (2.4%)	199 (2.2%)
6th grade	28 (2.0%)	14 (1.2%)	39 (1.8%)	49 (1.6%)	14 (1.7%)	10 (1.6%)	154 (1.7%)
7th grade	27 (2.0%)	19 (1.6%)	37 (1.7%)	43 (1.4%)	11 (1.3%)	14 (2.2%)	151 (1.6%)
8th grade	29 (2.1%)	16 (1.4%)	42 (2.0%)	44 (1.4%)	10 (1.2%)	13 (2.1%)	154 (1.7%)
9th grade	22 (1.6%)	18 (1.5%)	39 (1.8%)	47 (1.5%)	10 (1.2%)	18 (2.8%)	154 (1.7%)
High school graduate	29 (2.1%)	19 (1.6%)	23 (1.1%)	53 (1.7%)	12 (1.5%)	14 (2.2%)	150 (1.6%)
Less than 9th grade	1 (0.1%)	2 (0.2%)	0 (0%)	6 (0.2%)	2 (0.2%)	0 (0%)	11 (0.1%)
More than high school	17 (1.2%)	8 (0.7%)	13 (0.6%)	23 (0.7%)	5 (0.6%)	5 (0.8%)	71 (0.8%)
Never attended / kindergarten only	30 (2.2%)	25 (2.1%)	38 (1.8%)	58 (1.8%)	17 (2.1%)	16 (2.5%)	184 (2.0%)
GED or equivalent	0 (0%)	0 (0%)	0 (0%)	5 (0.2%)	0 (0%)	0 (0%)	5 (0.1%)
Missing	951 (69.6%)	916 (78.4%)	1584 (74.9%)	2450 (77.8%)	635 (77.4%)	412 (65.0%)	6948 (75.1%)
INDFMIN2
$ 0 to $ 4,999	57 (4.2%)	16 (1.4%)	119 (5.6%)	82 (2.6%)	47 (5.7%)	19 (3.0%)	340 (3.7%)
$ 5,000 to $ 9,999	35 (2.6%)	14 (1.2%)	101 (4.8%)	81 (2.6%)	23 (2.8%)	27 (4.3%)	281 (3.0%)
$10,000 to $14,999	86 (6.3%)	27 (2.3%)	116 (5.5%)	144 (4.6%)	59 (7.2%)	22 (3.5%)	454 (4.9%)
$100,000 and Over	130 (9.5%)	399 (34.2%)	238 (11.3%)	580 (18.4%)	94 (11.5%)	135 (21.3%)	1576 (17.0%)
$15,000 to $19,999	92 (6.7%)	32 (2.7%)	140 (6.6%)	200 (6.3%)	55 (6.7%)	37 (5.8%)	556 (6.0%)
$20,000 and Over	38 (2.8%)	51 (4.4%)	50 (2.4%)	80 (2.5%)	6 (0.7%)	21 (3.3%)	246 (2.7%)
$20,000 to $24,999	83 (6.1%)	36 (3.1%)	146 (6.9%)	204 (6.5%)	45 (5.5%)	41 (6.5%)	555 (6.0%)
$25,000 to $34,999	152 (11.1%)	60 (5.1%)	237 (11.2%)	384 (12.2%)	92 (11.2%)	80 (12.6%)	1005 (10.9%)
$35,000 to $44,999	164 (12.0%)	76 (6.5%)	215 (10.2%)	327 (10.4%)	65 (7.9%)	49 (7.7%)	896 (9.7%)
$45,000 to $54,999	96 (7.0%)	70 (6.0%)	115 (5.4%)	227 (7.2%)	44 (5.4%)	46 (7.3%)	598 (6.5%)
$55,000 to $64,999	88 (6.4%)	88 (7.5%)	114 (5.4%)	178 (5.7%)	37 (4.5%)	35 (5.5%)	540 (5.8%)
$65,000 to $74,999	56 (4.1%)	49 (4.2%)	79 (3.7%)	168 (5.3%)	29 (3.5%)	37 (5.8%)	418 (4.5%)
$75,000 to $99,999	97 (7.1%)	111 (9.5%)	160 (7.6%)	340 (10.8%)	56 (6.8%)	40 (6.3%)	804 (8.7%)
Don't know	79 (5.8%)	7 (0.6%)	43 (2.0%)	17 (0.5%)	55 (6.7%)	6 (0.9%)	207 (2.2%)
Refused	24 (1.8%)	29 (2.5%)	68 (3.2%)	22 (0.7%)	26 (3.2%)	10 (1.6%)	179 (1.9%)
Under $20,000	28 (2.0%)	10 (0.9%)	38 (1.8%)	20 (0.6%)	25 (3.0%)	4 (0.6%)	125 (1.4%)
Missing	62 (4.5%)	93 (8.0%)	136 (6.4%)	96 (3.0%)	62 (7.6%)	25 (3.9%)	474 (5.1%)
INDHHIN2
$ 0 to $ 4,999	52 (3.8%)	11 (0.9%)	108 (5.1%)	56 (1.8%)	39 (4.8%)	16 (2.5%)	282 (3.0%)
$ 5,000 to $ 9,999	32 (2.3%)	12 (1.0%)	87 (4.1%)	74 (2.3%)	22 (2.7%)	25 (3.9%)	252 (2.7%)
$10,000 to $14,999	79 (5.8%)	25 (2.1%)	101 (4.8%)	126 (4.0%)	48 (5.9%)	20 (3.2%)	399 (4.3%)
$100,000 and Over	134 (9.8%)	408 (34.9%)	244 (11.5%)	600 (19.0%)	101 (12.3%)	137 (21.6%)	1624 (17.5%)
$15,000 to $19,999	83 (6.1%)	30 (2.6%)	138 (6.5%)	198 (6.3%)	50 (6.1%)	37 (5.8%)	536 (5.8%)
$20,000 and Over	47 (3.4%)	65 (5.6%)	66 (3.1%)	101 (3.2%)	25 (3.0%)	24 (3.8%)	328 (3.5%)
$20,000 to $24,999	77 (5.6%)	32 (2.7%)	147 (7.0%)	193 (6.1%)	42 (5.1%)	38 (6.0%)	529 (5.7%)
$25,000 to $34,999	148 (10.8%)	57 (4.9%)	222 (10.5%)	374 (11.9%)	86 (10.5%)	73 (11.5%)	960 (10.4%)
$35,000 to $44,999	162 (11.9%)	75 (6.4%)	208 (9.8%)	330 (10.5%)	68 (8.3%)	50 (7.9%)	893 (9.6%)
$45,000 to $54,999	103 (7.5%)	70 (6.0%)	123 (5.8%)	218 (6.9%)	46 (5.6%)	47 (7.4%)	607 (6.6%)
$55,000 to $64,999	88 (6.4%)	87 (7.4%)	134 (6.3%)	189 (6.0%)	36 (4.4%)	39 (6.2%)	573 (6.2%)
$65,000 to $74,999	64 (4.7%)	51 (4.4%)	81 (3.8%)	176 (5.6%)	28 (3.4%)	41 (6.5%)	441 (4.8%)
$75,000 to $99,999	98 (7.2%)	108 (9.2%)	165 (7.8%)	359 (11.4%)	57 (7.0%)	42 (6.6%)	829 (9.0%)
Don't know	85 (6.2%)	11 (0.9%)	46 (2.2%)	18 (0.6%)	54 (6.6%)	6 (0.9%)	220 (2.4%)
Refused	24 (1.8%)	24 (2.1%)	63 (3.0%)	21 (0.7%)	28 (3.4%)	10 (1.6%)	170 (1.8%)
Under $20,000	28 (2.0%)	9 (0.8%)	35 (1.7%)	19 (0.6%)	25 (3.0%)	4 (0.6%)	120 (1.3%)
Missing	63 (4.6%)	93 (8.0%)	147 (7.0%)	98 (3.1%)	65 (7.9%)	25 (3.9%)	491 (5.3%)
inc_ratio
Mean (SD)	1.88 (1.39)	3.22 (1.63)	2.05 (1.52)	2.55 (1.57)	2.04 (1.58)	2.42 (1.60)	2.38 (1.60)
Median [MIN, MAX]	1.50 [0, 5.00]	3.47 [0, 5.00]	1.62 [0, 5.00]	2.08 [0, 5.00]	1.51 [0, 5.00]	2.02 [0, 5.00]	1.92 [0, 5.00]
IQR [25%ile, 75%ile]	1.73 [0.810, 2.54]	3.35 [1.65, 5.00]	2.26 [0.870, 3.13]	2.75 [1.23, 3.98]	2.21 [0.770, 2.98]	2.87 [1.05, 3.92]	2.65 [1.04, 3.69]
N Missing	231	190	335	235	174	66	1231
Missing	231 (16.9%)	190 (16.3%)	335 (15.8%)	235 (7.5%)	174 (21.2%)	66 (10.4%)	1231 (13.3%)
BPXPLS
Mean (SD)	74.3 (12.2)	72.9 (11.8)	73.2 (12.4)	73.7 (12.4)	73.4 (12.2)	77.1 (13.4)	73.7 (12.4)
Median [MIN, MAX]	74.0 [42.0, 120]	72.0 [42.0, 134]	72.0 [44.0, 136]	72.0 [34.0, 132]	72.0 [46.0, 128]	76.0 [50.0, 130]	72.0 [34.0, 136]
IQR [25%ile, 75%ile]	16.0 [66.0, 82.0]	16.0 [64.0, 80.0]	16.0 [64.0, 80.0]	18.0 [64.0, 82.0]	14.0 [66.0, 80.0]	18.0 [68.0, 86.0]	16.0 [66.0, 82.0]
N Missing	389	275	531	891	227	199	2512
Missing	389 (28.5%)	275 (23.5%)	531 (25.1%)	891 (28.3%)	227 (27.7%)	199 (31.4%)	2512 (27.1%)
BPXSY1
Mean (SD)	118 (18.1)	121 (19.5)	125 (22.1)	121 (19.0)	121 (21.2)	117 (18.6)	121 (20.0)
Median [MIN, MAX]	116 [82.0, 216]	117 [72.0, 216]	122 [78.0, 228]	118 [78.0, 216]	116 [82.0, 210]	114 [80.0, 190]	118 [72.0, 228]
IQR [25%ile, 75%ile]	22.0 [106, 128]	22.0 [108, 130]	28.0 [108, 136]	24.0 [108, 132]	26.0 [106, 132]	22.0 [104, 126]	26.0 [106, 132]
N Missing	439	326	655	1039	268	225	2952
Missing	439 (32.1%)	326 (27.9%)	655 (31.0%)	1039 (33.0%)	268 (32.7%)	225 (35.5%)	2952 (31.9%)
BPXDI1
Mean (SD)	66.3 (16.5)	70.0 (14.8)	69.0 (17.4)	67.1 (15.7)	67.8 (15.3)	66.4 (19.2)	67.8 (16.4)
Median [MIN, MAX]	68.0 [0, 114]	72.0 [0, 110]	70.0 [0, 136]	68.0 [0, 120]	70.0 [0, 112]	68.0 [0, 120]	70.0 [0, 136]
IQR [25%ile, 75%ile]	16.0 [60.0, 76.0]	14.0 [64.0, 78.0]	20.0 [60.0, 80.0]	16.0 [60.0, 76.0]	16.0 [60.0, 76.0]	18.0 [60.0, 78.0]	16.0 [60.0, 76.0]
N Missing	439	326	655	1039	268	225	2952
Missing	439 (32.1%)	326 (27.9%)	655 (31.0%)	1039 (33.0%)	268 (32.7%)	225 (35.5%)	2952 (31.9%)
LBDRFOSI
Mean (SD)	1110 (396)	1140 (393)	988 (422)	1280 (535)	1150 (413)	1120 (452)	1150 (470)
Median [MIN, MAX]	1060 [154, 4540]	1090 [239, 3150]	911 [253, 4220]	1170 [114, 4420]	1090 [344, 2990]	1060 [313, 5870]	1060 [114, 5870]
IQR [25%ile, 75%ile]	395 [865, 1260]	484 [859, 1340]	437 [723, 1160]	576 [914, 1490]	458 [882, 1340]	427 [859, 1290]	496 [842, 1340]
N Missing	647	568	1087	1629	426	315	4672
Missing	647 (47.3%)	568 (48.6%)	1087 (51.4%)	1629 (51.7%)	426 (52.0%)	315 (49.7%)	4672 (50.5%)
LBXNFOA
Mean (SD)	1.26 (1.17)	2.40 (3.96)	1.47 (1.13)	1.68 (1.12)	1.51 (1.22)	1.33 (0.871)	1.63 (1.82)
Median [MIN, MAX]	1.00 [0.0700, 11.4]	1.50 [0.200, 52.8]	1.20 [0.0700, 10.7]	1.40 [0.0700, 10.4]	1.30 [0.200, 9.50]	1.10 [0.0700, 5.40]	1.30 [0.0700, 52.8]
IQR [25%ile, 75%ile]	0.700 [0.700, 1.40]	1.70 [0.900, 2.60]	1.10 [0.700, 1.80]	1.20 [0.900, 2.10]	0.900 [0.800, 1.70]	0.800 [0.800, 1.60]	1.20 [0.800, 2.00]
N Missing	1070	911	1685	2483	644	532	7325
Missing	1070 (78.3%)	911 (78.0%)	1685 (79.7%)	2483 (78.8%)	644 (78.5%)	532 (83.9%)	7325 (79.2%)
LBXNFOS
Mean (SD)	2.92 (2.92)	6.18 (7.79)	6.32 (9.86)	4.34 (3.76)	4.07 (5.41)	3.75 (5.01)	4.75 (6.43)
Median [MIN, MAX]	2.00 [0.0700, 21.3]	3.40 [0.200, 64.0]	3.30 [0.0700, 88.4]	3.30 [0.0700, 33.6]	2.70 [0.500, 54.5]	2.60 [0.0700, 43.9]	3.00 [0.0700, 88.4]
IQR [25%ile, 75%ile]	2.10 [1.30, 3.40]	5.70 [1.40, 7.10]	4.95 [1.70, 6.65]	3.60 [2.00, 5.60]	2.70 [1.70, 4.40]	2.58 [1.50, 4.08]	3.70 [1.70, 5.40]
N Missing	1070	911	1685	2483	644	532	7325
Missing	1070 (78.3%)	911 (78.0%)	1685 (79.7%)	2483 (78.8%)	644 (78.5%)	532 (83.9%)	7325 (79.2%)
LBXPFNA
Mean (SD)	0.427 (0.420)	0.826 (0.744)	0.601 (0.620)	0.560 (0.478)	0.513 (0.370)	0.445 (0.303)	0.574 (0.543)
Median [MIN, MAX]	0.300 [0.0700, 3.90]	0.600 [0.0700, 5.10]	0.500 [0.0700, 6.50]	0.400 [0.0700, 4.50]	0.450 [0.0700, 3.00]	0.400 [0.0700, 1.40]	0.400 [0.0700, 6.50]
IQR [25%ile, 75%ile]	0.300 [0.200, 0.500]	0.800 [0.300, 1.10]	0.400 [0.300, 0.700]	0.400 [0.300, 0.700]	0.400 [0.300, 0.700]	0.400 [0.200, 0.600]	0.400 [0.300, 0.700]
N Missing	1070	911	1685	2483	644	532	7325
Missing	1070 (78.3%)	911 (78.0%)	1685 (79.7%)	2483 (78.8%)	644 (78.5%)	532 (83.9%)	7325 (79.2%)
LBXPFHS
Mean (SD)	1.14 (1.20)	2.21 (4.76)	1.54 (2.04)	1.74 (1.81)	1.28 (1.68)	1.29 (1.16)	1.60 (2.39)
Median [MIN, MAX]	0.800 [0.0700, 12.0]	0.900 [0.0700, 48.8]	1.10 [0.0700, 21.4]	1.30 [0.0700, 19.3]	0.900 [0.0700, 19.9]	0.900 [0.0700, 6.90]	1.10 [0.0700, 48.8]
IQR [25%ile, 75%ile]	0.800 [0.500, 1.30]	1.30 [0.500, 1.80]	1.40 [0.500, 1.90]	1.40 [0.700, 2.10]	0.800 [0.600, 1.40]	0.975 [0.500, 1.48]	1.20 [0.600, 1.80]
N Missing	1070	911	1685	2483	644	532	7325
Missing	1070 (78.3%)	911 (78.0%)	1685 (79.7%)	2483 (78.8%)	644 (78.5%)	532 (83.9%)	7325 (79.2%)

Notably, there are two education variables $DMDEDUC2$ and $DMDEDUC3$ and they both have a high percentage of missing. From the documentation (and some quick tabulations), we see they are mutually exclusive and $DMDEDUC3$is only asked of young people (<20):

Checking if anyone has non-missing values for both (no):

What is the age range for people with a value for $DMDEDUC2$:

What is the age range for people with a value for $DMDEDUC3$:

We can potentially address this by merging the education values.

Conceputal question: Why would we want to capture “highest completed education” variables differently for young people than for adults? Does a value of “8th grade” mean the same thing for a 13 year-old as a 25 year-old?

There are too many values to look at as separate categories, but at the same time there is not a clear way to make them ordinal / continuous measures. We also have two possible income measures, family income ($INDFMIN2$) and household income ($INDHHIN2$) to choose from. For now, let’s create one joint educational categories and take total household income and make it ordinal:

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `hh_inc = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion

Now let’s create a table of our cleaned variables, restricted down to the set we will probably use for analysis, i.e. that have non-missing PFAS and blood pressure data:

tbl1 <- 
 merged_final_clean %>% 
  filter(!is.na(BPXSY1) & !is.na(LBXNFOA)) %>% 
  filter(age >= 18 & age <= 65) %>% 
  mutate(hi_bp = case_when(BPXSY1 >= 140 | BPXDI1 >= 90 ~ "High BP",
                           is.na(BPXSY1) | is.na(BPXDI1) ~ NA_character_,
                           T ~ "Normal BP")) %>% 
  table1(~ RIAGENDR + RIDRETH3 + age + as.factor(edu_cat) + hh_inc + inc_ratio + BPXPLS + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | hi_bp, data = .,
         topclass="Rtable1-zebra",
         overall = F, extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"))
tbl1

	High BP (N=188)	Normal BP (N=969)	P-value
RIAGENDR
Female	83 (44.1%)	493 (50.9%)	0.108
Male	105 (55.9%)	476 (49.1%)
RIDRETH3
Mexican American	35 (18.6%)	153 (15.8%)	<0.001
Non-Hispanic Asian	24 (12.8%)	151 (15.6%)
Non-Hispanic Black	67 (35.6%)	193 (19.9%)
Non-Hispanic White	37 (19.7%)	326 (33.6%)
Other Hispanic	18 (9.6%)	92 (9.5%)
Other Race - Including Multi-Racial	7 (3.7%)	54 (5.6%)
age
Mean (SD)	53.4 (9.68)	39.9 (14.5)	<0.001
Median [MIN, MAX]	56.0 [23.0, 65.0]	39.0 [18.0, 65.0]
IQR [25%ile, 75%ile]	13.0 [48.0, 61.0]	26.0 [27.0, 53.0]
N Missing	0	0
as.factor(edu_cat)
0	49 (26.1%)	176 (18.2%)	0.0429
1	102 (54.3%)	573 (59.1%)
2	37 (19.7%)	219 (22.6%)
Missing	0 (0%)	1 (0.1%)
hh_inc
Mean (SD)	46.6 (33.8)	48.6 (32.4)	0.477
Median [MIN, MAX]	35.0 [0, 100]	45.0 [0, 100]
IQR [25%ile, 75%ile]	55.0 [20.0, 75.0]	55.0 [20.0, 75.0]
N Missing	16	78
Missing	16 (8.5%)	78 (8.0%)
inc_ratio
Mean (SD)	2.58 (1.61)	2.47 (1.66)	0.442
Median [MIN, MAX]	2.34 [0, 5.00]	2.02 [0, 5.00]
IQR [25%ile, 75%ile]	3.03 [1.19, 4.22]	3.02 [1.06, 4.08]
N Missing	30	112
Missing	30 (16.0%)	112 (11.6%)
BPXPLS
Mean (SD)	71.9 (12.0)	72.7 (11.9)	0.424
Median [MIN, MAX]	70.0 [42.0, 112]	72.0 [44.0, 118]
IQR [25%ile, 75%ile]	14.0 [64.0, 78.0]	16.0 [64.0, 80.0]
N Missing	0	0
BPXSY1
Mean (SD)	149 (14.5)	116 (11.1)	<0.001
Median [MIN, MAX]	146 [122, 216]	116 [88.0, 138]
IQR [25%ile, 75%ile]	12.5 [142, 154]	18.0 [108, 126]
N Missing	0	0
BPXDI1
Mean (SD)	85.5 (13.8)	70.1 (10.3)	<0.001
Median [MIN, MAX]	88.0 [0, 124]	72.0 [0, 88.0]
IQR [25%ile, 75%ile]	16.5 [77.5, 94.0]	12.0 [64.0, 76.0]
N Missing	0	0
LBDRFOSI
Mean (SD)	1100 (657)	1070 (415)	0.818
Median [MIN, MAX]	976 [219, 4540]	1010 [379, 3340]
IQR [25%ile, 75%ile]	544 [751, 1300]	469 [791, 1260]
N Missing	133	483
Missing	133 (70.7%)	483 (49.8%)
LBXNFOA
Mean (SD)	1.70 (1.14)	1.59 (2.24)	0.311
Median [MIN, MAX]	1.50 [0.100, 9.50]	1.20 [0.0700, 52.8]
IQR [25%ile, 75%ile]	1.13 [1.00, 2.13]	1.10 [0.700, 1.80]
N Missing	0	0
LBXNFOS
Mean (SD)	5.73 (7.26)	4.20 (5.21)	0.0065
Median [MIN, MAX]	3.60 [0.100, 59.9]	2.80 [0.0700, 63.1]
IQR [25%ile, 75%ile]	3.25 [2.18, 5.43]	3.10 [1.60, 4.70]
N Missing	0	0
LBXPFNA
Mean (SD)	0.630 (0.492)	0.523 (0.526)	0.00735
Median [MIN, MAX]	0.500 [0.0700, 3.00]	0.400 [0.0700, 6.50]
IQR [25%ile, 75%ile]	0.500 [0.300, 0.800]	0.400 [0.200, 0.600]
N Missing	0	0
LBXPFHS
Mean (SD)	1.55 (1.63)	1.61 (2.78)	0.709
Median [MIN, MAX]	1.10 [0.100, 17.0]	1.00 [0.0700, 48.8]
IQR [25%ile, 75%ile]	1.10 [0.800, 1.90]	1.20 [0.500, 1.70]
N Missing	0	0

Importantly, we can see for the categorical data, whether we have sufficient data for our adjustment models. We will return to this later.

Now that we have an idea of the basic data that we have, we can revisit our analytic model to see if we have sufficient data to attempt to estimate a causal effect.

Refine the DAG

Let’s make a first attempt at a DAG. We think about the major sources of exposure and outcomes, relative to our target population, adult US-residents.

We should include issues we may forsee given this is a cross-sectional study, which creates two problems:

Potential reverse causality
Differential loss to follow-up or selection bias

Because of the potential for reverse causality (BP –> PFAS), there is no way to identify an adjustment set:

## Warning in dag_adjustment_sets(., exposure = exposure, outcome = outcome, : Failed to close backdoor paths. Common reasons include:
##             * graph is not acyclic
##             * backdoor paths are not closeable with given set of variables
##             * necessary variables are unmeasured (latent)

Moreover, the cross-sectional analysis means that we are stuck with a selected sample population (a “censored” one) that is missing older (higher Age) and sicker (higher BP) people, which may lead to bias:

The only potential solution here (other than to not do the analysis or find a different data set) is to restrict our sample and adjust for what we can to prevent this selection bias:

Notably, by restricting to adults, we can somewhat address issues due to reverse causation and censorship issues. For example, older populations may have higher blood pressure resulting in more medical and pharmaceutical intervention and PFAS exposures by that route. Moreover, in older populations there is more likely going to be illness or death that prevent them from participating in surveys. Since illness and death are consequences of exposure, outcome, and covariates, this results in a selection bias (as we have shown in the DAG)

Additionally, by restricting to adults, we narrow down on the potential ways people can be exposed (to PFAS), for example by residual exposures from birth (in infants), exposures due to parents and home environment (in children), or as the result of potential greater exposure to intensive medical and pharmacological treatments. This improves, though doesn’t solve causal consistency issues for the exposure.

In summary, restricting our analyses to adults may help reduce the potential for selection bias and reverse causality due to the cross-sectional nature of the survey. It also improve the potential that the measured exposures arise from common sources and thus have common effects, fulfilling the causal consistency we need for effect estimation.

We show this in our modified DAG here, where we remove the reverse causal arrow (BP –> PFAS) and Age as a determinant of Censorship (i.e. because we’ve restricted to a population where this selection is not as strong):

bp_dag_new <- ggdag::dagify(BP ~ Diet + Food_Security + Income + Ethnicity + Smoking + Age, 
                        PFAS ~ Diet + Income + Ethnicity + Smoking + Food_Security + Age,
                        Diet ~ Food_Security + Ethnicity + Age,
                        Food_Security ~ Income + Ethnicity + Age,
                        Smoking ~ Income + Ethnicity + Age,
                        Censoring ~ BP,
                        exposure = "PFAS", outcome = "BP")

So now we see that a simple adjustment will allow us to estimate the effect of BP on PFAS:

ggdag_adjustment_set(bp_dag_new) + theme_classic() + geom_dag_text(col = "Orange")

And confirmed BP and PFAS are d-separated given this adjustment set:

ggdag_dseparated(bp_dag_new, controlling_for = c("Age", "Income", "Ethnicity", "Food_Security", "Smoking", "Diet")) + theme_classic() + geom_dag_text(col = "Orange")

But we have a problem that we don’t have a good variable capturing “Diet” (since it is not a particularly clear variable).

We probably mean something closer to: “Foods that contain higher PFAS, but ALSO influence BP through other nutrient content.” In this case, let us assume that we can replace this variable with another, the micronutrient Folate (there are probably better options such as “fatty fish intake” or plasma fatty acids, but we’ll go with this for now). We can represent this in our DAG as follows, by replacing “Diet” with our proxy:

bp_dag_new2 <- ggdag::dagify(BP ~ Folate + Food_Security + Income + Ethnicity + Smoking + Age, 
                        PFAS ~ Folate + Income + Ethnicity + Smoking + Food_Security + Age,
                        Folate ~ Food_Security + Ethnicity + Age,
                        Food_Security ~ Income + Ethnicity + Age,
                        Smoking ~ Income + Ethnicity + Age,
                        Censoring ~ BP,
                        Folate ~ Diet,
                        exposure = "PFAS", outcome = "BP")
ggdag(bp_dag_new2) + theme_classic() + geom_dag_text(col = "Orange")

And we see that the implied adjustment set (if this DAG is true) just replaces “Diet” with it’s proxy “Folate”:

ggdag_adjustment_set(bp_dag_new2) + theme_classic() + geom_dag_text(col = "Orange")

STATISTICAL DESIGN

Now that we have our adjustment set, let’s consider what else we need to build our statistical model.

Estimands and estimators

Estimand:

The target of inference, the thing that you estimate to show that there is an association or effect. For causal effects this is usually some outcome contrast i.e. that exposed (or more exposed groups) will have higher means, risk, rates, odds, etc. of the outcome. In other words, that the potential outcomes (mean, risk, rate, odds) under exposure is different than the potential outcomes under no exposure.

For our outcome, blood pressure, it’s unlikely that we can look at rates, especially as we do not have incidence data (we only have one measure of blood pressure). But we could look at differences in mean blood pressure or look at different in prevalence in terms of risk or odds. Let’s look at models for both.

Estimator: This is the statistical model that will allow you to estimate the estimand.

We typically use regression-based estimators, which try to find some best prediction of an outcome, with or without tranformation by a “link function,” on the basis of a weighted set of predictors, with the “weights” being the parameters for those predictors

For differences in means, we typically use linear (Ordinary Least Squares; OLS) regression, or the “identity” link function (no transformation of the outcome)

For differences in risk or odds, we can use logistic, log-binomial, robust Poisson (or OLS if you an economist); which are simply different “link functions”

The predictors in the models include the exposure, for which the parameter can give you the estimate of effect, and just as importantly, the covariates, without the consideration of which the parameter for the exposure is biased and incorrect.

For a quick demonstration, let’s restrict our analytic set as we’ve decided above and look at the un-adjusted estimates (assuming incorrectly that this is an unbiased estimate) of the effect PFOS on mean systolic blood pressure.

analytic_set <- merged_final_clean %>% 
  filter(!is.na(BPXSY1) & !is.na(LBXNFOA)) %>% 
  filter(age >= 18 & age <= 65) %>% # Restricting our analytic set as we decided above
  mutate(hi_bp = case_when(BPXSY1 >= 140 | BPXDI1 >= 90 ~ 1,
                           is.na(BPXSY1) | is.na(BPXDI1) ~ NA_real_,
                           T ~ 0)) # creating our binary hi_bp indicator

Remember that for linear regression, a key criterion for the standard errors / variance estimates to be correct (in finite samples), the residuals for the regression need to be normally distributed (or, identically points need to be normally distributed around the regression line). We might be worried this is not the case given the skewed distribution of blood PFAS measures:

ggplot(analytic_set) + geom_histogram(aes(LBXNFOS))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(analytic_set) + geom_point(aes(LBXNFOS, BPXSY1))

However, we can check this after running our regression:

mdl <- lm(data = analytic_set, formula = BPXSY1 ~ LBXNFOS) 
summary(mdl)

## 
## Call:
## lm(formula = BPXSY1 ~ LBXNFOS, data = analytic_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.566 -11.153  -2.349   8.651  95.652 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.74715    0.62407 191.880  < 2e-16 ***
## LBXNFOS       0.40034    0.08712   4.595  4.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.64 on 1155 degrees of freedom
## Multiple R-squared:  0.01795,    Adjusted R-squared:  0.0171 
## F-statistic: 21.11 on 1 and 1155 DF,  p-value: 4.802e-06

#mdl %>% ggplot() + geom_point(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) # one way
ggResidpanel::resid_panel(mdl) # more comprehensive way

> > Question: What parameter do we interpret and how would we interpret it in a sentence? (PFAS is in units of ng/mL and blood pressure in mmHg) >

The model seems okay, but for biological reasons, we might want to transform our exposure anyway. i.e. if we believe that the effect of increasing PFAS is not linear, but occurs by changes in orders of magnitude (many biological systems work in this way). We can do this quickly by specifying this in the model, here we specify log2 which conveniently gives us the effect size from a doubling of the measure:

mdl2 <- lm(data = analytic_set, formula = BPXSY1 ~ log2(LBXNFOS)) 
summary(mdl2)

## 
## Call:
## lm(formula = BPXSY1 ~ log2(LBXNFOS), data = analytic_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.666 -10.974  -2.393   8.473  96.284 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   118.6089     0.7600 156.056  < 2e-16 ***
## log2(LBXNFOS)   1.8918     0.3775   5.012 6.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.61 on 1155 degrees of freedom
## Multiple R-squared:  0.02129,    Adjusted R-squared:  0.02044 
## F-statistic: 25.12 on 1 and 1155 DF,  p-value: 6.228e-07

ggResidpanel::resid_panel(mdl2)

> > Question: Why is the parameter different in this model? What is the one-sentence description of the association? >

We can also look at risk or odds ratios for our newly created $hi_bp$ binary outcome:

mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit")) 
summary(mdl3)

## 
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.01585    0.13754 -14.656  < 2e-16 ***
## log2(LBXNFOS)  0.22539    0.06245   3.609 0.000307 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1013.7  on 1155  degrees of freedom
## AIC: 1017.7
## 
## Number of Fisher Scoring iterations: 4

Question: How do we interpret this estimate? Is there something we need to do with the parameter (hint: there is a link function)?

Each doubling of plasma PFOS is associated with a 1.25 as high odds of high blood pressure (versus having half the concentration) in U.S. adults between the ages of 18-65.

Stratification

Let’s consider the association adjusted for educational category. What does this do? How does it work? Let’s simplify the problem a bit by changing the exposure to a binary one, we will call hi_pfas where it is indicated by being >median of the observed measures (we can discuss other relevant thresholds):

analytic_set <- 
  analytic_set %>% 
  mutate(hi_pfas = case_when(is.na(LBXNFOS) ~ NA_real_,
                             LBXNFOS > quantile(analytic_set$LBXNFOS, c(0.5), na.rm = T) ~ 1,
                             T ~ 0)) 

mdl4 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit")) 
summary(mdl4)

## 
## Call:
## glm(formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9056     0.1229 -15.500  < 2e-16 ***
## hi_pfas       0.4973     0.1619   3.071  0.00214 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.9  on 1156  degrees of freedom
## Residual deviance: 1017.3  on 1155  degrees of freedom
## AIC: 1021.3
## 
## Number of Fisher Scoring iterations: 4

Having higher than median plasma PFOS is associated with a 1.64 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65.

Next, we adjust for educational category (as a categorical variable, meaning each value has a different estimate):

mdl5 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit")) 
summary(mdl5)

## 
## Call:
## glm(formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit"), 
##     data = analytic_set)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.5386     0.1847  -8.332   <2e-16 ***
## hi_pfas               0.5169     0.1628   3.174   0.0015 ** 
## as.factor(edu_cat)1  -0.4579     0.1951  -2.347   0.0189 *  
## as.factor(edu_cat)2  -0.5475     0.2419  -2.264   0.0236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1026.5  on 1155  degrees of freedom
## Residual deviance: 1010.3  on 1152  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1018.3
## 
## Number of Fisher Scoring iterations: 4

Having higher than median plasma PFOS is associated with a 1.68 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65, adjusted for differences in distribution of education category AND assuming that the odds ratios are the same between educational categories.

Mantel-Haeszel Adjustment

set_edu0 = table(filter(analytic_set, edu_cat == 0)$hi_bp, filter(analytic_set, edu_cat == 0)$hi_pfas)
set_edu1 = table(filter(analytic_set, edu_cat == 1)$hi_bp, filter(analytic_set, edu_cat == 1)$hi_pfas)
set_edu2 = table(filter(analytic_set, edu_cat == 2)$hi_bp, filter(analytic_set, edu_cat == 2)$hi_pfas)

Stratum-specific odds ratio for Less than High School (edu_cat = 0):

oddsratio.wald(set_edu0, rev = "both")

## $data
##        
##           1   0 Total
##   1      24  25    49
##   0      81  95   176
##   Total 105 120   225
## 
## $measure
##    odds ratio with 95% C.I.
##     estimate     lower    upper
##   1 1.000000        NA       NA
##   0 1.125926 0.5974418 2.121896
## 
## $p.value
##    two-sided
##     midp.exact fisher.exact chi.square
##   1         NA           NA         NA
##   0  0.7161814    0.7478975  0.7136658
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

OR = [exposed cases (24) * unexposed controls (95)] / [unexposed cases (25) / exposed controls (81)]

There is less data in this stratum, so estimates within this group are less stable, which has implications for the overall adjustment!

Stratum-specific odds ratio for Less than High School (edu_cat = 1):

oddsratio.wald(set_edu1, rev = c("both"))

Compute this OR.

Stratum-specific odds ratio for Less than High School (edu_cat = 2):

oddsratio.wald(set_edu2, rev = c("both"))

Compute this OR.

Compute the MH Adjusted Estimate by Hand:

a = exposed cases b = exposed controls c = un-exposed cases d = un-exposed controls

Summing across all strata of the confounder
Take the cross product of a X d / divide by the total number in the stratum
Take the cross product of b X c / divide by the total number in the stratum
Take the ratio of those two measures

strat_set <-array(c(set_edu0, set_edu1, set_edu2), dim = c(2, 2, 3))
epi_test::epi.2by2(strat_set)

Recall the logistic regression estimate from the previous model:

OR = 1.68 [95% CI: 1.22, 2.31]

NEXT TIME

Not just reading off coefficients: Other effects given by the main parameters
Why intepreting parameters for covariates in main models is wrong (“Table 2 Fallacy”)
Adding data on additional confounders
Centering and transforming variables for estimation and interpretation