Today we will finalize our DAG, consider some solutions to unmeasured confounding, and fit models implied by our DAG.
We will emphasize intuitions about what adjustments do, including by stratification and regression, and discuss how to interpret coefficients.
If we have time, we will extend this to effect measure modification parameters.
Updated assignments and due dates:
- ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan
- OCT 27 (Fri): DRAFT of analytic plan due
- NOV 8, 15, 22: In lieu of analytic plan presentation, workshop (analyses, code)
- NOV 17 (Fri): DRAFT of analyses with code due
- DEC 08 (Fri): Final written report
- Post your written reports to Laulima by the end of the due date
Exposure to “forever chemicals” (per- and poly-fluoroalkyl substances; PFAS) may cause a number of adverse health outcomes including high blood pressure, as they have similar structures as fatty acids and may have similar biological functions (hormonal; cell-signaling) but unlike fatty acids from food are not easily broken down.
Furthermore, they may have a role to play in ethnic and socioeconomic health disparities, due to differences in exposure and differences in effect. Differences in exposure may arise from different diets and environmental exposures due to historical and structurally discriminatory practices (e.g. less safe food and water) and differences in effect may arise from differences in susceptibility due to factors such as food insecurity or nutritional state.
Load nhanes_merged_18.xlsx.
Refer to “nhanes_pfas_project.R” on how the file was created. This file merges five files from the 2017-2018 wave of the CDC’s NHANES annual survey.
Documentation for the merged NHANES datasets:
NOTE: For this and most remaining analyses we will set aside consideration of the survey weighting. It is important for inference, but not immediately related to our decisions and challenges around analytic and statistical design. We will return to this later.
merged_final <- readxl::read_xlsx(paste0(alt_path, "nhanes_merged_18.xlsx"))
merged_final
The first thing we want to do, to do a quick check of the dataset is to see whether the variable types seem to line up: variables that should be numbers are dbl or int. Categorical variables can be numeric, or if they are text chr. One thing we notice is that participant age (RIDAGEYR) should be a number, in years, but it is in character (chr) format. If count() it, we find it this because one of the values is not a number (“80 years of age and over”):
Seems like a lot, but we will only be working with the subset that have PFAS measurements, let’s restrict and see how many this might affect:
merged_final %>% filter(!is.na(LBXNFOA)) %>% count(RIDAGEYR == "80 years of age and over")
Still a fair number, but perhaps we can recode it to “80” for now and potentially restrict the final study to younger people (emulating a study with shorter “follow-up”) anyway.
In addition to age, we also see a similar problem with the family income to poverty ratio (\(INDFMPIR\)):Since this is a large number we will need to figure out what to do with them, categorization will lose data, but perhaps that is the best we can do. For now, let’s consider recoding it to “5”. This and age is an example of “top coding” with an analogue of “bottom coding” that is often done to preserve anonymity (extreme values can identify people), but serves as a challenge to analyses.
In any case, let’s start building a cleaner dataset (merged_final_clean) with age and income ratio re-code.
Then let’s quickly tabulate it to see what we else need to fix:
merged_final_clean <-
merged_final %>%
mutate(age = case_when(RIDAGEYR == "80 years of age and over" ~ 80,
T ~ as.numeric(RIDAGEYR)),
inc_ratio = case_when(INDFMPIR == "Value greater than or equal to 5.00" ~ 5,
T ~ as.numeric(INDFMPIR)))
tbl1 <-
merged_final_clean %>%
#filter(RIDRETH3 %in% c("Non-Hispanic Asian", "Non-Hispanic White")) %>%
table1(~ RIAGENDR + age + DMDEDUC2 + DMDEDUC3 + INDFMIN2 + INDHHIN2 + inc_ratio + BPXPLS + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | RIDRETH3, data = .,
topclass="Rtable1-zebra",
#overall = F, extra.col=list(`P-value`=pvalue),
render.continuous=c(.="Mean (SD)",
.="Median [MIN, MAX]",
"IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
"N Missing" = "NMISS"))
tbl1
| Mexican American (N=1367) |
Non-Hispanic Asian (N=1168) |
Non-Hispanic Black (N=2115) |
Non-Hispanic White (N=3150) |
Other Hispanic (N=820) |
Other Race - Including Multi-Racial (N=634) |
Overall (N=9254) |
|
|---|---|---|---|---|---|---|---|
| RIAGENDR | |||||||
| Female | 719 (52.6%) | 615 (52.7%) | 1083 (51.2%) | 1567 (49.7%) | 420 (51.2%) | 293 (46.2%) | 4697 (50.8%) |
| Male | 648 (47.4%) | 553 (47.3%) | 1032 (48.8%) | 1583 (50.3%) | 400 (48.8%) | 341 (53.8%) | 4557 (49.2%) |
| age | |||||||
| Mean (SD) | 29.6 (23.2) | 36.6 (22.5) | 34.7 (25.0) | 37.0 (27.7) | 34.5 (24.6) | 25.5 (23.2) | 34.3 (25.5) |
| Median [MIN, MAX] | 23.0 [0, 80.0] | 36.0 [0, 80.0] | 32.0 [0, 80.0] | 34.0 [0, 80.0] | 33.0 [0, 80.0] | 16.5 [0, 80.0] | 31.0 [0, 80.0] |
| IQR [25%ile, 75%ile] | 40.0 [10.0, 50.0] | 39.0 [16.0, 55.0] | 48.0 [11.0, 59.0] | 53.0 [10.0, 63.0] | 47.0 [11.0, 58.0] | 37.0 [7.00, 44.0] | 47.0 [11.0, 58.0] |
| N Missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| DMDEDUC2 | |||||||
| 9-11th grade (Includes 12th grade with no diploma) | 133 (9.7%) | 56 (4.8%) | 158 (7.5%) | 193 (6.1%) | 72 (8.8%) | 26 (4.1%) | 638 (6.9%) |
| College graduate or above | 53 (3.9%) | 453 (38.8%) | 248 (11.7%) | 449 (14.3%) | 79 (9.6%) | 54 (8.5%) | 1336 (14.4%) |
| Don't Know | 2 (0.1%) | 0 (0%) | 2 (0.1%) | 1 (0.0%) | 5 (0.6%) | 1 (0.2%) | 11 (0.1%) |
| High school graduate/GED or equivalent | 168 (12.3%) | 105 (9.0%) | 342 (16.2%) | 545 (17.3%) | 98 (12.0%) | 67 (10.6%) | 1325 (14.3%) |
| Less than 9th grade | 206 (15.1%) | 54 (4.6%) | 34 (1.6%) | 50 (1.6%) | 124 (15.1%) | 11 (1.7%) | 479 (5.2%) |
| Some college or AA degree | 173 (12.7%) | 142 (12.2%) | 513 (24.3%) | 697 (22.1%) | 139 (17.0%) | 114 (18.0%) | 1778 (19.2%) |
| Refused | 0 (0%) | 1 (0.1%) | 1 (0.0%) | 0 (0%) | 0 (0%) | 0 (0%) | 2 (0.0%) |
| Missing | 632 (46.2%) | 357 (30.6%) | 817 (38.6%) | 1215 (38.6%) | 303 (37.0%) | 361 (56.9%) | 3685 (39.8%) |
| DMDEDUC3 | |||||||
| 10th grade | 26 (1.9%) | 20 (1.7%) | 28 (1.3%) | 39 (1.2%) | 13 (1.6%) | 13 (2.1%) | 139 (1.5%) |
| 11th grade | 30 (2.2%) | 24 (2.1%) | 32 (1.5%) | 46 (1.5%) | 7 (0.9%) | 16 (2.5%) | 155 (1.7%) |
| 12th grade, no diploma | 2 (0.1%) | 3 (0.3%) | 7 (0.3%) | 4 (0.1%) | 0 (0%) | 4 (0.6%) | 20 (0.2%) |
| 1st grade | 31 (2.3%) | 14 (1.2%) | 37 (1.7%) | 60 (1.9%) | 12 (1.5%) | 22 (3.5%) | 176 (1.9%) |
| 2nd grade | 31 (2.3%) | 23 (2.0%) | 51 (2.4%) | 57 (1.8%) | 22 (2.7%) | 18 (2.8%) | 202 (2.2%) |
| 3rd grade | 37 (2.7%) | 15 (1.3%) | 57 (2.7%) | 56 (1.8%) | 16 (2.0%) | 21 (3.3%) | 202 (2.2%) |
| 4th grade | 34 (2.5%) | 17 (1.5%) | 34 (1.6%) | 53 (1.7%) | 18 (2.2%) | 23 (3.6%) | 179 (1.9%) |
| 5th grade | 42 (3.1%) | 15 (1.3%) | 54 (2.6%) | 57 (1.8%) | 16 (2.0%) | 15 (2.4%) | 199 (2.2%) |
| 6th grade | 28 (2.0%) | 14 (1.2%) | 39 (1.8%) | 49 (1.6%) | 14 (1.7%) | 10 (1.6%) | 154 (1.7%) |
| 7th grade | 27 (2.0%) | 19 (1.6%) | 37 (1.7%) | 43 (1.4%) | 11 (1.3%) | 14 (2.2%) | 151 (1.6%) |
| 8th grade | 29 (2.1%) | 16 (1.4%) | 42 (2.0%) | 44 (1.4%) | 10 (1.2%) | 13 (2.1%) | 154 (1.7%) |
| 9th grade | 22 (1.6%) | 18 (1.5%) | 39 (1.8%) | 47 (1.5%) | 10 (1.2%) | 18 (2.8%) | 154 (1.7%) |
| High school graduate | 29 (2.1%) | 19 (1.6%) | 23 (1.1%) | 53 (1.7%) | 12 (1.5%) | 14 (2.2%) | 150 (1.6%) |
| Less than 9th grade | 1 (0.1%) | 2 (0.2%) | 0 (0%) | 6 (0.2%) | 2 (0.2%) | 0 (0%) | 11 (0.1%) |
| More than high school | 17 (1.2%) | 8 (0.7%) | 13 (0.6%) | 23 (0.7%) | 5 (0.6%) | 5 (0.8%) | 71 (0.8%) |
| Never attended / kindergarten only | 30 (2.2%) | 25 (2.1%) | 38 (1.8%) | 58 (1.8%) | 17 (2.1%) | 16 (2.5%) | 184 (2.0%) |
| GED or equivalent | 0 (0%) | 0 (0%) | 0 (0%) | 5 (0.2%) | 0 (0%) | 0 (0%) | 5 (0.1%) |
| Missing | 951 (69.6%) | 916 (78.4%) | 1584 (74.9%) | 2450 (77.8%) | 635 (77.4%) | 412 (65.0%) | 6948 (75.1%) |
| INDFMIN2 | |||||||
| $ 0 to $ 4,999 | 57 (4.2%) | 16 (1.4%) | 119 (5.6%) | 82 (2.6%) | 47 (5.7%) | 19 (3.0%) | 340 (3.7%) |
| $ 5,000 to $ 9,999 | 35 (2.6%) | 14 (1.2%) | 101 (4.8%) | 81 (2.6%) | 23 (2.8%) | 27 (4.3%) | 281 (3.0%) |
| $10,000 to $14,999 | 86 (6.3%) | 27 (2.3%) | 116 (5.5%) | 144 (4.6%) | 59 (7.2%) | 22 (3.5%) | 454 (4.9%) |
| $100,000 and Over | 130 (9.5%) | 399 (34.2%) | 238 (11.3%) | 580 (18.4%) | 94 (11.5%) | 135 (21.3%) | 1576 (17.0%) |
| $15,000 to $19,999 | 92 (6.7%) | 32 (2.7%) | 140 (6.6%) | 200 (6.3%) | 55 (6.7%) | 37 (5.8%) | 556 (6.0%) |
| $20,000 and Over | 38 (2.8%) | 51 (4.4%) | 50 (2.4%) | 80 (2.5%) | 6 (0.7%) | 21 (3.3%) | 246 (2.7%) |
| $20,000 to $24,999 | 83 (6.1%) | 36 (3.1%) | 146 (6.9%) | 204 (6.5%) | 45 (5.5%) | 41 (6.5%) | 555 (6.0%) |
| $25,000 to $34,999 | 152 (11.1%) | 60 (5.1%) | 237 (11.2%) | 384 (12.2%) | 92 (11.2%) | 80 (12.6%) | 1005 (10.9%) |
| $35,000 to $44,999 | 164 (12.0%) | 76 (6.5%) | 215 (10.2%) | 327 (10.4%) | 65 (7.9%) | 49 (7.7%) | 896 (9.7%) |
| $45,000 to $54,999 | 96 (7.0%) | 70 (6.0%) | 115 (5.4%) | 227 (7.2%) | 44 (5.4%) | 46 (7.3%) | 598 (6.5%) |
| $55,000 to $64,999 | 88 (6.4%) | 88 (7.5%) | 114 (5.4%) | 178 (5.7%) | 37 (4.5%) | 35 (5.5%) | 540 (5.8%) |
| $65,000 to $74,999 | 56 (4.1%) | 49 (4.2%) | 79 (3.7%) | 168 (5.3%) | 29 (3.5%) | 37 (5.8%) | 418 (4.5%) |
| $75,000 to $99,999 | 97 (7.1%) | 111 (9.5%) | 160 (7.6%) | 340 (10.8%) | 56 (6.8%) | 40 (6.3%) | 804 (8.7%) |
| Don't know | 79 (5.8%) | 7 (0.6%) | 43 (2.0%) | 17 (0.5%) | 55 (6.7%) | 6 (0.9%) | 207 (2.2%) |
| Refused | 24 (1.8%) | 29 (2.5%) | 68 (3.2%) | 22 (0.7%) | 26 (3.2%) | 10 (1.6%) | 179 (1.9%) |
| Under $20,000 | 28 (2.0%) | 10 (0.9%) | 38 (1.8%) | 20 (0.6%) | 25 (3.0%) | 4 (0.6%) | 125 (1.4%) |
| Missing | 62 (4.5%) | 93 (8.0%) | 136 (6.4%) | 96 (3.0%) | 62 (7.6%) | 25 (3.9%) | 474 (5.1%) |
| INDHHIN2 | |||||||
| $ 0 to $ 4,999 | 52 (3.8%) | 11 (0.9%) | 108 (5.1%) | 56 (1.8%) | 39 (4.8%) | 16 (2.5%) | 282 (3.0%) |
| $ 5,000 to $ 9,999 | 32 (2.3%) | 12 (1.0%) | 87 (4.1%) | 74 (2.3%) | 22 (2.7%) | 25 (3.9%) | 252 (2.7%) |
| $10,000 to $14,999 | 79 (5.8%) | 25 (2.1%) | 101 (4.8%) | 126 (4.0%) | 48 (5.9%) | 20 (3.2%) | 399 (4.3%) |
| $100,000 and Over | 134 (9.8%) | 408 (34.9%) | 244 (11.5%) | 600 (19.0%) | 101 (12.3%) | 137 (21.6%) | 1624 (17.5%) |
| $15,000 to $19,999 | 83 (6.1%) | 30 (2.6%) | 138 (6.5%) | 198 (6.3%) | 50 (6.1%) | 37 (5.8%) | 536 (5.8%) |
| $20,000 and Over | 47 (3.4%) | 65 (5.6%) | 66 (3.1%) | 101 (3.2%) | 25 (3.0%) | 24 (3.8%) | 328 (3.5%) |
| $20,000 to $24,999 | 77 (5.6%) | 32 (2.7%) | 147 (7.0%) | 193 (6.1%) | 42 (5.1%) | 38 (6.0%) | 529 (5.7%) |
| $25,000 to $34,999 | 148 (10.8%) | 57 (4.9%) | 222 (10.5%) | 374 (11.9%) | 86 (10.5%) | 73 (11.5%) | 960 (10.4%) |
| $35,000 to $44,999 | 162 (11.9%) | 75 (6.4%) | 208 (9.8%) | 330 (10.5%) | 68 (8.3%) | 50 (7.9%) | 893 (9.6%) |
| $45,000 to $54,999 | 103 (7.5%) | 70 (6.0%) | 123 (5.8%) | 218 (6.9%) | 46 (5.6%) | 47 (7.4%) | 607 (6.6%) |
| $55,000 to $64,999 | 88 (6.4%) | 87 (7.4%) | 134 (6.3%) | 189 (6.0%) | 36 (4.4%) | 39 (6.2%) | 573 (6.2%) |
| $65,000 to $74,999 | 64 (4.7%) | 51 (4.4%) | 81 (3.8%) | 176 (5.6%) | 28 (3.4%) | 41 (6.5%) | 441 (4.8%) |
| $75,000 to $99,999 | 98 (7.2%) | 108 (9.2%) | 165 (7.8%) | 359 (11.4%) | 57 (7.0%) | 42 (6.6%) | 829 (9.0%) |
| Don't know | 85 (6.2%) | 11 (0.9%) | 46 (2.2%) | 18 (0.6%) | 54 (6.6%) | 6 (0.9%) | 220 (2.4%) |
| Refused | 24 (1.8%) | 24 (2.1%) | 63 (3.0%) | 21 (0.7%) | 28 (3.4%) | 10 (1.6%) | 170 (1.8%) |
| Under $20,000 | 28 (2.0%) | 9 (0.8%) | 35 (1.7%) | 19 (0.6%) | 25 (3.0%) | 4 (0.6%) | 120 (1.3%) |
| Missing | 63 (4.6%) | 93 (8.0%) | 147 (7.0%) | 98 (3.1%) | 65 (7.9%) | 25 (3.9%) | 491 (5.3%) |
| inc_ratio | |||||||
| Mean (SD) | 1.88 (1.39) | 3.22 (1.63) | 2.05 (1.52) | 2.55 (1.57) | 2.04 (1.58) | 2.42 (1.60) | 2.38 (1.60) |
| Median [MIN, MAX] | 1.50 [0, 5.00] | 3.47 [0, 5.00] | 1.62 [0, 5.00] | 2.08 [0, 5.00] | 1.51 [0, 5.00] | 2.02 [0, 5.00] | 1.92 [0, 5.00] |
| IQR [25%ile, 75%ile] | 1.73 [0.810, 2.54] | 3.35 [1.65, 5.00] | 2.26 [0.870, 3.13] | 2.75 [1.23, 3.98] | 2.21 [0.770, 2.98] | 2.87 [1.05, 3.92] | 2.65 [1.04, 3.69] |
| N Missing | 231 | 190 | 335 | 235 | 174 | 66 | 1231 |
| Missing | 231 (16.9%) | 190 (16.3%) | 335 (15.8%) | 235 (7.5%) | 174 (21.2%) | 66 (10.4%) | 1231 (13.3%) |
| BPXPLS | |||||||
| Mean (SD) | 74.3 (12.2) | 72.9 (11.8) | 73.2 (12.4) | 73.7 (12.4) | 73.4 (12.2) | 77.1 (13.4) | 73.7 (12.4) |
| Median [MIN, MAX] | 74.0 [42.0, 120] | 72.0 [42.0, 134] | 72.0 [44.0, 136] | 72.0 [34.0, 132] | 72.0 [46.0, 128] | 76.0 [50.0, 130] | 72.0 [34.0, 136] |
| IQR [25%ile, 75%ile] | 16.0 [66.0, 82.0] | 16.0 [64.0, 80.0] | 16.0 [64.0, 80.0] | 18.0 [64.0, 82.0] | 14.0 [66.0, 80.0] | 18.0 [68.0, 86.0] | 16.0 [66.0, 82.0] |
| N Missing | 389 | 275 | 531 | 891 | 227 | 199 | 2512 |
| Missing | 389 (28.5%) | 275 (23.5%) | 531 (25.1%) | 891 (28.3%) | 227 (27.7%) | 199 (31.4%) | 2512 (27.1%) |
| BPXSY1 | |||||||
| Mean (SD) | 118 (18.1) | 121 (19.5) | 125 (22.1) | 121 (19.0) | 121 (21.2) | 117 (18.6) | 121 (20.0) |
| Median [MIN, MAX] | 116 [82.0, 216] | 117 [72.0, 216] | 122 [78.0, 228] | 118 [78.0, 216] | 116 [82.0, 210] | 114 [80.0, 190] | 118 [72.0, 228] |
| IQR [25%ile, 75%ile] | 22.0 [106, 128] | 22.0 [108, 130] | 28.0 [108, 136] | 24.0 [108, 132] | 26.0 [106, 132] | 22.0 [104, 126] | 26.0 [106, 132] |
| N Missing | 439 | 326 | 655 | 1039 | 268 | 225 | 2952 |
| Missing | 439 (32.1%) | 326 (27.9%) | 655 (31.0%) | 1039 (33.0%) | 268 (32.7%) | 225 (35.5%) | 2952 (31.9%) |
| BPXDI1 | |||||||
| Mean (SD) | 66.3 (16.5) | 70.0 (14.8) | 69.0 (17.4) | 67.1 (15.7) | 67.8 (15.3) | 66.4 (19.2) | 67.8 (16.4) |
| Median [MIN, MAX] | 68.0 [0, 114] | 72.0 [0, 110] | 70.0 [0, 136] | 68.0 [0, 120] | 70.0 [0, 112] | 68.0 [0, 120] | 70.0 [0, 136] |
| IQR [25%ile, 75%ile] | 16.0 [60.0, 76.0] | 14.0 [64.0, 78.0] | 20.0 [60.0, 80.0] | 16.0 [60.0, 76.0] | 16.0 [60.0, 76.0] | 18.0 [60.0, 78.0] | 16.0 [60.0, 76.0] |
| N Missing | 439 | 326 | 655 | 1039 | 268 | 225 | 2952 |
| Missing | 439 (32.1%) | 326 (27.9%) | 655 (31.0%) | 1039 (33.0%) | 268 (32.7%) | 225 (35.5%) | 2952 (31.9%) |
| LBDRFOSI | |||||||
| Mean (SD) | 1110 (396) | 1140 (393) | 988 (422) | 1280 (535) | 1150 (413) | 1120 (452) | 1150 (470) |
| Median [MIN, MAX] | 1060 [154, 4540] | 1090 [239, 3150] | 911 [253, 4220] | 1170 [114, 4420] | 1090 [344, 2990] | 1060 [313, 5870] | 1060 [114, 5870] |
| IQR [25%ile, 75%ile] | 395 [865, 1260] | 484 [859, 1340] | 437 [723, 1160] | 576 [914, 1490] | 458 [882, 1340] | 427 [859, 1290] | 496 [842, 1340] |
| N Missing | 647 | 568 | 1087 | 1629 | 426 | 315 | 4672 |
| Missing | 647 (47.3%) | 568 (48.6%) | 1087 (51.4%) | 1629 (51.7%) | 426 (52.0%) | 315 (49.7%) | 4672 (50.5%) |
| LBXNFOA | |||||||
| Mean (SD) | 1.26 (1.17) | 2.40 (3.96) | 1.47 (1.13) | 1.68 (1.12) | 1.51 (1.22) | 1.33 (0.871) | 1.63 (1.82) |
| Median [MIN, MAX] | 1.00 [0.0700, 11.4] | 1.50 [0.200, 52.8] | 1.20 [0.0700, 10.7] | 1.40 [0.0700, 10.4] | 1.30 [0.200, 9.50] | 1.10 [0.0700, 5.40] | 1.30 [0.0700, 52.8] |
| IQR [25%ile, 75%ile] | 0.700 [0.700, 1.40] | 1.70 [0.900, 2.60] | 1.10 [0.700, 1.80] | 1.20 [0.900, 2.10] | 0.900 [0.800, 1.70] | 0.800 [0.800, 1.60] | 1.20 [0.800, 2.00] |
| N Missing | 1070 | 911 | 1685 | 2483 | 644 | 532 | 7325 |
| Missing | 1070 (78.3%) | 911 (78.0%) | 1685 (79.7%) | 2483 (78.8%) | 644 (78.5%) | 532 (83.9%) | 7325 (79.2%) |
| LBXNFOS | |||||||
| Mean (SD) | 2.92 (2.92) | 6.18 (7.79) | 6.32 (9.86) | 4.34 (3.76) | 4.07 (5.41) | 3.75 (5.01) | 4.75 (6.43) |
| Median [MIN, MAX] | 2.00 [0.0700, 21.3] | 3.40 [0.200, 64.0] | 3.30 [0.0700, 88.4] | 3.30 [0.0700, 33.6] | 2.70 [0.500, 54.5] | 2.60 [0.0700, 43.9] | 3.00 [0.0700, 88.4] |
| IQR [25%ile, 75%ile] | 2.10 [1.30, 3.40] | 5.70 [1.40, 7.10] | 4.95 [1.70, 6.65] | 3.60 [2.00, 5.60] | 2.70 [1.70, 4.40] | 2.58 [1.50, 4.08] | 3.70 [1.70, 5.40] |
| N Missing | 1070 | 911 | 1685 | 2483 | 644 | 532 | 7325 |
| Missing | 1070 (78.3%) | 911 (78.0%) | 1685 (79.7%) | 2483 (78.8%) | 644 (78.5%) | 532 (83.9%) | 7325 (79.2%) |
| LBXPFNA | |||||||
| Mean (SD) | 0.427 (0.420) | 0.826 (0.744) | 0.601 (0.620) | 0.560 (0.478) | 0.513 (0.370) | 0.445 (0.303) | 0.574 (0.543) |
| Median [MIN, MAX] | 0.300 [0.0700, 3.90] | 0.600 [0.0700, 5.10] | 0.500 [0.0700, 6.50] | 0.400 [0.0700, 4.50] | 0.450 [0.0700, 3.00] | 0.400 [0.0700, 1.40] | 0.400 [0.0700, 6.50] |
| IQR [25%ile, 75%ile] | 0.300 [0.200, 0.500] | 0.800 [0.300, 1.10] | 0.400 [0.300, 0.700] | 0.400 [0.300, 0.700] | 0.400 [0.300, 0.700] | 0.400 [0.200, 0.600] | 0.400 [0.300, 0.700] |
| N Missing | 1070 | 911 | 1685 | 2483 | 644 | 532 | 7325 |
| Missing | 1070 (78.3%) | 911 (78.0%) | 1685 (79.7%) | 2483 (78.8%) | 644 (78.5%) | 532 (83.9%) | 7325 (79.2%) |
| LBXPFHS | |||||||
| Mean (SD) | 1.14 (1.20) | 2.21 (4.76) | 1.54 (2.04) | 1.74 (1.81) | 1.28 (1.68) | 1.29 (1.16) | 1.60 (2.39) |
| Median [MIN, MAX] | 0.800 [0.0700, 12.0] | 0.900 [0.0700, 48.8] | 1.10 [0.0700, 21.4] | 1.30 [0.0700, 19.3] | 0.900 [0.0700, 19.9] | 0.900 [0.0700, 6.90] | 1.10 [0.0700, 48.8] |
| IQR [25%ile, 75%ile] | 0.800 [0.500, 1.30] | 1.30 [0.500, 1.80] | 1.40 [0.500, 1.90] | 1.40 [0.700, 2.10] | 0.800 [0.600, 1.40] | 0.975 [0.500, 1.48] | 1.20 [0.600, 1.80] |
| N Missing | 1070 | 911 | 1685 | 2483 | 644 | 532 | 7325 |
| Missing | 1070 (78.3%) | 911 (78.0%) | 1685 (79.7%) | 2483 (78.8%) | 644 (78.5%) | 532 (83.9%) | 7325 (79.2%) |
Notably, there are two education variables \(DMDEDUC2\) and \(DMDEDUC3\) and they both have a high percentage of missing. From the documentation (and some quick tabulations), we see they are mutually exclusive and \(DMDEDUC3\)is only asked of young people (<20):
Checking if anyone has non-missing values for both (no):We can potentially address this by merging the education values.
Conceputal question: Why would we want to capture “highest completed education” variables differently for young people than for adults? Does a value of “8th grade” mean the same thing for a 13 year-old as a 25 year-old?
There are too many values to look at as separate categories, but at the same time there is not a clear way to make them ordinal / continuous measures. We also have two possible income measures, family income (\(INDFMIN2\)) and household income (\(INDHHIN2\)) to choose from. For now, let’s create one joint educational categories and take total household income and make it ordinal:
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `hh_inc = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion
Now let’s create a table of our cleaned variables, restricted down to the set we will probably use for analysis, i.e. that have non-missing PFAS and blood pressure data:
tbl1 <-
merged_final_clean %>%
filter(!is.na(BPXSY1) & !is.na(LBXNFOA)) %>%
filter(age >= 18 & age <= 65) %>%
mutate(hi_bp = case_when(BPXSY1 >= 140 | BPXDI1 >= 90 ~ "High BP",
is.na(BPXSY1) | is.na(BPXDI1) ~ NA_character_,
T ~ "Normal BP")) %>%
table1(~ RIAGENDR + RIDRETH3 + age + as.factor(edu_cat) + hh_inc + inc_ratio + BPXPLS + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | hi_bp, data = .,
topclass="Rtable1-zebra",
overall = F, extra.col=list(`P-value`=pvalue),
render.continuous=c(.="Mean (SD)",
.="Median [MIN, MAX]",
"IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
"N Missing" = "NMISS"))
tbl1
| High BP (N=188) |
Normal BP (N=969) |
P-value | |
|---|---|---|---|
| RIAGENDR | |||
| Female | 83 (44.1%) | 493 (50.9%) | 0.108 |
| Male | 105 (55.9%) | 476 (49.1%) | |
| RIDRETH3 | |||
| Mexican American | 35 (18.6%) | 153 (15.8%) | <0.001 |
| Non-Hispanic Asian | 24 (12.8%) | 151 (15.6%) | |
| Non-Hispanic Black | 67 (35.6%) | 193 (19.9%) | |
| Non-Hispanic White | 37 (19.7%) | 326 (33.6%) | |
| Other Hispanic | 18 (9.6%) | 92 (9.5%) | |
| Other Race - Including Multi-Racial | 7 (3.7%) | 54 (5.6%) | |
| age | |||
| Mean (SD) | 53.4 (9.68) | 39.9 (14.5) | <0.001 |
| Median [MIN, MAX] | 56.0 [23.0, 65.0] | 39.0 [18.0, 65.0] | |
| IQR [25%ile, 75%ile] | 13.0 [48.0, 61.0] | 26.0 [27.0, 53.0] | |
| N Missing | 0 | 0 | |
| as.factor(edu_cat) | |||
| 0 | 49 (26.1%) | 176 (18.2%) | 0.0429 |
| 1 | 102 (54.3%) | 573 (59.1%) | |
| 2 | 37 (19.7%) | 219 (22.6%) | |
| Missing | 0 (0%) | 1 (0.1%) | |
| hh_inc | |||
| Mean (SD) | 46.6 (33.8) | 48.6 (32.4) | 0.477 |
| Median [MIN, MAX] | 35.0 [0, 100] | 45.0 [0, 100] | |
| IQR [25%ile, 75%ile] | 55.0 [20.0, 75.0] | 55.0 [20.0, 75.0] | |
| N Missing | 16 | 78 | |
| Missing | 16 (8.5%) | 78 (8.0%) | |
| inc_ratio | |||
| Mean (SD) | 2.58 (1.61) | 2.47 (1.66) | 0.442 |
| Median [MIN, MAX] | 2.34 [0, 5.00] | 2.02 [0, 5.00] | |
| IQR [25%ile, 75%ile] | 3.03 [1.19, 4.22] | 3.02 [1.06, 4.08] | |
| N Missing | 30 | 112 | |
| Missing | 30 (16.0%) | 112 (11.6%) | |
| BPXPLS | |||
| Mean (SD) | 71.9 (12.0) | 72.7 (11.9) | 0.424 |
| Median [MIN, MAX] | 70.0 [42.0, 112] | 72.0 [44.0, 118] | |
| IQR [25%ile, 75%ile] | 14.0 [64.0, 78.0] | 16.0 [64.0, 80.0] | |
| N Missing | 0 | 0 | |
| BPXSY1 | |||
| Mean (SD) | 149 (14.5) | 116 (11.1) | <0.001 |
| Median [MIN, MAX] | 146 [122, 216] | 116 [88.0, 138] | |
| IQR [25%ile, 75%ile] | 12.5 [142, 154] | 18.0 [108, 126] | |
| N Missing | 0 | 0 | |
| BPXDI1 | |||
| Mean (SD) | 85.5 (13.8) | 70.1 (10.3) | <0.001 |
| Median [MIN, MAX] | 88.0 [0, 124] | 72.0 [0, 88.0] | |
| IQR [25%ile, 75%ile] | 16.5 [77.5, 94.0] | 12.0 [64.0, 76.0] | |
| N Missing | 0 | 0 | |
| LBDRFOSI | |||
| Mean (SD) | 1100 (657) | 1070 (415) | 0.818 |
| Median [MIN, MAX] | 976 [219, 4540] | 1010 [379, 3340] | |
| IQR [25%ile, 75%ile] | 544 [751, 1300] | 469 [791, 1260] | |
| N Missing | 133 | 483 | |
| Missing | 133 (70.7%) | 483 (49.8%) | |
| LBXNFOA | |||
| Mean (SD) | 1.70 (1.14) | 1.59 (2.24) | 0.311 |
| Median [MIN, MAX] | 1.50 [0.100, 9.50] | 1.20 [0.0700, 52.8] | |
| IQR [25%ile, 75%ile] | 1.13 [1.00, 2.13] | 1.10 [0.700, 1.80] | |
| N Missing | 0 | 0 | |
| LBXNFOS | |||
| Mean (SD) | 5.73 (7.26) | 4.20 (5.21) | 0.0065 |
| Median [MIN, MAX] | 3.60 [0.100, 59.9] | 2.80 [0.0700, 63.1] | |
| IQR [25%ile, 75%ile] | 3.25 [2.18, 5.43] | 3.10 [1.60, 4.70] | |
| N Missing | 0 | 0 | |
| LBXPFNA | |||
| Mean (SD) | 0.630 (0.492) | 0.523 (0.526) | 0.00735 |
| Median [MIN, MAX] | 0.500 [0.0700, 3.00] | 0.400 [0.0700, 6.50] | |
| IQR [25%ile, 75%ile] | 0.500 [0.300, 0.800] | 0.400 [0.200, 0.600] | |
| N Missing | 0 | 0 | |
| LBXPFHS | |||
| Mean (SD) | 1.55 (1.63) | 1.61 (2.78) | 0.709 |
| Median [MIN, MAX] | 1.10 [0.100, 17.0] | 1.00 [0.0700, 48.8] | |
| IQR [25%ile, 75%ile] | 1.10 [0.800, 1.90] | 1.20 [0.500, 1.70] | |
| N Missing | 0 | 0 |
Importantly, we can see for the categorical data, whether we have sufficient data for our adjustment models. We will return to this later.
Now that we have an idea of the basic data that we have, we can revisit our analytic model to see if we have sufficient data to attempt to estimate a causal effect.
Let’s make a first attempt at a DAG. We think about the major sources of exposure and outcomes, relative to our target population, adult US-residents.
We should include issues we may forsee given this is a cross-sectional study, which creates two problems:
Because of the potential for reverse causality (BP –> PFAS), there is no way to identify an adjustment set:
## Warning in dag_adjustment_sets(., exposure = exposure, outcome = outcome, : Failed to close backdoor paths. Common reasons include:
## * graph is not acyclic
## * backdoor paths are not closeable with given set of variables
## * necessary variables are unmeasured (latent)
Moreover, the cross-sectional analysis means that we are stuck with a selected sample population (a “censored” one) that is missing older (higher Age) and sicker (higher BP) people, which may lead to bias:
The only potential solution here (other than to not do the analysis or find a different data set) is to restrict our sample and adjust for what we can to prevent this selection bias:
Notably, by restricting to adults, we can somewhat address issues due to reverse causation and censorship issues. For example, older populations may have higher blood pressure resulting in more medical and pharmaceutical intervention and PFAS exposures by that route. Moreover, in older populations there is more likely going to be illness or death that prevent them from participating in surveys. Since illness and death are consequences of exposure, outcome, and covariates, this results in a selection bias (as we have shown in the DAG)
Additionally, by restricting to adults, we narrow down on the potential ways people can be exposed (to PFAS), for example by residual exposures from birth (in infants), exposures due to parents and home environment (in children), or as the result of potential greater exposure to intensive medical and pharmacological treatments. This improves, though doesn’t solve causal consistency issues for the exposure.
In summary, restricting our analyses to adults may help reduce the potential for selection bias and reverse causality due to the cross-sectional nature of the survey. It also improve the potential that the measured exposures arise from common sources and thus have common effects, fulfilling the causal consistency we need for effect estimation.
We show this in our modified DAG here, where we remove the reverse causal arrow (BP –> PFAS) and Age as a determinant of Censorship (i.e. because we’ve restricted to a population where this selection is not as strong):
bp_dag_new <- ggdag::dagify(BP ~ Diet + Food_Security + Income + Ethnicity + Smoking + Age,
PFAS ~ Diet + Income + Ethnicity + Smoking + Food_Security + Age,
Diet ~ Food_Security + Ethnicity + Age,
Food_Security ~ Income + Ethnicity + Age,
Smoking ~ Income + Ethnicity + Age,
Censoring ~ BP,
exposure = "PFAS", outcome = "BP")
So now we see that a simple adjustment will allow us to estimate the effect of BP on PFAS:
ggdag_adjustment_set(bp_dag_new) + theme_classic() + geom_dag_text(col = "Orange")
And confirmed BP and PFAS are d-separated given this adjustment set:
ggdag_dseparated(bp_dag_new, controlling_for = c("Age", "Income", "Ethnicity", "Food_Security", "Smoking", "Diet")) + theme_classic() + geom_dag_text(col = "Orange")
But we have a problem that we don’t have a good variable capturing “Diet” (since it is not a particularly clear variable).
We probably mean something closer to: “Foods that contain higher PFAS, but ALSO influence BP through other nutrient content.” In this case, let us assume that we can replace this variable with another, the micronutrient Folate (there are probably better options such as “fatty fish intake” or plasma fatty acids, but we’ll go with this for now). We can represent this in our DAG as follows, by replacing “Diet” with our proxy:
bp_dag_new2 <- ggdag::dagify(BP ~ Folate + Food_Security + Income + Ethnicity + Smoking + Age,
PFAS ~ Folate + Income + Ethnicity + Smoking + Food_Security + Age,
Folate ~ Food_Security + Ethnicity + Age,
Food_Security ~ Income + Ethnicity + Age,
Smoking ~ Income + Ethnicity + Age,
Censoring ~ BP,
Folate ~ Diet,
exposure = "PFAS", outcome = "BP")
ggdag(bp_dag_new2) + theme_classic() + geom_dag_text(col = "Orange")
And we see that the implied adjustment set (if this DAG is true) just replaces “Diet” with it’s proxy “Folate”:
ggdag_adjustment_set(bp_dag_new2) + theme_classic() + geom_dag_text(col = "Orange")
Now that we have our adjustment set, let’s consider what else we need to build our statistical model.
Estimand:
The target of inference, the thing that you estimate to show that there is an association or effect. For causal effects this is usually some outcome contrast i.e. that exposed (or more exposed groups) will have higher means, risk, rates, odds, etc. of the outcome. In other words, that the potential outcomes (mean, risk, rate, odds) under exposure is different than the potential outcomes under no exposure.
For our outcome, blood pressure, it’s unlikely that we can look at rates, especially as we do not have incidence data (we only have one measure of blood pressure). But we could look at differences in mean blood pressure or look at different in prevalence in terms of risk or odds. Let’s look at models for both.
Estimator: This is the statistical model that will allow you to estimate the estimand.
We typically use regression-based estimators, which try to find some best prediction of an outcome, with or without tranformation by a “link function,” on the basis of a weighted set of predictors, with the “weights” being the parameters for those predictors
For differences in means, we typically use linear (Ordinary Least Squares; OLS) regression, or the “identity” link function (no transformation of the outcome)
For differences in risk or odds, we can use logistic, log-binomial, robust Poisson (or OLS if you an economist); which are simply different “link functions”
The predictors in the models include the exposure, for which the parameter can give you the estimate of effect, and just as importantly, the covariates, without the consideration of which the parameter for the exposure is biased and incorrect.
For a quick demonstration, let’s restrict our analytic set as we’ve decided above and look at the un-adjusted estimates (assuming incorrectly that this is an unbiased estimate) of the effect PFOS on mean systolic blood pressure.
analytic_set <- merged_final_clean %>%
filter(!is.na(BPXSY1) & !is.na(LBXNFOA)) %>%
filter(age >= 18 & age <= 65) %>% # Restricting our analytic set as we decided above
mutate(hi_bp = case_when(BPXSY1 >= 140 | BPXDI1 >= 90 ~ 1,
is.na(BPXSY1) | is.na(BPXDI1) ~ NA_real_,
T ~ 0)) # creating our binary hi_bp indicator
Remember that for linear regression, a key criterion for the standard errors / variance estimates to be correct (in finite samples), the residuals for the regression need to be normally distributed (or, identically points need to be normally distributed around the regression line). We might be worried this is not the case given the skewed distribution of blood PFAS measures:
ggplot(analytic_set) + geom_histogram(aes(LBXNFOS))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(analytic_set) + geom_point(aes(LBXNFOS, BPXSY1))
However, we can check this after running our regression:
mdl <- lm(data = analytic_set, formula = BPXSY1 ~ LBXNFOS)
summary(mdl)
##
## Call:
## lm(formula = BPXSY1 ~ LBXNFOS, data = analytic_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.566 -11.153 -2.349 8.651 95.652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.74715 0.62407 191.880 < 2e-16 ***
## LBXNFOS 0.40034 0.08712 4.595 4.8e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.64 on 1155 degrees of freedom
## Multiple R-squared: 0.01795, Adjusted R-squared: 0.0171
## F-statistic: 21.11 on 1 and 1155 DF, p-value: 4.802e-06
#mdl %>% ggplot() + geom_point(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) # one way
ggResidpanel::resid_panel(mdl) # more comprehensive way
> > Question: What parameter do we interpret and how
would we interpret it in a sentence? (PFAS is in units of ng/mL and
blood pressure in mmHg) >
The model seems okay, but for biological reasons, we might want to transform our exposure anyway. i.e. if we believe that the effect of increasing PFAS is not linear, but occurs by changes in orders of magnitude (many biological systems work in this way). We can do this quickly by specifying this in the model, here we specify log2 which conveniently gives us the effect size from a doubling of the measure:
mdl2 <- lm(data = analytic_set, formula = BPXSY1 ~ log2(LBXNFOS))
summary(mdl2)
##
## Call:
## lm(formula = BPXSY1 ~ log2(LBXNFOS), data = analytic_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.666 -10.974 -2.393 8.473 96.284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.6089 0.7600 156.056 < 2e-16 ***
## log2(LBXNFOS) 1.8918 0.3775 5.012 6.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.61 on 1155 degrees of freedom
## Multiple R-squared: 0.02129, Adjusted R-squared: 0.02044
## F-statistic: 25.12 on 1 and 1155 DF, p-value: 6.228e-07
ggResidpanel::resid_panel(mdl2)
> > Question: Why is the parameter different in this
model? What is the one-sentence description of the association? >
We can also look at risk or odds ratios for our newly created \(hi_bp\) binary outcome:
mdl3 <- glm(data = analytic_set, formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit"))
summary(mdl3)
##
## Call:
## glm(formula = hi_bp ~ log2(LBXNFOS), family = binomial(link = "logit"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.01585 0.13754 -14.656 < 2e-16 ***
## log2(LBXNFOS) 0.22539 0.06245 3.609 0.000307 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.9 on 1156 degrees of freedom
## Residual deviance: 1013.7 on 1155 degrees of freedom
## AIC: 1017.7
##
## Number of Fisher Scoring iterations: 4
Question: How do we interpret this estimate? Is there something we need to do with the parameter (hint: there is a link function)?
Each doubling of plasma PFOS is associated with a 1.25 as high odds of high blood pressure (versus having half the concentration) in U.S. adults between the ages of 18-65.
Let’s consider the association adjusted for educational category. What does this do? How does it work? Let’s simplify the problem a bit by changing the exposure to a binary one, we will call hi_pfas where it is indicated by being >median of the observed measures (we can discuss other relevant thresholds):
analytic_set <-
analytic_set %>%
mutate(hi_pfas = case_when(is.na(LBXNFOS) ~ NA_real_,
LBXNFOS > quantile(analytic_set$LBXNFOS, c(0.5), na.rm = T) ~ 1,
T ~ 0))
mdl4 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"))
summary(mdl4)
##
## Call:
## glm(formula = hi_bp ~ hi_pfas, family = binomial(link = "logit"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9056 0.1229 -15.500 < 2e-16 ***
## hi_pfas 0.4973 0.1619 3.071 0.00214 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.9 on 1156 degrees of freedom
## Residual deviance: 1017.3 on 1155 degrees of freedom
## AIC: 1021.3
##
## Number of Fisher Scoring iterations: 4
Having higher than median plasma PFOS is associated with a 1.64 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65.
Next, we adjust for educational category (as a categorical variable, meaning each value has a different estimate):
mdl5 <- glm(data = analytic_set, formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit"))
summary(mdl5)
##
## Call:
## glm(formula = hi_bp ~ hi_pfas + as.factor(edu_cat), family = binomial(link = "logit"),
## data = analytic_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5386 0.1847 -8.332 <2e-16 ***
## hi_pfas 0.5169 0.1628 3.174 0.0015 **
## as.factor(edu_cat)1 -0.4579 0.1951 -2.347 0.0189 *
## as.factor(edu_cat)2 -0.5475 0.2419 -2.264 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1026.5 on 1155 degrees of freedom
## Residual deviance: 1010.3 on 1152 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 1018.3
##
## Number of Fisher Scoring iterations: 4
Having higher than median plasma PFOS is associated with a 1.68 as high odds of high blood pressure (than having below-median PFAS) in U.S. adults between the ages of 18-65, adjusted for differences in distribution of education category AND assuming that the odds ratios are the same between educational categories.
set_edu0 = table(filter(analytic_set, edu_cat == 0)$hi_bp, filter(analytic_set, edu_cat == 0)$hi_pfas)
set_edu1 = table(filter(analytic_set, edu_cat == 1)$hi_bp, filter(analytic_set, edu_cat == 1)$hi_pfas)
set_edu2 = table(filter(analytic_set, edu_cat == 2)$hi_bp, filter(analytic_set, edu_cat == 2)$hi_pfas)
Stratum-specific odds ratio for Less than High School (edu_cat = 0):
oddsratio.wald(set_edu0, rev = "both")
## $data
##
## 1 0 Total
## 1 24 25 49
## 0 81 95 176
## Total 105 120 225
##
## $measure
## odds ratio with 95% C.I.
## estimate lower upper
## 1 1.000000 NA NA
## 0 1.125926 0.5974418 2.121896
##
## $p.value
## two-sided
## midp.exact fisher.exact chi.square
## 1 NA NA NA
## 0 0.7161814 0.7478975 0.7136658
##
## $correction
## [1] FALSE
##
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"
OR = [exposed cases (24) * unexposed controls (95)] / [unexposed cases (25) / exposed controls (81)]
There is less data in this stratum, so estimates within this group are less stable, which has implications for the overall adjustment!
Stratum-specific odds ratio for Less than High School (edu_cat = 1):
oddsratio.wald(set_edu1, rev = c("both"))
Compute this OR.
Stratum-specific odds ratio for Less than High School (edu_cat = 2):
oddsratio.wald(set_edu2, rev = c("both"))
Compute this OR.
Compute the MH Adjusted Estimate by Hand:
a = exposed cases b = exposed controls c = un-exposed cases d = un-exposed controls
strat_set <-array(c(set_edu0, set_edu1, set_edu2), dim = c(2, 2, 3))
epi_test::epi.2by2(strat_set)
Recall the logistic regression estimate from the previous model:
OR = 1.68 [95% CI: 1.22, 2.31]