Updated assignments and due dates:
- ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan
- NOV 10 (Fri): DRAFT of analytic plan due
- NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.
- NOV 24 (Fri): DRAFT of analyses with code due
- DEC 08 (Fri): Final written report
- Post your written reports to Laulima by the end of the due date
Today:
Review constructing measures
Review measures of assocations RD, RR, and ORs
Discuss concepts of social exposures
Decomposition of effects (discrimination, etc.)
merged_final <- readxl::read_xlsx(paste0(alt_path, "nhanes_merged_18.xlsx"))
# Re-code some text variables to numeric, making some decisions about how to treat top-coded variables
merged_final_clean <-
merged_final %>%
mutate(age = case_when(RIDAGEYR == "80 years of age and over" ~ 80,
T ~ as.numeric(RIDAGEYR)),
inc_ratio = case_when(INDFMPIR == "Value greater than or equal to 5.00" ~ 5,
T ~ as.numeric(INDFMPIR)))
# Create a joint "edu_cat" variable by first re-coding DMDEDUC2 if they have it.
# And then replacing missing values with DMDEDUC3, since I know they are mutually exclusive
merged_final_clean <-
merged_final_clean %>%
mutate(edu_cat = case_when(DMDEDUC2 %in% c("Less than 9th grade", "9-11th grade (Includes 12th grade with no diploma)") ~ 0,
DMDEDUC2 %in% c("High school graduate/GED or equivalent", "Some college or AA degree") ~ 1,
DMDEDUC2 == "College graduate or above" ~ 2,
T ~ NA_real_)) %>%
mutate(edu_cat = case_when(is.na(DMDEDUC3) ~ edu_cat, # keep DMDEDUC2 value if it has it
DMDEDUC3 %in% c("High school graduate", "More than high school", "GED or equivalent") ~ 1,
T ~ 0))
# merged_final_clean %>% count(edu_cat) # 1379 that were originally missing both + 13 that had DMDEDUC2 == "Don't Know" or "Refused"
# Create a household income variable "hh_inc" expressed in $1000s and dealing with top-coding
merged_final_clean <-
merged_final_clean %>%
mutate(hh_inc = case_when(INDHHIN2 %in% c("Don't know", "Refused") | is.na(INDHHIN2) ~ NA_real_,
INDHHIN2 == "$20,000 and Over" ~ 20,
INDHHIN2 == "Under $20,000" ~ 15,
INDHHIN2 == "$100,000 and Over" ~ 100,
T ~ as.numeric(substring(INDHHIN2,2,3))))
# merged_final_clean %>% count(hh_inc)
# Add relevant labels for variables and educational categories
# Add a binary outcome variable, hi_bp, for blood pressure >= 140 systolic or >= 90 diastolic
# Add a binary exposure variable, hi_pfas, for pfas greater than the median observed value
merged_final_clean <-
merged_final_clean %>%
mutate(edu_cat = factor(edu_cat, levels = c(0,1,2), labels = c("/u003C HS", "HS", "College"), ordered = T)) %>%
mutate(hi_bp = factor(case_when(BPXSY1 >= 140 | BPXDI1 >= 90 ~ 1,
is.na(BPXSY1) | is.na(BPXDI1) ~ NA_real_,
T ~ 0), levels = c(0,1), labels = c("Normal BP", "High BP"))) %>%
mutate(hi_pfas = factor(case_when(is.na(LBXNFOS) ~ NA_real_,
LBXNFOS > quantile(.$LBXNFOS, c(0.5), na.rm = T) ~ 1,
T ~ 0), levels = c(0,1), labels = c("Normal PFOS", "High PFOS"))) %>%
mutate(RIAGENDR = structure(RIAGENDR, label = "Participant Sex"),
age = structure(age, label = "Participant Age (years)"),
RIDRETH3 = structure(RIDRETH3, label = "Self-reported Race/Ethnicity"),
edu_cat = structure(edu_cat, label = "Highest Completed Education"),
hh_inc = structure(hh_inc, label = "Household Income (in $1000s)"),
inc_ratio = structure(inc_ratio, label = "Family Income:Poverty Line Ratio"),
BPXSY1 = structure(BPXSY1, label = "Systolic Blood Pressure (mmHg)"),
BPXDI1 = structure(BPXDI1, label = "Diastolic Blood Pressure (mmHg)"),
LBDRFOSI = structure(LBDRFOSI, label = "RBC Folate (ng/mL)"),
LBXNFOA = structure(LBXNFOA, label = "Plasma PFOA (ng/mL)"),
LBXNFOS = structure(LBXNFOS, label = "Plasma PFOS (ng/mL)"),
LBXPFNA = structure(LBXPFNA, label = "Plasma PFNA (ng/mL)"),
LBXPFHS = structure(LBXPFHS, label = "Plasma PFHxS (ng/mL)"),
hi_bp = structure(hi_bp, label = "High Blood Pressure"))
# A descriptive table restricted to those with blood pressure AND PFOA data, and stratified by blood pressure:
tbl1 <-
merged_final_clean %>%
filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>%
filter(age >= 18 & age <= 65) %>%
table1(~ RIAGENDR + RIDRETH3 + age + as.factor(edu_cat) + hh_inc + inc_ratio + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | hi_bp, data = .,
topclass="Rtable1-zebra",
overall = F, extra.col=list(`P-value`=pvalue),
render.continuous=c(.="Mean (SD)",
.="Median [MIN, MAX]",
"IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
"N Missing" = "NMISS"))
tbl1
| Normal BP (N=969) |
High BP (N=188) |
P-value | |
|---|---|---|---|
| Participant Sex | |||
| Female | 493 (50.9%) | 83 (44.1%) | 0.108 |
| Male | 476 (49.1%) | 105 (55.9%) | |
| Self-reported Race/Ethnicity | |||
| Mexican American | 153 (15.8%) | 35 (18.6%) | <0.001 |
| Non-Hispanic Asian | 151 (15.6%) | 24 (12.8%) | |
| Non-Hispanic Black | 193 (19.9%) | 67 (35.6%) | |
| Non-Hispanic White | 326 (33.6%) | 37 (19.7%) | |
| Other Hispanic | 92 (9.5%) | 18 (9.6%) | |
| Other Race - Including Multi-Racial | 54 (5.6%) | 7 (3.7%) | |
| Participant Age (years) | |||
| Mean (SD) | 39.9 (14.5) | 53.4 (9.68) | <0.001 |
| Median [MIN, MAX] | 39.0 [18.0, 65.0] | 56.0 [23.0, 65.0] | |
| IQR [25%ile, 75%ile] | 26.0 [27.0, 53.0] | 13.0 [48.0, 61.0] | |
| N Missing | 0 | 0 | |
| Highest Completed Education | |||
| /u003C HS | 176 (18.2%) | 49 (26.1%) | 0.0429 |
| HS | 573 (59.1%) | 102 (54.3%) | |
| College | 219 (22.6%) | 37 (19.7%) | |
| Missing | 1 (0.1%) | 0 (0%) | |
| Household Income (in $1000s) | |||
| Mean (SD) | 48.6 (32.4) | 46.6 (33.8) | 0.477 |
| Median [MIN, MAX] | 45.0 [0, 100] | 35.0 [0, 100] | |
| IQR [25%ile, 75%ile] | 55.0 [20.0, 75.0] | 55.0 [20.0, 75.0] | |
| N Missing | 78 | 16 | |
| Missing | 78 (8.0%) | 16 (8.5%) | |
| Family Income:Poverty Line Ratio | |||
| Mean (SD) | 2.47 (1.66) | 2.58 (1.61) | 0.442 |
| Median [MIN, MAX] | 2.02 [0, 5.00] | 2.34 [0, 5.00] | |
| IQR [25%ile, 75%ile] | 3.02 [1.06, 4.08] | 3.03 [1.19, 4.22] | |
| N Missing | 112 | 30 | |
| Missing | 112 (11.6%) | 30 (16.0%) | |
| Systolic Blood Pressure (mmHg) | |||
| Mean (SD) | 116 (11.1) | 149 (14.5) | <0.001 |
| Median [MIN, MAX] | 116 [88.0, 138] | 146 [122, 216] | |
| IQR [25%ile, 75%ile] | 18.0 [108, 126] | 12.5 [142, 154] | |
| N Missing | 0 | 0 | |
| Diastolic Blood Pressure (mmHg) | |||
| Mean (SD) | 70.1 (10.3) | 85.5 (13.8) | <0.001 |
| Median [MIN, MAX] | 72.0 [0, 88.0] | 88.0 [0, 124] | |
| IQR [25%ile, 75%ile] | 12.0 [64.0, 76.0] | 16.5 [77.5, 94.0] | |
| N Missing | 0 | 0 | |
| RBC Folate (ng/mL) | |||
| Mean (SD) | 1070 (415) | 1100 (657) | 0.818 |
| Median [MIN, MAX] | 1010 [379, 3340] | 976 [219, 4540] | |
| IQR [25%ile, 75%ile] | 469 [791, 1260] | 544 [751, 1300] | |
| N Missing | 483 | 133 | |
| Missing | 483 (49.8%) | 133 (70.7%) | |
| Plasma PFOA (ng/mL) | |||
| Mean (SD) | 1.59 (2.24) | 1.70 (1.14) | 0.311 |
| Median [MIN, MAX] | 1.20 [0.0700, 52.8] | 1.50 [0.100, 9.50] | |
| IQR [25%ile, 75%ile] | 1.10 [0.700, 1.80] | 1.13 [1.00, 2.13] | |
| N Missing | 0 | 0 | |
| Plasma PFOS (ng/mL) | |||
| Mean (SD) | 4.20 (5.21) | 5.73 (7.26) | 0.0065 |
| Median [MIN, MAX] | 2.80 [0.0700, 63.1] | 3.60 [0.100, 59.9] | |
| IQR [25%ile, 75%ile] | 3.10 [1.60, 4.70] | 3.25 [2.18, 5.43] | |
| N Missing | 0 | 0 | |
| Plasma PFNA (ng/mL) | |||
| Mean (SD) | 0.523 (0.526) | 0.630 (0.492) | 0.00735 |
| Median [MIN, MAX] | 0.400 [0.0700, 6.50] | 0.500 [0.0700, 3.00] | |
| IQR [25%ile, 75%ile] | 0.400 [0.200, 0.600] | 0.500 [0.300, 0.800] | |
| N Missing | 0 | 0 | |
| Plasma PFHxS (ng/mL) | |||
| Mean (SD) | 1.61 (2.78) | 1.55 (1.63) | 0.709 |
| Median [MIN, MAX] | 1.00 [0.0700, 48.8] | 1.10 [0.100, 17.0] | |
| IQR [25%ile, 75%ile] | 1.20 [0.500, 1.70] | 1.10 [0.800, 1.90] | |
| N Missing | 0 | 0 |
# Create a working analytic set
# Previously, we had edu_cat, hi_bp, and hi_pfas as labelled variables to make the tables nices
# For ease of analysis, I'm going to transform them back to numbers
analytic_set <-
merged_final_clean %>%
#filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>%
filter(!is.na(hi_bp) & !is.na(hi_pfas)) %>%
filter(age >= 18 & age <= 65) %>% # Restricting our analytic set as we decided above
mutate(edu_cat = as.numeric(edu_cat)-1,
hi_bp = as.numeric(hi_bp)-1,
hi_pfas = as.numeric(hi_pfas)-1)
Recall, we showed mean difference in systolic blood pressures by 1 unit (ng/mL) change in PFOS:
mdl <- lm(data = analytic_set, formula = BPXSY1 ~ LBXNFOS)
summary(mdl)
##
## Call:
## lm(formula = BPXSY1 ~ LBXNFOS, data = analytic_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.566 -11.153 -2.349 8.651 95.652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.74715 0.62407 191.880 < 2e-16 ***
## LBXNFOS 0.40034 0.08712 4.595 4.8e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.64 on 1155 degrees of freedom
## Multiple R-squared: 0.01795, Adjusted R-squared: 0.0171
## F-statistic: 21.11 on 1 and 1155 DF, p-value: 4.802e-06
beta = round(summary(mdl)$coefficients[2,1], 2)
ci_lb = round(summary(mdl)$coefficients[2,1] - 1.96*summary(mdl)$coefficients[2,2], 2)
ci_ub = round(summary(mdl)$coefficients[2,1] + 1.96*summary(mdl)$coefficients[2,2], 2)
# We also showed the diagnostics for this regression
# mdl %>% ggplot() + geom_point(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) # one way
# ggResidpanel::resid_panel(mdl) # more comprehensive way
We observe a 0.4 mmHg higher in systolic blood pressure per 1 ng/mL higher PFAS exposure (beta = 0.4, 95% CI: 0.23, 0.57).
What if we want to at the extent to which PFAS might explain racial / ethnic disparities in blood pressure?
One approach is to use a Kitagawa-Oaxaca-Blinder -type of effect decomposition approach. First proposed by Evelyn Mae Kitagawa a sociologist and demographer in 1955, it was unfortunately attributed to two male economist 20 years later (Ronald Oaxaca and Alan Blinder) in what is an unfortunately typical trends in the sciences.
This decomposition partitions the total effect size, i.e. the mean difference between two groups, into the difference in distribution of a cause of the outcome (called “endowments”) and the differences in effect size for that cause (called “coefficients”).
analytic_set2 <- mutate(analytic_set, ASIAN = case_when(RIDRETH3 == "Non-Hispanic Asian" ~ F,
RIDRETH3 == "Non-Hispanic White" ~ T,
T ~ NA),
PFOS_center = LBXNFOS / mean(analytic_set$LBXNFOS))
decomp_res <- oaxaca(formula = BPXSY1 ~ LBXNFOS | ASIAN, data = analytic_set2, R = 100)
## oaxaca: oaxaca() performing analysis. Please wait.
##
## Bootstrapping standard errors:
## 1 / 100 (1%)
## 10 / 100 (10%)
## 20 / 100 (20%)
## 30 / 100 (30%)
## 40 / 100 (40%)
## 50 / 100 (50%)
## 60 / 100 (60%)
## 70 / 100 (70%)
## 80 / 100 (80%)
## 90 / 100 (90%)
## 100 / 100 (100%)
decomp_res$x
## $x.mean.A
## (Intercept) LBXNFOS
## 1.000000 6.679429
##
## $x.mean.B
## (Intercept) LBXNFOS
## 1.000000 3.678237
##
## $x.mean.diff
## (Intercept) LBXNFOS
## 0.000000 3.001192
decomp_res$y
## $y.A
## [1] 120.8686
##
## $y.B
## [1] 118.0826
##
## $y.diff
## [1] 2.785927
decomp_res$threefold$overall
## coef(endowments) se(endowments) coef(coefficients) se(coefficients)
## 1.5738764 0.8672679 2.2596093 1.1940000
## coef(interaction) se(interaction)
## -1.0475589 0.9419813
plot(decomp_res, components = c("endowments", "coefficients"))
## The effect on Systolic BP due to difference in exposure levels (in %):
round(decomp_res$threefold$overall[1]/decomp_res$y$y.diff, 2)*100
## coef(endowments)
## 56
## The effect on Systolic BP due to difference in effect (in %):
round((decomp_res$threefold$overall[3] + decomp_res$threefold$overall[5])/decomp_res$y$y.diff, 2)*100
## coef(coefficients)
## 44
## The difference in effect estimate in Asians:
decomp_res$beta$beta.diff["LBXNFOS"]
## LBXNFOS
## -0.3490476
#plot(decomp_res,decomposition="twofold",group.weight=-1)
analytic_set2 <- mutate(analytic_set, ASIAN = case_when(RIDRETH3 == "Non-Hispanic Asian" ~ F,
RIDRETH3 == "Non-Hispanic White" ~ T,
T ~ NA),
PFOS_center = LBXNFOS / mean(analytic_set$LBXNFOS))
decomp_res <- oaxaca(formula = BPXSY1 ~ LBXNFOS + as.factor(RIAGENDR) + age + as.factor(edu_cat) + hh_inc | ASIAN, data = analytic_set2, R = 100)
## oaxaca: oaxaca() performing analysis. Please wait.
##
## Bootstrapping standard errors:
## 1 / 100 (1%)
## 10 / 100 (10%)
## 20 / 100 (20%)
## 30 / 100 (30%)
## 40 / 100 (40%)
## 50 / 100 (50%)
## 60 / 100 (60%)
## 70 / 100 (70%)
## 80 / 100 (80%)
## 90 / 100 (90%)
## 100 / 100 (100%)
decomp_res$x
## $x.mean.A
## (Intercept) LBXNFOS as.factor(RIAGENDR)Male
## 1.0000000 6.6319277 0.5180723
## age as.factor(edu_cat)1 as.factor(edu_cat)2
## 43.5843373 0.4036145 0.4939759
## hh_inc
## 63.1325301
##
## $x.mean.B
## (Intercept) LBXNFOS as.factor(RIAGENDR)Male
## 1.0000000 3.6738506 0.5373563
## age as.factor(edu_cat)1 as.factor(edu_cat)2
## 41.2413793 0.6666667 0.2385057
## hh_inc
## 50.9770115
##
## $x.mean.diff
## (Intercept) LBXNFOS as.factor(RIAGENDR)Male
## 0.00000000 2.95807714 -0.01928403
## age as.factor(edu_cat)1 as.factor(edu_cat)2
## 2.34295804 -0.26305221 0.25547016
## hh_inc
## 12.15551863
decomp_res$y
## $y.A
## [1] 120.9398
##
## $y.B
## [1] 117.8908
##
## $y.diff
## [1] 3.048954
decomp_res$threefold$overall
## coef(endowments) se(endowments) coef(coefficients) se(coefficients)
## -0.4514409 1.1126093 3.3260037 1.4555947
## coef(interaction) se(interaction)
## 0.1743917 1.3304903
plot(decomp_res, components = c("endowments", "coefficients"))
Given that race are socially- (human) constructed categories, it is a critical discussion point to try to understand what is meant when are thinking about “effects” of race or even associations with race, specifically as we are entering these variables into statistical models and trying to interpret the parameter estimates from those models. A lack of clarity on what is intended, and how the models match that intent, at best leads to lack of clarity and justification for the interpretation. At worst, they lead to incorrect estimates as well as lending themselves to stereotypical and even racist interpretations.
An excellent paper on this issue was published by Robinson and Vanderweele in 2014. This should be read by anyone seeing to look at racial/ethnic disparities, but provides insights on issues in studying disparities along other social constructed categories such as gender and socioeconomic status as well.
Make sure to also read the short commentaries and responses to this paper, especially the one by Jay Kaufman which sets this work in the context of the broader study of on the
For any particular study, we need to be clearer about how we believe race or ethnicity to be relevant to the causes being studied.
A few examples:
For each, this would determine the relevant exposure measure and categorization. It would also determine what are appropriate mediating or intervenable variables under study.
Entering “Race” (R) into a diagram.
Entering components such as culture (C), parental appearance (PP), own appearance (P), genetic ancestry (G) as separate elements that make up the construct to correspond to different effects of racism, own background, etc.