Assignment Schedule and Due Dates

Updated assignments and due dates:

  • ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan
  • NOV 10 (Fri): DRAFT of analytic plan due
  • NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.
  • NOV 24 (Fri): DRAFT of analyses with code due
  • DEC 08 (Fri): Final written report
    • Post your written reports to Laulima by the end of the due date

Looking for datasets?

data.mendeley.com Resource


Overview

Today:

Review constructing measures
Review measures of assocations RD, RR, and ORs
Discuss concepts of social exposures
Decomposition of effects (discrimination, etc.)


Measures of Associations

Constructing Measures

merged_final <- readxl::read_xlsx(paste0(alt_path, "nhanes_merged_18.xlsx"))

# Re-code some text variables to numeric, making some decisions about how to treat top-coded variables
merged_final_clean <- 
  merged_final %>% 
  mutate(age = case_when(RIDAGEYR == "80 years of age and over" ~ 80,
                         T ~ as.numeric(RIDAGEYR)),
         inc_ratio = case_when(INDFMPIR == "Value greater than or equal to 5.00" ~ 5,
                               T ~ as.numeric(INDFMPIR))) 

# Create a joint "edu_cat" variable by first re-coding DMDEDUC2 if they have it. 
# And then replacing missing values with DMDEDUC3, since I know they are mutually exclusive
merged_final_clean <- 
  merged_final_clean %>% 
  mutate(edu_cat = case_when(DMDEDUC2 %in% c("Less than 9th grade", "9-11th grade (Includes 12th grade with no diploma)") ~ 0,
                             DMDEDUC2 %in% c("High school graduate/GED or equivalent", "Some college or AA degree") ~ 1,
                             DMDEDUC2 == "College graduate or above" ~ 2,
                             T ~ NA_real_)) %>% 
  mutate(edu_cat = case_when(is.na(DMDEDUC3) ~ edu_cat, # keep DMDEDUC2 value if it has it
                             DMDEDUC3 %in% c("High school graduate", "More than high school", "GED or equivalent") ~ 1,
                             T ~ 0)) 

# merged_final_clean %>% count(edu_cat) # 1379 that were originally missing both + 13 that had DMDEDUC2 == "Don't Know" or "Refused"

# Create a household income variable "hh_inc" expressed in $1000s and dealing with top-coding
merged_final_clean <- 
  merged_final_clean %>% 
  mutate(hh_inc = case_when(INDHHIN2 %in% c("Don't know", "Refused") | is.na(INDHHIN2) ~ NA_real_,
                            INDHHIN2 == "$20,000 and Over" ~ 20,
                            INDHHIN2 == "Under $20,000" ~ 15,
                            INDHHIN2 == "$100,000 and Over" ~ 100,
                            T ~ as.numeric(substring(INDHHIN2,2,3)))) 
  
# merged_final_clean %>% count(hh_inc)

# Add relevant labels for variables and educational categories
# Add a binary outcome variable, hi_bp, for blood pressure >= 140 systolic or >= 90 diastolic
# Add a binary exposure variable, hi_pfas, for pfas greater than the median observed value
merged_final_clean <- 
  merged_final_clean %>% 
  mutate(edu_cat = factor(edu_cat, levels = c(0,1,2), labels = c("/u003C HS", "HS", "College"), ordered = T)) %>% 
   mutate(hi_bp = factor(case_when(BPXSY1 >= 140 | BPXDI1 >= 90 ~ 1, 
                           is.na(BPXSY1) | is.na(BPXDI1) ~ NA_real_,
                           T ~ 0), levels = c(0,1), labels = c("Normal BP", "High BP"))) %>% 
    mutate(hi_pfas = factor(case_when(is.na(LBXNFOS) ~ NA_real_,
                             LBXNFOS > quantile(.$LBXNFOS, c(0.5), na.rm = T) ~ 1,
                             T ~ 0), levels = c(0,1), labels = c("Normal PFOS", "High PFOS"))) %>% 
  mutate(RIAGENDR = structure(RIAGENDR, label = "Participant Sex"),
         age = structure(age, label = "Participant Age (years)"),
         RIDRETH3 = structure(RIDRETH3, label = "Self-reported Race/Ethnicity"),
         edu_cat = structure(edu_cat, label = "Highest Completed Education"),
         hh_inc = structure(hh_inc, label = "Household Income (in $1000s)"),
         inc_ratio = structure(inc_ratio, label = "Family Income:Poverty Line Ratio"),
         BPXSY1 = structure(BPXSY1, label = "Systolic Blood Pressure (mmHg)"),
         BPXDI1 = structure(BPXDI1, label = "Diastolic Blood Pressure (mmHg)"),
         LBDRFOSI = structure(LBDRFOSI, label = "RBC Folate (ng/mL)"),
         LBXNFOA = structure(LBXNFOA, label = "Plasma PFOA (ng/mL)"),
         LBXNFOS = structure(LBXNFOS, label = "Plasma PFOS (ng/mL)"),
         LBXPFNA = structure(LBXPFNA, label = "Plasma PFNA (ng/mL)"),
         LBXPFHS = structure(LBXPFHS, label = "Plasma PFHxS (ng/mL)"),
         hi_bp = structure(hi_bp, label = "High Blood Pressure")) 
# A descriptive table restricted to those with blood pressure AND PFOA data, and  stratified by blood pressure:
tbl1 <- 
 merged_final_clean %>% 
  filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>% 
  filter(age >= 18 & age <= 65) %>% 
  table1(~ RIAGENDR + RIDRETH3 + age + as.factor(edu_cat) + hh_inc + inc_ratio + BPXSY1 + BPXDI1 + LBDRFOSI + LBXNFOA + LBXNFOS + LBXPFNA + LBXPFHS | hi_bp, data = .,
         topclass="Rtable1-zebra",
         overall = F, extra.col=list(`P-value`=pvalue),
         render.continuous=c(.="Mean (SD)", 
                             .="Median [MIN, MAX]",
                             "IQR [25%ile, 75%ile]" ="IQR [Q1, Q3]",
                           "N Missing" = "NMISS"))
tbl1
Normal BP
(N=969)
High BP
(N=188)
P-value
Participant Sex
Female 493 (50.9%) 83 (44.1%) 0.108
Male 476 (49.1%) 105 (55.9%)
Self-reported Race/Ethnicity
Mexican American 153 (15.8%) 35 (18.6%) <0.001
Non-Hispanic Asian 151 (15.6%) 24 (12.8%)
Non-Hispanic Black 193 (19.9%) 67 (35.6%)
Non-Hispanic White 326 (33.6%) 37 (19.7%)
Other Hispanic 92 (9.5%) 18 (9.6%)
Other Race - Including Multi-Racial 54 (5.6%) 7 (3.7%)
Participant Age (years)
Mean (SD) 39.9 (14.5) 53.4 (9.68) <0.001
Median [MIN, MAX] 39.0 [18.0, 65.0] 56.0 [23.0, 65.0]
IQR [25%ile, 75%ile] 26.0 [27.0, 53.0] 13.0 [48.0, 61.0]
N Missing 0 0
Highest Completed Education
/u003C HS 176 (18.2%) 49 (26.1%) 0.0429
HS 573 (59.1%) 102 (54.3%)
College 219 (22.6%) 37 (19.7%)
Missing 1 (0.1%) 0 (0%)
Household Income (in $1000s)
Mean (SD) 48.6 (32.4) 46.6 (33.8) 0.477
Median [MIN, MAX] 45.0 [0, 100] 35.0 [0, 100]
IQR [25%ile, 75%ile] 55.0 [20.0, 75.0] 55.0 [20.0, 75.0]
N Missing 78 16
Missing 78 (8.0%) 16 (8.5%)
Family Income:Poverty Line Ratio
Mean (SD) 2.47 (1.66) 2.58 (1.61) 0.442
Median [MIN, MAX] 2.02 [0, 5.00] 2.34 [0, 5.00]
IQR [25%ile, 75%ile] 3.02 [1.06, 4.08] 3.03 [1.19, 4.22]
N Missing 112 30
Missing 112 (11.6%) 30 (16.0%)
Systolic Blood Pressure (mmHg)
Mean (SD) 116 (11.1) 149 (14.5) <0.001
Median [MIN, MAX] 116 [88.0, 138] 146 [122, 216]
IQR [25%ile, 75%ile] 18.0 [108, 126] 12.5 [142, 154]
N Missing 0 0
Diastolic Blood Pressure (mmHg)
Mean (SD) 70.1 (10.3) 85.5 (13.8) <0.001
Median [MIN, MAX] 72.0 [0, 88.0] 88.0 [0, 124]
IQR [25%ile, 75%ile] 12.0 [64.0, 76.0] 16.5 [77.5, 94.0]
N Missing 0 0
RBC Folate (ng/mL)
Mean (SD) 1070 (415) 1100 (657) 0.818
Median [MIN, MAX] 1010 [379, 3340] 976 [219, 4540]
IQR [25%ile, 75%ile] 469 [791, 1260] 544 [751, 1300]
N Missing 483 133
Missing 483 (49.8%) 133 (70.7%)
Plasma PFOA (ng/mL)
Mean (SD) 1.59 (2.24) 1.70 (1.14) 0.311
Median [MIN, MAX] 1.20 [0.0700, 52.8] 1.50 [0.100, 9.50]
IQR [25%ile, 75%ile] 1.10 [0.700, 1.80] 1.13 [1.00, 2.13]
N Missing 0 0
Plasma PFOS (ng/mL)
Mean (SD) 4.20 (5.21) 5.73 (7.26) 0.0065
Median [MIN, MAX] 2.80 [0.0700, 63.1] 3.60 [0.100, 59.9]
IQR [25%ile, 75%ile] 3.10 [1.60, 4.70] 3.25 [2.18, 5.43]
N Missing 0 0
Plasma PFNA (ng/mL)
Mean (SD) 0.523 (0.526) 0.630 (0.492) 0.00735
Median [MIN, MAX] 0.400 [0.0700, 6.50] 0.500 [0.0700, 3.00]
IQR [25%ile, 75%ile] 0.400 [0.200, 0.600] 0.500 [0.300, 0.800]
N Missing 0 0
Plasma PFHxS (ng/mL)
Mean (SD) 1.61 (2.78) 1.55 (1.63) 0.709
Median [MIN, MAX] 1.00 [0.0700, 48.8] 1.10 [0.100, 17.0]
IQR [25%ile, 75%ile] 1.20 [0.500, 1.70] 1.10 [0.800, 1.90]
N Missing 0 0
# Create a working analytic set
# Previously, we had edu_cat, hi_bp, and hi_pfas as labelled variables to make the tables nices
# For ease of analysis, I'm going to transform them back to numbers

analytic_set <- 
  merged_final_clean %>% 
  #filter(!is.na(BPXSY1) & !is.na(LBXNFOS)) %>% 
  filter(!is.na(hi_bp) & !is.na(hi_pfas)) %>% 
  filter(age >= 18 & age <= 65) %>% # Restricting our analytic set as we decided above
  mutate(edu_cat = as.numeric(edu_cat)-1,
         hi_bp = as.numeric(hi_bp)-1,
         hi_pfas = as.numeric(hi_pfas)-1) 

Recap of regression-based estimates

Recall, we showed mean difference in systolic blood pressures by 1 unit (ng/mL) change in PFOS:

mdl <- lm(data = analytic_set, formula = BPXSY1 ~ LBXNFOS) 
summary(mdl)
## 
## Call:
## lm(formula = BPXSY1 ~ LBXNFOS, data = analytic_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.566 -11.153  -2.349   8.651  95.652 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 119.74715    0.62407 191.880  < 2e-16 ***
## LBXNFOS       0.40034    0.08712   4.595  4.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.64 on 1155 degrees of freedom
## Multiple R-squared:  0.01795,    Adjusted R-squared:  0.0171 
## F-statistic: 21.11 on 1 and 1155 DF,  p-value: 4.802e-06
beta = round(summary(mdl)$coefficients[2,1], 2)
ci_lb = round(summary(mdl)$coefficients[2,1] - 1.96*summary(mdl)$coefficients[2,2], 2)
ci_ub = round(summary(mdl)$coefficients[2,1] + 1.96*summary(mdl)$coefficients[2,2], 2)

# We also showed the diagnostics for this regression
# mdl %>% ggplot() + geom_point(aes(x = .fitted, y = .resid)) + geom_hline(yintercept = 0) # one way
# ggResidpanel::resid_panel(mdl) # more comprehensive way

We observe a 0.4 mmHg higher in systolic blood pressure per 1 ng/mL higher PFAS exposure (beta = 0.4, 95% CI: 0.23, 0.57).

What if we want to at the extent to which PFAS might explain racial / ethnic disparities in blood pressure?


Decomposition of Racial / Ethnic Disparities

One approach is to use a Kitagawa-Oaxaca-Blinder -type of effect decomposition approach. First proposed by Evelyn Mae Kitagawa a sociologist and demographer in 1955, it was unfortunately attributed to two male economist 20 years later (Ronald Oaxaca and Alan Blinder) in what is an unfortunately typical trends in the sciences.

This decomposition partitions the total effect size, i.e. the mean difference between two groups, into the difference in distribution of a cause of the outcome (called “endowments”) and the differences in effect size for that cause (called “coefficients”).

Basic decomposition of Asian-White disparities

analytic_set2 <- mutate(analytic_set, ASIAN = case_when(RIDRETH3 == "Non-Hispanic Asian" ~ F,
                                                        RIDRETH3 == "Non-Hispanic White" ~ T,
                                                        T ~ NA),
                        PFOS_center = LBXNFOS / mean(analytic_set$LBXNFOS))
decomp_res <- oaxaca(formula = BPXSY1 ~ LBXNFOS | ASIAN, data = analytic_set2, R = 100)
## oaxaca: oaxaca() performing analysis. Please wait.
## 
## Bootstrapping standard errors:
## 1 / 100 (1%)
## 10 / 100 (10%)
## 20 / 100 (20%)
## 30 / 100 (30%)
## 40 / 100 (40%)
## 50 / 100 (50%)
## 60 / 100 (60%)
## 70 / 100 (70%)
## 80 / 100 (80%)
## 90 / 100 (90%)
## 100 / 100 (100%)
decomp_res$x
## $x.mean.A
## (Intercept)     LBXNFOS 
##    1.000000    6.679429 
## 
## $x.mean.B
## (Intercept)     LBXNFOS 
##    1.000000    3.678237 
## 
## $x.mean.diff
## (Intercept)     LBXNFOS 
##    0.000000    3.001192
decomp_res$y
## $y.A
## [1] 120.8686
## 
## $y.B
## [1] 118.0826
## 
## $y.diff
## [1] 2.785927
decomp_res$threefold$overall
##   coef(endowments)     se(endowments) coef(coefficients)   se(coefficients) 
##          1.5738764          0.8672679          2.2596093          1.1940000 
##  coef(interaction)    se(interaction) 
##         -1.0475589          0.9419813
plot(decomp_res, components = c("endowments", "coefficients"))

## The effect on Systolic BP due to difference in exposure levels (in %):
round(decomp_res$threefold$overall[1]/decomp_res$y$y.diff, 2)*100
## coef(endowments) 
##               56
## The effect on Systolic BP due to difference in effect (in %): 
round((decomp_res$threefold$overall[3] + decomp_res$threefold$overall[5])/decomp_res$y$y.diff, 2)*100
## coef(coefficients) 
##                 44
## The difference in effect estimate in Asians:
decomp_res$beta$beta.diff["LBXNFOS"]
##    LBXNFOS 
## -0.3490476
#plot(decomp_res,decomposition="twofold",group.weight=-1)

Importance of unincluded variables

analytic_set2 <- mutate(analytic_set, ASIAN = case_when(RIDRETH3 == "Non-Hispanic Asian" ~ F,
                                                        RIDRETH3 == "Non-Hispanic White" ~ T,
                                                        T ~ NA),
                        PFOS_center = LBXNFOS / mean(analytic_set$LBXNFOS))
decomp_res <- oaxaca(formula = BPXSY1 ~ LBXNFOS + as.factor(RIAGENDR) + age + as.factor(edu_cat) + hh_inc | ASIAN, data = analytic_set2, R = 100)
## oaxaca: oaxaca() performing analysis. Please wait.
## 
## Bootstrapping standard errors:
## 1 / 100 (1%)
## 10 / 100 (10%)
## 20 / 100 (20%)
## 30 / 100 (30%)
## 40 / 100 (40%)
## 50 / 100 (50%)
## 60 / 100 (60%)
## 70 / 100 (70%)
## 80 / 100 (80%)
## 90 / 100 (90%)
## 100 / 100 (100%)
decomp_res$x
## $x.mean.A
##             (Intercept)                 LBXNFOS as.factor(RIAGENDR)Male 
##               1.0000000               6.6319277               0.5180723 
##                     age     as.factor(edu_cat)1     as.factor(edu_cat)2 
##              43.5843373               0.4036145               0.4939759 
##                  hh_inc 
##              63.1325301 
## 
## $x.mean.B
##             (Intercept)                 LBXNFOS as.factor(RIAGENDR)Male 
##               1.0000000               3.6738506               0.5373563 
##                     age     as.factor(edu_cat)1     as.factor(edu_cat)2 
##              41.2413793               0.6666667               0.2385057 
##                  hh_inc 
##              50.9770115 
## 
## $x.mean.diff
##             (Intercept)                 LBXNFOS as.factor(RIAGENDR)Male 
##              0.00000000              2.95807714             -0.01928403 
##                     age     as.factor(edu_cat)1     as.factor(edu_cat)2 
##              2.34295804             -0.26305221              0.25547016 
##                  hh_inc 
##             12.15551863
decomp_res$y
## $y.A
## [1] 120.9398
## 
## $y.B
## [1] 117.8908
## 
## $y.diff
## [1] 3.048954
decomp_res$threefold$overall
##   coef(endowments)     se(endowments) coef(coefficients)   se(coefficients) 
##         -0.4514409          1.1126093          3.3260037          1.4555947 
##  coef(interaction)    se(interaction) 
##          0.1743917          1.3304903
plot(decomp_res, components = c("endowments", "coefficients"))

Issues around conceptualizing race & ethnicity for research

What is the role of “race” in causality?

Given that race are socially- (human) constructed categories, it is a critical discussion point to try to understand what is meant when are thinking about “effects” of race or even associations with race, specifically as we are entering these variables into statistical models and trying to interpret the parameter estimates from those models. A lack of clarity on what is intended, and how the models match that intent, at best leads to lack of clarity and justification for the interpretation. At worst, they lead to incorrect estimates as well as lending themselves to stereotypical and even racist interpretations.

An excellent paper on this issue was published by Robinson and Vanderweele in 2014. This should be read by anyone seeing to look at racial/ethnic disparities, but provides insights on issues in studying disparities along other social constructed categories such as gender and socioeconomic status as well.

Make sure to also read the short commentaries and responses to this paper, especially the one by Jay Kaufman which sets this work in the context of the broader study of on the

Key message on race in disparity analyses

For any particular study, we need to be clearer about how we believe race or ethnicity to be relevant to the causes being studied.

A few examples:

  • interpersonal discrimination / racism – based on appearance
  • systematic, institutional, or legal racism – based on family history by externally assigned categories
  • family, inter-generational cultural practices – based on family’s self-identification
  • own internalized behaviors – based on own self-identification
  • intergenerational transmission of resources – based on measures of wealth/income and not race per se

For each, this would determine the relevant exposure measure and categorization. It would also determine what are appropriate mediating or intervenable variables under study.

Consider the following DAG

1. “RACE”

Entering “Race” (R) into a diagram.


2. Race, de-constructed

Entering components such as culture (C), parental appearance (PP), own appearance (P), genetic ancestry (G) as separate elements that make up the construct to correspond to different effects of racism, own background, etc.