Week 11 - Missing Data

Assignment Schedule and Due Dates

Updated assignments and due dates:

ASAP after selecting data & hypothesis, schedule to meet with me about analysis plan

NOV 10 (Fri): DRAFT of analytic plan due

NOV 8, 15, 22: Analytic workshop - to refine plans, coding, etc.

NOV 24 (Fri): DRAFT of analyses with code due

DEC 08 (Fri): Final written report

Post your written reports to Laulima by the end of the due date

Overview

Today:

“visdat” and “naniar” packages
Consequences of missing data
Inverse probability weighting
Multiple imputation

“visdat” and “naniar” packages

These packages helps you get a sense of your data, particularly patterns of missing data.

This helps you understand: 1) if you have a missing data problem 2) how you might deal with it.

You can find more details in the documentation here.

Another potentially useful package is VIM.

There are two basic types of missingness (these are my names for them) and the “naniar” package can help you see them:

Structural - Questions were not asked to large groups of people either because of a skip logic, (sub-) sampling, i.e. for “big picture” reasons.
Sporadic - Questions were not answered by individuals, because they didn’t know or didn’t want to answer. i.e. for individual reasons.

For (1) you’ll need to think about how to address that in your question and study design – i.e. Do you want to just study a sub-group? Consult with experts and past papers on the subject.

For (2) you’ll need to think about why these values are missing and whether it’s conceivable that these are Not At Random – specifically the reasons they are missing are related to your exposure and outcome e.g. people who are less healthy may be more likely to miss data on them. If so, complete case analyses are not the right solution!

Let’s look at the missingness from our data, which has some of both:

analytic_set <- readxl::read_xlsx(paste0(alt_path, "analytic_set.xlsx"))

vis_dat(analytic_set)

vis_miss(analytic_set)

Let’s look at the relationship between PFAS and Blood Pressure, split by missingness for income ratio:

ggplot(analytic_set) + geom_point(aes(x = log2(LBXNFOS), y = BPXSY1, color = is.na(inc_ratio))) + theme_bw()

Now for folate:

ggplot(analytic_set) + geom_point(aes(x = log2(LBXNFOS), y = BPXSY1, color = is.na(LBDRFOSI))) + theme_bw()

Consequences of missing data

Looking again at the PFAS -> BP dataset. Let’s assume that age and education category are the only confounders and that we have a dataset representing the full data (i.e. we know what the correct answer is):

full_set <- readxl::read_xlsx(paste0(alt_path, "full_set.xlsx"))

summary(lm(full_set, formula = BP ~ LBXNFOS + age + edu_cat))

## 
## Call:
## lm(formula = BP ~ LBXNFOS + age + edu_cat, data = full_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8740  -3.3931  -0.0483   3.6030  19.1514 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.427681   0.517542 192.115   <2e-16 ***
## LBXNFOS      0.056396   0.028227   1.998    0.046 *  
## age          0.009796   0.010687   0.917    0.360    
## edu_cat      0.146321   0.235201   0.622    0.534    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.151 on 1152 degrees of freedom
## Multiple R-squared:  0.005837,   Adjusted R-squared:  0.003248 
## F-statistic: 2.255 on 3 and 1152 DF,  p-value: 0.08043

The coefficient is 0.056. Remember this number. How do we interpret it again?

What if there was some censorship such that we were not able to collect BP outcome data on some large segment of the population:

censor_set <- readxl::read_xlsx(paste0(alt_path, "censor_set.xlsx"))

gg_miss_var(censor_set)

And we could see that this missingness was dependent on edu_cat among other things. What can we do about it?

gg_miss_var(censor_set, facet = edu_cat)

Inverse probability weighting

First let’s see what happens when we do nothing, and just look at the association for those who have outcome values:

mdl_out <- summary(lm(censor_set, formula = BP ~ LBXNFOS + age + edu_cat))
mdl_out

## 
## Call:
## lm(formula = BP ~ LBXNFOS + age + edu_cat, data = censor_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9157  -3.2803  -0.0462   3.5222  19.2350 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.983e+01  6.091e-01 163.902   <2e-16 ***
## LBXNFOS     7.279e-02  4.798e-02   1.517    0.130    
## age         2.532e-04  1.297e-02   0.020    0.984    
## edu_cat     4.611e-02  2.743e-01   0.168    0.867    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.181 on 798 degrees of freedom
##   (354 observations deleted due to missingness)
## Multiple R-squared:  0.003126,   Adjusted R-squared:  -0.0006215 
## F-statistic: 0.8342 on 3 and 798 DF,  p-value: 0.4753

round(mdl_out$coefficients[2,1], 2)

## [1] 0.07

This coefficient is very wrong.

If we know the reasons for the missingness, we can restore the data distribution to what it might have looked like without the censorship through inverse probability weighting – Remember that adjustments operated by similar principles. But here rather than adjusting for observed covariates – we’re actually adjusting for the chance of being missing.

We do this first by fitting a model to predict the chance that the outcome measure is missing based on covariates that determine that missing (like edu_cat):

IPCW_model <- glm(data = mutate(censor_set, OBSERVED = 1 - CENSOR), 
                  formula = CENSOR ~ LBXNFOS + age + edu_cat, family = binomial(link = "logit"))

Then, we create a weight based on the inverse of this probability of missingness (1/probability). We use this weight variable to adjust our original regression model:

censor_set %>% 
  add_column(IPCW = 1/predict(IPCW_model, newdata = censor_set, type = "response")) %>% 
  #ggplot() + geom_density(aes(IPCW, fill = as.factor(1-CENSOR), alpha = 0.4))
  lm(formula = BP ~ LBXNFOS + age + edu_cat, weights = IPCW) %>% summary()

## 
## Call:
## lm(formula = BP ~ LBXNFOS + age + edu_cat, data = ., weights = IPCW)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.878  -6.278  -0.095   6.680  32.363 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.772301   0.571518 174.574   <2e-16 ***
## LBXNFOS      0.059207   0.059163   1.001    0.317    
## age          0.001983   0.013187   0.150    0.881    
## edu_cat      0.086048   0.278757   0.309    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.951 on 798 degrees of freedom
##   (354 observations deleted due to missingness)
## Multiple R-squared:  0.001598,   Adjusted R-squared:  -0.002156 
## F-statistic: 0.4257 on 3 and 798 DF,  p-value: 0.7346

This result is much closer to the unbiased (true) answer. Note that though we were able to get the unbiased estimate, the coefficient is not now statistically significant reflecting the greater uncertainty in the model (and using a smaller sample size). This should remind us that statisical significance and p-values are definitely not related to judging whether a result is unbiased or correct.

Multiple imputation

Let us take a slightly different scenario where we have exposure and outcomes measured, but some people did not report on their own educational category. A naive approach is just to do the complete case analysis:

miss_set <- readxl::read_xlsx(paste0(alt_path, "miss_set.xlsx"))

summary(lm(data = miss_set, formula = BP ~ LBXNFOS + age + edu_cat))

## 
## Call:
## lm(formula = BP ~ LBXNFOS + age + edu_cat, data = miss_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6480  -3.4689  -0.0455   3.6768  19.0321 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.107448   0.603227 164.296   <2e-16 ***
## LBXNFOS     -0.009362   0.048001  -0.195   0.8454    
## age          0.023041   0.012764   1.805   0.0714 .  
## edu_cat      0.242847   0.290511   0.836   0.4035    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.087 on 774 degrees of freedom
##   (378 observations deleted due to missingness)
## Multiple R-squared:  0.005406,   Adjusted R-squared:  0.001551 
## F-statistic: 1.402 on 3 and 774 DF,  p-value: 0.2408

You can see that answer is very wrong.

Another tool in the toolbox, especially for missing covariates, is to restore the missing values based on their relationships with covariates. Particularly this works if the missingness of the data can be fully explained by observable values, i.e. Missing Conditionally At Random (or MAR).

We do this by replacing the missing covariate values with observed values from individuals with similar values of other covariates. Moreover, we don’t do this once, but we do this a large number of times to fully represent the range of values that this missing covariate could take. Usually, we draw as many samples (imputation) as a function of the percentage of data that are missing, which roughly corresponds to the percent missing (but not exactly).

This method is called Multivariate Imputations by Chained Equations (MICE), where the “chained equations” part describes how it imputes covariates iteratively in the situations where multiple covariates are missing. Here we just have one covariate “edu_cat” that has missing data:

miss_set %>% count(is.na(edu_cat)) # about 33% missing

imputed_Data <- mice::mice(miss_set, m = 33, method = 'pmm', seed = 1)

## 
##  iter imp variable
##   1   1  edu_cat
##   1   2  edu_cat
##   1   3  edu_cat
##   1   4  edu_cat
##   1   5  edu_cat
##   1   6  edu_cat
##   1   7  edu_cat
##   1   8  edu_cat
##   1   9  edu_cat
##   1   10  edu_cat
##   1   11  edu_cat
##   1   12  edu_cat
##   1   13  edu_cat
##   1   14  edu_cat
##   1   15  edu_cat
##   1   16  edu_cat
##   1   17  edu_cat
##   1   18  edu_cat
##   1   19  edu_cat
##   1   20  edu_cat
##   1   21  edu_cat
##   1   22  edu_cat
##   1   23  edu_cat
##   1   24  edu_cat
##   1   25  edu_cat
##   1   26  edu_cat
##   1   27  edu_cat
##   1   28  edu_cat
##   1   29  edu_cat
##   1   30  edu_cat
##   1   31  edu_cat
##   1   32  edu_cat
##   1   33  edu_cat
##   2   1  edu_cat
##   2   2  edu_cat
##   2   3  edu_cat
##   2   4  edu_cat
##   2   5  edu_cat
##   2   6  edu_cat
##   2   7  edu_cat
##   2   8  edu_cat
##   2   9  edu_cat
##   2   10  edu_cat
##   2   11  edu_cat
##   2   12  edu_cat
##   2   13  edu_cat
##   2   14  edu_cat
##   2   15  edu_cat
##   2   16  edu_cat
##   2   17  edu_cat
##   2   18  edu_cat
##   2   19  edu_cat
##   2   20  edu_cat
##   2   21  edu_cat
##   2   22  edu_cat
##   2   23  edu_cat
##   2   24  edu_cat
##   2   25  edu_cat
##   2   26  edu_cat
##   2   27  edu_cat
##   2   28  edu_cat
##   2   29  edu_cat
##   2   30  edu_cat
##   2   31  edu_cat
##   2   32  edu_cat
##   2   33  edu_cat
##   3   1  edu_cat
##   3   2  edu_cat
##   3   3  edu_cat
##   3   4  edu_cat
##   3   5  edu_cat
##   3   6  edu_cat
##   3   7  edu_cat
##   3   8  edu_cat
##   3   9  edu_cat
##   3   10  edu_cat
##   3   11  edu_cat
##   3   12  edu_cat
##   3   13  edu_cat
##   3   14  edu_cat
##   3   15  edu_cat
##   3   16  edu_cat
##   3   17  edu_cat
##   3   18  edu_cat
##   3   19  edu_cat
##   3   20  edu_cat
##   3   21  edu_cat
##   3   22  edu_cat
##   3   23  edu_cat
##   3   24  edu_cat
##   3   25  edu_cat
##   3   26  edu_cat
##   3   27  edu_cat
##   3   28  edu_cat
##   3   29  edu_cat
##   3   30  edu_cat
##   3   31  edu_cat
##   3   32  edu_cat
##   3   33  edu_cat
##   4   1  edu_cat
##   4   2  edu_cat
##   4   3  edu_cat
##   4   4  edu_cat
##   4   5  edu_cat
##   4   6  edu_cat
##   4   7  edu_cat
##   4   8  edu_cat
##   4   9  edu_cat
##   4   10  edu_cat
##   4   11  edu_cat
##   4   12  edu_cat
##   4   13  edu_cat
##   4   14  edu_cat
##   4   15  edu_cat
##   4   16  edu_cat
##   4   17  edu_cat
##   4   18  edu_cat
##   4   19  edu_cat
##   4   20  edu_cat
##   4   21  edu_cat
##   4   22  edu_cat
##   4   23  edu_cat
##   4   24  edu_cat
##   4   25  edu_cat
##   4   26  edu_cat
##   4   27  edu_cat
##   4   28  edu_cat
##   4   29  edu_cat
##   4   30  edu_cat
##   4   31  edu_cat
##   4   32  edu_cat
##   4   33  edu_cat
##   5   1  edu_cat
##   5   2  edu_cat
##   5   3  edu_cat
##   5   4  edu_cat
##   5   5  edu_cat
##   5   6  edu_cat
##   5   7  edu_cat
##   5   8  edu_cat
##   5   9  edu_cat
##   5   10  edu_cat
##   5   11  edu_cat
##   5   12  edu_cat
##   5   13  edu_cat
##   5   14  edu_cat
##   5   15  edu_cat
##   5   16  edu_cat
##   5   17  edu_cat
##   5   18  edu_cat
##   5   19  edu_cat
##   5   20  edu_cat
##   5   21  edu_cat
##   5   22  edu_cat
##   5   23  edu_cat
##   5   24  edu_cat
##   5   25  edu_cat
##   5   26  edu_cat
##   5   27  edu_cat
##   5   28  edu_cat
##   5   29  edu_cat
##   5   30  edu_cat
##   5   31  edu_cat
##   5   32  edu_cat
##   5   33  edu_cat

summary(imputed_Data)

## Class: mids
## Number of multiple imputations:  33 
## Imputation methods:
##     age edu_cat LBXNFOS      BP 
##      ""   "pmm"      ""      "" 
## PredictorMatrix:
##         age edu_cat LBXNFOS BP
## age       0       1       1  1
## edu_cat   1       0       1  1
## LBXNFOS   1       1       0  1
## BP        1       1       1  0

densityplot(imputed_Data)

Then we run the regression in each of the imputed sets and pool them together by:

Taking the average (mean) as the point estimate
Finding the Standard Error through special rules (Rubin’s Rules) to combine within vs. across imputation Standard Errors

mice_model <- with(imputed_Data, lm(BP ~ LBXNFOS + age + edu_cat))
summary(pool(mice_model))

We can see, not only do we pretty much get the correct coefficient, but the p-value is closer to the true value as well!

Fin.