Introduction to missingness

Setup

We will use the following libraries.

library(data.table)
library(parallel)
library(ggplot2)
library(brglm2)
library(emmeans)

Types

Missing data can lead to loss of statistical power, bias and can compromise causal interpretation of experiments. The fundamental rule is that prevention is better than cure. It is way better to reduce missingness by design rather than try to fix the problem after the fact. Missingness relates to both predictors \(X\) and response variables \(Y\). There are 3 common classes:

Missing completely at random: \(Pr(\text{Missing} | Y) = Pr(\text{Missing})\)

better described as being random
probability of being missing is the same for every observation
analysis remains unbiased, but se increases
examples -
1. measuring weight - equipment failed
2. measuring maths ability - the dog genuinely ate my homework

Missing at random: \(Pr(\text{Missing}|Y, X) = Pr(\text{Missing}|Y_{obs}, X_{obs})\)

better described as being predictable
probability of being missing depends on values we observed
missingness does NOT relate to the missing value \(Y_{mis}\)
analysis remains unbiased, but se increases SO LONG AS YOU ADJUST THE MODEL for all the variables that affect the probability of missingness
examples -
1. measuring device only works well in cool temperatures - if we know the temp, we can predict the missingness
2. measuring download speeds relies on having not reached data usage limit - if we know the data usage, we can predict the missingness
3. males more likely to tell you their weight than females - if we know sex then we can predict missingness

Missing not at random: \(Pr(\text{Missing}|Y, X) = Pr(\text{Missing}|Y_{mis}, X_{mis})\)

MNAR (on the response variable) can equate to censoring, such as when all people earning more than 100k refuse to respond. In this extreme case, the imputation model would need to predict beyond the range of the observed data, which can be problematic. At a minimum, covariates that plausibly predict missingness of the response should be included in the analysis model, with the aim of bringing the MNAR situation a bit closer to being MAR. When MNAR relates to unobserved predictors, it should either be modelled or you just have to live with some level of bias in your results.

better described as being not predictable given the information we have
probability of being missing depends on unobserved predictor and/or response values
analysis will be biased if we do not account for missingness
examples -
1. measuring weight loss - participant more likely to withdraw if they do not lose weight quickly
2. measuring drug use - drug takers less likely to do drug tests
3. measurng income - rich people tend to under (or not) report their wealth

All this said, the various records that end up missing are usually missing for a variety of reasons rather than nicely sitting under one of the above classifications.

Logistic regression example

Imagine an experiment where we have some intervention that is intended to prevent relapse in a cohort of alcoholics. The baseline rate of relapse in males and females is slightly higher in males and the treatment is equally effective in both groups on the log-odds scale. However, while the treatment effect on the log-odds scale is the same across sex, that doesn’t mean that the chance of relapse reduces by the same amount on the probability scale.

The following settings are used throughout. If you want things to run quicker then just reduce nbatch to 3 or lower.

n <- 100
nbatch <- 10
nsim <- 1000
ndel <- 25
# In case I want to predict probabilities by group...
XX <- data.table(CJ(1, trt = 0:1, sex = 0:1))
b <- c(qlogis(0.65), log(0.2), log(1.75))
eta <- as.matrix(XX) %*% b
# Probability of event in each group
p <- as.numeric(plogis(eta))

df0 <- data.table(p = c(p, mean(p[1:2]), mean(p[3:4])), 
                  variable = c("ctl_f", "ctl_m", 
                               "trt_f", "trt_m", 
                               "ctl", "trt"))

cbind(XX[,-1], p = p)

##    trt sex         p
## 1:   0   0 0.6500000
## 2:   0   1 0.7647059
## 3:   1   0 0.2708333
## 4:   1   1 0.3939394

It is worth mentioning now that the properties of the ML estimator for GLMs and of the inferential procedures that depend on its asymptotic normality may deteriorate for small or moderate sample sizes or when the number of parameters is large relative to the number of observations. The brglm2 package provides various ways to correct for this bias by specifying the method parameter as brglmFit in the glm call. It gives improved frequentist properties, however, it is a fair bit slower than the glm call, see https://link.springer.com/article/10.1007/s11222-019-09860-6 for more information.

If we repeatedly simulate a dataset and then fit a GLM to it, we should be able to recover the parameters (the probabilities of an event in each group) that we used to simulate the data.

I am looking at the probabilities of an event by group rather than the parameters from the linear predictor because the bias in former is much more readily apparent than in the latter.

set.seed(1)
d0 <-
  rbindlist(mclapply(1:nbatch, function(i) {
    rbindlist(lapply(1:nsim, function(j) {
      X <- data.table(
        1,
        trt = rbinom(n, 1, prob = 0.5),
        sex = rbinom(n, 1, prob = 0.5)
      )
      eta <- as.matrix(X) %*% b
      y <- rbinom(n, 1, prob = plogis(eta))
      lm1 <- glm(y ~ trt + sex, data = X, family = binomial(), 
                 method = "brglmFit")
      # probably better to use emmeans but it is dog slow.
      # s <- summary(emmeans(lm1, "trt", type = "response"))
      pp0 <- plogis(
        predict(lm1, newdata = data.table(trt = 0:1, sex = mean(X$sex)))
      )
      pp <- plogis(predict(lm1, newdata = XX))
      data.table(
        ctl_f = pp[1], ctl_m = pp[2], trt_f = pp[3], trt_m =pp[4], 
        # ctl = s$prob[1], trt = s$prob[2]
        ctl = pp0[1], trt = pp0[2], n = sum(!is.na(y))
      )
    }))
  }, mc.cores = 10))

The simulation parameter estimates and average number of observed records:

rbind(p = c(p, mean(p[1:2]), mean(p[3:4]), NA),
      p_hat = colMeans(d0))

##           ctl_f     ctl_m     trt_f     trt_m       ctl       trt   n
## p     0.6500000 0.7647059 0.2708333 0.3939394 0.7073529 0.3323864  NA
## p_hat 0.6449866 0.7581273 0.2769861 0.3985623 0.7066639 0.3332507 100

You can compute the standard errors by running sd on the simulation results.

MCAR

Introduce 25% (high) missingness via MCAR and conduct a complete case analysis.

d1 <-
  rbindlist(mclapply(1:nbatch, function(i) {
    rbindlist(lapply(1:nsim, function(j) {
      X <- data.table(
        1,
        trt = rbinom(n, 1, prob = 0.5),
        sex = rbinom(n, 1, prob = 0.5)
      )
      eta <- as.matrix(X) %*% b
      # Missingness is arbitrary
      mis <- rep(0, n)
      mis[sample(1:n, size = ndel, replace = F)] <- 1
      y <- rbinom(n, 1, prob = plogis(eta))
      # cbind(y, mis, sex = X$sex, trt = X$trt)
      y[mis == 1] <- NA
      lm1 <- glm(y ~ trt + sex, data = X, 
                 family = binomial(), 
                 method = "brglmFit")
      pp0 <- plogis(
        predict(lm1, newdata = data.table(trt = 0:1, sex = mean(X$sex)))
      )
      pp <- plogis(predict(lm1, newdata = XX))
      data.table(
        ctl_f = pp[1], ctl_m = pp[2], trt_f = pp[3], trt_m =pp[4], 
        ctl = pp0[1], trt = pp0[2], n = sum(!is.na(y))
      )
    }))
  }, mc.cores = 10))

The simulation parameter estimates and average number of observed records:

rbind(p = c(p, mean(p[1:2]), mean(p[3:4]), NA),
      p_hat = colMeans(d1))

##           ctl_f     ctl_m     trt_f     trt_m       ctl       trt  n
## p     0.6500000 0.7647059 0.2708333 0.3939394 0.7073529 0.3323864 NA
## p_hat 0.6430859 0.7544981 0.2798861 0.3986755 0.7044634 0.3343952 75

Precision is lost (the SEs increased) under the complete case analysis, but the estimates remain unbiased. The dashed line shows the distribution of the parameters obtained from modeling the complete data set.

ggplot(melt(d1[, .(ctl_f,ctl_m,trt_f,trt_m)], 
            measure.vars = c("ctl_f","ctl_m","trt_f","trt_m")), 
       aes(x = value)) +
  geom_density() +
  geom_density(data = melt(d0[, .(ctl_f,ctl_m,trt_f,trt_m)], 
            measure.vars = c("ctl_f","ctl_m","trt_f","trt_m")),
            aes(x = value), lty = 2) +
  geom_vline(data = df0[variable %in% c("ctl_f","ctl_m","trt_f","trt_m"),], 
             aes(xintercept = p), lty = 3) +
  theme_bw() +
  facet_wrap(.~variable)

MAR

Now set missingness to be conditional on sex, which is observed and hence MAR.

d2 <-
  rbindlist(mclapply(1:nbatch, function(i) {
    rbindlist(lapply(1:nsim, function(j) {
      X <- data.table(
        1,
        trt = rbinom(n, 1, prob = 0.5),
        sex = rbinom(n, 1, prob = 0.5)
      )
      eta <- as.matrix(X) %*% b
      # Missingness related to being male
      mis <- rep(0, n)
      mis[which(X$sex == 1)[1:ndel]] <- 1
      y <- rbinom(n, 1, prob = plogis(eta))
      # cbind(y, mis, sex = X$sex, trt = X$trt)
      y[mis == 1] <- NA
      lm1 <- glm(y ~ trt + sex, data = X, 
                 family = binomial(),  
                 method = "brglmFit")
      pp0 <- plogis(
        predict(lm1, newdata = data.table(trt = 0:1, sex = mean(X$sex)))
      )
      pp <- plogis(predict(lm1, newdata = XX))
      data.table(
        ctl_f = pp[1], ctl_m = pp[2], trt_f = pp[3], trt_m =pp[4],
        ctl = pp0[1], trt = pp0[2], n = sum(!is.na(y))
      )
    }))
  }, mc.cores = 10))

The simulation parameter estimates and average number of observed records:

rbind(p = c(p, mean(p[1:2]), mean(p[3:4]), NA),
      p_hat = colMeans(d2))

##           ctl_f     ctl_m     trt_f     trt_m       ctl       trt  n
## p     0.6500000 0.7647059 0.2708333 0.3939394 0.7073529 0.3323864 NA
## p_hat 0.6455365 0.7524469 0.2779039 0.3989317 0.7052170 0.3338020 75

Again, the dashed line is from modeling based on the complete data set. The estimates are still unbiased.

ggplot(melt(d2, 
            measure.vars = c("ctl_f","ctl_m","trt_f","trt_m")), 
       aes(x = value)) +
  geom_density() +
  geom_density(data = melt(d0, 
            measure.vars = c("ctl_f","ctl_m","trt_f","trt_m")),
            aes(x = value), lty = 2) +
  geom_vline(data = df0[variable %in% c("ctl_f","ctl_m","trt_f","trt_m"),], 
             aes(xintercept = p), lty = 3) +
  theme_bw() +
  facet_wrap(.~variable)

If the model is mis-specified (we leave out sex) then we could get into trouble.

d2b <-
  rbindlist(mclapply(1:nbatch, function(i) {
    rbindlist(lapply(1:nsim, function(j) {
      X <- data.table(
        1,
        trt = rbinom(n, 1, prob = 0.5),
        sex = rbinom(n, 1, prob = 0.5)
      )
      eta <- as.matrix(X) %*% b
      # Missingness related to being male
      mis <- rep(0, n)
      mis[which(X$sex == 1)[1:ndel]] <- 1
      y <- rbinom(n, 1, prob = plogis(eta))
      # cbind(y, mis, sex = X$sex, trt = X$trt)
      y[mis == 1] <- NA
      # cbind(y, X$sex, X$trt)
      # just model trt effect
      lm1 <- glm(y ~ trt , data = X, 
                 family = binomial(), 
                 method = "brglmFit")
      pp <- plogis(predict(lm1, newdata = XX[c(1,3),]))
      data.table(
        ctl = pp[1], trt = pp[2], n = sum(!is.na(y))
      )
    }))
  }, mc.cores = 10))

The simulation parameter estimates and average number of observed records:

rbind(p = c(mean(p[1:2]), mean(p[3:4]), NA),
      p_hat = colMeans(d2b))

##             ctl       trt  n
## p     0.7073529 0.3323864 NA
## p_hat 0.6828577 0.3165260 75

ggplot(melt(d2b[, .(ctl, trt)], 
            measure.vars = c("ctl", "trt")), 
       aes(x = value)) +
  geom_density() +
  geom_density(data = melt(d0[, .(ctl, trt)], 
                           measure.vars = c("ctl", "trt")),
            aes(x = value), lty = 2) +
  geom_vline(data = df0[variable %in% c("ctl", "trt"),], 
             aes(xintercept = p), lty = 3) +
  theme_bw() +
  facet_wrap(.~variable)

MNAR

Now introduce missingness that is related to the response (MNAR). Here, I made the participants who have the event more likely to be missing.

d3 <-
  rbindlist(mclapply(1:nbatch, function(i) {
    rbindlist(lapply(1:nsim, function(j) {
      X <- data.table(
        1,
        trt = rbinom(n, 1, prob = 0.5),
        sex = rbinom(n, 1, prob = 0.5)
      )
      eta <- as.matrix(X) %*% b
      # Missingness related to relapse
      mis <- rep(0, n)
      y <- rbinom(n, 1, prob = plogis(eta))
      mis[which(y == 1)[1:ndel]] <- 1
      y[mis == 1] <- NA
      lm1 <- glm(y ~ trt + sex, data = X, 
                 family = binomial(), 
                 method = "brglmFit")
      pp0 <- plogis(
        predict(lm1, newdata = data.table(trt = 0:1, sex = mean(X$sex)))
      )
      pp <- plogis(predict(lm1, newdata = XX))
      data.table(
        ctl_f = pp[1], ctl_m = pp[2], trt_f = pp[3], trt_m =pp[4],
        ctl = pp0[1], trt = pp0[2], n = sum(!is.na(y))
      )
    }))
  }, mc.cores = 10))

The simulation parameter estimates and average number of observed records:

rbind(p = c(p, mean(p[1:2]), mean(p[3:4]), NA),
      p_hat = colMeans(d3))

##           ctl_f     ctl_m     trt_f     trt_m       ctl       trt  n
## p     0.6500000 0.7647059 0.2708333 0.3939394 0.7073529 0.3323864 NA
## p_hat 0.4901856 0.6209195 0.1714068 0.2608004 0.5574717 0.2097219 75

Bias in the all groups relative to the estimates from using the complete dataset.

ggplot(melt(d3[, .(ctl_f,ctl_m,trt_f,trt_m)], 
            measure.vars = c("ctl_f","ctl_m","trt_f","trt_m")), 
       aes(x = value)) +
  geom_density() +
  geom_density(data = melt(d0[, .(ctl_f,ctl_m,trt_f,trt_m)], 
            measure.vars = c("ctl_f","ctl_m","trt_f","trt_m")),
            aes(x = value), lty = 2) +
  geom_vline(data = df0[variable %in% c("ctl_f","ctl_m","trt_f","trt_m"),], 
             aes(xintercept = p), lty = 3) +
  theme_bw() +
  facet_wrap(.~variable)

Compare means.

rr <- rbind(truth = c(p, mean(p[1:2]), mean(p[3:4]), NA),
            complete = colMeans(d0),
            mcar = colMeans(d1),
            mar = colMeans(d2),
            mar_mis = c(rep(NA, 4), colMeans(d2b)),
            mnar = colMeans(d3))
rr

##              ctl_f     ctl_m     trt_f     trt_m       ctl       trt   n
## truth    0.6500000 0.7647059 0.2708333 0.3939394 0.7073529 0.3323864  NA
## complete 0.6449866 0.7581273 0.2769861 0.3985623 0.7066639 0.3332507 100
## mcar     0.6430859 0.7544981 0.2798861 0.3986755 0.7044634 0.3343952  75
## mar      0.6455365 0.7524469 0.2779039 0.3989317 0.7052170 0.3338020  75
## mar_mis         NA        NA        NA        NA 0.6828577 0.3165260  75
## mnar     0.4901856 0.6209195 0.1714068 0.2608004 0.5574717 0.2097219  75

Aggregate all the data and plot all, including margins for the average treatment effect.

rl <- rbind(cbind(desc = "complete", d0),
            cbind(desc = "mcar", d1),
            cbind(desc = "mar", d2),
            cbind(desc = "mar_mis", d2b),
            cbind(desc = "mnar", d3), fill=TRUE)
cols <- colnames(rl)[c(-length(colnames(rl)))]
rl <- melt(rl[, ..cols], measure.vars = cols[-1])

dfig1 <- rl[, .(mu = mean(value, na.rm = TRUE), 
               lb = quantile(value, probs = 0.025, na.rm = TRUE),
               ub = quantile(value, probs = 0.975, na.rm = TRUE)),
           by = .(desc, variable)]
dfig1$desc <- factor(dfig1$desc, levels = c("mnar",
                                            "mar_mis",
                                            "mar",
                                            "mcar",
                                            "complete"
                                            ))
dfig2 <- data.table(p = c(p, mean(p[1:2]), mean(p[3:4])), 
                    variable = unique(rl$variable))

ggplot(dfig1, aes(x = mu, y = desc)) +
  geom_point(col = 1, size = 3) +
  geom_errorbar(aes(xmin = lb, xmax = ub), width = 0) +
  geom_vline(data = dfig2, 
             aes(xintercept = p), lty = 3) +
  theme_bw() +
  scale_y_discrete("") +
  facet_wrap(~variable, ncol = 2)

## Warning: Removed 4 rows containing missing values (geom_point).

Addressing missingness

The first thing to do is to look at amount and the pattern of missingness. Assuming MCAR is generally unlikely to be reasonable; it is more likely that you have a mix of missing data mechanisms. Looking at the patterns of missingness sometimes helps you get a sense of what is driving the missingness, but there is no way to know for sure and so your best bet is to examine various what-if scenarios.

There are three things you can when faced with missing data:

Exclude the participants that have missing data and do a complete case analysis
Use single value imputation (e.g. impute with the mean, regression, nearest neighbour)
Multiple-imputation

If you have 5-10% missingness, complete case (supplemented with best-worst-case scenarios) is usually reasonable. The idea with the best-worst-case scenarios is to set the missing outcomes for one group to be as good as they could possibly be and set the missing outcomes in the other groups to be as bad as they could possibly be and then repeat your analysis. Next, you then do worst-best-case and repeat the analysis again. For a continuous response values, you just assume that the missing values are something like the group mean plus 2 standard deviations.

If the proportion of missing data is large then you can still do a complete case (supplemented with best/worst) and report the interpretative limitations appropriately. If more than 40% of the trial data are missing then the trial result cannot really be relied on.

With a single-value imputation approach, the standard errors will usually be deflated relative to what they should be. This is because we have substantial uncertainty about the missing values, but we are pretending that we know the true value. Last one carried forward can also be problematic due to regression to the mean. However, for missingness of categorical predictors, creating a new level to represent missing can be quite effective.

If the above do not apply and MNAR is plausible then you should probably use MI.

Propensity weighting

Build a model to predict the non-response for a given variable. The inverse of the predicted probability of response can then be used as a weighting to make the complete case sample more representative of the full cohort.

Resampling imputation

This is a convenient starting point rather than a full fix because it ignores other (potentially predictive) observed data. However, using a regression model is a somewhat better approach where you fit a model based on the observed data. Once you have the fitted model, you can predict the means for the missing data and then pass that to rnorm (or whatever) along with the residual error to make the imputation stochastic and representative of the variation.

random.imp <- function (a){
  missing <- is.na(a)
  n.missing <- sum(missing)
  a.obs <- a[!missing]
  imputed <- a
  imputed[missing] <- sample (a.obs, n.missing, replace=TRUE)
  return (imputed)
}
random.imp(c(1, 2, 3, NA, 4, 5))

## [1] 1 2 3 1 4 5

Multiple imputation

MI involves

imputing multiple complete datasets (5 to 50)
running complete case analysis on each imputed dataset
pooling the results

Simplistically, pooling can be achieved by obtaining the parameter estimate \(\hat{\beta}_m\) from each of \(M\) imputed datasets and computing \(\hat{\beta} = \frac{1}{m}\sum_{m=1}^M \hat{\beta}_m\). The variance estimate \(V_{\beta}\) reflects the within and between variation:

\[ \begin{aligned} V_{\beta} = W + \left(1 + \frac{1}{m} \right) B \end{aligned} \]

where \(W = \frac{1}{m}\sum_{m = 1}^M s_m^2\) and \(B = \frac{1}{m-1}\sum_{m = 1}^M \left( \hat{\beta}_m - \hat{\beta} \right)^2\). Full information maximum likelihood is deterministic alternative to MI. FML approach estimates the joint distribution of the outcome and covariates that would maximise the probability of observing the values that we observed.

Mixed effects models

When you have repeat measures on patients or heterogeneity in site or practicioner delivering the intervention then you may be using a mixed-effects model.

In this context, the approach to addressing missing data must account for the multi-level nature of the data. So, for example, in a school, class, student repeat measure hierarchical model, you might need a model to address imputation at the subject level, another model addressing imputation at the class level and another at the school level. The later models would typically rely on the aggregated lower level data.

Bayesian analyses

In a Bayesian analysis, the model is generative. So, if we just have missing response variables, they are easy to impute. It gets a bit more complicated when predictors are missing.

References

ARM text by Gelman has a good chapter on missing data.

Flexible Imputation of Missing Data: https://stefvanbuuren.name/fimd/