Simulating missing data

The required packages:

library(mice)
library(ggplot2)

#General setup
set.seed(131)

#input parameters simulation
n        <- 1000
prop.mis <- seq(0.1,0.4, by=0.05)
n.iter   <- 1000

Setup variables

We will randomly draw 3 variables from normal distributions. These hypothetical data are from a cohort of 1000 patients with TBI. The research question we will be answering is:

“What is the effect of degree of depression, and degree of functioning, on quality of life following TBI?”

Parameter	Meaning	Range
x1	Degree of depression	0-100
x2	Degree of (executive) functioning	0-100
y	Quality of life	0-100

We start by drawing values for x1 and x2. The code below draws random values for x1 and x2 from a multivariate normal distribution. The multivariate normal distribution is required, because the degree of depression and the degree of functioning are correlated: the better a patient recovers after TBI, the better both the degree of depression and function likely are. To take into account the relationship between x1 and x2, this approach is therefore better than drawing from two random normal distributions. The input parameters (mu = mean, sd = standard deviation, cov = covariance) are arbitrarily chosen.

#make two continue somewhat correlated variabeles
mux1 <- 60     # Depression (0-100)
sdx1 <- 10
mux2 <- 40     # Executive funcitoning (0-100)
sdx2 <- 10            

cov_x1x2 <- 40

vcov_x1x2 <- matrix(c(sdx1^2,cov_x1x2,cov_x1x2,sdx2^2), ncol = 2)

x <- mvtnorm::rmvnorm(n = n, mean = c(mux1, mux2), sigma = vcov_x1x2)

Based on the randomly drawn x1 and x2, we can now assume the “real model” underlying the data. We chose arbitrary values for the effect (beta, or coefficient) of depression and functioning on quality of life. The higher the degree of depression or functioning, the higher the quality of life. Moreover, a random variation is included in y, because these two factors do not explain the complete variation in quality of life.

# Regression input
beta_x1 <- +0.3
beta_x2 <- +0.5
var_err <- 8

# Make y
y <- 20 + beta_x1*x[,1] + beta_x2*x[,2] + rnorm(n = n, mean = 0, sd = var_err)

Below, you can see the relationship between the three variables, and their distribution:

#relationships betweeon x and y
df           <- data.frame(cbind(x,y))
colnames(df) <- c("x1", "x2", "y")
GGally::ggpairs(df)

Preparing the simulation

The simulation needs to be prepared, by first fitting the “true model” (that is, the model with the original, complete data). Second, a dataframe is created, in which the results of the simulation can be stored.

#Fit the "true model" 
fit <- lm(y~x1+x2, data=df)

#Prepare loop for making data missing and then cc/mi fit
res_sim <- expand.grid(iter     = 1:n.iter,
                       prop.mis = prop.mis,
                       x        = c("x1","x2"),
                       meth     = c("cc","mi"),
                       beta     = NA,
                       se       = NA)

Simulate

The simulation will be executed in the following order:

Create random missingness in x1 (depression), conditional on x2 (functioning): the higher the degree of functioning, the lower the probability of a missing observation. The effect is arbitrarily set at -0.03 on the log-odds scale, with a random variation with a standard deviation of 0.05.
Fit the model on the dataset with missing data. The model will automatically only analyse complete cases, so this model is referred to as the “complete case” analysis. The coefficients and standard errors of this complete case analysis is stored in the earlier created dataframe.
Create a multiple imputed dataset (with 5 completed datasets) with the MICE package. Fit the model on the 5 completed datasets, pool the results, and also store the coefficients and standard errors of this analysis, which is called the “Multiple imputed” analysis.

Repeat these steps for different proportions of missingness (which modulates step 1), and do 1000 iterations to arrive at stable estimates.

for(p in prop.mis){
  for(i in 1:n.iter){
    df_iter <- df
    
    # Make proportion of x1 missing, conditional on x2, so that prop.mis will be missing
    prob_mis <- plogis(-0.03*df_iter$x2+rnorm(n = n, sd=0.05))
    prob_mis <- prob_mis-(mean(prob_mis)-p)
    prob_mis[prob_mis<0] <- 0
    yn_mis   <- rbinom(n = n, size = 1, prob = prob_mis)==1
    df_iter$x1[yn_mis] <- NA
    
    #Fit the complete case model
    fitcc <- lm(y~x1+x2, data=df_iter)
    
    #add to result
    res_sim[res_sim$iter==i&res_sim$meth=="cc"&res_sim$prop.mis==p,
            c("beta","se")] <- coef(summary(fitcc))[2:3,1:2]
    
    #MI 
    dfi <- mice(df_iter, m=5, printFlag = FALSE)
    
    #Fit mi
    fitmi <- with(dfi, lm(y~x1+x2))
    fitmi <- summary(pool(fitmi))
    
    #add to result
    res_sim[res_sim$iter==i&res_sim$meth=="mi"&res_sim$prop.mis==p,
            c("beta","se")] <- fitmi[2:3,2:3]
  }
  print(paste(paste(ceiling(which(prop.mis%in%p)*(100/length(prop.mis)))),"% completed", sep="")) #To show the progress
}

Show results

Below, the results are shown of the simulation. We compare the coefficients of the simulation to the 95% confidence interval of the coefficients of the “true model”. If an estimate falls outside the 95% confidence interval of the true model estimate, we will label this estimate as a “wrong conclusion”: because of this estimate, we will conclude that the effect lies outside a reasonable interval where we expect the estimate of the target population.

First, the results for the effect of x1 (degree of depression) on y will be shown:

#Plot the probability of drawing a "wrong" conclusion, based on cc/mi and prop.mis
max_x1 <- fit$coefficients["x1"]+1.96*sqrt(diag(vcov(fit)))["x1"]
min_x1 <- fit$coefficients["x1"]-1.96*sqrt(diag(vcov(fit)))["x1"]
max_x2 <- fit$coefficients["x2"]+1.96*sqrt(diag(vcov(fit)))["x2"]
min_x2 <- fit$coefficients["x2"]-1.96*sqrt(diag(vcov(fit)))["x2"]

#Calculate and plot proportion of x1 wrong
wrng_x1_df <- res_sim[res_sim$x=="x1",]
wrng_x1_df$wrng <- wrng_x1_df$beta>max_x1|wrng_x1_df$beta<min_x1

wrng_x1_df$meth <- factor(wrng_x1_df$meth)
levels(wrng_x1_df$meth) <- c("Complete case analysis", "Imputed data")

#Add minimal and maximal to dataframe to make a ribbon
ribbon_data <- data.frame(prop.mis=prop.mis, min_x1=min_x1, max_x1=max_x1)
wrng_x1_df <- merge(wrng_x1_df, ribbon_data, by="prop.mis")

#To make sure it fills all the plot
wrng_x1_df$prop.mis2 <- wrng_x1_df$prop.mis
wrng_x1_df$prop.mis2[wrng_x1_df$prop.mis2==0.1] <- 0.05
wrng_x1_df$prop.mis2[wrng_x1_df$prop.mis2==0.4] <- 0.45

x1_wrn_sct <- ggplot(wrng_x1_df, aes(x=prop.mis, y=beta))+
  geom_ribbon(aes(x=prop.mis2,ymin=min_x1, ymax=max_x1), fill="#7570b3", alpha=0.2)+
  geom_point(position=position_jitter(), aes(col=beta>max_x1|beta<min_x1))+
  labs(x="% missing values",y="Beta x1",col="Conclusion")+ 
  scale_color_brewer(labels=c("Right", "Wrong"), palette = "Set2")+facet_wrap(~meth)+theme_bw()
x1_wrn_sct

This figure shows that the estimates of the multiple imputed are similar in uncertainty and direction. The higher the proportion of missingness, the more often the estimates of both analyses lie outside the confidence interval of the true model, because the uncertainty in estimation increases.

Now follows the results for the effect of functioning (x2) on quality of life.

#Calculate proportion of x2 wrong
wrng_x2_df <- res_sim[res_sim$x=="x2",]
wrng_x2_df$wrng <- wrng_x2_df$beta>max_x2|wrng_x2_df$beta<min_x2

#Add minimal and maximal to dataframe to make a ribbon
ribbon_data <- data.frame(prop.mis=prop.mis, min_x2=min_x2, max_x2=max_x2)
wrng_x2_df <- merge(wrng_x2_df, ribbon_data, by="prop.mis")

wrng_x2_df$meth <- factor(wrng_x2_df$meth)
levels(wrng_x2_df$meth) <- c("Complete case", "Imputed data")


#To make sure it fills all the plot
wrng_x2_df$prop.mis2 <- wrng_x2_df$prop.mis
wrng_x2_df$prop.mis2[wrng_x2_df$prop.mis2==0.1] <- 0.05
wrng_x2_df$prop.mis2[wrng_x2_df$prop.mis2==0.4] <- 0.45

x2_wrn_sct <- ggplot(wrng_x2_df, aes(x=prop.mis, y=beta))+
  geom_ribbon(aes(x=prop.mis2,ymin=min_x2, ymax=max_x2), fill="#7570b3", alpha=0.2)+
  geom_point(position=position_jitter(), aes(col=beta>max_x2|beta<min_x2))+
  labs(x="% missing values",y="Beta x2",col="Conclusion")+ 
  scale_color_brewer(labels=c("Right", "Wrong"), palette = "Set2")+facet_wrap(~meth)+theme_bw()
x2_wrn_sct

We now observe that because all x2 data can be used, the uncertainty is much lower in the multiple imputed dataset. Because of the lower uncertainty, all estimates now lie within the 95% confidence interval of the “true model”. This result shows why multiple imputation is often beneficial, and not harmful: we do not create new information, but use all available information as efficiently as possible.

Simulating missing data

Benjamin Gravesteijn

12-12-2020

Setup variables

Preparing the simulation

Simulate

Show results