The required packages:
library(mice)
library(ggplot2)
#General setup
set.seed(131)
#input parameters simulation
n <- 1000
prop.mis <- seq(0.1,0.4, by=0.05)
n.iter <- 1000
We will randomly draw 3 variables from normal distributions. These hypothetical data are from a cohort of 1000 patients with TBI. The research question we will be answering is:
“What is the effect of degree of depression, and degree of functioning, on quality of life following TBI?”
Parameter | Meaning | Range |
---|---|---|
x1 | Degree of depression | 0-100 |
x2 | Degree of (executive) functioning | 0-100 |
y | Quality of life | 0-100 |
We start by drawing values for x1 and x2. The code below draws random values for x1 and x2 from a multivariate normal distribution. The multivariate normal distribution is required, because the degree of depression and the degree of functioning are correlated: the better a patient recovers after TBI, the better both the degree of depression and function likely are. To take into account the relationship between x1 and x2, this approach is therefore better than drawing from two random normal distributions. The input parameters (mu = mean, sd = standard deviation, cov = covariance) are arbitrarily chosen.
#make two continue somewhat correlated variabeles
mux1 <- 60 # Depression (0-100)
sdx1 <- 10
mux2 <- 40 # Executive funcitoning (0-100)
sdx2 <- 10
cov_x1x2 <- 40
vcov_x1x2 <- matrix(c(sdx1^2,cov_x1x2,cov_x1x2,sdx2^2), ncol = 2)
x <- mvtnorm::rmvnorm(n = n, mean = c(mux1, mux2), sigma = vcov_x1x2)
Based on the randomly drawn x1 and x2, we can now assume the “real model” underlying the data. We chose arbitrary values for the effect (beta, or coefficient) of depression and functioning on quality of life. The higher the degree of depression or functioning, the higher the quality of life. Moreover, a random variation is included in y, because these two factors do not explain the complete variation in quality of life.
# Regression input
beta_x1 <- +0.3
beta_x2 <- +0.5
var_err <- 8
# Make y
y <- 20 + beta_x1*x[,1] + beta_x2*x[,2] + rnorm(n = n, mean = 0, sd = var_err)
Below, you can see the relationship between the three variables, and their distribution:
#relationships betweeon x and y
df <- data.frame(cbind(x,y))
colnames(df) <- c("x1", "x2", "y")
GGally::ggpairs(df)
The simulation needs to be prepared, by first fitting the “true model” (that is, the model with the original, complete data). Second, a dataframe is created, in which the results of the simulation can be stored.
#Fit the "true model"
fit <- lm(y~x1+x2, data=df)
#Prepare loop for making data missing and then cc/mi fit
res_sim <- expand.grid(iter = 1:n.iter,
prop.mis = prop.mis,
x = c("x1","x2"),
meth = c("cc","mi"),
beta = NA,
se = NA)
The simulation will be executed in the following order:
Repeat these steps for different proportions of missingness (which modulates step 1), and do 1000 iterations to arrive at stable estimates.
for(p in prop.mis){
for(i in 1:n.iter){
df_iter <- df
# Make proportion of x1 missing, conditional on x2, so that prop.mis will be missing
prob_mis <- plogis(-0.03*df_iter$x2+rnorm(n = n, sd=0.05))
prob_mis <- prob_mis-(mean(prob_mis)-p)
prob_mis[prob_mis<0] <- 0
yn_mis <- rbinom(n = n, size = 1, prob = prob_mis)==1
df_iter$x1[yn_mis] <- NA
#Fit the complete case model
fitcc <- lm(y~x1+x2, data=df_iter)
#add to result
res_sim[res_sim$iter==i&res_sim$meth=="cc"&res_sim$prop.mis==p,
c("beta","se")] <- coef(summary(fitcc))[2:3,1:2]
#MI
dfi <- mice(df_iter, m=5, printFlag = FALSE)
#Fit mi
fitmi <- with(dfi, lm(y~x1+x2))
fitmi <- summary(pool(fitmi))
#add to result
res_sim[res_sim$iter==i&res_sim$meth=="mi"&res_sim$prop.mis==p,
c("beta","se")] <- fitmi[2:3,2:3]
}
print(paste(paste(ceiling(which(prop.mis%in%p)*(100/length(prop.mis)))),"% completed", sep="")) #To show the progress
}
Below, the results are shown of the simulation. We compare the coefficients of the simulation to the 95% confidence interval of the coefficients of the “true model”. If an estimate falls outside the 95% confidence interval of the true model estimate, we will label this estimate as a “wrong conclusion”: because of this estimate, we will conclude that the effect lies outside a reasonable interval where we expect the estimate of the target population.
First, the results for the effect of x1 (degree of depression) on y will be shown:
#Plot the probability of drawing a "wrong" conclusion, based on cc/mi and prop.mis
max_x1 <- fit$coefficients["x1"]+1.96*sqrt(diag(vcov(fit)))["x1"]
min_x1 <- fit$coefficients["x1"]-1.96*sqrt(diag(vcov(fit)))["x1"]
max_x2 <- fit$coefficients["x2"]+1.96*sqrt(diag(vcov(fit)))["x2"]
min_x2 <- fit$coefficients["x2"]-1.96*sqrt(diag(vcov(fit)))["x2"]
#Calculate and plot proportion of x1 wrong
wrng_x1_df <- res_sim[res_sim$x=="x1",]
wrng_x1_df$wrng <- wrng_x1_df$beta>max_x1|wrng_x1_df$beta<min_x1
wrng_x1_df$meth <- factor(wrng_x1_df$meth)
levels(wrng_x1_df$meth) <- c("Complete case analysis", "Imputed data")
#Add minimal and maximal to dataframe to make a ribbon
ribbon_data <- data.frame(prop.mis=prop.mis, min_x1=min_x1, max_x1=max_x1)
wrng_x1_df <- merge(wrng_x1_df, ribbon_data, by="prop.mis")
#To make sure it fills all the plot
wrng_x1_df$prop.mis2 <- wrng_x1_df$prop.mis
wrng_x1_df$prop.mis2[wrng_x1_df$prop.mis2==0.1] <- 0.05
wrng_x1_df$prop.mis2[wrng_x1_df$prop.mis2==0.4] <- 0.45
x1_wrn_sct <- ggplot(wrng_x1_df, aes(x=prop.mis, y=beta))+
geom_ribbon(aes(x=prop.mis2,ymin=min_x1, ymax=max_x1), fill="#7570b3", alpha=0.2)+
geom_point(position=position_jitter(), aes(col=beta>max_x1|beta<min_x1))+
labs(x="% missing values",y="Beta x1",col="Conclusion")+
scale_color_brewer(labels=c("Right", "Wrong"), palette = "Set2")+facet_wrap(~meth)+theme_bw()
x1_wrn_sct
This figure shows that the estimates of the multiple imputed are similar in uncertainty and direction. The higher the proportion of missingness, the more often the estimates of both analyses lie outside the confidence interval of the true model, because the uncertainty in estimation increases.
Now follows the results for the effect of functioning (x2) on quality of life.
#Calculate proportion of x2 wrong
wrng_x2_df <- res_sim[res_sim$x=="x2",]
wrng_x2_df$wrng <- wrng_x2_df$beta>max_x2|wrng_x2_df$beta<min_x2
#Add minimal and maximal to dataframe to make a ribbon
ribbon_data <- data.frame(prop.mis=prop.mis, min_x2=min_x2, max_x2=max_x2)
wrng_x2_df <- merge(wrng_x2_df, ribbon_data, by="prop.mis")
wrng_x2_df$meth <- factor(wrng_x2_df$meth)
levels(wrng_x2_df$meth) <- c("Complete case", "Imputed data")
#To make sure it fills all the plot
wrng_x2_df$prop.mis2 <- wrng_x2_df$prop.mis
wrng_x2_df$prop.mis2[wrng_x2_df$prop.mis2==0.1] <- 0.05
wrng_x2_df$prop.mis2[wrng_x2_df$prop.mis2==0.4] <- 0.45
x2_wrn_sct <- ggplot(wrng_x2_df, aes(x=prop.mis, y=beta))+
geom_ribbon(aes(x=prop.mis2,ymin=min_x2, ymax=max_x2), fill="#7570b3", alpha=0.2)+
geom_point(position=position_jitter(), aes(col=beta>max_x2|beta<min_x2))+
labs(x="% missing values",y="Beta x2",col="Conclusion")+
scale_color_brewer(labels=c("Right", "Wrong"), palette = "Set2")+facet_wrap(~meth)+theme_bw()
x2_wrn_sct
We now observe that because all x2 data can be used, the uncertainty is much lower in the multiple imputed dataset. Because of the lower uncertainty, all estimates now lie within the 95% confidence interval of the “true model”. This result shows why multiple imputation is often beneficial, and not harmful: we do not create new information, but use all available information as efficiently as possible.