Missing Data Example with MICE in R

First, we need to library the MICE package and set the data that we will be using for the example. The nhanes 2 data set is a small data set designed for missing data examples. It contains four variables that are quantitative and binary (yes no variables). Using the MICE package is easy and all that you need to do is select your data and use set.seed for reproducibility.

This creates the default five imputed data sets, which are each analyzed separately in the fit object. To obtain the correct standard errors, data set should be analyzed separately and then their parameter estimates can be combined after the analyses are completed, which is described below with the pool function.

library(mice)
data("nhanes2")
head(nhanes2)

##     age  bmi  hyp chl
## 1 20-39   NA <NA>  NA
## 2 40-59 22.7   no 187
## 3 20-39   NA   no 187
## 4 60-99   NA <NA>  NA
## 5 20-39 20.4   no 113
## 6 60-99   NA <NA> 184

nhances2.imp = mice(nhanes2, seed = 12345)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

Then we can get a summary of what the imputation method did. It first shows us the number of imputed data sets it created, which in this case is five. Then it shows the number of missing values per variable. It also describes the imputation method used. For categorical variables such as hyp, it uses a log regression, because the data is not continuous.

For the predictor matrix, it shows the variables used in predicting missing values for that variable. For example, when predicting missing values for age, the model used all three of the other included variables.

Then we can run a regression using the imputed data sets with the standard lm function in R. Here we create a model that regresses chl on age and bmi.

Finally, we can use the pool function to pool the regression results together over the five data sets to obtain one final result.

summary(nhances2.imp)

## Multiply imputed data set
## Call:
## mice(data = nhanes2, seed = 12345)
## Number of multiple imputations:  5
## Missing cells per column:
## age bmi hyp chl 
##   0   9   8  10 
## Imputation methods:
##      age      bmi      hyp      chl 
##       ""    "pmm" "logreg"    "pmm" 
## VisitSequence:
## bmi hyp chl 
##   2   3   4 
## PredictorMatrix:
##     age bmi hyp chl
## age   0   0   0   0
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0
## Random generator seed value:  12345

fit = with(nhances2.imp, lm(chl ~ age + bmi))
round(summary(pool(fit)),2)

##               est    se    t    df Pr(>|t|)   lo 95  hi 95 nmis  fmi
## (Intercept)  9.02 62.43 0.14 10.95     0.89 -128.46 146.51   NA 0.38
## age2        43.66 17.70 2.47 14.72     0.03    5.88  81.44   NA 0.25
## age3        59.46 20.79 2.86 10.42     0.02   13.40 105.53   NA 0.40
## bmi          5.98  2.09 2.87 12.53     0.01    1.46  10.51    9 0.33
##             lambda
## (Intercept)   0.28
## age2          0.16
## age3          0.30
## bmi           0.23