Overview

Missing values can be a big obstacle in the way of model fitting and predictive analytics. The choice of method to impute missing values,largely influences the model’s predictive ability. In most statistical analysis methods,listwise deletion is the default method used to impute missing values. But, it not as good since it leads to information loss.

R has several packages to impute missing values, including MICE, Amelia, missForest, Hmisc and mi. This blog tests the different methods used by MICE to impute missing values and then comapres with the original dataset to test which one goes closest to the shape of the proginal distribution.

MICE (Multivariate Imputation via Chained Equations) is one of the commonly used package by R users. Creating multiple imputations as compared to a single imputation (such as mean) takes care of uncertainty in missing values.MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.

Load Libraries

library(tidyverse)
library(mice)
library(missForest)
library(VIM)
library(kableExtra)

Read in data and produce missing at random

We will use the iris dataset for this analysis. The dataset has 150 observations for 5 variables. We remove the categorical variable and focus our analysis on the continuous variables.

data<-iris
data <- subset(data, select = -c(Species))

Next we produce 10% missing values at random using the prodNA function from the missForest package.

iris.mis <- prodNA(data, noNA = 0.1)

We will focus on the continuous variables and remove the categorical variable Species. Let us have a look at the missingness.There are 65% values in the data set with no missing value. There are 10% missing values in Petal.Length, 9.5% missing values in Petal.Width and so on.

mice_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(iris.mis), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##      Variable      Count
##  Sepal.Length 0.14000000
##   Sepal.Width 0.10000000
##   Petal.Width 0.09333333
##  Petal.Length 0.06666667

Impute missing values

Let us impute the missing values now. We will use a seed to keep the results unaltered.

PMM (Predicted Mean Matching)

Data_pmm <- mice(iris.mis, method = 'pmm', seed = 5000)
## 
##  iter imp variable
##   1   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
pmm<- complete(Data_pmm)

CART(Classification and regression trees)

Data_cart<- mice(iris.mis, method = 'cart', seed = 5001)
## 
##  iter imp variable
##   1   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
cart<-complete(Data_cart)

Midastouch(Weighted predictive mean matching)

Data_midastouch <- mice(iris.mis,  method = 'midastouch', seed = 5002)
## 
##  iter imp variable
##   1   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
midastouch<-complete(Data_midastouch)

RF(Random Forest Imputation)

Data_rf <- mice(iris.mis, method = 'rf', seed = 5003)
## 
##  iter imp variable
##   1   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
rf<-complete(Data_rf)

Sample (Random sample from observed values)

Data_sample <- mice(iris.mis, method = 'sample', seed = 5004)
## 
##  iter imp variable
##   1   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
sample<-complete(Data_sample)

Mean (Unconditional mean imputation)

Data_mean <- mice(iris.mis, method = 'mean', seed = 5005)
## 
##  iter imp variable
##   1   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   1   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   2   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   3   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   4   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
##   5   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
mean<-complete(Data_mean)

Compare the methods

Here we compare the summary of the methods with the original distribution.

Sepal.length

We see that pmm and rf methods return distributions closest to original.

sepal_length<-as.data.frame(cbind(data$Sepal.Length, pmm$Sepal.Length, cart$Sepal.Length,                          midastouch$Sepal.Length, rf$Sepal.Length, sample$Sepal.Length, 
                      mean$Sepal.Length))
names(sepal_length)<-c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(sepal_length, main="Comparison of different imputation methods on sepal.length")

Sepal.width

Except for midastouch, all the other methods have a returned an almost similar distribution.

sepal_width<-as.data.frame(cbind(data$Sepal.Width, pmm$Sepal.Width, cart$Sepal.Width,      
                    midastouch$Sepal.Width, rf$Sepal.Width, sample$Sepal.Width, 
                    mean$Sepal.Width))

names(sepal_width) <- c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(sepal_width, main="Comparison of different imputation methods on sepal.width")

Petal.length

Cart is the winner here.

petal_length<-as.data.frame(cbind(data$Petal.Length, pmm$Petal.Length, cart$Petal.Length,      
                    midastouch$Petal.Length, rf$Petal.Length, sample$Petal.Length, 
                    mean$Petal.Length))

names(petal_length) <- c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(petal_length, main="Comparison of different imputation methods on petal_length")

Petal.width

Except for mean, all the other methods have a returned an almost similar distribution.

petal_width<-as.data.frame(cbind(data$Petal.Width, pmm$Petal.Width, cart$Petal.Width,      
                    midastouch$Petal.Width, rf$Petal.Width, sample$Petal.Width, 
                    mean$Petal.Width))

names(petal_width) <- c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(petal_width, main="Comparison of different imputation methods on petal_width")

Conclusion

We have had different results for different variables when using different imputation methods using the MICE package. PMM seems to have done a better job than most other methods with every variable. Here is a density plot for the distributions, decide for yourself!

par(mfrow=c(3,3))
plot(density(data$Sepal.Length), main='Original')
plot(density(pmm$Sepal.Length),  main='pmm')
plot(density(cart$Sepal.Length),  main='cart')
plot(density(midastouch$Sepal.Length),  main='midastouch')
plot(density(rf$Sepal.Length),  main='rf')
plot(density(sample$Sepal.Length),  main='sample')
plot(density(mean$Sepal.Length),  main='mean')