Overview
Missing values can be a big obstacle in the way of model fitting and predictive analytics. The choice of method to impute missing values,largely influences the model’s predictive ability. In most statistical analysis methods,listwise deletion is the default method used to impute missing values. But, it not as good since it leads to information loss.
R has several packages to impute missing values, including MICE, Amelia, missForest, Hmisc and mi. This blog tests the different methods used by MICE to impute missing values and then comapres with the original dataset to test which one goes closest to the shape of the proginal distribution.
MICE (Multivariate Imputation via Chained Equations) is one of the commonly used package by R users. Creating multiple imputations as compared to a single imputation (such as mean) takes care of uncertainty in missing values.MICE assumes that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.
Load Libraries
library(tidyverse)
library(mice)
library(missForest)
library(VIM)
library(kableExtra)
Read in data and produce missing at random
We will use the iris dataset for this analysis. The dataset has 150 observations for 5 variables. We remove the categorical variable and focus our analysis on the continuous variables.
data<-iris
data <- subset(data, select = -c(Species))
Next we produce 10% missing values at random using the prodNA function from the missForest package.
iris.mis <- prodNA(data, noNA = 0.1)
We will focus on the continuous variables and remove the categorical variable Species. Let us have a look at the missingness.There are 65% values in the data set with no missing value. There are 10% missing values in Petal.Length, 9.5% missing values in Petal.Width and so on.
mice_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(iris.mis), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Sepal.Length 0.14000000
## Sepal.Width 0.10000000
## Petal.Width 0.09333333
## Petal.Length 0.06666667
Impute missing values
Let us impute the missing values now. We will use a seed to keep the results unaltered.
PMM (Predicted Mean Matching)
Data_pmm <- mice(iris.mis, method = 'pmm', seed = 5000)
##
## iter imp variable
## 1 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
pmm<- complete(Data_pmm)
CART(Classification and regression trees)
Data_cart<- mice(iris.mis, method = 'cart', seed = 5001)
##
## iter imp variable
## 1 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
cart<-complete(Data_cart)
Midastouch(Weighted predictive mean matching)
Data_midastouch <- mice(iris.mis, method = 'midastouch', seed = 5002)
##
## iter imp variable
## 1 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
midastouch<-complete(Data_midastouch)
RF(Random Forest Imputation)
Data_rf <- mice(iris.mis, method = 'rf', seed = 5003)
##
## iter imp variable
## 1 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
rf<-complete(Data_rf)
Sample (Random sample from observed values)
Data_sample <- mice(iris.mis, method = 'sample', seed = 5004)
##
## iter imp variable
## 1 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
sample<-complete(Data_sample)
Mean (Unconditional mean imputation)
Data_mean <- mice(iris.mis, method = 'mean', seed = 5005)
##
## iter imp variable
## 1 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 2 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 3 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 2 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 3 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 4 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 5 Sepal.Length Sepal.Width Petal.Length Petal.Width
mean<-complete(Data_mean)
Compare the methods
Here we compare the summary of the methods with the original distribution.
Sepal.length
We see that pmm and rf methods return distributions closest to original.
sepal_length<-as.data.frame(cbind(data$Sepal.Length, pmm$Sepal.Length, cart$Sepal.Length, midastouch$Sepal.Length, rf$Sepal.Length, sample$Sepal.Length,
mean$Sepal.Length))
names(sepal_length)<-c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(sepal_length, main="Comparison of different imputation methods on sepal.length")
Sepal.width
Except for midastouch, all the other methods have a returned an almost similar distribution.
sepal_width<-as.data.frame(cbind(data$Sepal.Width, pmm$Sepal.Width, cart$Sepal.Width,
midastouch$Sepal.Width, rf$Sepal.Width, sample$Sepal.Width,
mean$Sepal.Width))
names(sepal_width) <- c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(sepal_width, main="Comparison of different imputation methods on sepal.width")
Petal.length
Cart is the winner here.
petal_length<-as.data.frame(cbind(data$Petal.Length, pmm$Petal.Length, cart$Petal.Length,
midastouch$Petal.Length, rf$Petal.Length, sample$Petal.Length,
mean$Petal.Length))
names(petal_length) <- c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(petal_length, main="Comparison of different imputation methods on petal_length")
Petal.width
Except for mean, all the other methods have a returned an almost similar distribution.
petal_width<-as.data.frame(cbind(data$Petal.Width, pmm$Petal.Width, cart$Petal.Width,
midastouch$Petal.Width, rf$Petal.Width, sample$Petal.Width,
mean$Petal.Width))
names(petal_width) <- c('original', 'pmm', 'cart', 'midastouch', 'rf', 'sample', 'mean')
boxplot(petal_width, main="Comparison of different imputation methods on petal_width")
Conclusion
We have had different results for different variables when using different imputation methods using the MICE package. PMM seems to have done a better job than most other methods with every variable. Here is a density plot for the distributions, decide for yourself!
par(mfrow=c(3,3))
plot(density(data$Sepal.Length), main='Original')
plot(density(pmm$Sepal.Length), main='pmm')
plot(density(cart$Sepal.Length), main='cart')
plot(density(midastouch$Sepal.Length), main='midastouch')
plot(density(rf$Sepal.Length), main='rf')
plot(density(sample$Sepal.Length), main='sample')
plot(density(mean$Sepal.Length), main='mean')