1. MICE
2. Amelia
3. missForest
4. Hmisc
5. mi
For example: Suppose we have X1, X2….Xk variables. If X1 has missing values, then it will be regressed on other variables X2 to Xk. The missing values in X1 will be then replaced by predictive values obtained. Similarly, if X2 has missing values, then X1, X3 to Xk variables will be used in prediction model as independent variables. Later, missing values will be replaced with predicted values.
**Note :-**
By default linear regression is used to predict continuous missing values.
Logistic regression is used for categorical missing values.
Precisely, the methods used by this package are:
1. PMM (Predictive Mean Matching) – For numeric variables
2. logreg(Logistic Regression) – For Binary Variables( with 2 levels)
3. polyreg(Bayesian polytomous regression) – For Factor Variables (>= 2 levels)
4. Proportional odds model (ordered, >= 2 levels)
Let’s understand it in R
data("iris")
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Since, MICE assumes missing at random values. Let’s seed missing values in our data set using prodNA function. You can access this function by installing missForest package.
library(missForest)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
##Generate 10% missing values at Random
iris.mis <- prodNA(iris, noNA = 0.1)
#Check missing values introduced in the data
summary(iris.mis)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.200 Min. :1.000 Min. :0.10
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.30
## Median :5.800 Median :3.000 Median :4.400 Median :1.30
## Mean :5.814 Mean :3.064 Mean :3.792 Mean :1.18
## 3rd Qu.:6.400 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.80
## Max. :7.700 Max. :4.400 Max. :6.900 Max. :2.50
## NA's :17 NA's :18 NA's :12 NA's :18
## Species
## setosa :47
## versicolor:46
## virginica :47
## NA's :10
##
##
##
I’ve removed categorical variable. Let’s here focus on continuous values. To treat categorical variable, simply encode the levels and follow the procedure below.
#remove categorical variables
iris.mis <- subset(iris.mis, select = -c(Species))
summary(iris.mis)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.200 Min. :1.000 Min. :0.10
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.30
## Median :5.800 Median :3.000 Median :4.400 Median :1.30
## Mean :5.814 Mean :3.064 Mean :3.792 Mean :1.18
## 3rd Qu.:6.400 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.80
## Max. :7.700 Max. :4.400 Max. :6.900 Max. :2.50
## NA's :17 NA's :18 NA's :12 NA's :18
Mice package has a function known as md.pattern(). It returns a tabular form of missing value present in each variable in a data set.
library(mice)
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.2
## mice 2.25 2015-11-09
md.pattern(iris.mis)
## Petal.Length Sepal.Length Sepal.Width Petal.Width
## 98 1 1 1 1 0
## 10 1 0 1 1 1
## 12 1 1 0 1 1
## 9 0 1 1 1 1
## 11 1 1 1 0 1
## 1 1 0 0 1 2
## 1 0 1 0 1 2
## 4 1 0 1 0 2
## 2 1 1 0 0 2
## 1 0 0 0 1 3
## 1 0 0 0 0 4
## 12 17 18 18 65
Let’s understand this table. There are 102 observations with no missing values. There are 10 observations with missing values in Sepal.Length. Similarly, there are 9 missing values with Sepal.Width and so on.
Lets create visual of above
library(VIM)
## Loading required package: colorspace
## Warning: package 'colorspace' was built under R version 3.3.2
## Loading required package: grid
## Loading required package: data.table
## Warning: package 'data.table' was built under R version 3.3.2
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
mice_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(iris.mis), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Sepal.Width 0.1200000
## Petal.Width 0.1200000
## Sepal.Length 0.1133333
## Petal.Length 0.0800000
Quick points from above graphs
1. 68% of values in dataset with no missing values.
2. There are 7% missing values in Petal.Length and 6% missing values in Petal.Width and so on.
3. Histogram clearly depicts the influence of missing values in the variables.
Now let’s impute missing values.
imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed = 500)
summary(imputed_Data)
Here is explanation of parameters used.
1. m – Refers to 5 imputed data sets
2. maxit – Refers to no. of iterations taken to impute missing values
3. method – Refers to method used in imputation. we used predictive mean matching
#check imputed values
imputed_Data$imp$Sepal.Width
## 1 2 3 4 5
## 13 3.7 3.9 3.1 3.7 3.1
## 18 3.1 3.6 3.5 3.9 3.7
## 19 3.4 3.4 3.4 3.9 3.4
## 36 3.1 3.6 3.3 3.2 3.5
## 53 3.2 2.7 3.4 3.0 3.0
## 61 2.7 3.0 3.0 2.3 3.0
## 75 3.2 3.2 2.9 3.2 2.8
## 76 3.0 3.0 2.8 3.6 3.4
## 77 3.1 3.1 3.8 3.4 3.2
## 80 2.5 3.8 3.0 3.8 2.9
## 89 2.7 2.9 2.9 2.8 2.4
## 92 2.5 2.8 2.5 3.2 2.9
## 110 2.6 3.3 3.3 2.4 2.4
## 115 3.0 2.8 3.0 3.0 2.7
## 136 3.5 3.7 3.1 3.4 3.2
## 139 2.4 3.4 2.8 3.0 2.9
## 141 3.4 2.3 3.0 3.2 3.0
## 143 2.2 2.7 3.4 2.4 2.9
Since there are 5 imputed data sets, you can select any using complete() function.
#get complete data ( 2nd out of 5)
completeData <- complete(imputed_Data,2)
Also, if you wish to build models on all 5 datasets, you can do it in one go using with() command. You can also combine the result from these models and obtain a consolidated output using pool() command.
This package also performs multiple imputation (generate imputed data sets) to deal with missing values. Multiple imputation helps to reduce bias and increase efficiency. It is enabled with bootstrap based EMB algorithm which makes it faster and robust to impute many variables including cross sectional, time series data etc. Also, it is enabled with parallel imputation feature using multicore CPUs.
It makes the following assumptions:
It works this way. First, it takes m bootstrap samples and applies EMB algorithm to each sample. The m estimates of mean and variances will be different. Finally, the first set of estimates are used to impute first set of missing values using regression, then second set of estimates are used for second set and so on.
On comparing with MICE, MVN lags on some crucial aspects such as:
Hence, this package works best when data has multivariable normal distribution. If not, transformation is to be done to bring data close to normality.
Let’s understand it practically now.
library(Amelia)
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2016 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The only thing that you need to be careful about is classifying variables. It has 3 parameters:
#seed 10% missing values
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
#specify columns and run amelia
amelia_fit <- amelia(iris.mis, m=5, parallel = "multicore", noms = "Species")
#access imputed outputs
amelia_fit$imputations[[1]]
amelia_fit$imputations[[2]]
amelia_fit$imputations[[3]]
amelia_fit$imputations[[4]]
amelia_fit$imputations[[5]]
To check a particular column in a data set, use the following commands
amelia_fit$imputations[[5]]$Sepal.Length
## [1] 5.100000 4.900000 4.700000 4.600000 5.000000 5.400000 4.600000
## [8] 5.000000 4.400000 4.900000 5.400000 4.800000 4.800000 4.224045
## [15] 5.800000 5.700000 5.400000 5.100000 5.660459 5.100000 5.400000
## [22] 5.100000 4.600000 5.100000 4.800000 5.000000 5.000000 5.200000
## [29] 5.200000 4.700000 4.800000 5.400000 5.200000 5.500000 4.900000
## [36] 5.000000 5.500000 4.900000 4.400000 5.100000 5.000000 4.500000
## [43] 4.400000 5.000000 5.100000 4.800000 5.100000 4.600000 5.300000
## [50] 5.000000 7.000000 6.400000 6.900000 5.500000 6.500000 5.700000
## [57] 6.300000 4.900000 6.600000 5.200000 5.000000 5.900000 6.000000
## [64] 6.100000 5.600000 6.700000 5.600000 5.800000 6.200000 5.600000
## [71] 5.900000 6.458963 6.051256 6.100000 6.400000 6.600000 6.800000
## [78] 6.700000 6.000000 5.700000 5.500000 5.500000 5.435076 6.000000
## [85] 5.400000 6.000000 6.700000 6.300000 5.600000 5.500000 5.500000
## [92] 6.100000 5.800000 5.000000 5.600000 5.700000 5.700000 6.200000
## [99] 5.100000 5.700000 6.300000 5.800000 7.100000 6.300000 6.500000
## [106] 7.600000 4.900000 7.300000 6.700000 7.200000 6.500000 6.400000
## [113] 6.800000 5.700000 5.800000 6.400000 6.500000 7.700000 7.806741
## [120] 5.308664 6.900000 6.204861 7.700000 6.300000 6.700000 7.200000
## [127] 6.200000 6.100000 6.400000 7.200000 7.400000 7.900000 6.400000
## [134] 6.300000 6.100000 7.700000 6.300000 6.400000 6.000000 6.900000
## [141] 6.700000 6.900000 5.800000 6.736529 6.700000 6.700000 6.300000
## [148] 5.928050 6.200000 5.900000
#export the outputs to csv files
write.amelia(amelia_fit, file.stem = "imputed_data_set")
As the name suggests, missForest is an implementation of random forest algorithm. It’s a non parametric imputation method applicable to various variable types. So, what’s a non parametric method ?
Non-parametric method does not make explicit assumptions about functional form of f (any arbitary function). Instead, it tries to estimate f such that it can be as close to the data points without seeming impractical.
It yield OOB (out of bag) imputation error estimate. Moreover, it provides high level of control on imputation process. It has options to return OOB separately (for each variable) instead of aggregating over the whole data matrix. This helps to look more closely as to how accurately the model has imputed values for each variable.
Let’s understand it practically. Since bagging works well on categorical variable too, we don’t need to remove them here. It very well takes care of missing value pertaining to their variable types:
library(missForest)
#load data
data("iris")
#seed 10% missing values
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
#impute missing values, using all parameters as default values
iris.imp <- missForest(iris.mis)
#check imputed values
iris.imp$ximp
#check imputation error
iris.imp$OOBerror
NRMSE is normalized mean squared error. It is used to represent error derived from imputing continuous values. PFC (proportion of falsely classified) is used to represent error derived from imputing categorical values.
#comparing actual data accuracy
iris.err <- mixError(iris.imp$ximp, iris.mis, iris)
iris.err
## NRMSE PFC
## 0.1504115 0.1052632
This suggests that categorical variables are imputed with 6% error and continuous variables are imputed with 15% error. This can be improved by tuning the values of mtry and ntree parameter. mtry refers to the number of variables being randomly sampled at each split. ntree refers to number of trees to grow in the forest.
Hmisc is a multiple purpose package useful for data analysis, high – level graphics, imputing missing values, advanced table making, model fitting & diagnostics (linear regression, logistic regression & cox regression) etc. Amidst, the wide range of functions contained in this package, it offers 2 powerful functions for imputing missing values. These are impute() and aregImpute(). Though, it also has transcan() function, but aregImpute() is better to use.
impute() function simply imputes missing value using user defined statistical method (mean, max, mean). It’s default is median. On the other hand, aregImpute() allows mean imputation using additive regression, bootstrapping, and predictive mean matching.
In bootstrapping, different bootstrap resamples are used for each of multiple imputations. Then, a flexible additive model (non parametric regression method) is fitted on samples taken with replacements from original data and missing values (acts as dependent variable) are predicted using non-missing values (independent variable).
Then, it uses predictive mean matching (default) to impute missing values. Predictive mean matching works well for continuous and categorical (binary & multi-level) without the need for computing residuals and maximum likelihood fit.
Here are some important highlights of this package:
Let’s understand it practically.
#load data
data("iris")
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
#seed missing values ( 10% )
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.550 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.300 Median :1.300
## Mean :5.873 Mean :3.076 Mean :3.754 Mean :1.227
## 3rd Qu.:6.400 3rd Qu.:3.350 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## NA's :15 NA's :19 NA's :15 NA's :13
## Species
## setosa :49
## versicolor:46
## virginica :42
## NA's :13
##
##
##
# impute with mean value
iris.mis$imputed_age <- with(iris.mis, impute(Sepal.Length, mean))
# impute with random value
iris.mis$imputed_age2 <- with(iris.mis, impute(Sepal.Length, 'random'))
#similarly you can use min, max, median to impute missing value
#using argImpute
impute_arg <- aregImpute(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width +
Species, data = iris.mis, n.impute = 5)
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
argImpute() automatically identifies the variable type and treats them accordingly.
impute_arg
##
## Multiple Imputation using Bootstrap and PMM
##
## aregImpute(formula = ~Sepal.Length + Sepal.Width + Petal.Length +
## Petal.Width + Species, data = iris.mis, n.impute = 5)
##
## n: 150 p: 5 Imputations: 5 nk: 3
##
## Number of NAs:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 15 19 15 13 13
##
## type d.f.
## Sepal.Length s 2
## Sepal.Width s 2
## Petal.Length s 2
## Petal.Width s 2
## Species c 2
##
## Transformation of Target Variables Forced to be Linear
##
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0.898 0.704 0.982 0.967 0.988
The output shows R² values for predicted missing values. Higher the value, better are the values predicted. You can also check imputed values using the following command
#check imputed variable Sepal.Length
impute_arg$imputed$Sepal.Length
## [,1] [,2] [,3] [,4] [,5]
## 8 5.0 4.6 5.1 5.2 5.1
## 10 5.0 4.4 4.7 4.6 4.6
## 22 5.0 5.3 5.4 5.1 5.1
## 36 4.9 4.6 4.8 4.7 4.6
## 43 4.4 4.9 4.8 5.5 4.9
## 48 5.0 5.1 4.8 5.0 5.1
## 52 6.4 6.2 6.4 6.8 6.1
## 53 6.3 5.9 6.4 6.9 6.0
## 56 5.9 6.5 5.8 5.9 5.4
## 65 5.5 5.5 5.4 5.7 5.7
## 82 5.5 5.1 5.0 5.7 5.5
## 102 5.6 5.7 5.7 6.1 6.0
## 143 6.6 6.0 6.6 6.7 6.3
## 146 6.1 5.7 5.6 6.0 5.4
## 149 6.3 6.4 6.5 6.3 6.0
mi (Multiple imputation with diagnostics) package provides several features for dealing with missing values. Like other packages, it also builds multiple imputation models to approximate missing values. And, uses predictive mean matching method.
Though, I’ve already explained predictive mean matching (pmm) above, but if you haven’t understood yet, here’s a simpler version: For each observation in a variable with missing value, we find observation (from available values) with the closest predictive mean to that variable. The observed value from this “match” is then used as imputed value.
Below are some unique characteristics of this package:
Let’s understand it practically.
library(mi)
## Loading required package: Matrix
## Loading required package: stats4
## mi (Version 1.0, packaged: 2015-04-16 14:03:10 UTC; goodrich)
## mi Copyright (C) 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 Trustees of Columbia University
## This program comes with ABSOLUTELY NO WARRANTY.
## This is free software, and you are welcome to redistribute it
## under the General Public License version 2 or later.
## Execute RShowDoc('COPYING') for details.
##
## Attaching package: 'mi'
## The following objects are masked from 'package:mice':
##
## complete, pool
#load data
data("iris")
#seed missing values ( 10% )
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
#imputing missing value with mi
mi_data <- mi(iris.mis, seed = 335)
I’ve used default values of parameters namely:
summary(mi_data)
## $Sepal.Length
## $Sepal.Length$is_missing
## missing
## FALSE TRUE
## 135 15
##
## $Sepal.Length$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.06800 -0.44890 0.08531 -0.08574 0.24260 0.59790
##
## $Sepal.Length$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.92860 -0.44510 -0.02194 0.00000 0.34070 1.24700
##
##
## $Sepal.Width
## $Sepal.Width$is_missing
## missing
## FALSE TRUE
## 137 13
##
## $Sepal.Width$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.84050 -0.24530 0.05836 0.04201 0.28630 0.77990
##
## $Sepal.Width$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.17700 -0.27250 -0.04624 0.00000 0.29310 1.53700
##
##
## $Petal.Length
## $Petal.Length$is_missing
## missing
## FALSE TRUE
## 139 11
##
## $Petal.Length$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.8111 -0.4138 0.2619 0.1279 0.5365 0.8304
##
## $Petal.Length$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7764 -0.6059 0.1615 0.0000 0.3889 0.9005
##
##
## $Petal.Width
## $Petal.Width$is_missing
## missing
## FALSE TRUE
## 135 15
##
## $Petal.Width$imputed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.89520 -0.55830 -0.01227 -0.03212 0.42540 0.82440
##
## $Petal.Width$observed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.72820 -0.59570 0.06674 0.00000 0.39800 0.86170
##
##
## $Species
## $Species$crosstab
##
## observed imputed
## setosa 188 26
## versicolor 168 33
## virginica 160 25
Here is a snapshot o summary output by mi package after imputing missing values. As shown, it uses summary statistics to define the imputed values.
*End Notes
So, which is the best of these 5 packages ? I am sure many of you would be asking this! Having created this tutorial, I felt Hmisc should be your first choice of missing value imputation followed by missForest and MICE.
Hmisc automatically recognizes the variables types and uses bootstrap sample and predictive mean matching to impute missing values. You don’t need to separate or treat categorical variable, just like we did while using MICE package. However, missForest can outperform Hmisc if the observed variables supplied contain sufficient information.