D621B1

Michael Muller

May 24, 2018


Experimenting with the Mice package in R for data impuation.

I’ve selected the horse-kick data set, more information about that can be found here ‘http://www.randomservices.org/random/data/HorseKicks.html.’

data(prussian)
describe(prussian)
##       vars   n mean   sd median trimmed  mad min max range skew kurtosis
## y        1 280  0.7 0.87    0.0    0.56 0.00   0   4     4 1.23     1.12
## year     2 280 84.5 5.78   84.5   84.50 7.41  75  94    19 0.00    -1.22
## corp*    3 280  7.5 4.04    7.5    7.50 5.19   1  14    13 0.00    -1.23
##         se
## y     0.05
## year  0.35
## corp* 0.24
hist(prussian$y,probability = TRUE,main='Horse Kicks')
curve(dnorm(x, mean=mean(prussian$y), sd=sd(prussian$y)), add=TRUE)

We can see, regardless of cavalry corps; the distribution of horse-kicks leading to death is poisson. Since we’re testing out data imputation, lets remove some data and impute it back to see what are the changes in distribution.

sampleData50 = prussian
sampleData50$dummy = floor(runif(280,0,2))
sampleData50$y = ifelse(sampleData50$dummy==1, NA, sampleData50$y)
hist(sampleData50$y,main='Horse kicks with 50% of data NA',probability = TRUE)

Something to note here, is that the distribution for the most part, didn’t change when removing 50% of the data (roughly 140 observations.) So before any imputation, a good question to ask, ‘is imputation necessary?’ You may hear some people claim ’no, as long as we have over 32 observations, we can get a good enough image of our distribution. Lets test that.

sampleData10 = prussian
sampleData10$dummy = floor(runif(280,0,10))
sampleData10$y = ifelse(sampleData10$dummy!=1, NA, sampleData10$y)
hist(sampleData10$y,main='Horse kicks with roughly 10% of data NA',probability = TRUE)

Without further ado, lets run the mice package on both samples and see what happens.

library(mice)
## Loading required package: lattice
imputedData50 = mice(sampleData50, m=2, maxit = 5, method = 'pmm', seed = 15)
## 
##  iter imp variable
##   1   1  y
##   1   2  y
##   2   1  y
##   2   2  y
##   3   1  y
##   3   2  y
##   4   1  y
##   4   2  y
##   5   1  y
##   5   2  y
imputedData10 = mice(sampleData10, m=2, maxit = 5, method = 'pmm', seed = 15)
## 
##  iter imp variable
##   1   1  y
##   1   2  y
##   2   1  y
##   2   2  y
##   3   1  y
##   3   2  y
##   4   1  y
##   4   2  y
##   5   1  y
##   5   2  y
imputedData50 = mice::complete(imputedData50,2)
imputedData10 = mice::complete(imputedData10,2)
par(mfrow=c(2,2))
hist(sampleData50$y,main='Horse kicks w/ 50% data')
hist(sampleData10$y,main='Horse kicks w/ 10% data')
hist(imputedData50$y, main='Imputed Horse kicks w/ 50% data')
hist(imputedData10$y, main='Imputed Horse kicks w/ 10% data')

Imputing data in an already large enough dataset (over 100 observations) had no change on the distribution. If you’re going train an algorithm that can handle NA’s, I don’t see a pressing reason to impute. Although, using the Mice package on a sparse dataset can yield closer to true distributions, it could also work against you as seen above.

Conclusion : Mice package works, I’d only make a point of using it on a LARGE sparse dataset.