I’ve selected the horse-kick data set, more information about that can be found here ‘http://www.randomservices.org/random/data/HorseKicks.html.’
data(prussian)
describe(prussian)
## vars n mean sd median trimmed mad min max range skew kurtosis
## y 1 280 0.7 0.87 0.0 0.56 0.00 0 4 4 1.23 1.12
## year 2 280 84.5 5.78 84.5 84.50 7.41 75 94 19 0.00 -1.22
## corp* 3 280 7.5 4.04 7.5 7.50 5.19 1 14 13 0.00 -1.23
## se
## y 0.05
## year 0.35
## corp* 0.24
hist(prussian$y,probability = TRUE,main='Horse Kicks')
curve(dnorm(x, mean=mean(prussian$y), sd=sd(prussian$y)), add=TRUE)
We can see, regardless of cavalry corps; the distribution of horse-kicks leading to death is poisson. Since we’re testing out data imputation, lets remove some data and impute it back to see what are the changes in distribution.
sampleData50 = prussian
sampleData50$dummy = floor(runif(280,0,2))
sampleData50$y = ifelse(sampleData50$dummy==1, NA, sampleData50$y)
hist(sampleData50$y,main='Horse kicks with 50% of data NA',probability = TRUE)
Something to note here, is that the distribution for the most part, didn’t change when removing 50% of the data (roughly 140 observations.) So before any imputation, a good question to ask, ‘is imputation necessary?’ You may hear some people claim ’no, as long as we have over 32 observations, we can get a good enough image of our distribution. Lets test that.
sampleData10 = prussian
sampleData10$dummy = floor(runif(280,0,10))
sampleData10$y = ifelse(sampleData10$dummy!=1, NA, sampleData10$y)
hist(sampleData10$y,main='Horse kicks with roughly 10% of data NA',probability = TRUE)
Without further ado, lets run the mice package on both samples and see what happens.
library(mice)
## Loading required package: lattice
imputedData50 = mice(sampleData50, m=2, maxit = 5, method = 'pmm', seed = 15)
##
## iter imp variable
## 1 1 y
## 1 2 y
## 2 1 y
## 2 2 y
## 3 1 y
## 3 2 y
## 4 1 y
## 4 2 y
## 5 1 y
## 5 2 y
imputedData10 = mice(sampleData10, m=2, maxit = 5, method = 'pmm', seed = 15)
##
## iter imp variable
## 1 1 y
## 1 2 y
## 2 1 y
## 2 2 y
## 3 1 y
## 3 2 y
## 4 1 y
## 4 2 y
## 5 1 y
## 5 2 y
imputedData50 = mice::complete(imputedData50,2)
imputedData10 = mice::complete(imputedData10,2)
par(mfrow=c(2,2))
hist(sampleData50$y,main='Horse kicks w/ 50% data')
hist(sampleData10$y,main='Horse kicks w/ 10% data')
hist(imputedData50$y, main='Imputed Horse kicks w/ 50% data')
hist(imputedData10$y, main='Imputed Horse kicks w/ 10% data')
Conclusion : Mice package works, I’d only make a point of using it on a LARGE sparse dataset.