Introduction

Missing data can be a not so trivial problem when analysing a dataset and accounting for it is usually not so straightforward either.

If the amount of missing data is very small relatively to the size of the dataset, then leaving out the few samples with missing features may be the best strategy in order not to bias the analysis, however leaving out available datapoints deprives the data of some amount of information and depending on the situation you face, you may want to look for other fixes before wiping out potentially useful datapoints from your dataset.

Airquality Dataset as Example

In this post we are going to impute missing values using a the airquality dataset (available in R). For the purpose of the article I am going to remove some datapoints from the dataset.

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

Using mice for looking at missing data pattern

The mice package provides a nice function md.pattern() to get a better understanding of the pattern of missing data

##     Wind Temp Month Day Solar.R Ozone   
## 111    1    1     1   1       1     1  0
## 35     1    1     1   1       1     0  1
## 5      1    1     1   1       0     1  1
## 2      1    1     1   1       0     0  2
##        0    0     0   0       7    37 44

We can see that both log and sqrt does a decent job to transform ‘medv’ distribution closer to normal. In the following model, I have selected ‘log’ transformation but it is also possible to try out ‘sqrt’ transformation.

Imputing the missing data

The mice() function takes care of the imputing process

m=5 refers to the number of imputed datasets. Five is the default value. meth=‘pmm’ refers to the imputation method. In this case we are using predictive mean matching as imputation method. Other imputation methods can be used, type methods(mice) for a list of the available imputation methods.

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   Ozone Solar.R    Wind    Temp   Month     Day 
##   "pmm"   "pmm"      ""      ""      ""      "" 
## PredictorMatrix:
##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   1
## Solar.R     1       0    1    1     1   1
## Wind        1       1    0    1     1   1
## Temp        1       1    1    0     1   1
## Month       1       1    1    1     0   1
## Day         1       1    1    1     1   0

If you would like to check the imputed data, for instance for the variable Ozone, you need to enter the following line of code

Now we can get back the completed dataset using the complete() function. It is almost plain English:

The missing values have been replaced with the imputed values in the first of the five datasets. If you wish to use another one, just change the second parameter in the complete() function.

Inspecting the distribution of original and imputed data

Let’s compare the distributions of original and imputed data using a some useful plots. First of all we can use a scatterplot and plot Ozone against all the other variables

The shape of the magenta points (imputed) matches the shape of the blue ones (observed). The matching shape tells us that the imputed values are indeed “plausible values”. Another helpful plot is the density plot:

Conclusion

The variable modelFit1 containts the results of the fitting performed over the imputed datasets, while the pool() function pools them all together. Apparently, only the Ozone variable is statistically significant.

Reference

van Buuren, Stef, & Karin Groothuis-Oudshoorn. “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software [Online], 45.3 (2011): 1 - 67. Web. 21 May. 2020