Data Imputation

The last post discussed data exploration through data visualizations. In the exploratory phase, you may come across columns with a mixed degree of missing rows. Columns with high percentages of missing rows should be removed as you would be substituting too much of the data. Columns with small percentages could be either ignored or also imputed, and then there are columns that cannot be ignored or should not be removed. This is when imputing comes into play.

Imputing data is simply substituting missing values with your own. Some of the methods implemented range from basic mean, median, mode to KNN imputing, predictive mean matching, or using a random forest to fill in the missing values.

colSums(is.na(iris.na))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##           21           27           32           32           38
plot_missing(iris.na)


Imputing Data with Various Methods

missForest

imp_data1 <- missForest(iris.na, xtrue = dataset) 
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
head(imp_data1$ximp)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1    3.500000     1.400000         0.2  setosa
## 2          4.9    3.000000     1.400000         0.2  setosa
## 3          4.7    3.200000     1.300000         0.2  setosa
## 4          4.6    3.100000     1.500000         0.2  setosa
## 5          5.0    3.600000     1.400000         0.2  setosa
## 6          5.4    3.745983     1.605083         0.4  setosa
imp_data1$OOBerror
##      NRMSE        PFC 
## 0.12967692 0.03571429

The NRMSE error means Normalized Root Mean Squared error.

imp_data1$error
##      NRMSE        PFC 
## 0.17559607 0.05263158

Using Predictive Mean Matching

imp_data2 <- mice(iris.na, method = "pmm", m=5)
## 
##  iter imp variable
##   1   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   1   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   1   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   1   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   1   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   2   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   2   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   2   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   2   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   2   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   3   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   3   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   3   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   3   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   3   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   4   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   4   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   4   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   4   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   4   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   5   1  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   5   2  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   5   3  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   5   4  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
##   5   5  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
data_imp_pmm <- complete(imp_data2)
head(data_imp_pmm)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.5          1.5         0.4  setosa

Let’s verify how good our imputed values are with predictive mean matching:

densityplot(imp_data2)


Conclusion

Imputing data can be an important step in developing an efficient model. It’s important to evaluate the methods being utilized and the values that are produced. The predictive mean matching method produced realistic results that matched the data set. The random forest approach did produce close results but not as accurate to the data set.


Sources:

https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html https://rdrr.io/cran/missForest/man/prodNA.html https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/