The last post discussed data exploration through data visualizations. In the exploratory phase, you may come across columns with a mixed degree of missing rows. Columns with high percentages of missing rows should be removed as you would be substituting too much of the data. Columns with small percentages could be either ignored or also imputed, and then there are columns that cannot be ignored or should not be removed. This is when imputing comes into play.
Imputing data is simply substituting missing values with your own. Some of the methods implemented range from basic mean, median, mode to KNN imputing, predictive mean matching, or using a random forest to fill in the missing values.
colSums(is.na(iris.na))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 21 27 32 32 38
plot_missing(iris.na)
imp_data1 <- missForest(iris.na, xtrue = dataset)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
head(imp_data1$ximp)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.500000 1.400000 0.2 setosa
## 2 4.9 3.000000 1.400000 0.2 setosa
## 3 4.7 3.200000 1.300000 0.2 setosa
## 4 4.6 3.100000 1.500000 0.2 setosa
## 5 5.0 3.600000 1.400000 0.2 setosa
## 6 5.4 3.745983 1.605083 0.4 setosa
imp_data1$OOBerror
## NRMSE PFC
## 0.12967692 0.03571429
The NRMSE error means Normalized Root Mean Squared error.
imp_data1$error
## NRMSE PFC
## 0.17559607 0.05263158
imp_data2 <- mice(iris.na, method = "pmm", m=5)
##
## iter imp variable
## 1 1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 2 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 3 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 2 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 3 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 4 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 2 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 3 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 4 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 4 1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 4 2 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 4 3 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 4 4 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 4 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5 1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5 2 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5 3 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5 4 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 5 5 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
data_imp_pmm <- complete(imp_data2)
head(data_imp_pmm)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.5 1.5 0.4 setosa
Let’s verify how good our imputed values are with predictive mean matching:
densityplot(imp_data2)
Imputing data can be an important step in developing an efficient model. It’s important to evaluate the methods being utilized and the values that are produced. The predictive mean matching method produced realistic results that matched the data set. The random forest approach did produce close results but not as accurate to the data set.
Sources:
https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html https://rdrr.io/cran/missForest/man/prodNA.html https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/