Sometimes we have some problems with missing values in real-life datasets. Sometimes missing values is a real problem for research. How to recover missing values in our datasets? In this post we will consider one very interesting method of recovering missing values. It is method a bagging of regression trees. It provides the recovery of missing values for several variables at once, based on regression dependencies.
For bagging of regression trees we will use caret package for R programming language. As learning dataset we use file with results air pollution monitoring on one of monitoring station in Kryvyi Rih city (Ukraine).
For writing this post I used R version 3.3.3 and caret package version 6.0-76.
So! Let’s go!
First at all we need to install and download caret package to the work environment:
library(caret)
Loading required package: lattice
Loading required package: ggplot2
Secondly we must download learning dataset. Its name is PSZ72014.csv. After this we download dataset into the work environment and use function summary for observation:
PSZ72014 <- read.csv("PSZ72014.csv", header = T, sep = ",", dec = ".")
summary(PSZ72014)
time.posCT.Kiev ws temp dew_point
2014-01-02 08:00:00: 1 Min. : 0.000 Min. :-22.00 Min. :-27.000
2014-01-02 13:00:00: 1 1st Qu.: 3.000 1st Qu.: 2.00 1st Qu.: -1.000
2014-01-02 20:00:00: 1 Median : 4.000 Median : 10.00 Median : 5.000
2014-01-03 01:00:00: 1 Mean : 4.058 Mean : 10.69 Mean : 4.519
2014-01-03 08:00:00: 1 3rd Qu.: 5.000 3rd Qu.: 19.00 3rd Qu.: 11.000
2014-01-03 13:00:00: 1 Max. :13.000 Max. : 36.00 Max. : 21.000
(Other) :1070 NA's :12 NA's :12 NA's :12
rh atmos_pres SO2 CO NO2
Min. : 18.40 Min. : 995 Min. :0.00000 Min. :0.000 Min. :0.00000
1st Qu.: 53.52 1st Qu.:1011 1st Qu.:0.01400 1st Qu.:1.000 1st Qu.:0.03000
Median : 73.70 Median :1016 Median :0.02100 Median :2.000 Median :0.05000
Mean : 70.19 Mean :1016 Mean :0.02372 Mean :2.137 Mean :0.05892
3rd Qu.: 87.40 3rd Qu.:1021 3rd Qu.:0.03000 3rd Qu.:3.000 3rd Qu.:0.08000
Max. :100.00 Max. :1040 Max. :0.08600 Max. :7.000 Max. :0.27000
NA's :12 NA's :12 NA's :280 NA's :513 NA's :280
H2S C6H6O NH3 CH2O
Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.0000
1st Qu.:0.001000 1st Qu.:0.001000 1st Qu.:0.02000 1st Qu.:0.0040
Median :0.001000 Median :0.001000 Median :0.03000 Median :0.0070
Mean :0.001544 Mean :0.001674 Mean :0.03588 Mean :0.0091
3rd Qu.:0.002000 3rd Qu.:0.002000 3rd Qu.:0.04000 3rd Qu.:0.0120
Max. :0.013000 Max. :0.014000 Max. :0.21000 Max. :0.0440
NA's :513
And now we look our data. Our dataset contain 13 columns and 1076 rows. What do we see? We see that some variables have only several missing values. But two variables such as CH2O (formaldehyde) and CO (carbon monooxyde) have 513 missing values both of them. Also two variables such as SO2 (sulphur dioxide) and NO2 (nitric dioxide) have 280 missing values. We will try to recover all missing values in our dataset.
Firstly we have to do data preparation for regression trees bagging modeling:
PSZ72014.pre <- preProcess(PSZ72014[, c(2:13)], method = "bagImpute")
After data preparation we make a secondary dataset. In this dataset we will write predicted values.
PSZ72014.bagImpute <- PSZ72014
PSZ72014.bagImpute[, c(2:13)] <- predict(PSZ72014.pre, PSZ72014.bagImpute[, c(2:13)])
After predicting all non-missing values are stored!
But we have a problem! How can we validate our new predicted values? One of the simpliest way is comparison a central moments for variables of both datasets. We will make two matrixes with values of central moments and compare these values
tmp <- do.call(cbind, lapply(PSZ72014[, c(2:13)], summary))
number of rows of result is not a multiple of vector length (arg 9)
tmp2 <- do.call(cbind, lapply(PSZ72014.bagImpute[, c(2:13)], summary))
tmp[1:6,] - tmp2
ws temp dew_point rh atmos_pres SO2 CO NO2 H2S C6H6O NH3 CH2O
Min. 0.000 0.00 0.000 0.00 0 0e+00 0.000 0.00000 0 0 0 0.000000
1st Qu. 0.000 0.00 0.000 -0.22 0 -2e-03 -0.361 -0.01000 0 0 0 -0.000561
Median 0.000 0.00 0.000 -0.40 0 -1e-03 0.000 0.00000 0 0 0 -0.000493
Mean 0.008 0.02 -0.018 -0.15 0 4e-05 0.054 0.00034 0 0 0 0.000430
3rd Qu. 0.000 0.00 0.000 -0.03 0 1e-03 0.500 0.00685 0 0 0 0.001000
Max. 0.000 0.00 0.000 0.00 0 0e+00 0.000 0.00000 0 0 0 0.000000
As we can see that primary and predicted values has similar central moments. It can mean that we inpute minimal errors in our predicted dataset.
So. We have a new dataset without NA-values. And we can find different patterns in previously missed time intervals.
PS The first version of this post was publicated in my own blog datastory.org.ua