One of the major challange while preparing the data for building machine learning model is to correct the missing values. The reason behind the missing data is poor data collection process or data entry error. It is critical to fix such missing values because model building is not possible with Nulls or NAs in the dataset.
There are many approaches we can follow in order to fix the missing data. The purpose of this article is to review and test some of the commonly and effective ways of fixing this issue.
# Example data
set.seed(34555) # Create reproducible data
N <- 5000 # Sample size
y <- round(rnorm(N, 50, 30)) # Dependent variable
x1 <- round(0.2 * y + rnorm(N, 10, 4), 2) # Predictor 1
x2 <- round(y * rpois(N, 5)) # Predictor 2
x3 <- round(0.01 * y + runif(N, 0, 10)) # Predictor 3
data_orginal <- data.frame(y, x1, x2, x3)
data_missing <- data_orginal
data_missing$y[rbinom(N, 1, 0.2) == 1] <- NA # Aproximately 20% missings in y
data_missing$x3[rbinom(N, 1, 0.1) == 1] <- NA # Aproximately 10% missings in x3
kable(head(data_missing)) # First 6 rows of our example data
| y | x1 | x2 | x3 |
|---|---|---|---|
| 62 | 22.22 | 372 | 9 |
| 42 | 22.80 | 252 | 1 |
| NA | 30.18 | 354 | 1 |
| 67 | 26.79 | 536 | 6 |
| 95 | 30.77 | 190 | 7 |
| 101 | 34.51 | 505 | 5 |
Lets visualize the intensity of missing data in our dataset using aggr function. This is how you can read this visualization.
The plot on the left shows the percentage of missing values for each variable The plot on the right shows the combination of missing values between variables
We can see that 20% of Y and 10% of x3 is missing in the entire set from the left histogram. The plot in right shows there is around 72% of the data has all variables with a Non-NA values and 01% has both Y and X3 missing. 08% has only x3 missing 18% has only x1 missing.
aggr_plot <- aggr(data_missing, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data_missing), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## y 0.1982
## x3 0.0988
## x1 0.0000
## x2 0.0000
Lets try out the different solutions for this problem.
The easiest solution is to drop the NA records from the dataset. This is effective only if there are very few missing records. If there are more, we will end up deteriorating the data and thus impact the model performance.
If a particular column has a major percentage missing, it is quite okay to drop the column from the dataset.
data_rs1 <- na.omit(data_missing)
kable(head(data_rs1))
| y | x1 | x2 | x3 | |
|---|---|---|---|---|
| 1 | 62 | 22.22 | 372 | 9 |
| 2 | 42 | 22.80 | 252 | 1 |
| 4 | 67 | 26.79 | 536 | 6 |
| 5 | 95 | 30.77 | 190 | 7 |
| 6 | 101 | 34.51 | 505 | 5 |
| 7 | 50 | 18.86 | 150 | 2 |
Lets try another common approach to fix the missing values in dataset. We will impute the missing values with the mean values of non null values. This is simplest approach.
data_mean <- data_missing
data_mean$y[is.na(data_mean$y)] <- mean(data_mean$y, na.rm = TRUE)
data_mean$x3[is.na(data_mean$x3)] <- mean(data_mean$x3, na.rm = TRUE)
kable(head(data_mean))
| y | x1 | x2 | x3 |
|---|---|---|---|
| 62.00000 | 22.22 | 372 | 9 |
| 42.00000 | 22.80 | 252 | 1 |
| 49.59092 | 30.18 | 354 | 1 |
| 67.00000 | 26.79 | 536 | 6 |
| 95.00000 | 30.77 | 190 | 7 |
| 101.00000 | 34.51 | 505 | 5 |
Lets evaluate the drawback of this approch. We will plot the orginal missing and imputed data points. Since we are imputing the missing values with the mean values, it is very always far from the real values.
plot(data_orginal[which(is.na(data_missing$y)), ]$x1, data_orginal[which(is.na(data_missing$y)), ]$y, # Plot of observed values
xlim = c(0, 20), ylim = c(- 15, 60),
main = "Simple Mean Imputation",
xlab = "X1", ylab = "Y")
points(data_mean[which(is.na(data_missing$y)), ]$x1, data_mean[which(is.na(data_missing$y)), ]$y, # Plot of missing values
col = "red")
abline(lm(y ~ x1, data_orginal), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft", # Legend
c("Observed Values", "Mean Imputed Values", "Regression Y ~ X1"),
pch = c(1, 1, NA),
lty = c(NA, NA, 1),
col = c("black", "red", "#1b98e0"))
Another problem is that the skewness of the distribution changes drastically and it may impact the final model.
##### Density of y pre and post imputation #####
par(mfrow=c(1,2))
# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$y),
lwd = 2,
ylim = c(0, 0.08),
main = "Density Pre and Post Mean Imputation(y)",
xlab = "Y")
# Density of observed & imputed data
points(density(data_mean$y),
lwd = 2,
type = "l",
col = "red")
# Legend
legend("topleft",
c("Before Imputation", "After Imputation"),
lty = 1,
lwd = 2,
col = c("black", "red"))
##### Density of X3 pre and post imputation #####
# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$x3),
lwd = 2,
ylim = c(0, 0.3),
main = "Density Pre and Post Mean Imputation(x3)",
xlab = "Y")
# Density of observed & imputed data
points(density(data_mean$x3),
lwd = 2,
type = "l",
col = "red")
# Legend
legend("topleft",
c("Before Imputation", "After Imputation"),
lty = 1,
lwd = 2,
col = c("black", "red"))
in Regression imputers, a linear regression model is estimated on the basis of observed values in the target variable Y and some explanatory variables X. This model is used to predict the values of missing values on basis of good observations. Relationships of X and Y (i.e. correlations, regression coefficients etc.) are preserved, since imputed values are based on regression models. This is a big advantage over simpler imputation methods such as mean imputation or zero substitution.
There are two types of regression imputation.
Deterministic regression imputation replaces missing values with the exact prediction of the regression model. Random variation (i.e. an error term) around the regression slope is not considered. Imputed values are therefore often too precise and lead to an overestimation of the correlation between X and Y.
We will use a package called mice. A brief ntroduction on the mice package is found here in https://datasciencebeginners.com/2018/11/11/a-brief-introduction-to-mice-r-package/
We will use method “norm.predict”, incase we have a categorical variables, we can try “logreg”
library("mice") # Load package
imp <- mice(data_missing, method = "norm.predict", m = 1) # Impute data
data_regress_1 <- complete(imp) # Store data
Lets see the scatter plot and fitted line.As you can see the imputed values(highlighed in red) are more closer to the regression line. This is because it doesnt consider error terms.
plot(data_orginal[which(is.na(data_missing$y)), ]$x1, data_orginal[which(is.na(data_missing$y)), ]$y, # Plot of observed values
xlim = c(0, 20), ylim = c(- 15, 60),
main = "Deterministic regression imputation",
xlab = "X1", ylab = "Y")
points(data_regress_1[which(is.na(data_missing$y)), ]$x1, data_regress_1[which(is.na(data_missing$y)), ]$y, # Plot of missing values
col = "red")
abline(lm(y ~ x1, data_orginal), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft", # Legend
c("Observed Values", "Deterministic regression imputed Values", "Regression Y ~ X1"),
pch = c(1, 1, NA),
lty = c(NA, NA, 1),
col = c("black", "red", "#1b98e0"))
Lets evaluate the density plot for orginal dataset(with out missing values) and imputed dataset(with missing values).
##### Density of y pre and post imputation #####
par(mfrow=c(1,2))
# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$y),
lwd = 2,
ylim = c(0, 0.08),
main = "Density Pre and Post Mean Imputation(y)",
xlab = "Y")
# Density of observed & imputed data
points(density(data_regress_1$y),
lwd = 2,
type = "l",
col = "red")
# Legend
legend("topleft",
c("Before Imputation", "After Imputation"),
lty = 1,
lwd = 2,
col = c("black", "red"))
##### Density of X3 pre and post imputation #####
# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$x3),
lwd = 2,
ylim = c(0, 0.3),
main = "Density Pre and Post Mean Imputation(x3)",
xlab = "Y")
# Density of observed & imputed data
points(density(data_regress_1$x3),
lwd = 2,
type = "l",
col = "red")
# Legend
legend("topleft",
c("Before Imputation", "After Imputation"),
lty = 1,
lwd = 2,
col = c("black", "red"))
Stochastic regression imputation was developed in order to solve this issue of deterministic regression imputation. Stochastic regression imputation adds a random error term to the predicted value and is therefore able to reproduce the correlation of X and Y more appropriately.
Lets try imputing using Stochastic regression
imp <- mice(data_missing, method = "norm.nob", m = 1) # Impute data
data_regress_2 <- complete(imp) # Store data
As shown in the plot, the imputed points are not closer towards the regression line, looks much more realistic compared to the actual values.
plot(data_orginal[which(is.na(data_missing$y)), ]$x1, data_orginal[which(is.na(data_missing$y)), ]$y, # Plot of observed values
xlim = c(0, 20), ylim = c(- 15, 60),
main = "Stochastic regression imputation",
xlab = "X1", ylab = "Y")
points(data_regress_2[which(is.na(data_missing$y)), ]$x1, data_regress_2[which(is.na(data_missing$y)), ]$y, # Plot of missing values
col = "red")
abline(lm(y ~ x1, data_orginal), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft", # Legend
c("Observed Values", "Stochastic regression imputed Values", "Regression Y ~ X1"),
pch = c(1, 1, NA),
lty = c(NA, NA, 1),
col = c("black", "red", "#1b98e0"))
Lets look the density plot and
##### Density of y pre and post imputation #####
par(mfrow=c(1,2))
# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$y),
lwd = 2,
ylim = c(0, 0.08),
main = "Density Pre and Post Mean Imputation(y)",
xlab = "Y")
# Density of observed & imputed data
points(density(data_regress_2$y),
lwd = 2,
type = "l",
col = "red")
# Legend
legend("topleft",
c("Before Imputation", "After Imputation"),
lty = 1,
lwd = 2,
col = c("black", "red"))
##### Density of X3 pre and post imputation #####
# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$x3),
lwd = 2,
ylim = c(0, 0.3),
main = "Density Pre and Post Mean Imputation(x3)",
xlab = "Y")
# Density of observed & imputed data
points(density(data_regress_2$x3),
lwd = 2,
type = "l",
col = "red")
# Legend
legend("topleft",
c("Before Imputation", "After Imputation"),
lty = 1,
lwd = 2,
col = c("black", "red"))
As we are seeing Stochastic regression imputation helps fixing the missing values, it has few drawbacks too.
Lets create a dataset with a response variable having only positive values and impute using Stochastic regression.
set.seed(34555) # Create reproducible data
N <- 5000 # Sample size
y <- round(rnorm(N, 0, 30)) # Dependent variable
y[y<0] <- y[y<0] * (-1)
y[rbinom(N, 1, 0.1) == 1] <- NA # Create 10% missingness in income
x1 <- y + rnorm(N, 1000, 1500) # Auxiliary variables
x2 <- y + rnorm(N, - 5000, 2000)
data_pos_orginal <- data.frame(y, x1, x2, x3)
imp <- mice(data_pos_orginal, method = "norm.nob", m = 1) # Impute data
##
## iter imp variable
## 1 1 y x1 x2
## 2 1 y x1 x2
## 3 1 y x1 x2
## 4 1 y x1 x2
## 5 1 y x1 x2
data_regress_pos <- complete(imp) # Store data
kable(head(data_regress_pos))
| y | x1 | x2 | x3 |
|---|---|---|---|
| 12.00000 | -3415.6971 | -3331.953 | 9 |
| 8.00000 | 3710.3596 | -7756.619 | 1 |
| 9.00000 | -1045.4055 | -4258.297 | 1 |
| 17.00000 | -840.3655 | -7482.816 | 6 |
| 41.10276 | 516.7819 | -2337.856 | 7 |
| 51.00000 | 1985.6899 | -6789.144 | 5 |
data_regress_pos$y[data_regress_pos$y < 0] # 10 values below 0 (implausible)
## [1] -20.24162288 -3.04376242 -4.62816265 -1.44968629 -15.07484767
## [6] -0.02032324 -11.38547515 -0.58412724 -1.63001199 -1.13812141
## [11] -8.69552529 -1.31993709 -4.08510342 -5.36279644 -6.09628257
## [16] -1.35591209 -6.60820159 -3.02735339 -16.30538052 -5.84970036
## [21] -18.15999608 -12.53182425 -14.35035405 -1.83651321 -4.44075923
## [26] -1.94525484 -10.94426660 -4.17282715 -8.63300036 -3.77449513
## [31] -5.14482786 -13.18530821 -14.37206480 -12.49475170 -1.51839089
## [36] -29.06196715 -3.14721667 -10.20476497 -8.81937085 -4.98694411
## [41] -6.00410951 -0.79388811 -12.75123539
Lets try imputing a heteroscadastic missing variables.
# Heteroscedastic data
set.seed(33324257) # Set seed
N <- 1:5000 # Sample size
a <- 0
b <- 1
sigma2 <- N^2
eps <- rnorm(N, mean = 0, sd = sqrt(sigma2))
y <- a + b * N + eps # Heteroscedastic variable
x <- 30 * N + rnorm(N[length(N)], 1000, 200) # Correlated variable
data_het_org <- data.frame(y, x)
data_het_miss <- data_het_org
data_het_miss$y[rbinom(N[length(N)], 1, 0.3) == 1] <- NA # 30% missings
Lets impute with Stochastic regression imputation method and check whether the resultant imputed values gives same pattern as original dataset.
imp <- mice(data_het_miss, method = "norm.nob", m = 1) # Impute data
data_regress_het_imp <- complete(imp) # Store data
plot(data_het_org[which(is.na(data_het_miss$y)), ]$x, data_het_org[which(is.na(data_het_miss$y)), ]$y, # Plot of observed values
main = "Stochastic regression imputation ",
xlab = "X", ylab = "Y")
points(data_regress_het_imp[which(is.na(data_het_miss$y)), ]$x, data_regress_het_imp[which(is.na(data_het_miss$y)), ]$y, # Plot of missing values
col = "red")
abline(lm(y ~ x, data_het_org), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft", # Legend
c("Observed Values", "Stochastic regression", "Regression Y ~ X"),
pch = c(1, 1, NA),
lty = c(NA, NA, 1),
col = c("black", "red", "#1b98e0"))
Predictive Mean matching (with single imputation)
Solution for both of the problems is Predictive Mean matching. It aims to reduce the bias introduced in a dataset through imputation, by drawing real values sampled from the data. This is achieved by building a small subset of observations where the outcome variable matches the outcome of the observations with missing values.
imp<- mice(data_het_miss, method = "pmm", m = 1)
data_regress_pmm <- complete(imp)
imp<- mice(data_pos_orginal, method = "pmm", m = 1)
data_pmm_pos <- complete(imp)
As we can see, we dont see any negative values with pmm regressions method
data_pmm_pos$y[data_pmm_pos$y < 0] # 10 values below 0 (implausible)
## numeric(0)
plot(data_het_org[which(is.na(data_het_miss$y)), ]$x, data_het_org[which(is.na(data_het_miss$y)), ]$y, # Plot of observed values
main = "Predictive mean matching - single imputation ",
xlab = "X1", ylab = "Y")
points(data_regress_pmm[which(is.na(data_het_miss$y)), ]$x, data_regress_pmm[which(is.na(data_het_miss$y)), ]$y, # Plot of missing values
col = "red")
abline(lm(y ~ x, data_het_org), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft", # Legend
c("Observed Values", "Predictive Mean Matching", "Regression Y ~ X"),
pch = c(1, 1, NA),
lty = c(NA, NA, 1),
col = c("black", "red", "#1b98e0"))
From the above tests, Predictive mean matching looks to be outperforming all other methods. It is also advisible to use multiple imputers and take the average out in case there are significant missing values for a variable.
References:
en.wikipedia.org , https://statisticsglobe.com/