Handling missing values

Introduction

One of the major challange while preparing the data for building machine learning model is to correct the missing values. The reason behind the missing data is poor data collection process or data entry error. It is critical to fix such missing values because model building is not possible with Nulls or NAs in the dataset.

There are many approaches we can follow in order to fix the missing data. The purpose of this article is to review and test some of the commonly and effective ways of fixing this issue.

# Example data

set.seed(34555) # Create reproducible data
N <- 5000 # Sample size

y <- round(rnorm(N, 50, 30))            # Dependent variable
x1 <- round(0.2 * y + rnorm(N, 10, 4), 2)   # Predictor 1
x2 <- round(y * rpois(N, 5))            # Predictor 2
x3 <- round(0.01 * y + runif(N, 0, 10)) # Predictor 3

data_orginal <- data.frame(y, x1, x2, x3)
data_missing <- data_orginal
data_missing$y[rbinom(N, 1, 0.2) == 1] <- NA # Aproximately 20% missings in y
data_missing$x3[rbinom(N, 1, 0.1) == 1] <- NA # Aproximately 10% missings in x3
kable(head(data_missing)) # First 6 rows of our example data

y	x1	x2	x3
62	22.22	372	9
42	22.80	252	1
NA	30.18	354	1
67	26.79	536	6
95	30.77	190	7
101	34.51	505	5

Visualize the missing data

Lets visualize the intensity of missing data in our dataset using aggr function. This is how you can read this visualization.

The plot on the left shows the percentage of missing values for each variable The plot on the right shows the combination of missing values between variables

We can see that 20% of Y and 10% of x3 is missing in the entire set from the left histogram. The plot in right shows there is around 72% of the data has all variables with a Non-NA values and 01% has both Y and X3 missing. 08% has only x3 missing 18% has only x1 missing.

aggr_plot <- aggr(data_missing, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data_missing), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable  Count
##         y 0.1982
##        x3 0.0988
##        x1 0.0000
##        x2 0.0000

Lets try out the different solutions for this problem.

Dropping the missing records

The easiest solution is to drop the NA records from the dataset. This is effective only if there are very few missing records. If there are more, we will end up deteriorating the data and thus impact the model performance.

If a particular column has a major percentage missing, it is quite okay to drop the column from the dataset.

data_rs1 <- na.omit(data_missing)
kable(head(data_rs1))

	y	x1	x2	x3
1	62	22.22	372	9
2	42	22.80	252	1
4	67	26.79	536	6
5	95	30.77	190	7
6	101	34.51	505	5
7	50	18.86	150	2

Imputers

Mean Imputing

Lets try another common approach to fix the missing values in dataset. We will impute the missing values with the mean values of non null values. This is simplest approach.

data_mean <- data_missing
data_mean$y[is.na(data_mean$y)] <- mean(data_mean$y, na.rm = TRUE)
data_mean$x3[is.na(data_mean$x3)] <- mean(data_mean$x3, na.rm = TRUE)
kable(head(data_mean))

y	x1	x2	x3
62.00000	22.22	372	9
42.00000	22.80	252	1
49.59092	30.18	354	1
67.00000	26.79	536	6
95.00000	30.77	190	7
101.00000	34.51	505	5

Lets evaluate the drawback of this approch. We will plot the orginal missing and imputed data points. Since we are imputing the missing values with the mean values, it is very always far from the real values.

plot(data_orginal[which(is.na(data_missing$y)), ]$x1, data_orginal[which(is.na(data_missing$y)),  ]$y,     # Plot of observed values
     xlim = c(0, 20), ylim = c(- 15, 60),
     main = "Simple Mean Imputation",
     xlab = "X1", ylab = "Y")
points(data_mean[which(is.na(data_missing$y)), ]$x1, data_mean[which(is.na(data_missing$y)),  ]$y,     # Plot of missing values
       col = "red")
abline(lm(y ~ x1, data_orginal), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft",                                        # Legend
       c("Observed Values", "Mean Imputed Values", "Regression Y ~ X1"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))

Another problem is that the skewness of the distribution changes drastically and it may impact the final model.

##### Density of y pre and post imputation #####
par(mfrow=c(1,2))

# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$y),
     lwd = 2, 
     ylim = c(0, 0.08),
     main = "Density Pre and Post Mean Imputation(y)",
     xlab = "Y")

# Density of observed & imputed data
points(density(data_mean$y), 
       lwd = 2, 
       type = "l", 
       col = "red")

# Legend
legend("topleft",
       c("Before Imputation", "After Imputation"),
       lty = 1,
       lwd = 2,
       col = c("black", "red"))

##### Density of X3 pre and post imputation #####

# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$x3),
     lwd = 2, 
     ylim = c(0, 0.3),
     main = "Density Pre and Post Mean Imputation(x3)",
     xlab = "Y")

# Density of observed & imputed data
points(density(data_mean$x3), 
       lwd = 2, 
       type = "l", 
       col = "red")

# Legend
legend("topleft",
       c("Before Imputation", "After Imputation"),
       lty = 1,
       lwd = 2,
       col = c("black", "red"))

Regression imputers

in Regression imputers, a linear regression model is estimated on the basis of observed values in the target variable Y and some explanatory variables X. This model is used to predict the values of missing values on basis of good observations. Relationships of X and Y (i.e. correlations, regression coefficients etc.) are preserved, since imputed values are based on regression models. This is a big advantage over simpler imputation methods such as mean imputation or zero substitution.

There are two types of regression imputation.

1. Deterministic regression imputation

Deterministic regression imputation replaces missing values with the exact prediction of the regression model. Random variation (i.e. an error term) around the regression slope is not considered. Imputed values are therefore often too precise and lead to an overestimation of the correlation between X and Y.

We will use a package called mice. A brief ntroduction on the mice package is found here in https://datasciencebeginners.com/2018/11/11/a-brief-introduction-to-mice-r-package/

We will use method “norm.predict”, incase we have a categorical variables, we can try “logreg”

library("mice") # Load package
imp <- mice(data_missing, method = "norm.predict", m = 1) # Impute data
data_regress_1 <- complete(imp) # Store data

Lets see the scatter plot and fitted line.As you can see the imputed values(highlighed in red) are more closer to the regression line. This is because it doesnt consider error terms.

plot(data_orginal[which(is.na(data_missing$y)), ]$x1, data_orginal[which(is.na(data_missing$y)),  ]$y,     # Plot of observed values
     xlim = c(0, 20), ylim = c(- 15, 60),
     main = "Deterministic regression imputation",
     xlab = "X1", ylab = "Y")
points(data_regress_1[which(is.na(data_missing$y)), ]$x1, data_regress_1[which(is.na(data_missing$y)),  ]$y,     # Plot of missing values
       col = "red")
abline(lm(y ~ x1, data_orginal), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft",                                        # Legend
       c("Observed Values", "Deterministic regression imputed Values", "Regression Y ~ X1"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))

Lets evaluate the density plot for orginal dataset(with out missing values) and imputed dataset(with missing values).

##### Density of y pre and post imputation #####
par(mfrow=c(1,2))

# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$y),
     lwd = 2, 
     ylim = c(0, 0.08),
     main = "Density Pre and Post Mean Imputation(y)",
     xlab = "Y")

# Density of observed & imputed data
points(density(data_regress_1$y), 
       lwd = 2, 
       type = "l", 
       col = "red")

# Legend
legend("topleft",
       c("Before Imputation", "After Imputation"),
       lty = 1,
       lwd = 2,
       col = c("black", "red"))

##### Density of X3 pre and post imputation #####

# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$x3),
     lwd = 2, 
     ylim = c(0, 0.3),
     main = "Density Pre and Post Mean Imputation(x3)",
     xlab = "Y")

# Density of observed & imputed data
points(density(data_regress_1$x3), 
       lwd = 2, 
       type = "l", 
       col = "red")

# Legend
legend("topleft",
       c("Before Imputation", "After Imputation"),
       lty = 1,
       lwd = 2,
       col = c("black", "red"))

2. Stochastic regression imputation

Stochastic regression imputation was developed in order to solve this issue of deterministic regression imputation. Stochastic regression imputation adds a random error term to the predicted value and is therefore able to reproduce the correlation of X and Y more appropriately.

Lets try imputing using Stochastic regression

imp <- mice(data_missing, method = "norm.nob", m = 1) # Impute data
data_regress_2 <- complete(imp) # Store data

As shown in the plot, the imputed points are not closer towards the regression line, looks much more realistic compared to the actual values.

plot(data_orginal[which(is.na(data_missing$y)), ]$x1, data_orginal[which(is.na(data_missing$y)),  ]$y,     # Plot of observed values
     xlim = c(0, 20), ylim = c(- 15, 60),
     main = "Stochastic regression imputation",
     xlab = "X1", ylab = "Y")
points(data_regress_2[which(is.na(data_missing$y)), ]$x1, data_regress_2[which(is.na(data_missing$y)),  ]$y,     # Plot of missing values
       col = "red")
abline(lm(y ~ x1, data_orginal), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft",                                        # Legend
       c("Observed Values", "Stochastic regression imputed Values", "Regression Y ~ X1"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))

Lets look the density plot and

##### Density of y pre and post imputation #####
par(mfrow=c(1,2))

# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$y),
     lwd = 2, 
     ylim = c(0, 0.08),
     main = "Density Pre and Post Mean Imputation(y)",
     xlab = "Y")

# Density of observed & imputed data
points(density(data_regress_2$y), 
       lwd = 2, 
       type = "l", 
       col = "red")

# Legend
legend("topleft",
       c("Before Imputation", "After Imputation"),
       lty = 1,
       lwd = 2,
       col = c("black", "red"))

##### Density of X3 pre and post imputation #####

# Density of observed data
plot(density(data_orginal[which(!is.na(data_missing$y)), ]$x3),
     lwd = 2, 
     ylim = c(0, 0.3),
     main = "Density Pre and Post Mean Imputation(x3)",
     xlab = "Y")

# Density of observed & imputed data
points(density(data_regress_2$x3), 
       lwd = 2, 
       type = "l", 
       col = "red")

# Legend
legend("topleft",
       c("Before Imputation", "After Imputation"),
       lty = 1,
       lwd = 2,
       col = c("black", "red"))

As we are seeing Stochastic regression imputation helps fixing the missing values, it has few drawbacks too.

Stochastic regression may lead into implausible values, means it can cause negative values.
Stochastic regression imputation has problems with heteroscedastic data.

Lets create a dataset with a response variable having only positive values and impute using Stochastic regression.

set.seed(34555) # Create reproducible data
N <- 5000 # Sample size

y <- round(rnorm(N, 0, 30))            # Dependent variable
y[y<0] <- y[y<0] * (-1)
y[rbinom(N, 1, 0.1) == 1] <- NA         # Create 10% missingness in income

x1 <- y + rnorm(N, 1000, 1500)          # Auxiliary variables
x2 <- y + rnorm(N, - 5000, 2000)

data_pos_orginal <- data.frame(y, x1, x2, x3)

imp <- mice(data_pos_orginal, method = "norm.nob", m = 1) # Impute data

## 
##  iter imp variable
##   1   1  y  x1  x2
##   2   1  y  x1  x2
##   3   1  y  x1  x2
##   4   1  y  x1  x2
##   5   1  y  x1  x2

data_regress_pos <- complete(imp) # Store data

kable(head(data_regress_pos))

y	x1	x2	x3
12.00000	-3415.6971	-3331.953	9
8.00000	3710.3596	-7756.619	1
9.00000	-1045.4055	-4258.297	1
17.00000	-840.3655	-7482.816	6
41.10276	516.7819	-2337.856	7
51.00000	1985.6899	-6789.144	5

data_regress_pos$y[data_regress_pos$y < 0] # 10 values below 0 (implausible)

##  [1] -20.24162288  -3.04376242  -4.62816265  -1.44968629 -15.07484767
##  [6]  -0.02032324 -11.38547515  -0.58412724  -1.63001199  -1.13812141
## [11]  -8.69552529  -1.31993709  -4.08510342  -5.36279644  -6.09628257
## [16]  -1.35591209  -6.60820159  -3.02735339 -16.30538052  -5.84970036
## [21] -18.15999608 -12.53182425 -14.35035405  -1.83651321  -4.44075923
## [26]  -1.94525484 -10.94426660  -4.17282715  -8.63300036  -3.77449513
## [31]  -5.14482786 -13.18530821 -14.37206480 -12.49475170  -1.51839089
## [36] -29.06196715  -3.14721667 -10.20476497  -8.81937085  -4.98694411
## [41]  -6.00410951  -0.79388811 -12.75123539

Lets try imputing a heteroscadastic missing variables.

# Heteroscedastic data

set.seed(33324257)                             # Set seed
N <- 1:5000                                  # Sample size

a <- 0
b <- 1
sigma2 <- N^2
eps <- rnorm(N, mean = 0, sd = sqrt(sigma2))

y <- a + b * N + eps                         # Heteroscedastic variable
x <- 30 * N + rnorm(N[length(N)], 1000, 200) # Correlated variable


data_het_org <- data.frame(y, x)

data_het_miss <- data_het_org

data_het_miss$y[rbinom(N[length(N)], 1, 0.3) == 1] <- NA   # 30% missings

Lets impute with Stochastic regression imputation method and check whether the resultant imputed values gives same pattern as original dataset.

imp <- mice(data_het_miss, method = "norm.nob", m = 1) # Impute data
data_regress_het_imp <- complete(imp) # Store data

plot(data_het_org[which(is.na(data_het_miss$y)), ]$x, data_het_org[which(is.na(data_het_miss$y)),  ]$y,     # Plot of observed values
     main = "Stochastic regression imputation ",
     xlab = "X", ylab = "Y")
points(data_regress_het_imp[which(is.na(data_het_miss$y)), ]$x, data_regress_het_imp[which(is.na(data_het_miss$y)),  ]$y,     # Plot of missing values
       col = "red")
abline(lm(y ~ x, data_het_org), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft",                                        # Legend
       c("Observed Values", "Stochastic regression", "Regression Y ~ X"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))

Predictive Mean matching (with single imputation)

Solution for both of the problems is Predictive Mean matching. It aims to reduce the bias introduced in a dataset through imputation, by drawing real values sampled from the data. This is achieved by building a small subset of observations where the outcome variable matches the outcome of the observations with missing values.

imp<- mice(data_het_miss, method = "pmm", m = 1)
data_regress_pmm <- complete(imp)

imp<- mice(data_pos_orginal, method = "pmm", m = 1)
data_pmm_pos <- complete(imp)

As we can see, we dont see any negative values with pmm regressions method

data_pmm_pos$y[data_pmm_pos$y < 0] # 10 values below 0 (implausible)

## numeric(0)

plot(data_het_org[which(is.na(data_het_miss$y)), ]$x, data_het_org[which(is.na(data_het_miss$y)),  ]$y,     # Plot of observed values
     main = "Predictive mean matching - single imputation  ",
     xlab = "X1", ylab = "Y")
points(data_regress_pmm[which(is.na(data_het_miss$y)), ]$x, data_regress_pmm[which(is.na(data_het_miss$y)),  ]$y,     # Plot of missing values
       col = "red")
abline(lm(y ~ x, data_het_org), col = "#1b98e0", lwd = 1.5) # Regression slope
legend("topleft",                                        # Legend
       c("Observed Values", "Predictive Mean Matching", "Regression Y ~ X"),
       pch = c(1, 1, NA),
       lty = c(NA, NA, 1),
       col = c("black", "red", "#1b98e0"))

Conclusion

From the above tests, Predictive mean matching looks to be outperforming all other methods. It is also advisible to use multiple imputers and take the average out in case there are significant missing values for a variable.

References:

en.wikipedia.org , https://statisticsglobe.com/