Introduction
In this lab, we will explore various methods for handling missing data in R. We will discuss the implications of missing data and different imputation techniques.
Handling Missing Data
## [1] 2.333333
where \(y\) is a vector containing three numbers, and where mean(y) is the R expression that returns their mean. Now suppose that the last number is missing. R indicates this by the symbol NA, which stands for “not available”:
## [1] NA
In this case the mean is undefined. To overcome this problem, is possible to add an extra argument na.rm=TRUE, this simply removes any missing data before computing the mean.
## [1] 1.5
This allows us to obtain the mean. However, we are changing the probabilitistic structure of our data. This is even worse when we have more than one column in our dataset affected by missing.
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
We want to try to predict the Ozone, we can try to fit a linear model. Also the lm() function incorporate an option to omit the missing data, this is na.action=na.omit.
##
## Call:
## lm(formula = Ozone ~ Wind, data = airquality, na.action = na.omit)
##
## Coefficients:
## (Intercept) Wind
## 96.873 -5.551
Next, we wish to plot the predicted ozone levels against the observed data, so we use predict() to calculate the predicted values, and add these to the data to prepare for plotting.
airquality2 <- cbind(na.omit(airquality[, c("Ozone", "Wind")]),
predicted = predict(fit))
plot(y=airquality2$predicted, x=airquality2$Ozone, col='blue',
xlim=c(-10,160),
ylim=c(-10,160),
ylab=c('Ozone Measured'),
xlab=c('Ozone Prediced')
)Completely eliminating missing data is not a great strategy, in fact this introduces bias in the estimates (our estimates are no longer reliable).
Complete case analysis
The complete case analysis (eliminate completely NA’s from our data) is incorporated in many statistical packages. The function na.omt() does this in many R functions.
Advantage: Convenience. If the data are MCAR, it produces unbiased estimates of means, variances, and regression weights.
Disadvantage: It can be wasteful. Often more than half of the original sample is lost if the number of variables is large.
Complete-case analysis is not always bad, there are much worse imputaton methods.
Pairwise deletion
The pairwise deletion calculated the meand and covariances on all observed data and them incorporate this information in the statistical analysis. The idea is to incorporate the information of the observed joint distribution of \(Y,X_1, \dots, X_p\), where \(Y\) is the dependent variable and \(X_1, \dots, X_p\) are the independent variables.
data <- airquality[, c("Ozone", "Solar.R", "Wind")]
mu <- colMeans(data, na.rm = TRUE)
cv <- cov(data, use = "pairwise")The option ‘pairwise’ tells R to compute the covariance matrix based on observed data. The standard lm() function does not take means and covariances as input, but the lavaan package (Rosseel 2012) provides this feature:
library(lavaan)
fit <- lavaan("Ozone ~ 1 + Wind + Solar.R
Ozone ~~ Ozone",
sample.mean = mu, sample.cov = cv,
sample.nobs = sum(complete.cases(data)))## lavaan 0.6.16 ended normally after 1 iteration
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 4
##
## Number of observations 111
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## Ozone ~
## Wind -5.545 0.644 -8.605 0.000
## Solar.R 0.118 0.025 4.681 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .Ozone 75.402 8.463 8.910 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .Ozone 565.035 75.845 7.450 0.000
Advantages: Uses all available information. Consistent estimates of mean, correlations, and covariances under MCAR.
Disadvantages: Best used if data approximate a multivariate normal distribution. Not recommended if correlations are high or if MCAR is not plausible.
Mean Imputation
The mean imputation concists in replace the missing values with the column mean. This can be done using the function mice() in the package mice in R.
##
## iter imp variable
## 1 1 Ozone Solar.R
imputed_ind<- ifelse(imp$where[,1]==TRUE, 'red', 'blue')
dta<- data.frame(ImputedOzone=complete(imp)$Ozone,imputed_ind)
{par(mfrow=c(1,2))
plot(dta$ImputedOzone, col=imputed_ind, ylab='Ozone- Mean Imputed')
hist(x=dta$ImputedOzone, col=imputed_ind, breaks = 19, main='', ylab='Ozone- Mean Imputed')}Mean imputation distorts the distribution in several ways.
Advantages: Quick and simple fix.
Disadvantages: Underestimates variance,disturbs relationships between variables, Introduces bias.
Regression Imputation
Regression imputation incorporates knowledge of other variables with the idea of producing smarter imputations. This method is based on three steps
Model Creation: For each variable with missing data, a regression model is constructed using other variables as predictors.
Pediction: Once the model is set up, it’s used to predict the missing values for the variable based on the observed values of the predictor variables.
Replacement: The predicted values are then used to replace the missing data points.
This can be done using the folloing code using mice() function in mice packge in R.
data <- airquality[, c("Ozone", "Solar.R")]
imp <- mice(data, method = "norm.predict", seed = 1,
m = 1, print = FALSE)
xyplot(imp, Ozone ~ Solar.R, col=c('blue', 'red'))Advantages: Incorporates knowledge of other variables, it makes use of the relationships between variables to make educated guesses about missing values.
Disadvantages: Strengthens relations artificially, correlations are biased upwards, variability is underestimated.
References
van Buuren, S. (2018). Flexible Imputation of Missing Data, Second Edition (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429492259