One of the most common problems a data scientist finds when dealing with real
data is missing data: observations where one or more variables do not have
recorded values. In R, missing values are represented by NA, Not Available.
Before studying the data, or fitting a model, it is important to understand the
nature (why) and amount (how much) of missing data. Depending on these two
aspects we might have to consider recording the data again, to drop observations
with missing values in some variables, or fill in for the missing data in a
technique known as imputation. In this lecture we will explain different
imputation techniques with the airquality dataset, available in the basic
R installation. We start by looking at this dataset and finding how much data
are missing.
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
# Number and percentage of NAs per variable
na_counts <- colSums(is.na(airquality))
na_pct <- round(100 * na_counts / nrow(airquality), 1)
data.frame(Missing = na_counts, Percent = na_pct)Two variables are missing data: Ozone with 37 values missing, and Solar.R with 7
values missing, which represent about 24% and 4% of the observations, respectively.
We will study with a little more detail the nature of the missing data in Section
3, but first we are going to find about the different types of
missing data.
Understanding why data are missing matters because it determines which imputation methods are valid. The standard taxonomy, due to Rubin (1976), has three categories.
The probability of a value being missing is unrelated to any variable in the dataset, or to some unobserved variables. In other words, the data are missing by pure chance. For example, a sensor randomly fails to record Ozone levels, independently of the weather in those days.
In the case of MCAR, throwing away the observations (cases) with missing data will not bias the result, but we reduce our sample size. This is the best case of missing data (the ideal case is when data are complete, nothing missing), though it is hard to prove that missing data are completely at random.
How to check: compare the distribution of observed variables between cases with
and without missing data. Under MCAR, they should look similar. Little’s MCAR
test (available in the naniar package) provides a formal test.
The probability of a value being missing depends on other observed variables, but
not on the missing value itself once those are accounted for. Example: ozone readings
are more likely to be missing on hot days (high Temp), but given the temperature,
the probability of missingness is unrelated to the actual ozone level.
MAR is the most common working assumption in practice. Under MAR, methods such as multiple imputation and maximum likelihood estimation are valid.
The probability of a value being missing depends either on unobserved variables or on the variable itself. An example of the first case could be when sensors do not record values because high traffic vibrations make them not work properly. For the second case, very high ozone values may not be recorded because the instrument saturates and fails to log a reading.
MNAR is the most problematic case: no standard imputation method fully corrects for it. Handling MNAR requires explicit modelling of the missingness mechanism, which is beyond the scope of this course.
When data are MCAR or MAR one can impute them, that is, fill in the missing data with other plausible values. We will learn different imputations techniques in Section 4
Before imputing, one should always study the missing data, so one knows how much is missing and whether there is any structure to the missingness.
We start with a graphical representation of the above table about the amount of missing data.
df_na <- data.frame(
variable = names(na_pct),
percent = as.numeric(na_pct)
)
ggplot(df_na, aes(x = reorder(variable, -percent), y = percent)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = paste0(percent, "%")), vjust = -0.4, size = 3.5) +
labs(x = NULL, y = "Missing (%)") +
theme_bw()Figure 3.1: Percentage of missing values per variable in the airquality dataset.
An important piece of information is whether the missing data for different variables take place in common or complementary observations. That is, whether we have to add the number of cases with missing values, or some observations miss values in more than one variable. A very simple way to find this out is to compute the total number of observations with missing values.
## Observations with missing data: 42
In this example there are 42 observations with NAs. Since we know that the Ozone variable has 37 NAs, and the Solar.R variable has 7 NAs, we can conclude that the are a 2 rows with missing data in both variables. We can check it with R, as we know the variables with missing data.
cat("Observations with more than one missing variable:",
sum(is.na(airquality$Ozone) & is.na(airquality$Solar.R)))## Observations with more than one missing variable: 2
We can also do this study graphically in a missingness map,
each row is an observation, each column a variable, and cells are coloured
by whether the value is present or missing. This reveals if missing values
cluster in a group of observations, or spread along the dataset (like in
the case of airquality).
# Build a long data frame of missingness indicators
df_map <- reshape2::melt(is.na(airquality))
names(df_map) <- c("observation", "variable", "missing")
ggplot(df_map, aes(x = variable, y = observation, fill = missing)) +
geom_tile() +
scale_fill_manual(values = c("FALSE" = "grey90", "TRUE" = "firebrick"),
labels = c("Observed", "Missing")) +
labs(x = NULL, y = "Observation", fill = NULL) +
theme_bw() +
theme(legend.position = "top",
axis.text.y = element_blank(),
axis.ticks.y = element_blank())Figure 3.2: Missingness map for the airquality dataset. Red cells indicate missing values.
From Figure 3.2 we can see that Ozone and Solar.R
missing values do not appear to cluster at the same observations, as we check
earlier with the computations of NAs. This suggests that their reasons for
missing values are quite likely independent.
When missing data are not too many, and we have Missing at Random, we can replace the NAs by other values in what is known as Imputation. In this section we describe some imputation methods.
The simplest approach is to drop any observation that has at least one missing
value. This method will not introduce bias in the study in we delete not too
many observations, as we are not changing information. However, it will make
the data available in the sample smaller. Thus, one has to be careful and not
use this technique when the amount of missing data is not small (use your common
sense). In R, na.omit() or complete.cases() do this.
## Original rows: 153
## After deletion: 111
## Rows removed: 42
We lose 42 observations (27.5% of the data).
When to use: only when data are MCAR and the fraction missing is small. Otherwise, deletion introduces bias (the mode, for example, will seem more “important” than what really is) and reduces power.
When not to use: if data are MAR or MNAR, or if the proportion missing is non-trivial.
LOCF replaces each missing value with the most recent preceding non-missing value in the same variable. We can use this in a natural way if data are ordered according to some criteria like time. Otherwise, shuffling the data would result in different imputed values and would make the study worthless.
locf <- function(x) {
# carry forward the last non-NA value
for (i in seq_along(x)) {
if (is.na(x[i]) && i > 1) x[i] <- x[i - 1]
}
x
}
aq_locf <- airquality
aq_locf$Ozone <- locf(aq_locf$Ozone)
aq_locf$Solar.R <- locf(aq_locf$Solar.R)
cat("Remaining NAs after LOCF - Ozone:", sum(is.na(aq_locf$Ozone)),
" Solar.R:", sum(is.na(aq_locf$Solar.R)), "\n")## Remaining NAs after LOCF - Ozone: 0 Solar.R: 0
Note that LOCF cannot fill missing values that appear at the very start of the
series (no previous value to carry forward). In the airquality data, the first
observation of Ozone is not missing, so no leading NAs arise here. You can see
in the figure below how long stretches with NAs become flat after the imputation,
as all values are replaced by the last one available.
df_orig <- data.frame(day = 1:75, ozone = airquality$Ozone[1:75], type = "Observed")
df_locf <- data.frame(day = 1:75, ozone = aq_locf$Ozone[1:75], type = "LOCF")
# Mark imputed points
imputed_idx <- which(is.na(airquality$Ozone[1:75]))
ggplot() +
geom_line(data = df_locf, aes(x = day, y = ozone), colour = "grey70") +
geom_point(data = df_orig, aes(x = day, y = ozone),
colour = "steelblue", size = 2) +
geom_point(data = df_locf[imputed_idx, ], aes(x = day, y = ozone),
shape = 4, colour = "firebrick", size = 1, stroke = 1.5) +
labs(title = "LOCF imputation — Ozone (first 75 days)",
subtitle = "Blue dots: observed; red crosses: imputed",
x = "Day", y = "Ozone (ppb)") +
theme_bw()Figure 4.1: Original Ozone values (points) and LOCF-imputed values for the first 75 observations.
Limitation: LOCF artificially flatlines the series at missing stretches and ignores the natural variability of the variable. It tends to underestimate variance and can introduce bias if the variable is trending. In our example, variance is reduced by approximately 10%.
## Original variance : 1088.201
## Variance after LOCF Imputation: 995.9992
In this technique we replace the missing values with the mean, in the case of continous variables, or the mode, for categorical variables.
aq_mean <- airquality
ozone_mean <- mean(aq_mean$Ozone, na.rm = TRUE)
solar_mean <- mean(aq_mean$Solar.R, na.rm = TRUE)
aq_mean$Ozone[is.na(aq_mean$Ozone)] <- ozone_mean
aq_mean$Solar.R[is.na(aq_mean$Solar.R)] <- solar_mean
cat(sprintf("Ozone mean used for imputation: %.2f\n", ozone_mean))## Ozone mean used for imputation: 42.13
## Solar.R mean used for imputation: 185.93
Limitation: mean imputation preserves the mean of the variable but distorts its distribution — it compresses the variance and can distort correlations with other variables. The imputed values are all identical, which is unrealistic.
df_dist <- rbind(
data.frame(ozone = airquality$Ozone[!is.na(airquality$Ozone)], type = "Original (observed)"),
data.frame(ozone = aq_mean$Ozone, type = "After mean imputation")
)
df_dist$type <- factor(df_dist$type, levels = c("Original (observed)", "After mean imputation"))
ggplot(df_dist, aes(x = ozone, fill = type)) +
geom_histogram(binwidth = 10, colour = "white", alpha = 0.8) +
facet_wrap(~ type) +
scale_fill_manual(values = c("Original (observed)" = "steelblue",
"After mean imputation" = "firebrick")) +
labs(x = "Ozone (ppb)", y = "Count") +
theme_bw() +
theme(legend.position = "none")Figure 4.2: Distribution of Ozone before (left) and after mean imputation (right).
The figure clearly shows the artificial spike at the mean value after imputation.
Unlike mean imputation, regression imputation uses the relationship between variables to generate imputed values, producing estimates that are consistent with the correlation structure of the data.
Here we predict Ozone from Temp and Wind, and Solar.R from Temp.
# Fit models on complete cases only
fit_ozone <- lm(Ozone ~ Temp + Wind, data = airquality)
fit_solar <- lm(Solar.R ~ Temp, data = airquality)
aq_reg <- airquality
# Predict and fill missing Ozone values
missing_ozone <- is.na(aq_reg$Ozone)
aq_reg$Ozone[missing_ozone] <-
predict(fit_ozone, newdata = aq_reg[missing_ozone, ])
# Predict and fill missing Solar.R values
missing_solar <- is.na(aq_reg$Solar.R)
aq_reg$Solar.R[missing_solar] <-
predict(fit_solar, newdata = aq_reg[missing_solar, ])
cat("Remaining NAs - Ozone:", sum(is.na(aq_reg$Ozone)),
" Solar.R:", sum(is.na(aq_reg$Solar.R)), "\n")## Remaining NAs - Ozone: 0 Solar.R: 0
Limitation: regression imputation gives a single predicted value with no
uncertainty — it underestimates the variance of the variable and overestimates
the precision of subsequent analyses. For example, the value of \(R^2\) increases
slightly in the above example when we predict the Ozone variable from all the
other variables.
## R squared original data: 0.6249408
##
## R squared imputed data : 0.6545649
Stochastic regression imputation, that is, adding a random residual term to the prediction partially addresses this. See this link
miceThe gold standard for missing data under MAR is multiple imputation (Rubin, 1987). Instead of replacing each missing value with a single estimate, multiple imputation generates \(m\) complete datasets, each with slightly different imputed values drawn from the predictive distribution of the missing data. Each dataset is analysed separately, and the results are combined using Rubin’s rules.
The mice package (Multivariate Imputation by Chained Equations) implements
this in R.
library(mice)
set.seed(42)
# m = 5 imputed datasets, method = "pmm" (predictive mean matching, good for
# continuous variables)
imp <- mice(airquality, m = 5, method = "pmm", printFlag = FALSE)
summary(imp)## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## Ozone Solar.R Wind Temp Month Day
## "pmm" "pmm" "" "" "" ""
## PredictorMatrix:
## Ozone Solar.R Wind Temp Month Day
## Ozone 0 1 1 1 1 1
## Solar.R 1 0 1 1 1 1
## Wind 1 1 0 1 1 1
## Temp 1 1 1 0 1 1
## Month 1 1 1 1 0 1
## Day 1 1 1 1 1 0
The imputed datasets can be extracted and analysed individually, then combined:
# Fit the same regression on each imputed dataset and pool results
fit_mice <- with(imp, lm(Ozone ~ Solar.R + Wind + Temp))
pooled <- pool(fit_mice)
summary(pooled)Why multiple imputation is better: it properly accounts for the uncertainty in the imputed values. Single imputation methods (mean, regression) treat imputed values as if they were observed, leading to standard errors that are too small and p-values that are too significant.
When to use mice: when data are MAR, the proportion of missing data is
non-trivial, and the downstream analysis requires valid uncertainty quantification.
To illustrate the impact of imputation choice, we fit a simple linear regression
Ozone ~ Solar.R on each imputed dataset and compare the \(R^2\) values.
# Helper: R^2 from a fitted lm
r2 <- function(fit) summary(fit)$r.squared
fit_orig <- lm(Ozone ~ Solar.R, data = na.omit(airquality))
fit_locf_lm <- lm(Ozone ~ Solar.R, data = aq_locf)
fit_mean_lm <- lm(Ozone ~ Solar.R, data = aq_mean)
fit_reg_lm <- lm(Ozone ~ Solar.R, data = aq_reg)
# For mice, pool R^2 across imputed datasets using mean
mice_r2 <- mean(sapply(1:5, function(i)
r2(lm(Ozone ~ Solar.R, data = complete(imp, i)))))
df_r2 <- data.frame(
Method = c("Complete cases/Deletion", "LOCF",
"Mean imputation", "Regression imputation", "mice (mean)"),
R2 = round(c(r2(fit_orig), r2(fit_locf_lm), r2(fit_mean_lm),
r2(fit_reg_lm), mice_r2), 3)
)
df_r2ggplot(df_r2, aes(x = reorder(Method, R2), y = R2)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = R2), hjust = -0.2, size = 3.5) +
coord_flip() +
ylim(0, max(df_r2$R2) * 1.15) +
labs(title = expression(R^2 ~ "by imputation method — Ozone ~ Solar.R"),
x = NULL, y = expression(R^2)) +
theme_bw()Figure 5.1: R-squared for the regression Ozone ~ Solar.R under each imputation method.
Note that the effect of mean imputation on \(R^2\) depends on which variable is imputed. When the response is imputed with its mean, those observations add noise to the regression (the imputed \(Y\) is constant regardless of \(X\)), which typically reduces \(R^2\). When a predictor is imputed with its mean, the imputed observations cluster at \(\bar{X}\) and their effect on \(R^2\) depends on the correlation structure. In either case, mean imputation distorts the distribution and biases downstream analyses.
Some practical advice before choosing an imputation method:
mice for MAR data when valid inference is important.| Method | Preserves mean | Preserves variance | Preserves correlations | Valid under |
|---|---|---|---|---|
| Deletion | Yes | Yes | Yes | MCAR only |
| LOCF | Approx. | No | No | Time series, MCAR |
| Mean imputation | Yes | No | No | MCAR, small % missing |
| Regression imputation | Yes | Partial | Yes | MAR |
Multiple imputation (mice) |
Yes | Yes | Yes | MAR |