1 Introduction

One of the most common problems a data scientist finds when dealing with real data is missing data: observations where one or more variables do not have recorded values. In R, missing values are represented by NA, Not Available. Before studying the data, or fitting a model, it is important to understand the nature (why) and amount (how much) of missing data. Depending on these two aspects we might have to consider recording the data again, to drop observations with missing values in some variables, or fill in for the missing data in a technique known as imputation. In this lecture we will explain different imputation techniques with the airquality dataset, available in the basic R installation. We start by looking at this dataset and finding how much data are missing.

data(airquality)
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
# Number and percentage of NAs per variable
na_counts <- colSums(is.na(airquality))
na_pct    <- round(100 * na_counts / nrow(airquality), 1)
data.frame(Missing = na_counts, Percent = na_pct)

Two variables are missing data: Ozone with 37 values missing, and Solar.R with 7 values missing, which represent about 24% and 4% of the observations, respectively. We will study with a little more detail the nature of the missing data in Section 3, but first we are going to find about the different types of missing data.


2 Types of Missing Data

Understanding why data are missing matters because it determines which imputation methods are valid. The standard taxonomy, due to Rubin (1976), has three categories.

2.1 Missing Completely at Random (MCAR)

The probability of a value being missing is unrelated to any variable in the dataset, or to some unobserved variables. In other words, the data are missing by pure chance. For example, a sensor randomly fails to record Ozone levels, independently of the weather in those days.

In the case of MCAR, throwing away the observations (cases) with missing data will not bias the result, but we reduce our sample size. This is the best case of missing data (the ideal case is when data are complete, nothing missing), though it is hard to prove that missing data are completely at random.

How to check: compare the distribution of observed variables between cases with and without missing data. Under MCAR, they should look similar. Little’s MCAR test (available in the naniar package) provides a formal test.

2.2 Missing at Random (MAR)

The probability of a value being missing depends on other observed variables, but not on the missing value itself once those are accounted for. Example: ozone readings are more likely to be missing on hot days (high Temp), but given the temperature, the probability of missingness is unrelated to the actual ozone level.

MAR is the most common working assumption in practice. Under MAR, methods such as multiple imputation and maximum likelihood estimation are valid.

2.3 Missing Not at Random (MNAR)

The probability of a value being missing depends either on unobserved variables or on the variable itself. An example of the first case could be when sensors do not record values because high traffic vibrations make them not work properly. For the second case, very high ozone values may not be recorded because the instrument saturates and fails to log a reading.

MNAR is the most problematic case: no standard imputation method fully corrects for it. Handling MNAR requires explicit modelling of the missingness mechanism, which is beyond the scope of this course.

When data are MCAR or MAR one can impute them, that is, fill in the missing data with other plausible values. We will learn different imputations techniques in Section 4


3 Visualising Missing Data

Before imputing, one should always study the missing data, so one knows how much is missing and whether there is any structure to the missingness.

We start with a graphical representation of the above table about the amount of missing data.

df_na <- data.frame(
  variable = names(na_pct),
  percent  = as.numeric(na_pct)
)

ggplot(df_na, aes(x = reorder(variable, -percent), y = percent)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = paste0(percent, "%")), vjust = -0.4, size = 3.5) +
  labs(x = NULL, y = "Missing (%)") +
  theme_bw()
Percentage of missing values per variable in the airquality dataset.

Figure 3.1: Percentage of missing values per variable in the airquality dataset.

An important piece of information is whether the missing data for different variables take place in common or complementary observations. That is, whether we have to add the number of cases with missing values, or some observations miss values in more than one variable. A very simple way to find this out is to compute the total number of observations with missing values.

cat("Observations with missing data:", sum(rowSums(is.na(airquality)) > 0))
## Observations with missing data: 42

In this example there are 42 observations with NAs. Since we know that the Ozone variable has 37 NAs, and the Solar.R variable has 7 NAs, we can conclude that the are a 2 rows with missing data in both variables. We can check it with R, as we know the variables with missing data.

cat("Observations with more than one missing variable:",
       sum(is.na(airquality$Ozone) & is.na(airquality$Solar.R)))
## Observations with more than one missing variable: 2

We can also do this study graphically in a missingness map, each row is an observation, each column a variable, and cells are coloured by whether the value is present or missing. This reveals if missing values cluster in a group of observations, or spread along the dataset (like in the case of airquality).

# Build a long data frame of missingness indicators
df_map <- reshape2::melt(is.na(airquality))
names(df_map) <- c("observation", "variable", "missing")

ggplot(df_map, aes(x = variable, y = observation, fill = missing)) +
  geom_tile() +
  scale_fill_manual(values = c("FALSE" = "grey90", "TRUE" = "firebrick"),
                    labels = c("Observed", "Missing")) +
  labs(x = NULL, y = "Observation", fill = NULL) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())
Missingness map for the airquality dataset. Red cells indicate missing values.

Figure 3.2: Missingness map for the airquality dataset. Red cells indicate missing values.

From Figure 3.2 we can see that Ozone and Solar.R missing values do not appear to cluster at the same observations, as we check earlier with the computations of NAs. This suggests that their reasons for missing values are quite likely independent.


4 Imputation Methods

When missing data are not too many, and we have Missing at Random, we can replace the NAs by other values in what is known as Imputation. In this section we describe some imputation methods.

4.1 Complete Case Analysis (Deletion)

The simplest approach is to drop any observation that has at least one missing value. This method will not introduce bias in the study in we delete not too many observations, as we are not changing information. However, it will make the data available in the sample smaller. Thus, one has to be careful and not use this technique when the amount of missing data is not small (use your common sense). In R, na.omit() or complete.cases() do this.

aq_complete <- na.omit(airquality)
cat("Original rows:", nrow(airquality), "\n")
## Original rows: 153
cat("After deletion:", nrow(aq_complete), "\n")
## After deletion: 111
cat("Rows removed:", nrow(airquality) - nrow(aq_complete), "\n")
## Rows removed: 42

We lose 42 observations (27.5% of the data).

When to use: only when data are MCAR and the fraction missing is small. Otherwise, deletion introduces bias (the mode, for example, will seem more “important” than what really is) and reduces power.

When not to use: if data are MAR or MNAR, or if the proportion missing is non-trivial.

4.2 Last Observation Carried Forward (LOCF)

LOCF replaces each missing value with the most recent preceding non-missing value in the same variable. We can use this in a natural way if data are ordered according to some criteria like time. Otherwise, shuffling the data would result in different imputed values and would make the study worthless.

locf <- function(x) {
  # carry forward the last non-NA value
  for (i in seq_along(x)) {
    if (is.na(x[i]) && i > 1) x[i] <- x[i - 1]
  }
  x
}

aq_locf          <- airquality
aq_locf$Ozone    <- locf(aq_locf$Ozone)
aq_locf$Solar.R  <- locf(aq_locf$Solar.R)

cat("Remaining NAs after LOCF - Ozone:", sum(is.na(aq_locf$Ozone)),
    " Solar.R:", sum(is.na(aq_locf$Solar.R)), "\n")
## Remaining NAs after LOCF - Ozone: 0  Solar.R: 0

Note that LOCF cannot fill missing values that appear at the very start of the series (no previous value to carry forward). In the airquality data, the first observation of Ozone is not missing, so no leading NAs arise here. You can see in the figure below how long stretches with NAs become flat after the imputation, as all values are replaced by the last one available.

df_orig <- data.frame(day = 1:75, ozone = airquality$Ozone[1:75], type = "Observed")
df_locf <- data.frame(day = 1:75, ozone = aq_locf$Ozone[1:75],   type = "LOCF")

# Mark imputed points
imputed_idx <- which(is.na(airquality$Ozone[1:75]))

ggplot() +
  geom_line(data = df_locf, aes(x = day, y = ozone), colour = "grey70") +
  geom_point(data = df_orig, aes(x = day, y = ozone),
             colour = "steelblue", size = 2) +
  geom_point(data = df_locf[imputed_idx, ], aes(x = day, y = ozone),
             shape = 4, colour = "firebrick", size = 1, stroke = 1.5) +
  labs(title = "LOCF imputation — Ozone (first 75 days)",
       subtitle = "Blue dots: observed; red crosses: imputed",
       x = "Day", y = "Ozone (ppb)") +
  theme_bw()
Original Ozone values (points) and LOCF-imputed values for the first 75 observations.

Figure 4.1: Original Ozone values (points) and LOCF-imputed values for the first 75 observations.

Limitation: LOCF artificially flatlines the series at missing stretches and ignores the natural variability of the variable. It tends to underestimate variance and can introduce bias if the variable is trending. In our example, variance is reduced by approximately 10%.

cat("Original variance             :", var(airquality$Ozone, na.rm = TRUE), "\n")
## Original variance             : 1088.201
cat("Variance after LOCF Imputation:", var(aq_locf$Ozone))
## Variance after LOCF Imputation: 995.9992

4.3 Mean and Mode Imputation

In this technique we replace the missing values with the mean, in the case of continous variables, or the mode, for categorical variables.

aq_mean         <- airquality
ozone_mean      <- mean(aq_mean$Ozone,   na.rm = TRUE)
solar_mean      <- mean(aq_mean$Solar.R, na.rm = TRUE)

aq_mean$Ozone[is.na(aq_mean$Ozone)]     <- ozone_mean
aq_mean$Solar.R[is.na(aq_mean$Solar.R)] <- solar_mean

cat(sprintf("Ozone mean used for imputation:   %.2f\n", ozone_mean))
## Ozone mean used for imputation:   42.13
cat(sprintf("Solar.R mean used for imputation: %.2f\n", solar_mean))
## Solar.R mean used for imputation: 185.93

Limitation: mean imputation preserves the mean of the variable but distorts its distribution — it compresses the variance and can distort correlations with other variables. The imputed values are all identical, which is unrealistic.

df_dist <- rbind(
    data.frame(ozone = airquality$Ozone[!is.na(airquality$Ozone)], type = "Original (observed)"),
    data.frame(ozone = aq_mean$Ozone, type = "After mean imputation")
)

df_dist$type <- factor(df_dist$type, levels = c("Original (observed)", "After mean imputation"))

ggplot(df_dist, aes(x = ozone, fill = type)) +
  geom_histogram(binwidth = 10, colour = "white", alpha = 0.8) +
  facet_wrap(~ type) +
  scale_fill_manual(values = c("Original (observed)" = "steelblue",
                               "After mean imputation" = "firebrick")) +
  labs(x = "Ozone (ppb)", y = "Count") +
  theme_bw() +
  theme(legend.position = "none")
Distribution of Ozone before (left) and after mean imputation (right).

Figure 4.2: Distribution of Ozone before (left) and after mean imputation (right).

The figure clearly shows the artificial spike at the mean value after imputation.

4.4 Regression Imputation

Unlike mean imputation, regression imputation uses the relationship between variables to generate imputed values, producing estimates that are consistent with the correlation structure of the data.

Here we predict Ozone from Temp and Wind, and Solar.R from Temp.

# Fit models on complete cases only
fit_ozone  <- lm(Ozone   ~ Temp + Wind, data = airquality)
fit_solar  <- lm(Solar.R ~ Temp,        data = airquality)

aq_reg <- airquality

# Predict and fill missing Ozone values
missing_ozone           <- is.na(aq_reg$Ozone)
aq_reg$Ozone[missing_ozone] <-
  predict(fit_ozone, newdata = aq_reg[missing_ozone, ])

# Predict and fill missing Solar.R values
missing_solar               <- is.na(aq_reg$Solar.R)
aq_reg$Solar.R[missing_solar] <-
  predict(fit_solar, newdata = aq_reg[missing_solar, ])

cat("Remaining NAs - Ozone:", sum(is.na(aq_reg$Ozone)),
    " Solar.R:", sum(is.na(aq_reg$Solar.R)), "\n")
## Remaining NAs - Ozone: 0  Solar.R: 0

Limitation: regression imputation gives a single predicted value with no uncertainty — it underestimates the variance of the variable and overestimates the precision of subsequent analyses. For example, the value of \(R^2\) increases slightly in the above example when we predict the Ozone variable from all the other variables.

cat("R squared original data: ", summary(lm(Ozone ~ ., data = na.omit(airquality)))$r.squared)
## R squared original data:  0.6249408
cat("\nR squared imputed data :", summary(lm(Ozone ~ ., data = aq_reg))$r.squared)
## 
## R squared imputed data : 0.6545649

Stochastic regression imputation, that is, adding a random residual term to the prediction partially addresses this. See this link

4.5 Multiple Imputation with mice

The gold standard for missing data under MAR is multiple imputation (Rubin, 1987). Instead of replacing each missing value with a single estimate, multiple imputation generates \(m\) complete datasets, each with slightly different imputed values drawn from the predictive distribution of the missing data. Each dataset is analysed separately, and the results are combined using Rubin’s rules.

The mice package (Multivariate Imputation by Chained Equations) implements this in R.

library(mice)

set.seed(42)
# m = 5 imputed datasets, method = "pmm" (predictive mean matching, good for
# continuous variables)
imp <- mice(airquality, m = 5, method = "pmm", printFlag = FALSE)
summary(imp)
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   Ozone Solar.R    Wind    Temp   Month     Day 
##   "pmm"   "pmm"      ""      ""      ""      "" 
## PredictorMatrix:
##         Ozone Solar.R Wind Temp Month Day
## Ozone       0       1    1    1     1   1
## Solar.R     1       0    1    1     1   1
## Wind        1       1    0    1     1   1
## Temp        1       1    1    0     1   1
## Month       1       1    1    1     0   1
## Day         1       1    1    1     1   0

The imputed datasets can be extracted and analysed individually, then combined:

# Fit the same regression on each imputed dataset and pool results
fit_mice <- with(imp, lm(Ozone ~ Solar.R + Wind + Temp))
pooled   <- pool(fit_mice)
summary(pooled)

Why multiple imputation is better: it properly accounts for the uncertainty in the imputed values. Single imputation methods (mean, regression) treat imputed values as if they were observed, leading to standard errors that are too small and p-values that are too significant.

When to use mice: when data are MAR, the proportion of missing data is non-trivial, and the downstream analysis requires valid uncertainty quantification.


5 Comparing the Methods

To illustrate the impact of imputation choice, we fit a simple linear regression Ozone ~ Solar.R on each imputed dataset and compare the \(R^2\) values.

# Helper: R^2 from a fitted lm
r2 <- function(fit) summary(fit)$r.squared

fit_orig     <- lm(Ozone ~ Solar.R, data = na.omit(airquality))
fit_locf_lm  <- lm(Ozone ~ Solar.R, data = aq_locf)
fit_mean_lm  <- lm(Ozone ~ Solar.R, data = aq_mean)
fit_reg_lm   <- lm(Ozone ~ Solar.R, data = aq_reg)

# For mice, pool R^2 across imputed datasets using mean
mice_r2 <- mean(sapply(1:5, function(i)
  r2(lm(Ozone ~ Solar.R, data = complete(imp, i)))))

df_r2 <- data.frame(
  Method = c("Complete cases/Deletion", "LOCF",
             "Mean imputation", "Regression imputation", "mice (mean)"),
  R2 = round(c(r2(fit_orig), r2(fit_locf_lm), r2(fit_mean_lm),
               r2(fit_reg_lm), mice_r2), 3)
)
df_r2
ggplot(df_r2, aes(x = reorder(Method, R2), y = R2)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = R2), hjust = -0.2, size = 3.5) +
  coord_flip() +
  ylim(0, max(df_r2$R2) * 1.15) +
  labs(title = expression(R^2 ~ "by imputation method  —  Ozone ~ Solar.R"),
       x = NULL, y = expression(R^2)) +
  theme_bw()
R-squared for the regression Ozone ~ Solar.R under each imputation method.

Figure 5.1: R-squared for the regression Ozone ~ Solar.R under each imputation method.

Note that the effect of mean imputation on \(R^2\) depends on which variable is imputed. When the response is imputed with its mean, those observations add noise to the regression (the imputed \(Y\) is constant regardless of \(X\)), which typically reduces \(R^2\). When a predictor is imputed with its mean, the imputed observations cluster at \(\bar{X}\) and their effect on \(R^2\) depends on the correlation structure. In either case, mean imputation distorts the distribution and biases downstream analyses.


6 Practical Guidelines and Summary

Some practical advice before choosing an imputation method:

  1. Always visualise the missing data pattern before doing anything else.
  2. Assess the mechanism: is missingness related to other observed variables? If so, MAR is more plausible than MCAR.
  3. Consider the proportion: if small number of data points are missing and MCAR is plausible, complete-case analysis is often acceptable.
  4. Avoid mean imputation when the proportion missing is large or downstream analyses require accurate standard errors.
  5. Prefer mice for MAR data when valid inference is important.
  6. Never impute the response variable \(Y\) in a supervised learning context — imputing \(Y\) is circular reasoning.
Method Preserves mean Preserves variance Preserves correlations Valid under
Deletion Yes Yes Yes MCAR only
LOCF Approx. No No Time series, MCAR
Mean imputation Yes No No MCAR, small % missing
Regression imputation Yes Partial Yes MAR
Multiple imputation (mice) Yes Yes Yes MAR

7 References

  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
  • van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). CRC Press. Freely available online at https://stefvanbuuren.name/fimd.
  • van Buuren, S. & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer.