1 Abstract
2 Introduction
- 2.1 Detecting missing data
- 2.2 Multiple imputation with mice
3 Missing-data methods in R
4 Results
5 Discussion and conclusions
6 References

1 Abstract

Missing data are everywhere in real-world datasets and can easily mislead analyses if we ignore them. In this project I use R and several packages to explore how different missing-data strategies affect results.

First, I use the built-in airquality data and the naniar package to visualize where values are missing and to show that missingness can depend on time. Second, I use the nhanes data from the mice package to compare a complete-case regression with a regression that uses multiple imputation. Finally, I simulate an income–education dataset and compare three simple methods (complete cases, overall mean imputation, and regression imputation). I show how each method changes means and distribution shapes.

The results highlight both how easy it is to mishandle missing data and how R tools can make better approaches (especially multiple imputation) much more accessible.

2 Introduction

In almost every real dataset, some values are missing. Survey respondents skip questions, labs lose samples, and sensors malfunction. A common first reaction is to drop any row with a missing value, but this “complete-case” strategy often wastes information and can introduce serious bias.

This paper focuses on how to handle missing data in R, rather than on one particular scientific dataset. My main goals are:

to show how to visualize and diagnose missingness in R using naniar;
to demonstrate basic multiple imputation in R with mice;
to compare several simple strategies for a simulated income dataset and see how much they can distort results.

Conceptually, you can think of the main approaches in this paper as:

Complete cases: keep only rows with no missing values at all.
Single imputations (mean and regression): fill in one “best guess” for each missing value, then treat the filled-in dataset as if it were complete.
Multiple imputation (mice): create several different completed versions of the data, analyze each one, and then combine the results to reflect extra uncertainty about the missing values.

Throughout the paper I keep the code relatively simple and use small, well-known datasets when possible, so that someone with limited R experience can reproduce the analyses.

2.1 Detecting missing data

The airquality dataset is built into base R. It contains daily air quality measurements in New York from May to September 1973.

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

summary(airquality)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

There are several NA values, especially in the Ozone and Solar.R columns. To see which variables are most affected, I use gg_miss_var() from the naniar package.

Conceptually, naniar does not “fix” missing data; it helps you see the holes. You can think of it as an X-ray for a data frame: instead of plotting values, it plots where the NAs are and how they are distributed across variables and groups.

p_air_var <- gg_miss_var(airquality) +
ggtitle("Number of missing values by variable in airquality") +
xlab("Variables") +
ylab("Number of missing values")

p_air_var

Figure 1. Number of missing values by variable in the airquality dataset, created with naniar::gg_miss_var().

Figure 1 shows that missingness is highly uneven across variables: Ozone and Solar.R have many more missing values than the others. In a real analysis this would be a warning sign that these variables need special attention.

Next, I want to check whether missingness depends on month. If entire months have more missing values, complete-case analysis could distort trends over time.

airquality_month <- airquality %>%
  mutate(Month = factor(Month))

p_air_month <- gg_miss_var(airquality_month, facet = Month) +
  ggtitle("Number of missing values by variable, faceted by month") +
  xlab("Variables") +
  ylab("Number of missing values")

p_air_month

Figure 2. Number of missing values by variable, faceted by month. Some months have many more missing Ozone and Solar.R values than others.

Figure 2 adds another layer: some months have far more missing Ozone and Solar.R values than others. This suggests that a simple complete-case analysis could accidentally remove certain months and distort any time trends.

2.2 Multiple imputation with `mice`

The mice package provides the nhanes dataset, which contains simulated health data with missing values.

nhanes

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

summary(nhanes)

##       age            bmi             hyp             chl       
##  Min.   :1.00   Min.   :20.40   Min.   :1.000   Min.   :113.0  
##  1st Qu.:1.00   1st Qu.:22.65   1st Qu.:1.000   1st Qu.:185.0  
##  Median :2.00   Median :26.75   Median :1.000   Median :187.0  
##  Mean   :1.76   Mean   :26.56   Mean   :1.235   Mean   :191.4  
##  3rd Qu.:2.00   3rd Qu.:28.93   3rd Qu.:1.000   3rd Qu.:212.0  
##  Max.   :3.00   Max.   :35.30   Max.   :2.000   Max.   :284.0  
##                 NA's   :9       NA's   :8       NA's   :10

For illustration, I fit a linear regression of cholesterol (chl) on age and BMI. I compare two approaches:

Complete cases – drop rows with any missing values and fit the model.
Multiple imputation (MI) – use mice() to create several imputed datasets, fit the model in each, and combine the results with pool().

Conceptually, multiple imputation in mice works like this: it fills in the missing values several different ways to create \(m\) complete datasets (for example, \(m = 5\)); each completed dataset has slightly different imputed values because the algorithm adds randomness to reflect uncertainty; then you fit the same model in each completed dataset and pool() combines the estimates using both the variation within and between datasets.

# 1. Complete-case regression
fit_cc <- lm(chl ~ age + bmi, data = nhanes)

# 2. Multiple imputation with mice
imp <- mice(nhanes, m = 5, method = "pmm", seed = 123)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

fit_mi <- with(imp, lm(chl ~ age + bmi))
pool_fit <- pool(fit_mi)
sum_mi <- summary(pool_fit)

fit_cc

## 
## Call:
## lm(formula = chl ~ age + bmi, data = nhanes)
## 
## Coefficients:
## (Intercept)          age          bmi  
##     -80.194       53.069        6.884

sum_mi

##          term  estimate std.error statistic       df    p.value
## 1 (Intercept) 11.430196 72.560560 0.1575263 8.020422 0.87872369
## 2         age 27.505289 10.784999 2.5503285 9.748109 0.02938517
## 3         bmi  4.979979  2.322195 2.1445134 7.835031 0.06503537

To compare them visually, I tidy up the coefficient estimates and confidence intervals and plot them.

# Tidy complete-case coefficients

coef_cc <- broom::tidy(fit_cc) %>%
filter(term != "(Intercept)") %>%
mutate(method = "Complete cases")

# Tidy MI coefficients

coef_mi <- sum_mi %>%
filter(term != "(Intercept)") %>%
transmute(
term,
estimate  = estimate,
std.error = `std.error`,
method    = "Multiple imputation"
)

coef_all <- bind_rows(coef_cc, coef_mi)

ggplot(coef_all,
aes(x = term, y = estimate, color = method)) +
geom_point(position = position_dodge(width = 0.4)) +
geom_errorbar(aes(ymin = estimate - 2 * std.error,
ymax = estimate + 2 * std.error),
width = 0.1,
position = position_dodge(width = 0.4)) +
coord_flip() +
labs(
title = "Cholesterol regression: complete cases vs multiple imputation",
x = "Predictor",
y = "Estimated coefficient",
color = "Method"
) +
theme_minimal(base_size = 11)

Figure 3. Estimated coefficients for chl ~ age + bmi in the nhanes data, comparing a complete-case model and a multiple-imputation model using mice.

Figure 3 helps explain what multiple imputation is doing in practice. The red points and intervals come from the complete-case regression, which ignores any row where chl, age, or bmi is missing. This throws away data and effectively assumes that the missing values are not related to the outcome in a problematic way. The blue points and intervals come from the multiple-imputation model: mice fills in the missing values several different ways, fits the regression in each completed dataset, and then combines the estimates with pool().

Comparing the two sets of coefficients, we see that the multiple-imputation estimates are similar in size but not identical, and their confidence intervals are usually a bit wider. The small changes in the estimates show that including information from the partially observed cases can shift the fitted relationship, while the wider intervals reflect the extra uncertainty about what the missing values could have been. In other words, multiple imputation uses more of the available data while also being more honest about how much we do not know.

3 Missing-data methods in R

For the main part of the project, I simulate a simple income–education dataset and then apply several missing-data methods to it. In this artificial example I know the “truth,” so I can see how much each method distorts it.

I create a sample where:

education has four levels (Less than HS, HS, Some college, Bachelor+),
income is higher on average for higher education,
missingness in income is more common for low-income people.

set.seed(123)

n <- 2000

education_levels <- c("Less than HS", "HS", "Some college", "Bachelor+")
education <- sample(education_levels, size = n, replace = TRUE,
                    prob = c(0.2, 0.3, 0.3, 0.2))

education <- factor(education, levels = education_levels)

age <- round(rnorm(n, mean = 40, sd = 12))

# Base income by education group
base_income <- c(
  "Less than HS" = 30000,
  "HS"           = 40000,
  "Some college" = 55000,
  "Bachelor+"    = 80000
)

income_true <- base_income[as.character(education)] +
  500 * (age - 40) +            # small age effect
  rnorm(n, mean = 0, sd = 10000)

sim_full <- data.frame(
  education = education,
  age       = age,
  income    = income_true
)

head(sim_full)

##      education age   income
## 1           HS  28 32496.93
## 2    Bachelor+  28 70722.43
## 3 Some college  40 40518.35
## 4 Less than HS  38 22027.15
## 5 Less than HS   9 40484.90
## 6           HS  52 45625.85

Now I introduce missing values in income. People with lower income are more likely to have missing values (for example, because of nonresponse).

data_biased <- sim_full %>%
  mutate(
    miss_prob = case_when(
      income < 40000 ~ 0.40,
      income < 60000 ~ 0.25,
      TRUE           ~ 0.10
    ),
    is_miss = rbinom(n(), size = 1, prob = miss_prob),
    income  = ifelse(is_miss == 1, NA, income)
  )

# Complete-case dataset
data_cc <- data_biased %>%
  filter(!is.na(income), !is.na(education))

mean(is.na(data_biased$income))

## [1] 0.2555

In the rest of this section I compare four methods for handling the missing incomes: complete cases, overall mean imputation, regression imputation, and multiple imputation with mice.

3.1 Method 1: complete cases

Complete-case analysis drops any row where income is missing. Conceptually, this is like saying “if you skipped at least one question on the survey, we pretend your whole survey never existed.” This is very simple and often the default in R, but if some groups are more likely to be missing (for example, low-income people), it can systematically bias results.

cc_means <- data_cc %>%
  group_by(education) %>%
  summarise(
    mean_income = mean(income),
    .groups = "drop"
  ) %>%
  mutate(method = "Complete cases")

cc_means

## # A tibble: 4 × 3
##   education    mean_income method        
##   <fct>              <dbl> <chr>         
## 1 Less than HS      29914. Complete cases
## 2 HS                40507. Complete cases
## 3 Some college      56534. Complete cases
## 4 Bachelor+         79961. Complete cases

This is common in practice because it is the default behavior of many R functions (lm(), cor(), etc.) when na.action = na.omit.

3.2 Method 2: overall mean imputation

Overall mean imputation replaces each missing income value with the overall mean of observed income. Conceptually, this is like saying “anyone who did not report income gets the average income.” This keeps all rows in the dataset, but it shrinks differences between groups and makes the data look more concentrated around the mean than it really is.

overall_mean <- mean(data_biased$income, na.rm = TRUE)

data_imp_mean <- data_biased %>%
  mutate(
    income_imp = ifelse(is.na(income), overall_mean, income)
  )

mean_means <- data_imp_mean %>%
  group_by(education) %>%
  summarise(
    mean_income = mean(income_imp),
    .groups = "drop"
  ) %>%
  mutate(method = "Overall mean impute")

mean_means

## # A tibble: 4 × 3
##   education    mean_income method             
##   <fct>              <dbl> <chr>              
## 1 Less than HS      38988. Overall mean impute
## 2 HS                44720. Overall mean impute
## 3 Some college      55902. Overall mean impute
## 4 Bachelor+         77257. Overall mean impute

This method is simple but can strongly distort group differences by pulling low-income groups upward and high-income groups downward.

3.3 Method 3: regression imputation

Regression imputation uses a regression model to predict income from education and age, and plugs the predictions in for missing values. Conceptually, this is like saying “for each person with missing income, we predict what they probably would have earned given their education and age, using a regression fitted on the people who did report income.” This usually preserves relationships between variables better than overall mean imputation, but because imputed values sit close to the regression line, it tends to underestimate variability.

# Fit regression using only complete cases
fit_reg <- lm(income ~ education + age, data = data_cc)

# Use predicted values for missing incomes
data_imp_reg <- data_biased %>%
  mutate(
    income_imp = ifelse(
      is.na(income),
      predict(fit_reg, newdata = data_biased),
      income
    )
  )

reg_means <- data_imp_reg %>%
  group_by(education) %>%
  summarise(
    mean_income = mean(income_imp),
    .groups = "drop"
  ) %>%
  mutate(method = "Regression impute")

reg_means

## # A tibble: 4 × 3
##   education    mean_income method           
##   <fct>              <dbl> <chr>            
## 1 Less than HS      29597. Regression impute
## 2 HS                40216. Regression impute
## 3 Some college      56247. Regression impute
## 4 Bachelor+         80185. Regression impute

Regression imputation usually preserves relationships with predictors better than overall mean imputation, but it tends to underestimate variability because predicted values fall close to the regression line.

3.4 Method 4: multiple imputation with `mice` (simulated data)

Finally, I apply multiple imputation to the simulated income data using the mice package. Conceptually, this works the same way as in the nhanes example: instead of filling in each missing income once, mice creates several different completed versions of the dataset, each with slightly different imputed incomes. These imputations are based on a model that uses the observed variables (income, education, and age) and include randomness to reflect uncertainty.

For this tutorial, I use the first completed dataset from mice to compute mean incomes by education and to compare distribution shapes. In a full multiple-imputation workflow, you would typically fit your model in each imputed dataset and then combine the estimates with pool().

Here, the main goal is to see how a model-based method like mice compares to complete cases, overall mean imputation, and single regression imputation on the same simulated problem.

## Multiple imputation for income using mice on simulated data ----

# We only need the variables used in imputation
data_for_mice <- data_biased %>%
  select(income, education, age)

# Run mice: m = 5 imputed datasets
imp_income <- mice(data_for_mice, m = 5, seed = 123)

## 
##  iter imp variable
##   1   1  income
##   1   2  income
##   1   3  income
##   1   4  income
##   1   5  income
##   2   1  income
##   2   2  income
##   2   3  income
##   2   4  income
##   2   5  income
##   3   1  income
##   3   2  income
##   3   3  income
##   3   4  income
##   3   5  income
##   4   1  income
##   4   2  income
##   4   3  income
##   4   4  income
##   4   5  income
##   5   1  income
##   5   2  income
##   5   3  income
##   5   4  income
##   5   5  income

# Take the first completed dataset for summaries/plots
data_imp_mi <- complete(imp_income, 1)

# Mean income by education under mice
mi_means <- data_imp_mi %>%
  group_by(education) %>%
  summarise(
    mean_income = mean(income),
    .groups = "drop"
  ) %>%
  mutate(method = "Multiple imputation (mice)")

mi_means

## # A tibble: 4 × 3
##   education    mean_income method                    
##   <fct>              <dbl> <chr>                     
## 1 Less than HS      29746. Multiple imputation (mice)
## 2 HS                40419. Multiple imputation (mice)
## 3 Some college      56101. Multiple imputation (mice)
## 4 Bachelor+         80292. Multiple imputation (mice)

4 Results

4.1 Mean income by education under each method

First I compute the true mean income by education (no missing data):

true_means <- sim_full %>%
  group_by(education) %>%
  summarise(
    mean_income = mean(income),
    .groups = "drop"
  ) %>%
  mutate(method = "True (no missing)")

# Add complete cases, mean impute, regression impute, and mice
all_means <- bind_rows(
  true_means,
  cc_means,
  mean_means,
  reg_means,
  mi_means
)

all_means

## # A tibble: 20 × 3
##    education    mean_income method                    
##    <fct>              <dbl> <chr>                     
##  1 Less than HS      29474. True (no missing)         
##  2 HS                39457. True (no missing)         
##  3 Some college      55553. True (no missing)         
##  4 Bachelor+         79768. True (no missing)         
##  5 Less than HS      29914. Complete cases            
##  6 HS                40507. Complete cases            
##  7 Some college      56534. Complete cases            
##  8 Bachelor+         79961. Complete cases            
##  9 Less than HS      38988. Overall mean impute       
## 10 HS                44720. Overall mean impute       
## 11 Some college      55902. Overall mean impute       
## 12 Bachelor+         77257. Overall mean impute       
## 13 Less than HS      29597. Regression impute         
## 14 HS                40216. Regression impute         
## 15 Some college      56247. Regression impute         
## 16 Bachelor+         80185. Regression impute         
## 17 Less than HS      29746. Multiple imputation (mice)
## 18 HS                40419. Multiple imputation (mice)
## 19 Some college      56101. Multiple imputation (mice)
## 20 Bachelor+         80292. Multiple imputation (mice)

Now I visualize the mean income for each education group and each method.

all_means$method <- factor(
  all_means$method,
  levels = c(
    "True (no missing)",
    "Complete cases",
    "Overall mean impute",
    "Regression impute",
    "Multiple imputation (mice)"
  )
)

ggplot(all_means,
       aes(x = education, y = mean_income, fill = method)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = dollar_format(prefix = "$")) +
  scale_fill_manual(
    values = c(
      "True (no missing)"           = "grey20",
      "Complete cases"              = "#D73027",
      "Overall mean impute"         = "#FC8D59",
      "Regression impute"           = "#4575B4",
      "Multiple imputation (mice)"  = "#1B9E77"
    )
  ) +
  labs(
    title = "Average income by education under different methods",
    x = "Education level",
    y = "Mean income",
    fill = "Method"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Figure 4. Average income by education level under the true data (no missing) and four missing-data methods.

Figure 4 shows how each method changes the average income pattern across education groups. The true bars increase smoothly with education. The complete-case bars are all a little too high, because low-income people are more likely to be missing and get dropped. Overall mean imputation pulls the groups toward a single middle value, shrinking the gap between “Less than HS” and “Bachelor+”. Regression imputation and mice both recover a pattern close to the truth, with mice slightly smoothing the differences.

Interestingly, in this one simulated data set the complete-case means sometimes look as close to the truth as the mice means. This can happen when the amount of missingness is moderate and the selection bias is not huge, and also because I am summarizing mice with only one completed data set instead of pooling results across all (m) imputations. In a larger simulation study averaged over many repetitions, we would usually expect properly specified multiple imputation to perform at least as well as complete cases.

4.2 Distribution shape for the HS group

Means can hide important changes in the shape of the distribution. To see this, I focus on the “HS” group and compare the full income distributions under each method.

true_hs <- sim_full %>%
  filter(education == "HS")

cc_hs <- data_cc %>%
  filter(education == "HS")

mean_hs <- data_imp_mean %>%
  filter(education == "HS")

reg_hs <- data_imp_reg %>%
  filter(education == "HS")

# HS group from the first mice-completed dataset
mi_hs <- data_imp_mi %>%
  filter(education == "HS") %>%
  transmute(
    income = income,
    method = "Multiple imputation (mice)"
  )

hs_all <- bind_rows(
  true_hs %>%
    select(income) %>%
    mutate(method = "True (no missing)"),
  cc_hs %>%
    select(income) %>%
    mutate(method = "Complete cases"),
  mean_hs %>%
    transmute(income = income_imp,
              method = "Overall mean impute"),
  reg_hs %>%
    transmute(income = income_imp,
              method = "Regression impute"),
  mi_hs
)

hs_all$method <- factor(
  hs_all$method,
  levels = c(
    "True (no missing)",
    "Complete cases",
    "Overall mean impute",
    "Regression impute",
    "Multiple imputation (mice)"
  )
)

ggplot(hs_all,
       aes(x = income, color = method, linetype = method)) +
  geom_density(linewidth = 1) +
  scale_color_manual(
    values = c(
      "True (no missing)"           = "black",
      "Complete cases"              = "#D73027",
      "Overall mean impute"         = "#FC8D59",
      "Regression impute"           = "#4575B4",
      "Multiple imputation (mice)"  = "#1B9E77"
    )
  ) +
  labs(
    title = "Income distribution for HS group under different methods",
    x = "Income",
    y = "Density",
    color = "Method",
    linetype = "Method"
  ) +
  theme_minimal(base_size = 11)

Figure 5. Income distributions for the HS group under the true data and four missing-data methods.

Figure 5 focuses on the HS group and compares the whole income distribution under each method. The solid black curve is the true distribution. The complete-case curve is shifted slightly to the right, again showing that dropping missing incomes makes this group look richer than it really is. Overall mean imputation produces a tall, very narrow bump around the mean, because many missing values are set to the same number; this badly underestimates the spread of incomes. Regression imputation tracks the center reasonably well but has a sharper peak and thinner tails than the true curve, reflecting the fact that predicted values sit close to the regression line. The mice curve is a bit rough (since it is based on one imputed dataset) but stays much closer to the true shape than overall mean or regression imputation, especially in the tails.

4.3 Practical guide: choosing a method

In a real dataset we never know the true means or regression coefficients, so there is no single “best” method for every situation. A useful way to think about missing data is:

Diagnose first. Use tools like naniar to see how much is missing, which variables are affected, and whether missingness depends on time or groups.
If missingness is small and looks random, complete cases may be acceptable, but it is still worth checking sensitivity.
If missingness clearly depends on other observed variables (for example, low-income people missing income more often), then methods that use those variables—regression imputation or multiple imputation—are usually preferable.
If results change a lot across methods, that is a sign that conclusions are sensitive to how missing data are handled, and this should be reported.

Table 1. Summary of methods and when a beginner might use them

Situation / goal	Method	R tools / functions	Simple description	Main pros	Main cons
Very small amount of missing data, looks random	Complete cases	`na.omit()`, default in `lm()` and many functions	Drop any row with an `NA`.	Very easy, already built in.	Throws away data; can bias results if some groups are missing more than others.
Need a quick fill just to run some code	Overall mean imputation	`mutate(... = ifelse(is.na(x), mean(x), x))`	Replace each missing value with the overall mean of that variable.	Simple and keeps all rows.	Destroys variability and shrinks differences between groups; usually not recommended.
Missingness depends on predictors, single step	Regression imputation	`lm()`, `predict()`	Predict missing values from a regression using other variables, then plug in those predictions.	Uses extra information; preserves relationships better.	Underestimates variability; still only one completed dataset (single imputation).
Serious analysis with several variables missing	Multiple imputation (`mice`)	`mice()`, `with()`, `pool()`	Create several completed datasets with random variation in imputations, fit models, then combine.	Uses all data; more realistic standard errors; flexible.	More complex; requires more decisions and computing than single imputation.

5 Discussion and conclusions

This project shows how different missing-data strategies, all easily implemented in R, can produce very different answers.

From the airquality and naniar example, I learned that visualizing missingness is an important first step. Missing values were not uniformly spread across months, which means that simple complete-case analysis would change the apparent seasonal pattern.

From the nhanes example, I saw how mice can perform multiple imputation with only a few lines of code. The coefficient plot (Figure 3) highlights that using all the data through multiple imputation can both change the point estimates and better reflect uncertainty.

In the simulated income example, where the true pattern is known, complete-case analysis consistently overestimated income in all education groups. Overall mean imputation severely distorted group differences and distribution shapes, while regression imputation did better but still underestimated variation.

Overall, the main lessons are:

Visualize missingness first (for example with naniar).
Be cautious with complete-case analysis and simple mean imputation.
Consider model-based approaches such as regression imputation or multiple imputation (mice) when the missingness mechanism is not completely random.

Future work could include adding a full multiple-imputation analysis to the simulated dataset, using more realistic missingness mechanisms, and comparing results across many simulated datasets rather than a single run.

6 References

R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

van Buuren, S. and Groothuis-Oudshoorn, K. (2011). “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software, 45(3), 1–67.

Tierney, N. J. and Cook, D. (2018). “naniar: data structures for missing data in R.” Journal of Open Source Software, 3(26), 642.

Handling Missing Data in R

Shreedip Das

2025-12-10

1 Abstract

2 Introduction

2.1 Detecting missing data

2.2 Multiple imputation with `mice`

3 Missing-data methods in R

3.1 Method 1: complete cases

3.2 Method 2: overall mean imputation

3.3 Method 3: regression imputation

3.4 Method 4: multiple imputation with `mice` (simulated data)

4 Results

4.1 Mean income by education under each method

4.2 Distribution shape for the HS group

4.3 Practical guide: choosing a method

5 Discussion and conclusions

6 References

Handling Missing Data in R

Shreedip Das

2025-12-10

1 Abstract

2 Introduction

2.1 Detecting missing data

2.2 Multiple imputation with mice

3 Missing-data methods in R

3.1 Method 1: complete cases

3.2 Method 2: overall mean imputation

3.3 Method 3: regression imputation

3.4 Method 4: multiple imputation with mice (simulated data)

4 Results

4.1 Mean income by education under each method

4.2 Distribution shape for the HS group

4.3 Practical guide: choosing a method

5 Discussion and conclusions

6 References

2.2 Multiple imputation with `mice`

3.4 Method 4: multiple imputation with `mice` (simulated data)