Mastering Missing Data in R

What will we learn from this lesson?

Handling missing data is a critical step in the data cleaning process. In this guide, we will cover:

Detection: Locating missing values within a dataset.
Visualization: Understanding the density of missing data through graphs.
Removal: Safely dropping rows with missing values.
Imputation: Intelligently filling gaps using statistical measures (Mean/Median).

1. Setup and Initial Audit

We will use the tidyverse for manipulation and the naniar and visdat packages specifically designed for missing data analysis.

Code

# Load necessary libraries
library(tidyverse)
library(naniar)
library(visdat)

# Preview the 'starwars' dataset
head(starwars)

# A tibble: 6 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Code

# Check total number of NAs in the entire dataset
sum(is.na(starwars))

[1] 105

Code

# Check number of NAs per column
colSums(is.na(starwars))

      name     height       mass hair_color skin_color  eye_color birth_year 
         0          6         28          5          0          0         44 
       sex     gender  homeworld    species      films   vehicles  starships 
         4          4         10          4          0          0          0

Explanation: The is.na() function checks every individual value. In large datasets, reading a giant logical matrix is difficult, so we use colSums() to get a quick overview of which columns lack information.

2. Visualizing Missing Data

Reading numbers can be hard; visualizing the gaps makes patterns obvious.

Code

# Map of where information is missing across the dataset
vis_miss(starwars)

Code

# Charting which variables have the most missing data
gg_miss_var(starwars)

Explanation: These graphs show us that columns like birth_year or mass have significant gaps. We must be cautious when using these variables for analysis.

3. Removing Missing Data (Removal)

If a row lacks critical information, we can choose to remove it.

Code

# Option 1: Remove any row that contains at least one NA
starwars_clean <- starwars %>%
  drop_na()

# Option 2: Remove rows only if NA is found in a specific column (e.g., height)
starwars_height_clean <- starwars %>%
  drop_na(height)

# Verify removal
sum(is.na(starwars_height_clean$height))

[1] 0

Explanation: drop_na() is powerful but aggressive. Option 1 might delete valuable data if many columns have NAs. Option 2 is safer as it only targets rows where a specific, essential variable is missing.

4. Filling Missing Data (Imputation)

Instead of deleting data, we can “impute” or fill the empty spaces.

A. Replacement with a Specific Value

Code

starwars_replaced <- starwars %>%
  replace_na(list(height = 0, mass = 0))

B. Imputation using the Mean (Average)

This is a professional approach where we replace NAs with the average value of that column.

Code

starwars_mean_fill <- starwars %>%
  mutate(height = ifelse(is.na(height),
                         mean(height, na.rm = TRUE),
                         height))

Explanation: The ifelse function checks: if height is NA, replace it with the mean. Otherwise, keep the original height. Note: na.rm = TRUE is required to calculate the mean when NAs are present.

C. Imputation using the Median

The median is often safer than the mean if your data has extreme outliers.

Code

starwars_median_fill <- starwars %>%
  mutate(height = ifelse(is.na(height),
                         median(height, na.rm = TRUE),
                         height))

5. Final Report: Before vs. After

Let’s see our cleaning in action by plotting the top 10 tallest characters.

Code

starwars %>%
  drop_na(height) %>%   # Essential to drop NAs before plotting
  filter(height > 200) %>%
  ggplot(aes(x = reorder(name, height), y = height)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top Tallest Characters", 
       subtitle = "Cleaned dataset excluding missing height values",
       x = "Character Name", 
       y = "Height (cm)") +
  theme_minimal()

Systemic Cheat Sheet

Task	Function
Detect	`colSums(is.na(data))`
Visualize	`vis_miss(data)`
Remove	`drop_na(column)`
Impute	`mutate(col = ifelse(is.na(col), mean, col))`

Congratulations! You now have the systemic tools to handle missing data and maintain the integrity of your analysis.

```