Mastering Missing Data in R: A Systematic Guide

Detection, Visualization, and Imputation

Author

Abdullah Al Shamim

Published

February 13, 2026

What will we learn from this lesson?

  • Detection: Finding exactly where missing values are located in the dataset.
  • Visualization: Understanding the scale of missing data through graphs.
  • Removal: Deleting unnecessary missing data appropriately.
  • Imputation: Intelligently filling gaps with new information (like Mean or Median).

1. Environment Setup and Data Check

We will use the tidyverse alongside naniar and visdat to analyze and visualize missing data.

Code
# Install necessary packages
# install.packages(c("naniar", "visdat"))

# Loading required libraries
library(tidyverse)
library(naniar)
library(visdat)

Inspecting the Starwars Dataset

Code
# View the Starwars dataset
# view(starwars)

# Count total NAs in the entire dataset
sum(is.na(starwars))
[1] 105
Code
# Count NAs per column
colSums(is.na(starwars))
      name     height       mass hair_color skin_color  eye_color birth_year 
         0          6         28          5          0          0         44 
       sex     gender  homeworld    species      films   vehicles  starships 
         4          4         10          4          0          0          0 

Explanation: The is.na() function checks every single value. However, in large datasets, this is difficult to read manually. We use colSums() to provide an at-a-glance summary of how many values are missing in each column.


2. Visualizing Missing Data (Visual Overview)

It can be difficult to grasp the pattern of missing data from code alone; using graphs is a much smarter approach.

Code
# Create a map of missing values across the dataset
vis_miss(starwars)

Code
# Chart showing which variables have the most missing data
gg_miss_var(starwars)

Explanation: These graphs help us realize that columns like birth_year or mass have significant missing data. We must be cautious when performing calculations on these specific variables.


3. Removing Missing Data (Removal)

If a row lacks critical information, we can choose to remove that row entirely.

Code
# Option 1: Delete any row that contains at least one NA
starwars_clean <- starwars %>% 
  drop_na()

# Option 2: Delete rows only if NA exists in a specific column (e.g., height)
starwars_height_clean <- starwars %>% 
  drop_na(height)

# Verify (Result should be 0)
sum(is.na(starwars_height_clean$height))
[1] 0

Explanation: drop_na() is very powerful. Caution! Using it blindly can result in the loss of valuable data. Specifying a specific column (Option 2) is generally a safer practice.


4. Filling Missing Data (Imputation)

Filling the empty spaces with a new value instead of deleting the data is called Imputation.

A. Replacement with a Specific Value

Code
starwars_fixed <- starwars %>%
  replace_na(list(height = 0, mass = 0))

# Verify
sum(is.na(starwars_fixed$height))
[1] 0

Explanation: Here, we are instructing R to replace any missing height value with 0.

B. Filling with Mean (Advanced Mutate)

This is a highly professional method. We fill the empty gaps with the average (Mean) of that specific column.

Code
starwars_mean_fill <- starwars %>%
  mutate(height = ifelse(is.na(height), 
                         mean(height, na.rm = TRUE), 
                         height))

# Verify
sum(is.na(starwars_mean_fill$height))
[1] 0

Explanation: The ifelse function checks: if the height is NA, replace it with the mean(); if it is not NA, keep the original value. Note that na.rm = TRUE is mandatory inside the mean function, otherwise the result will return NA.

C. Filling with Median

Similar to the mean, we can use the Median to fill empty spaces. This is a professional method used when the column has many outliers.

Code
# Filling empty height cells with the Median value
starwars_median_fill <- starwars %>%
  mutate(height = ifelse(is.na(height), 
                         median(height, na.rm = TRUE), 
                         height))

# Verify
sum(is.na(starwars_median_fill$height))
[1] 0

5. Final Report and Visualization (Before vs After)

Finally, let’s see if our cleaning process allows us to generate a valid visualization.

Code
# Chart of the Top 10 tallest characters after cleaning
starwars %>%
  drop_na(height) %>%   # Essential to drop NAs before plotting
  filter(height > 200) %>% 
  ggplot(aes(x = reorder(name, height), y = height)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top Tallest Characters", 
       x = "Character Name", 
       y = "Height (cm)") +
  theme_minimal()


Systematic Checklist (Cheat Sheet):

  • To Search: colSums(is.na(data))
  • To Visualize: vis_miss(data)
  • To Remove: drop_na(column)
  • To Fill: mutate(col = ifelse(is.na(col), mean, col))

Excellent work! You have successfully mastered the systemic workflow for handling missing data in R.