Handling missing data is a critical step in the data cleaning process. In this guide, we will cover:
Detection: Locating missing values within a dataset.
Visualization: Understanding the density of missing data through graphs.
Removal: Safely dropping rows with missing values.
Imputation: Intelligently filling gaps using statistical measures (Mean/Median).
1. Setup and Initial Audit
We will use the tidyverse for manipulation and the naniar and visdat packages specifically designed for missing data analysis.
Code
# Load necessary librarieslibrary(tidyverse)library(naniar)library(visdat)# Preview the 'starwars' datasethead(starwars)
# A tibble: 6 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Code
# Check total number of NAs in the entire datasetsum(is.na(starwars))
[1] 105
Code
# Check number of NAs per columncolSums(is.na(starwars))
name height mass hair_color skin_color eye_color birth_year
0 6 28 5 0 0 44
sex gender homeworld species films vehicles starships
4 4 10 4 0 0 0
Explanation: The is.na() function checks every individual value. In large datasets, reading a giant logical matrix is difficult, so we use colSums() to get a quick overview of which columns lack information.
2. Visualizing Missing Data
Reading numbers can be hard; visualizing the gaps makes patterns obvious.
Code
# Map of where information is missing across the datasetvis_miss(starwars)
Code
# Charting which variables have the most missing datagg_miss_var(starwars)
Explanation: These graphs show us that columns like birth_year or mass have significant gaps. We must be cautious when using these variables for analysis.
3. Removing Missing Data (Removal)
If a row lacks critical information, we can choose to remove it.
Code
# Option 1: Remove any row that contains at least one NAstarwars_clean <- starwars %>%drop_na()# Option 2: Remove rows only if NA is found in a specific column (e.g., height)starwars_height_clean <- starwars %>%drop_na(height)# Verify removalsum(is.na(starwars_height_clean$height))
[1] 0
Explanation:drop_na() is powerful but aggressive. Option 1 might delete valuable data if many columns have NAs. Option 2 is safer as it only targets rows where a specific, essential variable is missing.
4. Filling Missing Data (Imputation)
Instead of deleting data, we can “impute” or fill the empty spaces.
A. Replacement with a Specific Value
Code
starwars_replaced <- starwars %>%replace_na(list(height =0, mass =0))
B. Imputation using the Mean (Average)
This is a professional approach where we replace NAs with the average value of that column.
Explanation: The ifelse function checks: if height is NA, replace it with the mean. Otherwise, keep the original height. Note: na.rm = TRUE is required to calculate the mean when NAs are present.
C. Imputation using the Median
The median is often safer than the mean if your data has extreme outliers.