Code
# Install necessary packages
# install.packages(c("naniar", "visdat"))
# Loading required libraries
library(tidyverse)
library(naniar)
library(visdat)Detection, Visualization, and Imputation
We will use the tidyverse alongside naniar and visdat to analyze and visualize missing data.
# Install necessary packages
# install.packages(c("naniar", "visdat"))
# Loading required libraries
library(tidyverse)
library(naniar)
library(visdat)# View the Starwars dataset
# view(starwars)
# Count total NAs in the entire dataset
sum(is.na(starwars))[1] 105
# Count NAs per column
colSums(is.na(starwars)) name height mass hair_color skin_color eye_color birth_year
0 6 28 5 0 0 44
sex gender homeworld species films vehicles starships
4 4 10 4 0 0 0
Explanation: The is.na() function checks every single value. However, in large datasets, this is difficult to read manually. We use colSums() to provide an at-a-glance summary of how many values are missing in each column.
It can be difficult to grasp the pattern of missing data from code alone; using graphs is a much smarter approach.
# Create a map of missing values across the dataset
vis_miss(starwars)# Chart showing which variables have the most missing data
gg_miss_var(starwars)Explanation: These graphs help us realize that columns like birth_year or mass have significant missing data. We must be cautious when performing calculations on these specific variables.
If a row lacks critical information, we can choose to remove that row entirely.
# Option 1: Delete any row that contains at least one NA
starwars_clean <- starwars %>%
drop_na()
# Option 2: Delete rows only if NA exists in a specific column (e.g., height)
starwars_height_clean <- starwars %>%
drop_na(height)
# Verify (Result should be 0)
sum(is.na(starwars_height_clean$height))[1] 0
Explanation: drop_na() is very powerful. Caution! Using it blindly can result in the loss of valuable data. Specifying a specific column (Option 2) is generally a safer practice.
Filling the empty spaces with a new value instead of deleting the data is called Imputation.
starwars_fixed <- starwars %>%
replace_na(list(height = 0, mass = 0))
# Verify
sum(is.na(starwars_fixed$height))[1] 0
Explanation: Here, we are instructing R to replace any missing height value with 0.
This is a highly professional method. We fill the empty gaps with the average (Mean) of that specific column.
starwars_mean_fill <- starwars %>%
mutate(height = ifelse(is.na(height),
mean(height, na.rm = TRUE),
height))
# Verify
sum(is.na(starwars_mean_fill$height))[1] 0
Explanation: The ifelse function checks: if the height is NA, replace it with the mean(); if it is not NA, keep the original value. Note that na.rm = TRUE is mandatory inside the mean function, otherwise the result will return NA.
Similar to the mean, we can use the Median to fill empty spaces. This is a professional method used when the column has many outliers.
# Filling empty height cells with the Median value
starwars_median_fill <- starwars %>%
mutate(height = ifelse(is.na(height),
median(height, na.rm = TRUE),
height))
# Verify
sum(is.na(starwars_median_fill$height))[1] 0
Finally, let’s see if our cleaning process allows us to generate a valid visualization.
# Chart of the Top 10 tallest characters after cleaning
starwars %>%
drop_na(height) %>% # Essential to drop NAs before plotting
filter(height > 200) %>%
ggplot(aes(x = reorder(name, height), y = height)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top Tallest Characters",
x = "Character Name",
y = "Height (cm)") +
theme_minimal()colSums(is.na(data))vis_miss(data)drop_na(column)mutate(col = ifelse(is.na(col), mean, col))Excellent work! You have successfully mastered the systemic workflow for handling missing data in R.