In this code-through tutorial, we will learn how to clean data using
the dplyr package in R and visualize results using
ggplot2.
Data cleaning is a foundational step before any analysis. This tutorial
covers:
All changes are demonstrated using a simple toy dataset for clarity.
library(dplyr)
library(tibble)
library(ggplot2)
library(tidyr)
maroon <- "#8C1D40"
gold <- "#FFC627"
df <- tibble(
student = c("A", "B", "C", "D", "E"),
math = c(90, 75, NA, 60, 88),
english = c(85, 70, 95, NA, 82),
extra_column = c("x", "x", "x", "x", "x")
)
df
## # A tibble: 5 Ă— 4
## student math english extra_column
## <chr> <dbl> <dbl> <chr>
## 1 A 90 85 x
## 2 B 75 70 x
## 3 C NA 95 x
## 4 D 60 NA x
## 5 E 88 82 x
ggplot(df, aes(x = student, y = math)) +
geom_col(fill = maroon) +
labs(title = "Math Scores Before Cleaning",
y = "Math Score")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_col()`).
glimpse(df)
## Rows: 5
## Columns: 4
## $ student <chr> "A", "B", "C", "D", "E"
## $ math <dbl> 90, 75, NA, 60, 88
## $ english <dbl> 85, 70, 95, NA, 82
## $ extra_column <chr> "x", "x", "x", "x", "x"
clean_df <- df %>%
select(student, math, english)
clean_df
## # A tibble: 5 Ă— 3
## student math english
## <chr> <dbl> <dbl>
## 1 A 90 85
## 2 B 75 70
## 3 C NA 95
## 4 D 60 NA
## 5 E 88 82
clean_df <- clean_df %>%
filter(!(is.na(math) & is.na(english)))
clean_df
## # A tibble: 5 Ă— 3
## student math english
## <chr> <dbl> <dbl>
## 1 A 90 85
## 2 B 75 70
## 3 C NA 95
## 4 D 60 NA
## 5 E 88 82
ggplot(clean_df, aes(x = math, y = english, label = student)) +
geom_point(size = 3, color = gold) +
geom_text(nudge_y = 3, color = maroon) +
labs(title = "Math vs English (Before Cleaning)",
x = "Math Score", y = "English Score")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).
clean_df <- clean_df %>%
mutate(avg_score = (math + english) / 2)
clean_df
## # A tibble: 5 Ă— 4
## student math english avg_score
## <chr> <dbl> <dbl> <dbl>
## 1 A 90 85 87.5
## 2 B 75 70 72.5
## 3 C NA 95 NA
## 4 D 60 NA NA
## 5 E 88 82 85
mean_math <- mean(clean_df$math, na.rm = TRUE)
mean_eng <- mean(clean_df$english, na.rm = TRUE)
clean_df <- clean_df %>%
mutate(
math = ifelse(is.na(math), mean_math, math),
english = ifelse(is.na(english), mean_eng, english),
avg_score = (math + english) / 2
)
clean_df
## # A tibble: 5 Ă— 4
## student math english avg_score
## <chr> <dbl> <dbl> <dbl>
## 1 A 90 85 87.5
## 2 B 75 70 72.5
## 3 C 78.2 95 86.6
## 4 D 60 83 71.5
## 5 E 88 82 85
ggplot(clean_df, aes(x = student, y = math)) +
geom_col(fill = gold) +
labs(title = "Math Scores After Cleaning",
y = "Math Score")
ggplot(clean_df, aes(x = student, y = avg_score)) +
geom_col(fill = maroon) +
labs(title = "Average Scores After Cleaning",
y = "Average Score")
compare_df <- df %>%
select(student, math, english) %>%
mutate(type = "Before Cleaning") %>%
bind_rows(
clean_df %>% mutate(type = "After Cleaning")
) %>%
pivot_longer(cols = c(math, english), names_to = "subject", values_to = "score")
ggplot(compare_df, aes(x = student, y = score, fill = type)) +
geom_col(position = "dodge") +
facet_wrap(~subject) +
scale_fill_manual(values = c("Before Cleaning" = maroon,
"After Cleaning" = gold)) +
labs(title = "Before vs After Cleaning: Subject Scores",
y = "Score", fill = "Dataset Version")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_col()`).
clean_df
## # A tibble: 5 Ă— 4
## student math english avg_score
## <chr> <dbl> <dbl> <dbl>
## 1 A 90 85 87.5
## 2 B 75 70 72.5
## 3 C 78.2 95 86.6
## 4 D 60 83 71.5
## 5 E 88 82 85
In this tutorial, we demonstrated:
This code-through demonstrated a complete, beginner-friendly workflow for cleaning and preparing data using the dplyr package in R. We started with a raw dataset that contained missing values, irrelevant columns, and inconsistent structure. Step-by-step, we applied essential data wrangling techniques, including selecting variables, filtering rows, creating new variables, and replacing missing values using meaningful strategies. Each transformation was visualized so that the impact of cleaning decisions could be clearly observed.
The visualizations in this tutorial reinforce why data cleaning is such a crucial step—patterns and relationships only become clear once the data is organized and complete. Using maroon and gold styling, we also demonstrated how plots can be made more readable and visually appealing.
Overall, this tutorial highlights an important lesson: clean data leads to better insights. With just a few dplyr functions and clear visualizations, anyone can transform messy data into a polished dataset ready for analysis or modeling. These techniques form the foundation of most real-world data science workflows, and mastering them will greatly improve the quality and reliability of future projects.