Introduction

In this code-through tutorial, we will learn how to clean data using the dplyr package in R and visualize results using ggplot2.
Data cleaning is a foundational step before any analysis. This tutorial covers:

  • Inspecting a dataset
  • Selecting important variables
  • Filtering rows
  • Creating new variables
  • Handling missing values
  • Visualizing before and after cleaning
  • Comparing datasets

All changes are demonstrated using a simple toy dataset for clarity.


Load Libraries

library(dplyr)
library(tibble)
library(ggplot2)
library(tidyr)

maroon <- "#8C1D40"
gold <- "#FFC627"

Creating a Toy Dataset

df <- tibble(
  student = c("A", "B", "C", "D", "E"),
  math = c(90, 75, NA, 60, 88),
  english = c(85, 70, 95, NA, 82),
  extra_column = c("x", "x", "x", "x", "x")
)

df
## # A tibble: 5 Ă— 4
##   student  math english extra_column
##   <chr>   <dbl>   <dbl> <chr>       
## 1 A          90      85 x           
## 2 B          75      70 x           
## 3 C          NA      95 x           
## 4 D          60      NA x           
## 5 E          88      82 x

Visualization 1: Math Scores Before Cleaning

ggplot(df, aes(x = student, y = math)) +
  geom_col(fill = maroon) +
  labs(title = "Math Scores Before Cleaning",
       y = "Math Score")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_col()`).


Step 1: Inspecting the Data

glimpse(df)
## Rows: 5
## Columns: 4
## $ student      <chr> "A", "B", "C", "D", "E"
## $ math         <dbl> 90, 75, NA, 60, 88
## $ english      <dbl> 85, 70, 95, NA, 82
## $ extra_column <chr> "x", "x", "x", "x", "x"

Step 2: Selecting Relevant Columns

clean_df <- df %>%
  select(student, math, english)

clean_df
## # A tibble: 5 Ă— 3
##   student  math english
##   <chr>   <dbl>   <dbl>
## 1 A          90      85
## 2 B          75      70
## 3 C          NA      95
## 4 D          60      NA
## 5 E          88      82

Step 3: Filtering Rows

clean_df <- clean_df %>%
  filter(!(is.na(math) & is.na(english)))

clean_df
## # A tibble: 5 Ă— 3
##   student  math english
##   <chr>   <dbl>   <dbl>
## 1 A          90      85
## 2 B          75      70
## 3 C          NA      95
## 4 D          60      NA
## 5 E          88      82

Visualization 2: Scatterplot Before Cleaning

ggplot(clean_df, aes(x = math, y = english, label = student)) +
  geom_point(size = 3, color = gold) +
  geom_text(nudge_y = 3, color = maroon) +
  labs(title = "Math vs English (Before Cleaning)",
       x = "Math Score", y = "English Score")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text()`).


Step 4: Creating a New Variable

clean_df <- clean_df %>%
  mutate(avg_score = (math + english) / 2)

clean_df
## # A tibble: 5 Ă— 4
##   student  math english avg_score
##   <chr>   <dbl>   <dbl>     <dbl>
## 1 A          90      85      87.5
## 2 B          75      70      72.5
## 3 C          NA      95      NA  
## 4 D          60      NA      NA  
## 5 E          88      82      85

Step 5: Handling Missing Values

mean_math <- mean(clean_df$math, na.rm = TRUE)
mean_eng  <- mean(clean_df$english, na.rm = TRUE)

clean_df <- clean_df %>%
  mutate(
    math = ifelse(is.na(math), mean_math, math),
    english = ifelse(is.na(english), mean_eng, english),
    avg_score = (math + english) / 2
  )

clean_df
## # A tibble: 5 Ă— 4
##   student  math english avg_score
##   <chr>   <dbl>   <dbl>     <dbl>
## 1 A        90        85      87.5
## 2 B        75        70      72.5
## 3 C        78.2      95      86.6
## 4 D        60        83      71.5
## 5 E        88        82      85

Visualization 3: Math Scores After Cleaning

ggplot(clean_df, aes(x = student, y = math)) +
  geom_col(fill = gold) +
  labs(title = "Math Scores After Cleaning",
       y = "Math Score")


Visualization 4: Average Scores After Cleaning

ggplot(clean_df, aes(x = student, y = avg_score)) +
  geom_col(fill = maroon) +
  labs(title = "Average Scores After Cleaning",
       y = "Average Score")


Visualization 5: Before vs After Cleaning Comparison

compare_df <- df %>%
  select(student, math, english) %>%
  mutate(type = "Before Cleaning") %>%
  bind_rows(
    clean_df %>% mutate(type = "After Cleaning")
  ) %>%
  pivot_longer(cols = c(math, english), names_to = "subject", values_to = "score")

ggplot(compare_df, aes(x = student, y = score, fill = type)) +
  geom_col(position = "dodge") +
  facet_wrap(~subject) +
  scale_fill_manual(values = c("Before Cleaning" = maroon,
                               "After Cleaning" = gold)) +
  labs(title = "Before vs After Cleaning: Subject Scores",
       y = "Score", fill = "Dataset Version")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_col()`).


Final Clean Dataset

clean_df
## # A tibble: 5 Ă— 4
##   student  math english avg_score
##   <chr>   <dbl>   <dbl>     <dbl>
## 1 A        90        85      87.5
## 2 B        75        70      72.5
## 3 C        78.2      95      86.6
## 4 D        60        83      71.5
## 5 E        88        82      85

Conclusion

In this tutorial, we demonstrated:

  • How to inspect and prepare raw data
  • How to remove unnecessary columns
  • How to handle missing values properly
  • How to create new variables
  • How to visualize both raw and cleaned data
  • How to compare datasets before and after preprocessing

This code-through demonstrated a complete, beginner-friendly workflow for cleaning and preparing data using the dplyr package in R. We started with a raw dataset that contained missing values, irrelevant columns, and inconsistent structure. Step-by-step, we applied essential data wrangling techniques, including selecting variables, filtering rows, creating new variables, and replacing missing values using meaningful strategies. Each transformation was visualized so that the impact of cleaning decisions could be clearly observed.

The visualizations in this tutorial reinforce why data cleaning is such a crucial step—patterns and relationships only become clear once the data is organized and complete. Using maroon and gold styling, we also demonstrated how plots can be made more readable and visually appealing.

Overall, this tutorial highlights an important lesson: clean data leads to better insights. With just a few dplyr functions and clear visualizations, anyone can transform messy data into a polished dataset ready for analysis or modeling. These techniques form the foundation of most real-world data science workflows, and mastering them will greatly improve the quality and reliability of future projects.