Code-Through Tutorial: Cleaning and Transforming Data with dplyr

Introduction

This code-through tutorial demonstrates how to clean, transform, and summarize data in R using the dplyr package. The goal is to provide a clear, beginner-friendly walkthrough of the core functions used in data wrangling. Using the built-in mtcars dataset, we will walk through filtering rows, selecting variables, creating new columns, sorting data, computing grouped summaries, and visualizing the results.

These steps represent a typical workflow that analysts follow when preparing datasets for deeper statistical analysis or modeling. By the end of the tutorial, a new R user should feel confident applying these techniques to their own data.

Using the built-in mtcars dataset

In this tutorial, we:

Loaded and inspected the raw mtcars data Filtered and selected relevant variables Created a new efficiency metric using mutate() Grouped the data by cylinder and summarized average MPG Built a bar chart to visualize differences across engine types

These functions represent the core workflow for data wrangling in R. By the end, a new user should be able to understand how to apply these techniques to any dataset in their own projects.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

data <- mtcars
head(data)

Filtering Rows

In this step, we demonstrate how to keep only the observations that meet a specific condition. Here we filter the dataset to focus on cars that achieve more than 20 miles per gallon.

filtered <- data %>%
  filter(mpg > 20)

head(filtered)

Selecting columns

This section shows how to extract only the variables we need for a specific analysis. Selecting columns helps simplify the dataset and makes it easier to focus on relevant features.

selected <- data %>%
  select(mpg, cyl, hp)

head(selected)

Creating New Variables (mutate)

We can create new variables using transformations of existing ones. Here we calculate a power_to_weight ratio, which provides a useful performance metric derived from horsepower and vehicle weight.

mutated <- data %>%
  mutate(power_to_weight = hp / wt)

head(mutated)

Sorting Data (arrange)

Sorting allows us to reorder the dataset based on a chosen variable. Below, we arrange the cars from highest to lowest fuel efficiency (mpg) to quickly see which models perform best.

arranged <- data %>%
  arrange(desc(mpg))

head(arranged)

Grouping and Summarizing

Grouping lets us compute summary statistics for different categories within the dataset. Here we group by cylinder count and calculate the average fuel economy, average horsepower, and the number of cars in each group.

summary_table <- data %>%
  group_by(cyl) %>%
  summarize(
    avg_mpg = mean(mpg),
    avg_hp  = mean(hp),
    count = n()
  )

summary_table

Plotting Results

To visually compare the summary statistics, we create a bar chart that displays the average MPG for cars with 4, 6, and 8 cylinders. This plot makes the differences across engine types easier to interpret.

ggplot(summary_table, aes(x = factor(cyl), y = avg_mpg)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(
    title = "Average MPG by Cylinder Type",
    x = "Number of Cylinders",
    y = "Average MPG"
  )

Conclusion

This code-through demonstrated several essential data manipulation techniques using the dplyr package in R. We explored how to filter and select relevant information, create new variables, sort observations, and compute group-level summaries. Together, these steps form the backbone of a typical data-cleaning workflow used in real-world analytical tasks.

The final visualization highlighted how fuel efficiency varies across engine types, illustrating how summarized data can reveal meaningful insights. These foundational skills can be applied to any dataset and provide a solid starting point for more advanced analysis in R.

Code-Through Tutorial: Cleaning and Transforming Data with dplyr

SHARAN KUMAR VARMA CHEKURI