Introduction
This code-through tutorial demonstrates how to clean, transform, and summarize data in R using the dplyr package. The goal is to provide a clear, beginner-friendly walkthrough of the core functions used in data wrangling. Using the built-in mtcars dataset, we will walk through filtering rows, selecting variables, creating new columns, sorting data, computing grouped summaries, and visualizing the results.
These steps represent a typical workflow that analysts follow when preparing datasets for deeper statistical analysis or modeling. By the end of the tutorial, a new R user should feel confident applying these techniques to their own data.
Using the built-in mtcars dataset
In this tutorial, we:
Loaded and inspected the raw mtcars data Filtered and selected relevant variables Created a new efficiency metric using mutate() Grouped the data by cylinder and summarized average MPG Built a bar chart to visualize differences across engine types
These functions represent the core workflow for data wrangling in R. By the end, a new user should be able to understand how to apply these techniques to any dataset in their own projects.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
data <- mtcars
head(data)
Filtering Rows
In this step, we demonstrate how to keep only the observations that meet a specific condition. Here we filter the dataset to focus on cars that achieve more than 20 miles per gallon.
filtered <- data %>%
filter(mpg > 20)
head(filtered)
Selecting columns
This section shows how to extract only the variables we need for a specific analysis. Selecting columns helps simplify the dataset and makes it easier to focus on relevant features.
selected <- data %>%
select(mpg, cyl, hp)
head(selected)
Creating New Variables (mutate)
We can create new variables using transformations of existing ones.
Here we calculate a power_to_weight ratio, which provides a
useful performance metric derived from horsepower and vehicle
weight.
mutated <- data %>%
mutate(power_to_weight = hp / wt)
head(mutated)
Sorting Data (arrange)
Sorting allows us to reorder the dataset based on a chosen variable. Below, we arrange the cars from highest to lowest fuel efficiency (mpg) to quickly see which models perform best.
arranged <- data %>%
arrange(desc(mpg))
head(arranged)
Grouping and Summarizing
Grouping lets us compute summary statistics for different categories within the dataset. Here we group by cylinder count and calculate the average fuel economy, average horsepower, and the number of cars in each group.
summary_table <- data %>%
group_by(cyl) %>%
summarize(
avg_mpg = mean(mpg),
avg_hp = mean(hp),
count = n()
)
summary_table
Plotting Results
To visually compare the summary statistics, we create a bar chart that displays the average MPG for cars with 4, 6, and 8 cylinders. This plot makes the differences across engine types easier to interpret.
ggplot(summary_table, aes(x = factor(cyl), y = avg_mpg)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(
title = "Average MPG by Cylinder Type",
x = "Number of Cylinders",
y = "Average MPG"
)
Conclusion
This code-through demonstrated several essential data manipulation
techniques using the dplyr package in R. We explored how to
filter and select relevant information, create new variables, sort
observations, and compute group-level summaries. Together, these steps
form the backbone of a typical data-cleaning workflow used in real-world
analytical tasks.
The final visualization highlighted how fuel efficiency varies across engine types, illustrating how summarized data can reveal meaningful insights. These foundational skills can be applied to any dataset and provide a solid starting point for more advanced analysis in R.