We’ll use one tidyverse package — dplyr
— on a built-in dataset called mpg (car
fuel economy).
The goal is to show a simple, readable data
workflow:
peek → create a new column → group → summarize → sort → show top 5 → (plot)
# install.packages(c("dplyr","ggplot2")) # run once if needed
library(dplyr) # tidyverse data manipulation
library(ggplot2) # provides the built-in mpg dataset (and plotting)
mpg comes with ggplot2 and has one row per car model
with city (cty) and highway (hwy) MPG, plus
the car manufacturer, model,
class, etc.
# Built-in dataset; no download needed
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Why this dataset? No downloads, no cleaning — perfect for a tiny teaching demo.
Question: Which car manufacturers have the highest average fuel efficiency in this dataset?
mpg_summary <- mpg %>%
mutate(
# A simple efficiency score: average of city & highway MPG
efficiency = (cty + hwy) / 2
) %>%
group_by(manufacturer) %>%
summarize(
avg_eff = mean(efficiency, na.rm = TRUE), # average efficiency for each brand
n = n() # number of observations per manufacturer
) %>%
arrange(desc(avg_eff)) %>% # sort from most to least efficient
slice_head(n = 5) # show just the top 5 brands
mpg_summary
## # A tibble: 5 × 3
## manufacturer avg_eff n
## <chr> <dbl> <int>
## 1 honda 28.5 9
## 2 volkswagen 25.1 27
## 3 hyundai 22.8 14
## 4 subaru 22.4 14
## 5 audi 22.0 18
What each verb does (quickly):
mutate() → makes a new column
(efficiency) from existing onesgroup_by() + summarize() → calculates
per-group statisticsarrange() → sorts rowsslice_head(n = 5) → keeps the top 5
rowsggplot(mpg_summary, aes(x = reorder(manufacturer, avg_eff), y = avg_eff)) +
geom_col() +
coord_flip() +
labs(
title = "Top 5 Manufacturers by Average Efficiency (mpg dataset)",
x = "Manufacturer",
y = "Average Efficiency (simple avg of city & highway MPG)"
) +
theme_minimal(base_size = 12)
How to read it: Bars higher up = higher average fuel efficiency among the models that brand has in this dataset.
We looked at the mpg data and built a tiny analysis with
dplyr.
First, we peeked at the columns; then we
created a simple mileage score by averaging city and
highway MPG. Next, we grouped by manufacturer and
summarized the average score per brand, then
sorted and kept the top 5. The table
and bar chart show which brands’ models are most fuel-efficient
in this dataset (not a universal ranking, just what’s
in mpg). This demonstrates the core dplyr flow you’ll use
everywhere: mutate → group_by → summarize →
arrange.
class (e.g., compact vs. suv)filter() to focus on recent model years
(year)mpg_summary <- mpg %>%
filter(year >= 2008) %>% # keep only recent model years
mutate(
# Create a simple efficiency score: average of city & highway MPG
efficiency = (cty + hwy) / 2
) %>%
group_by(manufacturer) %>%
summarize(
avg_eff = mean(efficiency, na.rm = TRUE), # average efficiency for each brand
n = n() # number of observations per manufacturer
) %>%
arrange(desc(avg_eff)) %>% # sort from most to least efficient
slice_head(n = 5) # keep only the top 5 brands
mpg_summary
## # A tibble: 5 × 3
## manufacturer avg_eff n
## <chr> <dbl> <int>
## 1 honda 28.9 4
## 2 volkswagen 24.5 11
## 3 hyundai 22.9 8
## 4 toyota 22.6 14
## 5 subaru 22.6 8