What this vignette teaches (in plain English)

We’ll use one tidyverse packagedplyr — on a built-in dataset called mpg (car fuel economy).
The goal is to show a simple, readable data workflow:

peek → create a new column → group → summarize → sort → show top 5 → (plot)


Packages

# install.packages(c("dplyr","ggplot2"))  # run once if needed
library(dplyr)     # tidyverse data manipulation
library(ggplot2)   # provides the built-in mpg dataset (and plotting)

What dataset is this?

mpg comes with ggplot2 and has one row per car model with city (cty) and highway (hwy) MPG, plus the car manufacturer, model, class, etc.

# Built-in dataset; no download needed
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Why this dataset? No downloads, no cleaning — perfect for a tiny teaching demo.


The dplyr pipeline (one screen, well-commented)

Question: Which car manufacturers have the highest average fuel efficiency in this dataset?

Create a simple efficiency score and find the top 5 manufacturers

mpg_summary <- mpg %>%
  mutate(
    # A simple efficiency score: average of city & highway MPG
    efficiency = (cty + hwy) / 2
  ) %>%
  group_by(manufacturer) %>%
  summarize(
    avg_eff = mean(efficiency, na.rm = TRUE),      # average efficiency for each brand
    n = n()                                        # number of observations per manufacturer
  ) %>%
  arrange(desc(avg_eff)) %>%                       # sort from most to least efficient
  slice_head(n = 5)                                # show just the top 5 brands

mpg_summary
## # A tibble: 5 × 3
##   manufacturer avg_eff     n
##   <chr>          <dbl> <int>
## 1 honda           28.5     9
## 2 volkswagen      25.1    27
## 3 hyundai         22.8    14
## 4 subaru          22.4    14
## 5 audi            22.0    18

What each verb does (quickly):


A quick visual (bar chart)

ggplot(mpg_summary, aes(x = reorder(manufacturer, avg_eff), y = avg_eff)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 5 Manufacturers by Average Efficiency (mpg dataset)",
    x = "Manufacturer",
    y = "Average Efficiency (simple avg of city & highway MPG)"
  ) +
  theme_minimal(base_size = 12)

How to read it: Bars higher up = higher average fuel efficiency among the models that brand has in this dataset.


One-paragraph takeaway

We looked at the mpg data and built a tiny analysis with dplyr.
First, we peeked at the columns; then we created a simple mileage score by averaging city and highway MPG. Next, we grouped by manufacturer and summarized the average score per brand, then sorted and kept the top 5. The table and bar chart show which brands’ models are most fuel-efficient in this dataset (not a universal ranking, just what’s in mpg). This demonstrates the core dplyr flow you’ll use everywhere: mutate → group_by → summarize → arrange.


(Optional) Next steps I could try

  • Compare by class (e.g., compact vs. suv)
  • Use filter() to focus on recent model years (year)
mpg_summary <- mpg %>%
filter(year >= 2008) %>%                         # keep only recent model years
mutate(
# Create a simple efficiency score: average of city & highway MPG
efficiency = (cty + hwy) / 2
) %>%
group_by(manufacturer) %>%
summarize(
avg_eff = mean(efficiency, na.rm = TRUE),      # average efficiency for each brand
n = n()                                        # number of observations per manufacturer
) %>%
arrange(desc(avg_eff)) %>%                       # sort from most to least efficient
slice_head(n = 5)                                # keep only the top 5 brands

mpg_summary
## # A tibble: 5 × 3
##   manufacturer avg_eff     n
##   <chr>          <dbl> <int>
## 1 honda           28.9     4
## 2 volkswagen      24.5    11
## 3 hyundai         22.9     8
## 4 toyota          22.6    14
## 5 subaru          22.6     8
  • Add a second plot with error bars (variation per brand)