I will walk through a small data exploration project using the built-in mpg dataset from ggplot2.
I will create a simple derived variable for average MPG, summarize MPG by car class with dplyr, and make a basic plots with ggplot2
First, load the tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The mpg dataset is included with ggplot2
mpg %>%
glimpse()
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Key variables we will use are the class: type of car (compact, suv, etc. the cty: city miles per gallon, and the hwy: highway miles per gallon.
cars <- mpg
head(cars, 5)
## # A tibble: 5 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
combine city and highway mileage into a single simple measure
cars <- cars %>%
mutate(
avg_mpg = (cty + hwy) / 2
)
head(cars, 5)
## # A tibble: 5 × 12
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## # ℹ 1 more variable: avg_mpg <dbl>
Now each row has avg_mpg
Next, we calculate average MPG for each car class.
mpg_by_class <- cars %>%
group_by(class) %>%
summarise(
mean_avg_mpg = mean(avg_mpg, na.rm = TRUE),
mean_hwy_mpg = mean(hwy, na.rm = TRUE),
n = n()
) %>%
arrange(desc(mean_avg_mpg))
mpg_by_class
## # A tibble: 7 × 4
## class mean_avg_mpg mean_hwy_mpg n
## <chr> <dbl> <dbl> <int>
## 1 subcompact 24.3 28.1 35
## 2 compact 24.2 28.3 47
## 3 midsize 23.0 27.3 41
## 4 2seater 20.1 24.8 5
## 5 minivan 19.1 22.4 11
## 6 suv 15.8 18.1 62
## 7 pickup 14.9 16.9 33
We can turn the summary table into a bar chart
ggplot(mpg_by_class, aes(x = reorder(class, mean_avg_mpg), y = mean_avg_mpg)) +
geom_col() +
coord_flip() +
labs(
title = "Average Fuel Efficiency by Car Class",
x = "Car Class",
y = "Mean Average MPG"
)
Each bar is a car class
in addition, boxplots show the spread of MPG within each class
ggplot(cars, aes(x = class, y = hwy)) +
geom_boxplot() +
labs(
title = "Highway MPG Distribution by Car Class",
x = "Car Class",
y = "Highway MPG"
)
This lets us see how variable each class is