Assignment:
In this assignment I will have fun learning about ggplot2 through playing with data.
Using the vehicles dataset from the fueleconomy library, I will present three ggplot2 plots that attempt to answer questions about the data that I think are interesting.
In particular, I’ll make at least one or two other alternative plots per graphic in my pdf.
Precursors
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(skimr)
library(fueleconomy)
library(stringr)
library(ggplot2)
library(ggalt)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
Getting Familiar with the the vehicles data set
?vehicles
summary(vehicles)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel hwy cty
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
#Fuel economy data from the EPA, 1985 to 2015
dim(vehicles)
## [1] 33442 12
#33442 rows, 12 columns
head(vehicles)
str(vehicles)
## tibble [33,442 × 12] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:33442] 13309 13310 13311 14038 14039 ...
## $ make : chr [1:33442] "Acura" "Acura" "Acura" "Acura" ...
## $ model: chr [1:33442] "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
## $ year : num [1:33442] 1997 1997 1997 1998 1998 ...
## $ class: chr [1:33442] "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
## $ trans: chr [1:33442] "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
## $ drive: chr [1:33442] "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
## $ cyl : num [1:33442] 4 4 6 4 4 6 4 4 6 5 ...
## $ displ: num [1:33442] 2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
## $ fuel : chr [1:33442] "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : num [1:33442] 26 28 26 27 29 26 27 29 26 23 ...
## $ cty : num [1:33442] 20 22 18 19 21 17 20 21 17 18 ...
#I'll want to make $class, $trans, $drive, $fuel columns factor datatype
skim(vehicles)
| Name | vehicles |
| Number of rows | 33442 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| make | 0 | 1 | 3 | 34 | 0 | 128 | 0 |
| model | 0 | 1 | 1 | 39 | 0 | 3198 | 0 |
| class | 0 | 1 | 4 | 34 | 0 | 34 | 0 |
| trans | 8 | 1 | 8 | 32 | 0 | 47 | 0 |
| drive | 0 | 1 | 13 | 26 | 0 | 7 | 0 |
| fuel | 0 | 1 | 3 | 27 | 0 | 13 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 17038.30 | 10087.01 | 1 | 8361.25 | 16723.5 | 25264.75 | 34932.0 | ▇▇▇▇▇ |
| year | 0 | 1 | 1999.11 | 9.38 | 1984 | 1991.00 | 1999.0 | 2008.00 | 2015.0 | ▇▆▅▆▇ |
| cyl | 58 | 1 | 5.77 | 1.74 | 2 | 4.00 | 6.0 | 6.00 | 16.0 | ▇▇▅▁▁ |
| displ | 57 | 1 | 3.35 | 1.36 | 0 | 2.30 | 3.0 | 4.30 | 8.4 | ▁▇▅▂▁ |
| hwy | 0 | 1 | 23.55 | 6.21 | 9 | 19.00 | 23.0 | 27.00 | 109.0 | ▇▁▁▁▁ |
| cty | 0 | 1 | 17.49 | 5.58 | 6 | 15.00 | 17.0 | 20.00 | 138.0 | ▇▁▁▁▁ |
#58 cars are missing $cyl data, 57 cars are missing $displ observations
Brainstorm: Questions to ask, aspects of this dataset to graph… -What is the combined mph of these cars? I see only city and hwy. -Is there car maker that overall makes the most fuel efficient cars? -SUVs subset: What percentage of this dataset is SUVs? I imagine there is a significant increase in SUV numbers between 1985 and 2005 (RIP minivan). What popular type of drivetrain and engine for an SUV? What is its mpg? -sportcars subset: How popular are sportcars? What type of engine do they have, and what is there fuel economy compared to the rest of the vehicle data set? -engines: is the biggest engine in a sportscar or SUV / truck? -when do electric cars enter the market?
Data Wrangling:Question 1: What is the range of fuel economy like per car company?
#wrangling
sc = v %>% filter(year == 2005)
sc2s = sc %>%
group_by(make) %>%
select(make, model, hwy, year, hwy, cty, combined_mpg) %>%
mutate(low_mpg = min(combined_mpg), high_mpg= max(combined_mpg))
sc2s
#plot 1 - points
ggplot(sc2s, aes(x = combined_mpg, y = make)) +
geom_point()
#plot 2 - dumbbell
ggplot(sc2s, aes(x = low_mpg, xend = high_mpg, y = make)) +
geom_dumbbell()
#can I compare this dumbbell graph to the same car makers, ten years ago?
ranges = v %>%
filter(year == c(2005,1985)) %>%
arrange(make) %>%
group_by(make) %>%
select(make, model, hwy, year, hwy, cty, combined_mpg) %>%
mutate(low_mpg = min(combined_mpg), high_mpg= max(combined_mpg))
?fct_infreq()
#plot 3 : faceted dumbbell
ggplot(ranges, aes(x = low_mpg, xend = high_mpg, y = fct_infreq(make))) +
geom_dumbbell(aes(color = make)) +
theme(legend.position = "none") +
facet_grid(.~year)
# plot 4
ggplot(ranges, aes(x = low_mpg, xend = high_mpg, y = fct_infreq(make))) +
geom_dumbbell(aes(color = make)) +
theme(legend.position = "none") +
facet_grid(.~year) +
theme(axis.text.y = element_text(size = 5.5)) +
xlab("Miles Per Gallon") +
ylab("Auto Maker") +
ggtitle("Range of Fuel Economy Per Auto Maker")
ABANDONED PROJECT ON CUMULATIVE GAS CONSUMPTION, VISUALIZING WITH AN AREA PLOT
# #plot 2
# #ggplot(sc2s, aes(x = year, y = gal_per100, fill = make)) +
# ?geom_area()
#
# #plot 3
# #new = sc2s %>%
# filter(make %in% c("Chevrolet", "Ford", "Dodge", "Pontiac")) %>%
# group_by(make, year) %>%
# summarize(gal_per100 = sum(gal_per100))
#
# #new
#
# #ggplot(new, aes(x = year, y = gal_per100, fill = make, color = make)) +
# geom_area(aes(fill = make), position = 'stack')
#
# #this doesn't tell me much interesting. I need to switch to a different group. =
#
# #civic = v %>%
# filter(make == "Honda" & grepl('Civic', model)) %>%
# group_by(year) %>%
# mutate(gpy = 12000/combined_mpg) %>%
# summarize(year, make, model, hwy, combined_mpg, gpy = sum(gpy))
*BONUS QUESTION* Which type of drivetrain is most fuel efficient in SUVs?
```r
summary(v$class)
## Compact Cars Large Cars
## 4739 1533
## Midsize Cars Midsize Station Wagons
## 3621 415
## Midsize-Large Station Wagons Minicompact Cars
## 627 1080
## Minivan - 2WD Minivan - 4WD
## 308 44
## Small Pickup Trucks Small Pickup Trucks 2WD
## 538 392
## Small Pickup Trucks 4WD Small Sport Utility Vehicle 2WD
## 181 169
## Small Sport Utility Vehicle 4WD Small Station Wagons
## 213 1295
## Special Purpose Vehicle Special Purpose Vehicle 2WD
## 1 553
## Special Purpose Vehicle 4WD Special Purpose Vehicles
## 289 1453
## Special Purpose Vehicles/2wd Special Purpose Vehicles/4wd
## 2 2
## Sport Utility Vehicle - 2WD Sport Utility Vehicle - 4WD
## 1626 2091
## Standard Pickup Trucks Standard Pickup Trucks 2WD
## 2354 1106
## Standard Pickup Trucks 4WD Standard Pickup Trucks/2wd
## 910 4
## Standard Sport Utility Vehicle 2WD Standard Sport Utility Vehicle 4WD
## 76 171
## Subcompact Cars Two Seaters
## 4185 1602
## Vans Vans Passenger
## 1141 2
## Vans, Cargo Type Vans, Passenger Type
## 434 285
#filtering v for only Suvs
suv = v %>%
filter(grepl("Sport", class))
suv
#plot1, distribution by make
ggplot(suv, aes(x = fct_infreq(make))) +
coord_flip() +
geom_bar(aes(fill = make)) +
theme(axis.text.x = element_blank()) +
xlab("Vehicle make")
#plot2, not that cool.
ggplot(suv, aes(x = displ, y = combined_mpg, color = drive)) +
geom_point(alpha = .01) +
geom_jitter() +
ylim(0,40)
## Warning: Removed 12 rows containing missing values (geom_point).
## Removed 12 rows containing missing values (geom_point).
suv_bydrive = suv %>%
group_by(drive) %>%
summarize(make, model, trans, hwy, cty, combined_mpg, avg_mpg = mean(combined_mpg))
## `summarise()` has grouped output by 'drive'. You can override using the
## `.groups` argument.
# dot plot for drive vs. average mpg, 2WD is far and away more fuel efficient.
#but doesn't tell us much more than that
ggplot(suv_bydrive, aes(x = drive, y = avg_mpg)) +
geom_point()
# maybe a box plot would be a better fit.
ggplot(suv_bydrive, aes(x = drive, y = combined_mpg)) +
geom_jitter(alpha = .2) +
geom_boxplot()
summary(suv$drive)
## 2WD 4WD 4WD/AWD AWD FWD
## 15 404 1563 468 871
## Part-time 4WD RWD
## 29 996
#Only 15 suvs are labeled 2WD, this boxplot makes it seem like it is greater. I may want to only compare the top drive types: 4WD, 4WD/AWD, AWD, FWD, RWD
suv_24A = suv %>%
filter(drive %in% c("FWD", "RWD", "4WD", "4WD/AWD"))
ggplot(suv_24A, aes(x = drive, y = combined_mpg)) +
geom_jitter(alpha = .4, color = "grey") +
geom_boxplot(aes(color = drive)) +
ylim(0,40) +
theme(legend.position = "none") +
geom_hline(yintercept = mean(suv_24A$combined_mpg), color="red") +
ggtitle("Fuel Efficiency of SUVs, by drivetrain") +
ylab("Combined MPG")
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
## Warning: Removed 6 rows containing missing values (geom_point).
# From plot1, Top 10 SUVs are made by: Chevrolet, Jeep, GMC, Toyota, Ford ,Nissan, Suzuki, Mercedes-Benz, Dodge, Hyundai
toyota_suvs = suv %>%
filter(make == "Toyota")
Question 3:What is mpg difference between toyota fleet over 10 years?
v %>%
arrange(desc(year))
toyota = v %>%
filter(make == "Toyota" & (year == 1990 | year == 2000 | year == 2010)) %>%
group_by(model) %>%
summarize(year, class, trans, drive, cyl, displ, fuel, hwy, cty, combined_mpg,
avg_mpg = mean(combined_mpg))
## `summarise()` has grouped output by 'model'. You can override using the
## `.groups` argument.
toyota
#plot 1 - too much information
t = ggplot(data = toyota, aes(x = class, y = combined_mpg, color = year))
t + geom_point()
#plot 2 - wut
t +
facet_wrap(year~.) +
geom_polygon()
#boxplot?
t +
facet_wrap(year~.) +
geom_boxplot()
#Still busy...
ggplot(data = toyota, aes(x = class, y = combined_mpg, color = class)) +
coord_flip() +
geom_point(alpha = .7, aes(size = displ)) +
facet_grid(year~.,scales = "free") +
theme_classic() +
theme(legend.position = "none",
panel.grid.major = element_line()) +
xlab("Vehicle Class") +
ylab("MPG (city + highway)") +
ylim(10,35)
## Warning: Removed 2 rows containing missing values (geom_point).
#cleaner, more readable
ggplot(data = toyota, aes(x = class, y = combined_mpg, color = drive)) +
coord_flip() +
geom_point(alpha = .7, aes(size = displ)) +
facet_grid(year~.,scales = "free") +
theme_classic(7) +
theme(legend.position = "bottom",
panel.grid.major = element_line()) +
xlab("Vehicle Class") +
ylab("MPG (city + highway)") +
ylim(10,35) +
ggtitle("2 Decades of Toyota Vehicle Fuel Economy")
## Warning: Removed 2 rows containing missing values (geom_point).
Question 3 : How does engine size and fuel economy of the BMW 3 series over time?
#wrangling 3 series data
bmw = v %>% filter(make == "BMW" )
TS = bmw %>% filter(grepl('32|3 Series|33', model))
#plot1:
ggplot(TS, aes(year, combined_mpg)) +
geom_point()
TS$trans = factor(TS$trans %>% substr(1,1))
str(TS$trans) #simplifed trans variable for automatic / manual shape
## Factor w/ 2 levels "A","M": 1 2 1 2 1 2 1 2 1 2 ...
TS$displ = factor(TS$displ) #casting engine displacement as a factor to add more distinct color
#plot 2
ggplot(TS, aes(year, combined_mpg, color = displ, shape = trans)) +
geom_point(alpha = .5, size = 3)
#plot 3
ggplot(TS, aes(year, combined_mpg, shape = trans)) +
geom_point(alpha = .5, size = 3, aes(color = displ)) +
geom_smooth(se = FALSE, method = "lm", aes(color = trans))
## `geom_smooth()` using formula 'y ~ x'
#make it pretty
ggplot(TS, aes(year, combined_mpg, shape = trans)) +
geom_point(alpha = .5, size = 3, aes(color = displ)) +
geom_smooth(se = FALSE, method = "lm", aes(color = trans)) +
theme_bw() +
labs(title = "Fuel Economy & Engine Size of BMW 3 Series",
y = "Combined MPG (CITY + HWY)",
x = "") +
theme(legend.title=element_text())
## `geom_smooth()` using formula 'y ~ x'
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.