HANKS_DATAVIZ

Assignment:

In this assignment I will have fun learning about ggplot2 through playing with data.

Using the vehicles dataset from the fueleconomy library, I will present three ggplot2 plots that attempt to answer questions about the data that I think are interesting.

In particular, I’ll make at least one or two other alternative plots per graphic in my pdf.

Precursors

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(skimr)
library(fueleconomy)
library(stringr)
library(ggplot2)
library(ggalt)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

Getting Familiar with the the vehicles data set

?vehicles
summary(vehicles)

##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

#Fuel economy data from the EPA, 1985 to 2015 
dim(vehicles)

## [1] 33442    12

#33442 rows, 12 columns 
head(vehicles)

str(vehicles)

## tibble [33,442 × 12] (S3: tbl_df/tbl/data.frame)
##  $ id   : num [1:33442] 13309 13310 13311 14038 14039 ...
##  $ make : chr [1:33442] "Acura" "Acura" "Acura" "Acura" ...
##  $ model: chr [1:33442] "2.2CL/3.0CL" "2.2CL/3.0CL" "2.2CL/3.0CL" "2.3CL/3.0CL" ...
##  $ year : num [1:33442] 1997 1997 1997 1998 1998 ...
##  $ class: chr [1:33442] "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" "Subcompact Cars" ...
##  $ trans: chr [1:33442] "Automatic 4-spd" "Manual 5-spd" "Automatic 4-spd" "Automatic 4-spd" ...
##  $ drive: chr [1:33442] "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" "Front-Wheel Drive" ...
##  $ cyl  : num [1:33442] 4 4 6 4 4 6 4 4 6 5 ...
##  $ displ: num [1:33442] 2.2 2.2 3 2.3 2.3 3 2.3 2.3 3 2.5 ...
##  $ fuel : chr [1:33442] "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : num [1:33442] 26 28 26 27 29 26 27 29 26 23 ...
##  $ cty  : num [1:33442] 20 22 18 19 21 17 20 21 17 18 ...

#I'll want to make $class, $trans, $drive, $fuel columns factor datatype

skim(vehicles)

Data summary
Name	vehicles
Number of rows	33442
Number of columns	12
_______________________
Column type frequency:
character	6
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
make	0	1	3	34	128
model	0	1	1	39	3198
class	0	1	4	34	34
trans	8	1	8	32	47
drive	0	1	13	26	7
fuel	0	1	3	27	13

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1	17038.30	10087.01	1	8361.25	16723.5	25264.75	34932.0	▇▇▇▇▇
year	0	1	1999.11	9.38	1984	1991.00	1999.0	2008.00	2015.0	▇▆▅▆▇
cyl	58	1	5.77	1.74	2	4.00	6.0	6.00	16.0	▇▇▅▁▁
displ	57	1	3.35	1.36	0	2.30	3.0	4.30	8.4	▁▇▅▂▁
hwy	0	1	23.55	6.21	9	19.00	23.0	27.00	109.0	▇▁▁▁▁
cty	0	1	17.49	5.58	6	15.00	17.0	20.00	138.0	▇▁▁▁▁

#58 cars are missing $cyl data, 57 cars are missing $displ observations

Brainstorm: Questions to ask, aspects of this dataset to graph… -What is the combined mph of these cars? I see only city and hwy. -Is there car maker that overall makes the most fuel efficient cars? -SUVs subset: What percentage of this dataset is SUVs? I imagine there is a significant increase in SUV numbers between 1985 and 2005 (RIP minivan). What popular type of drivetrain and engine for an SUV? What is its mpg? -sportcars subset: How popular are sportcars? What type of engine do they have, and what is there fuel economy compared to the rest of the vehicle data set? -engines: is the biggest engine in a sportscar or SUV / truck? -when do electric cars enter the market?

Data Wrangling:

Question 1: What is the range of fuel economy like per car company?

#wrangling
sc = v %>% filter(year == 2005)

sc2s = sc %>% 
  group_by(make) %>% 
  select(make, model, hwy, year, hwy, cty, combined_mpg) %>% 
  mutate(low_mpg = min(combined_mpg), high_mpg= max(combined_mpg))

         

sc2s

#plot 1 - points
ggplot(sc2s, aes(x = combined_mpg, y = make)) + 
  geom_point()

#plot 2 - dumbbell

ggplot(sc2s, aes(x = low_mpg, xend = high_mpg, y = make)) + 
  geom_dumbbell()

#can I compare this dumbbell graph to the same car makers, ten years ago? 
ranges = v %>%
    filter(year == c(2005,1985)) %>%
      arrange(make) %>%
       group_by(make) %>% 
          select(make, model, hwy, year, hwy, cty, combined_mpg) %>% 
          mutate(low_mpg = min(combined_mpg), high_mpg= max(combined_mpg)) 
          
?fct_infreq()

#plot 3 : faceted dumbbell 
ggplot(ranges, aes(x = low_mpg, xend = high_mpg, y = fct_infreq(make))) + 
  geom_dumbbell(aes(color = make)) + 
  theme(legend.position = "none") +
  facet_grid(.~year)

# plot 4 
ggplot(ranges, aes(x = low_mpg, xend = high_mpg, y = fct_infreq(make))) + 
  geom_dumbbell(aes(color = make)) + 
  theme(legend.position = "none") +
  facet_grid(.~year) + 
  theme(axis.text.y = element_text(size = 5.5)) +
  xlab("Miles Per Gallon") + 
  ylab("Auto Maker") +
  ggtitle("Range of Fuel Economy Per Auto Maker")

ABANDONED PROJECT ON CUMULATIVE GAS CONSUMPTION, VISUALIZING WITH AN AREA PLOT

# #plot 2
# #ggplot(sc2s, aes(x = year, y = gal_per100, fill = make)) + 
#   ?geom_area()
# 
# #plot 3 
# #new = sc2s %>%
#   filter(make %in% c("Chevrolet", "Ford", "Dodge", "Pontiac")) %>%
#     group_by(make, year) %>% 
#       summarize(gal_per100 = sum(gal_per100))
# 
# #new
#   
# #ggplot(new, aes(x = year, y = gal_per100, fill = make, color = make)) + 
#   geom_area(aes(fill = make), position = 'stack') 
# 
# #this doesn't tell me much interesting. I need to switch to a different group. = 
# 
# #civic = v %>% 
#         filter(make == "Honda" & grepl('Civic', model)) %>% 
#           group_by(year) %>%
#               mutate(gpy = 12000/combined_mpg) %>% 
#                 summarize(year, make, model, hwy, combined_mpg, gpy = sum(gpy))




*BONUS QUESTION* Which type of drivetrain is most fuel efficient in SUVs? 


```r
summary(v$class)

##                       Compact Cars                         Large Cars 
##                               4739                               1533 
##                       Midsize Cars             Midsize Station Wagons 
##                               3621                                415 
##       Midsize-Large Station Wagons                   Minicompact Cars 
##                                627                               1080 
##                      Minivan - 2WD                      Minivan - 4WD 
##                                308                                 44 
##                Small Pickup Trucks            Small Pickup Trucks 2WD 
##                                538                                392 
##            Small Pickup Trucks 4WD    Small Sport Utility Vehicle 2WD 
##                                181                                169 
##    Small Sport Utility Vehicle 4WD               Small Station Wagons 
##                                213                               1295 
##            Special Purpose Vehicle        Special Purpose Vehicle 2WD 
##                                  1                                553 
##        Special Purpose Vehicle 4WD           Special Purpose Vehicles 
##                                289                               1453 
##       Special Purpose Vehicles/2wd       Special Purpose Vehicles/4wd 
##                                  2                                  2 
##        Sport Utility Vehicle - 2WD        Sport Utility Vehicle - 4WD 
##                               1626                               2091 
##             Standard Pickup Trucks         Standard Pickup Trucks 2WD 
##                               2354                               1106 
##         Standard Pickup Trucks 4WD         Standard Pickup Trucks/2wd 
##                                910                                  4 
## Standard Sport Utility Vehicle 2WD Standard Sport Utility Vehicle 4WD 
##                                 76                                171 
##                    Subcompact Cars                        Two Seaters 
##                               4185                               1602 
##                               Vans                     Vans Passenger 
##                               1141                                  2 
##                   Vans, Cargo Type               Vans, Passenger Type 
##                                434                                285

#filtering v for only Suvs
suv = v %>%
  filter(grepl("Sport", class))

suv

#plot1, distribution by make
ggplot(suv, aes(x = fct_infreq(make))) +
  coord_flip() +
  geom_bar(aes(fill = make)) +
  theme(axis.text.x = element_blank()) +
  xlab("Vehicle make")

#plot2, not that cool.
ggplot(suv, aes(x = displ, y = combined_mpg, color = drive)) +
  geom_point(alpha = .01) +
  geom_jitter() +
  ylim(0,40)

## Warning: Removed 12 rows containing missing values (geom_point).
## Removed 12 rows containing missing values (geom_point).

suv_bydrive = suv %>%
    group_by(drive) %>%
    summarize(make, model, trans, hwy, cty, combined_mpg, avg_mpg = mean(combined_mpg))

## `summarise()` has grouped output by 'drive'. You can override using the
## `.groups` argument.

# dot plot for drive vs. average mpg, 2WD is far and away more fuel efficient.
#but doesn't tell us much more than that
ggplot(suv_bydrive, aes(x = drive, y =  avg_mpg)) +
  geom_point()

# maybe a box plot would be a better fit.
ggplot(suv_bydrive, aes(x = drive, y = combined_mpg)) +
  geom_jitter(alpha = .2) +
  geom_boxplot()

summary(suv$drive)

##           2WD           4WD       4WD/AWD           AWD           FWD 
##            15           404          1563           468           871 
## Part-time 4WD           RWD 
##            29           996

#Only 15 suvs are labeled 2WD,  this boxplot makes it seem like it is greater. I may want to only compare the top drive types: 4WD, 4WD/AWD, AWD, FWD, RWD

suv_24A = suv  %>%
      filter(drive %in% c("FWD", "RWD", "4WD", "4WD/AWD"))

ggplot(suv_24A, aes(x = drive, y = combined_mpg)) +
  geom_jitter(alpha = .4, color = "grey") +
  geom_boxplot(aes(color = drive)) +
  ylim(0,40) +
  theme(legend.position = "none") +
  geom_hline(yintercept = mean(suv_24A$combined_mpg), color="red") +
  ggtitle("Fuel Efficiency of SUVs, by drivetrain") +
  ylab("Combined MPG")

## Warning: Removed 6 rows containing non-finite values (stat_boxplot).

## Warning: Removed 6 rows containing missing values (geom_point).

# From plot1, Top 10 SUVs are made by: Chevrolet, Jeep, GMC, Toyota, Ford ,Nissan, Suzuki, Mercedes-Benz, Dodge, Hyundai

toyota_suvs = suv %>%
    filter(make == "Toyota")

Question 3:What is mpg difference between toyota fleet over 10 years?

v %>% 
  arrange(desc(year))

  toyota = v %>% 
    filter(make == "Toyota" & (year == 1990 | year == 2000 | year == 2010)) %>%
      group_by(model) %>%
        summarize(year, class, trans, drive, cyl, displ, fuel, hwy, cty, combined_mpg, 
                  avg_mpg = mean(combined_mpg))

## `summarise()` has grouped output by 'model'. You can override using the
## `.groups` argument.

toyota

#plot 1 - too much information
t = ggplot(data = toyota, aes(x = class, y = combined_mpg, color = year))
t + geom_point()

#plot 2 - wut 
t +  
  facet_wrap(year~.) +
  geom_polygon()

#boxplot? 
t +
  facet_wrap(year~.) +
  geom_boxplot()

#Still busy...
ggplot(data = toyota, aes(x = class, y = combined_mpg, color = class)) +  
  coord_flip() +
  geom_point(alpha = .7, aes(size = displ)) +
  facet_grid(year~.,scales = "free") +
  theme_classic() +
  theme(legend.position = "none", 
        panel.grid.major = element_line()) + 
  xlab("Vehicle Class") + 
  ylab("MPG (city + highway)") +
  ylim(10,35)

## Warning: Removed 2 rows containing missing values (geom_point).

#cleaner, more readable
ggplot(data = toyota, aes(x = class, y = combined_mpg, color = drive)) +  
  coord_flip() +
  geom_point(alpha = .7, aes(size = displ)) +
  facet_grid(year~.,scales = "free") +
  theme_classic(7) +
  theme(legend.position = "bottom", 
        panel.grid.major = element_line()) + 
  xlab("Vehicle Class") + 
  ylab("MPG (city + highway)") +
  ylim(10,35) +
  ggtitle("2 Decades of Toyota Vehicle Fuel Economy")

## Warning: Removed 2 rows containing missing values (geom_point).

Question 3 : How does engine size and fuel economy of the BMW 3 series over time?

#wrangling 3 series data 
bmw = v %>% filter(make == "BMW" ) 
TS = bmw %>% filter(grepl('32|3 Series|33', model))

#plot1: 
ggplot(TS, aes(year, combined_mpg)) + 
  geom_point()

TS$trans = factor(TS$trans %>% substr(1,1))
str(TS$trans) #simplifed trans variable for automatic / manual shape

##  Factor w/ 2 levels "A","M": 1 2 1 2 1 2 1 2 1 2 ...

TS$displ = factor(TS$displ) #casting engine displacement as a factor to add more distinct color 

#plot 2
ggplot(TS, aes(year, combined_mpg, color = displ, shape = trans)) + 
  geom_point(alpha = .5, size = 3)

#plot 3 
ggplot(TS, aes(year, combined_mpg, shape = trans)) + 
  geom_point(alpha = .5, size = 3, aes(color = displ)) +
  geom_smooth(se = FALSE, method = "lm", aes(color = trans))

## `geom_smooth()` using formula 'y ~ x'

#make it pretty 
ggplot(TS, aes(year, combined_mpg, shape = trans)) + 
  geom_point(alpha = .5, size = 3, aes(color = displ)) +
  geom_smooth(se = FALSE, method = "lm", aes(color = trans)) +
  theme_bw() +
  labs(title = "Fuel Economy & Engine Size of BMW 3 Series",
       y = "Combined MPG (CITY + HWY)",
       x = "") +
  theme(legend.title=element_text())

## `geom_smooth()` using formula 'y ~ x'

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

HANKS_DATAVIZ_HW4

Charles Hanks

2022-09-28