library (tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Abstract

An analysis of relationships between selected variables in the mpg dataset, which is available in the ggplot2 package.

Introduction

The goal of this exercise is to practice plotting and analyzing a dataset using the ggplot2 package in R Studio. Two plots are developed and explored.

  1. Plot and analyze the engine size (displacement) against highway mileage, and identify outliers.
  2. Plot and analyze the engine size against city mileage by drive type and fuel type.

Description and structure of the mpg dataset

The mpg dataset is a tibble in the tidyverse package of R. It contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only car models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car. The dataset contains 11 variables and cases for 234 cars.

Variable Type Description Details
manufacturer string manufacturer name 15 manufacturers
model string model name 38 models
displ numeric engine displacement, in liters 1.6 - 7.0, median: 3.3
year integer year of manufacturing 1999, 2008
cyl integer Number of cylinders 4, 5, 6, 8
trans string type of transmission automatic (8 sub-types), manual (2 sub-types)
drv string drive type f = front wheel, r = rear wheel, 4 = 4 wheel
cty integer city mileage miles per gallon
hwy integer highway mileage miles per gallon
fl string fuel type e = ethanol, d = diesel, r = regular, p = premium, c = natural gas
class string vehicle class 7 class types
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

Plot 1 - Engine Displacement by Highway Milage

In a plot of engine displacement against highway milage an expectation is that smaller engines will have greater fuel efficiency. And this is what the plot shows, larger engines are grouped below 30 mpg, and smaller engines have higher mpg.

An initial guess for outliers as shown in the graph would be cars with mpg > 40 and the car with displ = 7 liters.

It appears that there are two cars with smaller engines at about 1.8 liters which have mpg > 40. Examining the data reveals there are actually three cars with mpg > 40. Adding transparency to the plot to helps emphasize where there are multiple cars at a point.

  • Volkswagen Jetta, manual, diesel with 44 mpg
  • Volkswagen New Beetle, manual, diesel with 44 mpg
  • Volkswagen New Beetle, automatic, diesel with 41 mpg

The largest engine at 7 liters and 24 mpg is the Chevrolet Corvette manual with 8 cylinders. The five cars with the lowest mpg = 12, are three 4-wheel drive pickups and two 4-wheel drive suvs. These cars are heavier, and 4-wheel drive tends to use more fuel.

ggplot(mpg, aes(hwy, displ, color = (displ <= 6.5 & hwy <= 40))) + geom_point(alpha = 0.5) + guides(color = guide_legend("Outlier"))

Boxplot and outliers of displ

A boxplot of engine displacement does not show any outliers. The IQR for displ is 2.2 and upper boundary for determining outliers is 7.9 liters. So the Chevy Corvette is not an outlier for displ.

ggplot(mpg, aes(displ)) + geom_boxplot(outlier.color = "red")

IQR(mpg$displ)
## [1] 2.2
quantile(mpg$displ,0.75) + IQR(mpg$displ) * 1.5
## 75% 
## 7.9

Boxplot and outliers of hwy

A boxplot and outlier calculation of hwy mpg does show that the cars with mpg > 40 are outliers. The IQR for hwy is 9, and the upper bound for determining outliers is 40.5. So the three diesel Volkwagens are outliers for the hwy variable.

ggplot(mpg, aes(hwy)) + geom_boxplot(outlier.color = "red")

IQR(mpg$hwy,0.25)
## [1] 9
quantile(mpg$hwy,0.75) + IQR(mpg$hwy) * 1.5
##  75% 
## 40.5

Plot 2 - Engine Displacement by City Milage differentiated by Fuel and Drive Type

A graph with four variables is plotted to investigate what layering of different plots could reveal about the data. A dot plot graph of displ for cty mpg has is layered with two categorical variables - fuel type and drive type.

In order to see all of the fuel types, both color and shape have been added to points on the graph. Without this the single natural gas car at 24 mpg and the diesel car at displ = 3 were both obscured by cars using regular gas at the same points. A faceted plot helps identify where the points for fuel type fall.

As expected, the smaller diesel cars that have good highway mileage also get better city mpg. Front wheel drive cars generally have lower city mpg, although in the range from 15 to 20 mpg, smaller 4 wheel drive cars tend to have better city mpg than slightly larger front wheel drive cars.

What surprised me is that ethanol cars have low city mileage. These turn out to be the pickups and suvs that have low highway mileage and their city mpg is 9, plus two more suvs with mpg = 11. The minivan with a fairly small engine and front wheel drive using ethanol also has city mpg = 11, so it doesn’t sound like a very good bargain.

ggplot(mpg, aes(cty, displ)) + 
  geom_smooth(aes(linetype = drv), color = "black",se = FALSE) + 
  geom_point(aes(size = 1.5, color = fl, shape = fl), alpha = 0.4, ) + 
  guides(size = "none") +
    scale_color_discrete(name = "fuel", labels = c("natural gas", "diesel", "ethanol", "premium", "regular")) +
    scale_shape_discrete(name = "fuel", labels = c("natural gas", "diesel", "ethanol", "premium", "regular")) +
    scale_linetype_discrete(name = "drive", labels = c("4 wheel", "front wheel", "rear wheel"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(cty, displ)) + 
  geom_point(aes(color = fl), alpha = 0.5 ) + 
  facet_wrap(vars(fl)) +
  scale_color_discrete(name = "fuel", labels = c("natural gas", "diesel", "ethanol", "premium", "regular"))

Resources / Bibliography

  1. Visualization Cheat Sheet - https://ggplot2.tidyverse.org/reference/mpg.html
  2. R Markdown Reference Guide - https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
  3. R Documentation, mpg dataset description - https://www.rdocumentation.org/packages/ggplot2/versions/3.3.3/topics/mpg
  4. R for Data Science by Garrett Grolemund, Hadley Wickham - https://www.oreilly.com/library/view/r-for-data/9781491910382/ch01.html