library (tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
An analysis of relationships between selected variables in the mpg dataset, which is available in the ggplot2 package.
The goal of this exercise is to practice plotting and analyzing a dataset using the ggplot2 package in R Studio. Two plots are developed and explored.
The mpg dataset is a tibble in the tidyverse package of R. It contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only car models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car. The dataset contains 11 variables and cases for 234 cars.
| Variable | Type | Description | Details |
|---|---|---|---|
| manufacturer | string | manufacturer name | 15 manufacturers |
| model | string | model name | 38 models |
| displ | numeric | engine displacement, in liters | 1.6 - 7.0, median: 3.3 |
| year | integer | year of manufacturing | 1999, 2008 |
| cyl | integer | Number of cylinders | 4, 5, 6, 8 |
| trans | string | type of transmission | automatic (8 sub-types), manual (2 sub-types) |
| drv | string | drive type | f = front wheel, r = rear wheel, 4 = 4 wheel |
| cty | integer | city mileage | miles per gallon |
| hwy | integer | highway mileage | miles per gallon |
| fl | string | fuel type | e = ethanol, d = diesel, r = regular, p = premium, c = natural gas |
| class | string | vehicle class | 7 class types |
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
In a plot of engine displacement against highway milage an expectation is that smaller engines will have greater fuel efficiency. And this is what the plot shows, larger engines are grouped below 30 mpg, and smaller engines have higher mpg.
An initial guess for outliers as shown in the graph would be cars with mpg > 40 and the car with displ = 7 liters.
It appears that there are two cars with smaller engines at about 1.8 liters which have mpg > 40. Examining the data reveals there are actually three cars with mpg > 40. Adding transparency to the plot to helps emphasize where there are multiple cars at a point.
The largest engine at 7 liters and 24 mpg is the Chevrolet Corvette manual with 8 cylinders. The five cars with the lowest mpg = 12, are three 4-wheel drive pickups and two 4-wheel drive suvs. These cars are heavier, and 4-wheel drive tends to use more fuel.
ggplot(mpg, aes(hwy, displ, color = (displ <= 6.5 & hwy <= 40))) + geom_point(alpha = 0.5) + guides(color = guide_legend("Outlier"))
A boxplot of engine displacement does not show any outliers. The IQR for displ is 2.2 and upper boundary for determining outliers is 7.9 liters. So the Chevy Corvette is not an outlier for displ.
ggplot(mpg, aes(displ)) + geom_boxplot(outlier.color = "red")
IQR(mpg$displ)
## [1] 2.2
quantile(mpg$displ,0.75) + IQR(mpg$displ) * 1.5
## 75%
## 7.9
A boxplot and outlier calculation of hwy mpg does show that the cars with mpg > 40 are outliers. The IQR for hwy is 9, and the upper bound for determining outliers is 40.5. So the three diesel Volkwagens are outliers for the hwy variable.
ggplot(mpg, aes(hwy)) + geom_boxplot(outlier.color = "red")
IQR(mpg$hwy,0.25)
## [1] 9
quantile(mpg$hwy,0.75) + IQR(mpg$hwy) * 1.5
## 75%
## 40.5
A graph with four variables is plotted to investigate what layering of different plots could reveal about the data. A dot plot graph of displ for cty mpg has is layered with two categorical variables - fuel type and drive type.
In order to see all of the fuel types, both color and shape have been added to points on the graph. Without this the single natural gas car at 24 mpg and the diesel car at displ = 3 were both obscured by cars using regular gas at the same points. A faceted plot helps identify where the points for fuel type fall.
As expected, the smaller diesel cars that have good highway mileage also get better city mpg. Front wheel drive cars generally have lower city mpg, although in the range from 15 to 20 mpg, smaller 4 wheel drive cars tend to have better city mpg than slightly larger front wheel drive cars.
What surprised me is that ethanol cars have low city mileage. These turn out to be the pickups and suvs that have low highway mileage and their city mpg is 9, plus two more suvs with mpg = 11. The minivan with a fairly small engine and front wheel drive using ethanol also has city mpg = 11, so it doesn’t sound like a very good bargain.
ggplot(mpg, aes(cty, displ)) +
geom_smooth(aes(linetype = drv), color = "black",se = FALSE) +
geom_point(aes(size = 1.5, color = fl, shape = fl), alpha = 0.4, ) +
guides(size = "none") +
scale_color_discrete(name = "fuel", labels = c("natural gas", "diesel", "ethanol", "premium", "regular")) +
scale_shape_discrete(name = "fuel", labels = c("natural gas", "diesel", "ethanol", "premium", "regular")) +
scale_linetype_discrete(name = "drive", labels = c("4 wheel", "front wheel", "rear wheel"))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(cty, displ)) +
geom_point(aes(color = fl), alpha = 0.5 ) +
facet_wrap(vars(fl)) +
scale_color_discrete(name = "fuel", labels = c("natural gas", "diesel", "ethanol", "premium", "regular"))