In this study, I will explore a dataset that contains fuel efficiency performance metrics, measured in miles per gallon (MPG) over time, for most makes and models of automobiles available in the U.S. since 1984. This data is courtesy of the U.S. Department of Energy and the US Environmental Protection Agency. The latest version is available at http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip .
library(plyr)
library(ggplot2)
library(reshape)
vehicles <- read.csv(unz("vehicles.csv.zip", "vehicles.csv"),
stringsAsFactors = F)
head(vehicles)
## barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 cityA08U
## 1 15.68944 0 0 0 19 0 0 0
## 2 29.95056 0 0 0 9 0 0 0
## 3 12.19557 0 0 0 23 0 0 0
## 4 29.95056 0 0 0 10 0 0 0
## 5 17.33749 0 0 0 17 0 0 0
## 6 14.96429 0 0 0 21 0 0 0
## cityCD cityE cityUF co2 co2A co2TailpipeAGpm co2TailpipeGpm comb08
## 1 0 0 0 -1 -1 0 423.1905 21
## 2 0 0 0 -1 -1 0 807.9091 11
## 3 0 0 0 -1 -1 0 329.1481 27
## 4 0 0 0 -1 -1 0 807.9091 11
## 5 0 0 0 -1 -1 0 467.7368 19
## 6 0 0 0 -1 -1 0 403.9545 22
## comb08U combA08 combA08U combE combinedCD combinedUF cylinders displ
## 1 0 0 0 0 0 0 4 2.0
## 2 0 0 0 0 0 0 12 4.9
## 3 0 0 0 0 0 0 4 2.2
## 4 0 0 0 0 0 0 8 5.2
## 5 0 0 0 0 0 0 4 2.2
## 6 0 0 0 0 0 0 4 1.8
## drive engId eng_dscr feScore fuelCost08
## 1 Rear-Wheel Drive 9011 (FFS) -1 1350
## 2 Rear-Wheel Drive 22020 (GUZZLER) -1 2550
## 3 Front-Wheel Drive 2100 (FFS) -1 1050
## 4 Rear-Wheel Drive 2850 -1 2550
## 5 4-Wheel or All-Wheel Drive 66031 (FFS,TRBO) -1 1850
## 6 Front-Wheel Drive 66020 (FFS) -1 1250
## fuelCostA08 fuelType fuelType1 ghgScore ghgScoreA highway08
## 1 0 Regular Regular Gasoline -1 -1 25
## 2 0 Regular Regular Gasoline -1 -1 14
## 3 0 Regular Regular Gasoline -1 -1 33
## 4 0 Regular Regular Gasoline -1 -1 12
## 5 0 Premium Premium Gasoline -1 -1 23
## 6 0 Regular Regular Gasoline -1 -1 24
## highway08U highwayA08 highwayA08U highwayCD highwayE highwayUF hlv hpv
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 19 77
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## id lv2 lv4 make model mpgData phevBlended pv2 pv4
## 1 1 0 0 Alfa Romeo Spider Veloce 2000 Y false 0 0
## 2 10 0 0 Ferrari Testarossa N false 0 0
## 3 100 0 0 Dodge Charger Y false 0 0
## 4 1000 0 0 Dodge B150/B250 Wagon 2WD N false 0 0
## 5 10000 0 14 Subaru Legacy AWD Turbo N false 0 90
## 6 10001 0 15 Subaru Loyale N false 0 88
## range rangeCity rangeCityA rangeHwy rangeHwyA trany UCity
## 1 0 0 0 0 0 Manual 5-spd 23.3333
## 2 0 0 0 0 0 Manual 5-spd 11.0000
## 3 0 0 0 0 0 Manual 5-spd 29.0000
## 4 0 0 0 0 0 Automatic 3-spd 12.2222
## 5 0 0 0 0 0 Manual 5-spd 21.0000
## 6 0 0 0 0 0 Automatic 3-spd 27.0000
## UCityA UHighway UHighwayA VClass year youSaveSpend guzzler
## 1 0 35.0000 0 Two Seaters 1985 -1250
## 2 0 19.0000 0 Two Seaters 1985 -7250 T
## 3 0 47.0000 0 Subcompact Cars 1985 250
## 4 0 16.6667 0 Vans 1985 -7250
## 5 0 32.0000 0 Compact Cars 1993 -3750
## 6 0 33.0000 0 Compact Cars 1993 -750
## trans_dscr tCharger sCharger atvType fuelType2 rangeA evMotor mfrCode
## 1 NA
## 2 NA
## 3 SIL NA
## 4 NA
## 5 TRUE
## 6 NA
## c240Dscr charge240b c240bDscr createdOn
## 1 0 Tue Jan 01 00:00:00 EST 2013
## 2 0 Tue Jan 01 00:00:00 EST 2013
## 3 0 Tue Jan 01 00:00:00 EST 2013
## 4 0 Tue Jan 01 00:00:00 EST 2013
## 5 0 Tue Jan 01 00:00:00 EST 2013
## 6 0 Tue Jan 01 00:00:00 EST 2013
## modifiedOn startStop phevCity phevHwy phevComb
## 1 Tue Jan 01 00:00:00 EST 2013 0 0 0
## 2 Tue Jan 01 00:00:00 EST 2013 0 0 0
## 3 Tue Jan 01 00:00:00 EST 2013 0 0 0
## 4 Tue Jan 01 00:00:00 EST 2013 0 0 0
## 5 Tue Jan 01 00:00:00 EST 2013 0 0 0
## 6 Tue Jan 01 00:00:00 EST 2013 0 0 0
names(vehicles)
## [1] "barrels08" "barrelsA08" "charge120"
## [4] "charge240" "city08" "city08U"
## [7] "cityA08" "cityA08U" "cityCD"
## [10] "cityE" "cityUF" "co2"
## [13] "co2A" "co2TailpipeAGpm" "co2TailpipeGpm"
## [16] "comb08" "comb08U" "combA08"
## [19] "combA08U" "combE" "combinedCD"
## [22] "combinedUF" "cylinders" "displ"
## [25] "drive" "engId" "eng_dscr"
## [28] "feScore" "fuelCost08" "fuelCostA08"
## [31] "fuelType" "fuelType1" "ghgScore"
## [34] "ghgScoreA" "highway08" "highway08U"
## [37] "highwayA08" "highwayA08U" "highwayCD"
## [40] "highwayE" "highwayUF" "hlv"
## [43] "hpv" "id" "lv2"
## [46] "lv4" "make" "model"
## [49] "mpgData" "phevBlended" "pv2"
## [52] "pv4" "range" "rangeCity"
## [55] "rangeCityA" "rangeHwy" "rangeHwyA"
## [58] "trany" "UCity" "UCityA"
## [61] "UHighway" "UHighwayA" "VClass"
## [64] "year" "youSaveSpend" "guzzler"
## [67] "trans_dscr" "tCharger" "sCharger"
## [70] "atvType" "fuelType2" "rangeA"
## [73] "evMotor" "mfrCode" "c240Dscr"
## [76] "charge240b" "c240bDscr" "createdOn"
## [79] "modifiedOn" "startStop" "phevCity"
## [82] "phevHwy" "phevComb"
A lot of these column or variable names are pretty descriptive and give us an idea of what they might contain. Remember, a more detailed description of the variables is available at http://www.fueleconomy.gov/feg/ws/index.shtml .
Let’s find out how many unique years of data are included in this dataset and determine the first and last years present in the dataset:
length(unique(vehicles[, "year"]))
## [1] 34
first_year <- min(vehicles[, "year"])
first_year
## [1] 1984
last_year <- max(vehicles[, "year"])
last_year
## [1] 2017
Now, let’s find out what types of fuel are used as the automobiles’ primary fuel types:
table(vehicles$fuelType1)
##
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1096 106 64 60
## Premium Gasoline Regular Gasoline
## 9700 25896
From this, we can see that most cars in the dataset use regular gasoline, and the second most common fuel type is premium gasoline.
Now, let’s explore the types of transmissions used by these automobiles:
table(vehicles$trany)
##
## Auto (AV-S6)
## 11 1
## Auto (AV-S8) Auto (AV)
## 1 2
## Auto(A1) Auto(A8)
## 4 1
## Auto(AM-S6) Auto(AM-S7)
## 82 206
## Auto(AM-S8) Auto(AM5)
## 4 12
## Auto(AM6) Auto(AM7)
## 103 122
## Auto(AM8) Auto(AV-S6)
## 4 131
## Auto(AV-S7) Auto(AV-S8)
## 48 24
## Auto(L3) Auto(L4)
## 2 2
## Automatic (A1) Automatic (A6)
## 92 4
## Automatic (AM5) Automatic (AM6)
## 2 1
## Automatic (AV-S6) Automatic (AV)
## 9 4
## Automatic (S4) Automatic (S5)
## 233 824
## Automatic (S6) Automatic (S7)
## 2465 228
## Automatic (S8) Automatic (S9)
## 767 16
## Automatic (variable gear ratios) Automatic 3-spd
## 644 3151
## Automatic 4-spd Automatic 5-spd
## 11039 2184
## Automatic 6-spd Automatic 6spd
## 1354 1
## Automatic 7-spd Automatic 8-spd
## 630 202
## Automatic 9-spd Manual 3-spd
## 55 77
## Manual 4-spd Manual 4-spd Doubled
## 1483 17
## Manual 5-spd Manual 5 spd
## 8287 1
## Manual 6-spd Manual 7-spd
## 2334 56
## Manual(M7)
## 2
vehicles$trany[vehicles$trany == ""] <- NA
vehicles$trany2 <- ifelse(substr(vehicles$trany, 1, 4) == "Auto",
"Auto", "Manual")
vehicles$trany <- as.factor(vehicles$trany)
table(vehicles$trany2)
##
## Auto Manual
## 24654 12257
Let’s start by looking at whether there is an overall trend of how MPG changes over time on an average.
mpgByYr <- ddply(vehicles, ~year, summarise,
avgMPG = mean(comb08),
avgHghy = mean(highway08),
avgCity = mean(city08))
ggplot(mpgByYr, aes(year, avgMPG)) +
geom_point() + geom_smooth() +
xlab("Year") + ylab("Average MPG") +
ggtitle("All cars") +
scale_x_continuous(breaks = seq(1984,2017,2))
Based on this visualization, one might conclude that there has been a tremendous increase in the fuel economy of cars sold in the last few years. However, this can be a little misleading as there have been more hybrid and non-gasoline vehicles in the later years:
table(vehicles$fuelType1)
##
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1096 106 64 60
## Premium Gasoline Regular Gasoline
## 9700 25896
Let’s look at gasoline cars only and redraw the previous plot.
gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline",
"Premium Gasoline",
"Midgrade Gasoline") &
fuelType2 == "" & atvType != "Hybrid")
mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08))
ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() +
geom_smooth() + xlab("Year") + ylab("Average MPG") +
ggtitle("Gasoline cars") +
scale_x_continuous(breaks = seq(1984,2017,2))
If fewer large engine cars have been made recently, this can explain the increase. First, let’s verify whether cars with larger engines have worse fuel efficiency. Variable displ in the dataset represents the displacement of the engine in liters.
ggplot(gasCars, aes(displ, comb08)) +
geom_point() +
geom_smooth()
This scatter plot of the data offers the convincing evidence that there is a negative, or even inverse correlation, between engine displacement and fuel efficiency; thus, smaller cars tend to be more fuel-efficient.
Now, let’s see whether more small cars were made in later years, which can explain the drastic increase in fuel efficiency:
avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ))
ggplot(avgCarSize, aes(year, avgDispl)) +
geom_point() +
geom_smooth() +
xlab("Year") +
ylab("Average engine displacement (l)") +
scale_x_continuous(breaks = seq(1984,2017,2))
From the figure, the average engine displacement has decreased substantially since 2008. To get a better sense of the impact this might have had on fuel efficiency, I will put both MPG and displacement by year on the same graph.
byYear <- ddply(gasCars, ~year,
summarise,
avgMPG = mean(comb08),
avgDispl = mean(displ))
# convert from wide to a long format:
byYear2 = melt(byYear, id = "year")
levels(byYear2$variable) <- c("Average MPG", "Avg engine displacement")
ggplot(byYear2, aes(year, value)) +
geom_point() +
geom_smooth() +
facet_wrap(~variable, ncol = 1, scales = "free_y") +
xlab("Year") +
ylab("") +
scale_x_continuous(breaks = seq(1984,2017,2))
From this plot, the following is seen:
Given the trend toward smaller displacement engines, let’s see whether automatic or manual transmissions are more efficient for four cylinder engines, and how the efficiencies have changed over time:
gasCars4 <- subset(gasCars, cylinders == "4")
ggplot(gasCars4, aes(factor(year), comb08)) +
geom_boxplot() +
facet_wrap(~trany2, ncol = 1) +
theme(axis.text.x = element_text(angle = 45)) +
labs(x = "Year", y = "MPG")
It appears that manual transmissions are more efficient than automatic transmissions, and they both exhibit the same increase on average, since 2008.
Next, let’s look at the change in proportion of manual cars available each year:
ggplot(gasCars4, aes(factor(year), fill = factor(trany2))) +
geom_bar(position = "fill") +
labs(x = "Year", y = "Proportion of cars", fill = "Transmission") +
theme(axis.text.x = element_text(angle = 45)) +
geom_hline(yintercept = 0.5, linetype = 2)
There appear to be many very efficient cars (less than 40 MPG) with automatic transmissions in later years, and almost no manual transmission cars with similar efficiencies in the same time frame. The pattern is reversed in earlier years.
Let’s look at how the makes and models of cars inform fuel efficiency over time. First, let’s look at the frequency of the makes and models of cars available in the US over this time and concentrate on four-cylinder cars:
carsMake <- ddply(gasCars4, ~year,
summarise, numberOfMakes = length(unique(make)))
ggplot(carsMake, aes(year, numberOfMakes)) +
geom_point() +
labs(x = "Year", y = "Number of available makes") +
ggtitle("Four cylinder cars") +
ylim(20, 45) +
scale_x_continuous(breaks = seq(1984,2017,2))
We see in the graph that there has been a decline in the number of makes available over this period, though there has been an uptick in recent times.
Let’s look at the makes that have been available for every year of this study:
uniqMakes <- dlply(subset(gasCars4, year != 2017), ~year, function(x) unique(x$make))
commonMakes <- Reduce(intersect, uniqMakes)
commonMakes
## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet"
## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi"
## [11] "Subaru" "Jeep"
There are only 12 manufactures that made four-cylinder cars every year during this period (excluding the year 2017, which is still in the future).
Let’s see how have these manufacturers done over time with respect to fuel efficiency:
carsCommonMakes4 <- subset(gasCars4, make %in% commonMakes)
avgMPG_commonMakes <- ddply(carsCommonMakes4,
~year + make,
summarise,
avgMPG = mean(comb08))
ggplot(avgMPG_commonMakes, aes(year, avgMPG)) +
geom_line() +
facet_wrap(~make, nrow = 3) +
scale_x_continuous(breaks = seq(1984,2017,2)) +
theme(axis.text.x = element_text(angle = 90))
Most manufacturers, as seen from the graph, have shown improvement over this time, though several manufacturers, such as e.g. Mazda or Honda, have demonstrated quite sharp fuel efficiency increases in the last 5 years.