To address today’s environment concerns and economy challenges, fuel economy of cars matters a lot for consumers. They do concerned with how good will be the MPG in their new vehicle. An eco-friendly vehicle could save money on ever-rising gas prices. This study investigates changes in fuel economy and its relationship with other given attributes. Here we will focus on following questions:
The data is collected from the R package: fueleconomy. The fueleconomy package’s data was sourced from the EPA (Environmental Protection Agency). In this package, the data is stored in vehicles dataset.
Fuel economy data contains data for all cars sold in the US from 1984 to 2015. The package fueleconomy has 33,442 rows and 12 variables.
It’s an observational study. The response variable is combined mpg which is quantitative. The two independent variables are number of cylinders and displacement. This research is mainly focused on identifying associations.
Further data information in this study can be found in below links:
library(dplyr)
library(ggplot2)
# load data
library(fueleconomy)
vehicles
summary(vehicles)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel hwy cty
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
Per EPA (Environmental protection agency) guidelines, combined fuel economy is a weighted average of City and Highway MPG values that is calculated by weighting the City value by 55% and the Highway value by 45%.
# remove null values
vehicles <- na.omit(vehicles)
vehicles <- vehicles %>% mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)
vehicles
# dimensions
dim(vehicles)
## [1] 33384 13
unique(vehicles$cyl)
## [1] 4 6 5 8 12 10 16 3 2
summary(vehicles$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.80 16.70 19.70 20.11 22.60 54.40
hist(vehicles$mpg, breaks = 50)
qqnorm(vehicles$mpg)
qqline(vehicles$mpg)
vehicles %>%
group_by(year) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
plot(vehicles$mpg~vehicles$year)
ggplot(vehicles, aes(vehicles$year, mpg)) +
geom_point(aes( color=factor(cyl), alpha = factor(cyl))) +
theme_minimal() +
geom_smooth()
## Warning: Using alpha for a discrete variable is not advised.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# to find make and model having max mpg
vehicles[which.max(vehicles$mpg),]
any(is.na(vehicles$cyl))
## [1] FALSE
vehicles_cyl <- vehicles %>% filter(!is.na(cyl))
vehicles_cyl %>%
group_by(cyl) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
ggplot(vehicles_cyl, aes(cyl, mpg)) + geom_boxplot(aes(fill = factor(cyl)))
ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl)))
Seeing the above graphs, on average, as the number of cylinders increases, combined mpg decreases.
vehicles_displ <- vehicles %>% filter(!is.na(displ))
vehicles_displ %>%
group_by(displ) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
ggplot(vehicles_displ, aes(displ, mpg)) + geom_boxplot(aes(fill = factor(displ)))
Seeing the above graphs for displacements, as the number of displacements increases, combined mpg decreases.
Seeing the above graphs, on average, as the number of cylinders increases, combined mpg decreases. Lets run an ANOVA (analysis of variance) test. ANOVA uses hypothesis test to check whether the means across many groups are equal:
\(H_0\): \(\mu_2\) = \(\mu_3\) = \(\mu_4\) = \(\mu_5\) = \(\mu_6\) = \(\mu_8\) = \(\mu_{10}\) = \(\mu_{12}\) = \(\mu_{16}\) \(H_A\): \(\mu_2\) \(\neq\) \(\mu_3\) \(\neq\) \(\mu_4\) \(\neq\) \(\mu_5\) \(\neq\) \(\mu_6\) \(\neq\) \(\mu_8\) \(\neq\) \(\mu_{10}\) \(\neq\) \(\mu_{12}\) \(\neq\) \(\mu_{16}\)
model_mpg_cyl <- lm(mpg~factor(cyl), data = vehicles)
anova(model_mpg_cyl)
Since the p-value is low (<0.05), \(H_0\) is rejected and accepts \(H_A\). Therefore mpg differs for different number of cylinders.
Seeing the above graphs for displacements, as the number of displacements increases, combined mpg decreases.
ANOVA test with displacement:
model_mpg_cyl <- lm(mpg~factor(displ), data = vehicles)
anova(model_mpg_cyl)
Since the p-value in this case is low (<0.05), \(H_0\) is rejected and accepts \(H_A\). Therefore mpg differs for engines having different number of displacements
Honda Insight of year 2000 has the max mpg (54.4).
Seeing the above diagnostic plots between the mpg and year, I see increase in mpg over the years.
After doing the analysis of fueleconomy package from the EPA, the data depicts that the average MPG has increased from ~17MPG in 1984 to ~23MPG in 2015. There is signicant evidence that combined MPG is different for engines having different number of cylinders. Also its clear that vehicle with more cylinders give less combined MPG. There is also signicant evidence that fuel economy is different for engines having different number of displacments. More displacment performs less on overall MPG.
Though I didn’t investigate here as part of this project, the data provides a number of different variables that could be further analyzed. Difference in fuel economy using other categorical variables (i.e. make, model, class, transmission, drive) could be investigated through appropriate hypothesis testing. We could perform multiple regression to predict the gas mileage of a vehicle based upon these characteristics.