Libraries
# load data
library(tidyverse)
library(fueleconomy)
library(knitr)
library(ggthemes)
library(kableExtra)
library(ggiraph)
1.) Introduction
In today’s money-conscious and eco-friendly society, consumers are usually concerned with how many MPG their new vehicle will bring them. Vehicles that are more eco-friendly help our environment, and also help consumers save money on ever-rising gas prices. This study will focus on answering the following questions:
Has the average combined highway and city MPG changed over time. If so, how?
How does the number of cylinders a vehicle has impact it’s combined MPG?
How does a vehicles displacment impact it’s combined MPG?
2.) Data
The data is collected from the R package: fueleconomy. The fueleconomy package’s data was collected from the Environmental Protection Agency’s website. The data is stored in the vehicles dataset.
The cases represent the various vehicles between years 1984 - 2015. There are 33,442 cases of vehicle data within the fueleconomy package.
The study is an observational study. The response variable is the combined (highway and city) MPG. The explanatory variable is how the fuel economy is effected by the number of cylinders and displacment in a vehicle.
Information relating to the data in this study can be found at the following links: https://blog.rstudio.com/2014/07/23/new-data-packages/
3 & 4.) Exploratory Data Analysis & Infrence
An overview of the data found within the vehicles package is below:
summary(vehicles)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel hwy cty
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
The vehicle dataset found within the fueleconomy package does not contain data on combined MPG. Instead, the dataset gives us the highway (hwy) and city (cty) MPG. In order to answer the research question, we must first combine hwy and cty into one combined variable: mpg.
In order to calculate this variable, the hwy and cty variables must be combined using the EPA standards: \[cty * 0.55 + hwy * 0.45\]
This value was then added to the vehicles dataset as mpg
#remove data that has null values
vehicles <- na.omit(vehicles)
#combine hwy and cty mpg following EPA standards
vehicles <- vehicles %>%
mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)
Now that our mpg variable has been calculated, we can do some further investigation.
ggplot1 <- ggplot(vehicles, aes(factor(vehicles$year), vehicles$mpg)) +
geom_boxplot_interactive(aes(tooltip = vehicles$year, data_id = year, fill = factor(vehicles$year))) +
theme(legend.position="none")+
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_hline(aes(yintercept = median(mpg)), colour = 'red')
ggiraph(code = print(ggplot1))
## Warning: package 'gdtools' was built under R version 3.4.4
The boxplot confirms our summary statistics above - the data seems to hover around the median of 19.70 MPG, until around the year 2009 when it starts to show an increase in the average combined MPG.
qqnorm(vehicles$mpg)
qqline(vehicles$mpg)
ggplot(vehicles, aes(sample = mpg)) +
stat_qq(aes(color = factor(year))) +
ylab("Sample Quantiles - MPG")
The two Q-Q plots also show that after 2009, the average combined MPG shows an increase.
mpg_year <- vehicles %>%
group_by(year) %>%
dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))
kable(mpg_year, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
row_spec(32, bold = T, color = "white", background = "green") %>%
row_spec(1, bold = T, color = "white", background = "red")
year | n | avgmpg | median | sd |
---|---|---|---|---|
1984 | 784 | 17.15944 | 17.250 | 4.182516 |
1985 | 1699 | 20.19741 | 19.600 | 5.322081 |
1986 | 1209 | 19.93201 | 19.600 | 5.255899 |
1987 | 1247 | 19.62097 | 19.350 | 5.135072 |
1988 | 1130 | 19.74969 | 19.250 | 5.041844 |
1989 | 1153 | 19.53877 | 19.150 | 5.175750 |
1990 | 1078 | 19.42032 | 19.050 | 4.955587 |
1991 | 1132 | 19.28101 | 18.700 | 4.916046 |
1992 | 1121 | 19.34095 | 19.050 | 4.894614 |
1993 | 1093 | 19.60018 | 19.050 | 4.869317 |
1994 | 982 | 19.53147 | 19.050 | 4.619627 |
1995 | 967 | 19.31541 | 18.700 | 4.639678 |
1996 | 773 | 20.11552 | 19.700 | 4.648382 |
1997 | 762 | 19.97749 | 19.600 | 4.547487 |
1998 | 809 | 19.92145 | 19.700 | 4.494452 |
1999 | 845 | 19.85521 | 20.050 | 4.531822 |
2000 | 836 | 19.77727 | 19.600 | 4.461690 |
2001 | 906 | 19.76297 | 19.600 | 4.607778 |
2002 | 972 | 19.57629 | 19.600 | 4.558214 |
2003 | 1043 | 19.44871 | 19.600 | 4.620918 |
2004 | 1122 | 19.58623 | 19.150 | 4.470160 |
2005 | 1166 | 19.75232 | 19.600 | 4.487480 |
2006 | 1104 | 19.51676 | 19.250 | 4.205074 |
2007 | 1126 | 19.52069 | 19.375 | 3.986880 |
2008 | 1186 | 19.78419 | 19.600 | 4.255072 |
2009 | 1184 | 20.34742 | 20.150 | 4.410107 |
2010 | 1109 | 21.19932 | 21.050 | 4.768022 |
2011 | 1126 | 21.46381 | 21.150 | 5.077788 |
2012 | 1144 | 22.09025 | 21.600 | 5.524931 |
2013 | 1170 | 23.02944 | 22.150 | 5.864990 |
2014 | 1202 | 23.47138 | 22.600 | 6.106523 |
2015 | 204 | 23.87426 | 23.500 | 4.898900 |
ggplotint2 <- ggplot(mpg_year, aes(mpg_year$year, mpg_year$avgmpg)) +
geom_point_interactive(tooltip = mpg_year$avgmpg, data_id = mpg_year$avgmpg) +
geom_smooth() +
ggtitle("MPG vs. Year") +
xlab("Years: 1985-2015") +
ylab("Combined MPG")
ggiraph(code = print(ggplotint2))
## `geom_smooth()` using method = 'loess'
The data above seem indicate that there is a right-skew within the combined MPG data. The data seems to show a mostly normal distribution. The data shows that combined MPG begins to increase after the year 2009.
\[H_0: \mu_{1984} = \mu_{2015}\]
\[H_a: \mu_{1984} \neq \mu_{2015}\]
diff_years <- mpg_year %>%
filter(year == "1984" | year == "2015")
kable(diff_years)
year | n | avgmpg | median | sd |
---|---|---|---|---|
1984 | 784 | 17.15944 | 17.25 | 4.182516 |
2015 | 204 | 23.87426 | 23.50 | 4.898900 |
\[T = \frac{(\bar{x}_{1}-\bar{x}_{2}) - \mu_0}{\sqrt{\frac{s_{1984}^2}{n_{1984}} + \frac{s_{2015}^2}{n_{2015}}}}\]
s1984 <- 4.182516
s2015 <- 4.898900
n1984 <- 784
n2015 <- 204
xbar1984 <- 17.15944
xbar2015 <- 23.87426
u0 <- 0
xdiff <- xbar2015 - xbar1984
SE <- sqrt((s1984^2/n1984) + (s2015^2/n2015))
t <- (xdiff-u0)/SE
2 * (1 - pt(17.94891, df = 203))
## [1] 0
Check to see if variance is equal to determine type of T test
test1984 <- filter(vehicles, year == "1984")
test1984 <- test1984$mpg
test2015 <- filter(vehicles, year == "2015")
test2015 <- test2015$mpg
var.test(test1984, test2015)
##
## F test to compare two variances
##
## data: test1984 and test2015
## F = 0.72892, num df = 783, denom df = 203, p-value = 0.003227
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5815171 0.9007487
## sample estimates:
## ratio of variances
## 0.728917
Since the p-score is low, the Null Hypothesis is rejected. The variance is not equal.
Using the t.test function to calculate p-value on the 1984 and 2015 data.
t.test(test1984, test2015, var.equal = F)
##
## Welch Two Sample t-test
##
## data: test1984 and test2015
## t = -17.949, df = 284.65, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.451194 -5.978458
## sample estimates:
## mean of x mean of y
## 17.15944 23.87426
Since the p-value is low - null must go; meaning, the \(H_0\) is rejected and \(H_a\) is accepted. There is signicant evidence that fuel economy has changed from 1984 and 2015.
Running ANOVA on all of the years in the dataset:
model_year <- lm(mpg~factor(year), data = vehicles)
anova(model_year)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(year) 31 47645 1536.95 65.149 < 2.2e-16 ***
## Residuals 33352 786813 23.59
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA test also shows a p-value < 0.05. Meaning, he \(H_0\) is rejected and \(H_a\) is accepted. There is signicant evidence that fuel economy has changed from 1984 through 2015.
How does the number of cylinders impact the average combined MPG?
ggplotint3 <- ggplot(vehicles, aes(factor(cyl), mpg)) +
geom_boxplot_interactive(aes(tooltip = mpg, data_id = mpg, fill=factor(cyl))) +
theme(legend.position="none")
ggiraph(code = print(ggplotint3))
ggplot(vehicles, aes(sample = mpg)) +
stat_qq(aes(color = factor(cyl)))
ggplot(vehicles, aes(mpg)) +
geom_histogram( bins = 50, aes(fill = factor(cyl)))
mpg_cyl <- vehicles %>%
group_by(cyl) %>%
dplyr::summarise(n = n(), avgmpg = mean(mpg), median = median(mpg), sd = sd(mpg))
kable(mpg_cyl, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
row_spec(2, bold = T, color = "white", background = "green") %>%
row_spec(9, bold = T, color = "white", background = "red")
cyl | n | avgmpg | median | sd |
---|---|---|---|---|
2 | 45 | 18.37000 | 18.600 | 0.5289784 |
3 | 182 | 37.11456 | 35.925 | 6.0392450 |
4 | 12381 | 24.12813 | 23.500 | 4.2924181 |
5 | 718 | 20.85578 | 20.600 | 2.7252662 |
6 | 11885 | 18.90963 | 19.050 | 2.6275592 |
8 | 7550 | 15.48340 | 15.250 | 2.6219897 |
10 | 138 | 14.46920 | 14.600 | 1.8040640 |
12 | 478 | 13.36056 | 13.700 | 1.7624723 |
16 | 7 | 10.95714 | 11.150 | 0.2405351 |
ggplotint <- ggplot(mpg_cyl, aes(mpg_cyl$cyl, mpg_cyl$avgmpg)) +
geom_point() +
geom_smooth()
Looking at the graphs: on average, as the number of cylinders increases, the average combined MPG decreases.
Stastiscally we can test this by running the ANOVA test. ANOVA is used in this case since we are testing more than two samples.
\[H_0: \mu_2 = \mu_3 = \mu_4 = \mu_5 = \mu_6 = \mu_8 = \mu_{10} = \mu_{12} = \mu_{16}\] \[H_a: \mu_2 \neq \mu_3 \neq \mu_4 \neq \mu_5 \neq \mu_6 \neq \mu_8 \neq \mu_{10} \neq \mu_{12} \neq \mu_{16}\]
model <- lm(mpg~factor(cyl), data = vehicles)
anova(model)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cyl) 8 458546 57318 5088.9 < 2.2e-16 ***
## Residuals 33375 375913 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is low (less than 0.05) - null must go; meaning, the \(H_0\) is rejected and the \(H_a\) is accepted. There is signicant evidence that fuel economy is different for engines with different # of cylinders.
How does the number of displacment impact the average combined MPG?
ggplot(vehicles, aes(vehicles$displ, mpg)) +
geom_point(aes( color=factor(cyl), alpha = factor(cyl))) +
theme_minimal() +
geom_smooth()
## `geom_smooth()` using method = 'gam'
model_dis <- lm(mpg~as.factor(displ), data = vehicles)
anova(model_dis)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(displ) 63 573567 9104.2 1162.8 < 2.2e-16 ***
## Residuals 33320 260892 7.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is low - null must go; meaning, the H0 is rejected. There is signicant evidence that fuel economy is different for engines with different #’s of displacment.
Correlogram
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.4
corrgram(vehicles, order=TRUE,
lower.panel=panel.shade,
upper.panel=panel.pie,
text.panel=panel.txt,
main="MPG Data")
From the above graph, we can see correlations between different variables in our dataset. For example: hwy, cty, and mpg are positively correlated. While hwy, mpg and cyl are negativelty correlated.
Conclusions:
After analyzing the fueleconomy package from the EPA, the data shows that the average MPG has increased since 1984 to 2015. From around ~17MPG to 23MPG.
According to the data, there is signicant evidence that fuel economy is different for engines with different # of cylinders. The data shows that overall, the more cylinders in a vehicle, the less overall MPG.
There is also signicant evidence that fuel economy is different for engines with different #’s of displacment. According to the data, overall, the more displacment the less the overall MPG.