In response to economic conditions and environmental concerns, fuel economy of cars has had varying levels of importance to consumers. This project investigates changes in fuel economy and relationships between fuel economy and other car attributes.
The following questions are investigated:
Has the fuel economy of vehicles changed from 1984 to 2015?
Is there a difference in fuel economy between engines with different cylinders?
What about vehicles requiring different fuel types (excluding hybrid or electric)?
Is there a relationship between gas prices and fuel economy?
Environmental Protection Agency data is collected from the U.S. Department of Energy’s Fuel Economy Data website and assembled in the fueleconomy R package. The data is stored in the vehicles data set. The cases in this data set represent model years of cars (makes and models) between 1984 and 2015 for which there exist at least ten years of data and complete data. Further information about the dataset is available via the package reference manual.
The structure of the data set is displayed below:
Classes 'tbl_df', 'tbl' and 'data.frame': 33442 obs. of 12 variables:
$ id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
$ make : chr "AM General" "AM General" "AM General" "AM General" ...
$ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
$ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
$ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
$ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
$ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
$ cyl : int 4 4 6 6 4 6 6 4 4 6 ...
$ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
$ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
$ hwy : int 17 17 13 13 17 13 21 26 28 26 ...
$ cty : int 18 18 13 13 16 13 14 20 22 18 ...
This is an observational study, as the data only monitors the variables — there is no assigning of any potential explanatory variables. Because of this, causal conclusions are potentially treacherous, and the research focuses on identifying associations.
Gas price data is collected from the U.S. Energy Information Administration. Data is downloaded from Table 9.4 of EIA’s Monthly Energy Review as an Excel file. The years and prices for “All Grades of Gasoline, U.S. City Average Retail Price” are stored in a csv and loaded into the gas data frame.
The structure of the gas data is displayed below:
'data.frame': 32 obs. of 2 variables:
$ year : int 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 ...
$ price: num 1.198 1.196 0.931 0.957 0.964 ...
In order to analyze fuel economy, the city and highway fuel economies are combined into a single value. Per EPA Guidelines,
Combined fuel economy is a weighted average of City and Highway MPG values that is calculated by weighting the City value by 55% and the Highway value by 45%.
The combined fuel economy for each case is calculated and stored in the variable mpg
vehicles$mpg <- 0.55 * vehicles$cty + 0.45 * vehicles$hwyThe combined fuel economy, mpg is the response variable of interest in this investigation. Brief summary statistics of this variable, as well as its distribution, are presented below.
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.80 16.70 19.70 20.22 22.70 123.20
The summary statistics and both charts indicate that the distribution of gas mileage is very strongly right-skewed. It appears that most values over roughly 35 miles per gallon are outliers from the population. As these outliers strongly affect the distribution and may lead to violation of conditions necessary for inference, they are removed from the dataset:
vehicles <- vehicles[!vehicles$mpg %in% boxplot.stats(vehicles$mpg)$out, ]Summary statistics and distribution are again presented for the mpg with outliers removed.
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.80 16.70 19.60 19.73 22.35 31.70
The filtered data range from 7.8 miles per gallon to 31.7 miles per gallon and is centered just below 20 miles per gallon. The distribution of the data is nearly normal, with a possible very slight right-skewness.
To view how fuel economy differs by explanatory variable, basic summary statistics (mean and standard deviation) are calculated, and the distributions of mpg visualized.
vehicles_year <- vehicles %>% filter(year == 1984 | year == 2015)
vehicles_year %>%
group_by(year) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))Source: local data frame [2 x 5]
year n mean median sd
(int) (int) (dbl) (dbl) (dbl)
1 1984 784 17.15944 17.25 4.182516
2 2015 196 23.45281 23.10 4.512880
The distribution for 1984 is centered slightly above 17 miles per gallon, with the distribution for 2015 centered slightly above 23 miles per gallon. The two distributions appear to have similar variability, with a standard deviation between 4-5 mpg. Both distributions appear somewhat bimodal, and the distribution of gas mileage in 1984 appears to be somewhat right skewed, however, the degree of this difference from normal distributions is not severe enough to caution against inference.
vehicles_cyl <- vehicles %>% filter(!is.na(cyl))
vehicles_cyl %>%
group_by(cyl) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))Source: local data frame [9 x 5]
cyl n mean median sd
(int) (int) (dbl) (dbl) (dbl)
1 2 45 18.37000 18.60 0.5289784
2 3 29 29.09310 29.80 1.5587726
3 4 11764 23.51209 23.25 3.3083728
4 5 718 20.85578 20.60 2.7252662
5 6 11884 18.90852 19.05 2.6248608
6 8 7550 15.48340 15.25 2.6219897
7 10 138 14.46920 14.60 1.8040640
8 12 478 13.36056 13.70 1.7624723
9 16 7 10.95714 11.15 0.2405351
The plot shows rather different distributions by number of cylinders. The highest mean is for engines with three cylinders; means decrease as the number of cylinders increases to 16. The distributions appear largely normal, with varying levels of skewness and spread with different numbers of cylinders. Generally, cylinder numbers with more observations have greater spread with less skew.
vehicles_fuel <- vehicles %>% filter(!grepl("Electricity|or|and", fuel))
vehicles_fuel %>%
group_by(fuel) %>%
summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))Source: local data frame [5 x 5]
fuel n mean median sd
(chr) (int) (dbl) (dbl) (dbl)
1 CNG 55 17.81364 14.25 7.047579
2 Diesel 699 20.40572 19.15 4.156685
3 Midgrade 43 18.02907 18.05 1.645384
4 Premium 8575 19.51809 19.60 3.785541
5 Regular 22091 19.89339 19.70 4.516686
The different fuel types have distributions centered at different values, with the center of each distribution being located between roughly 15-20 miles per gallon. The Premium and Regular fuel types have very wide ranges, with a number of potential outliers using Premium fuel. The distributions for CNG and Diesel exhibit right-skewness, while the other three types of fuel appear to be roughly symmetrical.
Brief summary statistics, as well as a histogram and boxplot, of the retail price of gasoline from 1984-2015 are prepared.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.931 1.195 1.366 1.864 2.541 3.695
From these plots, the distribution of annual average gas prices appears to be right-skewed – there are a number of instances between $1.10 - $1.30 per gallon, with most other bins showing far lower counts. Most occurrences not in the two most populated bins occur at higher values, ranging has high as $3.70.
After fluctuating between roughly $1.00 and $1.25 per gallon from 1984-2002, gas prices grew rapidly through 2008 to slightly less than $3.50 per gallon, and have since fluctuated between roughly $2.50 and $3.75 per gallon.
Statistical inference is performed in an attempt to answer the four questions posed in the introduction:
Has the fuel economy of vehicles changed from 1984 to 2015?
Is there a difference in fuel economy between engines with different cylinders?
What about vehicles requiring different fuel types (excluding hybrid or electric)?
Is there a relationship between gas prices and fuel economy?
To test if there is a difference between fuel economies in 1984 and 2015, a t-test is performed.
For this test, the hypotheses are
\(H_0: \mu_{1984} = \mu_{2015}\) (there is no difference in average fuel economies)
\(H_a: \mu_{1984} \neq \mu_{2015}\) (there is a difference in average fuel economies)
The data are not sampled, but there are no reasons why independence will not hold for the observations in this data set. As observed in section 3.1, the data are not significantly skewed – since the data represent a significant portion of the population, it can be deduced that the population is not significantly skewed. Because the conditions are met, the t-test is performed.
Using the statistics from section 3.2.1, the test point estimate for the difference in yearly averages is the difference in the sample means:
\[\bar{x}_{diff} = \bar{x}_{2015} - \bar{x}_{1984} = 23.45281 - 17.15944 = 6.29337\]
The standard error is given by
\[SE_{diff} = \sqrt{\frac{s_{1984}^2}{n_{1984}} + \frac{s_{2015}^2}{n_{2015}}} = \sqrt{\frac{(4.182516)^2}{784} + \frac{(4.512880)^2}{196}} = 0.3553\]
The \(T\) score associated with this test statistic is
\[T = \frac{\bar{x}_{diff} - \mu_0}{SE_{diff}} = \frac{6.293 - 0}{0.3553} = 17.712\]
With \(n - 1 = 195\) degrees of freedom, the p-value is given by
2 * (1 - pt(17.712, df = 195))[1] 0
Due to the extremely small p-value, the null hypothesis is rejected — there is significant statistical evidence that fuel economy was different for cars in 1984 and 2015.
To test for difference in the means between groups with different number of engine cylinders, an ANOVA test is performed.
For this test, the hypotheses are
\(H_0: \mu_2 = \mu_3 = \mu_4 = \mu_5 = \mu_6 = \mu_8 = \mu_{10} = \mu_{12} = \mu_{16}\) (all means are the same)
\(H_a: \mu_2 \neq \mu_3 \neq \mu_4 \neq \mu_5 \neq \mu_6 \neq \mu_8 \neq \mu_{10} \neq \mu_{12} \neq \mu_{16}\) (at least some of the means are different)
As outlined in section 4.1.1, the conditions for independence and approximate normality are met. The summary by fuel type in section 3.2.2 shows that the variance between groups is approximately equal. All conditions are met, so the ANOVA test is performed.
To perform the ANOVA test, a linear regression is performed, and an ANOVA table is created using the anova function
fit_cyl <- lm(mpg ~ as.factor(cyl), data = vehicles_cyl)
anova(fit_cyl)Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(cyl) 8 339733 42467 5130.9 < 2.2e-16 ***
Residuals 32604 269854 8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Due to the extremely low p-value returned, the null hypothesis is rejected — there is statistically significant evidence that the fuel economy of vehicles is different for engines with different numbers of cylinders.
To test for difference in the means between groups with different fuel types, an ANOVA test is performed.
For this test, the hypotheses are
\(H_0: \mu_{CNG} = \mu_{Diesel} = \mu_{Midgrade} = \mu_{Premium} = \mu_{Regular}\) (all means are the same)
\(H_a: \mu_{CNG} \neq \mu_{Diesel} \neq \mu_{Midgrade} \neq \mu_{Premium} \neq \mu_{Regular}\) (at least some of the means are different)
As outlined in section 4.1.1, the conditions for independence and approximate normality are met. The summary by fuel type in section 3.2.3 shows that the variance between groups is approximately equal. All conditions are met, so the ANOVA test is performed.
To perform the ANOVA test, a linear regression is performed, and an ANOVA table is created using the anova function
fit_fuel <- lm(mpg ~ fuel, data = vehicles_fuel)
anova(fit_fuel)Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
fuel 4 1482 370.51 19.81 2.647e-16 ***
Residuals 31458 588370 18.70
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Due to the extremely low p-value returned, the null hypothesis is rejected — there is statistically significant evidence that the fuel economy of vehicles is different for cars using different fuel types.
To investigate the relationship between fuel economy and gas prices, a linear regression will be performed. In order to prepare for this analysis, a data frame is constructed matching the cases of the vehicles data set with the gas prices in the gas data set.
vehicles <- inner_join(vehicles, gas, by = "year")To explore the relationship, a scatterplot of the two variables is created, with a jitter on the price, since there is only one value for each year.
No relationship is immediately visible from the scatterplot, although it can be observed that most of the very high fuel economy values occur in years with high gas prices.
A linear regression is conducted and the fit summarized:
fit_gas <- lm(mpg ~ price, vehicles)To investigate that the conditions for linear regression, diagnostic plots are created:
There does not appear to be any pattern in the residuals in the scatterplot, so the condition of linearity can be accepted. The histogram indicates that the residuals are normally distributed. Finally, the scatterplot and Q-Q plot indicate that the residuals indicate near-constant variability. Because the conditions are met, linear regression is continued.
The linear model fitting fuel economy and gas price is summarized:
summary(fit_gas)
Call:
lm(formula = mpg ~ price, data = vehicles)
Residuals:
Min 1Q Median 3Q Max
-11.4266 -3.2139 -0.1327 2.6060 12.6653
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.30388 0.05343 342.60 <2e-16 ***
price 0.75816 0.02555 29.68 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.266 on 32615 degrees of freedom
Multiple R-squared: 0.02629, Adjusted R-squared: 0.02626
F-statistic: 880.8 on 1 and 32615 DF, p-value: < 2.2e-16
The equation returned by the linear regression is \[\widehat{mpg} = 18.304 + 0.758 \times price\]
The p-values associated with the coefficients of the linear regression are very small, indicating that they have statistical significance. However, the \(R^2\) value associated with the regression is quite low – less than 3% of variation in fuel economy can be explained by variation in gas prices.
The regression line is plotted over the scatterplot from above:
Investigating the EPA fuel economy data collected in the fueleconomy package, outliers in the combined gas mileage were identified. With these outliers removed, the distribution of fuel economy was found to be nearly normal. Using this modified data set, the following findings were reached, with strong statistical significance:
Fuel economy was different for cars in 1984 than it was in 2015
Fuel economy of vehicles is different for engines with different numbers of cylinders
Fuel economy of vehicles is different for cars using different fuel types
While not investigated as part of this project, the data investigated provides a number of different variables that could be used for further analysis. Difference in fuel economy based on other categorical variables (i.e. make, vehicle class, transmission, drive) could be investigated through appropriate hypothesis testing. Multiple regression could be performed to attempt to predict the gas mileage of a vehicle based upon its characteristics.
Additionally, a linear regression was performed comparing combined fuel economy to average gas prices. While the regression line produced (\(\widehat{mpg} = 18.304 + 0.758 \times price\)) has statistically significant coefficients, it only explains a small percent of the variation in fuel economy. This suggests that there are other variables affecting fuel economy – the completion of a multiple regression including more variables from this data set may be able to explain a higher percentage of the variation in fuel economy. The creation of such a model may be useful for someone seeking to predict the fuel economy (and associated fuel costs) for a vehicle with known attributes that has not yet undergone EPA testing for fuel economy.
U.S. Department of Energy Office of Energy Efficiency and Renewable Energy (2016). Download Fuel Economy Data. http://www.fueleconomy.gov/feg/download.shtml
U.S. Department of Energy Office of Energy Efficiency and Renewable Energy (2016). Gasoline Vehicles: Learn More About the New Label. https://www.fueleconomy.gov/feg/label/learn-more-gasoline-label.shtml#fuel-economy
U.S. Energy Information Administration (2016). Monthly Energy Review Table 9.4: Retail Motor Gasoline and Motor Gasoline and On-Highway Diesel Fuel Prices. https://www.eia.gov/beta/MER/index.cfm?tbl=T09.04
Wickham, Hadley (2014). fueleconomy: EPA fuel economy data. R package version 0.1. https://CRAN.R-project.org/package=fueleconomy