1. Introduction

To address today’s environment concerns and economy challenges, fuel economy of cars matters a lot for consumers. They do concerned with how good will be the MPG in their new vehicle. An eco-friendly vehicle could save money on ever-rising gas prices. This study investigates changes in fuel economy and its relationship with other given attributes. Here we will focus on following questions:

2. Data

The data is collected from the R package: fueleconomy. The fueleconomy package’s data was sourced from the EPA (Environmental Protection Agency). In this package, the data is stored in vehicles dataset.

Fuel economy data contains data for all cars sold in the US from 1984 to 2015. The package fueleconomy has 33,442 rows and 12 variables.

It’s an observational study. The response variable is combined mpg which is quantitative. The two independent variables are number of cylinders and displacement. This research is mainly focused on identifying associations.

Further data information in this study can be found in below links:

3. Exploratory data analysis

library(dplyr)
library(ggplot2)

# load data
library(fueleconomy)

vehicles
summary(vehicles)
##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

Per EPA (Environmental protection agency) guidelines, combined fuel economy is a weighted average of City and Highway MPG values that is calculated by weighting the City value by 55% and the Highway value by 45%.

# remove null values
vehicles <- na.omit(vehicles)

vehicles <- vehicles %>% mutate(mpg = 0.55 * vehicles$cty + 0.45 * vehicles$hwy)
vehicles
# dimensions
dim(vehicles)
## [1] 33384    13
unique(vehicles$cyl)
## [1]  4  6  5  8 12 10 16  3  2
summary(vehicles$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.80   16.70   19.70   20.11   22.60   54.40
hist(vehicles$mpg, breaks = 50)

qqnorm(vehicles$mpg)
qqline(vehicles$mpg)

vehicles %>%
  group_by(year) %>%
  summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
plot(vehicles$mpg~vehicles$year)

ggplot(vehicles, aes(vehicles$year, mpg)) +
  geom_point(aes( color=factor(cyl), alpha = factor(cyl))) +
  theme_minimal() +
  geom_smooth()
## Warning: Using alpha for a discrete variable is not advised.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# to find make and model having max mpg
vehicles[which.max(vehicles$mpg),]
any(is.na(vehicles$cyl))
## [1] FALSE
vehicles_cyl <- vehicles %>% filter(!is.na(cyl))

vehicles_cyl %>%
  group_by(cyl) %>%
  summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
ggplot(vehicles_cyl, aes(cyl, mpg)) + geom_boxplot(aes(fill = factor(cyl)))

ggplot(vehicles, aes(mpg)) + geom_histogram(bins = 50, aes(fill = factor(cyl)))

Seeing the above graphs, on average, as the number of cylinders increases, combined mpg decreases.

vehicles_displ <- vehicles %>% filter(!is.na(displ))

vehicles_displ %>%
  group_by(displ) %>%
  summarise(n = n(), mean = mean(mpg), median = median(mpg), sd = sd(mpg))
ggplot(vehicles_displ, aes(displ, mpg)) + geom_boxplot(aes(fill = factor(displ)))

Seeing the above graphs for displacements, as the number of displacements increases, combined mpg decreases.

4. Inference

Seeing the above graphs, on average, as the number of cylinders increases, combined mpg decreases. Lets run an ANOVA (analysis of variance) test. ANOVA uses hypothesis test to check whether the means across many groups are equal:

\(H_0\): \(\mu_2\) = \(\mu_3\) = \(\mu_4\) = \(\mu_5\) = \(\mu_6\) = \(\mu_8\) = \(\mu_{10}\) = \(\mu_{12}\) = \(\mu_{16}\) \(H_A\): \(\mu_2\) \(\neq\) \(\mu_3\) \(\neq\) \(\mu_4\) \(\neq\) \(\mu_5\) \(\neq\) \(\mu_6\) \(\neq\) \(\mu_8\) \(\neq\) \(\mu_{10}\) \(\neq\) \(\mu_{12}\) \(\neq\) \(\mu_{16}\)

model_mpg_cyl <- lm(mpg~factor(cyl), data = vehicles)
anova(model_mpg_cyl)

Since the p-value is low (<0.05), \(H_0\) is rejected and accepts \(H_A\). Therefore mpg differs for different number of cylinders.

Seeing the above graphs for displacements, as the number of displacements increases, combined mpg decreases.

ANOVA test with displacement:

model_mpg_cyl <- lm(mpg~factor(displ), data = vehicles)
anova(model_mpg_cyl)

Since the p-value in this case is low (<0.05), \(H_0\) is rejected and accepts \(H_A\). Therefore mpg differs for engines having different number of displacements

Honda Insight of year 2000 has the max mpg (54.4).

Seeing the above diagnostic plots between the mpg and year, I see increase in mpg over the years.

5. Conclusion

After doing the analysis of fueleconomy package from the EPA, the data depicts that the average MPG has increased from ~17MPG in 1984 to ~23MPG in 2015. There is signicant evidence that combined MPG is different for engines having different number of cylinders. Also its clear that vehicle with more cylinders give less combined MPG. There is also signicant evidence that fuel economy is different for engines having different number of displacments. More displacment performs less on overall MPG.

Though I didn’t investigate here as part of this project, the data provides a number of different variables that could be further analyzed. Difference in fuel economy using other categorical variables (i.e. make, model, class, transmission, drive) could be investigated through appropriate hypothesis testing. We could perform multiple regression to predict the gas mileage of a vehicle based upon these characteristics.