With the increased concerns about global warming and greenhouse gasses, the auto industry strives to produce engines that perform better while consuming less fuel. The most common metric to measure this consumption is Miles per Gallon (MPG) which accounts for the number of miles a vehicule is able to run with the use of 1 gallon (3.78 lts) of fuel.
There are multiple factors that affect this metric, one of them, as we are trying to show in this paper, is the choice of an automatic or a manual transmission. Specifically, we’ll try to answer 2 questions:
In order to try to answer to these question, we’ll use a data set comprised of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models) as reported by the US magazine Motor Trend in their 1974 issue. This data set is provided as part of the basic packages in R.
We’ll use this data to fit different models and select the most appropriate one for explaining the relationship between transmission and MPG and create residual plots and diagnostics.
Let’s check the data
At first glance (Appendix - Figure 1), we can see that the Transmission Type really affects the MPG value, clearly stating that Manual transmissions have a better Average MPG than Automatics (24.39 vs 17.15). But, is this enough?
Let’s check how much does this model explains the variance by fitting a linear model with only 1 predictor (Transmission Type)
# Fit a linear model
one.variable.model <- lm(mpg ~ am_factor, my.mtcars)
summary(one.variable.model)
##
## Call:
## lm(formula = mpg ~ am_factor, data = my.mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am_factorManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
When we look at the coefficients, we can clearly see that the mean MPG for vehicules with manual transmissions is 7.245 higher than that of vehicules with automatic transmission… Unfortunately, the value of R^2 is telling us that this model only explains ~36% of the variance in MPG.
The pairs plot (Appendix - Figure 2) shows us that there’s a strong correlation between mpg and cylinders (cyl), displacement (disp), horse power (hp), weight (wt) and transmission type (am)
So, let’s find another model that uses those variables that are correlated to MPG by using the R function step, which iterates between all different combinations of variables to find the model that better fits and returns it
summary(best.model)
##
## Call:
## lm(formula = mpg ~ am_factor + factor(cyl) + hp + wt, data = my.mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## am_factorManual 1.80921 1.39630 1.296 0.20646
## factor(cyl)6 -3.03134 1.40728 -2.154 0.04068 *
## factor(cyl)8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
After looking at all the different combinations, the step function came back with the best model to fit, which includes the transmission type (am_factor), the number of cylinders (factor(cyl)), the horse power (hp) and the weigth (wt) as the variables that better explain the variance in MPG.
Let’s compare the two models:
anova(one.variable.model, best.model)
Looking at the above results, the p-value obtained is highly significant and we reject the null hypothesis that the confounder variables cyl, hp and wt don’t contribute to the accuracy of the model.
We’ll take a look at the residuals analysis by the means of plots
par(mfrow=c(2,2))
plot(best.model)
From these plots we can see:
The Residuals vs. Fitted plot verifies the independence condition due to the points being randomly scattered
The Normal Q-Q plot tells us that the residuals are normally distributed since almost all of the points fall into the line
The Scale-Location plot consists of points scattered in a constant band pattern, indicating constant variance.
There are some distinct points of interest (outliers or leverage points) in the top right of the plots that may indicate values of increased leverage of outliers.
Looking at the results we can see that:
Figure 1
Figure 2