Executive Summary

Working at Motor Trend, I was asked to write a paper about the impact of Transmition type (either automatic or manual) on MPG (Miles Per Gallon). This paper tries to answer the following 2 questions:

  1. Is an automatic or manual transmission better for MPG
  2. Quantify the MPG difference between automatic and manual transmissions

A quick EDA will show from data set (from Motor Trend US Magazine 1974) the difference between transmission types on MPG ; Then regression analysis will be performed to confirm, or not, the results.

Source: Henderson and Velleman / mtcars R dataset

Exploratory Analysis

As with any new dataset, we first perform calls to str(), summary(), head(),…

# ?mtcars   # cannot output
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

We can then compare mpg for both auto and manual transmission with mean and plot

##     auto   manual 
## 17.14737 24.39231
g <- ggplot(mtcars, aes(y=mpg, x=factor(am), 
        color=factor(am, labels=c('0=auto', '1=manual')))) + geom_violin()
g + theme(legend.title=element_blank()) + labs(x = "Transmission", y = "Miles/Gallon") + 
        ggtitle("Impact of Transmission on Consumption")

It clearly seems that manual transmission have a greater positive impact on mpg (24.39 > 17.15).
This is further confirmed y a t-test (p<.5) as follow:

t.test(mtcars$mpg[mtcars$am==0], mtcars$mpg[mtcars$am==1])
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg[mtcars$am == 0] and mtcars$mpg[mtcars$am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

But we should not take this result for granted and do some additional analysis…

Building a first basic model (mpg with am as single predictor)

fit <- lm(mpg~am, mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

So manual transmission offer an increase of 7.24mpg over automatic but only 36% of the variance is explained by transmission, this is quite low…

par(mfrow = c(1,2))
with(mtcars, plot(am, mpg, bg=am, pch=21, ylab="Miles per Gallon", xlab="Transmission"))
abline(fit, lwd = 2, col = "blue")
plot(resid(fit), mtcars$mpg, ylab="Miles per Galon", xlab='Residuals')

We can visually confirm an intercept at ~17 and a slope around 7.
We can also confirm that we should definitly improve our model as there is some pattern left in the residuals.

Building and comparing more models

We apply the nested models method with anova test to identify important variables.

Building a better model

We choose to keep cyl, hp, wt to build a better model (anova results with p<.5)

fit_improved <- lm(mpg~am+cyl+hp+wt, mtcars)
anova(fit_improved, fit_all)  # additional check that removed var were not bringing value
summary(fit_improved)
## 
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4765 -1.8471 -0.5544  1.2758  5.6608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.14654    3.10478  11.642 4.94e-12 ***
## am           1.47805    1.44115   1.026   0.3142    
## cyl         -0.74516    0.58279  -1.279   0.2119    
## hp          -0.02495    0.01365  -1.828   0.0786 .  
## wt          -2.60648    0.91984  -2.834   0.0086 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8267 
## F-statistic: 37.96 on 4 and 27 DF,  p-value: 1.025e-10

From R^2 of our new model we see that ~85% of variance is covered (against 36%)!

We can now perform last tests on our final model…

par(mfrow = c(2,2))
plot(fit_improved)

All seems ok as:

Conclusion

If manual transmission have a positive impact than automatic on MPG (+1.48mpg over automatic) it result to be much less than we first thought with our first model (+7.24mpg), which, with further analysis, showed that it was not just transmission on itself that was responsible for it.
While trying to improve the model, we find several other variables linked to it that have greater impact (either negative or positive), for example weight (‘wt’ with -2.61mpg).