Working at Motor Trend, I was asked to write a paper about the impact of Transmition type (either automatic or manual) on MPG (Miles Per Gallon). This paper tries to answer the following 2 questions:
A quick EDA will show from data set (from Motor Trend US Magazine 1974) the difference between transmission types on MPG ; Then regression analysis will be performed to confirm, or not, the results.
Source: Henderson and Velleman / mtcars R dataset
As with any new dataset, we first perform calls to str(), summary(), head(),…
# ?mtcars # cannot output
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
We can then compare mpg for both auto and manual transmission with mean and plot
## auto manual
## 17.14737 24.39231
g <- ggplot(mtcars, aes(y=mpg, x=factor(am),
color=factor(am, labels=c('0=auto', '1=manual')))) + geom_violin()
g + theme(legend.title=element_blank()) + labs(x = "Transmission", y = "Miles/Gallon") +
ggtitle("Impact of Transmission on Consumption")
It clearly seems that manual transmission have a greater positive impact on mpg (24.39 > 17.15).
This is further confirmed y a t-test (p<.5) as follow:
t.test(mtcars$mpg[mtcars$am==0], mtcars$mpg[mtcars$am==1])
##
## Welch Two Sample t-test
##
## data: mtcars$mpg[mtcars$am == 0] and mtcars$mpg[mtcars$am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
But we should not take this result for granted and do some additional analysis…
fit <- lm(mpg~am, mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
So manual transmission offer an increase of 7.24mpg over automatic but only 36% of the variance is explained by transmission, this is quite low…
par(mfrow = c(1,2))
with(mtcars, plot(am, mpg, bg=am, pch=21, ylab="Miles per Gallon", xlab="Transmission"))
abline(fit, lwd = 2, col = "blue")
plot(resid(fit), mtcars$mpg, ylab="Miles per Galon", xlab='Residuals')
We can visually confirm an intercept at ~17 and a slope around 7.
We can also confirm that we should definitly improve our model as there is some pattern left in the residuals.
We choose to keep cyl, hp, wt to build a better model (anova results with p<.5)
fit_improved <- lm(mpg~am+cyl+hp+wt, mtcars)
anova(fit_improved, fit_all) # additional check that removed var were not bringing value
summary(fit_improved)
##
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4765 -1.8471 -0.5544 1.2758 5.6608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14654 3.10478 11.642 4.94e-12 ***
## am 1.47805 1.44115 1.026 0.3142
## cyl -0.74516 0.58279 -1.279 0.2119
## hp -0.02495 0.01365 -1.828 0.0786 .
## wt -2.60648 0.91984 -2.834 0.0086 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8267
## F-statistic: 37.96 on 4 and 27 DF, p-value: 1.025e-10
From R^2 of our new model we see that ~85% of variance is covered (against 36%)!
We can now perform last tests on our final model…
par(mfrow = c(2,2))
plot(fit_improved)
All seems ok as:
If manual transmission have a positive impact than automatic on MPG (+1.48mpg over automatic) it result to be much less than we first thought with our first model (+7.24mpg), which, with further analysis, showed that it was not just transmission on itself that was responsible for it.
While trying to improve the model, we find several other variables linked to it that have greater impact (either negative or positive), for example weight (‘wt’ with -2.61mpg).