In this quick analysis we will explore the relationship between a set of available variables to determine how to maximize miles per gallon fuel efficiency (MPG) based on the mtcars dataset. We will achieve this through a series of models that will help us determine the most effective way of predicting MPG, including whether or not the vehicle has a manual or automatic transmission.
The first question we seek to answer is whether an automatic or a manual transmission has better MPG. We can do this with a simple chart as observed in Fig 1 in the appendix. Based on this simple distinction we can observe that manual transmissions have a clearly higher MPG. We can quantify this further with a quick linear model.
mtcars$trans <- ifelse(mtcars$am == 0,"auto", "manual")
fit <- lm(mpg ~ am, mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
fit$coefficients[1]
## (Intercept)
## 17.14737
sum(fit$coefficients)
## [1] 24.39231
Based on our coefficients we can see that an automatic transmission has a mean predicted value of 17.15 MPG whereas a manual transmission has an estimate of 24.39 MPG. This can be observed in Fig 1 as well.
However, we can see from the model summary that R-squared is low at 0.36. Even though “am” is statistically significant at p= 0.001 we can use the other variables to see if we get a better model to predict MPG.
aov <- aov(mpg ~ cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb, mtcars)
summary(aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 817.7 817.7 116.425 5.03e-10 ***
## disp 1 37.6 37.6 5.353 0.03091 *
## hp 1 9.4 9.4 1.334 0.26103
## drat 1 16.5 16.5 2.345 0.14064
## wt 1 77.5 77.5 11.031 0.00324 **
## qsec 1 3.9 3.9 0.562 0.46166
## vs 1 0.1 0.1 0.018 0.89317
## am 1 14.5 14.5 2.061 0.16586
## gear 1 1.0 1.0 0.138 0.71365
## carb 1 0.4 0.4 0.058 0.81218
## Residuals 21 147.5 7.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Running a linearl model including all other variables we can see that “cyl”, “disp”, and “wt” are the only statistically significant variables, with “am” failing even at p = 0.1. We next create a model that uses these variables, plus we’ll retain “am” for now.
fit2 <- lm(mpg ~ am + cyl + disp + wt, mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + cyl + disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.318 -1.362 -0.479 1.354 6.059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.898313 3.601540 11.356 8.68e-12 ***
## am 0.129066 1.321512 0.098 0.92292
## cyl -1.784173 0.618192 -2.886 0.00758 **
## disp 0.007404 0.012081 0.613 0.54509
## wt -3.583425 1.186504 -3.020 0.00547 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared: 0.8327, Adjusted R-squared: 0.8079
## F-statistic: 33.59 on 4 and 27 DF, p-value: 4.038e-10
Our R-squared is now at 0.83, but it is likely overfitted and “am” is still not significant, nor is “disp”. We try a final model and compare the results of all three.
fit3 <- lm(mpg ~ cyl+wt, mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ cyl + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## wt -3.1910 0.7569 -4.216 0.000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
anova(fit, fit3, fit2)
AIC(fit, fit2, fit3)
shapiro.test(fit3$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit3$residuals
## W = 0.93745, p-value = 0.06341
We see now that the third model (fit3) that includes “cyl” and “wt” still has a rounded R-squared of 0.83, meaning we have lost very little explaining power by removing the other variables.
When we compare all three models, we see that fit3 (mpg ~ cyl + wt) shows the greatest improvement and also has the lowest AIC, making it the best model.
Lastly, we do some graphical analysis in Fig 2- Fig 5, that shows a good distribution of residuals to fitted values, plus a normal distribution that we further tconfirm in the Shapiro-Wilk normality test.
We can also see in the expanded QQ plot in Fig 6 that the values remain within the confidence bands.