In this project, I analysed the Motor Trend Car Road Tests dataset in R in order to answer the following questions: “Is an automatic or manual transmission better for MPG?” “What’s the MPG difference between automatic and manual transmissions”
I fitted many models and combinations of variables trying to find the best regression model to answer the questions.
Manual Transmission was found to be better for MPG and the MPG difference between automatic and manual transmissions in an adjusted model is approximately 2.94 MPG.
I’ll start by first loading the data and converting the am (Transmission 0 = automatic, 1 = manual) and vs (V/S) variables from numeric to factors. I will also load the necessary packages for this project.
require(ggplot2)
require(UsingR)
require(GGally)
require(car)
mtcars <- mtcars
mtcars$am <- as.factor(mtcars$am)
mtcars$vs <- as.factor(mtcars$vs)
Following the principle of Occam’s Razor, I will try to keep this as simple as possible. I’ll start by fitting a simple linear model as asked by the rubric using only mpg as an outcome and am as a predictor. (Appendix #1)
From analysis of the summary, in this model the variable am seems to be a very significant predictor. Not all is good though, as from the value of the adjusted R^2, this model only explains about 34% of the variability which is very poor.
Let us try to fit the other side of the spectrum now: a model which uses all the variables as predictors except mpg. (Appendix #2)
From this summary, we conclude that this model explains 81% of the variability observed, which is very good but in trade now, except for weight which has a significance level of 0.1, none of the other predictors seem to be statistically significant. This signifies that some of the variables are probably partially correlated with the others hurting our model.
I will start by calculating the variance inflation factors between the variables in the complete_model (Appendix #3)
We can see that many variables have a vif > 5 which confirms multicolinearity, specially cyl, disp, carb and wt.
By building a pairs plot to check the correlation values and comparing correlations with the cor function, let’s try to see which variables we can omit from the final model. (Appendix #3)
From the analysis of the pairs plot, we are expecting higher MPG with manual transmission. Weight also seems to be massively correlated with disp, hp and drat. As weight seems more important for MPG consumption, let’s try to build a model omitting those variables as well as cyl and carb and keeping weight. (Appendix #4)
With this summary, vs and gear variables seem to be completely insignificant statistically so let us build a final model containing removing those variables as well. (Appendix #5)
All variables are significant and the model explains 83.36% of the total variability and the F statistic is also very significant. Seems like a very good model, now to do some diagnostics on it to see if it is truly valid. (I know we could have used the step function in order to find the best model quickly, however with all the hot debate surrounding using stepwise regression vs not using it, I prefer to do it manually as Prof. Brian also did). Let’s do an analysis of variance to be sure this is the best model. (Appendix #6)
It is.
We will begin by plotting the final model in order to create 4 diagnostics plots, including a residuals one and a Q-Q plot. (Appendix #7)
The residuals plot shows no significant pattern which proves the independence of the data and the Q-Q plot shows the normality of the data. The other two plots also look normal without many outliers that could leverage the data a lot. To confirm though, let’s check the max hatvalues and the max dfbetas. (Appendix #8)
As we can see, neither the hat values nor the dfbetas show an exorbitant value. We can therefore confirm this model is safe enough to draw some conclusions.
Let us finish by calculating 95% confidence intervals for our model as well as using a t.test to evaluate if MPG consumption is the same for automatic and manual transmissions . (Appendix #9)
We can see that apart from intercept, none of the other variables cross 0 and are thus highly significant. We can then be sure that 95% of the times, the intervals given will contain the true population mean for that variable. The t.test p-value is also statistically significant which proves MPG consumption is not equal in both transmissions.
By analysis of the summary of the final model, we can draw the following conclusions:
1 - Manual transmission results in higher MPG.
2 - In a model adjusted for weight and quarter-mile time, changing from automatic transmission to manual transmission results in a gain of approximately 2.94 MPG which is fairly important for economic reasons.
Appendix #1
simple_model <- lm(mpg ~ am, mtcars)
summary(simple_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am1 7.244939 1.764422 4.106127 2.850207e-04
Appendix #2
complete_model <- lm(mpg ~ ., mtcars)
summary(complete_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs1 0.31776281 2.10450861 0.1509915 0.88142347
## am1 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
Appendix #3
vif(complete_model)
## cyl disp hp drat wt qsec vs
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873
## am gear carb
## 4.648487 5.357452 7.908747
Appendix #4
ggpairs(mtcars, c(1,2,3,4,5, 6,7,9))
cor(mtcars[c(1,3,4,5,6,7)])
## mpg disp hp drat wt qsec
## mpg 1.0000000 -0.8475514 -0.7761684 0.68117191 -0.8676594 0.41868403
## disp -0.8475514 1.0000000 0.7909486 -0.71021393 0.8879799 -0.43369788
## hp -0.7761684 0.7909486 1.0000000 -0.44875912 0.6587479 -0.70822339
## drat 0.6811719 -0.7102139 -0.4487591 1.00000000 -0.7124406 0.09120476
## wt -0.8676594 0.8879799 0.6587479 -0.71244065 1.0000000 -0.17471588
## qsec 0.4186840 -0.4336979 -0.7082234 0.09120476 -0.1747159 1.00000000
Appendix #4
new_model <- lm(mpg ~ . -(disp + hp + drat + carb + cyl), mtcars)
summary(new_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.34484998 9.9831267 1.03623346 3.096353e-01
## wt -3.91800463 0.8258160 -4.74440399 6.598901e-05
## qsec 1.21149668 0.4734868 2.55867070 1.667702e-02
## vs1 0.05204275 1.8505928 0.02812221 9.777794e-01
## am1 3.08825472 1.8233513 1.69372444 1.022665e-01
## gear -0.14917917 1.0660835 -0.13993197 8.897921e-01
Appendix #5
final_model <- lm(mpg ~ . -(disp + hp + drat + carb + cyl + vs + gear), mtcars)
summary(final_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am1 2.935837 1.4109045 2.080819 4.671551e-02
Appendix #6
anova(simple_model, final_model, complete_model)$Pr
## [1] NA 8.025056e-08 8.636073e-01
Appendix #7
par(mfrow=c(2,2))
plot(final_model)
Appendix #8
max(hatvalues(final_model))
## [1] 0.2970422
max(dfbetas(final_model))
## [1] 1.093842
Appendix #9
confint.lm(final_model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -4.63829946 23.873860
## wt -5.37333423 -2.459673
## qsec 0.63457320 1.817199
## am1 0.04573031 5.825944
t.test(mpg ~ am, mtcars)$p.value
## [1] 0.001373638