Findings
Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions
Based on the “mtcars” dataset, transmission in itself does not have a statistically significant impact on fuel efficiency. Using a simple one-way view, without allowing for other factors, it would appear that manual cars have a higher MPG (7.2 more on average). However, this is not the “pure effect” of transmission as there are variables which confound transmission (for example, weight is highly negatively correlated with manual transmission). When allowing for weight, number of cylinders and displacement, transmission becomes an insiginificant factor in MPG.
Method
Final model
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp + disp:cyl, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7574 -1.3419 -0.2371 1.2777 3.9379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.149633 4.277477 11.724 7.01e-12 ***
## wt -2.894728 0.983595 -2.943 0.00676 **
## cyl -2.561859 0.757871 -3.380 0.00230 **
## hp -0.024119 0.011039 -2.185 0.03811 *
## disp -0.092176 0.040160 -2.295 0.03004 *
## cyl:disp 0.012283 0.004585 2.679 0.01264 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.267 on 26 degrees of freedom
## Multiple R-squared: 0.8814, Adjusted R-squared: 0.8586
## F-statistic: 38.63 on 5 and 26 DF, p-value: 3.128e-11
Scatterplot matrices have been created to identify factors of potential importance on MPG and any interesting correlations.
Obvious predictors on MPG like cyl, hp, wt appear to be strongly correlated to MPG in the direction expected. Transmission is strongly negatively correlated to weight and this may come up as a confounding variable. Looking at the scatterplot of am vs. MPG, it would appear that manual cars have a higher MPG.
Random forest and GBM are used here as a “brute force” method to identify the most predictive features to fit in our final linear model. The lasso is an “l1” feature selection method for linear models sand represents an automated way of model subset selection. Repeated 10 fold cross-validation is used here to select the best model and prevent over-fitting and estimate out of sample error (see Appendix for details of the Random Forest, GBM and lasso fit). The GBM, RF and lassoo have RMSE of 2-3 on both the cross-validation and test error.
Top factors:
* random forest: disp, wt, hp, cyl
* GBM: disp, wt, hp, cyl
* Lasso: wt, cyl, hp, am
wt, hp and cyl come up as top factors in all three models and disp comes up in two of the models. These are also variables that make a lot of sense in the context of fuel efficiency.
lm1 - our first model looks purely at the impact of transmission without allowing for any other factors.
lm1 <- lm(formula = mpg~as.factor(am), data = data)
summary(lm1)
##
## Call:
## lm(formula = mpg ~ as.factor(am), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## as.factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
# par(mfrow=c(2,2))
# plot(lm1)
# par(mfrow=c(3,3))
# for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
# oneway(i,"mpg",lm1,data)
# }
QQ plot shows normality is ok, residuals look ok and the transmission factor is significant. This model tells us that having manual transmission increases the miles per gallon by 7.2. From a common sense point of view, it’s unlikely transmission has such a big impact and this is very likely the impact of some compounding variables.
lm2 - we fit a linear model, starting with the top 4 factors identified by the random forest, GBM and lasso.
lm2 <- lm(formula = mpg~ wt + cyl + hp + disp, data = data)
summary(lm2)
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## wt -3.85390 1.01547 -3.795 0.000759 ***
## cyl -1.29332 0.65588 -1.972 0.058947 .
## hp -0.02054 0.01215 -1.691 0.102379
## disp 0.01160 0.01173 0.989 0.331386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
par(mfrow=c(2,2))
plot(lm2)
# par(mfrow=c(3,3))
# for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
# oneway(i,"mpg",lm2,data)
# }
anova(lm1,lm2)
## Analysis of Variance Table
##
## Model 1: mpg ~ as.factor(am)
## Model 2: mpg ~ wt + cyl + hp + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 27 170.44 3 550.45 29.066 1.32e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA/chi-square nested models test as well as the adjusted R-square tells us lm2 is a superior model. Displacement comes up very strongly in the GBM and random forest, but despite this it is not significant. Perhaps it is an interaction. I have tried fitting disp:cyl, disp:hp, disp:am and disp:wt, and only disp:cyl comes through as significant, which I fit in lm3.
lm3 - fitting an interaction to lm2
lm3 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl, data = data)
summary(lm3)
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp + disp:cyl, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7574 -1.3419 -0.2371 1.2777 3.9379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.149633 4.277477 11.724 7.01e-12 ***
## wt -2.894728 0.983595 -2.943 0.00676 **
## cyl -2.561859 0.757871 -3.380 0.00230 **
## hp -0.024119 0.011039 -2.185 0.03811 *
## disp -0.092176 0.040160 -2.295 0.03004 *
## cyl:disp 0.012283 0.004585 2.679 0.01264 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.267 on 26 degrees of freedom
## Multiple R-squared: 0.8814, Adjusted R-squared: 0.8586
## F-statistic: 38.63 on 5 and 26 DF, p-value: 3.128e-11
par(mfrow=c(2,2))
plot(lm3)
par(mfrow=c(3,3))
for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
oneway(i,"mpg",lm3,data) #oneway is a function i made to quickly plot one-way fits
}
anova(lm1,lm2,lm3)
## Analysis of Variance Table
##
## Model 1: mpg ~ as.factor(am)
## Model 2: mpg ~ wt + cyl + hp + disp
## Model 3: mpg ~ wt + cyl + hp + disp + disp:cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 27 170.44 3 550.45 35.7134 2.274e-09 ***
## 3 26 133.58 1 36.86 7.1753 0.01264 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm3 has all coefficients significant, the best adjusted R-squared out of all models fitted and the ANOVA/nested models chi-square test says it is a better model. The residuals show no signs of heteroscedasticity or bias and the QQ-plot shows that the assumption of normality is fine. The 9 one-way figures produced show the model fits quite well when cut by those 9 factors (e.g. not under-fitting/biased).
Based on the outputs of the random forest, GBM and lasso, the remaining factors are unlikley to be very predictive. To be sure, I have fitted a linear model for each of the remaining factors to see if they are worth adding. I have then performed a chi-square nested models test on each of them to see if it is worth adding another factor. None of them are anywhere near passing the chi-square nested models test.
In particular, transmission becomes an insignificant factor after fitting other variables. Even if we were to fit it, it’s beta is insignificant and does not have a major impact on the response (0.58 greater MPG if manual and the standard error is larger than the estimate, see Appendix E).
Our final model is lm4, mpg~ wt + cyl + hp + disp + disp:cyl.
set.seed(1234)
#splitting data into train and test
trainIndex <- createDataPartition(data$mpg, p = .75, list = FALSE, times = 1)
train <- data[trainIndex,]
test <- data[-trainIndex,]
#using cross-validation to estimate OOS error
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
rforestGrid <- expand.grid(mtry = c(2,3,4,5,6,7,8,9,10))
set.seed(934)
rforestFit1 <- train(form, data = train
,method = "rf"
,tuneGrid = rforestGrid
,trControl = fitControl
,na.action = na.pass
,importance=T
)
#prediction on test
rforestPredict1_test <- predict(rforestFit1, newdata = test, na.action=na.pass)
#RMSE ontrain
# rforestFit1
#RMSE ontest
RMSE_rforestFit1 <- sqrt(mean((rforestPredict1_test - test$mpg)^2))
RMSE_rforestFit1
## [1] 2.254119
#best hyperparameter
plot(rforestFit1)
#variable importance
varImp(rforestFit1)
## rf variable importance
##
## Overall
## disp 100.000
## wt 91.516
## hp 72.757
## cyl 50.056
## gear 6.714
## as.factor(am)1 5.872
## as.factor(vs)1 1.760
## drat 1.120
## qsec 0.102
## carb 0.000
gbmGrid <- expand.grid(interaction.depth = c(6,8),
n.trees = 3000,
shrinkage = c(0.001, 0.0001),
n.minobsinnode = c(ceiling(0.025*dim(train)[1]),ceiling(0.05*dim(train)[1])))
set.seed(825)
gbmFit1 <- train(form, data = train
,method = "gbm"
,tuneGrid = gbmGrid
,trControl = fitControl
,bag.fraction = 0.7
,verbose = FALSE
,na.action = na.pass
# ,importance=T
)
#hyperparameters
plot(gbmFit1)
#variable importance
summary(gbmFit1)
## var rel.inf
## disp disp 43.688450842
## wt wt 26.037161748
## hp hp 18.144326806
## cyl cyl 8.699762517
## drat drat 1.826134668
## qsec qsec 0.998261245
## carb carb 0.485413654
## gear gear 0.109133911
## as.factor(vs)1 as.factor(vs)1 0.006052016
## as.factor(am)1 as.factor(am)1 0.005302593
#prediction on test
gbmPredict1_test <- predict(gbmFit1, newdata = test, na.action=na.pass)
#RMSE ontrain
gbmFit1
## Stochastic Gradient Boosting
##
## 25 samples
## 10 predictors
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 22, 23, 23, 23, 22, 22, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode RMSE Rsquared
## 1e-04 6 1 4.408247 0.9730900
## 1e-04 6 2 4.408846 0.9724921
## 1e-04 8 1 4.397575 0.9716614
## 1e-04 8 2 4.411354 0.9724283
## 1e-03 6 1 2.212191 0.9691160
## 1e-03 6 2 2.102792 0.9692171
## 1e-03 8 1 2.239043 0.9679627
## 1e-03 8 2 2.111993 0.9691577
## MAE
## 3.715960
## 3.694303
## 3.707051
## 3.696909
## 2.025459
## 1.942360
## 2.045088
## 1.953080
##
## Tuning parameter 'n.trees' was held constant at a value of 3000
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 3000,
## interaction.depth = 6, shrinkage = 0.001 and n.minobsinnode = 2.
#RMSE ontest
RMSE_gbmFit1 <- sqrt(mean((gbmPredict1_test - test$mpg)^2))
RMSE_gbmFit1
## [1] 2.795255
lassoGrid <- expand.grid(fraction = seq(from=0.01, to=0.99, length.out=198))
set.seed(567)
lassoFit1 <- train(form
, data = train
,method = "lasso"
,tuneGrid = lassoGrid
,trControl = fitControl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
#hyperparameters
plot(lassoFit1)
#looks like 0.39 has the best fit
plot.enet(lassoFit1$finalModel, use.color = TRUE)
abline(v=lassoFit1$bestTune$fraction)
#grabbing the coefficients from training faction of .39:
predict(lassoFit1$finalModel, type = "coefficients")$coefficients[6,][predict(lassoFit1$finalModel, type = "coefficients")$coefficients[6,]!=0]
## cyl hp wt
## -0.73203256 -0.01343803 -2.45343755
#prediction on test
lassoPredict1_test <- predict(lassoFit1, newdata = test, na.action=na.pass)
#RMSE ontrain
min(lassoFit1$results$RMSE)
## [1] 3.110872
#RMSE ontest
RMSE_lassoFit1 <- sqrt(mean((lassoPredict1_test - test$mpg)^2))
RMSE_lassoFit1
## [1] 2.136262
lm4_1 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + vs, data = data)
lm4_2 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + drat, data = data)
lm4_3 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + qsec, data = data)
lm4_4 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + carb, data = data)
lm4_5 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + gear, data = data)
anova (lm3,lm4_1)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + vs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 133.58
## 2 25 131.84 1 1.7401 0.33 0.5708
anova (lm3,lm4_2)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + drat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 133.58
## 2 25 133.57 1 0.0054237 0.001 0.9748
anova (lm3,lm4_3)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 133.58
## 2 25 130.89 1 2.6889 0.5136 0.4802
anova (lm3,lm4_4)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 133.58
## 2 25 133.23 1 0.3513 0.0659 0.7995
anova (lm3,lm4_5)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + gear
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 133.58
## 2 25 128.16 1 5.4156 1.0564 0.3139
lm5 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + am, data = data)
summary(lm5)
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp + disp:cyl + am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6088 -1.2570 -0.3296 1.3561 3.9353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.716992 5.535182 8.801 3.95e-09 ***
## wt -2.736296 1.069002 -2.560 0.01691 *
## cyl -2.430332 0.831962 -2.921 0.00729 **
## hp -0.026708 0.012814 -2.084 0.04751 *
## disp -0.086865 0.042744 -2.032 0.05288 .
## am 0.579518 1.386101 0.418 0.67945
## cyl:disp 0.011683 0.004876 2.396 0.02437 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.303 on 25 degrees of freedom
## Multiple R-squared: 0.8822, Adjusted R-squared: 0.8539
## F-statistic: 31.2 on 6 and 25 DF, p-value: 1.905e-10
par(mfrow=c(2,2))
plot(lm5)
# par(mfrow=c(3,3))
# for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
# oneway(i,"mpg",lm5,data)
# }