Executive summary

Findings
Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions
Based on the “mtcars” dataset, transmission in itself does not have a statistically significant impact on fuel efficiency. Using a simple one-way view, without allowing for other factors, it would appear that manual cars have a higher MPG (7.2 more on average). However, this is not the “pure effect” of transmission as there are variables which confound transmission (for example, weight is highly negatively correlated with manual transmission). When allowing for weight, number of cylinders and displacement, transmission becomes an insiginificant factor in MPG.

Method

  1. Perform EDA to identify potential predictors and relationships between variables
  2. Fit a random forest, GBM and lasso regression model to identify the best predictors to help us fit a linear model
  3. Begin with a linear model that has the best factors identified from Step 2.
  4. Add/subtract factors that are/are not significant using top factors from GBM/random forest/lasso as a guide. Factors are added/removed based on:
  1. Perform model diagnostics throughout the iterative process of creating a linear model
  1. Steps 1-5 will give us a final model that we can use to make inferences about the impact of transmission on MPG

Final model

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp + disp:cyl, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7574 -1.3419 -0.2371  1.2777  3.9379 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 50.149633   4.277477  11.724 7.01e-12 ***
## wt          -2.894728   0.983595  -2.943  0.00676 ** 
## cyl         -2.561859   0.757871  -3.380  0.00230 ** 
## hp          -0.024119   0.011039  -2.185  0.03811 *  
## disp        -0.092176   0.040160  -2.295  0.03004 *  
## cyl:disp     0.012283   0.004585   2.679  0.01264 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.267 on 26 degrees of freedom
## Multiple R-squared:  0.8814, Adjusted R-squared:  0.8586 
## F-statistic: 38.63 on 5 and 26 DF,  p-value: 3.128e-11

Body

Step 1 - EDA

Scatterplot matrices have been created to identify factors of potential importance on MPG and any interesting correlations.

Obvious predictors on MPG like cyl, hp, wt appear to be strongly correlated to MPG in the direction expected. Transmission is strongly negatively correlated to weight and this may come up as a confounding variable. Looking at the scatterplot of am vs. MPG, it would appear that manual cars have a higher MPG.

Step 2 - important features

Random forest and GBM are used here as a “brute force” method to identify the most predictive features to fit in our final linear model. The lasso is an “l1” feature selection method for linear models sand represents an automated way of model subset selection. Repeated 10 fold cross-validation is used here to select the best model and prevent over-fitting and estimate out of sample error (see Appendix for details of the Random Forest, GBM and lasso fit). The GBM, RF and lassoo have RMSE of 2-3 on both the cross-validation and test error.

Top factors:
* random forest: disp, wt, hp, cyl
* GBM: disp, wt, hp, cyl
* Lasso: wt, cyl, hp, am

wt, hp and cyl come up as top factors in all three models and disp comes up in two of the models. These are also variables that make a lot of sense in the context of fuel efficiency.

Step 3 - fit linear models

lm1 - our first model looks purely at the impact of transmission without allowing for any other factors.

lm1 <- lm(formula = mpg~as.factor(am), data = data)
summary(lm1)
## 
## Call:
## lm(formula = mpg ~ as.factor(am), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.147      1.125  15.247 1.13e-15 ***
## as.factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
# par(mfrow=c(2,2))
# plot(lm1)
# par(mfrow=c(3,3))
# for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
#     oneway(i,"mpg",lm1,data)
# }

QQ plot shows normality is ok, residuals look ok and the transmission factor is significant. This model tells us that having manual transmission increases the miles per gallon by 7.2. From a common sense point of view, it’s unlikely transmission has such a big impact and this is very likely the impact of some compounding variables.
lm2 - we fit a linear model, starting with the top 4 factors identified by the random forest, GBM and lasso.

lm2 <- lm(formula = mpg~ wt + cyl + hp + disp, data = data)
summary(lm2)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0562 -1.4636 -0.4281  1.2854  5.8269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.82854    2.75747  14.807 1.76e-14 ***
## wt          -3.85390    1.01547  -3.795 0.000759 ***
## cyl         -1.29332    0.65588  -1.972 0.058947 .  
## hp          -0.02054    0.01215  -1.691 0.102379    
## disp         0.01160    0.01173   0.989 0.331386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared:  0.8486, Adjusted R-squared:  0.8262 
## F-statistic: 37.84 on 4 and 27 DF,  p-value: 1.061e-10
par(mfrow=c(2,2))
plot(lm2)

# par(mfrow=c(3,3))
# for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
#     oneway(i,"mpg",lm2,data)
# }
anova(lm1,lm2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ as.factor(am)
## Model 2: mpg ~ wt + cyl + hp + disp
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     27 170.44  3    550.45 29.066 1.32e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA/chi-square nested models test as well as the adjusted R-square tells us lm2 is a superior model. Displacement comes up very strongly in the GBM and random forest, but despite this it is not significant. Perhaps it is an interaction. I have tried fitting disp:cyl, disp:hp, disp:am and disp:wt, and only disp:cyl comes through as significant, which I fit in lm3.
lm3 - fitting an interaction to lm2

lm3 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl, data = data)
summary(lm3)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp + disp:cyl, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7574 -1.3419 -0.2371  1.2777  3.9379 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 50.149633   4.277477  11.724 7.01e-12 ***
## wt          -2.894728   0.983595  -2.943  0.00676 ** 
## cyl         -2.561859   0.757871  -3.380  0.00230 ** 
## hp          -0.024119   0.011039  -2.185  0.03811 *  
## disp        -0.092176   0.040160  -2.295  0.03004 *  
## cyl:disp     0.012283   0.004585   2.679  0.01264 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.267 on 26 degrees of freedom
## Multiple R-squared:  0.8814, Adjusted R-squared:  0.8586 
## F-statistic: 38.63 on 5 and 26 DF,  p-value: 3.128e-11
par(mfrow=c(2,2))
plot(lm3)

par(mfrow=c(3,3))
for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
    oneway(i,"mpg",lm3,data) #oneway is a function i made to quickly plot one-way fits
}

anova(lm1,lm2,lm3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ as.factor(am)
## Model 2: mpg ~ wt + cyl + hp + disp
## Model 3: mpg ~ wt + cyl + hp + disp + disp:cyl
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     27 170.44  3    550.45 35.7134 2.274e-09 ***
## 3     26 133.58  1     36.86  7.1753   0.01264 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lm3 has all coefficients significant, the best adjusted R-squared out of all models fitted and the ANOVA/nested models chi-square test says it is a better model. The residuals show no signs of heteroscedasticity or bias and the QQ-plot shows that the assumption of normality is fine. The 9 one-way figures produced show the model fits quite well when cut by those 9 factors (e.g. not under-fitting/biased).

Based on the outputs of the random forest, GBM and lasso, the remaining factors are unlikley to be very predictive. To be sure, I have fitted a linear model for each of the remaining factors to see if they are worth adding. I have then performed a chi-square nested models test on each of them to see if it is worth adding another factor. None of them are anywhere near passing the chi-square nested models test.

In particular, transmission becomes an insignificant factor after fitting other variables. Even if we were to fit it, it’s beta is insignificant and does not have a major impact on the response (0.58 greater MPG if manual and the standard error is larger than the estimate, see Appendix E).

Our final model is lm4, mpg~ wt + cyl + hp + disp + disp:cyl.

Appendix

Apendix A - Random Forest

set.seed(1234)
#splitting data into train and test
trainIndex <- createDataPartition(data$mpg, p = .75, list = FALSE, times = 1)
    train <- data[trainIndex,]
    test  <- data[-trainIndex,]
#using cross-validation to estimate OOS error
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

rforestGrid <-  expand.grid(mtry = c(2,3,4,5,6,7,8,9,10))

set.seed(934)
rforestFit1 <- train(form, data = train
                 ,method = "rf"
                 ,tuneGrid = rforestGrid
                 ,trControl = fitControl
                 ,na.action = na.pass
                 ,importance=T
                 )
#prediction on test
rforestPredict1_test <- predict(rforestFit1, newdata = test, na.action=na.pass)
#RMSE ontrain
# rforestFit1
#RMSE ontest
RMSE_rforestFit1 <- sqrt(mean((rforestPredict1_test - test$mpg)^2))
RMSE_rforestFit1
## [1] 2.254119
#best hyperparameter
plot(rforestFit1)

#variable importance
varImp(rforestFit1)
## rf variable importance
## 
##                Overall
## disp           100.000
## wt              91.516
## hp              72.757
## cyl             50.056
## gear             6.714
## as.factor(am)1   5.872
## as.factor(vs)1   1.760
## drat             1.120
## qsec             0.102
## carb             0.000

Appendix B - GBM

gbmGrid <-  expand.grid(interaction.depth = c(6,8), 
                        n.trees = 3000, 
                        shrinkage = c(0.001, 0.0001),
                        n.minobsinnode = c(ceiling(0.025*dim(train)[1]),ceiling(0.05*dim(train)[1])))

set.seed(825)
gbmFit1 <- train(form, data = train
                 ,method = "gbm"
                 ,tuneGrid = gbmGrid
                 ,trControl = fitControl
                 ,bag.fraction = 0.7
                 ,verbose = FALSE
                 ,na.action = na.pass
                 # ,importance=T
                 )
#hyperparameters
plot(gbmFit1)

#variable importance
summary(gbmFit1)

##                           var      rel.inf
## disp                     disp 43.688450842
## wt                         wt 26.037161748
## hp                         hp 18.144326806
## cyl                       cyl  8.699762517
## drat                     drat  1.826134668
## qsec                     qsec  0.998261245
## carb                     carb  0.485413654
## gear                     gear  0.109133911
## as.factor(vs)1 as.factor(vs)1  0.006052016
## as.factor(am)1 as.factor(am)1  0.005302593
#prediction on test
gbmPredict1_test <- predict(gbmFit1, newdata = test, na.action=na.pass)
#RMSE ontrain
gbmFit1
## Stochastic Gradient Boosting 
## 
## 25 samples
## 10 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 22, 23, 23, 23, 22, 22, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  RMSE      Rsquared 
##   1e-04      6                  1               4.408247  0.9730900
##   1e-04      6                  2               4.408846  0.9724921
##   1e-04      8                  1               4.397575  0.9716614
##   1e-04      8                  2               4.411354  0.9724283
##   1e-03      6                  1               2.212191  0.9691160
##   1e-03      6                  2               2.102792  0.9692171
##   1e-03      8                  1               2.239043  0.9679627
##   1e-03      8                  2               2.111993  0.9691577
##   MAE     
##   3.715960
##   3.694303
##   3.707051
##   3.696909
##   2.025459
##   1.942360
##   2.045088
##   1.953080
## 
## Tuning parameter 'n.trees' was held constant at a value of 3000
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 3000,
##  interaction.depth = 6, shrinkage = 0.001 and n.minobsinnode = 2.
#RMSE ontest
RMSE_gbmFit1 <- sqrt(mean((gbmPredict1_test - test$mpg)^2))
RMSE_gbmFit1
## [1] 2.795255

Appendix C Lasso Regression

lassoGrid <-  expand.grid(fraction = seq(from=0.01, to=0.99, length.out=198))

set.seed(567)
lassoFit1 <- train(form
                 , data = train
                 ,method = "lasso"
                 ,tuneGrid = lassoGrid
                 ,trControl = fitControl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
#hyperparameters
plot(lassoFit1)

#looks like 0.39 has the best fit
plot.enet(lassoFit1$finalModel, use.color = TRUE)
abline(v=lassoFit1$bestTune$fraction)

#grabbing the coefficients from training faction of .39:
predict(lassoFit1$finalModel, type = "coefficients")$coefficients[6,][predict(lassoFit1$finalModel, type = "coefficients")$coefficients[6,]!=0]
##         cyl          hp          wt 
## -0.73203256 -0.01343803 -2.45343755
#prediction on test
lassoPredict1_test <- predict(lassoFit1, newdata = test, na.action=na.pass)
#RMSE ontrain
min(lassoFit1$results$RMSE)
## [1] 3.110872
#RMSE ontest
RMSE_lassoFit1 <- sqrt(mean((lassoPredict1_test - test$mpg)^2))
RMSE_lassoFit1
## [1] 2.136262

Appendix D - ANOVA/chi-square nested models tests on remaining factors

lm4_1 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + vs, data = data)
lm4_2 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + drat, data = data)
lm4_3 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + qsec, data = data)
lm4_4 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + carb, data = data)
lm4_5 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + gear, data = data)

anova (lm3,lm4_1)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + vs
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)
## 1     26 133.58                         
## 2     25 131.84  1    1.7401 0.33 0.5708
anova (lm3,lm4_2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + drat
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     26 133.58                          
## 2     25 133.57  1 0.0054237 0.001 0.9748
anova (lm3,lm4_3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + qsec
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     26 133.58                           
## 2     25 130.89  1    2.6889 0.5136 0.4802
anova (lm3,lm4_4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + carb
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     26 133.58                           
## 2     25 133.23  1    0.3513 0.0659 0.7995
anova (lm3,lm4_5)
## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + cyl + hp + disp + disp:cyl
## Model 2: mpg ~ wt + cyl + hp + disp + disp:cyl + gear
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     26 133.58                           
## 2     25 128.16  1    5.4156 1.0564 0.3139

Appendix E - fitting am to final model to see how it looks

lm5 <- lm(formula = mpg~ wt + cyl + hp + disp + disp:cyl + am, data = data)
summary(lm5)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp + disp + disp:cyl + am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6088 -1.2570 -0.3296  1.3561  3.9353 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.716992   5.535182   8.801 3.95e-09 ***
## wt          -2.736296   1.069002  -2.560  0.01691 *  
## cyl         -2.430332   0.831962  -2.921  0.00729 ** 
## hp          -0.026708   0.012814  -2.084  0.04751 *  
## disp        -0.086865   0.042744  -2.032  0.05288 .  
## am           0.579518   1.386101   0.418  0.67945    
## cyl:disp     0.011683   0.004876   2.396  0.02437 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.303 on 25 degrees of freedom
## Multiple R-squared:  0.8822, Adjusted R-squared:  0.8539 
## F-statistic:  31.2 on 6 and 25 DF,  p-value: 1.905e-10
par(mfrow=c(2,2))
plot(lm5)

# par(mfrow=c(3,3))
# for (i in c("am", "wt", "hp", "disp", "cyl", "drat", "qsec", "carb", "gear")) {
#     oneway(i,"mpg",lm5,data)
# }