Regression Models

EXECUTIVE SUMMARY

In this quick analysis we will explore the relationship between a set of available variables to determine how to maximize miles per gallon fuel efficiency (MPG) based on the mtcars dataset. We will achieve this through a series of models that will help us determine the most effective way of predicting MPG, including whether or not the vehicle has a manual or automatic transmission.

AUTO / MANUAL ANALYSIS

The first question we seek to answer is whether an automatic or a manual transmission has better MPG. We can do this with a simple chart as observed in Fig 1 in the appendix. Based on this simple distinction we can observe that manual transmissions have a clearly higher MPG. We can quantify this further with a quick linear model.

mtcars$trans <- ifelse(mtcars$am == 0,"auto", "manual")
fit <- lm(mpg ~ am, mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
fit$coefficients[1]
## (Intercept) 
##    17.14737
sum(fit$coefficients)
## [1] 24.39231

Based on our coefficients we can see that an automatic transmission has a mean predicted value of 17.15 MPG whereas a manual transmission has an estimate of 24.39 MPG. This can be observed in Fig 1 as well.

However, we can see from the model summary that R-squared is low at 0.36. Even though “am” is statistically significant at p= 0.001 we can use the other variables to see if we get a better model to predict MPG.

MODEL ANALYSIS

aov <- aov(mpg ~ cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb, mtcars)
summary(aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cyl          1  817.7   817.7 116.425 5.03e-10 ***
## disp         1   37.6    37.6   5.353  0.03091 *  
## hp           1    9.4     9.4   1.334  0.26103    
## drat         1   16.5    16.5   2.345  0.14064    
## wt           1   77.5    77.5  11.031  0.00324 ** 
## qsec         1    3.9     3.9   0.562  0.46166    
## vs           1    0.1     0.1   0.018  0.89317    
## am           1   14.5    14.5   2.061  0.16586    
## gear         1    1.0     1.0   0.138  0.71365    
## carb         1    0.4     0.4   0.058  0.81218    
## Residuals   21  147.5     7.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Running a linearl model including all other variables we can see that “cyl”, “disp”, and “wt” are the only statistically significant variables, with “am” failing even at p = 0.1. We next create a model that uses these variables, plus we’ll retain “am” for now.

fit2 <- lm(mpg ~ am + cyl + disp + wt, mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + cyl + disp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.318 -1.362 -0.479  1.354  6.059 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.898313   3.601540  11.356 8.68e-12 ***
## am           0.129066   1.321512   0.098  0.92292    
## cyl         -1.784173   0.618192  -2.886  0.00758 ** 
## disp         0.007404   0.012081   0.613  0.54509    
## wt          -3.583425   1.186504  -3.020  0.00547 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared:  0.8327, Adjusted R-squared:  0.8079 
## F-statistic: 33.59 on 4 and 27 DF,  p-value: 4.038e-10

Our R-squared is now at 0.83, but it is likely overfitted and “am” is still not significant, nor is “disp”. We try a final model and compare the results of all three.

fit3 <- lm(mpg ~ cyl+wt, mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12
anova(fit, fit3, fit2)
AIC(fit, fit2, fit3)
shapiro.test(fit3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.93745, p-value = 0.06341

We see now that the third model (fit3) that includes “cyl” and “wt” still has a rounded R-squared of 0.83, meaning we have lost very little explaining power by removing the other variables.

When we compare all three models, we see that fit3 (mpg ~ cyl + wt) shows the greatest improvement and also has the lowest AIC, making it the best model.

Lastly, we do some graphical analysis in Fig 2- Fig 5, that shows a good distribution of residuals to fitted values, plus a normal distribution that we further tconfirm in the Shapiro-Wilk normality test.

We can also see in the expanded QQ plot in Fig 6 that the values remain within the confidence bands.

APPENDIX