Regression Models Final Project: Analyzing MPG

Executive Summary

In this analysis, we explore the relationship between miles per gallon and transmission (automatic or manual). There are three steps to this analysis: exploratory data analysis, model building, and diagnostics.

The result of the analysis is the observation that manual transmission does in fact predict higher mpg. I predict a 2.62 improvement in mpg in cars using manual transmission versus an automatic transmission car with all of the same parameters. My final model has an adjusted R-squared of 0.8478 so I’m very confident in the model.

Exploratory data analysis

First, let’s look at the data set. The help package contains descriptions of the variables. First, let’s look at the head of the data.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Next, we can take a look at some summary plots:

par(mfrow=c(3,4))

To make the analysis more relevant, we are going to change some of the variables to factors:

mtcars$cyl<-as.factor(mtcars$cyl)
mtcars$vs<-as.factor(mtcars$vs)
mtcars$am<-as.factor(mtcars$am)
mtcars$gear<-as.factor(mtcars$gear)
mtcars$carb<-as.factor(mtcars$carb)

We can use the ggpairs function to assess correlation between variables:

library(GGally)
library(ggplot2)
ggpairs(mtcars)

Some observations: At first glance, there does seem to be a relationship between transmission and mpg. Pretty much all other variables seem to have either a positive or negative relationship with mpg. * Positive relationships: rear axle ratio, 1/4 mile time, transmission, gear * Negative relationships: cylinders, displacement, horsepower, weight, carburetors

The vanilla linear regression incorporating all variables doesn’t give us useful information, as all of the variables are insignificant. There are too many parameters especially relative to the number of observations.

summary(lm(mpg~factor(cyl)+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+factor(gear)+factor(carb),data=mtcars))
## 
## Call:
## lm(formula = mpg ~ factor(cyl) + disp + hp + drat + wt + qsec + 
##     factor(vs) + factor(am) + factor(gear) + factor(carb), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   23.87913   20.06582   1.190   0.2525  
## factor(cyl)6  -2.64870    3.04089  -0.871   0.3975  
## factor(cyl)8  -0.33616    7.15954  -0.047   0.9632  
## disp           0.03555    0.03190   1.114   0.2827  
## hp            -0.07051    0.03943  -1.788   0.0939 .
## drat           1.18283    2.48348   0.476   0.6407  
## wt            -4.52978    2.53875  -1.784   0.0946 .
## qsec           0.36784    0.93540   0.393   0.6997  
## factor(vs)1    1.93085    2.87126   0.672   0.5115  
## factor(am)1    1.21212    3.21355   0.377   0.7113  
## factor(gear)4  1.11435    3.79952   0.293   0.7733  
## factor(gear)5  2.52840    3.73636   0.677   0.5089  
## factor(carb)2 -0.97935    2.31797  -0.423   0.6787  
## factor(carb)3  2.99964    4.29355   0.699   0.4955  
## factor(carb)4  1.09142    4.44962   0.245   0.8096  
## factor(carb)6  4.47757    6.38406   0.701   0.4938  
## factor(carb)8  7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

On the other hand, a simple linear regression isn’t reliable because runs a higher risk that confounding variables influence the result.

summary(lm(mpg~factor(am),data=mtcars))
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
Conclusions

All other variables seem to “matter” in the sense that they have some relationship to mpg. We won’t eliminate any going into the model building We have a sense initially for how different variables impact mpg, but need to be mindful of confounding

Model building

The leaps library has a tool for finding the best set of regression variables. I am using the adjusted R-square parameter to evaluate regression variable sets. Adjusted r-square strikes a balance between underfitting and overfitting.

Leaps has a package called regsubsets that generates the best regressions for each number of variables.

library(leaps)
## Warning: package 'leaps' was built under R version 3.3.3
fit<-regsubsets(mpg~factor(cyl)+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+factor(gear)+factor(carb),data=mtcars,nbest=1)

The adjusted r-square values for each combination can be plotted.

plot(fit,scale="adjr2")

Here is the final linear model, with the highest possible adjusted R-square.

fit<-lm(mpg~I(cyl==6)+hp+wt+factor(vs)+factor(am),data=mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ I(cyl == 6) + hp + wt + factor(vs) + factor(am), 
##     data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3317 -1.1979  0.0248  0.9276  4.6697 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     31.28241    3.19021   9.806 3.18e-10 ***
## I(cyl == 6)TRUE -2.20520    1.03213  -2.137  0.04221 *  
## hp              -0.03393    0.01044  -3.250  0.00318 ** 
## wt              -2.36781    0.86855  -2.726  0.01132 *  
## factor(vs)1      1.87741    1.24809   1.504  0.14457    
## factor(am)1      2.62112    1.29995   2.016  0.05421 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.351 on 26 degrees of freedom
## Multiple R-squared:  0.8724, Adjusted R-squared:  0.8478 
## F-statistic: 35.54 on 5 and 26 DF,  p-value: 7.991e-11

The results suggest that manual transmission (am=1) predicts better mpg. Note, however, that the p-value is just above 0.05, which is traditionally the cutoff for establishing significance. I will go ahead and support the assertion that manual transmission predicts higher mpg.

Based on the regression analysis, manual transmission on average yields a miles per gallon improvement of 2.62. The adjusted R-squared is 0.8478, which is very high.

Diagnostics

Below are standard diagnostic plots which we cab use to evaluate the results.

par(mfrow=c(3,2))
plot(fit,which=1:4)

Observations: * The residual vs fitted plot shows randomness, which is good * The Q-Q plot is approximately linear, which is good * The scale-location plot also shows no pattern, which is good * The Cook’s distance plot shows three values that have high leverage; no issues here

Conclusion

Manual transmission is predicted to provide a 2.62 mile/gallon boost versus the same car with an automatic transmission. Six cylinders, horsepower, and weight, are all additional factors that have a significant impact on mpg. Given the high adjusted r-squared of our model, we are fairly confident of this result.