In this analysis, we explore the relationship between miles per gallon and transmission (automatic or manual). There are three steps to this analysis: exploratory data analysis, model building, and diagnostics.
The result of the analysis is the observation that manual transmission does in fact predict higher mpg. I predict a 2.62 improvement in mpg in cars using manual transmission versus an automatic transmission car with all of the same parameters. My final model has an adjusted R-squared of 0.8478 so I’m very confident in the model.
First, let’s look at the data set. The help package contains descriptions of the variables. First, let’s look at the head of the data.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Next, we can take a look at some summary plots:
par(mfrow=c(3,4))
To make the analysis more relevant, we are going to change some of the variables to factors:
mtcars$cyl<-as.factor(mtcars$cyl)
mtcars$vs<-as.factor(mtcars$vs)
mtcars$am<-as.factor(mtcars$am)
mtcars$gear<-as.factor(mtcars$gear)
mtcars$carb<-as.factor(mtcars$carb)
We can use the ggpairs function to assess correlation between variables:
library(GGally)
library(ggplot2)
ggpairs(mtcars)
Some observations: At first glance, there does seem to be a relationship between transmission and mpg. Pretty much all other variables seem to have either a positive or negative relationship with mpg. * Positive relationships: rear axle ratio, 1/4 mile time, transmission, gear * Negative relationships: cylinders, displacement, horsepower, weight, carburetors
The vanilla linear regression incorporating all variables doesn’t give us useful information, as all of the variables are insignificant. There are too many parameters especially relative to the number of observations.
summary(lm(mpg~factor(cyl)+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+factor(gear)+factor(carb),data=mtcars))
##
## Call:
## lm(formula = mpg ~ factor(cyl) + disp + hp + drat + wt + qsec +
## factor(vs) + factor(am) + factor(gear) + factor(carb), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## factor(cyl)6 -2.64870 3.04089 -0.871 0.3975
## factor(cyl)8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## factor(vs)1 1.93085 2.87126 0.672 0.5115
## factor(am)1 1.21212 3.21355 0.377 0.7113
## factor(gear)4 1.11435 3.79952 0.293 0.7733
## factor(gear)5 2.52840 3.73636 0.677 0.5089
## factor(carb)2 -0.97935 2.31797 -0.423 0.6787
## factor(carb)3 2.99964 4.29355 0.699 0.4955
## factor(carb)4 1.09142 4.44962 0.245 0.8096
## factor(carb)6 4.47757 6.38406 0.701 0.4938
## factor(carb)8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
On the other hand, a simple linear regression isn’t reliable because runs a higher risk that confounding variables influence the result.
summary(lm(mpg~factor(am),data=mtcars))
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
All other variables seem to “matter” in the sense that they have some relationship to mpg. We won’t eliminate any going into the model building We have a sense initially for how different variables impact mpg, but need to be mindful of confounding
The leaps library has a tool for finding the best set of regression variables. I am using the adjusted R-square parameter to evaluate regression variable sets. Adjusted r-square strikes a balance between underfitting and overfitting.
Leaps has a package called regsubsets that generates the best regressions for each number of variables.
library(leaps)
## Warning: package 'leaps' was built under R version 3.3.3
fit<-regsubsets(mpg~factor(cyl)+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+factor(gear)+factor(carb),data=mtcars,nbest=1)
The adjusted r-square values for each combination can be plotted.
plot(fit,scale="adjr2")
Here is the final linear model, with the highest possible adjusted R-square.
fit<-lm(mpg~I(cyl==6)+hp+wt+factor(vs)+factor(am),data=mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ I(cyl == 6) + hp + wt + factor(vs) + factor(am),
## data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3317 -1.1979 0.0248 0.9276 4.6697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.28241 3.19021 9.806 3.18e-10 ***
## I(cyl == 6)TRUE -2.20520 1.03213 -2.137 0.04221 *
## hp -0.03393 0.01044 -3.250 0.00318 **
## wt -2.36781 0.86855 -2.726 0.01132 *
## factor(vs)1 1.87741 1.24809 1.504 0.14457
## factor(am)1 2.62112 1.29995 2.016 0.05421 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.351 on 26 degrees of freedom
## Multiple R-squared: 0.8724, Adjusted R-squared: 0.8478
## F-statistic: 35.54 on 5 and 26 DF, p-value: 7.991e-11
The results suggest that manual transmission (am=1) predicts better mpg. Note, however, that the p-value is just above 0.05, which is traditionally the cutoff for establishing significance. I will go ahead and support the assertion that manual transmission predicts higher mpg.
Based on the regression analysis, manual transmission on average yields a miles per gallon improvement of 2.62. The adjusted R-squared is 0.8478, which is very high.
Below are standard diagnostic plots which we cab use to evaluate the results.
par(mfrow=c(3,2))
plot(fit,which=1:4)
Observations: * The residual vs fitted plot shows randomness, which is good * The Q-Q plot is approximately linear, which is good * The scale-location plot also shows no pattern, which is good * The Cook’s distance plot shows three values that have high leverage; no issues here
Manual transmission is predicted to provide a 2.62 mile/gallon boost versus the same car with an automatic transmission. Six cylinders, horsepower, and weight, are all additional factors that have a significant impact on mpg. Given the high adjusted r-squared of our model, we are fairly confident of this result.