Executive Summary

This paper will explore the Motor Trend data set in an attempt to uncover a statistically significant relationship between transmission types and the miles per gallon achieved for a variety of vehicles. The two goals for this paper are to determine which transmission type is better for MPG and to quantify the difference. It was determined that a manual transmission yields approximately 7.2 MPG better than an automatic transmission when there are no other adjustment factors such as weight, cylinders, or horsepower. We performed a stepwise regression to arrive at the best fitting model and added an interaction factor for weight and transmission type as the heavier cars in the dataset tend to be automatics. The resulting model indicated that a manual transmission will yield 9.8 + (-3.14*weight) more MPG than an automatic when all other variables are held constant.

Exploratory Analysis

The data set is pre-loaded in R and can be accessed in the R library(datasets). Figure 1 in the Appendix makes it clear that Manual transmissions get better gas mileage. Before we begin building a regression model, we should check the distribution of mpg. Figure 2 shows the density plot. We can see that mpg is approximately Gaussian and no outliers causing a long tail in the distribution. Figure 3 shows the density by transmission type.

If we look at a t-test to determine if the difference in means between automatic and manual transmissions is significant, we find that the interval -11.2801944, -3.2096842 does not contain 0 and the p-value 0.0013736 is small enough that we can reject the null in favor of the alternative which says that the difference in means is not equal to 0. Therefore, we know there is a statistically significant difference between Manual and Automatic transmissions on gas mileage and Manuals gets better MPG. How much better? By the difference in means 7.2449393.

Simple Linear Regression

We would like to build a regression model using only the binary variable am as a predictor variable of MPG. We will use our t-test above to confirm that our model is working correctly.

Simple <- lm(mpg~am, data=mtcars)
summary(Simple)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amManual     7.244939   1.764422  4.106127 2.850207e-04

As you can see, the intercept of the model is the average MPG for an Automatic and for every instance where we have a Manual it will increase by the difference between the means we computed previously. Our slope has a very small p-value so we know that our variable am is significant. Unfortunately, the \(R^2\) for this model indicates it explains only about 34% of the variability in MPG.

Multivariate Linear Regression

If we want to use other mtcars variables to develop the best fitting model, we can use something called Stepwise Regression the command in R is step. We will use both forward selection (starting with no variables and testing the addition of each variable) and backward elimination (starting with all candidate variables and testing the deletion of each) and repeating this process until no further improvement is possible.

fit <- lm(mpg~., data=mtcars)
Algo <- step(fit, direction="both")
summary(Algo)

The algorithm helps us determine that MPG is best predicted when am is the independent variable and we include hp,wt,cyl as confounding variables to enhance model fit. This algorithm can only take us so far in terms of variable selection. We know from looking at the data that heavier cars tend to be automatics at least in this data set. In Fig. 4 we have an illustration of the relationship between wt and transmission type. We need to adjust for this so to do that we can add an interaction factor to our model.

Final <- lm(mpg~ wt + cyl + hp + am + am:wt, data=mtcars)
summary(Final)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 30.65246564 2.90990761 10.533828 1.111073e-10
## wt          -2.20686780 0.85204530 -2.590083 1.577795e-02
## cyl6        -2.38062017 1.37363506 -1.733081 9.540117e-02
## cyl8        -2.89910832 2.19666356 -1.319778 1.988677e-01
## hp          -0.01781723 0.01484303 -1.200377 2.412443e-01
## amManual     9.89860309 4.28571455  2.309674 2.944922e-02
## wt:amManual -3.14498752 1.58475347 -1.984528 5.827576e-02
summary(Final)$adj.r.squared
## [1] 0.8563248

Our Final model can describe 86% of the variability in MPG. Let’s compare the Simple model to our Final one:

anova(Simple, Final)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + cyl + hp + am + am:wt
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     25 130.47  5    590.42 22.627 1.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our p-value is highly significant in the above ANOVA table. This indicates that we can reject the null hypothesis and accept that our confounding variables wt, cyl, hp,am:wt all contribute to the accuracy of the model.

Residuals and Diagnostic Tests

In Fig. 5 we can see the residual plots for our Final model. 1. Residuals vs Fitted: The points are randomly scattered around the line indicating independence. 2. Normal Q-Q: The points are close to the line indicating the residuals are approximately normal. 3. Scale-Location: The points are random confirming constant variance. 4. Residuals vs Leverage: The points are all within the 0.5 bands indicating no extreme outliers.

If we want to measure how much impact each observation has on a particular predictor we can look at the dfbeta’s of the Final model. A rule of thumb is to check that for all dfbeta’s \(-1<dfbeta<1\).

sum((abs(dfbetas(Final)))>1)
## [1] 0

And we can see that we have no dfbeta outside that range so our model meets all basic assumptions and tests for a good multivariate linear regression.

APPENDIX

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5