Regression Project : How Transmission-Type impacts MPG of cars ?

Executive Summary :

Motor Trend magazine would like to explore, which factors influence MPG of cars, especially to find out the effect of Transmission Type on MPG. We take mtcars dataset, (32 observations of 11 variables) and apply Regression Models to answer these questions.

Exploratory Data Analysis & Model Selection

We run pairs-plot on mtcars. Refer to Appendix section A. We observe that, Positive Correlation exists between {mpg} and {am} ie. 0.6, between {hp} and {wt} ie. 0.66 (both are Negatively Correlated to {mpg}, though), between {am} and {gear} ie. 0.79. However, {cyl} has Strong Negative Correlation to {mpg} ie. -0.85.

Let’s explore {mpg}/{am} relationship. Refer to Appendix section B. Auto Transmission Mean: 17.14737 mpg, Manual Transmission Mean: 24.39231 mpg. Looks like manual transmission gives better milege. Let’s explore this. Now we fit our first model.

fit <- lm(mpg~factor(am),data=mtcars)
summary(fit)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## factor(am)1  7.244939   1.764422  4.106127 2.850207e-04

Observations : Intercept represents empirical mean of Auto transmission (17.147368 mpg). Difference between means of Manual/Auto transmissions is 7.244939 mpg ie. Manual Transmission gives 7.244939 better milege. (Manual transmission Mean = 24.39231). Both intercept and slope are significant (P-value < 0.001). However, Adjusted R-squared is only 0.3385. So we need to find a better fitting model.

We use following Backward Elimiation Strategy. Start with a Model that includes all potential Predictor variables. Eliminate variables, one, at a time. Drop the variable, with the largest P-value. Re-fit the model. Re-assess inclusion of all variables. Repeat these steps, until only variables with Statistically Significant P-Values remain. That’s your Final Model. We start with including all variables.

fit10<-lm(mpg~factor(am)+cyl+disp+hp+drat+wt+qsec+vs+gear+carb,data=mtcars)
max(summary(fit10)$coef[-1,4])
## [1] 0.9160874

So, we eliminate the {cyl} variable. We repeat above steps, removing 1 variable, at a time. We basically remove these variables, in this order : {vs}, {carb}, {gear}, {drat}, {disp}, {hp}. We are left with this model (we could have let R do the trick with step command),

fit17<-lm(mpg~factor(am)+wt+qsec,data=mtcars)
summary(fit17)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## factor(am)1  2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04

Observations from the (fit17) model : Intercept represents empirical mean of Auto transmission 9.617781 mpg for 0 weight and qsec variables. Difference between means of Manual/Auto transmissions is 2.935837 mpg ie. Manual Transmission gives 2.935837 better milege holding wt and qsec variables constant. Milege given by Manual Transmission = 12.55362 mpg. There is an estimated 3.916504 decrease in MPG per one ton increase in weight of the car, holding am and qsec variables constant. There is estimated 1.225886 MPG increase, per unit increase in qsec variable, holding am and wt variables constant, qsec variable measures “1/4th mile time”.

Uncertainty noted : Residual Standard Error is 2.459. Our Model fits only 83.36%.

Now, we apply anova (Analysis of Variation) technique, to both models.

anova(fit,fit17)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observations : RSS (Residual Sum of Squares) is reduced in the new model from 720.90 to 169.29, which is good. P-value of this test is 1.55e-09 which is less than alpha ie. 0.001. So new model (fit17) is Significantly Better than old one (fit). Adjusted R-squared value for (fit17) is 0.8336, which is much better than Adjusted R-squared (fit) value of 0.3385.

We now apply Residual Diagnostics to (fit17). Refer to Appendix section C.

Observations : Looking at {Residual vs Fitted plot}, variability of the residuals is nearly constant. Looking at {Q-Q plot}, Residuals of the model are nearly normal. Looking at {Residual vs Leverage plot}, we see few cars with high leverage ie “Chrysler Imperial”, “Merc 230” and “Fiat 128”. This can be confirmed by running hatvalues(fit17) command.

Conclusion

Based on our analysis, we can state that Manual Transmission gives better MPG milege. Based on our initial model (fit), where we consider only {am} variable, the difference between Manual/Automatic Transmission is 7.244939 MPG. Based on our final model (fit17), taking into consideration {am}, {wt} and {qsec} variables, the difference between Manual/Automatic Transmission is 2.935837 MPG.

Appendix :

Section A : mtcars pairs plot

require(GGally)
require(ggplot2)
g <- ggpairs(mtcars, lower = list(continuous = "smooth"),params = c(method = "loess"))
g

Section B : mpg-am box plot

autoMean <- mean(mtcars[mtcars$am==0,1])
manualMean <- mean(mtcars[mtcars$am==1,1])
g2 <- ggplot(data = mtcars, aes(y = mpg, x = factor(am), fill  = factor(am))) +
        geom_boxplot(colour = "black", size = .5) +
        geom_hline(yintercept=autoMean,color="red",size=2) + 
        geom_hline(yintercept=manualMean,color="blue",size=2) + 
        labs(x="Transmission (0 = Automatic, 1=Manual)") + 
        labs(y="MPG")  +
        labs(title="MPG By Transmission") 
g2

Section C : Residual Plots for Final Model (fit17)

par(mfrow = c(2, 2))
plot(fit17)