Regression Models Course Project - Motor Trend, impact of transmission type on MPG

Executive Summary

The purpose of the project is to provide the magazine “Motor Trend” with an analysis of the relationship between automatic/manual transmission and MPG (miles per gallon). We need to answer 2 questions:
- “Is an automatic or manual transmission better for MPG”
- “Quantify the MPG difference between automatic and manual transmissions”

Data We are using a dataset from 1974, “mtcars”, containing 32 observations (car models) and 11 variables. We’ll start with two variables: “mpg”" as the outcome and “am” (transmission, 0 = automatic, 1 = manual) as the regressor. Then we’ll use other variables to explain variability further.

Process 1. Exploratory data analysis, Visualization, quantification, statistical inference.
2. Linear regression “am” & “mpg”, check how much of the variance is explained by the model.
3. Fit multiple models to measure the impact of other variables.
4. Residual plot to detect/quantify uncertainty in the model.

Key findings Manual transmission is better than automatic for MPG with a difference of about 7.24 between the two. In spite of the strong correlation between transmission type and mpg, transmission explains only 36% of the variability of mpg. Other factors that improve mpg are cylinders, weight and horse power. Including these factors in the model explains 86% of the variability of mpg.

ANALYSIS

1. Exploratory Data Analysis

data(mtcars) #read data
head(mtcars, 2) #display first 2 observations & variable names

##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

#transform relevant variables into factors
mtcars$cyl <- factor(mtcars$cyl); mtcars$vs <- factor(mtcars$vs); mtcars$gear <- factor(mtcars$gear); mtcars$carb <- factor(mtcars$carb); mtcars$am <- factor(mtcars$am)

We create a boxplot (appendix 1) that shows a clear difference in impact of transmission type on mpg. Manual transmission is associated with a clearly higher MPG.

Let’s quantify this:

as.data.frame(rbind(summary(mtcars$mpg[mtcars$am==0]), summary(mtcars$mpg[mtcars$am==1])), row.names = c("automatic", "manual"))

##           Min. 1st Qu. Median  Mean 3rd Qu. Max.
## automatic 10.4   14.95   17.3 17.15    19.2 24.4
## manual    15.0   21.00   22.8 24.39    30.4 33.9

Both the boxplot and the summaries show a clear difference, mean for automatic is 17.15 MPG, versus 24.39 for manual, a difference of about 7.24 MPG.

Is this difference statistically significant? We can check this with a t test. Null hypothesis is that both means are the same (mu_automatic - mu_manual = 0).

(t.test(mpg ~ factor(am), data=mtcars))$p.value

## [1] 0.001373638

The p-value of 0.00137 is below significance level alpha of 0.05. We reject the null hypothesis in favor of the alternative: there is a significant difference between the MPG mean of automatic cars vs. MPG of manual transmission cars.

2. Simple linear regression, mpg and am

The correlation between transmission and MPG is clear, but how much of MPG variance is explained by transmission? We can perform a simple linear regression:

fit1 <- lm(mpg ~ am, data = mtcars); summary(fit1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

We retrieve our previous results: intercept (beta0) is the MPG mean for automatic transmission, the am coefficient (beta1, the slope) is the increase of the mean for manual transmission, the sum of the two (17.15+7.24) being the mean for manual transmission.

What interests us here is the R-squared value of about 36%, which is the variance of MPG explained by the transmission variable.

This is not enough and adding other variables may explain the variance of MPG better. We need to select the relevant ones.

3. Fit multiple models

A pairs plot (appendix 2) shows that the following variables seem to have some correlation with mpg: am, cyl, disp, hp, wt and qsec. We confirm the intuition with the step function, (stepwise selection of useful variables, omitting the ones that don’t contribute significantly).

step(lm(data=mtcars, mpg ~ .),trace=0) #the trace(0) argument prints only the final predictors.

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Coefficients:
## (Intercept)         cyl6         cyl8           hp           wt  
##    33.70832     -3.03134     -2.16368     -0.03211     -2.49683  
##         am1  
##     1.80921

We retrieve here more or less the same variables: am, cyl, hp, wt. Just disp and qsec don’t shown here.

They may be strongly connected with mpg, but probably won’t add more variance explanation. We can verify this by fitting 2 multivariate regression lines:
“fit4” uses 4 regressors as suggested by the step function, “fit6” uses the 6 regressors visually identified in the pairs plot.

fit4 <- lm(mpg ~ am + cyl + hp + wt, data = mtcars)
fit6 <- lm(mpg ~ am + cyl + hp + wt + disp + qsec, data = mtcars)
#Summary and R squared value with 4 regressors, THIS PART IS IMPORTANT FOR THE CONCLUSION
summary(fit4)

## 
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## am1          1.80921    1.39630   1.296  0.20646    
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

#R squared value with 6 regressors
summary(fit6)$r.squared

## [1] 0.8736016

The difference of variance explained between both models with new regressors is only 0.8%. It seems that adding variables disp and qsec do not add much to the model.
We can formalize this with the anova function, comparing the different fitted models between one another:

anova(fit1, fit4, fit6)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl + hp + wt
## Model 3: mpg ~ am + cyl + hp + wt + disp + qsec
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     26 151.03  4    569.87 24.0231 4.303e-08 ***
## 3     24 142.33  2      8.70  0.7331    0.4909    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The very low p-value for fit4 confirms that we can reject the null hypothesis in favor of the alternative: am, cyl, hp and wt do contribute significantly to the accuracy of the model.
The high p-value for fit6 (.49) make us fail to reject the null hypothesis that disp and qsec don’t add a significant accuracy to the model compared to the fit4 model.
We retain then the fit4 model as the best one, with the following variables as regressors: am, cyl, hp and wt.

4. Residual plot

The residual plot (appendix 3) allows us to diagnose potential issues with the model:
- Residuals vs. Fitted: points are randomly scattered, checking the independence condition.
- Normal Q-Q plot: the points are positioned around one line, verifying that residuals are normally distributed.
- Scale-Location: points are scattered in a constant band pattern, meaning variance is constant.
- Residuals vs Leverage: there are some outliers that we can see on the plot and discover with the dfbetas and hatvalues functions.

Conclusion

Answer to question 1: Manual transmission is better (higher value) than automatic transmission for MPG.
Answer to question 2: for automatic, mean MPG is 17.15, vs 24.39 MPG for manual, a difference of about 7.24 MPG. We have shown that this diffference is statistically significant.
The transmission variable accounts for only 36% of the variability of mpg.
Other variables than transmission type contribute to explain the variability of mpg, among the main ones and those we retained in our best model: cyl, wt, and hp.
Cars with with manual transmission get about 1.8 more miles per gallon than automatic transmission, all other variables held constant.

All other variables held constant:
- for every weight increase of 1000 pound, mpg decreases by about 2.5
- cylinders have also an impact, mpg will decrease by 3 for a 6 cylinders (compared to 4), and decrease by 2.15 for 8 cylinders.

APPENDIX

Appendix 1. Box plot

boxplot(mpg~am, data = mtcars, main  = "Boxplot MPG & Transmission", xlab = "Transmission, 0 = automatic, 1 = manual", ylab = "MPG")

We see a clear difference, manual transmission showing a higher (better) MPG.

Appendix 2. Pairs plot for all variables*

pairs(mpg ~., panel = panel.smooth, data = mtcars)

Appendix 3. Residual plot

par(mfrow=c(2, 2))
plot(fit4)

Regression Models Course Project - Motor Trend, impact of transmission type on MPG

XVALDA

04 June 2017

Executive Summary

ANALYSIS

Conclusion

APPENDIX