Executive Summary

The selected model shows an estimated coefficient of 1.4780477 which means that there are an increase of about 1.478 in MPG if the car has a manual transmission adjusting for variables. Therefore, to formaly answer the stated question, the manual transmission has a better MPG compared to an automatic one of about 1.478 miles per galon on average after adjusting for other confounding variables.

Introduction

The main goal of this analysis is to answer the question “Is an automatic or manual transmission better for MPG?”, as well as give quantitative details on the differences on MPG between the two types of transmission, or more concretly, quantify the MPG difference between automatic and manual transmissions.

The database is called mtcars and it was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Let’s begin by taking a quick look in our dataset.

Exploratory analysis

Before we do any more advanced analysis, let’s just look at the boxplot of the MPG of the two types of transmissions (See Appendix A), and the averages of the MPG on both types of transmission (See Appendix B).

Regression Analysis

Notice that the MPG median of manual transmission cars are higher than the automatic transmission cars (Appendix A); however, many other aspects of this analysis were left out, such as number of cylinders, horse power, and weight. Having said that, let’s try a linear model to analyse the problem, but still only looking at the influence of the type of transmission on the fuel consumption (MPG).

data("mtcars")
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <-c("AT", "MT")
fit <- lm(mpg ~ am, data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amMT           7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Notice that the trasmission (am) is statisticaly significant in the relationship with the MPG as stateted by Pr(>|t|) 1.13e-15, which is less than the typical benchmarks (0.05, for example). Also, the expected change in mpg going from automatic transmission to manual transmission is of 7.245 MPGs.

To summarize the coeficients, we have:

  1. (Intercept) -> 17.147 is the mean of the automatic transmission.
  2. Estimate (amMT) -> 7.2449393 the expected change in mpg going from automatic transmission to manual transmission

Multivariable Analysis

However, intuitively we know that other predictors can also influence MPG besids the type of transmission (e.g. weight of the vehicle (wt) and number of cylinders (cyl)). Let’s look at other variables that could be added to the model.

Selecting the variables (predictors)

By analizing the sqrt of VIF, we can see that displacement adds a lot of variance due to its correlation with cylinder for example. So, we can probable remove displacement from the predictors pool without hurting the model, as well as other predictors such as carb, vs, and drat which doesn’t seem to have a lot of influence in the result based on the correlation analysis.

data("mtcars")
require(car)
## Loading required package: car
fit <- lm(mpg ~ ., data = mtcars)
sqrt(vif(fit));
##      cyl     disp       hp     drat       wt     qsec       vs       am 
## 3.920948 4.649757 3.135608 1.837014 3.894212 2.743712 2.228424 2.156035 
##     gear     carb 
## 2.314617 2.812249
cor(mtcars)[1,]
##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251

Using the remaining variables, we can try some combinations of predictors and use the ANOVA test to verify which one generates a reasonable model.

# Nested model testing
fit1 <- lm(mpg ~ am, data = mtcars)
fit2 <- update(fit1, mpg ~ am + cyl, data = mtcars)
fit3 <- update(fit2, mpg ~ am + cyl + hp, data = mtcars)
fit4 <- update(fit3, mpg ~ am + cyl + hp + wt, data = mtcars)
fit5 <- update(fit4, mpg ~ am + cyl + hp + wt + qsec, data = mtcars)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + hp
## Model 4: mpg ~ am + cyl + hp + wt
## Model 5: mpg ~ am + cyl + hp + wt + qsec
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 271.36  1    449.53 73.1328 4.952e-09 ***
## 3     28 220.55  1     50.81  8.2659  0.007954 ** 
## 4     27 170.00  1     50.56  8.2246  0.008091 ** 
## 5     26 159.82  1     10.18  1.6562  0.209459    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 4 seems to be a good model since we have a low F value and it is statistically significant as shown by Pr(>F). Also, it explains about 85% (or 82% if using the adjusted R-squared) of the variance. Now lets look at the model residuals and test for normality. One way to see that, is to look at the plot on the upper right corner (Normal Q-Q). If the residuals fall roughly on a line of the normal QQ plot, it is a good sign. Also, the Residual vs Fitted plot may not have any pattern (heteroskedasticity - non constant variance)

summary(fit4)
## 
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4765 -1.8471 -0.5544  1.2758  5.6608 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.14654    3.10478  11.642 4.94e-12 ***
## am           1.47805    1.44115   1.026   0.3142    
## cyl         -0.74516    0.58279  -1.279   0.2119    
## hp          -0.02495    0.01365  -1.828   0.0786 .  
## wt          -2.60648    0.91984  -2.834   0.0086 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.509 on 27 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8267 
## F-statistic: 37.96 on 4 and 27 DF,  p-value: 1.025e-10

Appendix A - Box plot type of transmission

library(ggplot2)
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <-c("AT", "MT")
qplot(am, mpg, data = mtcars, geom = "boxplot", xlab = 'Transmission Type', ylab = "MPG")

Appendix B - Histogram of AT and MT

qplot(mpg, data = mtcars, facets = am ~ ., binwidth = 3)

Appendix C - Residual Plots

par(mfrow = c(2,2))
plot(fit4)