Automatic or Manual Transmission: Which is the best regarding miles per gallon (MPG)?

Executive Summary

In this report we will use a data set of a collection of cars in order to try to answer the following two questions:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions?

This report include five sections:

Exploratory Analysis
Model Selection Analysis
Residual Plot and Diagnostic
Conclusions
Annexes

1. Exploratory Analisys

We will use the mtcars data set. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

A data frame with 32 observations on 11 variables:

mpg: Miles/(US) gallon
cyl: Number of cylinders
disp: Displacement (cu.in.)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (1000 lbs)
qsec: 1/4 mile time
vs: V/S
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors

Let’s see some of the values included in the data set:

data(mtcars)
print.data.frame(mtcars[1:3,])

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

We will mainly focus in the relationship between mpg and am. Let’s change some variables from numeric class to factor class and make a plot in order to validate if exists some relationship:

library(ggplot2)

mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)

# Box plot
p <- ggplot(mtcars, aes(x=am, y=mpg,fill=am,group=am)) + 
  labs(title="Plot of MPG per Transmision Type",
       x="Transmission Type (0=automatic, 1=manual)", 
       y = "Miles/gallon") + 
  geom_boxplot() + theme(legend.position="none")
## Output in Annexes Section

According with the graphs looks like exists a difference between the MPG that the automatic and manual cars have, let’s make a inference test to validate that.

1.1 Inference Statistics Test

t_test <- t.test(mpg ~ am, data=mtcars)
## Output in Annexes Section

Because the p value is < 0.05 we can reject the null hipothesys, so the difference between the transmission car type have a statistics influence in the amount of MGP. The MGP mean for manual transmission cars (am=1) is 7.2 more better (24.39) than automatic cars (17.15).

2. Model Selection Analysis

In order to choose the best model, we will consider to use a Simple Linear Regression Model and a Multiple Linear Regression Model (Total and Stepwise Backward).

2.1. Simple Linear Regression Model

Let’s validate the validaty of a simple lineal reggression model using mpg ~ am:

s_model <- lm(mpg ~ am, data=mtcars)
## Output in Annexes Section

This model has the Residual Standard Error of 4.902 with 30 degrees of freedom, and the Adjusted R-Square values is 0.3585, which means that the model can only explain about 36% of the regression variance of the MPG variable. So let’s consider to include other variables in order to built a better model.

2.2. Multiple Linear Regression Model

Let’s consider the full model and use the stepwise regression model in order to find the best one:

2.2.a Full Model

The full model including all the variables has the following results:

f_model <- lm(mpg ~ . , data=mtcars)
## Output in Annexes Section

This model has the Residual Standard Error of 2.833 with 15 degrees of freedom, and the Adjusted R-Square values is 0.779, which means that the model can explain about 78% of the regression variance of the MPG variable, that is much better that the previous one.

2.2.b Stepwise Backward Model

Finally, let’s use the stepwise barckward reggression method:

b_model <- step(f_model, direction="backward", trace=0)
## Output in Annexes Section

According with this, the best model should be mpg ~ cyl + hp + wt + am, that have a Residual Standard Error of 2.41 with 26 degrees of freedom, and the Adjusted R-Square values is 0.8401, which means that the model can explain about 84% of the regression variance of the MPG variable, that is the best one. We will choose this model because is the best one.

3. Residual Plot and Diagnostic

par(mfrow = c(2, 2))
##plot(b_model)
## Output in Annexes Section

Based on the residual plots, we can verify the following assumptions: * The Residual vs Fitted Plots don’t have a consisten pattern, supporting the indepence assumptions

The Normal Q-Q Plot indicates that the residual are normally distributed because the points looks like a line
The Scale-Location Plot confirm the constant variance assumptions, as the point are randomly distributed
The Residual vs Leverage argues that no outliers are present, as all fall well whitin the o.5 bands.

The above analysis meet all the basic assumptions of linear regression and answer the questions.

4. Conclusions

Looking at the selected model (please check 5.2.2.b Stepwise Backward Model), we can see how mpg is affected by changes in cyl, hp and wt:

Cars with manual transmission get about 1.8 MPG more than automatic transmission.
MPG decreases by about 2.5 for every 1000 pound increase in weight.
MPG decreases very marginally with horsepower, about 3 MPG for every 100 horsepower.
MPG decreases by about 2 for 6 cylinder engines and by 3 for 4 cylinder engines.

5. Annexes

Here you can find the output of the differents functions/plots

5.1. Plot of MPG per Transmision Type

5.1.1 Inference Statistics Test

t_test

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

5.2.1 Simple Linear Regression Model

summary(s_model)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

5.2.2.a Full Model

summary(f_model)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## am1          1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

5.2.2.b Stepwise Backward Model

summary(b_model)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am1          1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

5.3 Residual Plot and Diagnostic

par(mfrow = c(2, 2))
plot(b_model)