Excutive Summary

The priliminary object of this report is to explore the relationship between transmission types, am, and fuel consumption, mpg, in the data set mtcars. To be more specifically, it is aimed to answer two questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions

The analysis shown below leads to a conclusion that there is a relatively significant difference between manual and auto transmission, and manual transmission has a better fuel economy than the automatic ones.

Exploratory Analysis

The mtcars data set was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). Here is a brief describtion of all the variables which will be mentioned later on.

It can be seen from the exploratory plots which are shown in the appendix that transmission type does have an impact on mpg. The following section will quantitatively analyse the impact and generate the best fitted model to explain it.

Regression Analysis

Initial Model

Initially, the regression analysis is conducted to see the relationship between mpg and am. It is seen as the base model.

base <- lm(mpg ~ am, data = data)
summary(base)
## 
## Call:
## lm(formula = mpg ~ am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
aggregate(mpg ~ am, data = data, mean)
##       am      mpg
## 1   Auto 17.14737
## 2 Manual 24.39231

The analysis shown here strongly support the preliminary conclusion extracted from the exploratory plots mentioned above, which is that transmission type does have a relatively significant impact on fuel consumption. The average mpg for manual and automatic is 24.39 and 17.15 respectively, where the Pr also gives a positive result towards this assumption.

Model Selection

It is also obvious for one to notify that some other variables such as cyl, hp and wt could also be taken into consideration when exploring the fuel consumption. Hence, a model named og which includes all the variables is generated and it will be used for future model selection.

# First is the one which considers all the variables
og <- lm(mpg ~., data = data)
summary(og)
## 
## Call:
## lm(formula = mpg ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## amManual     1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124
# Also it has the base model, which is mpg vs am
base <- lm(mpg ~ am, data = data)
summary(base)
## 
## Call:
## lm(formula = mpg ~ am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

With the help of step(), a stepwise model selection process can be preformed by R, which eventually decides the best variables to interpret the regression model. Basically, one is supposed to select the model with the smallest AIC, Akaike Information Criterion, which is used to evaluate the the complexity and goodness of fit for a statistical model.

# Then it comes to the calibration step, find the best fitted one
best <- step(og, direction = "both")
summary(best)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

Then by using anova(), the difference among different models can be illustrated easily.

# First consider the base and the og
anova(base, best, og)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     26 151.03  4    569.87 17.7489 1.476e-05 ***
## 3     15 120.40 11     30.62  0.3468    0.9588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is can be obtained that by adding a few more varaibles to the model from the base to the best, the significance level changed dramatically, whereas adding more variables from the best to the og, the significance level didn’t change a lot. This supports the fact that the regression model obtianed from step() has the best interpretation towards the question.

Inference

Recall the summary of best again.

summary(best)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

Again, according to the P value shown above, most of the P values are less than or really close to the threshold value 0.05, which means that each variable has relatively significant impact on the output, mpg. The reason that p-value of hp is bigger than other’s is that actually, hp highly depends on cyl. It can be seen below.

summary(lm(hp ~ cyl - 1, data = data))
## 
## Call:
## lm(formula = hp ~ cyl - 1, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.21 -22.78  -8.25  15.97 125.79 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## cyl4    82.64      11.43   7.228 5.86e-08 ***
## cyl6   122.29      14.33   8.532 2.12e-09 ***
## cyl8   209.21      10.13  20.645  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.92 on 29 degrees of freedom
## Multiple R-squared:   0.95,  Adjusted R-squared:  0.9449 
## F-statistic: 183.7 on 3 and 29 DF,  p-value: < 2.2e-16

But this doesn’t mean that hp can be removed. In the regression model best, more importantly, the multiple \(R^2\) and adjusted \(R^2\) are 0.84 and 0.81 respectively, which are higher than the \(R^2\) values of the model without hp. This indicates that around 83 percent of the regression variance can be explained by the selected variables. The F-statistic result shows that the p-value of the whole model is 2.73e-10, which also supports the conclustion that selected variables from the model are jointly significant.

Residual and Diagnostics

Additionally, a series of residuals plots have been shown below, aiming to conduct the regression diagnostics and examining non-normality.

par(mfrow = c(2,2))
plot(best)

According to these four plots shown above, they indicate that the residuals are independent and normally distributed with a constant variance.

Conclusion

Based on the analysis conducted above, a few conclusions can be made:

Appendix

Here are some plots for exploratory analysis.

# Plot em all
# Plot all the box plots first
ggplot2.multiplot(plot1, plot7,  cols = 2)

ggplot2.multiplot(plot8, plot9, plot10, cols = 2)

# The others
ggplot2.multiplot(plot2, plot3, cols = 2)

ggplot2.multiplot(plot4, plot5, plot6, cols = 2)