Introduction

Rising fuel prices and climate change have driven motorists to be more concerned than ever with fuel economy. Consequently, understanding the factors which affect the miles per gallon (MPG) of cars is a key area of interest. This analysis will focus on the effect of transmission type (automatic or manual) on the MPG of cars. In particular, this analysis will address the following two questions:

  1. Is an automatic or manual trasmission better for miles per gallon?
  2. How different is the MPG between automatic and manual transmission?

The data comes from the mtcars data set built into R which was extracted from the 1974 Motor Trend US magazine. It comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The variables are as follows:

Conclusion

That there is a statistically significant difference (at the 5% significance level) in fuel efficiency between automatic and transmission cars. Indeed, on average manual cars have a 2.94 (3 s.f.) higher MPG than automatic cars, keep all other factors constant.

Exploratory Analysis

We begin by loading the data and looking at some basic summaries

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Since we are interested in the effect of transmission on MPG, we plot that now.

On average, manual transmission cars have a higher mpg than automatic cars. However there may be other confounding factors which we next examine by plotting mpg against all other variables (excluding transmission).

From the above plot, we note the following:

This intuitively makes sense, we would expect heavier cars to need bigger, more powerful engines with more cylinders and to be less efficient. Indeed, we see a positive correlation pair-wise between cylinders, engine size, weight and horsepower.

mtcars <- data.frame(apply(mtcars, 2, as.numeric))
cor(mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

We also note:

It would seem that the relationship between transmission and fuel economy could be being confounded by some of the other variables.

Regression Analysis

We now fit several linear models to try to answer the two questions set for this analysis.

Model 1 - Issues with Confounding

Since we are trying to investigate the relationship between transmission and fuel efficiency, we can begin by fitting a model with fuel efficiency as the response and transmission as the (single) predictor.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual")
fit1 <- lm(mpg ~ am, mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

According to this model, there is a statistically significant difference (at the 5% significance level) between manual and automatic cars with manual cars having on average 7.25 (3 s.f) higher MPG than automatic cars. The R2 metric measures the proportion of the variation in the response explained by the predictor(s). However, there evidence of a poor fit in the low R2 value of just 36% (2 s.f.).

The issue with Model 1 is that it does not account for the effects of confounding from other variables. We saw in the exploratory analysis that many variables were highly correlated with both fuel efficiency and transmission. Model 1 fails to account for the confounding effects from these other variables.

Model 2 - Issues with Collinearity

To overcome the issue of confounding from Model 1, we now fit a model of all variables against fuel efficiency.

fitall <- lm(mpg ~ ., mtcars)
summary(fitall)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## ammanual     2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

According to Model 2, none of the variables have a statistically significant (at the 5% significance level) effect on fuel efficiency. Given the strong relationships between many of the variables and fuel efficiency, this seems highly improbable.The issue here is collinearity; that is, where predictors which are both highly correlated with the response, are also themselves high correlated. Collinearity inflates the standard errors of the slope estimates and can lead to unstable models.

We can measure the effect of collinearity using the variance inflation factor metric. The VIF of a given predictor within a model is calculated as \(1/(1 - R^2)\) where the R2 comes from regressing the given predictor (as the response) against the remaining predictors in the model. A high value of R2 in such a model would indicate the presence of collinearity since much of the variation in a predictor would already be explained by the other predictors. This would in turn lead to a high VIF for that particular predictor.

library(car)
vif(fitall)
##       cyl      disp        hp      drat        wt      qsec        vs        am 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873  4.648487 
##      gear      carb 
##  5.357452  7.908747

As a general rule-of-thumb, a VIF greater than 5 warrants further investigation, while a VIF exceeding 10 indicates serious collinearity. In this instance there is evidence of collinearity in many of the variables indicating Model 2 is over-fit.

Model 3 - Selecting Predictors

We have seen in Model 1 the problems that arise from omitting relevant variables, while Model 2 demonstrated the issues with including too many predictors which are highly correlated with each other. We would like to find a middle ground between these two extremes which minimises their negative effects. That is, we would like to choose a selection of the predictors which minimise confounding and collinearity. This begs the question, which predictors should we use?

The Akaike Information Criterion (AIC) is a metric which estimates the prediction error (and hence quality) of a model given a set of data. It is calculated as \(AIC = 2k = 2ln(L)\) where k is the number of predictors and L is the maximum value of the likelihood function for the model. Hence AIC rewards good fit but also punishes over fitting. We will implement the AIC as follows:

  1. Fit the model of mpg as the response against all predictors and calculate the AIC.
  2. For each predictor, fit a model with that predictor removed and recalculate the AIC
  3. Choose the model with the lowest AIC
  4. Repeat steps 2-3 until the AIC would increase by removing any variables

Note at each step we also test the effect on the AIC of adding back in predictors which have already been removed.

library(MASS)
fit <- stepAIC(fitall, direction = "both", trace = FALSE)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ammanual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The result of the AIC algorithm as described above is a model which contains weight, quarter mile time and transmission as predictors. With Model, 3 we can say that there is a statistically significant difference (at the 5% significance level) in fuel efficiency between automatic and transmission cars. Indeed, on average, manual cars have a 2.94 (3 s.f.) higher MPG than automatic cars, keep all other factors constant.

Model Diagnostics

We would like to try to diagnose any issues with Model 3 as well as verify that the assumptions which underpin regression analysis hold true for the model.

To that end, we first try to identify collinearity. Again we use the VIF.

vif(fit)
##       wt     qsec       am 
## 2.482952 1.364339 2.541437

Based on the VIF metric, there does not seem to be collinearity between the predictors in the model.

Next we verify that the residuals are roughly normally distributed by plotting the standardised residuals against the theoretical quantiles of the standard normal distribution. We also verify the assumption that the residuals have constant variance by plotting the residuals and standardised residuals against the fitted values.

par(mfrow = c(2,2))
plot(fit)

Indeed, the standardised residuals fall roughly on a straight line when plotted against the normal quantiles indicating they are approximately normally distributed. There is some evidence of a trend in the standardised residuals but nothing to be too alarmed by.

Reflection

There are two main flaws with the data set used for this analysis; namely, it is too old and too small. The automotive industry moves quickly with new features being added to cars all the time. At the time of writing, very few cars on the road will be from the era the data comes from. Hence there are serious concerns over how applicable the results would be to modern vehicles. Furthermore, the data set consisted of only 32 cars; many more would be needed to have a high degree of confidence in the findings the model generated. Finally, the data set does not account for differences between vehicles of the same model. The horsepower of a car tends to deteriorate over its lifetime and other factor may change too. Ideally, we would want a data set that accounted for this.

Nevertheless, the analysis is useful for explaining some of the core concepts of regression such as confounding, collinearity, variable selection and model diagnostics which will be useful when working on other projects.