Rising fuel prices and climate change have driven motorists to be more concerned than ever with fuel economy. Consequently, understanding the factors which affect the miles per gallon (MPG) of cars is a key area of interest. This analysis will focus on the effect of transmission type (automatic or manual) on the MPG of cars. In particular, this analysis will address the following two questions:
The data comes from the mtcars data set built into R which was extracted from the 1974 Motor Trend US magazine. It comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). The variables are as follows:
That there is a statistically significant difference (at the 5% significance level) in fuel efficiency between automatic and transmission cars. Indeed, on average manual cars have a 2.94 (3 s.f.) higher MPG than automatic cars, keep all other factors constant.
We begin by loading the data and looking at some basic summaries
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Since we are interested in the effect of transmission on MPG, we plot that now.
On average, manual transmission cars have a higher mpg than automatic cars. However there may be other confounding factors which we next examine by plotting mpg against all other variables (excluding transmission).
From the above plot, we note the following:
This intuitively makes sense, we would expect heavier cars to need bigger, more powerful engines with more cylinders and to be less efficient. Indeed, we see a positive correlation pair-wise between cylinders, engine size, weight and horsepower.
mtcars <- data.frame(apply(mtcars, 2, as.numeric))
cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
We also note:
It would seem that the relationship between transmission and fuel economy could be being confounded by some of the other variables.
We now fit several linear models to try to answer the two questions set for this analysis.
Since we are trying to investigate the relationship between transmission and fuel efficiency, we can begin by fitting a model with fuel efficiency as the response and transmission as the (single) predictor.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual")
fit1 <- lm(mpg ~ am, mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
According to this model, there is a statistically significant difference (at the 5% significance level) between manual and automatic cars with manual cars having on average 7.25 (3 s.f) higher MPG than automatic cars. The R2 metric measures the proportion of the variation in the response explained by the predictor(s). However, there evidence of a poor fit in the low R2 value of just 36% (2 s.f.).
The issue with Model 1 is that it does not account for the effects of confounding from other variables. We saw in the exploratory analysis that many variables were highly correlated with both fuel efficiency and transmission. Model 1 fails to account for the confounding effects from these other variables.
To overcome the issue of confounding from Model 1, we now fit a model of all variables against fuel efficiency.
fitall <- lm(mpg ~ ., mtcars)
summary(fitall)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## ammanual 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
According to Model 2, none of the variables have a statistically significant (at the 5% significance level) effect on fuel efficiency. Given the strong relationships between many of the variables and fuel efficiency, this seems highly improbable.The issue here is collinearity; that is, where predictors which are both highly correlated with the response, are also themselves high correlated. Collinearity inflates the standard errors of the slope estimates and can lead to unstable models.
We can measure the effect of collinearity using the variance inflation factor metric. The VIF of a given predictor within a model is calculated as \(1/(1 - R^2)\) where the R2 comes from regressing the given predictor (as the response) against the remaining predictors in the model. A high value of R2 in such a model would indicate the presence of collinearity since much of the variation in a predictor would already be explained by the other predictors. This would in turn lead to a high VIF for that particular predictor.
library(car)
vif(fitall)
## cyl disp hp drat wt qsec vs am
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487
## gear carb
## 5.357452 7.908747
As a general rule-of-thumb, a VIF greater than 5 warrants further investigation, while a VIF exceeding 10 indicates serious collinearity. In this instance there is evidence of collinearity in many of the variables indicating Model 2 is over-fit.
We have seen in Model 1 the problems that arise from omitting relevant variables, while Model 2 demonstrated the issues with including too many predictors which are highly correlated with each other. We would like to find a middle ground between these two extremes which minimises their negative effects. That is, we would like to choose a selection of the predictors which minimise confounding and collinearity. This begs the question, which predictors should we use?
The Akaike Information Criterion (AIC) is a metric which estimates the prediction error (and hence quality) of a model given a set of data. It is calculated as \(AIC = 2k = 2ln(L)\) where k is the number of predictors and L is the maximum value of the likelihood function for the model. Hence AIC rewards good fit but also punishes over fitting. We will implement the AIC as follows:
Note at each step we also test the effect on the AIC of adding back in predictors which have already been removed.
library(MASS)
fit <- stepAIC(fitall, direction = "both", trace = FALSE)
summary(fit)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ammanual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The result of the AIC algorithm as described above is a model which contains weight, quarter mile time and transmission as predictors. With Model, 3 we can say that there is a statistically significant difference (at the 5% significance level) in fuel efficiency between automatic and transmission cars. Indeed, on average, manual cars have a 2.94 (3 s.f.) higher MPG than automatic cars, keep all other factors constant.
We would like to try to diagnose any issues with Model 3 as well as verify that the assumptions which underpin regression analysis hold true for the model.
To that end, we first try to identify collinearity. Again we use the VIF.
vif(fit)
## wt qsec am
## 2.482952 1.364339 2.541437
Based on the VIF metric, there does not seem to be collinearity between the predictors in the model.
Next we verify that the residuals are roughly normally distributed by plotting the standardised residuals against the theoretical quantiles of the standard normal distribution. We also verify the assumption that the residuals have constant variance by plotting the residuals and standardised residuals against the fitted values.
par(mfrow = c(2,2))
plot(fit)
Indeed, the standardised residuals fall roughly on a straight line when plotted against the normal quantiles indicating they are approximately normally distributed. There is some evidence of a trend in the standardised residuals but nothing to be too alarmed by.
There are two main flaws with the data set used for this analysis; namely, it is too old and too small. The automotive industry moves quickly with new features being added to cars all the time. At the time of writing, very few cars on the road will be from the era the data comes from. Hence there are serious concerns over how applicable the results would be to modern vehicles. Furthermore, the data set consisted of only 32 cars; many more would be needed to have a high degree of confidence in the findings the model generated. Finally, the data set does not account for differences between vehicles of the same model. The horsepower of a car tends to deteriorate over its lifetime and other factor may change too. Ideally, we would want a data set that accounted for this.
Nevertheless, the analysis is useful for explaining some of the core concepts of regression such as confounding, collinearity, variable selection and model diagnostics which will be useful when working on other projects.