Executive Summary

Understanding how fuel efficiency depends on the characterics of an automobile is complicated; there are many confounding factors that could be interrelated. In this report, I examine how fuel consumption depends on transmission type, either automatic or manual, in a sample of 32 automobiles from the 1970s. I find that these data cannot be used to find any difference in fuel consumption between the two types of transmission. A model that only includes transmission type is not appropriate for predicting fuel consumption, and for the cars in this sample, there is no effect on fuel efficiency from transmisson type at the 95% confidence level.

Exploratory Data Analysis

The data for this analysis is from the 1974 Motor Trend magazine and is available in the datasets package in R. First, I will load the data, look at the columns available, and see what the distribution of fuel consumption is for the two types of transmission.

library(datasets)
data(mtcars)
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
mtcars$am <- as.factor(mtcars$am)

This histogram appears to show that manual transmission cars have higher fuel efficiency, i.e. higher MPG, while automatic transmission cars have lower MPG. There are many other possibly confounding factors in play here, however. What if manual transmission cars are on average lighter in weight? It will be important to consider other automobile characteristics when choosing an appropriate model.

Effect of Transmission Type

To fit a linear model for the fuel consumption in MPG depending on the transmission, I use the typical procedure in R.

fit <- lm(mpg ~ am, data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

This appears to show a highly significant difference between automatic and manual transmissions. However, all the other characteristics of cars which also affect fuel consumption have been ignored in this model. This would not be a good model to use because the effects of weight, number of cylinders, and other confounders are being ignored. Instead, let’s build some other models including these other automobile characteristics and see what the role of transmission type is for these other models.

fit2 <- update(fit, mpg ~ am + wt)
fit3 <- update(fit, mpg ~ am + wt + cyl)
fit4 <- update(fit, mpg ~ am + wt + cyl + disp)
fit5 <- update(fit, mpg ~ am + wt + cyl + disp + gear)
fit6 <- update(fit, mpg ~ am + wt + cyl + disp + gear + hp)
fit7 <- update(fit, mpg ~ am + wt + cyl + disp + gear + hp + qsec)
fit8 <- update(fit, mpg ~ am + wt + cyl + disp + gear + hp + qsec + carb)
fitall <- update(fit, mpg ~ am + wt + cyl + disp + gear + hp + qsec + carb + drat)
anova(fit, fit2, fit3, fit4, fit5, fit6, fit7, fit8, fitall)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + cyl
## Model 4: mpg ~ am + wt + cyl + disp
## Model 5: mpg ~ am + wt + cyl + disp + gear
## Model 6: mpg ~ am + wt + cyl + disp + gear + hp
## Model 7: mpg ~ am + wt + cyl + disp + gear + hp + qsec
## Model 8: mpg ~ am + wt + cyl + disp + gear + hp + qsec + carb
## Model 9: mpg ~ am + wt + cyl + disp + gear + hp + qsec + carb + drat
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 65.9424 4.602e-08 ***
## 3     28 191.05  1     87.27 13.0033  0.001569 ** 
## 4     27 188.43  1      2.62  0.3906  0.538436    
## 5     26 182.41  1      6.02  0.8971  0.353858    
## 6     25 163.06  1     19.35  2.8827  0.103639    
## 7     24 149.49  1     13.57  2.0212  0.169144    
## 8     23 149.31  1      0.18  0.0267  0.871743    
## 9     22 147.65  1      1.66  0.2472  0.623996    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This analysis of variance table shows that most of these models with many parameters included are not good models to use. Models 2 and 3 look like they may be promising.

Let’s look at the coefficients for the transmission term for each of these models.

rbind(summary(fit)$coef[2], summary(fit2)$coef[2], summary(fit3)$coef[2], 
      summary(fit4)$coef[2], summary(fit5)$coef[2], summary(fit6)$coef[2],
      summary(fit7)$coef[2], summary(fit8)$coef[2], summary(fitall)$coef[2])
##              [,1]
##  [1,]  7.24493927
##  [2,] -0.02361522
##  [3,]  0.17649316
##  [4,]  0.12906557
##  [5,]  1.26190687
##  [6,]  1.46499471
##  [7,]  2.60930196
##  [8,]  2.60526083
##  [9,]  2.45509738

These vary quite a bit, including a negative number, but mainly notice that none of the coefficients are as big as the first model’s, which fitted transmission type alone as the only parameter that fuel consumption depended on. In that first model, the high coefficient for transmission type occurred because other important parameters were not included in the model. A linear model that only includes transmission type ignores other important characteristics of cars that affect fuel consumption.

It will be easier to see this graphically. The plot below shows the fitted coefficients from each of the models tested above. Specifically notice the transmission term.

Notice that for all of the models but Model 1 (the first model that included only transmission type as a parameter) the coefficient for transmission type is near zero and the 95% confidence interval for the coefficient includes zero. Based on this, we cannot conclude that there is any difference in fuel efficiency between automatic and manual transmissions in the cars in this sample.

Best Model and Residual Diagnostics

Given that I have concluded transmission type does not affect fuel efficiency in this sample of cars, a best model would not include transmission type as a parameter at all. From the graph above and the ANOVA table, it appears that the important parameters for predicting MPG are weight and number of cylinders; these are the parameters that have coefficients with 95% confidence intervals that do not include zero for all or almost all of the models I tested, especially in the models that looked promising in the ANOVA table. Let’s make a new model and then investigate its residuals.

fitnotrans <- lm(mpg ~ wt + cyl, data = mtcars)
summary(fitnotrans)
## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Notice that both the coefficients for weight and number of cylinders are negative and significant at the 95% confidence level; heavier cars and cars with more cylinders use more fuel. This is a better regression model to use than any of those in the earlier section because it includes the parameters that are important in predicting MPG and doesn’t include parameters that are not. Let’s look at the residuals for this model.

The residuals for this model look about as we would like. The 95% confidence interval on the local smoothed fit to the residuals includes zero except at the very extreme high end. This indicates that there are no significant problems with the distribution of the residuals for this model.