The dataset mtcars provides information on how different factors affect the MPG (mile per gallon), such as the number of cylinders, the displacement, horsepower and weight of the car and so on. Some of the factors have obvious reasons why and how they would affect the MPG. For example, it is not surprising to see that MPG goes down as the weight of the car goes up. In appendix figure 1, a summary of how different factors affect the MPG is presented.
In this paper, however, the focus is mainly on the transmission type–is automatic transmission or manual transmission better for MPG? In order to answer this question, the paper goes in three sections. The exploratory analysis section summarizes the relationship between mpg and am without other variables considered; in the model fitting section, different models are built and compared; finally discussion and conclusion are made in the summary section.
Exploratory analysis
Firstly, we will look at a very straight forward model and see how the MPG is roughly affected by the am (Transmission type, where 0 = automatic, 1 = manual), ignoring all other variables. The summary is shown in figure 2 (see appendix).
In this figure, the barplot shows how the average of MPG differs on different transmission types while the boxplot shows how the data are distributed. Both plot show that the manual transmission type provides a better fuel efficiency.
Model fitting
The naive model as discussed in the exporatory analysis is generated as follows.
fit <- lm(mpg ~ am, mtcars)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
The model shows that the MPG average for automatic transmission type is 17.15 while the one for manual transmission type is 24.39. The probability for the two transmission type to be equal is very low 2.85e-04, so it seems safe to reject the null hypothesis and conclude that the manual transmission is more fuel efficient.
However, this is a very naive model, since it only considers the transmission type variable and assumes that transmission type is not correlated with other types of variable. But as we can see from Figure 3 in the appendix, it is the opposite case. Transmission type is especially strongly correlated with the disp (displacement), drat (rear axle ratio) and wt (weight).
A closer look at the models including these three variables plus (cyl and vs which are discrete and not easily judged from the figure) is given as follows. A detailed summary of the models can be looked up in the appendix. What will be discussed here is the difference between the two different transmission types after extra variables are added in, which correspond to (am, estimates) in the summary table and the probability (Prob>abs(t)) which specifies the necessity of including the extra variable.
fit1<-lm(mpg ~ am + disp, mtcars)
fit2<-lm(mpg ~ am + drat, mtcars)
fit3<-lm(mpg ~ am + wt, mtcars)
fit4<-lm(mpg ~ am + cyl, mtcars)
fit5<-lm(mpg ~ am + vs, mtcars)
| Added Variable | none | disp | drat | wt | cyl | vs |
| Estimate | 7.24 | 1.83 | 2.81 | -0.02 | 2.57 | 6.07 |
| prob(>t) | 0.00029 | 0.212 | 0.229 | 0.988 | 0.056 | 5e-05 |
As can be seen from the table, all the five variables reduce the mpg gap between automatic and manual transmission type, as well as increase the probability that the automatic and manual transmission types are the same. What is especially obvious is the weight, which brings the mpg difference between the two transmission types to a 1% confidence interval. We can almost conclude that automatic and manual are the same if weight factor is taken into account.
In order to look more closely, a new model with three variables (disp, drat and wt) which affects the result most is built here.
fit6<- lm(mpg ~ am + wt + disp + drat, mtcars)
summary(fit6)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.89479301 7.28161771 4.2428474 0.0002321489
## am -0.28147667 1.69703010 -0.1658643 0.8694995274
## wt -3.25888551 1.34392690 -2.4248979 0.0222787536
## disp -0.01605787 0.00995161 -1.6135951 0.1182435965
## drat 0.97307718 1.67310709 0.5815989 0.5656614773
anova(fit3, fit6)
## Analysis of Variance Table
##
## Model 1: mpg ~ am + wt
## Model 2: mpg ~ am + wt + disp + drat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 278.32
## 2 27 243.51 2 34.814 1.9301 0.1646
This complete model further improves the residual sum of squares (see Figure 4), and the fairly small probability indicates that it is necessary to build this model with four variables included instead of two.
Summary
After all, can we answer the question now? Is automatic or manual transmission better for MPG? The answer is actually model dependent. Based on the naive model, we will conclude that the manual transmission is better. However if we include more variables, especially the weight variable, we will conclude that transmission type doesn’t make a big difference or even the automatic one is slightly better. The reason for this uncertainty is that our data is biased, and we don’t have the kind of data where am is not entangled with other types of variables. Based on what we have, nevertheless, and if we assume the correlation between am and other types of variables are always the same when cars are built, then we can conclude that the automatic transmission type is slightly better with an mpg benefit of 0.28 \(\pm\) 1.70.
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Summary of models
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.84808111 1.834071377 15.183750 2.452658e-15
## am 1.83345825 1.436099585 1.276693 2.118396e-01
## disp -0.03685086 0.005781896 -6.373490 5.747528e-07
summary(fit2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.949883 7.073285 -0.2756687 0.78475740
## am 2.807061 2.282159 1.2300023 0.22858143
## drat 5.811143 2.129833 2.7284496 0.01069548
summary(fit3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.32155131 3.0546385 12.21799285 5.843477e-13
## am -0.02361522 1.5456453 -0.01527855 9.879146e-01
## wt -5.35281145 0.7882438 -6.79080719 1.867415e-07
summary(fit4)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.522443 2.6031842 13.261621 7.694408e-14
## am 2.567035 1.2914280 1.987749 5.635445e-02
## cyl -2.500958 0.3608282 -6.931159 1.284560e-07
summary(fit5)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.594444 0.9261514 15.758162 9.352153e-16
## am 6.066667 1.2748423 4.758759 4.958115e-05
## vs 6.929365 1.2621316 5.490208 6.500962e-06