Executive summary

Does a manual transmission have better fuel economy in miles per gallons (mpg) than automatic? Let’s quantify this. We use mtcars and fit mpg with several models where am (transmission type) is the main predictor and include other regressors. I used nested models to select the key variables. The model suggests that manual gives a 2.08 more mpg than automatic, but due to the high p-value of am we have to reject that hypothesis.

Exploring the data

This a data.frame with 32 observations and 11 variables. Two variables are factors: am (see above with 0 = automatic and 1 = manual) and vs (V-shaped or straight engine - see Stackoverflow). I consider gear as numeric.

library(datasets) ; ?mtcars ; str(mtcars) ; cor(mtcars)

We see that am is quite correlated to drat (0.71), wt (-0.69) and gear (0.79).

boxplot(mpg~ am, data = mtcars, ylab="mpg", xlab = "FIGURE 1 : 0 = Automatic; 1 = Manual")

fit1 <- lm(mpg ~ am , data = mtcars) ; summary(fit1)

Manual transmission gives better fuel economy (see fig 1). There are no outliers but the max values of “automatic” overlap the lowest 25% percentile of the “manual” data. In this simple unadjusted model, since am acts as a factor, the intercept is the mean of mpg with am = 0 (automatic). The 1st coefficient is the gap of mean mpg with “manual”, i.e. the mpg is 7.24 higher with manual than automatic. Since the p-values are below 5%, the gap is significant. Yet we have not included other regressors therefore we might have ignored their effect. Let’s find out.

Adjustement with other variables

We will use a nested model and anova to have a feeling about influential variables,

fit2 <- lm(mpg ~am +cyl,data=mtcars); fit3 <-lm(mpg ~am+cyl+disp,data =mtcars) # ...Etc until using all regressors are taken into account.
fitAll<-lm(mpg ~ . ,data =mtcars);anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fitAll)

cyl, hp and wt seem significant. Let’s analyze the model including those regressors only (see appendix). In fact we see that cyl is not so significant. Let’s drop it for simplicity and reduction of model inflation.

fitNewCyl <- lm(mpg ~ am + cyl + hp + wt, data = mtcars)
fitNew <- lm(mpg ~ am + wt + hp, data = mtcars); summary(fitNew)$coefficients

This model does not include interaction between regressors so the coefficent of am means that, all other parameters being equal, the mpg with manual transmission (am = 1) is 2.08 higher than with automatic. The coef of wt means that for every additional 1000 lbs, all other factors being equal, we are losing 2.87 mpg of fuel economy (negative). The coef of hp, though significant according to the low p-value (0.000546) has in fact little influence (0.0374 loss of mpg for every additional hp). However the standard error of am is 1.37 (p-value is .14), meaning that we would have to reject the hypothesis that “manual” gives a better fuel economy than “automatic”.We can have a look at the residuals sorted by am.

plot(predict(fitNew), resid(fitNew), type = "n"); points(predict(fitNew)[mtcars$am == 0], resid(fitNew)[mtcars$am == 0], col = "blue")
points(predict(fitNew)[mtcars$am == 1], resid(fitNew)[mtcars$am == 1], col = "red")

We cannot see a clear pattern of the residuals, except that there are a bit more red points (1 = manual) on the right and blue points (0) on the left.

Caveats or limitations

We could have considered gear a 3-level factor instead of a numeric variable. Plus, the step analysis suggests that wt, qsec, am are better predictors. It was not too natural for me to include the qsec variable (1/4 mile time) so kept my nested model approach. (see code in appendix).

Appendix

What is mtcars made of?

?mtcars ; str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

A view of the correlations between all variables and coefficients of fit1 (lm(mpg ~ am , data = mtcars))

cor(mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
summary(fit1)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

Analysis of the nested models

data.frame(anova(fit1, fit2, fit3, fit4, fit5, fit6, fit7, fit8, fit9, fitAll))
##    Res.Df      RSS Df   Sum.of.Sq           F       Pr..F.
## 1      30 720.8966 NA          NA          NA           NA
## 2      29 271.3621  1 449.5344789 64.00393598 8.231226e-08
## 3      28 252.0811  1  19.2809820  2.74519263 1.124136e-01
## 4      27 216.3673  1  35.7137970  5.08486820 3.493497e-02
## 5      26 214.4969  1   1.8704425  0.26631035 6.112088e-01
## 6      25 162.4336  1  52.0633331  7.41268668 1.275295e-02
## 7      24 149.0899  1  13.3437093  1.89985408 1.826036e-01
## 8      23 148.8728  1   0.2170472  0.03090280 8.621415e-01
## 9      22 147.9011  1   0.9717105  0.13835045 7.136533e-01
## 10     21 147.4944  1   0.4066688  0.05790079 8.121787e-01

Analysis of a first model but including cyl = (lm(mpg ~ am + cyl + hp + wt, data = mtcars))

summary(fitNewCyl)$coefficients
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
## am           1.47804771 1.44114927  1.025603 3.141799e-01
## cyl         -0.74515702 0.58278741 -1.278609 2.119166e-01
## hp          -0.02495106 0.01364614 -1.828433 7.855337e-02
## wt          -2.60648071 0.91983749 -2.833632 8.603218e-03

Analysis of the final model (dropping cyl) = lm(mpg ~ am + hp + wt, data = mtcars)

summary(fitNew)$coefficients
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## am           2.08371013 1.376420152  1.513862 1.412682e-01
## wt          -2.87857541 0.904970538 -3.180850 3.574031e-03
## hp          -0.03747873 0.009605422 -3.901830 5.464023e-04

Step method gives wt, qsec, am as best predictors.

step(fitAll, k = log(nrow(mtcars)))
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936