Does a manual transmission have better fuel economy in miles per gallons (mpg) than automatic? Let’s quantify this. We use mtcars and fit mpg with several models where am (transmission type) is the main predictor and include other regressors. I used nested models to select the key variables. The model suggests that manual gives a 2.08 more mpg than automatic, but due to the high p-value of am we have to reject that hypothesis.
This a data.frame with 32 observations and 11 variables. Two variables are factors: am (see above with 0 = automatic and 1 = manual) and vs (V-shaped or straight engine - see Stackoverflow). I consider gear as numeric.
library(datasets) ; ?mtcars ; str(mtcars) ; cor(mtcars)
We see that am is quite correlated to drat (0.71), wt (-0.69) and gear (0.79).
boxplot(mpg~ am, data = mtcars, ylab="mpg", xlab = "FIGURE 1 : 0 = Automatic; 1 = Manual")
fit1 <- lm(mpg ~ am , data = mtcars) ; summary(fit1)
Manual transmission gives better fuel economy (see fig 1). There are no outliers but the max values of “automatic” overlap the lowest 25% percentile of the “manual” data. In this simple unadjusted model, since am acts as a factor, the intercept is the mean of mpg with am = 0 (automatic). The 1st coefficient is the gap of mean mpg with “manual”, i.e. the mpg is 7.24 higher with manual than automatic. Since the p-values are below 5%, the gap is significant. Yet we have not included other regressors therefore we might have ignored their effect. Let’s find out.
We will use a nested model and anova to have a feeling about influential variables,
fit2 <- lm(mpg ~am +cyl,data=mtcars); fit3 <-lm(mpg ~am+cyl+disp,data =mtcars) # ...Etc until using all regressors are taken into account.
fitAll<-lm(mpg ~ . ,data =mtcars);anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fitAll)
cyl, hp and wt seem significant. Let’s analyze the model including those regressors only (see appendix). In fact we see that cyl is not so significant. Let’s drop it for simplicity and reduction of model inflation.
fitNewCyl <- lm(mpg ~ am + cyl + hp + wt, data = mtcars)
fitNew <- lm(mpg ~ am + wt + hp, data = mtcars); summary(fitNew)$coefficients
This model does not include interaction between regressors so the coefficent of am means that, all other parameters being equal, the mpg with manual transmission (am = 1) is 2.08 higher than with automatic. The coef of wt means that for every additional 1000 lbs, all other factors being equal, we are losing 2.87 mpg of fuel economy (negative). The coef of hp, though significant according to the low p-value (0.000546) has in fact little influence (0.0374 loss of mpg for every additional hp). However the standard error of am is 1.37 (p-value is .14), meaning that we would have to reject the hypothesis that “manual” gives a better fuel economy than “automatic”.We can have a look at the residuals sorted by am.
plot(predict(fitNew), resid(fitNew), type = "n"); points(predict(fitNew)[mtcars$am == 0], resid(fitNew)[mtcars$am == 0], col = "blue")
points(predict(fitNew)[mtcars$am == 1], resid(fitNew)[mtcars$am == 1], col = "red")
We cannot see a clear pattern of the residuals, except that there are a bit more red points (1 = manual) on the right and blue points (0) on the left.
We could have considered gear a 3-level factor instead of a numeric variable. Plus, the step analysis suggests that wt, qsec, am are better predictors. It was not too natural for me to include the qsec variable (1/4 mile time) so kept my nested model approach. (see code in appendix).
What is mtcars made of?
?mtcars ; str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
A view of the correlations between all variables and coefficients of fit1 (lm(mpg ~ am , data = mtcars))
cor(mtcars)
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
summary(fit1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
Analysis of the nested models
data.frame(anova(fit1, fit2, fit3, fit4, fit5, fit6, fit7, fit8, fit9, fitAll))
## Res.Df RSS Df Sum.of.Sq F Pr..F.
## 1 30 720.8966 NA NA NA NA
## 2 29 271.3621 1 449.5344789 64.00393598 8.231226e-08
## 3 28 252.0811 1 19.2809820 2.74519263 1.124136e-01
## 4 27 216.3673 1 35.7137970 5.08486820 3.493497e-02
## 5 26 214.4969 1 1.8704425 0.26631035 6.112088e-01
## 6 25 162.4336 1 52.0633331 7.41268668 1.275295e-02
## 7 24 149.0899 1 13.3437093 1.89985408 1.826036e-01
## 8 23 148.8728 1 0.2170472 0.03090280 8.621415e-01
## 9 22 147.9011 1 0.9717105 0.13835045 7.136533e-01
## 10 21 147.4944 1 0.4066688 0.05790079 8.121787e-01
Analysis of a first model but including cyl = (lm(mpg ~ am + cyl + hp + wt, data = mtcars))
summary(fitNewCyl)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
## am 1.47804771 1.44114927 1.025603 3.141799e-01
## cyl -0.74515702 0.58278741 -1.278609 2.119166e-01
## hp -0.02495106 0.01364614 -1.828433 7.855337e-02
## wt -2.60648071 0.91983749 -2.833632 8.603218e-03
Analysis of the final model (dropping cyl) = lm(mpg ~ am + hp + wt, data = mtcars)
summary(fitNew)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## am 2.08371013 1.376420152 1.513862 1.412682e-01
## wt -2.87857541 0.904970538 -3.180850 3.574031e-03
## hp -0.03747873 0.009605422 -3.901830 5.464023e-04
Step method gives wt, qsec, am as best predictors.
step(fitAll, k = log(nrow(mtcars)))
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936