I work for Motor Trend (not really) and was asked (i) if “automatic or manual transmission better for MPG” and (ii) to “quantify the MPG difference between automatic and manual transmissions.” Please note: I am trying to learn ggplot2 so those packages are retrieved. The corresponding base graphics can be found in Appendix A-4.
Superficially, a manual transmission is better for MPG (yes, it makes a difference). A naive regression the manual increases fuel economy by ~ 7 MPG (R.adj^2 = 0.33). But when we account for confounding variables (and interactions), the adj. R^2 increases to 0.88 but with nuance. Specifically, weight interacts. For lighter cars (~ 2,000 lbs), the manual adds ~ 5 to 7 MPG but, as the car’s weight increases, the impact of the manual decreases. At a certain point, for heavy cars, the weight overwhelms the influence of transmission.
The dataframe only contains 32 (old) observations. The key predictor is (am), a Bernoulli variable (0 or 1) that is obviously categorical. I was curious if as.factor() is necessary, but apparently it does not matter; see my test @ http://rpubs.com/bionicturtle/test-am-factor. The violin chart superfically suggests a manual (am = 1) is associated with better fuel economy (higher MPG).
library(ggplot2); data(mtcars) #dataframe with 32 obs of 11 variables
mtcars$am <- as.factor(mtcars$am); levels(mtcars$am) <- c("auto", "man")
avg.a <- mean(mtcars$mpg[mtcars$am=="auto"]); avg.m <- mean(mtcars$mpg[mtcars$am=="man"])
p <- ggplot(mtcars, aes(x = am, y = mpg)); p + geom_violin() +
geom_boxplot(width = .1, fill = "black", outlier.color=NA) +
stat_summary(fun.y = median, geom="point", fill="white", shape=21, size=5.0)
On this naive look, the averages differ:
And a t-test virtually confirms a difference in the averages (based on assumptions):
mean.diff <- t.test(mpg ~ am, data=mtcars)
Specifically, the p-value (aka, exact significance level) is 0.0013736. For a desired confidence level of 0.95, the confidence interval is -11.2801944 to -3.2096842, which easily excludes zero.
Per Brian Coffo’s lectures, I really love the simplicity of using pairs(.) to generate a matrix of scatterplots just to visually notice some relationships; see Appendix A-1. The obvious first try is ln(mpg ~ am):
fit.0.mpg.am <- lm(mpg ~ am, data = mtcars)
summary(fit.0.mpg.am)$coefficients; simple.r.sqr <- summary(fit.0.mpg.am)$adj.r.squared
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## amman 7.244939 1.764422 4.106127 2.850207e-04
For this model, the intercept is conveniently equal to the average MPG among automatic transmissions; i.e., when am = 0. Similarly, the sum of the intercept and slope coefficients is equal to the average MPG among manual transmissions; i.e., when am = 1. But the adjusted R^2 for this model is only 0.3384589.
So I used step(.) for the first time, which is incredibly powerful! See Appendix A-2. As I preferred a purely statistcal approach (as opposed to a fundamental hypothesis), I did the following:
# Best predictors are: wt, qsec, am; With interactions, highest adjusted R^2 is interact2: am*wt
mpg.interact2 <- lm(mpg ~ wt + qsec + am + am*wt, data=mtcars)
summary(mpg.interact2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.723053 5.8990407 1.648243 0.1108925394
## wt -2.936531 0.6660253 -4.409038 0.0001488947
## qsec 1.016974 0.2520152 4.035366 0.0004030165
## amman 14.079428 3.4352512 4.098515 0.0003408693
## wt:amman -4.141376 1.1968119 -3.460340 0.0018085763
best.r <- summary(mpg.interact2)$adj.r.squared
summary(mpg.interact2)$adj.r.squared #This model has the highest adjusted R^2
## [1] 0.8804219
The adjusted R^2 for our preferred model is 0.8804219.
We might still be careful to notice relationships illustrated in the Appendix A-3. Specifically,
Let’s look at the model (with three predictors) both with and without the interaction:
mpg.interact2.wo <- lm(mpg ~ wt + qsec + am, data=mtcars)
mpg.interact2 <- lm(mpg ~ wt + qsec + am + am*wt, data=mtcars)
mpg.interact2.wo$coefficients
## (Intercept) wt qsec amman
## 9.617781 -3.916504 1.225886 2.935837
mpg.interact2$coefficients
## (Intercept) wt qsec amman wt:amman
## 9.723053 -2.936531 1.016974 14.079428 -4.141376
# key coefficients
beta <- mpg.interact2.wo[[1]]["amman"]
beta2 <- mpg.interact2[[1]]["amman"]
beta2.i <- mpg.interact2[[1]]["wt:amman"]
The three-predictor model, without interactions, suggests that a manual transmission adds 2.9358372 MPG. Our preferred model, however, includes the interaction. With the interaction, the effect of the manual transmision is given by: coefficient[amman] + coefficient[wt:amman]*wt. If our model is correctly specified (a huge assumption), we can observe that the impact of a manual transmission on MPG is very sensitive to the car’s weight. For example,
This suggests that, at least at lower weight classes, a manual transmission improves MPG up to +7.0. But, as weight increases, the difference attenuates above a certain weight, weight overwhelms the transmission’s impact.
Finally, I check the assumptions concerning the residuals:
mtcars.add <- mtcars
mtcars.add$residuals<-resid(mpg.interact2)
mtcars.add$standardized.residuals <- rstandard(mpg.interact2)
mtcars.add$fitted <- mpg.interact2$fitted.values
require(gridExtra)
histogram.1 <- ggplot(mtcars.add, aes(standardized.residuals)) +
theme(legend.position = "none") +
geom_histogram(aes(y=..density..), colour="black", fill="white") +
labs(x = "Standardized Residuals", y = "Density") +
stat_function(fun = dnorm,
args = list(mean = mean(mtcars.add$standardized.residuals, na.rm = TRUE),
sd = sd(mtcars.add$standardized.residuals, na.rm = TRUE)),
colour = "red", size = 1)
qqplot.resid <- qplot(sample = mtcars.add$standardized.residuals, stat="qq") +
labs(x = "Theoretical Values", y = "Observed Values") # qqplot.resid
scatter <- ggplot(mtcars.add, aes(fitted, standardized.residuals)) +
geom_point() + geom_smooth(method = "lm", colour = "Red") +
labs(x = "Fitted Values", y = "Studentized Residual")
grid.arrange(histogram.1, qqplot.resid, scatter, nrow=1)
pairs(~ mpg + cyl + disp + hp + wt + am, data = mtcars, main="title")
fit.allpredicts <- lm(mpg ~ ., data = mtcars)
summary(fit.allpredicts)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## amman 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
mpg.step.b <- step(lm(mpg ~ ., data = mtcars), direction ="back", trace = 0 ) # trace = 0 to exclude the detail
# summary of step function:
# mpg ~ wt + qsec + am: AIC = 61.31
# mpg ~ hp + wt + qsec + am: AIC = 61.52
# mpg ~ disp + hp + wt + qsec + am: AIC = 62.16
# step leads to wt, qsec and am
mpg.hp <- lm(mpg ~ hp, data = mtcars)
summary(mpg.hp)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
mpg.wt.qsec.am <- lm(mpg ~ wt + qsec + am, data=mtcars)
summary(mpg.wt.qsec.am)$adj.r.squared
## [1] 0.8335561
mpg.wt.qsec.am.hp <- lm(mpg ~ wt + qsec + am + hp, data=mtcars)
summary(mpg.wt.qsec.am.hp)$adj.r.squared
## [1] 0.8367919
mpg.wt.qsec.am.hp.disp <- lm(mpg ~ wt + qsec + am + hp + disp, data=mtcars)
summary(mpg.wt.qsec.am.hp.disp)$adj.r.squared
## [1] 0.8375334
anova(mpg.wt.qsec.am, mpg.wt.qsec.am.hp, mpg.wt.qsec.am.hp.disp)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ wt + qsec + am + hp
## Model 3: mpg ~ wt + qsec + am + hp + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 169.29
## 2 27 160.07 1 9.2195 1.5622 0.2225
## 3 26 153.44 1 6.6287 1.1232 0.2990
mpg.interact0 <- lm(mpg ~ wt + qsec + am, data=mtcars)
mpg.interact2 <- lm(mpg ~ wt + qsec + am + am*wt, data=mtcars)
mpg.interact3 <- lm(mpg ~ wt + qsec + am + am*wt + am*qsec, data=mtcars)
mpg.interact4 <- lm(mpg ~ wt + qsec + am + am*wt + am*qsec + wt*qsec, data=mtcars)
anova(mpg.interact0, mpg.interact2, mpg.interact3, mpg.interact4)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ wt + qsec + am + am * wt
## Model 3: mpg ~ wt + qsec + am + am * wt + am * qsec
## Model 4: mpg ~ wt + qsec + am + am * wt + am * qsec + wt * qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 169.29
## 2 27 117.28 1 52.010 11.9312 0.00198 **
## 3 26 116.47 1 0.802 0.1841 0.67159
## 4 25 108.98 1 7.496 1.7196 0.20166
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# mpg.interact1.b <- lm(mpg ~ qsec + am*wt, data=mtcars) Per Brian, asterick (*) includes predictors!
summary(mpg.interact0)$adj.r.squared
## [1] 0.8335561
summary(mpg.interact2)$adj.r.squared #This model has the highest adjusted R^2
## [1] 0.8804219
summary(mpg.interact3)$adj.r.squared
## [1] 0.8766723
summary(mpg.interact4)$adj.r.squared
## [1] 0.8799939
## Loading required package: GGally
par(mfrow=c(2,2))
plot(mpg.interact2)