Assignment

I work for Motor Trend (not really) and was asked (i) if “automatic or manual transmission better for MPG” and (ii) to “quantify the MPG difference between automatic and manual transmissions.” Please note: I am trying to learn ggplot2 so those packages are retrieved. The corresponding base graphics can be found in Appendix A-4.

Executive summary

Superficially, a manual transmission is better for MPG (yes, it makes a difference). A naive regression the manual increases fuel economy by ~ 7 MPG (R.adj^2 = 0.33). But when we account for confounding variables (and interactions), the adj. R^2 increases to 0.88 but with nuance. Specifically, weight interacts. For lighter cars (~ 2,000 lbs), the manual adds ~ 5 to 7 MPG but, as the car’s weight increases, the impact of the manual decreases. At a certain point, for heavy cars, the weight overwhelms the influence of transmission.

Exploratory data analyses

The dataframe only contains 32 (old) observations. The key predictor is (am), a Bernoulli variable (0 or 1) that is obviously categorical. I was curious if as.factor() is necessary, but apparently it does not matter; see my test @ http://rpubs.com/bionicturtle/test-am-factor. The violin chart superfically suggests a manual (am = 1) is associated with better fuel economy (higher MPG).

library(ggplot2); data(mtcars) #dataframe with 32 obs of 11 variables
mtcars$am <- as.factor(mtcars$am); levels(mtcars$am) <- c("auto", "man")
avg.a <- mean(mtcars$mpg[mtcars$am=="auto"]); avg.m <- mean(mtcars$mpg[mtcars$am=="man"])
p <- ggplot(mtcars, aes(x = am, y = mpg)); p + geom_violin() + 
  geom_boxplot(width = .1, fill = "black", outlier.color=NA) +
  stat_summary(fun.y = median, geom="point", fill="white", shape=21, size=5.0)

On this naive look, the averages differ:

Among automatic transmissions, average MPG is 17.1473684
Among manual transmissions, average MPG is 24.3923077

And a t-test virtually confirms a difference in the averages (based on assumptions):

mean.diff <- t.test(mpg ~ am, data=mtcars)

Specifically, the p-value (aka, exact significance level) is 0.0013736. For a desired confidence level of 0.95, the confidence interval is -11.2801944 to -3.2096842, which easily excludes zero.

Lazy regressions

Per Brian Coffo’s lectures, I really love the simplicity of using pairs(.) to generate a matrix of scatterplots just to visually notice some relationships; see Appendix A-1. The obvious first try is ln(mpg ~ am):

fit.0.mpg.am <- lm(mpg ~ am, data = mtcars)
summary(fit.0.mpg.am)$coefficients; simple.r.sqr <- summary(fit.0.mpg.am)$adj.r.squared

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amman        7.244939   1.764422  4.106127 2.850207e-04

For this model, the intercept is conveniently equal to the average MPG among automatic transmissions; i.e., when am = 0. Similarly, the sum of the intercept and slope coefficients is equal to the average MPG among manual transmissions; i.e., when am = 1. But the adjusted R^2 for this model is only 0.3384589.

Step-wise regression

So I used step(.) for the first time, which is incredibly powerful! See Appendix A-2. As I preferred a purely statistcal approach (as opposed to a fundamental hypothesis), I did the following:

Ran a “backwards” step-wise regression and selected the top three models; i.e., the three lowest AICs measures.
Based on ANOVA, I retained only three predictors (wt, qsec, am) as adding (+hp) and (+disp) did not statistically add value
Based on adjusted R^2, I retained only one interaction, am*wt
In this way, my preferred model is MPG = intercept + wt + qsec + am + am*wt

# Best predictors are: wt, qsec, am; With interactions, highest adjusted R^2 is interact2: am*wt
mpg.interact2 <- lm(mpg ~ wt + qsec + am + am*wt, data=mtcars)
summary(mpg.interact2)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.723053  5.8990407  1.648243 0.1108925394
## wt          -2.936531  0.6660253 -4.409038 0.0001488947
## qsec         1.016974  0.2520152  4.035366 0.0004030165
## amman       14.079428  3.4352512  4.098515 0.0003408693
## wt:amman    -4.141376  1.1968119 -3.460340 0.0018085763

best.r <- summary(mpg.interact2)$adj.r.squared
summary(mpg.interact2)$adj.r.squared #This model has the highest adjusted R^2

## [1] 0.8804219

The adjusted R^2 for our preferred model is 0.8804219.

Caveat

We might still be careful to notice relationships illustrated in the Appendix A-3. Specifically,

mpg is correlated (+0.60) to 1/4 mile time, and
mpg is negatively correlated (-0.868) to Weight

Interpretation of coefficients

Let’s look at the model (with three predictors) both with and without the interaction:

mpg.interact2.wo <- lm(mpg ~ wt + qsec + am, data=mtcars)
mpg.interact2 <- lm(mpg ~ wt + qsec + am + am*wt, data=mtcars)
mpg.interact2.wo$coefficients

## (Intercept)          wt        qsec       amman 
##    9.617781   -3.916504    1.225886    2.935837

mpg.interact2$coefficients

## (Intercept)          wt        qsec       amman    wt:amman 
##    9.723053   -2.936531    1.016974   14.079428   -4.141376

# key coefficients
beta <- mpg.interact2.wo[[1]]["amman"]
beta2 <- mpg.interact2[[1]]["amman"]
beta2.i <- mpg.interact2[[1]]["wt:amman"]

The three-predictor model, without interactions, suggests that a manual transmission adds 2.9358372 MPG. Our preferred model, however, includes the interaction. With the interaction, the effect of the manual transmision is given by: coefficient[amman] + coefficient[wt:amman]*wt. If our model is correctly specified (a huge assumption), we can observe that the impact of a manual transmission on MPG is very sensitive to the car’s weight. For example,

At 1,500 lbs (approx min of dataset), we might expect a manual transmission to add 7.8673632 MPG
At 3,325 lbs (approx median of dataset), we might expect a manual transmission to add 0.3093512 MPG
At 5,400 lbs (approx max of dataset), we might expect a manual transmission to add -8.2840049 MPG

This suggests that, at least at lower weight classes, a manual transmission improves MPG up to +7.0. But, as weight increases, the difference attenuates above a certain weight, weight overwhelms the transmission’s impact.

Checking residuals

Finally, I check the assumptions concerning the residuals:

mtcars.add <- mtcars
mtcars.add$residuals<-resid(mpg.interact2)
mtcars.add$standardized.residuals <- rstandard(mpg.interact2)
mtcars.add$fitted <- mpg.interact2$fitted.values

require(gridExtra)
histogram.1 <- ggplot(mtcars.add, aes(standardized.residuals)) + 
  theme(legend.position = "none") + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  labs(x = "Standardized Residuals", y = "Density") +
  stat_function(fun = dnorm, 
                args = list(mean = mean(mtcars.add$standardized.residuals, na.rm = TRUE),
                            sd = sd(mtcars.add$standardized.residuals, na.rm = TRUE)), 
                colour = "red", size = 1)

qqplot.resid <- qplot(sample = mtcars.add$standardized.residuals, stat="qq") + 
  labs(x = "Theoretical Values", y = "Observed Values") # qqplot.resid

scatter <- ggplot(mtcars.add, aes(fitted, standardized.residuals)) + 
  geom_point() +   geom_smooth(method = "lm", colour = "Red") +
  labs(x = "Fitted Values", y = "Studentized Residual") 

grid.arrange(histogram.1, qqplot.resid, scatter, nrow=1)

Appendix

A-1: pairwise look and “lazy” multivariate regression with all predictors

pairs(~ mpg + cyl + disp + hp + wt + am, data = mtcars, main="title")

fit.allpredicts <- lm(mpg ~ ., data = mtcars)
summary(fit.allpredicts)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## amman        2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

A-2: Step-wise regression and subsequent pursuit of highest adjusted R^2 (with interactions)

mpg.step.b <- step(lm(mpg ~ ., data = mtcars), direction ="back", trace = 0 ) # trace = 0 to exclude the detail
# summary of step function:
# mpg ~ wt + qsec + am: AIC = 61.31
# mpg ~ hp + wt + qsec + am: AIC = 61.52
# mpg ~ disp + hp + wt + qsec + am: AIC = 62.16

# step leads to wt, qsec and am
mpg.hp <- lm(mpg ~ hp, data = mtcars)
summary(mpg.hp)

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

mpg.wt.qsec.am <- lm(mpg ~ wt + qsec + am, data=mtcars)
summary(mpg.wt.qsec.am)$adj.r.squared

## [1] 0.8335561

mpg.wt.qsec.am.hp <- lm(mpg ~ wt + qsec + am + hp, data=mtcars)
summary(mpg.wt.qsec.am.hp)$adj.r.squared

## [1] 0.8367919

mpg.wt.qsec.am.hp.disp <- lm(mpg ~ wt + qsec + am + hp + disp, data=mtcars)
summary(mpg.wt.qsec.am.hp.disp)$adj.r.squared

## [1] 0.8375334

anova(mpg.wt.qsec.am, mpg.wt.qsec.am.hp, mpg.wt.qsec.am.hp.disp)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ wt + qsec + am + hp
## Model 3: mpg ~ wt + qsec + am + hp + disp
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 169.29                           
## 2     27 160.07  1    9.2195 1.5622 0.2225
## 3     26 153.44  1    6.6287 1.1232 0.2990

mpg.interact0 <- lm(mpg ~ wt + qsec + am, data=mtcars)
mpg.interact2 <- lm(mpg ~ wt + qsec + am + am*wt, data=mtcars)
mpg.interact3 <- lm(mpg ~ wt + qsec + am + am*wt + am*qsec, data=mtcars)
mpg.interact4 <- lm(mpg ~ wt + qsec + am + am*wt + am*qsec + wt*qsec, data=mtcars)
anova(mpg.interact0, mpg.interact2, mpg.interact3, mpg.interact4)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + qsec + am
## Model 2: mpg ~ wt + qsec + am + am * wt
## Model 3: mpg ~ wt + qsec + am + am * wt + am * qsec
## Model 4: mpg ~ wt + qsec + am + am * wt + am * qsec + wt * qsec
##   Res.Df    RSS Df Sum of Sq       F  Pr(>F)   
## 1     28 169.29                                
## 2     27 117.28  1    52.010 11.9312 0.00198 **
## 3     26 116.47  1     0.802  0.1841 0.67159   
## 4     25 108.98  1     7.496  1.7196 0.20166   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# mpg.interact1.b <- lm(mpg ~ qsec + am*wt, data=mtcars) Per Brian, asterick (*) includes predictors!
summary(mpg.interact0)$adj.r.squared

## [1] 0.8335561

summary(mpg.interact2)$adj.r.squared #This model has the highest adjusted R^2

## [1] 0.8804219

summary(mpg.interact3)$adj.r.squared

## [1] 0.8766723

summary(mpg.interact4)$adj.r.squared

## [1] 0.8799939

A-3: Closer look at the three selected predictors

## Loading required package: GGally

A-4: Residual inspection (and Q-Q plot) using built-in plot()

par(mfrow=c(2,2))
plot(mpg.interact2)

John Hopkins Data Science > 6.Regression Models > Project by David Harper, CFA, FRM