Motor Trend Car Road Tests - Effects of Transmission Types on MPG

Author: Ken Ho

Synopsis

In this project, we analyze the mtcars dataset and explore the linear relationship between a set of variables and miles per gallon.

The main objectives of this research are as follows:

Is an automatic or manual transmission better for MPG.
Quantifying how different is the MPG between automatic and manual transmissions.

The results of this analysis are:

Manual transmission is better than automatic transmission for MPG.
Manual transmission is 2.94 mpg more fuel efficient than automatic transmission while holding other regressors constant.

Exploratory Data Analyses

t <- t.test(mpg ~ am, data = mtcars, paired = FALSE, var.equal = FALSE)
t

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

The exploratory data analysis shows that:

The t-test has a P-value of 1.374e-03 (< 5%) and 95% confidence interval of (-11.28 -3.21) for mean(automatic)-mean(manual), we reject the null hypothesis that there is no significant difference in MPG between the two transmission types.
The boxplot in the appendix below shows that manual transmission has better MPG than automatic transmission.
The Pairs plot in the appendix below shows that there are several variables that have high correlation with mpg.

Model Selection

Single variable linear regression model - Model #1:

model_1 = lm(mpg ~ am, data = mtcars)
summary(model_1)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

summary(model_1)$adj.r.squared

## [1] 0.3384589

Multivariable linear regression model (Stepwise Regression) - Model #2:

fitAll <- lm(mpg ~ . , data = mtcars)
model_2 <- step(fitAll, direction = "both")    # stepwise regression

summary(model_2)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## am           2.935837  1.4109045  2.080819 4.671551e-02

summary(model_2)$adj.r.squared

## [1] 0.8335561

Model #2 obtained from the above computations consists of the variables: “wt” and “qsec” as confounders and “am” as the independent variable.

Compare the Adjusted R-squared values of the two models:

##         Adjusted R-squared
## model_1             0.3385
## model_2             0.8300

Now, let’s look at the Analysis of Variance Table of the models:

varTbl <- anova(model_1, model_2)
varTbl

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, the VIF (variance inflation factors) of model #2:

vif(model_2)

##       wt     qsec       am 
## 2.482952 1.364339 2.541437

Findings:

Model #2 shows that the value of Adjusted R-squared is 0.8336; That is, 83.36% of the response variable variation is explained by this linear model, which is way better than that of model #1.
Analysis of Variance Table shows that model #2 has P-value of 1.55e-09 (< 5%), which indicates that the confounders are significance.
The VIF (variance inflation factors) shows that model #2 has pretty acceptable variance inflation values. Hence model #2 is chosen as the regression model of this analysis.

Residual Plot and Diagnostics

Refer to Residual Plots and Diagnostics sections in Appendix below for more details.

Findings:

No systematic patterns or large outlying observations found in the Residual Plots.
There are 2-3 observations that show low leverage but have fairly high influence.

Interpretation of Coefficients

For meaningful interpretation of coefficients, we applied centering on both “wt” and “qsec” variables:

fitCentered <- lm(mpg ~ I(wt - mean(wt)) + I(qsec - mean(qsec)) + factor(am), 
                  data = mtcars)
summary(fitCentered)$coef

##                       Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)          18.897941  0.7193542 26.270704 2.855851e-21
## I(wt - mean(wt))     -3.916504  0.7112016 -5.506882 6.952711e-06
## I(qsec - mean(qsec))  1.225886  0.2886696  4.246676 2.161737e-04
## factor(am)1           2.935837  1.4109045  2.080819 4.671551e-02

Interpretation:

18.90 mpg with average weight, average quarter mile time, and automatic transmission (am = 0).
3.92 mpg less fuel efficiency with every kilo-pound of weight increase, while holding other regressors constant.
1.23 mpg more fuel efficiency with every second of quarter mile time increase, while holding other regressors constant.
2.94 mpg more fuel efficiency with manual transmission compared to automatic transmission, while holding other regressors constant.

Quantification of the Uncertainty

Below are confidence levels of intercept and predictors with 95% confidence:

confint(fitCentered)

##                            2.5 %    97.5 %
## (Intercept)          17.42441087 20.371471
## I(wt - mean(wt))     -5.37333423 -2.459673
## I(qsec - mean(qsec))  0.63457320  1.817199
## factor(am)1           0.04573031  5.825944

Conclusions

The multivariable regression model was chosen as the regression model of this analysis due to the fact that more response variable variation was explained.
Manual transmission is better than automatic transmission for MPG.
Manual transmission is 2.94 mpg more fuel efficient than automatic transmission while holding other regressors constant.

Appendix

Exploratory Data Analysis

Boxplot

mtcars2 <- mtcars
mtcars2$txType <- factor(mtcars$am, labels = c("Automatic","Manual"))
ggplot(mtcars2, aes(x = txType, y = mpg, fill = txType)) +
  geom_boxplot() + 
  labs(title = "Miles Per Gallon by Transmission Type", 
       x = "Transmission Type", 
       y = "Miles Per Gallon") +
  scale_fill_discrete(name = "Transmission")

Pairs Plot

g = ggpairs(mtcars, lower = list(continuous = "smooth"))
g

Residual Plots

par(mfrow = c(2, 2))
plot(model_2)

Diagnostics

par(mfrow = c(1, 2))
# Leverage
plot(hatvalues(model_2), main="Leverage")
# Influence
#plot(rstandard(model_2))
plot(rstudent(model_2), main="Studentized Residuals")

par(mfrow = c(1, 2))
plot(dffits(model_2), main="Influence - dffits")
plot(cooks.distance(model_2), main="Influence - Cook's Distance")

par(mfrow = c(1, 2))
plot(dfbetas(model_2)[, 2], main="Influence - dfbetas of \"wt\"")
plot(dfbetas(model_2)[, 3], main="Influence - dfbetas of \"qsec\"")