Executive Summary

This project explores how miles per gallon (MPG) changes by looking at a data set of a collection of cars (from R’s mtcars dataset). In particular, we are interested in two questions:

Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions

Exploratory Data Analysis

First of all, let’s compare mpg for automatic transmission vs manual transmission by doing a violin plot (the code to generate this plot is in the appendix):

## Loading required package: ggplot2

From the violin plot of the data, it seems that manual transmission has better miles per gallon. In the subsequent sections, we will formally show this by doing regression analysis with standard linear models.

Linear Model Fit

Since the question of interest is how mpg differs across the two types of transmission, one possible strategy is to construct a binary model of mpg ~ am.

Since cyl, disp, hp, and wt are the four most correlated variables with mpg (see appendix), we’ll also construct another model with cyl, disp, hp, and wt included across the two transmission types (we’re limited to two models due to space constraint).

Model 1: mpg ~ am

Our model is of the form \[ Y_i = \beta_0 + \beta_1 X_{i} + \epsilon_i \]

with Y as the mpg and X as the binary variable where

0 represents automatic transmission and
1 represents manual transmission.

Then for manual transmission: \[ E[Y_i] = \beta_0 + \beta_1 \]

and for automatic transmission: \[ E[Y_i] = \beta_0 \]

Therefore, \(\beta_1 = E[Y_i | X_i = 1] - E[Y_i | X_i = 0]\) is interpreted as the increase or decrease in the mean/expected value of the mpg when comparing manual transmission to automatic transmission; \(\beta_0\) is interpreted as the mean mpg for automatic transmission, while \(\beta_0 + \beta_1\) is the mean mpg for manual transmission.

Let’s now fit a linear model with mpg as the outcome and am as the predictor to find the intercept \(\beta_0\) and slope \(\beta_1\).

mtcars$am <- factor(mtcars$am)
fit <- lm(mpg ~ am, mtcars)

summary(fit)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am1          7.244939   1.764422  4.106127 2.850207e-04

Our model says that switching from automatic to manual transmission results in mpg increase of 7.2449393. The slope has a significant P-value: we can confidently reject the null hypothesis (that the two groups are the same) in favor of our alternative hypothesis (that manual transmission has higher mpg) to within uncertainty/error rate of 2.850207410^{-4}.

Residual plot and diagnostics are presented in the appendix.

Model 2: mpg ~ cyl + disp + hp + wt

We’ll do this briefly and quickly, separately for manual and automatic:

mtcars0 <- mtcars[mtcars$am == 0,]
mtcars1 <- mtcars[mtcars$am == 1,]
lm0 <- lm(mpg ~ cyl + disp + hp + wt, data = mtcars0)
lm1 <- lm(mpg ~ cyl + disp + hp + wt, data = mtcars1)

The coefficients are presented in the appendix, but we’re more interested in the predicted values across the two groups.

predicted <- data.frame(y = c(predict(lm0), predict(lm1)), x = rep(c(0, 1), c(nrow(mtcars0), nrow(mtcars1))))

plot(y ~ factor(x), predicted)

As we can see, mpg for manual transmission (am = 1) is clearly higher than for automatic transmission (am = 0)

Conclusion

To answer our opening question, we conclude that manual transmission has better (higher) mpg and this is supported by our standard linear model.

Appendix

Details on the mtcars dataset: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

Code for violin plot in exploratory data analysis section

require(stats); data(mtcars); require(ggplot2)

g = ggplot(data = mtcars, aes(y = mpg, x = factor(am, labels = c("automatic", "manual")), group = factor(am, labels = c("automatic", "manual")), fill = factor(am)))
g = g + geom_violin(alpha = .5)
g = g + xlab("Type of transmission") + ylab("Miles per gallon")
g = g + scale_fill_discrete(name = "Transmission type", labels=c("automatic", "manual"))
g

Residual Plot and Diagnostics for our binary model (mpg ~ am)

par(mfrow = c(2,2))
plot(predict(fit), resid(fit))
plot(mtcars$am, dfbetas(fit)[,2], xlab = "Transmission type", ylab = "dfbetas")
hist(hatvalues(fit))
plot(hatvalues(fit), rstandard(fit))

As X is binary, Y also only takes up two values. There does not seem to be a fishy pattern in the residuals, dfbetas values for the slope are small compared to the estimate, and there aren’t any unusual hatvalues.

Correlation between mpg and all variables

cor(mtcars)[1,]

##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251

Here’s the coefficients for model fit of mpg with cyl, disp, hp, and wt included across am = 0 and am = 1.

#For am = 0
summary(lm0)$coefficients

##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 34.145260059 3.61582440  9.4432849 1.890058e-07
## cyl         -0.907122428 0.64057054 -1.4161164 1.786046e-01
## disp         0.006790792 0.01109120  0.6122683 5.501759e-01
## hp          -0.027079832 0.01807299 -1.4983594 1.562463e-01
## wt          -2.209608418 1.07435194 -2.0566896 5.885142e-02

#For am = 1
summary(lm1)$coefficients

##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 43.463139334 5.49303834  7.9124041 4.726584e-05
## cyl          0.180200602 1.75727337  0.1025456 9.208475e-01
## disp        -0.021635960 0.03396445 -0.6370179 5.419087e-01
## hp           0.001261798 0.02765118  0.0456327 9.647215e-01
## wt          -7.067741384 2.67848077 -2.6387128 2.977276e-02

Regression Models - Course Project

Kevin Siswandi

Executive Summary

Exploratory Data Analysis

Linear Model Fit

Conclusion

Appendix