This project explores how miles per gallon (MPG) changes by looking at a data set of a collection of cars (from R’s mtcars dataset). In particular, we are interested in two questions:
First of all, let’s compare mpg for automatic transmission vs manual transmission by doing a violin plot (the code to generate this plot is in the appendix):
## Loading required package: ggplot2
From the violin plot of the data, it seems that manual transmission has better miles per gallon. In the subsequent sections, we will formally show this by doing regression analysis with standard linear models.
Since the question of interest is how mpg differs across the two types of transmission, one possible strategy is to construct a binary model of mpg ~ am.
Since cyl, disp, hp, and wt are the four most correlated variables with mpg (see appendix), we’ll also construct another model with cyl, disp, hp, and wt included across the two transmission types (we’re limited to two models due to space constraint).
Model 1: mpg ~ am
Our model is of the form \[ Y_i = \beta_0 + \beta_1 X_{i} + \epsilon_i \]
with Y as the mpg and X as the binary variable where
Then for manual transmission: \[ E[Y_i] = \beta_0 + \beta_1 \]
and for automatic transmission: \[ E[Y_i] = \beta_0 \]
Therefore, \(\beta_1 = E[Y_i | X_i = 1] - E[Y_i | X_i = 0]\) is interpreted as the increase or decrease in the mean/expected value of the mpg when comparing manual transmission to automatic transmission; \(\beta_0\) is interpreted as the mean mpg for automatic transmission, while \(\beta_0 + \beta_1\) is the mean mpg for manual transmission.
Let’s now fit a linear model with mpg as the outcome and am as the predictor to find the intercept \(\beta_0\) and slope \(\beta_1\).
mtcars$am <- factor(mtcars$am)
fit <- lm(mpg ~ am, mtcars)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am1 7.244939 1.764422 4.106127 2.850207e-04
Our model says that switching from automatic to manual transmission results in mpg increase of 7.2449393. The slope has a significant P-value: we can confidently reject the null hypothesis (that the two groups are the same) in favor of our alternative hypothesis (that manual transmission has higher mpg) to within uncertainty/error rate of 2.850207410^{-4}.
Residual plot and diagnostics are presented in the appendix.
Model 2: mpg ~ cyl + disp + hp + wt
We’ll do this briefly and quickly, separately for manual and automatic:
mtcars0 <- mtcars[mtcars$am == 0,]
mtcars1 <- mtcars[mtcars$am == 1,]
lm0 <- lm(mpg ~ cyl + disp + hp + wt, data = mtcars0)
lm1 <- lm(mpg ~ cyl + disp + hp + wt, data = mtcars1)
The coefficients are presented in the appendix, but we’re more interested in the predicted values across the two groups.
predicted <- data.frame(y = c(predict(lm0), predict(lm1)), x = rep(c(0, 1), c(nrow(mtcars0), nrow(mtcars1))))
plot(y ~ factor(x), predicted)
As we can see, mpg for manual transmission (am = 1) is clearly higher than for automatic transmission (am = 0)
To answer our opening question, we conclude that manual transmission has better (higher) mpg and this is supported by our standard linear model.
Details on the mtcars dataset: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
Code for violin plot in exploratory data analysis section
require(stats); data(mtcars); require(ggplot2)
g = ggplot(data = mtcars, aes(y = mpg, x = factor(am, labels = c("automatic", "manual")), group = factor(am, labels = c("automatic", "manual")), fill = factor(am)))
g = g + geom_violin(alpha = .5)
g = g + xlab("Type of transmission") + ylab("Miles per gallon")
g = g + scale_fill_discrete(name = "Transmission type", labels=c("automatic", "manual"))
g
Residual Plot and Diagnostics for our binary model (mpg ~ am)
par(mfrow = c(2,2))
plot(predict(fit), resid(fit))
plot(mtcars$am, dfbetas(fit)[,2], xlab = "Transmission type", ylab = "dfbetas")
hist(hatvalues(fit))
plot(hatvalues(fit), rstandard(fit))
As X is binary, Y also only takes up two values. There does not seem to be a fishy pattern in the residuals, dfbetas values for the slope are small compared to the estimate, and there aren’t any unusual hatvalues.
Correlation between mpg and all variables
cor(mtcars)[1,]
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
Here’s the coefficients for model fit of mpg with cyl, disp, hp, and wt included across am = 0 and am = 1.
#For am = 0
summary(lm0)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.145260059 3.61582440 9.4432849 1.890058e-07
## cyl -0.907122428 0.64057054 -1.4161164 1.786046e-01
## disp 0.006790792 0.01109120 0.6122683 5.501759e-01
## hp -0.027079832 0.01807299 -1.4983594 1.562463e-01
## wt -2.209608418 1.07435194 -2.0566896 5.885142e-02
#For am = 1
summary(lm1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.463139334 5.49303834 7.9124041 4.726584e-05
## cyl 0.180200602 1.75727337 0.1025456 9.208475e-01
## disp -0.021635960 0.03396445 -0.6370179 5.419087e-01
## hp 0.001261798 0.02765118 0.0456327 9.647215e-01
## wt -7.067741384 2.67848077 -2.6387128 2.977276e-02