Created with Knitr
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG”
"Quantify the MPG difference between automatic and manual transmissions"
We used a variety of comparison methods (t test, simple linear regression, and multivariate linear regression) to quantify the difference in mpg between automatic and manual transmissions.
We concluded that one can expect approximately 3 mpg advantage when one selects a manual transmission car vice an automatic transmission car.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
A data frame with 32 observations on 11 (numeric) variables.
library(datasets)
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
How does the transmission type affect gas mileage?
ggplot(data = mtcars, aes(as.factor(am), mpg)) +
geom_boxplot() +
theme_bw() +
xlab("Transmission Type (0 = Automatic, 1 = Manual)") +
ylab("MPG") +
labs(title="MPG by Transmission Type")
Visually, it appears that manual transmissions do have better gas mileage than automatics.
Let’s next test statistically if there is a difference between automatic and manual transmissions. We will perform a hypothesis test, with the automatic transmission as the null, and manual if we reject the null.
t.test(mpg ~ am, data = mtcars, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
Since the p-value is 0.0013 (less than p < 0.005), and the confidence interval does not contain 0, we can reject the null hypothesis, restating, we can conclude that there is statistically significant difference between manual and automatic transmissions. This supports our visual estimation above.
If we continue to assess with different variables, more patterns may emerge. Whereas testing for differences in just the transmission did seem significant, we can take into account confounding variables.
Beginning with a simple linear regression:
summary(lm(mpg ~ as.factor(am), mtcars))
##
## Call:
## lm(formula = mpg ~ as.factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## as.factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We see the \({R^2}\) value is 0.338, so this one variable linear regression model only accounts for 34% of the variance in fuel consumption.
The next question we want to ask, is what other variables correlate to gas mileage?
correlation_matrix <- round(cor(mtcars),1)
corrp.mat <- cor_pmat(mtcars)
ggcorrplot(correlation_matrix, hc.order =TRUE,
type ="lower", lab =TRUE)
It appears that number of gears, transmission, engine shape (vs), and quarter mile time (qsec) all correlate with mpg.
Can we fit a better model with multiple variables? We’ll create a
stepwise regression using the stepAIC() from the
MASS package to find the variables which result in the best
fit.
initialModel <- lm(mpg ~ ., data = mtcars)
stepReg <- stepAIC(initialModel, direction = "both")
# note: results are hidden
print(stepReg$anova)
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Final Model:
## mpg ~ wt + qsec + am
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 21 147.4944 70.89774
## 2 - cyl 1 0.07987121 22 147.5743 68.91507
## 3 - vs 1 0.26852280 23 147.8428 66.97324
## 4 - carb 1 0.68546077 24 148.5283 65.12126
## 5 - gear 1 1.56497053 25 150.0933 63.45667
## 6 - drat 1 3.34455117 26 153.4378 62.16190
## 7 - disp 1 6.62865369 27 160.0665 61.51530
## 8 - hp 1 9.21946935 28 169.2859 61.30730
The model with the best fit accounts for weight and quarter mile time in addition to transmission.
summary(bestModelFit <- lm(mpg ~ wt + qsec + as.factor(am), data = mtcars))
##
## Call:
## lm(formula = mpg ~ wt + qsec + as.factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## as.factor(am)1 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Now our \(R^2\) value is .833, so our adjusted model accounts for 83% of variance in gas mileage.
ggplot(mtcars, aes(wt, mpg, color = as.factor(am))) +
geom_point() +
geom_smooth(aes(y = predict(bestModelFit, mtcars)),method = lm) +
geom_smooth(aes(y = predict(lm(mpg ~ as.factor(am), mtcars))), method = lm) +
theme_bw() +
labs(title = "Best Fit Multivariate Model vs Simple Linear Regression Model \n by Transmission Type") +
xlab("Weight in 1000lbs") +
ylab("MPG") +
scale_color_discrete(name = "Transmission Type", labels = c("Automatic", "Manual"))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
The plots of the residuals support the conclusion that accounting for multiple variables accounts for more of the variance.
The points in the Residuals vs. Fitted plot appear to be random which indicate the data are independent. The plot also suggests three potential outliers for the Chrysler Imperial, Fiat 128, and Toyota Corolla.
The points of the Normal Q-Q plot follow the line indicating that the residuals are normally distributed.
The points on the Scale-Location plot appear to be spread equally along the line with random points spread out allowing us to conclude equal variance (homoscedasticity).
The Residuals vs. Leverage plot doesn’t show any influential points. All points are within the 0.05 lines which conclude there are no outliers.
par(mfrow = c(2, 2))
plot(bestModelFit)
Yes, one can expect better gas mileage with a manual transmission vice an automatic transmission. There were a few confounding variables, such as weight and quarter mile time in addition to transmission type that better predict variance in gas mileage.
Both the t test and the simple linear regression showed that there is a 7.25 mpg premium when selecting a manual transmissions vice automatic transmissions.
However, by using the best fitting model (holding weight and quarter mile time constant), the premium dropped to 2.94 mpg when selecting a manual transmission vice automatic transmission.