In this report we will test data gathered by Motor Trends magazine in 1974 to find whether cars get better gas mileage with either an automatic or manual transmission. We are interested in answering the following two questions:
“Is an automatic or manual transmission better for MPG?” “Quantify the MPG difference between automatic and manual transmissions.”
We initialize by loading all required packages and loading in the data mtcars. We then load the data in to a correlation matrix before converting variables to factor to prepare the data for further analysis.
library(ggplot2);library(corrplot)
data(mtcars)
set.seed(123)
# Create correlation matrix before convertin to factores
corrMat <- cor(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am, labels = c("Auto", "Manual"))
dim(mtcars)
## [1] 32 11
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 Manual 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 Manual 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 Manual 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 Auto 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 Auto 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 Auto 3 1
The data set consist of 11 variables for 32 different car models. Of these, 19 have manual transmission and 13 have automatic.
We can already see that there may be lots of cofounding variables that affect MPG aside from transmission. Let’s look at a correlation matrix:
corrplot(corrMat, method = "number", order = "FPC", type = "lower", tl.cex = 0.6, tl.col = rgb(0, 0, 0), outline = F)
Weight, hp, disp and cyl are all strongly ( > 0.75) correlated with MPG.
When we visualize the interaction between these variables below some clear patterns emerges - dark small triangles in the top left and big, bright colored circles in the bottom right, meaning light vehicles with low horespower, low displacement and manual transmission tend to have higher MPG. In particular, loking along the Y-axis for MPG it’s easy to draw a line (seen in red) to separate most of the Automatic (circles) from the triangular shapes (Manual). This at least visually confirms that there is a difference in MPG between transmission types.
t.test(mpg ~ am, mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.28 -3.21
## sample estimates:
## mean in group Auto mean in group Manual
## 17.15 24.39
The t-test confirms that there is a statistically significant difference between transmission types, with the means of 17.2 vs 24.4 MPG differing with 95% certainty.
fit <- lm(mpg ~ am, mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.392 -3.092 -0.297 3.244 9.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.15 1.12 15.25 1.1e-15 ***
## amManual 7.24 1.76 4.11 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.338
## F-statistic: 16.9 on 1 and 30 DF, p-value: 0.000285
Rsquare is only 0.33, suggesting that AM is only responsible for a small degree of the variance in the data. In the previous plot, each panel also has a separation in transmission type along the x-axis: triangles to the left and circles to the right, suggesting weight is likely also a factor in MPG (as makes sense).Referring back to the correlation matrix, though it’s clear all four of these variables are correlated with MPG, disp and cyl are cross-correlated with all variables under consideration, while weight and horsepower are mostly correlated with MPG. This suggest that weight and horespower may be stronger predictors of MPG.
Let’s investigate a model using all variables:
fitAll <- lm(mpg ~ ., mtcars)
summary(fitAll)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.19004 0.25253
## cyl6 -2.64870 3.04089 -0.87103 0.39747
## cyl8 -0.33616 7.15954 -0.04695 0.96317
## disp 0.03555 0.03190 1.11433 0.28267
## hp -0.07051 0.03943 -1.78835 0.09393
## drat 1.18283 2.48348 0.47628 0.64074
## wt -4.52978 2.53875 -1.78426 0.09462
## qsec 0.36784 0.93540 0.39325 0.69967
## vs1 1.93085 2.87126 0.67248 0.51151
## amManual 1.21212 3.21355 0.37719 0.71132
## gear4 1.11435 3.79952 0.29329 0.77332
## gear5 2.52840 3.73636 0.67670 0.50890
## carb2 -0.97935 2.31797 -0.42250 0.67865
## carb3 2.99964 4.29355 0.69864 0.49547
## carb4 1.09142 4.44962 0.24528 0.80956
## carb6 4.47757 6.38406 0.70137 0.49381
## carb8 7.25041 8.36057 0.86722 0.39948
With an Rsquare of 0.779 this is a definite improvement. Looking at the t-statistic it is mainly hp and wt that have high values . This is something we suspected from the correlation matrix. We now have enough context to create some additional models to compare.
fit1 <- lm(mpg ~ wt + hp, mtcars)
fit2 <- lm(mpg ~ wt + hp + cyl, mtcars)
fit3 <- lm(mpg ~ wt + hp + cyl + disp, mtcars)
anova(fit, fit1, fit2, fit3, fitAll)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + hp
## Model 3: mpg ~ wt + hp + cyl
## Model 4: mpg ~ wt + hp + cyl + disp
## Model 5: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 721
## 2 29 195 1 526 65.51 7.5e-07 ***
## 3 27 161 2 34 2.13 0.15
## 4 26 160 1 1 0.08 0.78
## 5 15 120 11 40 0.45 0.91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using anova to analyze the model we see that While all of them show improvements in RSS, only fit1 is significant. With an R-squared value of 0.81 this is a clear improvement over the model taking all variables in to consideration.
Let’s continue with this model and look at the interaction term:
fit2.x <- lm(mpg ~ wt + hp + wt*hp, mtcars)
anova(fit2, fit2.x)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt + hp + cyl
## Model 2: mpg ~ wt + hp + wt * hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 161
## 2 28 130 -1 31
summary(fit2.x)$r.squared
## [1] 0.8848
Here the R-squared measure is even higher and the anova test confirms the additional interaction term to be of signifiant value.
shapiro.test(fit2.x$residuals)
##
## Shapiro-Wilk normality test
##
## data: fit2.x$residuals
## W = 0.9545, p-value = 0.1928
With a Shapiro-Wilk p-value of 0.19 we do not reject the hypothesis that the residuals are normally distributed, making us more confident that we are on the right track.
There doesn’t seem to be any correlation between residuals and the fitted values. The QQ Plot confirms the shapiro test that the model is close to normally distributed.