by MAS16 Feb 2019
This report uses the mtcars dataset to answer: (1) Is an automatic or manual transmission better for mpg? (2) What is the quantified difference between automatic and manual transmissions? Regression modeling and statistical inference show that manual transmissions are better than automatic transmissions for mpg by approximately 1.8 mpg after adjusting for contributions from other car features.
First, let’s explore mtcars using the str and head functions (results in Appendix A.1). The data consist of 32 cars with 11 numeric features. The am feature designates automatic (am=0) and manual (am=1) transmission. The boxplot in Appendix A.2 shows that there may be a difference in mean mpg between automatic and manual. To test the null hypothesis that there is no difference in mean mpg, we use a t-test (R code in Appendix A.3).
##
## Welch Two Sample t-test
##
## data: man_mpg and auto_mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.209684 11.280194
## sample estimates:
## mean of x mean of y
## 24.39231 17.14737
The p-value is 0.0014 which is lower than 0.05 and the 95% confidence interval for the t-statistic does not include 0. We therefore reject the null hypothesis.
Next, construct a linear model relating mpg to am:
# Construct linear model
mdl1 <- lm(mpg ~ factor(am), data=mtcars)
# Get coefficients
summary(mdl1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## factor(am)1 7.244939 1.764422 4.106127 2.850207e-04
# Get r-squared
summary(mdl1)$r.squared
## [1] 0.3597989
The intercept is the mean mpg observed for automatic transmissions and the slope coefficient of 7.245 suggests using a manual transmission increases mpg by 7.245. The p-values are well below 0.05. However, the r-squared is 0.36, suggesting only 36% of the variability in mpg is explained by this model.
To construct better models, let’s look at the correlation between mpg and all the other car features (results in Appendix A.4). We see high correlations (>0.75) with cyl, disp, hp and wt. A strategy for model selection involves constructing nested linear models using these features and comparing each model using ANOVA. (R code in Appendix A.5)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + factor(cyl)
## Model 3: mpg ~ factor(am) + factor(cyl) + disp
## Model 4: mpg ~ factor(am) + factor(cyl) + disp + hp
## Model 5: mpg ~ factor(am) + factor(cyl) + disp + hp + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 264.50 2 456.40 37.9300 2.678e-08 ***
## 3 27 230.46 1 34.04 5.6572 0.025339 *
## 4 26 183.04 1 47.42 7.8820 0.009541 **
## 5 25 150.41 1 32.63 5.4236 0.028246 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA shows that model 5 results in a statistically significant (p-value = 0.028) improvement relative to models with fewer features. The coefficients for model 5 are shown below:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.864276061 2.69541569 12.5636562 2.668321e-12
## factor(am)1 1.806099494 1.42107933 1.2709350 2.154510e-01
## factor(cyl)6 -3.136066556 1.46909031 -2.1346996 4.277253e-02
## factor(cyl)8 -2.717781289 2.89814941 -0.9377644 3.573375e-01
## disp 0.004087893 0.01276729 0.3201848 7.514890e-01
## hp -0.032480178 0.01398322 -2.3227963 2.862128e-02
## wt -2.738694608 1.17597755 -2.3288664 2.824553e-02
The adjusted difference in mean mpg between automatic and manual using the coefficients for model 5 shows manual transmissions increase mpg by 1.81.
# Model 5 r-squared
summary(mdl5)$r.squared
## [1] 0.8664276
Additionally, the r-squared value has increased to 0.866, indicating 86.6% of the variability in mpg can now be explained. To verify the model, Diagnostic plots for model 5 are shown in Appendix A.6 and show there is no pattern in the residuals. The qq-plot shows they are near normally distributed.
mtcars Data Setstr(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
library(ggplot2)
g <- ggplot(mtcars, aes(x=factor(am), y=mpg, type="l")) +
geom_boxplot(aes(group=factor(am))) + geom_jitter(width=0.1) +
xlab("Transmission Type") + ylab("MPG") +
ggtitle("MPG as Function of Transmission Type") +
scale_x_discrete(labels = c("Automatic","Manual")) +
theme(plot.title = element_text(face="bold", hjust=0.5, size=12))
g
auto_mpg <- mtcars[mtcars$am==0, ]$mpg
man_mpg <- mtcars[mtcars$am==1, ]$mpg
t.test(man_mpg, auto_mpg)
# Get correlations among features
cor(mtcars)[1,]
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
# Fit multiple models, nested; varying by one additional feature
mdl2 <- lm(mpg ~ factor(am) + factor(cyl), data=mtcars)
mdl3 <- lm(mpg ~ factor(am) + factor(cyl) + disp, data=mtcars)
mdl4 <- lm(mpg ~ factor(am) + factor(cyl) + disp + hp, data=mtcars)
mdl5 <- lm(mpg ~ factor(am) + factor(cyl) + disp + hp + wt, data=mtcars)
# Compare models using ANOVA
anova(mdl1, mdl2, mdl3, mdl4, mdl5)
par(mar=c(4,4,2,2))
par(mfrow=c(2,2))
plot(mdl5, which=1)
plot(mdl5, which=3)
plot(mdl5, which=2)
plot(mdl5, which=5)