This report examines the relationship between miles per gallon (mpg) in the mtcars dataset using linear regression. The following prompts are addressed:
Regression analysis using transmission type, weight, and 1/4 mile time as explanatory variables leads to the conclusion that manual cars get on average 2.9 more mpg than automatic cars, when the effects of weight and 1/4 mile time are ignored.
Let’s take a look at the dataset.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The variable am is represented as a numeric, where 0 encodes automatic and 1 encodes manual. In order to use regression analysis to compare mpg in manual vs. automatic transmissions, we will need to convert am to a factor. We can also replace the 0’s and 1’s with labels to make the output more readable.
mtcars$am <- gsub("0", "auto", mtcars$am)
mtcars$am <- gsub("1", "manual", mtcars$am)
mtcars$am <- factor(mtcars$am)
Several other variables should also be treated as factors; namely, vs, gear, and carb.
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
A boxplot will give us a visual idea of whether or not mpg might differ between automatic and manual transmission cars.
mtcars %>%
ggplot(aes(x = am, y = mpg, fill = am)) +
geom_boxplot()
Manual cars appear to get better mileage on average than do automatic cars. This observation can be confirmed with statistical inference.
Refer to the Appendix for a pairs plot of the dataset.
We can use the R function t.test to find out whether our hypothesis that manual cars get better gas mileage than automatic cars is statistically significant.
t.test(mpg ~ am, mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group auto mean in group manual
## 17.14737 24.39231
The 95% confidence interval shown in the output of t.test does not contain 0, so we can conclude that the difference in mpg between manual and automatic transmissions is in fact significant.
Let’s start by regressing mpg on just am.
am_model <- lm(mpg ~ am, mtcars)
summary(am_model)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The R2 value for this model is only 0.3598, which means that fitting mpg on am alone only explains about 36% of the variance in mpg.
Building a model that regresses mpg on all other variables in the dataset will explain more of the variance.
full_model <- lm(mpg ~ ., mtcars)
summary(full_model)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6533 -1.3325 -0.5166 0.7643 4.7284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.31994 23.88164 1.060 0.3048
## cyl -1.02343 1.48131 -0.691 0.4995
## disp 0.04377 0.03058 1.431 0.1716
## hp -0.04881 0.03189 -1.531 0.1454
## drat 1.82084 2.38101 0.765 0.4556
## wt -4.63540 2.52737 -1.834 0.0853 .
## qsec 0.26967 0.92631 0.291 0.7747
## vs1 1.04908 2.70495 0.388 0.7032
## ammanual 0.96265 3.19138 0.302 0.7668
## gear4 1.75360 3.72534 0.471 0.6442
## gear5 1.87899 3.65935 0.513 0.6146
## carb2 -0.93427 2.30934 -0.405 0.6912
## carb3 3.42169 4.25513 0.804 0.4331
## carb4 -0.99364 3.84683 -0.258 0.7995
## carb6 1.94389 5.76983 0.337 0.7406
## carb8 4.36998 7.75434 0.564 0.5809
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.823 on 16 degrees of freedom
## Multiple R-squared: 0.8867, Adjusted R-squared: 0.7806
## F-statistic: 8.352 on 15 and 16 DF, p-value: 6.044e-05
As expected, the full model has a higher R2 value (0.8867). But the output of summary shows that none of the coefficients are significant at the 0.05 level.
Excluding variables that are correlated with transmission type will introduce bias in the coefficients. However, including unnecessary regressors will inflate the model’s variance. We will use the step function in R to determine which variables to include in our final model.
step_model <- step(full_model, direction = "backward", trace = FALSE)
summary(step_model)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ammanual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The model produced by the step algorithm includes wt, qsec, and am. The effects of all three variables are significant at the 0.05 level, and the model explains about 85% of the variance.
From the coefficients of this model, we can conclude that, holding weight and qsec constant, manual cars get 2.9 more miles per gallon on average than automatic cars.
The 95% confidence interval for this claim is:
confint(step_model)['ammanual', ]
## 2.5 % 97.5 %
## 0.04573031 5.82594408
Diagnostic plotting using base graphics shows that the residuals are uncorrelated with the fitted values. The quantile-quantile plot indicates that the distributon of the residiuals is roughly normal.
par(mfrow = c(2,2))
plot(step_model)
data(mtcars)
mtcars$am <- factor(mtcars$am)
ggpairs(mtcars, mapping = aes(colour = am))