The type of transmission in an automobile has an effect on the vehicle’s miles per gallon. To quanitify this difference, Motor Trend magazine has collected a data set that includes fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Does an automatic or manual transmission produce better fuel efficiency? A careful statistical analysis of the data set will help answer this question and identify the specific variables involved in the relationship.
First, the data set is loaded into R and the mean miles per gallon is computed for automatic and manual transmissions. The categorical variable am indicates the transmission type (0 = automatic, 1 = manual)
## transmission mpg
## 1 automatic 17.14737
## 2 manual 24.39231
The means for the two groups of vehicles seem significantly different. To provide additional context, ggplot is used to generate a boxplot comparing mpg values for manual and automatic transmissions.
The boxplot further illustrates what appears to be a significant difference in the miles per gallon achieved by the two transmission types. A t-test is used to determine if this difference is truly significant.
##
## Welch Two Sample t-test
##
## data: mtcars$mpg[mtcars$am == 0] and mtcars$mpg[mtcars$am == 1]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The t.test indicates a high likliehood that the true difference in means is not zero. In order to quantify this difference, an investigation of the relatioships between the remaining variables in the mtcars data must be performed. A pairs plot with the correlations as well scatterplots (see Appendix A) identifies four variables with high probability of a relationship to mpg - cyl, disp, hp, and wt. As cyl and disp are highly correlated with each other, both should not be selected for inclusion in the model.
First, a simple linear model is generated using mpg and am.
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Based on this model, manual transmission vehicles achieve an average 7.2 miles per gallon more than vehicles with automatic transmissions. The r-squared value is low, indicating that only 36% of the variation is explained by the relationship between the two variables mpg and am. Additional models can now be created using the variables selected above (cyl, hp, and wt)
fit2 <- lm(mpg ~ am + hp, data = mtcars)
fit3 <- lm(mpg ~ am + hp + wt, data = mtcars)
fit4 <- lm(mpg ~ am + hp + wt + cyl, data = mtcars)
anova(fit1, fit2, fit3, fit4)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + hp
## Model 3: mpg ~ am + hp + wt
## Model 4: mpg ~ am + hp + wt + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 245.44 1 475.46 75.5148 2.638e-09 ***
## 3 28 180.29 1 65.15 10.3472 0.003356 **
## 4 27 170.00 1 10.29 1.6348 0.211917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is apparent from the p-values listed above that both fit2 and fit3 are significant improvements over fit1, and fit3 is an improvement over fit2 The additional variable in fit 4, cyl, does not improve the explanatory power of the model. The selected model then is fit3, which explains mpg using am, hp, and wt.
Based on this model (summary results are found in Appendix B), manual transmission vehicles achieve an average 2 miles per gallon more than vehicles with automatic transmissions. In this case, the r-squared value is high (84%), indicating the selected variables do explain a high degree of the variation in the data.
As a final test of the chosen model, it is important to perform some residual diagnostics to ensure that no trends are apparent between the residuals and fitted values. The normal Q-Q plot should also show no signs of non-normality (see Appendix C for residual plots). In this case, the residuals show no signs of heteroskedasticity and appear normally distributed, so it is safe to conclude that the selected model is a fairly accurate representation of the effects of am, hp, and wt on mpg.
##
## Call:
## lm(formula = mpg ~ am + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## am 2.083710 1.376420 1.514 0.141268
## hp -0.037479 0.009605 -3.902 0.000546 ***
## wt -2.878575 0.904971 -3.181 0.003574 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11