19 June 2018
We examined the effect of transmission type on vehicle mileage, using R’s mtcars dataset.1 The data consist of variables on fuel consumption and 10 aspects of automobile design and performance for 32 cars from 1973 and 1974. Analysis of several multiple regression models revealed that having a manual transmission can increase vehicle mileage up to 7.2 miles per gallon, on average, compared to having an automatic transmission. However, the effect of transmission type was only statistically significant when transmission type was the sole predictor of mileage; significance of transmission type disappeared when other variables were added to the models. In general, transmission type does affect vehicle mileage but the magnitude of the effect varies according to the presence of numerous other covariates, each of which contributes a small, but cumulatively significant, influence on mileage.
We focused on two questions: (1) Is an automatic or manual transmission better for vehicle mileage? (2) How different is the mileage between automatic and manual transmissions?
The variables, how they were modeled, and their units are: mpg (continuous, miles/US gallon); cyl (factor, number of cylinders: 4, 6, or 8); disp (continuous, displacement in cu.in.); hp (continuous, gross horsepower); drat (continuous, rear axle ratio); wt (continuous, weight in 1000’s of lbs); qsec (continuous, 1/4 mile time); vs (factor, engine type: 0 = V-shaped, 1 = straight); am (factor, transmission: 0 = automatic, 1 = manual); gear (factor, number of forward gears: 3, 4, or 5); carb (factor, number of carburetors). We collapsed carb into two categories (1 to 3 = “few”, 4 to 8 = “many”) because of sparse cells. No interaction terms were included in any model due to the small sample size (n = 32).2
Figure 1 presents an exploratory analysis of the six continuous variables. Two variables were slightly non-normal (Shapiro Test, disp, p = 0.02; hp, p = 0.05) but the univariate histograms show that all are essentially symmetrical in distribution. The five continuous predictor variables each show a more-or-less linear relationship with the dependent variable mpg, and all are significantly correlated with it. However, eight of the ten correlations between the five continuous predictors are statistically signficant, some highly so, which is problematical for multiple regression.
Table 1 shows a minimal regression model with am as the sole predictor of mpg. The coefficient on am is highly significant, with a 95% confidence interval of 3.64 to 10.85. This result suggests that having a manual transmission would be expected to increase vehicle mileage by 7.2 miles per gallon, on average, compared to having an automatic transmission.
Recognizing that we have a high degree of multicollinearity in the continuous predictors, we proceed with the full model, including all predictors, shown in Table 2. The overall model is highly significant, with an adjusted R2 of 0.81. None of the individual regression coefficients were significant, a symptom of multicollinearity.
To address the problem of multicollinearity we produced a third, reduced model that used only disp as a continuous predictor, along with the five factors (Table 3). Figure 1 shows that disp is highly positively correlated with hp and wt, highly negatively correlated with drat, and also significantly negatively correlated with qsec. It was thus felt that disp was a good surrogate for the other continuous predictors. The reduced model as a whole was highly significant with an adjusted R2 of 0.77, but again, none of the individual predictors were statistically significant.
Residuals for all three models were normally distributed (Shapiro-Wilk test, p = 0.86, 0.31, and 0.40 for minimal, full, and reduced models). Figure 2 shows a residuals plot for the full model.
Table 4 presents an ANOVA of the three models. Compared with the minimal model, the reduced model explained a significantly greater proportion of total variance (p = 0.00002). The full model, however, did not explain significantly more variance than the reduced model (p = 0.15).
In the full model (Table 2), each individual predictor contributes a small effect towards explaining variation in vehicle mileage. In the aggregate, all of these small effects sum to produce an overall highly significant regression model, but one in which none of the individual predictors themselves are statistically significant. The same is true for the reduced model (Table 3), which contained only a single continuous predictor (engine displacement, disp) in an attempt to control for multicollinearity. In the minimal model, with just transmission type as predictor, transmission type is highly significant, but that is because transmission type is carrying all the cumulative effects of the other predictors that are not in the model.
The only statistically reliable model, in terms of interpreting the coefficients, is the minimal model, because that is the only one of the three models in which the 95% confidence interval for the coefficient on transmission type did not include zero. In both the full and reduced models, each individual predictor cannot be statistically distinguished from zero, but the models as a whole are highly significant. The full model would be preferred if all we were interested in was predicting mpg, because that model had the highest explained variance (80%). But we are unable to determine which of the predictors have the most influence on mpg, because each of the individual effects is so small.
We could use the minimal model to make the statement that having a manual transmission would on average increase vehicle mileage by 7.2 miles per gallon. But our analysis has shown that the effect of transmission type is not due to transmission type alone, but to the effects of the other predictors that were not included in the model.
| Term | Coefficient | Std. Error | t-statistic | p-value |
|---|---|---|---|---|
| (Intercept) | 17.147368 | 1.124603 | 15.247492 | 0.00000000 |
| am.f1 | 7.244939 | 1.764422 | 4.106127 | 0.00028502 |
| Term | Coefficient | Std. Error | t-statistic | p-value |
|---|---|---|---|---|
| (Intercept) | 16.105 | 17.343 | 0.929 | 0.365 |
| disp | 0.005 | 0.016 | 0.322 | 0.751 |
| hp | -0.042 | 0.028 | -1.520 | 0.145 |
| drat | 0.697 | 2.183 | 0.319 | 0.753 |
| wt | -2.935 | 1.788 | -1.642 | 0.117 |
| qsec | 0.691 | 0.808 | 0.855 | 0.403 |
| vs.f1 | 1.776 | 2.408 | 0.738 | 0.470 |
| am.f1 | 2.996 | 2.291 | 1.307 | 0.207 |
| gear.f4 | -0.039 | 2.765 | -0.014 | 0.989 |
| gear.f5 | 2.018 | 2.858 | 0.706 | 0.489 |
| carb.fmany | 0.423 | 2.837 | 0.149 | 0.883 |
| cyl.f6 | -0.820 | 2.391 | -0.343 | 0.735 |
| cyl.f8 | 2.907 | 5.582 | 0.521 | 0.609 |
| Term | Coefficient | Std. Error | t-statistic | p-value |
|---|---|---|---|---|
| (Intercept) | 23.609 | 3.677 | 6.421 | 0.000 |
| disp | -0.010 | 0.013 | -0.762 | 0.454 |
| vs.f1 | 1.049 | 2.144 | 0.489 | 0.629 |
| am.f1 | 3.602 | 1.975 | 1.824 | 0.081 |
| gear.f4 | 0.743 | 2.489 | 0.299 | 0.768 |
| gear.f5 | -0.013 | 2.683 | -0.005 | 0.996 |
| carb.fmany | -3.443 | 1.833 | -1.878 | 0.073 |
| cyl.f6 | -2.131 | 2.057 | -1.036 | 0.311 |
| cyl.f8 | -3.752 | 3.590 | -1.045 | 0.307 |
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 1 | 30 | 720.8966 | NA | NA | NA | NA |
| 2 | 23 | 188.0170 | 7 | 532.8796 | 10.80644 | 0.00002 |
| 3 | 19 | 133.8449 | 4 | 54.1721 | 1.92250 | 0.14804 |
Figure 1. One Graph To Rule Them All: Exploratory data analysis, with Pearson correlations, of the continuous variables. mpg: miles per gallon; disp: engine displacement (cu.in.); hp: horsepower; drat: rear axle ratio; wt: vehicle weight (1000s of pounds); qsec: 1/4 mile time (seconds).
Figure 2. Residuals plot for the full model; see Table 2. Shapiro-Wilk normality test p = 0.3145.