Analysis of Transmission Type Impacts on MPG

Executive Summary

Manual transmissions have higher miles per gallon; however, quantifying the improved gas performance from manual transmission is more difficult. This analysis estimates a final model suggesting the impact of manual transmission on miles per gallon is a 4.2995 increase on average. Despite promising diagnostics of the model’s performance, the model is incomplete due to excluding known-knowns (cylinders,gears,hp) due to lack of degrees of freedom and correlation among regressors, and includes variables that are confounded in that they represent both driver behavior (driving fast) and automobile efficiency.

Description of the Analysis

This analysis seeks to answer two questions: 1.) What transmission type is better for MPG? 2.) What is the difference on MPG of the two types of transmissions?

Final Model and Findings

The final model for this analysis is estimated as:

##       Estimate Std. Error   t value     Pr(>|t|)
## am    4.299519  1.0241147  4.198279 2.329423e-04
## wt   -3.185455  0.4827586 -6.598442 3.128844e-07
## qsec  1.599823  0.1021276 15.664944 1.091522e-15
Coefficient Interpretation

Each of the coefficents in the model are statistically different from 0 at the .001 level of significance. From this one infers the coefficients are not 0 with 99.999% certainty.
Because no intercept is included, the coefficients of wt and qsec are interpreted as “given an automatic transmission”. Given a weight and quarter mile time, a manual transmission increases mpg by 4.2995 miles per gallon on average; given an automatic transmission, for every 1,000lb increase in the weight of the car, mpg decreases 3.1855 mpg on average; and for every 1 second increase in the quarter mile time, mpg increases 1.5998 mpg on average.

The R squared and Adjusted R-squared are 0.9871223 and 0.9857902, respecively Finally, the F-statistic, model DoF’s, and Residual DoF’s are 740.9865203, 3, 29, respectively. The F-Statistic is statistically different from 0 at the .001 level of significance, which suggests at least one of the coefficients is statistically significantly different from 0. With 99.999% certainty, we infer that at least one of the variables in the final model is in the true model.

Exploratory Analysis

Factor Variables

An exploration of the data shows the variables of MTCARS are a combination of categorical and continuous variables. The panel plot of histograms (Figures 1-4) in the Appendix shows the distribution of four of the continuous variables being considered for inclusion. Figure 5 shows the distribution of the dependent variable “Miles per US Gallon of Gas (mpg)”. The mean of mpg is 20.090625 with a standard deviation of 6.0269481.

Continuous Variables

The correlation matrix shows stong correlation among disp, wt, and drat with mpg; however, there is strong correlation between disp and wt, between drat and disp, between hp and wt, and between hp and disp.

##             mpg       disp         wt        qsec        drat         hp
## mpg   1.0000000 -0.8475514 -0.8676594  0.41868403  0.68117191 -0.7761684
## disp -0.8475514  1.0000000  0.8879799 -0.43369788 -0.71021393  0.7909486
## wt   -0.8676594  0.8879799  1.0000000 -0.17471588 -0.71244065  0.6587479
## qsec  0.4186840 -0.4336979 -0.1747159  1.00000000  0.09120476 -0.7082234
## drat  0.6811719 -0.7102139 -0.7124406  0.09120476  1.00000000 -0.4487591
## hp   -0.7761684  0.7909486  0.6587479 -0.70822339 -0.44875912  1.0000000
Naive Models

Multiple models were estimated to determine which combination of variables best predicts mpg. These naive models are single factor models. Figures 6 through 10 in “Figures for Naive Models” shows a series of naive models (red lines) using the continuous variables from the data, including a confidence interval (black lines), a prediction interval (green lines), and the actual values (blue circles).

Variable Selection Strategy

Attempting to include multiple factor variables would eat the model’s DoF’s away. So, only transmission type is included in the model since it is central to the questions being answered. From the exploratory analysis, three of the continuous variables seem of most use based on actual values dispersion, and breadth of confidence and prediction intervals. These are Displacement, Weight, and Horsepower. Weight has an intuitive connection to miles per gallon, so that variable is included. Horsepower and Displacement are therefore excluded because of high correlations with Weight. R’s stepwise Variable Selection returned Displacement, Weight, Quarter Mile Time, and Transmission type. Because of the high correlation, Displacement was nonetheless excluded, and Quarter Mile Time included, despite a relatively low correlation with miles per gallon. The choice to remove the intercept is because of the absurd interpretation it would have, “the average miles per gallon of a car with 0 pounds weight, a 0 second quarter mile time, and an automatic transmission.”

Model Uncertainty and Residual Diagnostics

The residual variation of the multivariate model is 2.4551462. The mean of the residuals, 0.0375163, is close to 0 but not 0. This is acceptable given the exclusion of the intercept.

95% confidence intervals surrounding the coefficients are:

##          2.5 %    97.5 %
## am    2.204969  6.394069
## wt   -4.172807 -2.198102
## qsec  1.390948  1.808697

These are interpretted as each estimated coefficient having 95% certainty of being within the upper and lower bounds of their respective confidence intervals.

ANOVA provides more color surrounding the R-squared and the F-Statistic of three of the compared models:

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec - 1
## Model 3: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 180.83  1    540.06 89.3270 3.295e-10 ***
## 3     28 169.29  1     11.55  1.9098    0.1779    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the final (model 2) model, the regressors account for so much more of the variance in the dependent variable than the other models. Looking at the results of the ANOVA table, excluding the intercept was a good decision, given the multivariate model’s p-value and F-Statistic performance without an intercept.

Residual Diagnostics The Multivariate Model suffers neither from heteroskedasticity, nor serial correlation, according to the results of the Bruce-Pagan and Durbin-Watson Tests, shown below.

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.652473    Df = 1     p = 0.1033889
##  lag Autocorrelation D-W Statistic p-value
##    1       0.0251951      1.860574   0.542
##  Alternative hypothesis: rho != 0

Figures 11 through 13 show residual plots against the indepedent variables for each model. Figures 14 and 15 show a comparison of each model’s predicted values (blue circles) to the actual values (red circles). This further supports the better performance of the Multivariate model.

Conclusions

1.) Question 1: The analysis indicates that across all models, particulary the models where only transmission type was used as factor, and in the final model, a manual transmission type has positive impact on miles per gallon relative to the automatic transmission type. 2.) Question 2: Quantifying the difference between the two is more difficult because that difference is conditional to the model being estimated. The Naive model with Transmission Type suggests 7.245 mpg increase on average for manaual transmissions. The final model suggests, for a given weight and quarter mile time, a manual transmission leads to a 4.2995mpg increase on average. With 95% certainty, the former could be between 3.6 and 10.85. With 95% certainty, the latter value could be anywhere between 2.2 and 6.4.

Appendix