Analysis of Transmission Type Impacts on MPG

Executive Summary

This analysis evaluates impacts of transmission types on miles per gallon of gas. Two models are estimated, a naive model where only the dummy variable of transmission type is used; and a multivariate model where three predictors, none of which are an intercept, is used. While estimates for impacts of transmission types are made, the multivariate model is still incomplete despite its high accuracy, low residual variation, and simplicity. The model excludes known-knowns (cylinders,gears,hp) due to lack of degrees of freedom, and includes variables that are confounded in that they represent both driver behavior (driving fast) and automobile efficiency. To best estimate the impact transmission type has on mpg, additional variables must be controlled for. Although, results suggest Transmission Type does indeed have a significant impact on miles per gallon of gas cars use.

Description of the Analysis

This analysis seeks to answer two questions:

1.) What transmission type is better for MPG?

2.) What is the difference on MPG of the two types of transmissions?

Exploratory Analysis

Factor Variables

An exploration of the data shows the variables of MTCARS are a combination of categorical and continuous variables. The categorical variables include: “the number of cylinders (cyl)”, “number of forward gears (gear)”, “the number of carburetors (carb)”, “transmission type (am)”, and “V/S (vs)”. To accommodate these variables would be to eat away the degress of freedom of the regression, which is only 32 DoF. Only Transmission Type is considered for inclusion.

The panel plot of histograms (Figures 1-5) in the Appendix shows the distribution of the variables “Displacement (disp)” measured in cubic inches, “Rear axel ratio (drat)”, “weight (wt)” measured in 1,000lbs, “Quarter Mile Time (qsec)” measured in seconds.

Figure 5 shows the distribution of the dependent variable “Miles per US Gallon of Gas (mpg)”. The mean of mpg is 20.090625 with a standard deviation of 6.0269481.

Continuous Variables

The correlation matrix shows stong correlation among disp, wt, and drat with mpg; however, there is strong correlation between disp and wt, and between drat and disp. Including any two of the three variables disp, wt, and drat in a regression model will lead to multicollinearity among the predictors, which leads to insignificant variables within the model.

##             mpg       disp         wt        qsec        drat         hp
## mpg   1.0000000 -0.8475514 -0.8676594  0.41868403  0.68117191 -0.7761684
## disp -0.8475514  1.0000000  0.8879799 -0.43369788 -0.71021393  0.7909486
## wt   -0.8676594  0.8879799  1.0000000 -0.17471588 -0.71244065  0.6587479
## qsec  0.4186840 -0.4336979 -0.1747159  1.00000000  0.09120476 -0.7082234
## drat  0.6811719 -0.7102139 -0.7124406  0.09120476  1.00000000 -0.4487591
## hp   -0.7761684  0.7909486  0.6587479 -0.70822339 -0.44875912  1.0000000

Model Selection

Naive Models

Multiple models were estimated to determine which combination of variables best predicts mpg. These naive models are single factor models. Figures 6 through 10 in “Figures for Naive Models” shows a series of naive models using the continuous variables from the data, including a confidence interval (black lines), a prediction interval (green lines), and the actual values. The three most useful variables based on actual values dispersion, and breadth of confidence and prediction intervals are Displacement, Weight, and Horsepower.

Naive Model with Transmission Type as a Single Predictor
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am           7.244939   1.764422  4.106127 2.850207e-04

The summary results show that both the intercept and the variable “transmission type” are significant at the .001 level of significance. The R-squared is 0.3597989. The F-Statistic, 16.8602788, 1, 30, is significantly different from 0 at the .001 level of significance. The results show the model is statistically significant, however, the R-squared shows it lacks predictive power.

Coefficient Interpretation

In this case, the intercept is interpreted as average miles per gallon of automatic transmissions, 17.1473684 mpg, which is the mean miles per gallon, 17.1473684, of automobiles with automatic transamissions. The coefficient of the variable “am” is interpreted as “having a manual transmission increases mpg by 7.245 mpg”. However, any predictions that include am=1 (manual transmission) result in the average mpg of manual transmissions, 24.3923077.

95% confidence intervals surrounding the coefficients are:

##                2.5 %   97.5 %
## (Intercept) 14.85062 19.44411
## am           3.64151 10.84837
Multivariate Model Using Stepwise Selection Algorithm

Given the interpretation of the intercept in this case, which would be the mpg of a car with no weight, 0 sec quarter mile, and an automatic transmission, the analysis moves forward with the model without the intercept.

## 
## Call:
## lm(formula = mpg ~ disp + wt + qsec + am - 1, data = mtcars)
## 
## Coefficients:
##     disp        wt      qsec        am  
##  0.01202  -4.61279   1.70551   4.18085

Because of the high correlation between Weight and Displacement, Displacement is excluded because the theoretical link between Weight and MPG is more intuitive. The results of the multivariate model are:

##       Estimate Std. Error   t value     Pr(>|t|)
## am    4.299519  1.0241147  4.198279 2.329423e-04
## wt   -3.185455  0.4827586 -6.598442 3.128844e-07
## qsec  1.599823  0.1021276 15.664944 1.091522e-15

The summary results once again show all of the variables are statistically significant at the .001 level of significance; the R squared and Adjusted R squared are 0.9871223 and 0.9857902, respecively Finally, the F-statistic, 740.9865203, 3, 29, is statistically different from 0 at the .001 level of significance, which suggests at least one of the coefficients is statistically significantly different from 0. This model has much higher predictive power compared to the naive models.

Coefficient Interpretation

In this case, coefficients are interpreted differently than with the Naive case. Because no intercept is included, the coefficients of wt and qsec are interpreted as “given an automatic transmission”. Having a manual transmission increases mpg by 4.2995 miles per gallon on average; for every 1,000lb increase in the weight of the car, mpg decreases 3.1855 mpg on average; and for every 1 second increase in the quarter mile time, mpg increases 1.5998 mpg on average.

95% confidence intervals surrounding the coefficients are:

##          2.5 %    97.5 %
## am    2.204969  6.394069
## wt   -4.172807 -2.198102
## qsec  1.390948  1.808697

Model Comparison, and Residual Diagnostics

Naive Model The residual variation of the naive model with transmission type as the single factor is 4.9020288, which seems high given the interpretation of the coefficients. The mean of the residuals is -6.591949210^{-17} which supports that the assumptions of linear regression are being upheld.

Multivariate Model The residual variation of the multivariate model is 2.4551462, which is about half of that of the Naive model, 4.9020288. This model has much tighter predictions to the actual values. The mean of the residuals, 0.0375163 , is close to 0 but not 0. This is acceptable given the exclusion of the intercept.

ANOVA provides more color surrounding the R-squared and the F-Statistic of each model:

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am - 1
## Model 3: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 180.83  1    540.06 89.3270 3.295e-10 ***
## 3     28 169.29  1     11.55  1.9098    0.1779    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the model, the regressors account for so much more of the variance in the dependent variable than the Naive model. Looking at the results of the ANOVA table, excluding the intercept was a good decision, given the multivariate model’s p-value and F-Statistic performance without an intercept.

Residual Diagnostics Neither the Naive model, nor the Multivariate Model suffer from heteroskedasticity, shown earlier through the mean residuals of both models being either 0, or close to 0 in the case of the Multivariate model with no intercept. This is confirmed by the results of the Bruce-Pagan Test below.

BP Test Results for the Naive Model:

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 3.409732    Df = 1     p = 0.06481297

BP Test Results for the Multivariate Model:

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.652473    Df = 1     p = 0.1033889

The Naive model suffers from serial correlation, as shown in the Durbin-Watson test results below.

##  lag Autocorrelation D-W Statistic p-value
##    1        0.453459      1.064698   0.002
##  Alternative hypothesis: rho != 0

The Multivariate model without an intercept does not.

##  lag Autocorrelation D-W Statistic p-value
##    1       0.0251951      1.860574   0.524
##  Alternative hypothesis: rho != 0

Figures 11 through 13 show residual plots against the indepedent variables. Figures 14 and 15 show a comparison of each model’s predicted values (blue circles) to the actual values (red circles). This further supports the better performance of the Multivariate model.

Appendix

Figures for Exploratory Analysis

Figures for Naive Models

Residual Plots

Prediction Comparison