This analysis evaluates impacts of transmission types on miles per gallon of gas. Two models are estimated, a naive model where only the dummy variable of transmission type is used; and a multivariate model where three predictors, none of which are an intercept, is used. While estimates for impacts of transmission types are made, the multivariate model is still incomplete despite its high accuracy, low residual variation, and simplicity. The model excludes known-knowns (cylinders,gears,hp) due to lack of degrees of freedom, and includes variables that are confounded in that they represent both driver behavior (driving fast) and automobile efficiency. To best estimate the impact transmission type has on mpg, additional variables must be controlled for. Although, results suggest Transmission Type does indeed have a significant impact on miles per gallon of gas cars use.
This analysis seeks to answer two questions:
1.) What transmission type is better for MPG?
2.) What is the difference on MPG of the two types of transmissions?
An exploration of the data shows the variables of MTCARS are a combination of categorical and continuous variables. The categorical variables include: “the number of cylinders (cyl)”, “number of forward gears (gear)”, “the number of carburetors (carb)”, “transmission type (am)”, and “V/S (vs)”. To accommodate these variables would be to eat away the degress of freedom of the regression, which is only 32 DoF. Only Transmission Type is considered for inclusion.
The panel plot of histograms (Figures 1-5) in the Appendix shows the distribution of the variables “Displacement (disp)” measured in cubic inches, “Rear axel ratio (drat)”, “weight (wt)” measured in 1,000lbs, “Quarter Mile Time (qsec)” measured in seconds.
Figure 5 shows the distribution of the dependent variable “Miles per US Gallon of Gas (mpg)”. The mean of mpg is 20.090625 with a standard deviation of 6.0269481.
The correlation matrix shows stong correlation among disp, wt, and drat with mpg; however, there is strong correlation between disp and wt, and between drat and disp. Including any two of the three variables disp, wt, and drat in a regression model will lead to multicollinearity among the predictors, which leads to insignificant variables within the model.
## mpg disp wt qsec drat hp
## mpg 1.0000000 -0.8475514 -0.8676594 0.41868403 0.68117191 -0.7761684
## disp -0.8475514 1.0000000 0.8879799 -0.43369788 -0.71021393 0.7909486
## wt -0.8676594 0.8879799 1.0000000 -0.17471588 -0.71244065 0.6587479
## qsec 0.4186840 -0.4336979 -0.1747159 1.00000000 0.09120476 -0.7082234
## drat 0.6811719 -0.7102139 -0.7124406 0.09120476 1.00000000 -0.4487591
## hp -0.7761684 0.7909486 0.6587479 -0.70822339 -0.44875912 1.0000000
Multiple models were estimated to determine which combination of variables best predicts mpg. These naive models are single factor models. Figures 6 through 10 in “Figures for Naive Models” shows a series of naive models using the continuous variables from the data, including a confidence interval (black lines), a prediction interval (green lines), and the actual values. The three most useful variables based on actual values dispersion, and breadth of confidence and prediction intervals are Displacement, Weight, and Horsepower.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am 7.244939 1.764422 4.106127 2.850207e-04
The summary results show that both the intercept and the variable “transmission type” are significant at the .001 level of significance. The R-squared is 0.3597989. The F-Statistic, 16.8602788, 1, 30, is significantly different from 0 at the .001 level of significance. The results show the model is statistically significant, however, the R-squared shows it lacks predictive power.
In this case, the intercept is interpreted as average miles per gallon of automatic transmissions, 17.1473684 mpg, which is the mean miles per gallon, 17.1473684, of automobiles with automatic transamissions. The coefficient of the variable “am” is interpreted as “having a manual transmission increases mpg by 7.245 mpg”. However, any predictions that include am=1 (manual transmission) result in the average mpg of manual transmissions, 24.3923077.
95% confidence intervals surrounding the coefficients are:
## 2.5 % 97.5 %
## (Intercept) 14.85062 19.44411
## am 3.64151 10.84837
Given the interpretation of the intercept in this case, which would be the mpg of a car with no weight, 0 sec quarter mile, and an automatic transmission, the analysis moves forward with the model without the intercept.
##
## Call:
## lm(formula = mpg ~ disp + wt + qsec + am - 1, data = mtcars)
##
## Coefficients:
## disp wt qsec am
## 0.01202 -4.61279 1.70551 4.18085
Because of the high correlation between Weight and Displacement, Displacement is excluded because the theoretical link between Weight and MPG is more intuitive. The results of the multivariate model are:
## Estimate Std. Error t value Pr(>|t|)
## am 4.299519 1.0241147 4.198279 2.329423e-04
## wt -3.185455 0.4827586 -6.598442 3.128844e-07
## qsec 1.599823 0.1021276 15.664944 1.091522e-15
The summary results once again show all of the variables are statistically significant at the .001 level of significance; the R squared and Adjusted R squared are 0.9871223 and 0.9857902, respecively Finally, the F-statistic, 740.9865203, 3, 29, is statistically different from 0 at the .001 level of significance, which suggests at least one of the coefficients is statistically significantly different from 0. This model has much higher predictive power compared to the naive models.
In this case, coefficients are interpreted differently than with the Naive case. Because no intercept is included, the coefficients of wt and qsec are interpreted as “given an automatic transmission”. Having a manual transmission increases mpg by 4.2995 miles per gallon on average; for every 1,000lb increase in the weight of the car, mpg decreases 3.1855 mpg on average; and for every 1 second increase in the quarter mile time, mpg increases 1.5998 mpg on average.
95% confidence intervals surrounding the coefficients are:
## 2.5 % 97.5 %
## am 2.204969 6.394069
## wt -4.172807 -2.198102
## qsec 1.390948 1.808697
Naive Model The residual variation of the naive model with transmission type as the single factor is 4.9020288, which seems high given the interpretation of the coefficients. The mean of the residuals is -6.591949210^{-17} which supports that the assumptions of linear regression are being upheld.
Multivariate Model The residual variation of the multivariate model is 2.4551462, which is about half of that of the Naive model, 4.9020288. This model has much tighter predictions to the actual values. The mean of the residuals, 0.0375163 , is close to 0 but not 0. This is acceptable given the exclusion of the intercept.
ANOVA provides more color surrounding the R-squared and the F-Statistic of each model:
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am - 1
## Model 3: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 180.83 1 540.06 89.3270 3.295e-10 ***
## 3 28 169.29 1 11.55 1.9098 0.1779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With the model, the regressors account for so much more of the variance in the dependent variable than the Naive model. Looking at the results of the ANOVA table, excluding the intercept was a good decision, given the multivariate model’s p-value and F-Statistic performance without an intercept.
Residual Diagnostics Neither the Naive model, nor the Multivariate Model suffer from heteroskedasticity, shown earlier through the mean residuals of both models being either 0, or close to 0 in the case of the Multivariate model with no intercept. This is confirmed by the results of the Bruce-Pagan Test below.
BP Test Results for the Naive Model:
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3.409732 Df = 1 p = 0.06481297
BP Test Results for the Multivariate Model:
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.652473 Df = 1 p = 0.1033889
The Naive model suffers from serial correlation, as shown in the Durbin-Watson test results below.
## lag Autocorrelation D-W Statistic p-value
## 1 0.453459 1.064698 0.002
## Alternative hypothesis: rho != 0
The Multivariate model without an intercept does not.
## lag Autocorrelation D-W Statistic p-value
## 1 0.0251951 1.860574 0.524
## Alternative hypothesis: rho != 0
Figures 11 through 13 show residual plots against the indepedent variables. Figures 14 and 15 show a comparison of each model’s predicted values (blue circles) to the actual values (red circles). This further supports the better performance of the Multivariate model.
Figures for Exploratory Analysis
Figures for Naive Models
Residual Plots
Prediction Comparison