Course Project Regression Models

Main Report

1. Introduction

The aim here is to try to quantify the effect of automatic or manual transmission on the miles per gallon that different vehicles experience(data(mtcars)). Because numerous factors could influence the fuel efficiency of a car, this report is attempting to show the effect of the so called ‘confounding variables’.

2. Exploratory Data Analysis

A quick plotting of the different types of transmission against miles per gallon can be seen in the Figure 1 of the Appendix. It is evident that mainly light vehicles have manual transmission, whereas mostly heavy ones are automatic. That suggests that other variables could also be responsible for the difference. The goal is to find appropriate variables that can be included into a model, and by doing so can separate the influence only of the transmission from that of other ‘confounders’.

3. Model Selection

In order to disentangle the effect of each variable a quick look at correlations and relations between them is necessary. From Figure 2 can be seen that most of the so called predictors are highly correlated; that could lead to variance inflation. Also a quick ‘kitchen sink’ regression on all available variables yields all insignificant coefficients, and very high inflation variation from the ‘vif’ function(Output 4). On the other hand a regression only on transmission type leads to highly significant coefficient, but that is certainly picking up influence from the other ‘confounders’. I would follow the simple approach of testing nested models; simply starting with ‘am’ and adding one extra variable and then estimating another model. In the end the ‘anova’ function would determine which predictors are ‘necessary’ for the ‘right’ model. Output 1 in the Appendix shows the result. From the output can be seem that the first 5 variable are marginally significant. According to that anova result the model looks like this:

##                Estimate Std. Error   t value    Pr(>|t|)
## (Intercept) 14.36190396 9.74079485  1.474408 0.152378367
## am1          3.47045340 1.48578009  2.335779 0.027487809
## wt          -4.08433206 1.19409972 -3.420428 0.002075008
## disp         0.01123765 0.01060333  1.059823 0.298972150
## hp          -0.02117055 0.01450469 -1.459565 0.156387279
## qsec         1.00689683 0.47543287  2.117853 0.043907652

Yet, looking at the model the variables ‘disp’ and ‘hp’ are statistically insignificant, i.e. we cannot rule out the possibility that their coefficients are indeed zero. Even though the ‘anova’ procedure said that they should be in the model, maybe the interaction between them makes them redundant. The sensible thing to do is to omit them and arrive at the possible ‘best’ model:

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## am1          2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04

It should be added that the ‘step’ function in R, which selects model based on AIC information criterion, gives the same model as a result, so maybe that is indeed the model that best fits our data. So in the end our model looks like this: \[MPG = \beta_0 + \beta_1 AM1 + \beta_2 WT + \beta_3 QSEC + u\]

4. Quantifying the Influence of Automatic or Manual Transmission on Miles per Gallon

The way to read our result of \(\beta_1 = 2.94\) is as follows - if the vehicle has a manual transmission that adds on average 2.94 miles in the possible mileage with one gallon of fuel. It should be emphasized that that coefficient is calculated in such a way that it excludes/removes the effects of weight and the quarter mile time on the miles per gallon variable. Possibly the quarter mile time(qsec) picks up some of the influence of horse power (hp) and displacement(disp), since they were rendered ‘insignificant’. Again, that is what our model selection yielded, of course it may not be the ‘best’ solution.

5. Residual Analysis

The residual analysis is an important part in deciding of whether a selected model is indeed a good one. Figure 3 of the Appendix illustrates four different residual plots. Each of them focuses on different aspects of the residuals. First top left plot looks at fitted values against the actual residuals. For a good model that graph should look reasonably random, and indeed in my opinion it is; ruling out, for example, the the problem of heteroscedasticity. Next is the Q-Q plot - testing the problem associated with the normality of the residuals. In my opinion the right tail has some departure from normality, and in the middle there is also some deviation. That could have some effect on the inference of our parameters. The other two plots is a bit harder to interpret and I would calculate the actual leverage and influence values. Output 2 gives a summary of the results. From there it can be seen that vehicles like Lincoln Continental and other SUVs have rather high leverage, suggesting of skewing in our results. The other measure ‘dfbetas’ calculates the value of our coefficients excluding one particular observation. Again it could be seen that some extravagant automobiles have a rather big impact on our results.

6. Uncertainty of the Conclusion and Inference

Of course there is uncertainty in our coefficients. The point estimate \(\beta_1 = 2.94\) is just one number that could be due to some purely random effects. So, as always we construct a 95% confidence interval as follows:

summaryCoef[2,1] + c(-1, 1) * qt(.975, df = fit5_1$df) * summaryCoef[2, 2]

## [1] 0.04573031 5.82594408

So the estimate for the effect of transmission type on miles per gallon could be from .046 to 5.826. In my opinion our coefficient is barely significant - there is 95% chance that our estimate could be quite close to 0. Again, we should treat our model with caution.

7. Executive Summary

This report tried to quantify the effects of the types of transmission on the miles per gallon variable. Removing the influence of possible ‘confounding variables’ as weight and quarter mile time we concluded that on average manual transmission increases the mileage of a car by 2.94 miles. That estimate should be taken with a ‘grain of salt’. In the original article of Henderson and Velleman (1981) that used the same dataset, they concluded that the ‘best’ model explaining the fuel efficiency includes only weight and a combined variable hp/wt(horse power divided by weight). I fitted the same model, but including the transmission variable(Output 3) . Indeed in this model transmission was insignificant, i.e. we cannot reject the the possibility of no effect at all. So, probably manual transmission helps when it comes to fuel efficiency, but maybe that is just a fluke.

Appendix

Figure 1 Scatter plot of manual (1) and automatic(0) transmission v miles per gallon

Figure 2 Pairwise plots of different variables and correlations

Figure 3 Residuals Plot of Selected Model \(MPG = \beta_0 + \beta_1 AM1 + \beta_2 WT + \beta_3 QSEC + u\)

Output 1 Model Selection using ‘anova’ function.

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + disp
## Model 4: mpg ~ am + wt + disp + hp
## Model 5: mpg ~ am + wt + disp + hp + qsec
## Model 6: mpg ~ am + wt + disp + hp + qsec + cyl
## Model 7: mpg ~ am + wt + disp + hp + qsec + cyl + drat
## Model 8: mpg ~ am + wt + disp + hp + qsec + cyl + drat + gear
## Model 9: mpg ~ am + wt + disp + hp + qsec + cyl + drat + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 57.0917 1.154e-06 ***
## 3     28 246.56  1     31.76  4.0974  0.059976 .  
## 4     27 179.91  1     66.65  8.5976  0.009766 ** 
## 5     26 153.44  1     26.47  3.4146  0.083195 .  
## 6     24 142.33  2     11.11  0.7164  0.503524    
## 7     23 141.21  1      1.12  0.1451  0.708289    
## 8     21 137.68  2      3.53  0.2276  0.799011    
## 9     16 124.03  5     13.65  0.3520  0.873459    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: All dummy variables in the dataset are converted at the beginning in factors. So R does not treat them as numbers.

Output 2 Leverage represented from ‘hatvalues’ (column 1) and influence measures ‘dfbetas’ (column 2 for variable ‘AM’)

##                      hatvalues         am1           wt         qsec
## Merc 280            0.06062835 -0.04169947 -0.023432317 -0.008336304
## Merc 280C           0.06114922  0.07973403  0.035755662 -0.025250662
## Merc 450SE          0.06105404 -0.04202540  0.025834062 -0.030097818
## Merc 450SL          0.05817788 -0.04074548 -0.013306840 -0.022420838
## Merc 450SLC         0.05303857  0.07137530  0.005011697  0.014273332
## Cadillac Fleetwood  0.22700693 -0.08089295 -0.150585664 -0.064948409
## Lincoln Continental 0.26421505  0.02475114  0.044808150  0.018084334
## Chrysler Imperial   0.22963383  0.56264176  1.093842173  0.336678138
## Fiat 128            0.12763129  0.47656803  0.128993948  0.496886062
## Honda Civic         0.11865613  0.01694722 -0.110885836  0.017887430
## Toyota Corolla      0.14634894  0.31746367 -0.051120600  0.451492785

Output 3 Model \(1/MPG = \beta_0 + \beta_1 AM1 + \beta_2 WT + \beta_3 HP/WT + u\)

##                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept) -0.0050079683 6.752230e-03 -0.7416762 4.644595e-01
## am1          0.0008369016 3.625290e-03  0.2308509 8.191091e-01
## wt           0.0150236408 1.799288e-03  8.3497717 4.392184e-09
## I(hp/wt)     0.0002329434 8.024195e-05  2.9030121 7.130316e-03

Note: AM1 variable is highly insignificant!

Output 4 Variance Inflation from the ‘vif’ function in R

##      cyl     disp       hp     drat       wt     qsec       vs       am 
## 3.364380 7.769536 5.312210 2.609533 4.881683 3.284842 2.843970 3.151269 
##     gear     carb 
## 2.670408 1.862838

##       am       wt     qsec 
## 2.541437 2.482952 1.364339

The first result is from regression on all variable, the second - our chosen model.

September, 2015