Executive Summary

We examine `mtcars` data set to answer the following questions:

• Is an automatic or manual transmission better for MPG?
• Quantify the MPG difference between automatic and manual transmissions.

We first summarize the relationships between variables, and then fit a linear regression model that has the smallest BIC and the largest adjusted \(R^2\), followed by residual analysis and diagnostics. Wither with or wothout interaction, our model tells us a manual transmission is better for MPG, and no-pattern residual plots are indications for good model fitting.

Explonatory Data Analysis

The `mtcars` data set is an R data frame with 32 observations on 11 variables. Figure 1 in Appendix gives us a general picture of the variables including their histogram, scatter plots and correlation between variables. Marginally manual transmission seems to have higher MPG than automatic transmission.

Regression Models and Subset Selection

We first consider two naive models, the model including all predictors (`fit.full`) and the one with variable `am` only (`fit.am`).

``fit.full <- lm(mpg ~ ., data = mtcars); round(summary(fit.full)\$coef[, 4][-1], 2) # p-value``
``````##  cyl6  cyl8  disp    hp  drat    wt  qsec   vs1   am1 gear4 gear5 carb2
##  0.40  0.96  0.28  0.09  0.64  0.09  0.70  0.51  0.71  0.77  0.51  0.68
## carb3 carb4 carb6 carb8
##  0.50  0.81  0.49  0.40``````
``fit.am <- lm(mpg ~ am, data = mtcars); summary(fit.am)\$coef[2, ]``
``````##     Estimate   Std. Error      t value     Pr(>|t|)
## 7.2449392713 1.7644216316 4.1061269831 0.0002850207``````

In the full model, all coefficients are not significant at 5% significance level, although is has large adjusted \(R^2 =\) 0.78. Fitting many correlated variables results in multicollinearity and overfitting with inflated estimated standard error. On the contrary, the coefficients of the `am`-only model are significantly different from zero, saying that on average, a manual transmitted car has 7.245 MPG higher than an automatic transmitted car. However, the model has small adjusted \(R^2 =\) 0.34, impling small explanatory power for MPG.

To do variable selection, we use forward and backward stepwise slection with two criteria AIC (\(-2\log L(M) + 2k\)) and BIC (\(-2\log L(M) + k\log(n)\)), where \(L(M)\) is the maximum of the likelihood function of model \(M\) and \(k\) is the number of parameters in \(M\) and \(n\) is the number of observations.

Forward stepwise selection starts with a intercept-only model, and then adds predictors to the model, one at the time, until all of the predictors are in the model. At each step the variable that gives the greatest additional improvement to the fit is added to the model. Backward method, on the other hand, begins with the full model, and then removes the least useful covariate, one at the time.

There are four models `for.aic`, `for.bic`, `back.aic` and `back.bic` to be considered, each of which is the best model with the smallest AIC or BIC.

``````for.aic <- step(lm(mpg ~ 1, data = mtcars), direction = "forward",
scope = formula(fit.full), k = 2, trace = 0) # forward AIC
for.bic <- step(lm(mpg ~ 1, data=mtcars), direction = "forward",
scope = formula(fit.full), k = log(32), trace = 0) # forward BIC
back.aic <- step(fit.full, direction = "backward", k = 2, trace = 0) # backward AIC
back.bic <- step(fit.full, direction = "backward", k = log(32), trace = 0) # backward BIC``````
``````# back.aicRsq back.bicRsq  for.aicRsq  for.bicRsq
#   0.8335561   0.8335561   0.8263446   0.8185189 ``````
``````##              Estimate     Pr(>|t|)
## (Intercept)  9.617781 1.779152e-01
## wt          -3.916504 6.952711e-06
## qsec         1.225886 2.161737e-04
## am1          2.935837 4.671551e-02``````

Since the model `back.bic` has the largest adjusted \(R^2 =\) 0.834, the model including `wt`, `qsec`, and `am` has the most explanatory power for `mpg`. Under this model, a mamual transmission car, on average, has 2.936 miles per gallon more than an automatic transmission car, holding values of weights and 1/4 mile time constant.

We then fit four possible interaction models `fit.int`, `fit.int.aq`, `fit.int.aw` and `fit.int.wq` to check if any interaction is needed to be in the model.

``````fit.int <- summary(lm(mpg ~ wt * qsec * am, data = mtcars))
fit.int.aq <- summary(lm(mpg ~ wt + qsec * am, data = mtcars))
fit.int.aw <- summary(lm(mpg ~ qsec + wt * am, data = mtcars))
fit.int.wq <- summary(lm(mpg ~ am + qsec * wt, data = mtcars))``````
``````##    int_Rsq int.aq_Rsq int.aw_Rsq int.wq_Rsq
##  0.8759496  0.8531624  0.8804219  0.8347545``````

Since model `fit.int.aw` has the largest adjusted \(R^2 =\) 0.88, the model `mpg = 9.723 + (1.017)qsec + (-2.937)wt + (14.079)am + (-4.141)wt*am` is our final model. When `am = 0`, the slope of `wt` is -2.937 and the intercept is 9.723. When `am = 1`, the slope of `wt` is -7.078 and the intercept is 23.802. In term of uncertainty, the 95% confidence interval for the coefficients are shown below.

``````fit <- lm(mpg ~ qsec + wt * am, data = mtcars)
t(confint(fit))``````
``````##        (Intercept)      qsec        wt       am1    wt:am1
## 2.5 %    -2.380779 0.4998811 -4.303102  7.030875 -6.597032
## 97.5 %   21.826884 1.5340661 -1.569960 21.127981 -1.685721``````

Residual Diagnostics

Some plots for residual diagnostics are shown in Figure 2. There is no particular pattern in residuals vs fitted, scale-location, and residuals vs leverage plots. For QQ-plot, it seems that the residual is a little bit right skewed, but it still can be seen as normal from Shapiro-Wilk normality test.

We use \(2*k/n\) as a threshold for hat values, and there are four high leverage points, but according to `dfbeta()`, they are not so influential to our model.

``shapiro.test(fit\$res)``
``````##
##  Shapiro-Wilk normality test
##
## data:  fit\$res
## W = 0.9444, p-value = 0.1001``````
``round(hatvalues(fit)[hatvalues(fit) > 2*5/32], 2) # high leverage``
``````##            Merc 230 Lincoln Continental        Lotus Europa
##                0.35                0.32                0.33
##       Maserati Bora
##                0.37``````
``round(dfbeta(fit)[which(hatvalues(fit) > 2*5/32), ], 2) # check influence ``
``````##                     (Intercept)  qsec    wt   am1 wt:am1
## Merc 230                   1.53 -0.09  0.01  0.46  -0.19
## Lincoln Continental        1.87 -0.03 -0.37 -1.14   0.30
## Lotus Europa               0.12 -0.01  0.00  0.10  -0.04
## Maserati Bora              0.38 -0.02 -0.01 -1.37   0.64``````

In sum, our model fit the data quite well. Although there are some high leverage points, they does not affect the model much. We may use this model to do inference and prediction as long as we pay attention to those data points with careful explanation.