Regression Models: Project

Assignment

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”

Summary

I first summarized the relationships between variables, and then fit a linear regression model that has the smallest BIC and the largest adjusted R^2, followed by residual analysis and diagnostics. Either with or without interaction, our model tells us a manual transmission is better for MPG, and no-pattern residual plots are indications for good model fitting.

Exploratory Data Analysis

The mtcars dataset is a dataframe with 32 observations on 11 variables. Figure 1 in Appendix gives us an overview of the variables including their histogram, scatter plots and correlation between variables. Our predictor variable of interest, am, is a dichotomous variable: 0 stands for automatic, 1 for manual. According to Figure 1, manual transmission seems to have higher MPG than automatic transmission.

The distribution of MPG is approximately normal and there are no apparent outliers. Figure 2 in the Appendix shows how MPG varies by automatic versus manual transmission. Figure 2 shows a difference in the MPG by transmission type – manual transmission seems to get better miles per gallon than automatic transmission.

Regression Models and Subset Selection

First, let’s consider two simple models: 1) the model including all predictors and 2) the one with only one variable – am.

##model with all variables
fit.full <- lm(mpg ~ ., data = mtcars); round(summary(fit.full)$coef[, 4][-1], 2)

##  cyl disp   hp drat   wt qsec   vs   am gear carb 
## 0.92 0.46 0.33 0.64 0.06 0.27 0.88 0.23 0.67 0.81

In the full model, all coefficients are not significant at 5% significance level, although is has large adjusted R^2=0.78. This model results in multicollinearity and overfitting with inflated estimated standard error.

##model with one variable
fit.am <- lm(mpg ~ am, data = mtcars); summary(fit.am)$coef[2, ]

##     Estimate   Std. Error      t value     Pr(>|t|) 
## 7.2449392713 1.7644216316 4.1061269831 0.0002850207

The coefficients of the am-only model are significantly different from zero, saying that on average, a manual transmitted car has 7.245 MPG higher than an automatic transmitted car. However, the model has small adjusted R^2= 0.34, implying small explanatory power for MPG.

To do variable selection, we use forward and backward stepwise selection with AIC (Akaike’s ‘An Information Criterion’) and BIC (Bayesian Information Criterion) criteria. Forward stepwise selection starts with an intercept-only model, and then adds predictors to the model gradually until all of the predictors are in the model. At each step the variable that gives the greatest additional improvement to the fit is added to the model. Backward method, on the other hand, begins with the full model, and then removes the least useful predictor, one at the time.

Four models forward_aic, forward_bic, backward_aic and backward_bic are specified below:

forward_aic <- step(lm(mpg ~ 1, data = mtcars), direction = "forward", 
                scope = formula(fit.full), k = 2, trace = 0) 
forward_bic <- step(lm(mpg ~ 1, data=mtcars), direction = "forward", 
                scope = formula(fit.full), k = log(32), trace = 0) 
backward_aic <- step(fit.full, direction = "backward", k = 2, trace = 0) 
backward_bic <- step(fit.full, direction = "backward", k = log(32), trace = 0)

forward_aic

## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt          cyl           hp  
##    38.75179     -3.16697     -0.94162     -0.01804

forward_bic

## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt          cyl  
##      39.686       -3.191       -1.508

backward_aic

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936

backward_bic

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936

The model back.bic has the largest adjusted R^2= 0.834. It includes wt, qsec, and am has the most explanatory power for MPG. Under this model, a manual transmission car, on average, has 2.936 miles per gallon more than an automatic transmission car.

We then fit four possible interaction models fit.int, fit.int.aq, fit.int.aw and fit.int.wq to check if any interaction is needed to be in the model.

fit.int <- summary(lm(mpg ~ wt * qsec * am, data = mtcars))
fit.int.aq <- summary(lm(mpg ~ wt + qsec * am, data = mtcars))
fit.int.aw <- summary(lm(mpg ~ qsec + wt * am, data = mtcars))
fit.int.wq <- summary(lm(mpg ~ am + qsec * wt, data = mtcars))

fit.int

## 
## Call:
## lm(formula = mpg ~ wt * qsec * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3330 -1.4729 -0.4856  1.1495  4.0045 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.8073    57.7960  -0.204    0.840
## wt            3.8670    17.0915   0.226    0.823
## qsec          2.2148     3.1734   0.698    0.492
## am           -5.5842    65.4915  -0.085    0.933
## wt:qsec      -0.3805     0.9467  -0.402    0.691
## wt:am         4.0906    20.5487   0.199    0.844
## qsec:am       1.1768     3.6310   0.324    0.749
## wt:qsec:am   -0.5009     1.1648  -0.430    0.671
## 
## Residual standard error: 2.123 on 24 degrees of freedom
## Multiple R-squared:  0.904,  Adjusted R-squared:  0.8759 
## F-statistic: 32.27 on 7 and 24 DF,  p-value: 1.027e-10

fit.int.aq

## 
## Call:
## lm(formula = mpg ~ wt + qsec * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2822 -1.5572 -0.2782  1.1654  4.1293 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.5290     7.2671   2.275   0.0311 *  
## wt           -3.7772     0.6711  -5.629 5.66e-06 ***
## qsec          0.8169     0.3299   2.477   0.0198 *  
## am          -15.6137     8.6237  -1.811   0.0814 .  
## qsec:am       1.0600     0.4870   2.177   0.0384 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.309 on 27 degrees of freedom
## Multiple R-squared:  0.8721, Adjusted R-squared:  0.8532 
## F-statistic: 46.03 on 4 and 27 DF,  p-value: 1.119e-11

fit.int.aw

## 
## Call:
## lm(formula = mpg ~ qsec + wt * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5076 -1.3801 -0.5588  1.0630  4.3684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.723      5.899   1.648 0.110893    
## qsec           1.017      0.252   4.035 0.000403 ***
## wt            -2.937      0.666  -4.409 0.000149 ***
## am            14.079      3.435   4.099 0.000341 ***
## wt:am         -4.141      1.197  -3.460 0.001809 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.084 on 27 degrees of freedom
## Multiple R-squared:  0.8959, Adjusted R-squared:  0.8804 
## F-statistic: 58.06 on 4 and 27 DF,  p-value: 7.168e-13

fit.int.wq

## 
## Call:
## lm(formula = mpg ~ am + qsec * wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5999 -1.6316 -0.6345  1.3839  4.2888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -20.2272    28.0796  -0.720   0.4775  
## am            2.8596     1.4075   2.032   0.0521 .
## qsec          2.8927     1.5466   1.870   0.0723 .
## wt            5.7172     8.8117   0.649   0.5219  
## qsec:wt      -0.5403     0.4926  -1.097   0.2824  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.45 on 27 degrees of freedom
## Multiple R-squared:  0.8561, Adjusted R-squared:  0.8348 
## F-statistic: 40.15 on 4 and 27 DF,  p-value: 5.416e-11

Since model fit.int.aw has the largest adjusted R^2= 0.88, the model mpg = 9.723 + (1.017)qsec + (-2.937)wt + (14.079)am + (-4.141)wt*am is our final model. When am = 0, the slope of wt is -2.937 and the intercept is 9.723. When am = 1, the slope of wt is -7.078 and the intercept is 23.802. In term of uncertainty, the 95% confidence interval for the coefficients are shown below.

 fit <- lm(mpg ~ qsec + wt * am, data = mtcars)
t(confint(fit))

##        (Intercept)      qsec        wt        am     wt:am
## 2.5 %    -2.380779 0.4998811 -4.303102  7.030875 -6.597032
## 97.5 %   21.826884 1.5340661 -1.569960 21.127981 -1.685721

summary(fit)

## 
## Call:
## lm(formula = mpg ~ qsec + wt * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5076 -1.3801 -0.5588  1.0630  4.3684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.723      5.899   1.648 0.110893    
## qsec           1.017      0.252   4.035 0.000403 ***
## wt            -2.937      0.666  -4.409 0.000149 ***
## am            14.079      3.435   4.099 0.000341 ***
## wt:am         -4.141      1.197  -3.460 0.001809 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.084 on 27 degrees of freedom
## Multiple R-squared:  0.8959, Adjusted R-squared:  0.8804 
## F-statistic: 58.06 on 4 and 27 DF,  p-value: 7.168e-13

This model explains over 88.04% of the variance, which is better than previous models.

Residual Diagnostics

Some plots for residual diagnostics are shown in Figure 3 in Appendix. There is no particular pattern in residuals vs fitted, scale-location, and residuals vs leverage plots. For QQ-plot, it seems that the residual is a little bit right skewed, but it still can be seen as normal from Shapiro-Wilk normality test.

shapiro.test(fit$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$res
## W = 0.94444, p-value = 0.1001

round(hatvalues(fit)[hatvalues(fit) > 2*5/32], 2) # high leverage

##            Merc 230 Lincoln Continental        Lotus Europa 
##                0.35                0.32                0.33 
##       Maserati Bora 
##                0.37

round(dfbeta(fit)[which(hatvalues(fit) > 2*5/32), ], 2) # check influence

##                     (Intercept)  qsec    wt    am wt:am
## Merc 230                   1.53 -0.09  0.01  0.46 -0.19
## Lincoln Continental        1.87 -0.03 -0.37 -1.14  0.30
## Lotus Europa               0.12 -0.01  0.00  0.10 -0.04
## Maserati Bora              0.38 -0.02 -0.01 -1.37  0.64

Our model fit the data quite well. Although there are some high leverage points, we may still use this model for prediction as long as we pay attention to those data points with careful explanation.

Appendix

library("PerformanceAnalytics")
chart.Correlation(mtcars, histogram=TRUE, pch=11)

Figure 1: Decriptive Summary

boxplot(mpg~am, data = mtcars,
        col = c("dark grey", "light grey"),
        xlab = "Transmission",
        ylab = "Miles per Gallon",
        main = "MPG by Transmission Type")

Figure 2: MPG by Transmission Type

require(graphics)
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(fit)

Figure 3: Residual diagnostics