Assignment
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”
Summary
I first summarized the relationships between variables, and then fit a linear regression model that has the smallest BIC and the largest adjusted R^2, followed by residual analysis and diagnostics. Either with or without interaction, our model tells us a manual transmission is better for MPG, and no-pattern residual plots are indications for good model fitting.
Exploratory Data Analysis
The mtcars dataset is a dataframe with 32 observations on 11 variables. Figure 1 in Appendix gives us an overview of the variables including their histogram, scatter plots and correlation between variables. Our predictor variable of interest, am, is a dichotomous variable: 0 stands for automatic, 1 for manual. According to Figure 1, manual transmission seems to have higher MPG than automatic transmission.
The distribution of MPG is approximately normal and there are no apparent outliers. Figure 2 in the Appendix shows how MPG varies by automatic versus manual transmission. Figure 2 shows a difference in the MPG by transmission type – manual transmission seems to get better miles per gallon than automatic transmission.
Regression Models and Subset Selection
First, let’s consider two simple models: 1) the model including all predictors and 2) the one with only one variable – am.
##model with all variables
fit.full <- lm(mpg ~ ., data = mtcars); round(summary(fit.full)$coef[, 4][-1], 2)
## cyl disp hp drat wt qsec vs am gear carb
## 0.92 0.46 0.33 0.64 0.06 0.27 0.88 0.23 0.67 0.81
In the full model, all coefficients are not significant at 5% significance level, although is has large adjusted R^2=0.78. This model results in multicollinearity and overfitting with inflated estimated standard error.
##model with one variable
fit.am <- lm(mpg ~ am, data = mtcars); summary(fit.am)$coef[2, ]
## Estimate Std. Error t value Pr(>|t|)
## 7.2449392713 1.7644216316 4.1061269831 0.0002850207
The coefficients of the am-only model are significantly different from zero, saying that on average, a manual transmitted car has 7.245 MPG higher than an automatic transmitted car. However, the model has small adjusted R^2= 0.34, implying small explanatory power for MPG.
To do variable selection, we use forward and backward stepwise selection with AIC (Akaike’s ‘An Information Criterion’) and BIC (Bayesian Information Criterion) criteria. Forward stepwise selection starts with an intercept-only model, and then adds predictors to the model gradually until all of the predictors are in the model. At each step the variable that gives the greatest additional improvement to the fit is added to the model. Backward method, on the other hand, begins with the full model, and then removes the least useful predictor, one at the time.
Four models forward_aic, forward_bic, backward_aic and backward_bic are specified below:
forward_aic <- step(lm(mpg ~ 1, data = mtcars), direction = "forward",
scope = formula(fit.full), k = 2, trace = 0)
forward_bic <- step(lm(mpg ~ 1, data=mtcars), direction = "forward",
scope = formula(fit.full), k = log(32), trace = 0)
backward_aic <- step(fit.full, direction = "backward", k = 2, trace = 0)
backward_bic <- step(fit.full, direction = "backward", k = log(32), trace = 0)
forward_aic
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Coefficients:
## (Intercept) wt cyl hp
## 38.75179 -3.16697 -0.94162 -0.01804
forward_bic
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Coefficients:
## (Intercept) wt cyl
## 39.686 -3.191 -1.508
backward_aic
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
backward_bic
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
The model back.bic has the largest adjusted R^2= 0.834. It includes wt, qsec, and am has the most explanatory power for MPG. Under this model, a manual transmission car, on average, has 2.936 miles per gallon more than an automatic transmission car.
We then fit four possible interaction models fit.int, fit.int.aq, fit.int.aw and fit.int.wq to check if any interaction is needed to be in the model.
fit.int <- summary(lm(mpg ~ wt * qsec * am, data = mtcars))
fit.int.aq <- summary(lm(mpg ~ wt + qsec * am, data = mtcars))
fit.int.aw <- summary(lm(mpg ~ qsec + wt * am, data = mtcars))
fit.int.wq <- summary(lm(mpg ~ am + qsec * wt, data = mtcars))
fit.int
##
## Call:
## lm(formula = mpg ~ wt * qsec * am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3330 -1.4729 -0.4856 1.1495 4.0045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.8073 57.7960 -0.204 0.840
## wt 3.8670 17.0915 0.226 0.823
## qsec 2.2148 3.1734 0.698 0.492
## am -5.5842 65.4915 -0.085 0.933
## wt:qsec -0.3805 0.9467 -0.402 0.691
## wt:am 4.0906 20.5487 0.199 0.844
## qsec:am 1.1768 3.6310 0.324 0.749
## wt:qsec:am -0.5009 1.1648 -0.430 0.671
##
## Residual standard error: 2.123 on 24 degrees of freedom
## Multiple R-squared: 0.904, Adjusted R-squared: 0.8759
## F-statistic: 32.27 on 7 and 24 DF, p-value: 1.027e-10
fit.int.aq
##
## Call:
## lm(formula = mpg ~ wt + qsec * am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2822 -1.5572 -0.2782 1.1654 4.1293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.5290 7.2671 2.275 0.0311 *
## wt -3.7772 0.6711 -5.629 5.66e-06 ***
## qsec 0.8169 0.3299 2.477 0.0198 *
## am -15.6137 8.6237 -1.811 0.0814 .
## qsec:am 1.0600 0.4870 2.177 0.0384 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 27 degrees of freedom
## Multiple R-squared: 0.8721, Adjusted R-squared: 0.8532
## F-statistic: 46.03 on 4 and 27 DF, p-value: 1.119e-11
fit.int.aw
##
## Call:
## lm(formula = mpg ~ qsec + wt * am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5076 -1.3801 -0.5588 1.0630 4.3684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.723 5.899 1.648 0.110893
## qsec 1.017 0.252 4.035 0.000403 ***
## wt -2.937 0.666 -4.409 0.000149 ***
## am 14.079 3.435 4.099 0.000341 ***
## wt:am -4.141 1.197 -3.460 0.001809 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.084 on 27 degrees of freedom
## Multiple R-squared: 0.8959, Adjusted R-squared: 0.8804
## F-statistic: 58.06 on 4 and 27 DF, p-value: 7.168e-13
fit.int.wq
##
## Call:
## lm(formula = mpg ~ am + qsec * wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5999 -1.6316 -0.6345 1.3839 4.2888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.2272 28.0796 -0.720 0.4775
## am 2.8596 1.4075 2.032 0.0521 .
## qsec 2.8927 1.5466 1.870 0.0723 .
## wt 5.7172 8.8117 0.649 0.5219
## qsec:wt -0.5403 0.4926 -1.097 0.2824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.45 on 27 degrees of freedom
## Multiple R-squared: 0.8561, Adjusted R-squared: 0.8348
## F-statistic: 40.15 on 4 and 27 DF, p-value: 5.416e-11
Since model fit.int.aw has the largest adjusted R^2= 0.88, the model mpg = 9.723 + (1.017)qsec + (-2.937)wt + (14.079)am + (-4.141)wt*am is our final model. When am = 0, the slope of wt is -2.937 and the intercept is 9.723. When am = 1, the slope of wt is -7.078 and the intercept is 23.802. In term of uncertainty, the 95% confidence interval for the coefficients are shown below.
fit <- lm(mpg ~ qsec + wt * am, data = mtcars)
t(confint(fit))
## (Intercept) qsec wt am wt:am
## 2.5 % -2.380779 0.4998811 -4.303102 7.030875 -6.597032
## 97.5 % 21.826884 1.5340661 -1.569960 21.127981 -1.685721
summary(fit)
##
## Call:
## lm(formula = mpg ~ qsec + wt * am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5076 -1.3801 -0.5588 1.0630 4.3684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.723 5.899 1.648 0.110893
## qsec 1.017 0.252 4.035 0.000403 ***
## wt -2.937 0.666 -4.409 0.000149 ***
## am 14.079 3.435 4.099 0.000341 ***
## wt:am -4.141 1.197 -3.460 0.001809 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.084 on 27 degrees of freedom
## Multiple R-squared: 0.8959, Adjusted R-squared: 0.8804
## F-statistic: 58.06 on 4 and 27 DF, p-value: 7.168e-13
This model explains over 88.04% of the variance, which is better than previous models.
Residual Diagnostics
Some plots for residual diagnostics are shown in Figure 3 in Appendix. There is no particular pattern in residuals vs fitted, scale-location, and residuals vs leverage plots. For QQ-plot, it seems that the residual is a little bit right skewed, but it still can be seen as normal from Shapiro-Wilk normality test.
shapiro.test(fit$res)
##
## Shapiro-Wilk normality test
##
## data: fit$res
## W = 0.94444, p-value = 0.1001
round(hatvalues(fit)[hatvalues(fit) > 2*5/32], 2) # high leverage
## Merc 230 Lincoln Continental Lotus Europa
## 0.35 0.32 0.33
## Maserati Bora
## 0.37
round(dfbeta(fit)[which(hatvalues(fit) > 2*5/32), ], 2) # check influence
## (Intercept) qsec wt am wt:am
## Merc 230 1.53 -0.09 0.01 0.46 -0.19
## Lincoln Continental 1.87 -0.03 -0.37 -1.14 0.30
## Lotus Europa 0.12 -0.01 0.00 0.10 -0.04
## Maserati Bora 0.38 -0.02 -0.01 -1.37 0.64
Our model fit the data quite well. Although there are some high leverage points, we may still use this model for prediction as long as we pay attention to those data points with careful explanation.
Appendix
library("PerformanceAnalytics")
chart.Correlation(mtcars, histogram=TRUE, pch=11)
Figure 1: Decriptive Summary
boxplot(mpg~am, data = mtcars,
col = c("dark grey", "light grey"),
xlab = "Transmission",
ylab = "Miles per Gallon",
main = "MPG by Transmission Type")
Figure 2: MPG by Transmission Type
require(graphics)
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(fit)
Figure 3: Residual diagnostics