This project evaluates fuel efficiency of automatic and manual transmissions using data provided by Motor Trend magazine. This dataset includes information on 10 aspects of auto design for 32 seperate 1973-74 automobile models.
The best linear-fit model in predicting mpg is as follows:
mpg = -3.92(wt) + 2.94(am) + 1.23(qsec) + 9.62
We begin with “housekeeping” steps of establishing a project folder and loading the associated dataset and supplemental packages:
setwd("~/R/07 Regression Models/finalProject")
attach(mtcars)
library(graphics)
library(stats)
The mtcars dataset has 11 variables, including the dependent variable (mpg) plus 10 independent variables, as shown below:
head(mtcars, 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Next, we examine the correlations that each of the 10 independent variables have with mpg:
sapply(mtcars,function(x)
cor(mpg,x))
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
Each of the variables has at least a moderate degree of correlation with mpg. The variables gear and qsec have the two weakest relationships (with correlations of 0.4802848 and 0.4186840, respectively).
For the next step, we’ll explore the correlations between mpg and the stronger independent variables (removing both gear and qsec).
pairs(mtcars[c('mpg','cyl','disp','hp','drat','wt','vs','vs','am','carb')],
panel=panel.smooth,
main='Relationship between mpg and variables with stronger correlations', col=3)
First, we create a simple linear model with am as the only predictor variable:
lm01 <- lm(mpg ~ factor(am))
The summary statistics of this simple model are as follows:
summary(lm01)$coef ; print(c('R-squared', round(summary(lm01)$r.sq,2)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## factor(am)1 7.244939 1.764422 4.106127 2.850207e-04
## [1] "R-squared" "0.36"
The large t-value (or small p-value) suggests that am is of explanatory significance; however, the low r-squred of 0.36 suggests that a multi-factor model may be more appropriate.
Next, we’ll add a cofounder to the simple linear model. In selecting this cofounder variable, we’ll want to take into account correlation with mpg (higher being preferable) and with the other independent variable (the more uncorrelated, or closer to zero, being preferable). A good candidate is the weight(wt) variable.
lm02 <- lm(mpg ~ factor(am) + wt)
The summary statistics of the updated model (02) are as follows:
summary(lm02)$coef; print(c('R-squared', round(summary(lm02)$r.sq,2)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.32155131 3.0546385 12.21799285 5.843477e-13
## factor(am)1 -0.02361522 1.5456453 -0.01527855 9.879146e-01
## wt -5.35281145 0.7882438 -6.79080719 1.867415e-07
## [1] "R-squared" "0.75"
Adding wt as a cofounder increases R-squared from 36 percent to 75 percent. However, we see a sign reversal of the am variable, as well as a decrease in t-value (-0.00528) to the point of insignificance. This signals a potential multicollinearity problem.
Continuing in stepwise fashion, we add the next-most correlated variable: cyl.
lm03 <- lm(mpg ~ factor(am) + wt + factor(cyl))
The summary statistics of the updated model (03) are as follows:
summary(lm03)$coef; print(c('R-squared', round(summary(lm03)$r.sq,2)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.7535920 2.8134831 11.9970836 2.495549e-12
## factor(am)1 0.1501031 1.3002231 0.1154441 9.089474e-01
## wt -3.1495978 0.9080495 -3.4685309 1.770987e-03
## factor(cyl)6 -4.2573185 1.4112394 -3.0167231 5.514697e-03
## factor(cyl)8 -6.0791189 1.6837131 -3.6105432 1.227964e-03
## [1] "R-squared" "0.84"
Similar to before, r-squared has increased (this time, from 75 percent to 84 percent). However, the am variable produces a t-value too low to conclude that it is of signifcant explanatory power.
Backward stepwise regression is an approach automates an otherwise tedious approach to choosing the best subset of variables in model creation. We employ the step function to perform this.
full <- lm(mpg ~ ., data=mtcars)
stepSol <- step(full, direction = "backward")
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
best <- stepSol$call
best
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
Our best model includes wt, qsec and am and the summary statistics of which are as follows:
lmbest <- lm(mpg ~ wt + qsec + am, data=mtcars)
summary(lmbest)$coef;print(c('R-squared',round(summary(lmbest)$r.sq,2)))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## am 2.935837 1.4109045 2.080819 4.671551e-02
## [1] "R-squared" "0.85"
Each independent variable is statistically sigificant at an alpha = 5% level. The R-squared of the model is 85 percent.
Finally, we will review the residuals. A well-specified model will result in residuals that are more random in nature (i.e., do not exhibit any discernible pattern).
plot(lmbest,which=1)
The residuals appear to be random. We also want to check that the residuals are approximately normally distributed. From the QQ plot below, the residuals appear to be normally distributed.
plot(lmbest,which=2)