This project evaluates fuel efficiency of automatic and manual transmissions using data provided by Motor Trend magazine. This dataset includes information on 10 aspects of auto design for 32 seperate 1973-74 automobile models.

Executive Summary

The best linear-fit model in predicting mpg is as follows:
mpg = -3.92(wt) + 2.94(am) + 1.23(qsec) + 9.62

Dataset Exploration

We begin with “housekeeping” steps of establishing a project folder and loading the associated dataset and supplemental packages:

setwd("~/R/07 Regression Models/finalProject")
attach(mtcars)
library(graphics)
library(stats)

The mtcars dataset has 11 variables, including the dependent variable (mpg) plus 10 independent variables, as shown below:

head(mtcars, 3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1


Next, we examine the correlations that each of the 10 independent variables have with mpg:

sapply(mtcars,function(x) 
  cor(mpg,x))
##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251


Each of the variables has at least a moderate degree of correlation with mpg. The variables gear and qsec have the two weakest relationships (with correlations of 0.4802848 and 0.4186840, respectively).

For the next step, we’ll explore the correlations between mpg and the stronger independent variables (removing both gear and qsec).

pairs(mtcars[c('mpg','cyl','disp','hp','drat','wt','vs','vs','am','carb')],
      panel=panel.smooth,
      main='Relationship between mpg and variables with stronger correlations', col=3)


Fitting Models

First, we create a simple linear model with am as the only predictor variable:

lm01 <- lm(mpg ~ factor(am))


The summary statistics of this simple model are as follows:

summary(lm01)$coef ; print(c('R-squared', round(summary(lm01)$r.sq,2)))
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## factor(am)1  7.244939   1.764422  4.106127 2.850207e-04
## [1] "R-squared" "0.36"

The large t-value (or small p-value) suggests that am is of explanatory significance; however, the low r-squred of 0.36 suggests that a multi-factor model may be more appropriate.

Next, we’ll add a cofounder to the simple linear model. In selecting this cofounder variable, we’ll want to take into account correlation with mpg (higher being preferable) and with the other independent variable (the more uncorrelated, or closer to zero, being preferable). A good candidate is the weight(wt) variable.

lm02 <- lm(mpg ~ factor(am) + wt)


The summary statistics of the updated model (02) are as follows:

summary(lm02)$coef; print(c('R-squared', round(summary(lm02)$r.sq,2)))
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 37.32155131  3.0546385 12.21799285 5.843477e-13
## factor(am)1 -0.02361522  1.5456453 -0.01527855 9.879146e-01
## wt          -5.35281145  0.7882438 -6.79080719 1.867415e-07
## [1] "R-squared" "0.75"

Adding wt as a cofounder increases R-squared from 36 percent to 75 percent. However, we see a sign reversal of the am variable, as well as a decrease in t-value (-0.00528) to the point of insignificance. This signals a potential multicollinearity problem.

Continuing in stepwise fashion, we add the next-most correlated variable: cyl.

lm03 <- lm(mpg ~ factor(am) + wt + factor(cyl))


The summary statistics of the updated model (03) are as follows:

summary(lm03)$coef; print(c('R-squared', round(summary(lm03)$r.sq,2)))
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  33.7535920  2.8134831 11.9970836 2.495549e-12
## factor(am)1   0.1501031  1.3002231  0.1154441 9.089474e-01
## wt           -3.1495978  0.9080495 -3.4685309 1.770987e-03
## factor(cyl)6 -4.2573185  1.4112394 -3.0167231 5.514697e-03
## factor(cyl)8 -6.0791189  1.6837131 -3.6105432 1.227964e-03
## [1] "R-squared" "0.84"


Similar to before, r-squared has increased (this time, from 75 percent to 84 percent). However, the am variable produces a t-value too low to conclude that it is of signifcant explanatory power.

Selecting the Best Linear Fit

Backward stepwise regression is an approach automates an otherwise tedious approach to choosing the best subset of variables in model creation. We employ the step function to perform this.

full <- lm(mpg ~ ., data=mtcars)
stepSol <- step(full, direction = "backward")
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
best <- stepSol$call
best
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)

Our best model includes wt, qsec and am and the summary statistics of which are as follows:

lmbest <- lm(mpg ~ wt + qsec + am, data=mtcars)
summary(lmbest)$coef;print(c('R-squared',round(summary(lmbest)$r.sq,2)))
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## am           2.935837  1.4109045  2.080819 4.671551e-02
## [1] "R-squared" "0.85"

Each independent variable is statistically sigificant at an alpha = 5% level. The R-squared of the model is 85 percent.
Finally, we will review the residuals. A well-specified model will result in residuals that are more random in nature (i.e., do not exhibit any discernible pattern).

plot(lmbest,which=1)

The residuals appear to be random. We also want to check that the residuals are approximately normally distributed. From the QQ plot below, the residuals appear to be normally distributed.

plot(lmbest,which=2)