We made a model using all of the variables we were given but that
might not be our best model as we may be able to create a better model
using less predictors. Here we will go into some steps on how to reduce
our model and explain how one can tell if one model is better than the
other.
Step AIC
Now that we feel we have two comparable models we will run what is
called Akaike Information Criteria (AIC) analysis for each of them in
the backwards direction. This means that R is taking our total model
with all of the variables in it and taking one out at a time so that our
model’s AIC value improves every time, where the smaller the value the
better the model. When the AIC cannot get any better, it stops and R
tells us that this is the best model to use. It basically gets rid of
the worst predictor at every step until it reaches a point where it
would be removing a good predictor instead of a bad one. We should be
left with our best predictors after using the stepAIC() command on our
model which is a part of the R package MASS. An important thing to note
is that an AIC value decrease of at least 10 is necessary to justify
using a more complicated model over a simpler one.
Mallow’s \(C_P\) is another way to
test models for adding or removing predictors. Mallows \(C_P=\Sigma\left(Y_i-Y^2\right) / M S E\)
Ideally, this would roughly equal \(k+1\) but it is not used by many
statisticians so it is better to use AIC as AIC is a much more well
known test.
library(MASS)
stepAIC(mreg)
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
First running this on our original model with every variable, R gives
us an output telling us to keep only 3 variables as predictors: quarter
mile time, weight, and transmission type.
Now if we run this on our transformation model we get:
stepAIC(mreg1)
## Start: AIC=65.93
## mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec + vs + am +
## gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.0001 126.28 63.928
## - drat 1 0.0721 126.35 63.947
## - qsec 1 0.3118 126.59 64.007
## - am 1 1.3473 127.63 64.268
## - wt 1 2.5214 128.80 64.561
## - cyl 1 5.1370 131.42 65.204
## - log(hp) 1 7.7323 134.01 65.830
## <none> 126.28 65.928
## - gear 1 8.1782 134.46 65.936
## - log(disp) 1 9.8204 136.10 66.325
## - carb 1 13.3555 139.63 67.146
##
## Step: AIC=63.93
## mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec + am + gear +
## carb
##
## Df Sum of Sq RSS AIC
## - drat 1 0.0720 126.35 61.947
## - qsec 1 0.3497 126.63 62.017
## - am 1 1.4160 127.70 62.285
## - wt 1 2.5408 128.82 62.566
## - cyl 1 5.4640 131.74 63.284
## - log(hp) 1 7.9966 134.28 63.893
## <none> 126.28 63.928
## - gear 1 8.2188 134.50 63.946
## - log(disp) 1 9.9454 136.22 64.354
## - carb 1 13.3899 139.67 65.153
##
## Step: AIC=61.95
## mpg ~ cyl + log(disp) + log(hp) + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - qsec 1 0.3086 126.66 60.025
## - am 1 1.4720 127.82 60.317
## - wt 1 2.5345 128.89 60.582
## - cyl 1 5.3971 131.75 61.285
## <none> 126.35 61.947
## - log(hp) 1 8.5533 134.91 62.043
## - gear 1 8.6525 135.00 62.066
## - log(disp) 1 10.1591 136.51 62.421
## - carb 1 13.3729 139.72 63.166
##
## Step: AIC=60.02
## mpg ~ cyl + log(disp) + log(hp) + wt + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - am 1 1.1724 127.83 58.320
## - wt 1 2.3372 129.00 58.610
## - cyl 1 5.1185 131.78 59.292
## <none> 126.66 60.025
## - gear 1 8.7232 135.38 60.156
## - log(hp) 1 9.3330 135.99 60.300
## - log(disp) 1 12.4852 139.15 61.033
## - carb 1 15.9928 142.65 61.830
##
## Step: AIC=58.32
## mpg ~ cyl + log(disp) + log(hp) + wt + gear + carb
##
## Df Sum of Sq RSS AIC
## - wt 1 2.7307 130.56 56.996
## - cyl 1 6.6592 134.49 57.945
## <none> 127.83 58.320
## - log(hp) 1 8.7242 136.56 58.432
## - gear 1 15.8483 143.68 60.060
## - log(disp) 1 16.0475 143.88 60.104
## - carb 1 16.7435 144.58 60.258
##
## Step: AIC=57
## mpg ~ cyl + log(disp) + log(hp) + gear + carb
##
## Df Sum of Sq RSS AIC
## - log(hp) 1 7.249 137.81 56.725
## <none> 130.56 56.996
## - cyl 1 13.526 144.09 58.150
## - gear 1 32.779 163.34 62.164
## - carb 1 38.572 169.13 63.279
## - log(disp) 1 53.640 184.20 66.010
##
## Step: AIC=56.73
## mpg ~ cyl + log(disp) + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 8.707 146.52 56.686
## <none> 137.81 56.725
## - gear 1 26.481 164.29 60.349
## - carb 1 60.918 198.73 66.439
## - log(disp) 1 97.216 235.03 71.807
##
## Step: AIC=56.69
## mpg ~ log(disp) + gear + carb
##
## Df Sum of Sq RSS AIC
## <none> 146.52 56.686
## - gear 1 20.189 166.71 58.817
## - carb 1 52.570 199.09 64.497
## - log(disp) 1 152.562 299.08 77.519
##
## Call:
## lm(formula = mpg ~ log(disp) + gear + carb, data = mtcars)
##
## Coefficients:
## (Intercept) log(disp) gear carb
## 51.789 -6.592 1.787 -1.227
This output tells us to keep only 3 variables as well: log of
displacement, number of gears, and number of carburetors. Then we want
to run a model for each using the recommended variables.
mreg2<-lm(mpg~qsec+wt+am, data=mtcars)
summary(mreg2)
##
## Call:
## lm(formula = mpg ~ qsec + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## qsec 1.2259 0.2887 4.247 0.000216 ***
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
mreg3<-lm(mpg~log(disp)+gear+carb, data=mtcars)
summary(mreg3)
##
## Call:
## lm(formula = mpg ~ log(disp) + gear + carb, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0461 -1.3931 -0.5111 1.8053 4.2983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.7887 8.5069 6.088 1.45e-06 ***
## log(disp) -6.5917 1.2208 -5.399 9.31e-06 ***
## gear 1.7869 0.9097 1.964 0.05950 .
## carb -1.2271 0.3872 -3.170 0.00368 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.288 on 28 degrees of freedom
## Multiple R-squared: 0.8699, Adjusted R-squared: 0.8559
## F-statistic: 62.4 on 3 and 28 DF, p-value: 1.62e-12
Here we can see that the model with the transformation in it does not
improve much compared to the reduced model without any transformations.
It also is harder to interpret transformed models so we will stick with
model2 instead.
Partial F Test
Instead of using the stepAIC() command, we could also have created a
nested model and used the anova(first model, second model) command to
test if the nested model was an improvement to the parent model. This
command is testing the null hypothesis where the predictors that are
being dropped are equal to 0 against the alternative hypothesis that
these predictors are not equal to 0. If the nested model is an
improvement to the parent model, we do not want to reject the null
hypothesis. If we were to reject the null hypothesis, it would mean that
the predictors that we dropped were significant to the model and should
not have been dropped.
nestmreg = lm(mpg ~ wt + qsec + am,data=mtcars)
anova(mreg,nestmreg)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 147.49
## 2 28 169.29 -7 -21.791 0.4432 0.8636
From the ANOVA table above, we can see that the p-value is quite
large so we do not have enough evidence to reject the null hypothesis.
This means that the nested model using the predictors for weight,
quarter mile time, and the transmission type is an improvement to the
model using all of the predictors since we were able to create a simpler
model that still was accurate in predicting the response variable
mpg.