In this project, I will be using a Best Subset of Predictors approach to fit multiple linear regression models to find which predictors create the most effective model. I will create models using Mallow’s Cp, forward selection, backwards elimination, and stepwise selection to find which variables prodcue the best model. Then, I will repeat these steps after performing a natural logarithmic transformation on the response variable. A portion of the dataset I will be using is shown below.
## Instar ActiveFeeding Fgp Mgp Mass LogMass Intake LogIntake WetFrass
## 1 1 Y Y Y 0.002064 -2.685290 0.165118 -0.7822056 0.000241
## 2 1 Y N N 0.005191 -2.284749 0.201008 -0.6967867 0.000063
## 3 2 N Y N 0.005603 -2.251579 0.189125 -0.7232511 0.001401
## 4 2 Y N N 0.019300 -1.714443 0.283280 -0.5477841 0.002045
## 5 2 N Y Y 0.029300 -1.533132 0.259569 -0.5857472 0.005377
## 6 3 Y Y N 0.062600 -1.203426 0.327864 -0.4843063 0.029500
## LogWetFrass DryFrass LogDryFrass Cassim LogCassim Nfrass LogNfrass
## 1 -3.617983 0.000208 -3.681937 0.01422378 -1.846985 6.61e-06 -5.179510
## 2 -4.200659 0.000061 -4.214670 0.01739189 -1.759653 1.03e-06 -5.986783
## 3 -2.853562 0.000969 -3.013676 0.01639923 -1.785177 2.78e-05 -4.555794
## 4 -2.689307 0.001834 -2.736601 0.02392468 -1.621154 4.64e-05 -4.333480
## 5 -2.269460 0.003523 -2.453087 0.02122857 -1.673079 9.97e-05 -4.001301
## 6 -1.530178 0.000789 -3.102923 0.02836365 -1.547238 1.84e-05 -4.735567
## Nassim LogNassim
## 1 0.001858999 -2.730721
## 2 0.002270091 -2.643957
## 3 0.002302210 -2.637855
## 4 0.003041352 -2.516933
## 5 0.002791898 -2.554100
## 6 0.003627464 -2.440397
##
## Call:
## lm(formula = Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass +
## Intake + WetFrass + DryFrass + Cassim + Nfrass, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0027588 -0.0001865 -0.0000518 0.0000977 0.0045538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.427e-05 1.885e-04 -0.182 0.855859
## Instar 2.006e-06 5.948e-05 0.034 0.973123
## ActiveFeedingY 9.271e-05 1.307e-04 0.709 0.478697
## FgpY -7.843e-05 1.483e-04 -0.529 0.597298
## MgpY 8.781e-05 1.211e-04 0.725 0.469109
## Mass 5.681e-05 4.617e-05 1.231 0.219635
## Intake -6.759e-03 7.115e-04 -9.500 < 2e-16 ***
## WetFrass -1.554e-03 4.078e-04 -3.809 0.000177 ***
## DryFrass 8.487e-02 5.100e-03 16.641 < 2e-16 ***
## Cassim 2.048e-01 7.483e-03 27.362 < 2e-16 ***
## Nfrass -9.653e-01 5.586e-02 -17.282 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0007354 on 243 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.998
## F-statistic: 1.287e+04 on 10 and 243 DF, p-value: < 2.2e-16
This model uses all of the predictor variables provided by the data to fit a linear model. As shown in the model, the R-squared value is 0.9981, which mean this model provides a nearly perfect linear fit with all of the variables included.
## The best subset is:
## Intake, DryFrass, Cassim, Nfrass
## with Mallow's Cp =14.99251.
Based on this Mallow’s Cp model, it appears the model is most effective with 4 predictor variables: Intake, DryFrass, Cassim, and Nfrass. These variables give us a Mallow’s Cp value of 14.99251. Mallow’s Cp cirteria states that a good model should have a value that is rougly equivalent to the number of predictors + 1. Because our model has 10 predictors, I believe 14.99251 is an acceptable Cp value. After testing different numbers of included predictors, I found that 4 gave the best fitted model.
## Start: AIC=-2080.9
## Nassim ~ 1
##
## Df Sum of Sq RSS AIC
## + Cassim 1 0.068653 0.001081 -3137.3
## + Intake 1 0.066616 0.003118 -2868.2
## + DryFrass 1 0.056004 0.013730 -2491.7
## + Nfrass 1 0.049286 0.020448 -2390.5
## + WetFrass 1 0.046622 0.023112 -2359.4
## + Instar 1 0.036239 0.033496 -2265.2
## + Mass 1 0.026941 0.042793 -2202.9
## + ActiveFeeding 1 0.005302 0.064432 -2099.0
## + Mgp 1 0.000870 0.068864 -2082.1
## <none> 0.069734 -2080.9
## + Fgp 1 0.000010 0.069724 -2078.9
##
## Step: AIC=-3137.3
## Nassim ~ Cassim
##
## Df Sum of Sq RSS AIC
## + Nfrass 1 0.00066146 0.00041941 -3375.8
## + WetFrass 1 0.00065541 0.00042546 -3372.1
## + Mass 1 0.00037668 0.00070420 -3244.1
## + DryFrass 1 0.00031902 0.00076185 -3224.1
## + Intake 1 0.00027061 0.00081027 -3208.5
## + Fgp 1 0.00020422 0.00087665 -3188.5
## + ActiveFeeding 1 0.00008228 0.00099860 -3155.4
## + Mgp 1 0.00002349 0.00105738 -3140.9
## + Instar 1 0.00001551 0.00106536 -3139.0
## <none> 0.00108087 -3137.3
##
## Step: AIC=-3375.75
## Nassim ~ Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## + DryFrass 1 2.2225e-04 0.00019717 -3565.5
## + Intake 1 8.3816e-05 0.00033560 -3430.4
## + Instar 1 5.5925e-05 0.00036349 -3410.1
## + Mass 1 1.6583e-05 0.00040283 -3384.0
## + WetFrass 1 1.0456e-05 0.00040896 -3380.2
## + Fgp 1 1.0151e-05 0.00040926 -3380.0
## + Mgp 1 3.7440e-06 0.00041567 -3376.0
## <none> 0.00041941 -3375.8
## + ActiveFeeding 1 4.9600e-07 0.00041892 -3374.1
##
## Step: AIC=-3565.47
## Nassim ~ Cassim + Nfrass + DryFrass
##
## Df Sum of Sq RSS AIC
## + Intake 1 5.7116e-05 0.00014005 -3650.4
## + Mass 1 8.8580e-06 0.00018831 -3575.1
## + WetFrass 1 3.9390e-06 0.00019323 -3568.6
## + Instar 1 2.1660e-06 0.00019500 -3566.3
## <none> 0.00019717 -3565.5
## + ActiveFeeding 1 1.1730e-06 0.00019599 -3565.0
## + Fgp 1 2.6500e-07 0.00019690 -3563.8
## + Mgp 1 2.1800e-07 0.00019695 -3563.8
##
## Step: AIC=-3650.35
## Nassim ~ Cassim + Nfrass + DryFrass + Intake
##
## Df Sum of Sq RSS AIC
## + WetFrass 1 7.0914e-06 0.00013296 -3661.6
## <none> 0.00014005 -3650.4
## + Instar 1 4.5630e-07 0.00013959 -3649.2
## + Mass 1 1.8200e-07 0.00013987 -3648.7
## + Mgp 1 1.6950e-07 0.00013988 -3648.7
## + ActiveFeeding 1 8.2000e-09 0.00014004 -3648.4
## + Fgp 1 0.0000e+00 0.00014005 -3648.4
##
## Step: AIC=-3661.55
## Nassim ~ Cassim + Nfrass + DryFrass + Intake + WetFrass
##
## Df Sum of Sq RSS AIC
## + Mass 1 1.0564e-06 0.00013190 -3661.6
## <none> 0.00013296 -3661.6
## + Mgp 1 1.8389e-07 0.00013277 -3659.9
## + Instar 1 1.7244e-07 0.00013279 -3659.9
## + Fgp 1 6.9200e-08 0.00013289 -3659.7
## + ActiveFeeding 1 9.3100e-09 0.00013295 -3659.6
##
## Step: AIC=-3661.58
## Nassim ~ Cassim + Nfrass + DryFrass + Intake + WetFrass + Mass
##
## Df Sum of Sq RSS AIC
## <none> 0.00013190 -3661.6
## + ActiveFeeding 1 1.9569e-07 0.00013171 -3660.0
## + Mgp 1 1.8323e-07 0.00013172 -3659.9
## + Instar 1 1.7590e-09 0.00013190 -3659.6
## + Fgp 1 1.0440e-09 0.00013190 -3659.6
##
## Call:
## lm(formula = Nassim ~ Cassim + Nfrass + DryFrass + Intake + WetFrass +
## Mass, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0027704 -0.0001662 -0.0000398 0.0001088 0.0045810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.027e-06 6.633e-05 -0.015 0.987660
## Cassim 2.040e-01 7.375e-03 27.658 < 2e-16 ***
## Nfrass -9.622e-01 5.348e-02 -17.993 < 2e-16 ***
## DryFrass 8.374e-02 4.738e-03 17.676 < 2e-16 ***
## Intake -6.650e-03 6.955e-04 -9.562 < 2e-16 ***
## WetFrass -1.522e-03 3.941e-04 -3.862 0.000144 ***
## Mass 5.521e-05 3.925e-05 1.406 0.160839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0007308 on 247 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9981
## F-statistic: 2.172e+04 on 6 and 247 DF, p-value: < 2.2e-16
Based on this forward selection model, it appears 6 predictor variables made the best model: Cassim, Nfrass, DryFrass, Intake, WetFrass, and Mass. This model has an R-squared of 0.9981, meaning it is an almost perfect linear model with these variables.
## Start: AIC=-3654.54
## Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + Intake +
## WetFrass + DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Instar 1 0.00000000 0.00013140 -3656.5
## - Fgp 1 0.00000015 0.00013155 -3656.3
## - ActiveFeeding 1 0.00000027 0.00013167 -3656.0
## - Mgp 1 0.00000028 0.00013169 -3656.0
## - Mass 1 0.00000082 0.00013222 -3655.0
## <none> 0.00013140 -3654.5
## - WetFrass 1 0.00000785 0.00013925 -3641.8
## - Intake 1 0.00004880 0.00018020 -3576.3
## - DryFrass 1 0.00014974 0.00028114 -3463.4
## - Nfrass 1 0.00016150 0.00029291 -3452.9
## - Cassim 1 0.00040486 0.00053626 -3299.3
##
## Step: AIC=-3656.54
## Nassim ~ ActiveFeeding + Fgp + Mgp + Mass + Intake + WetFrass +
## DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Fgp 1 0.00000015 0.00013156 -3658.2
## - ActiveFeeding 1 0.00000027 0.00013167 -3658.0
## - Mgp 1 0.00000028 0.00013169 -3658.0
## - Mass 1 0.00000098 0.00013238 -3656.7
## <none> 0.00013140 -3656.5
## - WetFrass 1 0.00000823 0.00013963 -3643.1
## - Intake 1 0.00004890 0.00018030 -3578.2
## - DryFrass 1 0.00015435 0.00028575 -3461.2
## - Nfrass 1 0.00016929 0.00030070 -3448.3
## - Cassim 1 0.00040488 0.00053628 -3301.3
##
## Step: AIC=-3658.25
## Nassim ~ ActiveFeeding + Mgp + Mass + Intake + WetFrass + DryFrass +
## Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Mgp 1 0.00000015 0.00013171 -3660.0
## - ActiveFeeding 1 0.00000016 0.00013172 -3659.9
## <none> 0.00013156 -3658.2
## - Mass 1 0.00000122 0.00013277 -3657.9
## - WetFrass 1 0.00000811 0.00013966 -3645.1
## - Intake 1 0.00004903 0.00018059 -3579.8
## - DryFrass 1 0.00016505 0.00029661 -3453.8
## - Nfrass 1 0.00017200 0.00030355 -3447.9
## - Cassim 1 0.00040767 0.00053922 -3301.9
##
## Step: AIC=-3659.96
## Nassim ~ ActiveFeeding + Mass + Intake + WetFrass + DryFrass +
## Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - ActiveFeeding 1 0.00000020 0.00013190 -3661.6
## <none> 0.00013171 -3660.0
## - Mass 1 0.00000124 0.00013295 -3659.6
## - WetFrass 1 0.00000811 0.00013981 -3646.8
## - Intake 1 0.00004900 0.00018071 -3581.6
## - DryFrass 1 0.00016696 0.00029866 -3454.0
## - Nfrass 1 0.00017187 0.00030357 -3449.9
## - Cassim 1 0.00040870 0.00054041 -3303.4
##
## Step: AIC=-3661.58
## Nassim ~ Mass + Intake + WetFrass + DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## <none> 0.00013190 -3661.6
## - Mass 1 0.00000106 0.00013296 -3661.6
## - WetFrass 1 0.00000797 0.00013987 -3648.7
## - Intake 1 0.00004882 0.00018073 -3583.6
## - DryFrass 1 0.00016685 0.00029875 -3455.9
## - Nfrass 1 0.00017289 0.00030479 -3450.8
## - Cassim 1 0.00040850 0.00054041 -3305.4
##
## Call:
## lm(formula = Nassim ~ Mass + Intake + WetFrass + DryFrass + Cassim +
## Nfrass, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0027704 -0.0001662 -0.0000398 0.0001088 0.0045810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.027e-06 6.633e-05 -0.015 0.987660
## Mass 5.521e-05 3.925e-05 1.406 0.160839
## Intake -6.650e-03 6.955e-04 -9.562 < 2e-16 ***
## WetFrass -1.522e-03 3.941e-04 -3.862 0.000144 ***
## DryFrass 8.374e-02 4.738e-03 17.676 < 2e-16 ***
## Cassim 2.040e-01 7.375e-03 27.658 < 2e-16 ***
## Nfrass -9.622e-01 5.348e-02 -17.993 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0007308 on 247 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9981
## F-statistic: 2.172e+04 on 6 and 247 DF, p-value: < 2.2e-16
Based on this backwards elimination model, it appears the same 6 predictors were chosen (Mass, Intake, WetFrass, DryFrass, Cassim, and Nfrass). This model gives an R-squared of 0.9981, similar to the last model. This means that the chosen predictor variables provide a nearly perfect linear model.
## Start: AIC=-2080.9
## Nassim ~ 1
##
## Call:
## lm(formula = Nassim ~ 1, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.016029 -0.011200 -0.008595 0.002465 0.050394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.013768 0.001042 13.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0166 on 253 degrees of freedom
## (13 observations deleted due to missingness)
## Start: AIC=-3654.54
## Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + Intake +
## WetFrass + DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Instar 1 0.00000000 0.00013140 -3656.5
## - Fgp 1 0.00000015 0.00013155 -3656.3
## - ActiveFeeding 1 0.00000027 0.00013167 -3656.0
## - Mgp 1 0.00000028 0.00013169 -3656.0
## - Mass 1 0.00000082 0.00013222 -3655.0
## <none> 0.00013140 -3654.5
## - WetFrass 1 0.00000785 0.00013925 -3641.8
## - Intake 1 0.00004880 0.00018020 -3576.3
## - DryFrass 1 0.00014974 0.00028114 -3463.4
## - Nfrass 1 0.00016150 0.00029291 -3452.9
## - Cassim 1 0.00040486 0.00053626 -3299.3
##
## Step: AIC=-3656.54
## Nassim ~ ActiveFeeding + Fgp + Mgp + Mass + Intake + WetFrass +
## DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Fgp 1 0.00000015 0.00013156 -3658.2
## - ActiveFeeding 1 0.00000027 0.00013167 -3658.0
## - Mgp 1 0.00000028 0.00013169 -3658.0
## - Mass 1 0.00000098 0.00013238 -3656.7
## <none> 0.00013140 -3656.5
## + Instar 1 0.00000000 0.00013140 -3654.5
## - WetFrass 1 0.00000823 0.00013963 -3643.1
## - Intake 1 0.00004890 0.00018030 -3578.2
## - DryFrass 1 0.00015435 0.00028575 -3461.2
## - Nfrass 1 0.00016929 0.00030070 -3448.3
## - Cassim 1 0.00040488 0.00053628 -3301.3
##
## Step: AIC=-3658.25
## Nassim ~ ActiveFeeding + Mgp + Mass + Intake + WetFrass + DryFrass +
## Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Mgp 1 0.00000015 0.00013171 -3660.0
## - ActiveFeeding 1 0.00000016 0.00013172 -3659.9
## <none> 0.00013156 -3658.2
## - Mass 1 0.00000122 0.00013277 -3657.9
## + Fgp 1 0.00000015 0.00013140 -3656.5
## + Instar 1 0.00000000 0.00013155 -3656.3
## - WetFrass 1 0.00000811 0.00013966 -3645.1
## - Intake 1 0.00004903 0.00018059 -3579.8
## - DryFrass 1 0.00016505 0.00029661 -3453.8
## - Nfrass 1 0.00017200 0.00030355 -3447.9
## - Cassim 1 0.00040767 0.00053922 -3301.9
##
## Step: AIC=-3659.96
## Nassim ~ ActiveFeeding + Mass + Intake + WetFrass + DryFrass +
## Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - ActiveFeeding 1 0.00000020 0.00013190 -3661.6
## <none> 0.00013171 -3660.0
## - Mass 1 0.00000124 0.00013295 -3659.6
## + Mgp 1 0.00000015 0.00013156 -3658.2
## + Fgp 1 0.00000002 0.00013169 -3658.0
## + Instar 1 0.00000000 0.00013171 -3658.0
## - WetFrass 1 0.00000811 0.00013981 -3646.8
## - Intake 1 0.00004900 0.00018071 -3581.6
## - DryFrass 1 0.00016696 0.00029866 -3454.0
## - Nfrass 1 0.00017187 0.00030357 -3449.9
## - Cassim 1 0.00040870 0.00054041 -3303.4
##
## Step: AIC=-3661.58
## Nassim ~ Mass + Intake + WetFrass + DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## <none> 0.00013190 -3661.6
## - Mass 1 0.00000106 0.00013296 -3661.6
## + ActiveFeeding 1 0.00000020 0.00013171 -3660.0
## + Mgp 1 0.00000018 0.00013172 -3659.9
## + Instar 1 0.00000000 0.00013190 -3659.6
## + Fgp 1 0.00000000 0.00013190 -3659.6
## - WetFrass 1 0.00000797 0.00013987 -3648.7
## - Intake 1 0.00004882 0.00018073 -3583.6
## - DryFrass 1 0.00016685 0.00029875 -3455.9
## - Nfrass 1 0.00017289 0.00030479 -3450.8
## - Cassim 1 0.00040850 0.00054041 -3305.4
##
## Call:
## lm(formula = Nassim ~ Mass + Intake + WetFrass + DryFrass + Cassim +
## Nfrass, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0027704 -0.0001662 -0.0000398 0.0001088 0.0045810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.027e-06 6.633e-05 -0.015 0.987660
## Mass 5.521e-05 3.925e-05 1.406 0.160839
## Intake -6.650e-03 6.955e-04 -9.562 < 2e-16 ***
## WetFrass -1.522e-03 3.941e-04 -3.862 0.000144 ***
## DryFrass 8.374e-02 4.738e-03 17.676 < 2e-16 ***
## Cassim 2.040e-01 7.375e-03 27.658 < 2e-16 ***
## Nfrass -9.622e-01 5.348e-02 -17.993 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0007308 on 247 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.9981
## F-statistic: 2.172e+04 on 6 and 247 DF, p-value: < 2.2e-16
This stepwise selection model uses the same 6 predictor vairables as the last 2 models, and results in the same R-squared value of 0.9981.
##
## Call:
## lm(formula = LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp +
## Mass + Intake + WetFrass + DryFrass + Cassim + Nfrass, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78898 -0.06022 0.00886 0.06858 0.24441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.007731 0.031993 -94.012 < 2e-16 ***
## Instar 0.159333 0.010096 15.781 < 2e-16 ***
## ActiveFeedingY 0.114118 0.022204 5.140 5.68e-07 ***
## FgpY -0.030847 0.025140 -1.227 0.221004
## MgpY 0.040446 0.020550 1.968 0.050186 .
## Mass 0.029500 0.007872 3.748 0.000223 ***
## Intake 0.075744 0.120911 0.626 0.531612
## WetFrass -0.163631 0.072638 -2.253 0.025176 *
## DryFrass 1.027704 0.865211 1.188 0.236074
## Cassim 1.675558 1.269629 1.320 0.188175
## Nfrass -24.273257 9.856176 -2.463 0.014485 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1247 on 242 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9414, Adjusted R-squared: 0.939
## F-statistic: 389 on 10 and 242 DF, p-value: < 2.2e-16
This model, now using the variable Nassim with a logarithmic transformation applied, displays an R-squared of 0.9414. While this value is lower than the R-squared using Nassim as the repsonse with no transformation, this is still an appropriate model.
## The best subset is:
## Instar, ActiveFeedingY, Mass, Intake, Nfrass
## with Mallow's Cp =10.81244.
Using the Mallow’s Cp method and testing different combinations of predictor variables, I have found that the 5 variables (Instar, ActiveFeedingY, Mass, Intake, and Nfrass) provide the best model. As explained above in the first Mallow’s Cp model, an acceptable Cp value is rougly the number of predictors + 1. The Mallow’s Cp value for this model is 10.81244, meaning the chosen predictor variables fit the model well.
## Start: AIC=-344.83
## LogNassim ~ 1
##
## Df Sum of Sq RSS AIC
## + Cassim 1 51.793 12.441 -758.13
## + Intake 1 51.343 12.891 -749.15
## + Instar 1 49.423 14.811 -714.03
## + DryFrass 1 45.252 18.982 -651.25
## + Nfrass 1 38.414 25.820 -573.41
## + WetFrass 1 35.542 28.692 -546.72
## + Mass 1 26.816 37.418 -479.54
## + ActiveFeeding 1 3.785 60.449 -358.19
## <none> 64.234 -344.83
## + Mgp 1 0.457 63.777 -344.63
## + Fgp 1 0.013 64.221 -342.88
##
## Step: AIC=-758.13
## LogNassim ~ Cassim
##
## Df Sum of Sq RSS AIC
## + Instar 1 7.0508 5.3905 -967.73
## + WetFrass 1 0.5231 11.9183 -767.00
## + Nfrass 1 0.4133 12.0280 -764.68
## <none> 12.4413 -758.13
## + Fgp 1 0.0474 12.3939 -757.10
## + ActiveFeeding 1 0.0435 12.3978 -757.02
## + Mass 1 0.0434 12.3979 -757.01
## + Intake 1 0.0255 12.4158 -756.65
## + DryFrass 1 0.0005 12.4408 -756.14
## + Mgp 1 0.0001 12.4412 -756.13
##
## Step: AIC=-967.73
## LogNassim ~ Cassim + Instar
##
## Df Sum of Sq RSS AIC
## + WetFrass 1 0.83640 4.5541 -1008.39
## + Nfrass 1 0.82454 4.5660 -1007.73
## + ActiveFeeding 1 0.76451 4.6260 -1004.43
## + DryFrass 1 0.46343 4.9271 -988.48
## + Mass 1 0.40932 4.9812 -985.71
## + Fgp 1 0.39412 4.9964 -984.94
## + Intake 1 0.29274 5.0978 -979.86
## + Mgp 1 0.21690 5.1736 -976.12
## <none> 5.3905 -967.73
##
## Step: AIC=-1008.39
## LogNassim ~ Cassim + Instar + WetFrass
##
## Df Sum of Sq RSS AIC
## + ActiveFeeding 1 0.34919 4.2049 -1026.6
## + Intake 1 0.08084 4.4733 -1010.9
## + Mass 1 0.07747 4.4766 -1010.7
## + DryFrass 1 0.07457 4.4795 -1010.6
## + Mgp 1 0.06495 4.4892 -1010.0
## + Fgp 1 0.06049 4.4936 -1009.8
## <none> 4.5541 -1008.4
## + Nfrass 1 0.01054 4.5436 -1007.0
##
## Step: AIC=-1026.58
## LogNassim ~ Cassim + Instar + WetFrass + ActiveFeeding
##
## Df Sum of Sq RSS AIC
## + Mass 1 0.220727 3.9842 -1038.2
## + DryFrass 1 0.056339 4.1486 -1028.0
## + Intake 1 0.044955 4.1600 -1027.3
## + Mgp 1 0.044135 4.1608 -1027.2
## <none> 4.2049 -1026.6
## + Nfrass 1 0.003303 4.2016 -1024.8
## + Fgp 1 0.000029 4.2049 -1024.6
##
## Step: AIC=-1038.22
## LogNassim ~ Cassim + Instar + WetFrass + ActiveFeeding + Mass
##
## Df Sum of Sq RSS AIC
## + Intake 1 0.076640 3.9075 -1041.1
## + DryFrass 1 0.055300 3.9289 -1039.8
## + Mgp 1 0.038576 3.9456 -1038.7
## <none> 3.9842 -1038.2
## + Nfrass 1 0.011178 3.9730 -1036.9
## + Fgp 1 0.004124 3.9801 -1036.5
##
## Step: AIC=-1041.13
## LogNassim ~ Cassim + Instar + WetFrass + ActiveFeeding + Mass +
## Intake
##
## Df Sum of Sq RSS AIC
## + Nfrass 1 0.077230 3.8303 -1044.2
## + Mgp 1 0.040835 3.8667 -1041.8
## <none> 3.9075 -1041.1
## + DryFrass 1 0.000736 3.9068 -1039.2
## + Fgp 1 0.000022 3.9075 -1039.1
##
## Step: AIC=-1044.18
## LogNassim ~ Cassim + Instar + WetFrass + ActiveFeeding + Mass +
## Intake + Nfrass
##
## Df Sum of Sq RSS AIC
## + Mgp 1 0.033554 3.7968 -1044.4
## <none> 3.8303 -1044.2
## + DryFrass 1 0.007178 3.8231 -1042.7
## + Fgp 1 0.000313 3.8300 -1042.2
##
## Step: AIC=-1044.41
## LogNassim ~ Cassim + Instar + WetFrass + ActiveFeeding + Mass +
## Intake + Nfrass + Mgp
##
## Df Sum of Sq RSS AIC
## <none> 3.7968 -1044.4
## + Fgp 1 0.013163 3.7836 -1043.3
## + DryFrass 1 0.011691 3.7851 -1043.2
##
## Call:
## lm(formula = LogNassim ~ Cassim + Instar + WetFrass + ActiveFeeding +
## Mass + Intake + Nfrass + Mgp, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77533 -0.05999 0.01044 0.07452 0.24687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.025013 0.029780 -101.580 < 2e-16 ***
## Cassim 0.619114 0.726558 0.852 0.39498
## Instar 0.161550 0.009955 16.229 < 2e-16 ***
## WetFrass -0.155714 0.072417 -2.150 0.03252 *
## ActiveFeedingY 0.104387 0.020658 5.053 8.53e-07 ***
## Mass 0.032654 0.007570 4.314 2.33e-05 ***
## Intake 0.183970 0.061253 3.003 0.00295 **
## Nfrass -19.120975 9.018416 -2.120 0.03500 *
## MgpY 0.025978 0.017690 1.468 0.14327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1247 on 244 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9409, Adjusted R-squared: 0.939
## F-statistic: 485.5 on 8 and 244 DF, p-value: < 2.2e-16
This forward selection model shows the 8 predictor variables that best fit the model (Cassim, Instar, WetFrass, ActiveFeedingY, Mass, Intake, Nfrass, and MgpY). Using these predictors, the model has an R-squared of 0.9409. While this R-squared is slightly lower than our first model, these variables still provide a very good fit.
## Start: AIC=-1042.76
## LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + Intake +
## WetFrass + DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Intake 1 0.0061 3.7678 -1044.35
## - DryFrass 1 0.0219 3.7836 -1043.29
## - Fgp 1 0.0234 3.7851 -1043.19
## - Cassim 1 0.0271 3.7887 -1042.94
## <none> 3.7617 -1042.76
## - Mgp 1 0.0602 3.8219 -1040.74
## - WetFrass 1 0.0789 3.8405 -1039.51
## - Nfrass 1 0.0943 3.8559 -1038.49
## - Mass 1 0.2183 3.9800 -1030.48
## - ActiveFeeding 1 0.4106 4.1723 -1018.55
## - Instar 1 3.8712 7.6329 -865.73
##
## Step: AIC=-1044.35
## LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + WetFrass +
## DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Fgp 1 0.0270 3.7947 -1044.54
## <none> 3.7678 -1044.35
## - Mgp 1 0.0683 3.8361 -1041.80
## - WetFrass 1 0.0804 3.8482 -1041.01
## - Nfrass 1 0.1007 3.8685 -1039.67
## - DryFrass 1 0.1690 3.9368 -1035.25
## - Mass 1 0.2167 3.9845 -1032.20
## - ActiveFeeding 1 0.4230 4.1908 -1019.43
## - Cassim 1 2.2823 6.0501 -926.53
## - Instar 1 3.8975 7.6653 -866.66
##
## Step: AIC=-1044.54
## LogNassim ~ Instar + ActiveFeeding + Mgp + Mass + WetFrass +
## DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## <none> 3.7947 -1044.54
## - Mgp 1 0.0431 3.8379 -1043.68
## - WetFrass 1 0.0735 3.8683 -1041.69
## - Nfrass 1 0.0874 3.8821 -1040.78
## - DryFrass 1 0.1424 3.9371 -1037.22
## - Mass 1 0.2449 4.0396 -1030.72
## - ActiveFeeding 1 0.4017 4.1964 -1021.09
## - Cassim 1 2.3977 6.1924 -922.65
## - Instar 1 3.9705 7.7652 -865.39
##
## Call:
## lm(formula = LogNassim ~ Instar + ActiveFeeding + Mgp + Mass +
## WetFrass + DryFrass + Cassim + Nfrass, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78035 -0.05863 0.00678 0.07494 0.25192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.021182 0.030145 -100.221 < 2e-16 ***
## Instar 0.160675 0.010056 15.978 < 2e-16 ***
## ActiveFeedingY 0.104865 0.020633 5.082 7.42e-07 ***
## MgpY 0.029488 0.017705 1.666 0.09709 .
## Mass 0.029448 0.007421 3.968 9.53e-05 ***
## WetFrass -0.157402 0.072381 -2.175 0.03062 *
## DryFrass 1.278422 0.422482 3.026 0.00274 **
## Cassim 2.497754 0.201163 12.417 < 2e-16 ***
## Nfrass -22.962052 9.684892 -2.371 0.01852 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1247 on 244 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9409, Adjusted R-squared: 0.939
## F-statistic: 485.8 on 8 and 244 DF, p-value: < 2.2e-16
This backwards model has the same predictor variables selected as the forward selection model, and provides the same R-squared value, making it a good linear model.
## Start: AIC=-344.83
## LogNassim ~ 1
##
## Call:
## lm(formula = LogNassim ~ 1, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9611 -0.4284 -0.1305 0.3669 0.9611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.15381 0.03174 -67.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5049 on 252 degrees of freedom
## (14 observations deleted due to missingness)
## Start: AIC=-1042.76
## LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + Intake +
## WetFrass + DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Intake 1 0.0061 3.7678 -1044.35
## - DryFrass 1 0.0219 3.7836 -1043.29
## - Fgp 1 0.0234 3.7851 -1043.19
## - Cassim 1 0.0271 3.7887 -1042.94
## <none> 3.7617 -1042.76
## - Mgp 1 0.0602 3.8219 -1040.74
## - WetFrass 1 0.0789 3.8405 -1039.51
## - Nfrass 1 0.0943 3.8559 -1038.49
## - Mass 1 0.2183 3.9800 -1030.48
## - ActiveFeeding 1 0.4106 4.1723 -1018.55
## - Instar 1 3.8712 7.6329 -865.73
##
## Step: AIC=-1044.35
## LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + WetFrass +
## DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## - Fgp 1 0.0270 3.7947 -1044.54
## <none> 3.7678 -1044.35
## + Intake 1 0.0061 3.7617 -1042.76
## - Mgp 1 0.0683 3.8361 -1041.80
## - WetFrass 1 0.0804 3.8482 -1041.01
## - Nfrass 1 0.1007 3.8685 -1039.67
## - DryFrass 1 0.1690 3.9368 -1035.25
## - Mass 1 0.2167 3.9845 -1032.20
## - ActiveFeeding 1 0.4230 4.1908 -1019.43
## - Cassim 1 2.2823 6.0501 -926.53
## - Instar 1 3.8975 7.6653 -866.66
##
## Step: AIC=-1044.54
## LogNassim ~ Instar + ActiveFeeding + Mgp + Mass + WetFrass +
## DryFrass + Cassim + Nfrass
##
## Df Sum of Sq RSS AIC
## <none> 3.7947 -1044.54
## + Fgp 1 0.0270 3.7678 -1044.35
## - Mgp 1 0.0431 3.8379 -1043.68
## + Intake 1 0.0097 3.7851 -1043.19
## - WetFrass 1 0.0735 3.8683 -1041.69
## - Nfrass 1 0.0874 3.8821 -1040.78
## - DryFrass 1 0.1424 3.9371 -1037.22
## - Mass 1 0.2449 4.0396 -1030.72
## - ActiveFeeding 1 0.4017 4.1964 -1021.09
## - Cassim 1 2.3977 6.1924 -922.65
## - Instar 1 3.9705 7.7652 -865.39
##
## Call:
## lm(formula = LogNassim ~ Instar + ActiveFeeding + Mgp + Mass +
## WetFrass + DryFrass + Cassim + Nfrass, data = caterpillars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78035 -0.05863 0.00678 0.07494 0.25192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.021182 0.030145 -100.221 < 2e-16 ***
## Instar 0.160675 0.010056 15.978 < 2e-16 ***
## ActiveFeedingY 0.104865 0.020633 5.082 7.42e-07 ***
## MgpY 0.029488 0.017705 1.666 0.09709 .
## Mass 0.029448 0.007421 3.968 9.53e-05 ***
## WetFrass -0.157402 0.072381 -2.175 0.03062 *
## DryFrass 1.278422 0.422482 3.026 0.00274 **
## Cassim 2.497754 0.201163 12.417 < 2e-16 ***
## Nfrass -22.962052 9.684892 -2.371 0.01852 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1247 on 244 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9409, Adjusted R-squared: 0.939
## F-statistic: 485.8 on 8 and 244 DF, p-value: < 2.2e-16
The stepwise selection model with the full model has the same predictor variables selected to fit the model as the forward selection and backwards elimination models. The R-squared is still 0.9409.