Comparing Regressions of Caterpillar Data

Introduction

We will summarize and compare different multiple regressions to predict the “Nassim” variable (based on the caterpillars data set), examining the full predictor regression, step-wise chosen predictor regression, and the best-subsets chosen predictor regression.

Full Regression

As a benchmark, we will create a full multiple regression model for response variable “Nassim” using all 17 explanatory variables. The model, as well as residual plots, will be shown below:

## Coefficients:

##    (Intercept)         Instar ActiveFeedingY           FgpY           MgpY 
##   1.639127e-02  -9.304661e-05   4.698194e-05  -5.283075e-06   8.976834e-05 
##           Mass        LogMass         Intake      LogIntake       WetFrass 
##   2.023160e-04  -6.015621e-05  -6.182131e-03  -2.049434e-03  -1.782492e-03 
##    LogWetFrass       DryFrass    LogDryFrass         Cassim      LogCassim 
##  -8.957653e-05   8.005301e-02   3.510954e-04   1.913683e-01  -1.163670e-02 
##         Nfrass      LogNfrass      LogNassim 
##  -8.117942e-01  -5.172215e-04   1.502860e-02

## R^2: 0.9986702

## Adjusted R^2: 0.998574

We can already see a high adjusted \(R^2\) value of about \(0.99857\). This is suspicious. The Residuals vs. Fitted plot suggests heteroscedasticity, and the Q-Q plot suggests non-normality of the residuals. Now, we will compare this to the models generated by our more selective methods of predictor variable-choosing.

Step-Wise Regression

## Coefficients:

##   (Intercept)        Instar          Mass        Intake     LogIntake 
##  0.0184393124 -0.0002659334  0.0001919871 -0.0060236145 -0.0027401603 
##      WetFrass      DryFrass        Cassim     LogCassim        Nfrass 
## -0.0017780093  0.0796411148  0.1901236405 -0.0107774810 -0.8270508772 
##     LogNassim 
##  0.0146487557

## R^2: 0.9986522

## Adjusted R^2: 0.9985965

These metrics look very similar to the full model, although there are fewer explanatory variables (only 10 explanatory variables, as some were eliminated in the step-wise process). The adjusted \(R^2\) is slightly higher, about \(0.99859\).

Best-Subsets Regression

## Coefficients:

##       (Intercept)    catData$Instar      catData$Mass    catData$Intake 
##      0.0359740398     -0.0002621775      0.0005760971      0.0111901241 
## catData$LogIntake  catData$WetFrass  catData$DryFrass catData$LogCassim 
##     -0.0101161564     -0.0017476348     -0.0127869055     -0.0142286595 
##    catData$Nfrass catData$LogNfrass catData$LogNassim 
##     -0.6487169513      0.0002282162      0.0251371255

## R^2: 0.9939156

## Adjusted R^2: 0.9936642

Lastly, here is the best-subsets chosen predictors regression. The adjusted \(R^2\) of this regression is about \(0.99366\), the lowest so far. this model has 10 explanatory variables.

Both of these models also suggest heteroscedasticity and non-normality of residuals, so we will instead perform a log-transformation on response variable “Nassir” to (hopefully) normalize the residuals and remove heteroscedasticity.

Log-Transformed Full Regression

## Coefficients:

##    (Intercept)         Instar ActiveFeedingY           FgpY           MgpY 
##   8.294449e-07  -2.604299e-08  -9.959367e-09  -1.042860e-08  -1.049890e-08 
##           Mass        LogMass         Intake      LogIntake       WetFrass 
##   1.676669e-10  -2.644552e-08   4.790745e-08  -5.678284e-07   1.235749e-08 
##    LogWetFrass       DryFrass    LogDryFrass         Cassim      LogCassim 
##   4.240415e-08   7.294222e-08  -3.336659e-08  -6.949477e-07   7.740038e-07 
##         Nfrass      LogNfrass      LogNassim 
##  -6.375277e-06   9.396602e-08   2.302585e+00

## R^2: 1

## Adjusted R^2: 1

Log-Transformed Step-Wise Regression

## Coefficients:

##   (Intercept)        Instar     LogIntake        Cassim     LogCassim 
##  7.517063e-07 -2.713820e-08 -5.022659e-07 -1.851571e-07  6.949363e-07 
##     LogNfrass     LogNassim 
##  8.287052e-08  2.302585e+00

## R^2: 1

## Adjusted R^2: 1

Log-Transformed Best-Subsets Regression

## Coefficients:

##       (Intercept)    catData$Instar      catData$MgpY catData$LogIntake 
##      8.512167e-07     -3.238548e-08     -1.443589e-08     -5.286853e-07 
##    catData$Cassim catData$LogCassim catData$LogNfrass catData$LogNassim 
##     -1.853543e-07      6.974869e-07      9.023190e-08      2.302585e+00

## R^2: 1

## Adjusted R^2: 1

Conclusion

The \(R^2\) and adjusted \(R^2\) values of all of these models is an unreasonable value of 1. This, with the high \(R^2\) of previous models, suggests a very high degree of multicollinearity in the data. This would need to be addressed to create a reliable regression for predicting “Nassim”. However, the log-transformed models are slightly better in terms of normality of the residuals, and they are also better in terms of heteroscedasticity. The models presented would likely not be appropriate, and further correlation analysis should be done on the data.