We will summarize and compare different multiple regressions to predict the “Nassim” variable (based on the caterpillars data set), examining the full predictor regression, step-wise chosen predictor regression, and the best-subsets chosen predictor regression.
As a benchmark, we will create a full multiple regression model for response variable “Nassim” using all 17 explanatory variables. The model, as well as residual plots, will be shown below:
## Coefficients:
## (Intercept) Instar ActiveFeedingY FgpY MgpY
## 1.639127e-02 -9.304661e-05 4.698194e-05 -5.283075e-06 8.976834e-05
## Mass LogMass Intake LogIntake WetFrass
## 2.023160e-04 -6.015621e-05 -6.182131e-03 -2.049434e-03 -1.782492e-03
## LogWetFrass DryFrass LogDryFrass Cassim LogCassim
## -8.957653e-05 8.005301e-02 3.510954e-04 1.913683e-01 -1.163670e-02
## Nfrass LogNfrass LogNassim
## -8.117942e-01 -5.172215e-04 1.502860e-02
## R^2: 0.9986702
## Adjusted R^2: 0.998574
We can already see a high adjusted \(R^2\) value of about \(0.99857\). This is suspicious. The Residuals vs. Fitted plot suggests heteroscedasticity, and the Q-Q plot suggests non-normality of the residuals. Now, we will compare this to the models generated by our more selective methods of predictor variable-choosing.
## Coefficients:
## (Intercept) Instar Mass Intake LogIntake
## 0.0184393124 -0.0002659334 0.0001919871 -0.0060236145 -0.0027401603
## WetFrass DryFrass Cassim LogCassim Nfrass
## -0.0017780093 0.0796411148 0.1901236405 -0.0107774810 -0.8270508772
## LogNassim
## 0.0146487557
## R^2: 0.9986522
## Adjusted R^2: 0.9985965
These metrics look very similar to the full model, although there are fewer explanatory variables (only 10 explanatory variables, as some were eliminated in the step-wise process). The adjusted \(R^2\) is slightly higher, about \(0.99859\).
## Coefficients:
## (Intercept) catData$Instar catData$Mass catData$Intake
## 0.0359740398 -0.0002621775 0.0005760971 0.0111901241
## catData$LogIntake catData$WetFrass catData$DryFrass catData$LogCassim
## -0.0101161564 -0.0017476348 -0.0127869055 -0.0142286595
## catData$Nfrass catData$LogNfrass catData$LogNassim
## -0.6487169513 0.0002282162 0.0251371255
## R^2: 0.9939156
## Adjusted R^2: 0.9936642
Lastly, here is the best-subsets chosen predictors regression. The adjusted \(R^2\) of this regression is about \(0.99366\), the lowest so far. this model has 10 explanatory variables.
Both of these models also suggest heteroscedasticity and non-normality of residuals, so we will instead perform a log-transformation on response variable “Nassir” to (hopefully) normalize the residuals and remove heteroscedasticity.
## Coefficients:
## (Intercept) Instar ActiveFeedingY FgpY MgpY
## 8.294449e-07 -2.604299e-08 -9.959367e-09 -1.042860e-08 -1.049890e-08
## Mass LogMass Intake LogIntake WetFrass
## 1.676669e-10 -2.644552e-08 4.790745e-08 -5.678284e-07 1.235749e-08
## LogWetFrass DryFrass LogDryFrass Cassim LogCassim
## 4.240415e-08 7.294222e-08 -3.336659e-08 -6.949477e-07 7.740038e-07
## Nfrass LogNfrass LogNassim
## -6.375277e-06 9.396602e-08 2.302585e+00
## R^2: 1
## Adjusted R^2: 1
## Coefficients:
## (Intercept) Instar LogIntake Cassim LogCassim
## 7.517063e-07 -2.713820e-08 -5.022659e-07 -1.851571e-07 6.949363e-07
## LogNfrass LogNassim
## 8.287052e-08 2.302585e+00
## R^2: 1
## Adjusted R^2: 1
## Coefficients:
## (Intercept) catData$Instar catData$MgpY catData$LogIntake
## 8.512167e-07 -3.238548e-08 -1.443589e-08 -5.286853e-07
## catData$Cassim catData$LogCassim catData$LogNfrass catData$LogNassim
## -1.853543e-07 6.974869e-07 9.023190e-08 2.302585e+00
## R^2: 1
## Adjusted R^2: 1
The \(R^2\) and adjusted \(R^2\) values of all of these models is an unreasonable value of 1. This, with the high \(R^2\) of previous models, suggests a very high degree of multicollinearity in the data. This would need to be addressed to create a reliable regression for predicting “Nassim”. However, the log-transformed models are slightly better in terms of normality of the residuals, and they are also better in terms of heteroscedasticity. The models presented would likely not be appropriate, and further correlation analysis should be done on the data.