Multiple Linear Regression and Model Selection for Nassim and Log(Nassim) in the Caterpillars Dataset
Author
Sandeep
Introduction
The Caterpillars dataset contains biological and feeding-related measurements of caterpillars. Understanding how these variables relate to nitrogen assimilation (Nassim) is important for studying caterpillar growth and metabolism.
This project examines how multiple explanatory variables such as feeding behavior, mass, intake, and frass measurements affect Nassim. Using multiple linear regression and model selection methods from Chapter 4, we aim to identify the best subset of predictors.
library(readr)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ActiveFeeding, Fgp, Mgp
dbl (15): Instar, Mass, LogMass, Intake, LogIntake, WetFrass, LogWetFrass, D...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The Caterpillars dataset contains 253 observations and 12 variables. Nassim is the response variable, and LogNassim is its transformed version. The dataset includes both numerical and categorical predictors, which were used to analyze factors affecting nitrogen assimilation.
The results show that Intake, WetFrass, DryFrass, Cassim, and Nfrass significantly affect Nassim. The model has a very high fit (Adjusted R-squared = 0.9981 ), meaning it explains almost all the variation in the data.
The results show that models with more predictors improve the fit, with Adjusted R-squared = 0.9981. The lowest Mallows’ Cp statistic occurs around 5–6 predictors, so the 5-variable model was selected as the best balance between simplicity and accuracy.
This model was selected because it gives a very high fit while using fewer predictors. It shows that DryFrass and Cassim have positive effects on Nassim, while Intake, WetFrass, and Nfrass have negative effects.
The final model for Nassim includes Intake, WetFrass, DryFrass, Cassim, and Nfrass. All variables are statistically significant. DryFrass and Cassim have positive effects, while Intake, WetFrass, and Nfrass have negative effects.
The model fits extremely well with Adjusted R-squared = 0.9981.
Diagnostics
plot(raw_bic_model, which =1)
plot(raw_bic_model, which =2)
plot(raw_bic_model, which =3)
plot(raw_bic_model, which =4)
Residuals vs Fitted
The points are mostly scattered around zero with no clear pattern. This shows that the model fits the data well and the linearity assumption is reasonable.
Q-Q Plot
Most points lie close to the straight line, but a few points deviate at the ends. This means the residuals are approximately normal, with some slight outliers.
Scale-Location Plot
The spread of points slightly increases as fitted values increase. This suggests a small change in variance, but overall it is acceptable.
Cook’s Distance
A few points (like 94, 169, and 241) have higher influence, but most points are within limits. This means there are no major influential outliers affecting the model.
The full log(Nassim) model shows that Instar, ActiveFeeding, Mass, WetFrass, and Nfrass are significant predictors. The model fits well with an Adjusted R-squared =0.939, explaining most of the variation.
The results show that Adjusted R-squared increases up to about 0.939, while Mallows’ Cp is lowest around 6–7 predictors. Therefore, a simpler model was chosen as the best balance between accuracy and simplicity
The best model includes Instar, ActiveFeeding, Mass, Intake, and Nfrass. It shows a strong fit and was selected as the best balance between simplicity and accuracy.
Call:
lm(formula = LogNassim ~ Instar + ActiveFeeding + Mass + Intake +
Nfrass, data = dat)
Residuals:
Min 1Q Median 3Q Max
-0.77994 -0.06252 0.00544 0.07396 0.28969
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.014412 0.028548 -105.592 < 2e-16 ***
Instar 0.161217 0.009425 17.104 < 2e-16 ***
ActiveFeedingY 0.102743 0.020712 4.961 1.31e-06 ***
Mass 0.028634 0.007212 3.971 9.41e-05 ***
Intake 0.240280 0.015419 15.583 < 2e-16 ***
Nfrass -37.805541 4.749233 -7.960 6.24e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1259 on 247 degrees of freedom
Multiple R-squared: 0.9391, Adjusted R-squared: 0.9378
F-statistic: 761.3 on 5 and 247 DF, p-value: < 2.2e-16
The final model for Log(Nassim) includes Instar, ActiveFeeding, Mass, Intake, and Nfrass. All variables are statistically significant. Instar, ActiveFeeding, Mass, and Intake have positive effects, while Nfrass has a negative effect.
The model fits well with Adjusted R-squared = 0.9378.
Diagnostics
plot(log_bic_model, which =1)
plot(log_bic_model, which =2)
plot(log_bic_model, which =3)
plot(log_bic_model, which =4)
Residuals vs Fitted
The points are fairly scattered around zero with a slight curve, indicating the model is mostly appropriate but may have a small non-linear pattern.
Q-Q Plot
Most points lie close to the straight line, suggesting the residuals are approximately normal, with a few outliers at the ends.
Scale-Location Plot
The spread of points is fairly constant, showing that the variance is reasonably stable.
Cook’s Distance
A few points (such as 42, 59, and 63) show higher influence, but most observations have low influence.
Conclusion
This project used multiple linear regression and model selection techniques to identify the factors affecting Nassim in caterpillars.
The results showed that Intake, WetFrass, DryFrass, Cassim, and Nfrass are the most important predictors of Nassim. The final model provided an excellent fit with Adjusted R-squared = 0.9981, explaining almost all the variation in the data.
When using log(Nassim), the best model included Instar, ActiveFeeding, Mass, Intake, and Nfrass, and also showed a strong fit with Adjusted R-squared = 0.9378.
Overall, both models performed well, but the Nassim model gave a better fit, while the log-transformed model provided an alternative way to understand the relationships.