Multiple Linear Regression and Model Selection for Nassim and Log(Nassim) in the Caterpillars Dataset

Author

Sandeep

Introduction

The Caterpillars dataset contains biological and feeding-related measurements of caterpillars. Understanding how these variables relate to nitrogen assimilation (Nassim) is important for studying caterpillar growth and metabolism.

This project examines how multiple explanatory variables such as feeding behavior, mass, intake, and frass measurements affect Nassim. Using multiple linear regression and model selection methods from Chapter 4, we aim to identify the best subset of predictors.

library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(leaps)
library(ggplot2)
caterpillars <- read_csv("Caterpillars.csv")
Rows: 267 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): ActiveFeeding, Fgp, Mgp
dbl (15): Instar, Mass, LogMass, Intake, LogIntake, WetFrass, LogWetFrass, D...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dat <- caterpillars %>%
  select(Nassim, LogNassim, Instar, ActiveFeeding, Fgp, Mgp,
         Mass, Intake, WetFrass, DryFrass, Cassim, Nfrass) %>%
  na.omit()

dat$ActiveFeeding <- as.factor(dat$ActiveFeeding)
dat$Fgp <- as.factor(dat$Fgp)
dat$Mgp <- as.factor(dat$Mgp)

str(dat)
tibble [253 × 12] (S3: tbl_df/tbl/data.frame)
 $ Nassim       : num [1:253] 0.00186 0.00227 0.0023 0.00304 0.00279 ...
 $ LogNassim    : num [1:253] -2.73 -2.64 -2.64 -2.52 -2.55 ...
 $ Instar       : num [1:253] 1 1 2 2 2 3 3 4 4 4 ...
 $ ActiveFeeding: Factor w/ 2 levels "N","Y": 2 2 1 2 1 2 1 2 2 1 ...
 $ Fgp          : Factor w/ 2 levels "N","Y": 2 1 2 1 2 2 1 2 2 1 ...
 $ Mgp          : Factor w/ 2 levels "N","Y": 2 1 1 1 2 1 1 2 1 1 ...
 $ Mass         : num [1:253] 0.00206 0.00519 0.0056 0.0193 0.0293 ...
 $ Intake       : num [1:253] 0.165 0.201 0.189 0.283 0.26 ...
 $ WetFrass     : num [1:253] 0.000241 0.000063 0.001401 0.002045 0.005377 ...
 $ DryFrass     : num [1:253] 0.000208 0.000061 0.000969 0.001834 0.003523 ...
 $ Cassim       : num [1:253] 0.0142 0.0174 0.0164 0.0239 0.0212 ...
 $ Nfrass       : num [1:253] 6.61e-06 1.03e-06 2.78e-05 4.64e-05 9.97e-05 ...
 - attr(*, "na.action")= 'omit' Named int [1:14] 15 17 69 158 174 175 188 190 193 208 ...
  ..- attr(*, "names")= chr [1:14] "15" "17" "69" "158" ...

The Caterpillars dataset contains 253 observations and 12 variables. Nassim is the response variable, and LogNassim is its transformed version. The dataset includes both numerical and categorical predictors, which were used to analyze factors affecting nitrogen assimilation.

Analysis 1: Nassim as Response

full_raw <- lm(Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass +
                 Intake + WetFrass + DryFrass + Cassim + Nfrass,
               data = dat)

summary(full_raw)

Call:
lm(formula = Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + 
    Intake + WetFrass + DryFrass + Cassim + Nfrass, data = dat)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0027429 -0.0001910 -0.0000570  0.0000934  0.0043961 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -4.988e-05  1.881e-04  -0.265    0.791    
Instar          6.792e-06  5.935e-05   0.114    0.909    
ActiveFeedingY  1.069e-04  1.305e-04   0.819    0.414    
FgpY           -7.619e-05  1.478e-04  -0.516    0.607    
MgpY            9.561e-05  1.208e-04   0.791    0.429    
Mass            6.482e-05  4.627e-05   1.401    0.163    
Intake         -6.837e-03  7.108e-04  -9.620  < 2e-16 ***
WetFrass       -1.766e-03  4.270e-04  -4.136 4.88e-05 ***
DryFrass        8.516e-02  5.086e-03  16.744  < 2e-16 ***
Cassim          2.052e-01  7.463e-03  27.496  < 2e-16 ***
Nfrass         -9.392e-01  5.794e-02 -16.210  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0007329 on 242 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981 
F-statistic: 1.291e+04 on 10 and 242 DF,  p-value: < 2.2e-16

The results show that Intake, WetFrass, DryFrass, Cassim, and Nfrass significantly affect Nassim. The model has a very high fit (Adjusted R-squared = 0.9981 ), meaning it explains almost all the variation in the data.

Model Selection

raw_subsets <- regsubsets(
  Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass +
    Intake + WetFrass + DryFrass + Cassim + Nfrass,
  data = dat,
  nvmax = 10,
  method = "exhaustive"
)

raw_sum <- summary(raw_subsets)

Model Comparison

raw_results <- data.frame(
  Predictors = 1:10,
  Adjusted_R2 = raw_sum$adjr2,
  Cp = raw_sum$cp,
  BIC = raw_sum$bic
)

raw_results
   Predictors Adjusted_R2          Cp       BIC
1           1   0.9850428 1678.017232 -1053.187
2           2   0.9939210  533.067921 -1276.454
3           3   0.9971301  121.796724 -1461.827
4           4   0.9979521   17.693159 -1542.684
5           5   0.9980623    4.668112 -1552.170
6           6   0.9980741    4.174843 -1549.217
7           7   0.9980704    5.653076 -1544.227
8           8   0.9980654    7.288447 -1539.074
9           9   0.9980597    9.013098 -1533.828
10         10   0.9980518   11.000000 -1528.309

The results show that models with more predictors improve the fit, with Adjusted R-squared = 0.9981. The lowest Mallows’ Cp statistic occurs around 5–6 predictors, so the 5-variable model was selected as the best balance between simplicity and accuracy.

Best Model

best_raw_bic <- which.min(raw_sum$bic)
coef(raw_subsets, best_raw_bic)
  (Intercept)        Intake      WetFrass      DryFrass        Cassim 
 4.809826e-05 -7.066343e-03 -1.567513e-03  8.649181e-02  2.073566e-01 
       Nfrass 
-9.356879e-01 

This model was selected because it gives a very high fit while using fewer predictors. It shows that DryFrass and Cassim have positive effects on Nassim, while Intake, WetFrass, and Nfrass have negative effects.

Final Model

raw_bic_model <- lm(Nassim ~ Intake + WetFrass + DryFrass + Cassim + Nfrass,
                    data = dat)

summary(raw_bic_model)

Call:
lm(formula = Nassim ~ Intake + WetFrass + DryFrass + Cassim + 
    Nfrass, data = dat)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0027395 -0.0001748 -0.0000427  0.0000884  0.0045196 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.810e-05  6.105e-05   0.788 0.431529    
Intake      -7.066e-03  6.600e-04 -10.706  < 2e-16 ***
WetFrass    -1.568e-03  4.033e-04  -3.887 0.000131 ***
DryFrass     8.649e-02  4.483e-03  19.293  < 2e-16 ***
Cassim       2.074e-01  7.130e-03  29.083  < 2e-16 ***
Nfrass      -9.357e-01  5.556e-02 -16.842  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0007309 on 247 degrees of freedom
Multiple R-squared:  0.9981,    Adjusted R-squared:  0.9981 
F-statistic: 2.596e+04 on 5 and 247 DF,  p-value: < 2.2e-16

The final model for Nassim includes Intake, WetFrass, DryFrass, Cassim, and Nfrass. All variables are statistically significant. DryFrass and Cassim have positive effects, while Intake, WetFrass, and Nfrass have negative effects.

The model fits extremely well with Adjusted R-squared = 0.9981.

Diagnostics

plot(raw_bic_model, which = 1)

plot(raw_bic_model, which = 2)

plot(raw_bic_model, which = 3)

plot(raw_bic_model, which = 4)

Residuals vs Fitted

The points are mostly scattered around zero with no clear pattern. This shows that the model fits the data well and the linearity assumption is reasonable.

Q-Q Plot

Most points lie close to the straight line, but a few points deviate at the ends. This means the residuals are approximately normal, with some slight outliers.

Scale-Location Plot

The spread of points slightly increases as fitted values increase. This suggests a small change in variance, but overall it is acceptable.

Cook’s Distance

A few points (like 94, 169, and 241) have higher influence, but most points are within limits. This means there are no major influential outliers affecting the model.

Analysis 2: Log(Nassim)

Full Model

full_log <- lm(LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass +
                 Intake + WetFrass + DryFrass + Cassim + Nfrass,
               data = dat)

summary(full_log)

Call:
lm(formula = LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + 
    Mass + Intake + WetFrass + DryFrass + Cassim + Nfrass, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78898 -0.06022  0.00886  0.06858  0.24441 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -3.007731   0.031993 -94.012  < 2e-16 ***
Instar           0.159333   0.010096  15.781  < 2e-16 ***
ActiveFeedingY   0.114118   0.022204   5.140 5.68e-07 ***
FgpY            -0.030847   0.025140  -1.227 0.221004    
MgpY             0.040446   0.020550   1.968 0.050186 .  
Mass             0.029500   0.007872   3.748 0.000223 ***
Intake           0.075744   0.120911   0.626 0.531612    
WetFrass        -0.163631   0.072638  -2.253 0.025176 *  
DryFrass         1.027704   0.865211   1.188 0.236074    
Cassim           1.675558   1.269629   1.320 0.188175    
Nfrass         -24.273257   9.856176  -2.463 0.014485 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1247 on 242 degrees of freedom
Multiple R-squared:  0.9414,    Adjusted R-squared:  0.939 
F-statistic:   389 on 10 and 242 DF,  p-value: < 2.2e-16

The full log(Nassim) model shows that Instar, ActiveFeeding, Mass, WetFrass, and Nfrass are significant predictors. The model fits well with an Adjusted R-squared =0.939, explaining most of the variation.

Model Selection

log_subsets <- regsubsets(
  LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass +
    Intake + WetFrass + DryFrass + Cassim + Nfrass,
  data = dat,
  nvmax = 10,
  method = "exhaustive"
)

log_sum <- summary(log_subsets)

Model Comparison

log_results <- data.frame(
  Predictors = 1:10,
  Adjusted_R2 = log_sum$adjr2,
  Cp = log_sum$cp,
  BIC = log_sum$bic
)

log_results
   Predictors Adjusted_R2         Cp       BIC
1           1   0.8055405 551.389649 -404.2347
2           2   0.9154087  99.788061 -610.3066
3           3   0.9301890  40.050671 -654.3731
4           4   0.9341285  24.884686 -664.5536
5           5   0.9378299  10.812441 -674.6739
6           6   0.9387062   8.257992 -673.7583
7           7   0.9390219   7.984582 -670.5619
8           8   0.9390721   8.783896 -666.2715
9           9   0.9391705   9.392436 -662.1863
10         10   0.9390180  11.000000 -657.0629

The results show that Adjusted R-squared increases up to about 0.939, while Mallows’ Cp is lowest around 6–7 predictors. Therefore, a simpler model was chosen as the best balance between accuracy and simplicity

Best Model

best_log_bic <- which.min(log_sum$bic)
coef(log_subsets, best_log_bic)
   (Intercept)         Instar ActiveFeedingY           Mass         Intake 
   -3.01441156     0.16121734     0.10274326     0.02863367     0.24027952 
        Nfrass 
  -37.80554125 

The best model includes Instar, ActiveFeeding, Mass, Intake, and Nfrass. It shows a strong fit and was selected as the best balance between simplicity and accuracy.

Final Model

log_bic_model <- lm(LogNassim ~ Instar + ActiveFeeding + Mass + Intake + Nfrass,
                    data = dat)

summary(log_bic_model)

Call:
lm(formula = LogNassim ~ Instar + ActiveFeeding + Mass + Intake + 
    Nfrass, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77994 -0.06252  0.00544  0.07396  0.28969 

Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)     -3.014412   0.028548 -105.592  < 2e-16 ***
Instar           0.161217   0.009425   17.104  < 2e-16 ***
ActiveFeedingY   0.102743   0.020712    4.961 1.31e-06 ***
Mass             0.028634   0.007212    3.971 9.41e-05 ***
Intake           0.240280   0.015419   15.583  < 2e-16 ***
Nfrass         -37.805541   4.749233   -7.960 6.24e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1259 on 247 degrees of freedom
Multiple R-squared:  0.9391,    Adjusted R-squared:  0.9378 
F-statistic: 761.3 on 5 and 247 DF,  p-value: < 2.2e-16

The final model for Log(Nassim) includes Instar, ActiveFeeding, Mass, Intake, and Nfrass. All variables are statistically significant. Instar, ActiveFeeding, Mass, and Intake have positive effects, while Nfrass has a negative effect.

The model fits well with Adjusted R-squared = 0.9378.

Diagnostics

plot(log_bic_model, which = 1)

plot(log_bic_model, which = 2)

plot(log_bic_model, which = 3)

plot(log_bic_model, which = 4)

Residuals vs Fitted

The points are fairly scattered around zero with a slight curve, indicating the model is mostly appropriate but may have a small non-linear pattern.

Q-Q Plot

Most points lie close to the straight line, suggesting the residuals are approximately normal, with a few outliers at the ends.

Scale-Location Plot

The spread of points is fairly constant, showing that the variance is reasonably stable.

Cook’s Distance

A few points (such as 42, 59, and 63) show higher influence, but most observations have low influence.

Conclusion

This project used multiple linear regression and model selection techniques to identify the factors affecting Nassim in caterpillars.

The results showed that Intake, WetFrass, DryFrass, Cassim, and Nfrass are the most important predictors of Nassim. The final model provided an excellent fit with Adjusted R-squared = 0.9981, explaining almost all the variation in the data.

When using log(Nassim), the best model included Instar, ActiveFeeding, Mass, Intake, and Nfrass, and also showed a strong fit with Adjusted R-squared = 0.9378.

Overall, both models performed well, but the Nassim model gave a better fit, while the log-transformed model provided an alternative way to understand the relationships.