Introduction

We analyze the Caterpillars dataset, which contains measurements related to caterpillar growth and environmental conditions. The goal of this project is to model Nassim using multiple linear regression and to determine which variables are most useful in predicting Nassim.

  1. Fit a multiple linear regression model using several predictor variables.

  2. Select models with different combinations of explanatory variables and compare them.

  3. Identify a suitable model based on statistical measures and interpret the results.

  4. Repeat the analysis using a natural-log transformation of Nassim and compare the results.

Analysis

We will explore the questions above in detail.

Caterpillars <- read.csv("https://www.stat2.org/datasets/Caterpillars.csv")
head(Caterpillars)
##   Instar ActiveFeeding Fgp Mgp     Mass   LogMass   Intake  LogIntake WetFrass
## 1      1             Y   Y   Y 0.002064 -2.685290 0.165118 -0.7822056 0.000241
## 2      1             Y   N   N 0.005191 -2.284749 0.201008 -0.6967867 0.000063
## 3      2             N   Y   N 0.005603 -2.251579 0.189125 -0.7232511 0.001401
## 4      2             Y   N   N 0.019300 -1.714443 0.283280 -0.5477841 0.002045
## 5      2             N   Y   Y 0.029300 -1.533132 0.259569 -0.5857472 0.005377
## 6      3             Y   Y   N 0.062600 -1.203426 0.327864 -0.4843063 0.029500
##   LogWetFrass DryFrass LogDryFrass     Cassim LogCassim   Nfrass LogNfrass
## 1   -3.617983 0.000208   -3.681937 0.01422378 -1.846985 6.61e-06 -5.179510
## 2   -4.200659 0.000061   -4.214670 0.01739189 -1.759653 1.03e-06 -5.986783
## 3   -2.853562 0.000969   -3.013676 0.01639923 -1.785177 2.78e-05 -4.555794
## 4   -2.689307 0.001834   -2.736601 0.02392468 -1.621154 4.64e-05 -4.333480
## 5   -2.269460 0.003523   -2.453087 0.02122857 -1.673079 9.97e-05 -4.001301
## 6   -1.530178 0.000789   -3.102923 0.02836365 -1.547238 1.84e-05 -4.735567
##        Nassim LogNassim
## 1 0.001858999 -2.730721
## 2 0.002270091 -2.643957
## 3 0.002302210 -2.637855
## 4 0.003041352 -2.516933
## 5 0.002791898 -2.554100
## 6 0.003627464 -2.440397

Question a.

Fit a multiple linear regression model using several predictor variables.

full_model <- lm(Nassim ~ Instar + ActiveFeeding + Fgp + Mgp +
                 Mass + Intake + WetFrass,
                 data = Caterpillars)

summary(full_model)
## 
## Call:
## lm(formula = Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + 
##     Intake + WetFrass, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0094457 -0.0002801 -0.0000176  0.0002447  0.0076763 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.436e-04  4.498e-04   0.542    0.589    
## Instar         -9.872e-05  1.439e-04  -0.686    0.493    
## ActiveFeedingY  2.132e-04  3.297e-04   0.647    0.519    
## FgpY           -3.030e-04  3.498e-04  -0.866    0.387    
## MgpY            2.496e-04  2.938e-04   0.850    0.396    
## Mass            9.070e-05  1.101e-04   0.824    0.411    
## Intake          1.089e-02  2.024e-04  53.815   <2e-16 ***
## WetFrass       -8.711e-03  6.235e-04 -13.971   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001869 on 246 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9873 
## F-statistic:  2816 on 7 and 246 DF,  p-value: < 2.2e-16

A multiple linear regression model was fit using several predictors. This model includes all available explanatory variables to assess their relationship with Nassim.

Question b.

Select models with different combinations of explanatory variables and compare them.

model1 <- lm(Nassim ~ Instar + Mass + Intake, data = Caterpillars)

model2 <- lm(Nassim ~ Mass + Intake, data = Caterpillars)

model3 <- lm(Nassim ~ Instar + Intake, data = Caterpillars)

summary(model1)
## 
## Call:
## lm(formula = Nassim ~ Instar + Mass + Intake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0110114 -0.0008831 -0.0002138  0.0005345  0.0105857 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.288e-04  4.665e-04  -1.562     0.12    
## Instar       7.615e-04  1.702e-04   4.474 1.16e-05 ***
## Mass        -1.173e-03  7.453e-05 -15.736  < 2e-16 ***
## Intake       8.599e-03  1.258e-04  68.349  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002503 on 250 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9775, Adjusted R-squared:  0.9773 
## F-statistic:  3628 on 3 and 250 DF,  p-value: < 2.2e-16
summary(model2)
## 
## Call:
## lm(formula = Nassim ~ Mass + Intake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0117517 -0.0007233 -0.0004883  0.0005720  0.0104027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.159e-03  2.068e-04   5.603 5.53e-08 ***
## Mass        -1.088e-03  7.473e-05 -14.553  < 2e-16 ***
## Intake       8.876e-03  1.136e-04  78.105  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002596 on 251 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9756 
## F-statistic:  5050 on 2 and 251 DF,  p-value: < 2.2e-16
summary(model3)
## 
## Call:
## lm(formula = Nassim ~ Instar + Intake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0142301 -0.0004055 -0.0001883  0.0008779  0.0087938 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.177e-04  6.402e-04   1.434    0.153    
## Instar      7.618e-05  2.317e-04   0.329    0.743    
## Intake      7.623e-03  1.541e-04  49.463   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003524 on 251 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9553, Adjusted R-squared:  0.955 
## F-statistic:  2683 on 2 and 251 DF,  p-value: < 2.2e-16
AIC(full_model, model1, model2, model3)
##            df       AIC
## full_model  9 -2460.670
## model1      5 -2316.349
## model2      4 -2298.782
## model3      4 -2143.495

Different models were fit using subsets of predictors. These models are compared using adjusted R-squared, p-values, and AIC values to determine which model provides the best balance between fit and simplicity.

Question c.

The full model was selected as the preferred model based on its lowest AIC value, indicating the best overall fit among the models considered.

From the model output, Intake and WetFrass are significant predictors of Nassim, as indicated by their extremely small p-values. These variables have a strong relationship with the response variable.

Other variables, including Instar, ActiveFeeding, Fgp, Mgp, and Mass, have large p-values, suggesting they do not contribute significantly to the model. However, removing these variables resulted in worse model performance based on AIC, so they were retained in the final model.

Overall, the full model provides the best balance between model fit and explanatory power.

Question d.

The log-transformed model shows that LogIntake, LogMass, and WetFrass are significant predictors of log(Nassim), as indicated by their small p-values. These variables have a strong relationship with the response variable.

To compare models, the Akaike Information Criterion (AIC) was used. The full log model has a lower AIC value (-301.27) compared to the reduced log model (-182.61), indicating that the full log model provides a better fit to the data.

The log transformation helps stabilize variance and improves model performance. Overall, the full log-transformed model provides the best fit and offers a useful representation of the relationship between the variables.

Caterpillars$logNassim <- log(Caterpillars$Nassim + 1)

log_full <- lm(logNassim ~ Instar + ActiveFeeding + Fgp + Mgp +
               LogMass + LogIntake + WetFrass,
               data = Caterpillars)

summary(log_full)
## 
## Call:
## lm(formula = logNassim ~ Instar + ActiveFeeding + Fgp + Mgp + 
##     LogMass + LogIntake + WetFrass, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0181002 -0.0028335  0.0000734  0.0026394  0.0179192 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.015e-02  5.691e-03   1.784  0.07567 .  
## Instar          1.789e-05  1.449e-03   0.012  0.99016    
## ActiveFeedingY  9.133e-04  1.063e-03   0.859  0.39131    
## FgpY           -3.987e-04  1.465e-03  -0.272  0.78572    
## MgpY            1.821e-03  1.015e-03   1.794  0.07398 .  
## LogMass        -5.833e-03  2.066e-03  -2.823  0.00515 ** 
## LogIntake       3.109e-02  2.653e-03  11.717  < 2e-16 ***
## WetFrass        4.737e-03  7.926e-04   5.977 7.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005718 on 246 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.8787, Adjusted R-squared:  0.8752 
## F-statistic: 254.5 on 7 and 246 DF,  p-value: < 2.2e-16
log_model1 <- lm(logNassim ~ LogMass + LogIntake, data = Caterpillars)

summary(log_model1)
## 
## Call:
## lm(formula = logNassim ~ LogMass + LogIntake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0147867 -0.0038577 -0.0000844  0.0029250  0.0216721 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0139273  0.0005286  26.347   <2e-16 ***
## LogMass     -0.0068461  0.0007625  -8.979   <2e-16 ***
## LogIntake    0.0393263  0.0016248  24.203   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.006188 on 251 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.8539 
## F-statistic: 740.2 on 2 and 251 DF,  p-value: < 2.2e-16
AIC(log_full, log_model1)
##            df       AIC
## log_full    9 -1892.660
## log_model1  4 -1857.487

The log transformation is used to stabilize variance and improve model fit. The results are compared with the original model to determine whether the transformation provides a better model.

Appendix

The following R code was used for data analysis and visualization in this report.

Caterpillars <- read.csv("https://www.stat2.org/datasets/Caterpillars.csv")
head(Caterpillars)

full_model <- lm(Nassim ~ Instar + ActiveFeeding + Fgp + Mgp +
                 Mass + Intake + WetFrass,
                 data = Caterpillars)
summary(full_model)

model1 <- lm(Nassim ~ Instar + Mass + Intake, data = Caterpillars)
model2 <- lm(Nassim ~ Mass + Intake, data = Caterpillars)
model3 <- lm(Nassim ~ Instar + Intake, data = Caterpillars)

summary(model1)
summary(model2)
summary(model3)

AIC(full_model, model1, model2, model3)

Caterpillars$logNassim <- log(Caterpillars$Nassim + 1)

log_full <- lm(logNassim ~ Instar + ActiveFeeding + Fgp + Mgp +
               LogMass + LogIntake + WetFrass,
               data = Caterpillars)

summary(log_full)

log_model1 <- lm(logNassim ~ LogMass + LogIntake, data = Caterpillars)

summary(log_model1)
AIC(log_full, log_model1)