STAT 321-40 Project 3: Multiple Linear Regression

Analysis

We will explore the questions above in detail.

Caterpillars <- read.csv("https://www.stat2.org/datasets/Caterpillars.csv")
head(Caterpillars)

##   Instar ActiveFeeding Fgp Mgp     Mass   LogMass   Intake  LogIntake WetFrass
## 1      1             Y   Y   Y 0.002064 -2.685290 0.165118 -0.7822056 0.000241
## 2      1             Y   N   N 0.005191 -2.284749 0.201008 -0.6967867 0.000063
## 3      2             N   Y   N 0.005603 -2.251579 0.189125 -0.7232511 0.001401
## 4      2             Y   N   N 0.019300 -1.714443 0.283280 -0.5477841 0.002045
## 5      2             N   Y   Y 0.029300 -1.533132 0.259569 -0.5857472 0.005377
## 6      3             Y   Y   N 0.062600 -1.203426 0.327864 -0.4843063 0.029500
##   LogWetFrass DryFrass LogDryFrass     Cassim LogCassim   Nfrass LogNfrass
## 1   -3.617983 0.000208   -3.681937 0.01422378 -1.846985 6.61e-06 -5.179510
## 2   -4.200659 0.000061   -4.214670 0.01739189 -1.759653 1.03e-06 -5.986783
## 3   -2.853562 0.000969   -3.013676 0.01639923 -1.785177 2.78e-05 -4.555794
## 4   -2.689307 0.001834   -2.736601 0.02392468 -1.621154 4.64e-05 -4.333480
## 5   -2.269460 0.003523   -2.453087 0.02122857 -1.673079 9.97e-05 -4.001301
## 6   -1.530178 0.000789   -3.102923 0.02836365 -1.547238 1.84e-05 -4.735567
##        Nassim LogNassim
## 1 0.001858999 -2.730721
## 2 0.002270091 -2.643957
## 3 0.002302210 -2.637855
## 4 0.003041352 -2.516933
## 5 0.002791898 -2.554100
## 6 0.003627464 -2.440397

Question a.

Fit a multiple linear regression model using several predictor variables.

full_model <- lm(Nassim ~ Instar + ActiveFeeding + Fgp + Mgp +
                 Mass + Intake + WetFrass,
                 data = Caterpillars)

summary(full_model)

## 
## Call:
## lm(formula = Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + 
##     Intake + WetFrass, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0094457 -0.0002801 -0.0000176  0.0002447  0.0076763 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.436e-04  4.498e-04   0.542    0.589    
## Instar         -9.872e-05  1.439e-04  -0.686    0.493    
## ActiveFeedingY  2.132e-04  3.297e-04   0.647    0.519    
## FgpY           -3.030e-04  3.498e-04  -0.866    0.387    
## MgpY            2.496e-04  2.938e-04   0.850    0.396    
## Mass            9.070e-05  1.101e-04   0.824    0.411    
## Intake          1.089e-02  2.024e-04  53.815   <2e-16 ***
## WetFrass       -8.711e-03  6.235e-04 -13.971   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001869 on 246 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9873 
## F-statistic:  2816 on 7 and 246 DF,  p-value: < 2.2e-16

A multiple linear regression model was fit using several predictors. This model includes all available explanatory variables to assess their relationship with Nassim.

Question b.

Select models with different combinations of explanatory variables and compare them.

model1 <- lm(Nassim ~ Instar + Mass + Intake, data = Caterpillars)

model2 <- lm(Nassim ~ Mass + Intake, data = Caterpillars)

model3 <- lm(Nassim ~ Instar + Intake, data = Caterpillars)

summary(model1)

## 
## Call:
## lm(formula = Nassim ~ Instar + Mass + Intake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0110114 -0.0008831 -0.0002138  0.0005345  0.0105857 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.288e-04  4.665e-04  -1.562     0.12    
## Instar       7.615e-04  1.702e-04   4.474 1.16e-05 ***
## Mass        -1.173e-03  7.453e-05 -15.736  < 2e-16 ***
## Intake       8.599e-03  1.258e-04  68.349  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002503 on 250 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9775, Adjusted R-squared:  0.9773 
## F-statistic:  3628 on 3 and 250 DF,  p-value: < 2.2e-16

summary(model2)

## 
## Call:
## lm(formula = Nassim ~ Mass + Intake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0117517 -0.0007233 -0.0004883  0.0005720  0.0104027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.159e-03  2.068e-04   5.603 5.53e-08 ***
## Mass        -1.088e-03  7.473e-05 -14.553  < 2e-16 ***
## Intake       8.876e-03  1.136e-04  78.105  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002596 on 251 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9756 
## F-statistic:  5050 on 2 and 251 DF,  p-value: < 2.2e-16

summary(model3)

## 
## Call:
## lm(formula = Nassim ~ Instar + Intake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0142301 -0.0004055 -0.0001883  0.0008779  0.0087938 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.177e-04  6.402e-04   1.434    0.153    
## Instar      7.618e-05  2.317e-04   0.329    0.743    
## Intake      7.623e-03  1.541e-04  49.463   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003524 on 251 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9553, Adjusted R-squared:  0.955 
## F-statistic:  2683 on 2 and 251 DF,  p-value: < 2.2e-16

AIC(full_model, model1, model2, model3)

##            df       AIC
## full_model  9 -2460.670
## model1      5 -2316.349
## model2      4 -2298.782
## model3      4 -2143.495

Different models were fit using subsets of predictors. These models are compared using adjusted R-squared, p-values, and AIC values to determine which model provides the best balance between fit and simplicity.

Question c.

The full model was selected as the preferred model based on its lowest AIC value, indicating the best overall fit among the models considered.

From the model output, Intake and WetFrass are significant predictors of Nassim, as indicated by their extremely small p-values. These variables have a strong relationship with the response variable.

Other variables, including Instar, ActiveFeeding, Fgp, Mgp, and Mass, have large p-values, suggesting they do not contribute significantly to the model. However, removing these variables resulted in worse model performance based on AIC, so they were retained in the final model.

Overall, the full model provides the best balance between model fit and explanatory power.

Question d.

The log-transformed model shows that LogIntake, LogMass, and WetFrass are significant predictors of log(Nassim), as indicated by their small p-values. These variables have a strong relationship with the response variable.

To compare models, the Akaike Information Criterion (AIC) was used. The full log model has a lower AIC value (-301.27) compared to the reduced log model (-182.61), indicating that the full log model provides a better fit to the data.

The log transformation helps stabilize variance and improves model performance. Overall, the full log-transformed model provides the best fit and offers a useful representation of the relationship between the variables.

Caterpillars$logNassim <- log(Caterpillars$Nassim + 1)

log_full <- lm(logNassim ~ Instar + ActiveFeeding + Fgp + Mgp +
               LogMass + LogIntake + WetFrass,
               data = Caterpillars)

summary(log_full)

## 
## Call:
## lm(formula = logNassim ~ Instar + ActiveFeeding + Fgp + Mgp + 
##     LogMass + LogIntake + WetFrass, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0181002 -0.0028335  0.0000734  0.0026394  0.0179192 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.015e-02  5.691e-03   1.784  0.07567 .  
## Instar          1.789e-05  1.449e-03   0.012  0.99016    
## ActiveFeedingY  9.133e-04  1.063e-03   0.859  0.39131    
## FgpY           -3.987e-04  1.465e-03  -0.272  0.78572    
## MgpY            1.821e-03  1.015e-03   1.794  0.07398 .  
## LogMass        -5.833e-03  2.066e-03  -2.823  0.00515 ** 
## LogIntake       3.109e-02  2.653e-03  11.717  < 2e-16 ***
## WetFrass        4.737e-03  7.926e-04   5.977 7.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005718 on 246 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.8787, Adjusted R-squared:  0.8752 
## F-statistic: 254.5 on 7 and 246 DF,  p-value: < 2.2e-16

log_model1 <- lm(logNassim ~ LogMass + LogIntake, data = Caterpillars)

summary(log_model1)

## 
## Call:
## lm(formula = logNassim ~ LogMass + LogIntake, data = Caterpillars)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0147867 -0.0038577 -0.0000844  0.0029250  0.0216721 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0139273  0.0005286  26.347   <2e-16 ***
## LogMass     -0.0068461  0.0007625  -8.979   <2e-16 ***
## LogIntake    0.0393263  0.0016248  24.203   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.006188 on 251 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.855,  Adjusted R-squared:  0.8539 
## F-statistic: 740.2 on 2 and 251 DF,  p-value: < 2.2e-16

AIC(log_full, log_model1)

##            df       AIC
## log_full    9 -1892.660
## log_model1  4 -1857.487

The log transformation is used to stabilize variance and improve model fit. The results are compared with the original model to determine whether the transformation provides a better model.

STAT 321-40 Project 3: Multiple Linear Regression

Zoljargal Enkhbayar

2026-03-17

Introduction

Analysis

Question a.

Question b.

Question c.

Question d.

Appendix