1. Abstract

This project explores and compares different methods of fitting a multiple regression model using the default and log-transformed caterpillar data. Using R programming, we calculate the best subsets of the data using two different methods for the default and log-transformed data. We compare the two methods within the default and transformed data so that we may conclude how a natural log transformation of the data affects the model.

2. Introduction

When fitting a multiple regression model, is it crucial to create many models using different methods to compare and find the best one. This project uses R programming to calculate the best subsets of predictors using Mallow’s Cp for the default and log-transformed data. We also fit models for the default and log transformed data using stepwise selection. We compare the values of Mallow’s Cp to the value of R squared and adjusted R squared to determine the model’s fit. We also discuss whether a log transformation of the data benefits the model fitting process.

3. Methodology

Data Source: Stat2 Models for a World of Data. Tools: R programming language, base packages, leaps, and MASS.

4. Results

4.1 Models for Default Data

To create our first multiple regression model using the default data we separate the dataset into default and log transformed data, select Nassim as our response variable, and then calculate the best subset of predictors using Mallow’s Cp.

## The best subset is:
##  Instar, Mass, Cassim, Nfrass
## with Mallow's Cp =3.270158.

Then, we fit a multiple regression model for the default data using the same response variable, but this time with stepwise selection methods.

## 
## Call:
## lm(formula = Nassim ~ Mass + Intake + WetFrass + DryFrass + Cassim + 
##     Nfrass, data = caterpillar_data_no_log)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0027704 -0.0001662 -0.0000398  0.0001088  0.0045810 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.027e-06  6.633e-05  -0.015 0.987660    
## Mass         5.521e-05  3.925e-05   1.406 0.160839    
## Intake      -6.650e-03  6.955e-04  -9.562  < 2e-16 ***
## WetFrass    -1.522e-03  3.941e-04  -3.862 0.000144 ***
## DryFrass     8.374e-02  4.738e-03  17.676  < 2e-16 ***
## Cassim       2.040e-01  7.375e-03  27.658  < 2e-16 ***
## Nfrass      -9.622e-01  5.348e-02 -17.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0007308 on 247 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.9981 
## F-statistic: 2.172e+04 on 6 and 247 DF,  p-value: < 2.2e-16

The “Call” displays the model that was chosen at the end of the stepwise selection methods.

4.2 Models for Log Transformed Data

To create our first multiple regression model using the log-transformed data we separate our data to exclude the default data, select LogNassim as our response variable, and then calculate the best subset of predictors using Mallow’s Cp.

## The best subset of log-transformed variables is:
##  (Intercept), Instar3, Instar4, Instar5, ActiveFeeding, Fgp, LogWetFrass, LogDryFrass, LogCassim, LogNfrass
## with Mallow's Cp =8.857443.

Then we fit a multiple regression model for the log transformed data with the previous response variable (LogNassim) with stepwise selection.

## 
## Call:
## lm(formula = LogNassim ~ Instar + ActiveFeeding + Fgp + LogWetFrass + 
##     LogDryFrass + LogCassim + LogNfrass, data = log_transformed_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.235665 -0.016527  0.000552  0.018205  0.196748 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.1057848  0.0536852 -20.598  < 2e-16 ***
## Instar2        0.0007799  0.0099939   0.078 0.937867    
## Instar3        0.0366515  0.0143461   2.555 0.011238 *  
## Instar4        0.0693167  0.0198037   3.500 0.000553 ***
## Instar5        0.0468268  0.0286687   1.633 0.103690    
## ActiveFeeding  0.0125323  0.0061912   2.024 0.044046 *  
## Fgp            0.0120804  0.0061343   1.969 0.050057 .  
## LogWetFrass   -0.0458619  0.0174903  -2.622 0.009292 ** 
## LogDryFrass    0.1614493  0.0381606   4.231 3.31e-05 ***
## LogCassim      0.9682161  0.0127379  76.011  < 2e-16 ***
## LogNfrass     -0.1086667  0.0326295  -3.330 0.001003 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03491 on 242 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.9954, Adjusted R-squared:  0.9952 
## F-statistic:  5247 on 10 and 242 DF,  p-value: < 2.2e-16

Once again, the “Call” displays the model that was chosen at the end of the stepwise selection methods.

5. Discussion

For the first multiple regression model with the default data, Mallow’s Cp of 3.27 compared to its four predictors suggests that the model is relatively well fitted but could benefit from more variables, as suggested by the results of fitting a model using the stepwise method. This model has two additional predictors and an R squared of 0.9981. While R squared always increases with the number of predictors added, adjusted R squared does not. This model has an adjusted R squared of 0.9981, which suggests that this is a very well-fitting model.

With the first multiple regression model with the log-transformed data, we have a Mallow’s Cp of 8.85 compared to the ten predictors in the model. This comparison suggests the model is relatively well-fitted but has become much more complex with the log transformation than with the default data. Using stepwise methods to fit a model for the log-transformed data gives us a model with three fewer predictors than the model found using Mallow’s Cp and an R squared of 0.9954 with an adjusted R squared of 0.9952, suggesting a very well-fitted model that is much less complicated than the previously calculated model.

Based on this analysis, it seems that the natural log transformation of the data complicates the model fitting process rather than simplifying it, which is usually the opposite of what data transformation is intended to do.

6. Conclusion

In this project, we aimed to use different methods to fit multiple regression models of default and natural log-transformed data. We compared the models and discussed the significance of Mallow’s Cp values, R squared, and adjusted R squared, and determined that a natural log transformation was not beneficial to the model fitting process in this case.

7. References

Caterpillar Dataset: [https://www.stat2.org/datasets/Caterpillars.csv]

8. Appendices

8.1 Setup Code and Reading Data

knitr::opts_chunk$set(echo = TRUE)
library(leaps)
library(MASS)

caterpillar_data <- read.csv('Caterpillars.csv')

8.2 Models for Default Data

8.2.1 Data Wrangling and Mallow’s Cp
caterpillar_data$Instar <- as.factor(caterpillar_data$Instar)

caterpillar_data$ActiveFeeding <- ifelse(caterpillar_data$ActiveFeeding == "Y", 1, 0)
caterpillar_data$Fgp <- ifelse(caterpillar_data$Fgp == "Y", 1, 0)
caterpillar_data$Mgp <- ifelse(caterpillar_data$Mgp == "Y", 1, 0)

caterpillar_data_no_log <- caterpillar_data[, !names(caterpillar_data) %in% 
                                              c("LogMass", "LogIntake", "LogWetFrass", 
                                                "LogDryFrass", "LogCassim", "LogNfrass", "LogNassim")]

best_subsets <- regsubsets(Nassim ~ Instar + ActiveFeeding + Fgp + Mgp + Mass + Intake + WetFrass + DryFrass + Cassim + Nfrass, 
                           data = caterpillar_data_no_log, nvmax = 10)

best_subsets_summary = summary(best_subsets)

cp_values <- best_subsets_summary$cp

min_cp_index <- which.min(cp_values)

best_subset <- best_subsets_summary$which[min_cp_index, ]

best_predictors <- names(caterpillar_data_no_log)[which(best_subset)]

valid_predictors <- best_predictors[!is.na(best_predictors)]

cat("The best subset is:\n ", 
    paste(valid_predictors[valid_predictors != "Nassim"], collapse = ", "),
    "\nwith Mallow's Cp =", cp_values[min_cp_index], ".", sep = "")
8.2.2 Stepwise Selection
lm_model_full <- lm(Nassim ~ ., data = caterpillar_data_no_log)

stepwise_method <- step(lm_model_full, direction = "both")

summary(stepwise_method)

8.3 Models for Log Transformed Data

8.3.1 Mallow’s Cp (log)
log_transformed_data <- caterpillar_data[, c("Instar", "ActiveFeeding", "Fgp", "Mgp", "LogMass", "LogIntake", "LogWetFrass", "LogDryFrass", "LogCassim", "LogNfrass", "LogNassim")]

best_subsets_log <- regsubsets(LogNassim ~ Instar + ActiveFeeding + Fgp + Mgp + 
                                 LogMass + LogIntake + LogWetFrass + 
                                 LogDryFrass + LogCassim + LogNfrass, 
                               data = log_transformed_data, nvmax = ncol(log_transformed_data) - 1)

best_subsets_log_summary <- summary(best_subsets_log)

cp_values_log <- best_subsets_log_summary$cp

min_cp_index_log <- which.min(cp_values_log)

best_predictors_log <- best_subsets_log_summary$which[min_cp_index_log, ]

valid_predictors_log <- names(best_predictors_log)[best_predictors_log]

cat("The best subset of log-transformed variables is:\n ", 
    paste(valid_predictors_log, collapse = ", "),
    "\nwith Mallow's Cp =", cp_values_log[min_cp_index_log], ".", sep = "")
8.3.2 Stepwise Selection (log)
lm_model_full_log <- lm(LogNassim ~ ., data = log_transformed_data)

stepwise_model_log <- step(lm_model_full_log, direction = "both")

summary(stepwise_model_log)