Final Exam Regression Analysis

Through the quizzes in our course, we are familiar with the selling price of fish in Animal Crossing: New Horizons. Now, we will explore the selling price of insects. You should explore the data to understand what the variables mean prior to you answering questions. Throughout this test, use α=0.05.

1. Construct the Full Model

Construct the model that examines selling price as a function of if the insect is flying, whether the insect can be caught in the rain, the number of catches required, and all possible two-way interactions.

model1 <- lm(Sell ~ Flying + Rain + Catches + Flying:Rain + Flying:Catches + Rain:Catches, data = insects)
summary(model1)
## 
## Call:
## lm(formula = Sell ~ Flying + Rain + Catches + Flying:Rain + Flying:Catches + 
##     Rain:Catches, data = insects)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5285.8  -755.3  -243.9   340.0  6994.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      283.750    661.297   0.429    0.669
## Flying           195.192    824.567   0.237    0.814
## Rain             721.585    739.462   0.976    0.332
## Catches           88.813     53.315   1.666    0.100
## Flying:Rain    -1140.527   2070.727  -0.551    0.583
## Flying:Catches   -10.364     58.128  -0.178    0.859
## Rain:Catches      -3.003     53.860  -0.056    0.956
## 
## Residual standard error: 1870 on 73 degrees of freedom
## Multiple R-squared:  0.6861, Adjusted R-squared:  0.6603 
## F-statistic:  26.6 on 6 and 73 DF,  p-value: < 2.2e-16
summary_m1 <- summary(model1)
library(tibble)
model_c1 <- coefficients(model1)
model_s1 <- summary(model1)
model_ci1 <- as_tibble(confint(model1, level=0.95))
model_t1 <- as_tibble(model_s1[[4]])
p.value.string = function(p.value){
  p.value <- round(p.value, digits=4)
  if (p.value == 0) {
    return("p < 0.0001")
  } else {
    return(paste0("p = ", format(p.value, scientific = F)))
  }
}
almost_sas <- function(aov.results){
  par(mfrow=c(2,2))
  plot(aov.results, which=1)
  plot(aov.results, which=2)
  aov_residuals <- residuals(aov.results)
  plot(density(aov_residuals))
  hist(aov_residuals)
}

The regression line is as follows, such that \(Y={\mbox{Selling Price}}\) is: \[ \hat{Y} = 283.75 + 195.19X_{\mbox{Flying}} + 721.59X_{\mbox{Rain}} + 88.81X_{\mbox{Catches}} - 1140.53X_{\mbox{Flying:Rain}} - 10.36X_{\mbox{Flying:Catches}} - 3X_{\mbox{Rain:Catches}}\]

The new model indicates that for every 1 Bell increase in Selling Price, Flying increases by 195.19, Rain increases by 721.59, Catches increases by 88.81, the interaction between Flying and Rain decreases by 1140.53, the interaction between Flying and Catches decreases by 10.36, and the interaction between Rain and Catches decreases by 3.00.

2. Testing Interactions

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
anova(model1)
## Analysis of Variance Table
## 
## Response: Sell
##                Df    Sum Sq   Mean Sq  F value    Pr(>F)    
## Flying          1  22922423  22922423   6.5521  0.012546 *  
## Rain            1  33705852  33705852   9.6343  0.002717 ** 
## Catches         1 500410829 500410829 143.0354 < 2.2e-16 ***
## Flying:Rain     1    948154    948154   0.2710  0.604226    
## Flying:Catches  1    324673    324673   0.0928  0.761510    
## Rain:Catches    1     10876     10876   0.0031  0.955688    
## Residuals      73 255391173   3498509                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Construct the appropriate F-test to determine if we can remove all interaction terms at the same time.

Hypotheses

   \(H_0: \ \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = \beta_6 = 0\)
   \(H_1:\) at least one \(\beta_i \ne 0\)

Test Statistic

   \(F_0 = 26.6\).

p-value

   \(p < 0.0001\).

Rejection Region

   Reject if \(p < \alpha\), where \(\alpha=0.05\).

Conclusion and Interpretation

   Reject \(H_0\). There is sufficient evidence to suggest that the regression line is significant. However, according to the ANOVA results, the interactions are not significant (each \(p_{\mbox{Interaction}}>\alpha=0.05\)).

3. Report Final Model

What is the final model after your test in Q2?

model2 <- lm(Sell ~ Flying + Rain + Catches)
summary(model2)
## 
## Call:
## lm(formula = Sell ~ Flying + Rain + Catches)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5258.1  -741.9  -191.9   310.9  7008.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  378.659    496.849   0.762    0.448    
## Flying       -27.024    612.380  -0.044    0.965    
## Rain         613.248    560.868   1.093    0.278    
## Catches       85.523      7.026  12.172   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1838 on 76 degrees of freedom
## Multiple R-squared:  0.6846, Adjusted R-squared:  0.6721 
## F-statistic: 54.98 on 3 and 76 DF,  p-value: < 2.2e-16
summary_m2 <- summary(model2)
library(tibble)
model_c2 <- coefficients(model2)
model_s2 <- summary(model2)
model_ci2 <- as_tibble(confint(model2, level=0.95))
model_t2 <- as_tibble(model_s2[[4]])

The resulting model, such that \(Y={\mbox{Selling Price}}\), is:

\[ \hat{Y} = 378.66 - 27.02X_{\mbox{Flying}} + 613.25X_{\mbox{Rain}} + 88.81X_{\mbox{Catches}}\]

4. Model Comparison

Find the R2adj for the model in Q1 and the model in Q3. Give interpretations for your R2adj values. Which model fits better and why? Based on these results, are you confident in your final model from Q3?

Model 1

\(R^2 = 0.6861\) and \(R^2_{\mbox{adj}} = 0.6603\)

According to the original model, 66.03% of the variance is explained by the model.

Model 2

\(R^2 = 0.6846\) and \(R^2_{\mbox{adj}} = 0.6721\)

According to the original model, 67.21% of the variance is explained by the model.

Since the second model explains more of the variance than the first model, the second model is likely more accurate, indicating greater confidence in Model 2 than Model 1.

5. Table for Final Model Results

Provide a table with point estimates and 95% confidence intervals on β as well as the p-value for the corresponding t-tests.

Predictor Estimate of \(\beta\) 95% CI for \(\beta\) p-value
Flying -27.02 (-1246.69, 1192.64) p = 0.9649
Rain 613.25 (-503.82, 1730.31) p = 0.2777
Catches 85.52 (71.53, 99.52) p < 0.0001

6. Interpret Final Model

Provide interpretations for each of the predictors and indicate whether or not they are significant predictors of selling price of insects.

According to Model 2, for every 1 Bell increase in Selling Price, Flying decreases by 27.02, Rain increases by 613.25, and Catches increases by 85.52.

However, according to the previous results from the predictors table, only Catches is a significant predictor (\(p<\alpha=0.05\)).

The model for predicting Selling Price based on Catches only is:

model4 <- lm(Sell ~ Catches)
model_c4 <- coefficients(model4)

\[ \hat{Y} = 686.18 + 87.68X_{\mbox{Catches}}\]

7. Visualization

Construct a graph for flying insects that allows the number of catches to vary (x-axis). Have one line for when it rains and one line for when it does not rain.

library(sjPlot)
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(sjmisc)
## 
## Attaching package: 'sjmisc'
## The following object is masked from 'package:purrr':
## 
##     is_empty
## The following object is masked from 'package:tidyr':
## 
##     replace_na
## The following object is masked from 'package:tibble':
## 
##     add_case
library(ggplot2)
data(insects)
## Warning in data(insects): data set 'insects' not found
theme_set(theme_sjplot())
insects$rainy_fac <- as.factor(insects$Rain)
insects$pred_rainy <- model_c2[[1]] + model_c2[[2]]*insects$Rain + model_c2[[3]]
insects$pred_notrainy <- model_c2[[1]] + model_c2[[2]]*insects$Rain
fit <- lm(Flying ~ Catches*Rain)
plot_model(fit, type="int")

It appears that there is no interaction effect regarding Rain.

8. Model Diagnostics

Check for outliers and, separately, any influence/leverage points. Do we need to perform sensitivity analysis?

Check for Outliers: Studentized Residuals

library(pander)
pander(rstandard(model4), style='rmarkdown')

According to the Studentized R results, observations 10 (-2.788), 28 (2.717), and 66 (2.717) appear be outliers, and observations 67 and 71 appear to be extreme outliers (both 4.990).

Check for Influence/Leverage Points: Cook’s Distance

library(lindia)
gg_cooksd(model2)+theme_bw()

According to the Cook’s Distance plot, observations 10, 28, 30, 31, 40, 66, 67, and 71 appear to have significant influence/leverage in the model.

Yes, sensitivity analysis should be conducted.

9. Sensitivity Analysis

If you determined that sensitivity analysis was needed in Q8, perform it here.

According to both the Studentized Residuals results and Cook’s Distance plot, observations, 10, 28, 66, 67, and 71 should be removed from the model.

Compare the two models: the model with the points in question and the model without the points in question. Is there a large difference between the two models?

model3 = lm(Sell[c(-10, -28, -66, -67, -71)] ~ Catches[c(-10, -28, -66, -67, -71)])
summary(model3)
## 
## Call:
## lm(formula = Sell[c(-10, -28, -66, -67, -71)] ~ Catches[c(-10, 
##     -28, -66, -67, -71)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2852.2  -276.5  -106.5   315.2  2702.1 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          406.503    136.292   2.983  0.00388 ** 
## Catches[c(-10, -28, -66, -67, -71)]   88.914      3.968  22.410  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1032 on 73 degrees of freedom
## Multiple R-squared:  0.8731, Adjusted R-squared:  0.8714 
## F-statistic: 502.2 on 1 and 73 DF,  p-value: < 2.2e-16
model_c3 <- coefficients(model3)
model_s3 <- summary(model3)
model_ci3 <- as_tibble(confint(model3, level=0.95))
model_t3 <- as_tibble(model_s3[[4]])

The resulting model, such that \(Y={\mbox{Selling Price}}\), is:

\[ \hat{Y} = 406.5 + 88.91X_{\mbox{Catches}}\]

The original model (Model 2), for comparison, is: \[ \hat{Y} = 378.66 - 27.02X_{\mbox{Flying}} + 613.25X_{\mbox{Rain}} + 88.81X_{\mbox{Catches}}\]

The two models differ quite significantly, especially regarding the removal of Flying and Rain because they were deemed insignificant in Q6. In the second model, \(\beta_{\mbox{Catches}}\) is nearly the same as it was in the previous model, but the intercept increased.

10. Model Assumptions

Assess the model assumptions graphically. Is our model valid?

almost_sas(model4)

According to the normality plots, it appears that the data are somewhat normally distributed, thereby meeting the normality and variance assumptions. The data do not perfectly match the desired plots (see the Normal Q-Q plot and histogram of residuals), but the data follow the required predictions (such as the density plot) quite closely. Therefore, the model assumptions are met, and the model is valid.