Problem

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

crb <- read.csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/CrabAgePredict.csv")
str(crb)
## 'data.frame':    3893 obs. of  9 variables:
##  $ Sex           : chr  "F" "M" "I" "F" ...
##  $ Length        : num  1.438 0.887 1.038 1.175 0.887 ...
##  $ Diameter      : num  1.175 0.65 0.775 0.887 0.662 ...
##  $ Height        : num  0.412 0.212 0.25 0.25 0.212 ...
##  $ Weight        : num  24.64 5.4 7.95 13.48 6.9 ...
##  $ Shucked.Weight: num  12.33 2.3 3.23 4.75 3.46 ...
##  $ Viscera.Weight: num  5.58 1.37 1.6 2.28 1.49 ...
##  $ Shell.Weight  : num  6.75 1.56 2.76 5.24 1.7 ...
##  $ Age           : int  9 6 6 10 6 8 15 10 13 7 ...

Create a Quadratic term and Quantitative interaction

crb <- crb %>% mutate(WeightSQ = (Weight)**2,#Quadratic term
                      Age_Category = ifelse(Age > mean(Age), 1.0, 0.0)) #quantitative interaction term  

Scatter plot without outliers

ggplot(data=crb,mapping=aes((Height+Age_Category*Diameter),Weight)) + geom_point() 

Model 1 (Model with Outliers and no Transformation Applied)

mod1 <- lm(WeightSQ~Height+Age_Category*Diameter,data=crb)
summary(mod1)
## 
## Call:
## lm(formula = WeightSQ ~ Height + Age_Category * Diameter, data = crb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3107.7  -214.8   -87.9   130.8  3805.0 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -958.54      34.23  -28.00   <2e-16 ***
## Height                 1190.91     109.36   10.89   <2e-16 ***
## Age_Category          -2842.96      69.41  -40.96   <2e-16 ***
## Diameter               1110.39      51.73   21.46   <2e-16 ***
## Age_Category:Diameter  2740.36      63.81   42.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 396.8 on 3888 degrees of freedom
## Multiple R-squared:  0.7516, Adjusted R-squared:  0.7514 
## F-statistic:  2941 on 4 and 3888 DF,  p-value: < 2.2e-16
plot(mod1)

Remove Outliers

#create id column
crb$id <- as.integer(rownames(crb))
#head(crb)

#remove outliers
crb2 <- crb[crb$id[-c(773,2317,2897)], ]
#head(crb2)

Scatter plot after removing initial outliers

ggplot(data=crb2,mapping=aes((Height+Age_Category*Diameter),Weight)) + geom_point() 

Model 2 (model after removing initial outliers)

mod2 <- lm((Weight)~Height+Age_Category*Diameter,data=crb2)
summary(mod2)
## 
## Call:
## lm(formula = (Weight) ~ Height + Age_Category * Diameter, data = crb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.349  -2.747  -0.777   2.070  24.249 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -20.6244     0.3757  -54.90   <2e-16 ***
## Height                 17.8030     1.1996   14.84   <2e-16 ***
## Age_Category          -26.4108     0.7625  -34.63   <2e-16 ***
## Diameter               35.5149     0.5677   62.56   <2e-16 ***
## Age_Category:Diameter  25.8650     0.7011   36.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.351 on 3885 degrees of freedom
## Multiple R-squared:  0.901,  Adjusted R-squared:  0.9009 
## F-statistic:  8838 on 4 and 3885 DF,  p-value: < 2.2e-16
plot(mod2)

Apply Box-cox Transformation

bc<-boxcox(mod1,seq(-3,3))

Best Lambda \(\lambda\) value

best <- bc$x[which(bc$y==max(bc$y))]
best
## [1] 0.2727273

The lambda \(\lambda\) value is approximately 0.3. That means that the relationship between Weight and the explanatory variables of crabs can be express as thus:

\(\sqrt[3]{Weight} =\beta*Height + \beta*Age_Category + \beta*Diameter + \beta*Age_Category*Diameter + \epsilon\)

Model 3 (After removing Initial outliers and Applying Box-cox Transformation

mod3 <- lm((Weight)^0.3~Height+Age_Category*Diameter,data=crb2)
summary(mod3)
## 
## Call:
## lm(formula = (Weight)^0.3 ~ Height + Age_Category * Diameter, 
##     data = crb2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24479 -0.05990 -0.00625  0.05157  0.70405 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.313337   0.009038  34.669  < 2e-16 ***
## Height                 0.555024   0.028859  19.232  < 2e-16 ***
## Age_Category           0.080187   0.018344   4.371 1.27e-05 ***
## Diameter               1.911472   0.013657 139.962  < 2e-16 ***
## Age_Category:Diameter -0.066227   0.016866  -3.927 8.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1047 on 3885 degrees of freedom
## Multiple R-squared:  0.9611, Adjusted R-squared:  0.9611 
## F-statistic: 2.403e+04 on 4 and 3885 DF,  p-value: < 2.2e-16
plot(mod3)

Model 3 represent a data without extreme outliers with a box-cox transformation.The model has a miniature p-values for all the slopes for all variables shows that the variables are statistically significant and also the R-squared \(R^2\) value of 0.9611 tells that approximately 96% variability of weight of crabs being explained independent variables. The data is normalized and the variability is constant as a result of the box-cox transformation or cube-root transformation of the weight variable. So, it is wise to conclude that the linear model is appropriate after the transformation. The model can be summarized as thus:

\(\sqrt[3]{Weight} = 0.31 + 0.56*Height + 0.08*Age_Category + 1.91*Diameter - 0.07*Age_Category*Diameter\)