Problem
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
crb <- read.csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/CrabAgePredict.csv")
str(crb)
## 'data.frame': 3893 obs. of 9 variables:
## $ Sex : chr "F" "M" "I" "F" ...
## $ Length : num 1.438 0.887 1.038 1.175 0.887 ...
## $ Diameter : num 1.175 0.65 0.775 0.887 0.662 ...
## $ Height : num 0.412 0.212 0.25 0.25 0.212 ...
## $ Weight : num 24.64 5.4 7.95 13.48 6.9 ...
## $ Shucked.Weight: num 12.33 2.3 3.23 4.75 3.46 ...
## $ Viscera.Weight: num 5.58 1.37 1.6 2.28 1.49 ...
## $ Shell.Weight : num 6.75 1.56 2.76 5.24 1.7 ...
## $ Age : int 9 6 6 10 6 8 15 10 13 7 ...
Create a Quadratic term and Quantitative interaction
crb <- crb %>% mutate(WeightSQ = (Weight)**2,#Quadratic term
Age_Category = ifelse(Age > mean(Age), 1.0, 0.0)) #quantitative interaction term
Scatter plot without outliers
ggplot(data=crb,mapping=aes((Height+Age_Category*Diameter),Weight)) + geom_point()
Model 1 (Model with Outliers and no Transformation Applied)
mod1 <- lm(WeightSQ~Height+Age_Category*Diameter,data=crb)
summary(mod1)
##
## Call:
## lm(formula = WeightSQ ~ Height + Age_Category * Diameter, data = crb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3107.7 -214.8 -87.9 130.8 3805.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -958.54 34.23 -28.00 <2e-16 ***
## Height 1190.91 109.36 10.89 <2e-16 ***
## Age_Category -2842.96 69.41 -40.96 <2e-16 ***
## Diameter 1110.39 51.73 21.46 <2e-16 ***
## Age_Category:Diameter 2740.36 63.81 42.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 396.8 on 3888 degrees of freedom
## Multiple R-squared: 0.7516, Adjusted R-squared: 0.7514
## F-statistic: 2941 on 4 and 3888 DF, p-value: < 2.2e-16
plot(mod1)
Remove Outliers
#create id column
crb$id <- as.integer(rownames(crb))
#head(crb)
#remove outliers
crb2 <- crb[crb$id[-c(773,2317,2897)], ]
#head(crb2)
Scatter plot after removing initial outliers
ggplot(data=crb2,mapping=aes((Height+Age_Category*Diameter),Weight)) + geom_point()
Model 2 (model after removing initial outliers)
mod2 <- lm((Weight)~Height+Age_Category*Diameter,data=crb2)
summary(mod2)
##
## Call:
## lm(formula = (Weight) ~ Height + Age_Category * Diameter, data = crb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.349 -2.747 -0.777 2.070 24.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.6244 0.3757 -54.90 <2e-16 ***
## Height 17.8030 1.1996 14.84 <2e-16 ***
## Age_Category -26.4108 0.7625 -34.63 <2e-16 ***
## Diameter 35.5149 0.5677 62.56 <2e-16 ***
## Age_Category:Diameter 25.8650 0.7011 36.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.351 on 3885 degrees of freedom
## Multiple R-squared: 0.901, Adjusted R-squared: 0.9009
## F-statistic: 8838 on 4 and 3885 DF, p-value: < 2.2e-16
plot(mod2)
Apply Box-cox Transformation
bc<-boxcox(mod1,seq(-3,3))
Best Lambda \(\lambda\) value
best <- bc$x[which(bc$y==max(bc$y))]
best
## [1] 0.2727273
The lambda \(\lambda\) value is approximately 0.3. That means that the relationship between Weight and the explanatory variables of crabs can be express as thus:
\(\sqrt[3]{Weight} =\beta*Height + \beta*Age_Category + \beta*Diameter + \beta*Age_Category*Diameter + \epsilon\)
Model 3 (After removing Initial outliers and Applying Box-cox Transformation
mod3 <- lm((Weight)^0.3~Height+Age_Category*Diameter,data=crb2)
summary(mod3)
##
## Call:
## lm(formula = (Weight)^0.3 ~ Height + Age_Category * Diameter,
## data = crb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24479 -0.05990 -0.00625 0.05157 0.70405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.313337 0.009038 34.669 < 2e-16 ***
## Height 0.555024 0.028859 19.232 < 2e-16 ***
## Age_Category 0.080187 0.018344 4.371 1.27e-05 ***
## Diameter 1.911472 0.013657 139.962 < 2e-16 ***
## Age_Category:Diameter -0.066227 0.016866 -3.927 8.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1047 on 3885 degrees of freedom
## Multiple R-squared: 0.9611, Adjusted R-squared: 0.9611
## F-statistic: 2.403e+04 on 4 and 3885 DF, p-value: < 2.2e-16
plot(mod3)
Model 3 represent a data without extreme outliers with a box-cox transformation.The model has a miniature p-values for all the slopes for all variables shows that the variables are statistically significant and also the R-squared \(R^2\) value of 0.9611 tells that approximately 96% variability of weight of crabs being explained independent variables. The data is normalized and the variability is constant as a result of the box-cox transformation or cube-root transformation of the weight variable. So, it is wise to conclude that the linear model is appropriate after the transformation. The model can be summarized as thus:
\(\sqrt[3]{Weight} = 0.31 + 0.56*Height + 0.08*Age_Category + 1.91*Diameter - 0.07*Age_Category*Diameter\)