crb <- read.csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/CrabAgePredict.csv")
str(crb)
## 'data.frame':    3893 obs. of  9 variables:
##  $ Sex           : chr  "F" "M" "I" "F" ...
##  $ Length        : num  1.438 0.887 1.038 1.175 0.887 ...
##  $ Diameter      : num  1.175 0.65 0.775 0.887 0.662 ...
##  $ Height        : num  0.412 0.212 0.25 0.25 0.212 ...
##  $ Weight        : num  24.64 5.4 7.95 13.48 6.9 ...
##  $ Shucked.Weight: num  12.33 2.3 3.23 4.75 3.46 ...
##  $ Viscera.Weight: num  5.58 1.37 1.6 2.28 1.49 ...
##  $ Shell.Weight  : num  6.75 1.56 2.76 5.24 1.7 ...
##  $ Age           : int  9 6 6 10 6 8 15 10 13 7 ...

Here’s a breakdown of the features:

create scatter plot with line of best fit

#First 5 observations
head(crb);
##   Sex Length Diameter Height    Weight Shucked.Weight Viscera.Weight
## 1   F 1.4375   1.1750 0.4125 24.635715      12.332033       5.584852
## 2   M 0.8875   0.6500 0.2125  5.400580       2.296310       1.374951
## 3   I 1.0375   0.7750 0.2500  7.952035       3.231843       1.601747
## 4   F 1.1750   0.8875 0.2500 13.480187       4.748541       2.282135
## 5   I 0.8875   0.6625 0.2125  6.903103       3.458639       1.488349
## 6   F 1.5500   1.1625 0.3500 28.661344      13.579410       6.761356
##   Shell.Weight Age
## 1     6.747181   9
## 2     1.559222   6
## 3     2.764076   6
## 4     5.244657  10
## 5     1.700970   6
## 6     7.229122   8
#Dimension of dataset
dim(crb)
## [1] 3893    9
ggplot(data=crb,mapping=aes(Height,Weight)) + geom_point() + 
  geom_smooth(method=lm, se=FALSE)

Model 1

mod1 <- lm(Weight~Height,data=crb)
summary(mod1)
## 
## Call:
## lm(formula = Weight ~ Height, data = crb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -273.521   -4.240   -0.994    3.278   42.829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.0840     0.4491  -31.36   <2e-16 ***
## Height      107.7679     1.2310   87.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.062 on 3891 degrees of freedom
## Multiple R-squared:  0.6633, Adjusted R-squared:  0.6632 
## F-statistic:  7664 on 1 and 3891 DF,  p-value: < 2.2e-16
plot(mod1)

The analysis of the Residuals in Model 1 indicates extreme outliers in the dataset. However, the main assumptions of a linear regression model are not satisfied as the residual data points is bell shaped indicating non-linearity and heteroskedasticity (no constatnt variability)

Remove Outliers

#create id column
crb$id <- as.integer(rownames(crb))
#head(crb)

#remove outliers
crb2 <- crb[crb$id[-c(749,2257,3543)], ]
#crb2

Scatter plot without outliers

ggplot(data=crb2,mapping=aes(Height,Weight)) + geom_point() + 
  geom_smooth(method=lm, se=FALSE)

Model 2 (model without outliers)

mod2 <- lm((Weight)~Height,data=crb2)
summary(mod2)
## 
## Call:
## lm(formula = (Weight) ~ Height, data = crb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.506  -3.760  -0.487   3.210  33.438 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.3045     0.3837  -55.52   <2e-16 ***
## Height      128.6979     1.0614  121.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.349 on 3888 degrees of freedom
## Multiple R-squared:  0.7909, Adjusted R-squared:  0.7908 
## F-statistic: 1.47e+04 on 1 and 3888 DF,  p-value: < 2.2e-16
plot(mod2)

The analysis of the residuals indicates no constant variability and non-linearity. Though, the data distribution is normal with some outliers in the far end.

Apply Box-cox Transformation

bc<-boxcox(mod2,seq(-3,3))

Best Lambda \(lamb\) value

best <- bc$x[which(bc$y==max(bc$y))]
best
## [1] 0.4545455

The lambda value is approximately 0.5. That means that the relationship between Weight and Height of crabs can be express as thus:

\(\sqrt{Weight} =\beta*Height + \epsilon\)

ggplot(data=crb2,mapping=aes(Height,sqrt(Weight))) + geom_point() + 
  geom_smooth(method=lm, se=FALSE)

Model 3

mod3 <- lm(sqrt(Weight)~Height,data=crb2)
summary(mod3)
## 
## Call:
## lm(formula = sqrt(Weight) ~ Height, data = crb2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2625 -0.3645 -0.0121  0.3499  4.0419 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.55856    0.03585  -15.58   <2e-16 ***
## Height      14.80768    0.09916  149.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5932 on 3888 degrees of freedom
## Multiple R-squared:  0.8515, Adjusted R-squared:  0.8515 
## F-statistic: 2.23e+04 on 1 and 3888 DF,  p-value: < 2.2e-16
plot(mod3)

Model 3 represent a data without extreme outliers with a box-cox transformation.The model has a miniature p-value of <2e-16 and an R-squared \(R^2\) value of 0.8515 showing approximately 85% variabilty of weight of crabs being explained by their height. The data is normalized and the variability is constant as a result of the box-cox transformation or square-root transformation of the weight variable. So, it is wise to conclude that the linear model is appropriate after the transformation.