crb <- read.csv("https://raw.githubusercontent.com/nnaemeka-git/global-datasets/main/CrabAgePredict.csv")
str(crb)
## 'data.frame': 3893 obs. of 9 variables:
## $ Sex : chr "F" "M" "I" "F" ...
## $ Length : num 1.438 0.887 1.038 1.175 0.887 ...
## $ Diameter : num 1.175 0.65 0.775 0.887 0.662 ...
## $ Height : num 0.412 0.212 0.25 0.25 0.212 ...
## $ Weight : num 24.64 5.4 7.95 13.48 6.9 ...
## $ Shucked.Weight: num 12.33 2.3 3.23 4.75 3.46 ...
## $ Viscera.Weight: num 5.58 1.37 1.6 2.28 1.49 ...
## $ Shell.Weight : num 6.75 1.56 2.76 5.24 1.7 ...
## $ Age : int 9 6 6 10 6 8 15 10 13 7 ...
Here’s a breakdown of the features:
Sex: The gender of the entity, represented by ‘F’ for female, ‘M’ for male, and ‘I’ for infant.
Length: The length of the crab.
Diameter: The diameter of the crab.
Height: The height of the crab.
Weight: The overall weight of the crab.
Shucked Weight: The weight of the edible portion (shucked meat) of the crab.
Viscera Weight: The weight of the gut or internal organs of the crab.
Shell Weight: The weight of the shell of the crab.
Age: The age of the crab.
create scatter plot with line of best fit
#First 5 observations
head(crb);
## Sex Length Diameter Height Weight Shucked.Weight Viscera.Weight
## 1 F 1.4375 1.1750 0.4125 24.635715 12.332033 5.584852
## 2 M 0.8875 0.6500 0.2125 5.400580 2.296310 1.374951
## 3 I 1.0375 0.7750 0.2500 7.952035 3.231843 1.601747
## 4 F 1.1750 0.8875 0.2500 13.480187 4.748541 2.282135
## 5 I 0.8875 0.6625 0.2125 6.903103 3.458639 1.488349
## 6 F 1.5500 1.1625 0.3500 28.661344 13.579410 6.761356
## Shell.Weight Age
## 1 6.747181 9
## 2 1.559222 6
## 3 2.764076 6
## 4 5.244657 10
## 5 1.700970 6
## 6 7.229122 8
#Dimension of dataset
dim(crb)
## [1] 3893 9
ggplot(data=crb,mapping=aes(Height,Weight)) + geom_point() +
geom_smooth(method=lm, se=FALSE)
Model 1
mod1 <- lm(Weight~Height,data=crb)
summary(mod1)
##
## Call:
## lm(formula = Weight ~ Height, data = crb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -273.521 -4.240 -0.994 3.278 42.829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.0840 0.4491 -31.36 <2e-16 ***
## Height 107.7679 1.2310 87.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.062 on 3891 degrees of freedom
## Multiple R-squared: 0.6633, Adjusted R-squared: 0.6632
## F-statistic: 7664 on 1 and 3891 DF, p-value: < 2.2e-16
plot(mod1)
The analysis of the Residuals in Model 1 indicates extreme outliers in the dataset. However, the main assumptions of a linear regression model are not satisfied as the residual data points is bell shaped indicating non-linearity and heteroskedasticity (no constatnt variability)
Remove Outliers
#create id column
crb$id <- as.integer(rownames(crb))
#head(crb)
#remove outliers
crb2 <- crb[crb$id[-c(749,2257,3543)], ]
#crb2
Scatter plot without outliers
ggplot(data=crb2,mapping=aes(Height,Weight)) + geom_point() +
geom_smooth(method=lm, se=FALSE)
Model 2 (model without outliers)
mod2 <- lm((Weight)~Height,data=crb2)
summary(mod2)
##
## Call:
## lm(formula = (Weight) ~ Height, data = crb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.506 -3.760 -0.487 3.210 33.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.3045 0.3837 -55.52 <2e-16 ***
## Height 128.6979 1.0614 121.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.349 on 3888 degrees of freedom
## Multiple R-squared: 0.7909, Adjusted R-squared: 0.7908
## F-statistic: 1.47e+04 on 1 and 3888 DF, p-value: < 2.2e-16
plot(mod2)
The analysis of the residuals indicates no constant variability and non-linearity. Though, the data distribution is normal with some outliers in the far end.
Apply Box-cox Transformation
bc<-boxcox(mod2,seq(-3,3))
Best Lambda \(lamb\) value
best <- bc$x[which(bc$y==max(bc$y))]
best
## [1] 0.4545455
The lambda value is approximately 0.5. That means that the relationship between Weight and Height of crabs can be express as thus:
\(\sqrt{Weight} =\beta*Height + \epsilon\)
ggplot(data=crb2,mapping=aes(Height,sqrt(Weight))) + geom_point() +
geom_smooth(method=lm, se=FALSE)
Model 3
mod3 <- lm(sqrt(Weight)~Height,data=crb2)
summary(mod3)
##
## Call:
## lm(formula = sqrt(Weight) ~ Height, data = crb2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2625 -0.3645 -0.0121 0.3499 4.0419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.55856 0.03585 -15.58 <2e-16 ***
## Height 14.80768 0.09916 149.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5932 on 3888 degrees of freedom
## Multiple R-squared: 0.8515, Adjusted R-squared: 0.8515
## F-statistic: 2.23e+04 on 1 and 3888 DF, p-value: < 2.2e-16
plot(mod3)
Model 3 represent a data without extreme outliers with a box-cox transformation.The model has a miniature p-value of <2e-16 and an R-squared \(R^2\) value of 0.8515 showing approximately 85% variabilty of weight of crabs being explained by their height. The data is normalized and the variability is constant as a result of the box-cox transformation or square-root transformation of the weight variable. So, it is wise to conclude that the linear model is appropriate after the transformation.