This dataset comes from a study of female horseshoe crabs on an island in the Gulf of Mexico. I seek to predict the carapace width ‘W’ in (cm) based on a quadratic of the weight in kg, and the spine condition (1, both good; 2, one worn or broken; 3, both worn or broken), as well as an interaction between the spine conditions and weights.
Crabs <- read.table("Crabs.dat", header=T)
attach(Crabs)
head(Crabs)
## crab y weight width color spine
## 1 1 8 3.05 28.3 2 3
## 2 2 0 1.55 22.5 3 3
## 3 3 9 2.30 26.0 1 1
## 4 4 0 2.10 24.8 3 3
## 5 5 4 2.60 26.0 3 3
## 6 6 0 2.10 23.8 2 3
str(Crabs)
## 'data.frame': 173 obs. of 6 variables:
## $ crab : int 1 2 3 4 5 6 7 8 9 10 ...
## $ y : int 8 0 9 0 4 0 0 0 0 0 ...
## $ weight: num 3.05 1.55 2.3 2.1 2.6 2.1 2.35 1.9 1.95 2.15 ...
## $ width : num 28.3 22.5 26 24.8 26 23.8 26.5 24.7 23.7 25.6 ...
## $ color : int 2 3 1 3 3 2 1 3 2 3 ...
## $ spine : int 3 3 1 3 3 3 1 2 1 3 ...
mod1 <- lm(width~I(weight^2) + weight*factor(spine), Crabs)
summary(mod1)
##
## Call:
## lm(formula = width ~ I(weight^2) + weight * factor(spine), data = Crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6999 -0.5675 -0.0188 0.4717 3.4986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.3001 1.3742 14.772 <2e-16 ***
## I(weight^2) 0.1224 0.1452 0.843 0.4007
## weight 2.1905 0.9045 2.422 0.0165 *
## factor(spine)2 -2.2341 1.4360 -1.556 0.1217
## factor(spine)3 -1.3739 0.8599 -1.598 0.1120
## weight:factor(spine)2 0.6284 0.6275 1.002 0.3180
## weight:factor(spine)3 0.5544 0.3329 1.665 0.0977 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9613 on 166 degrees of freedom
## Multiple R-squared: 0.7995, Adjusted R-squared: 0.7923
## F-statistic: 110.3 on 6 and 166 DF, p-value: < 2.2e-16
From the model the weight itself is the only explanatory variable with any statistical significance.
cor(width, weight)
## [1] 0.8868715
Indeed there is a strong positive correlation between weight and width of 0.8868715, as would be expected.
The dichotomous spine condition predictors have a negative coefficient, also expected as spine = 2 and =3 refer to poor condition and lesser health.
The quadratic has the highest p-value, and is thus the least significant, as it likely exhibits multicollinearity with weight.
The Multiple R-squared of this HW model is 0.7995, which is slightly better than a univariate model that considers weight only (Multiple R-squared: 0.7865). Nevertheless a parsimonious linear model with only weight is still much more reasonable:
mod2 <- lm(width~weight, Crabs)
summary(mod2)
##
## Call:
## lm(formula = width ~ weight, data = Crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3954 -0.5817 -0.0370 0.4942 3.8874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3985 0.3234 56.89 <2e-16 ***
## weight 3.2416 0.1291 25.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9773 on 171 degrees of freedom
## Multiple R-squared: 0.7865, Adjusted R-squared: 0.7853
## F-statistic: 630.1 on 1 and 171 DF, p-value: < 2.2e-16
The residuals appear good but for three outlying narrow crabs.
hist(resid(mod1))
The residuals indeed seem more or less normal despite the arbitrary nature of this model.