Crab Mating
1(a) Create a histogram of the number of satellite crabs surrounding female crabs. Interpret in the context of the data.
hist(crabdat$satell)
Most of the female crabs have 2 or less satellite crabs surrounding them. There are very few female crabs with greater than 5 satellite crabs.
1(b) Create side-by-side boxplots of the response by color. Interpret.
boxplot(crabdat$satell ~ crabdat$color)
Light colored females have the highest median number of satellite crabs, followed by medium, dark, and darker. However, there are some medium, dark, and darker females that have a very high number of satellite crabs, exceeding the highest number of satellite crabs from a light colored female. Many darker colored females have 0 satellite crabs, but there are some outliers that have upwards of 5 satellite crabs.
1(c) Create side-by-side boxplots of the response by spine. Interpret.
boxplot(crabdat$satell ~ crabdat$spine)
Females with good spine condition have the highest median number of satellite crabs, followed by females with both spines broken (bad) and females with one spine broken (middle). However, females with both spines broken have the highest maximum number of satellite crabs, as well as outliers upwards of 10 satellite crabs. Females with one spine broken tend have a median of 0 satellite crabs and a maximum of around 7 satellite crabs.
1(d) Create a scatterplot of the log of the response by width. Interpret.
plot(log(satell) ~ width, data = crabdat)
There does not appear to be a relationship between female carapace width and the log number of satellites.
1(e) Use the plots you’ve created as evidence to argue that a Poisson regression is appropriate.
Width does not appear to have a linear relationship with the log number of satellites, implying that linear regression may not be appropriate. The histogram of satellites is right skewed, suggesting that the number of satellites can be reasonably modeled with a Poisson distribution. Additionally, the boxplots are mainly skewed right, which shows that satellites continues to follow a Poisson distribution with predictors color and spine condition.
1(f) Before doing any further analysis, which predictors do you think will have the strongest relationship with the mean number of satellite crabs that a female has?
I think color will have the strongest relationship with the mean number of satellite crabs a female has because there is a large difference in medians in the boxplot.
mod1 = glm(satell ~ color, family = poisson, data = crabdat)
summary(mod1)
##
## Call:
## glm(formula = satell ~ color, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8577 -2.1106 -0.1649 0.8721 4.7491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.80078 0.10102 7.927 2.24e-15 ***
## colordarker -0.08516 0.18007 -0.473 0.636279
## colorlight 0.60614 0.17496 3.464 0.000532 ***
## colormedium 0.39155 0.11575 3.383 0.000718 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 609.14 on 169 degrees of freedom
## AIC: 972.44
##
## Number of Fisher Scoring iterations: 6
exp(-0.08516) #darker
## [1] 0.9183653
exp(0.60614) #light
## [1] 1.833341
exp(0.39155) #medium
## [1] 1.479272
There is a significant difference in number of satellites between females with dark coloring and females with light or medium coloring. Darker females tend to have 0.9183653 times as many satellites as dark females. Light females tend to have 1.833341 times as many satellites as dark females. Medium females tend to have 1.479272 times as many satellites as dark females.
2(b) Just spine
mod2 = glm(satell ~ spine, family = poisson, data = crabdat)
summary(mod2)
##
## Call:
## glm(formula = satell ~ spine, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7014 -2.3706 -0.5097 1.1252 5.0859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03316 0.05423 19.051 <2e-16 ***
## spinegood 0.26120 0.10173 2.568 0.0102 *
## spinemiddle -0.34001 0.19045 -1.785 0.0742 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 621.16 on 170 degrees of freedom
## AIC: 982.46
##
## Number of Fisher Scoring iterations: 5
exp(0.2612) #good
## [1] 1.298487
exp(-0.34001) #middle
## [1] 0.7117632
There is a significant difference in number of satellites between females with good spine condition and females with both spines broken. Females with good spine condition tend to have 1.298487 times as many satellites as females with both spines broken. Females with one spine broken tend to have 0.7117632 times as many satellites as females with both spines broken.
2(c) Just width
mod3 = glm(satell ~ width, family = poisson, data = crabdat)
summary(mod3)
##
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
exp(0.16405)
## [1] 1.178273
An increase of 1 cm in a female’s carapace width is associated with multiplicative change of 1.178273 in the mean number satellites.
AIC(mod1)
## [1] 972.4368
AIC(mod2)
## [1] 982.4582
AIC(mod3)
## [1] 927.1762
The AIC for mod3 is lower than the AICs for mod1 and mod2, so mod3 is a better model. This suggests that carapace width is an important predictor for the number of satellites.
halfnorm(residuals(mod3)) #test for outliers
dp1 = sum(residuals(mod3, type="pearson")^2 / mod3$df.res) #test for overdispersion
dp1
## [1] 3.182205
There appear to be about 4 outliers, shown by the points not following the trend in the half-norm plot. According to the dispersion parameter, the variance is about 3x bigger than the mean.
bigmod = glm(satell ~ width + color + spine, family = poisson, data = crabdat) #model with all 3 predictors
mod0 = glm(satell ~ 1, family = poisson, data = crabdat)
fwd.model = stepAIC(mod0, direction='forward', scope=(~ crabdat$width + crabdat$spine + crabdat$color)) #forward selection using AIC suggests that spine should be dropped from the model
## Start: AIC=990.09
## satell ~ 1
##
## Df Deviance AIC
## + crabdat$width 1 567.88 927.18
## + crabdat$color 3 609.14 972.44
## + crabdat$spine 2 621.16 982.46
## <none> 632.79 990.09
##
## Step: AIC=927.18
## satell ~ crabdat$width
##
## Df Deviance AIC
## + crabdat$color 3 559.34 924.64
## <none> 567.88 927.18
## + crabdat$spine 2 566.60 929.90
##
## Step: AIC=924.64
## satell ~ crabdat$width + crabdat$color
##
## Df Deviance AIC
## <none> 559.34 924.64
## + crabdat$spine 2 558.63 927.93
anova(bigmod, glm(satell ~ width + color, family = poisson, data = crabdat), test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + color + spine
## Model 2: satell ~ width + color
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 166 558.63
## 2 168 559.34 -2 -0.7153 0.6993
#p-value is large so spine should be dropped from the model
anova(mod3, glm(satell ~ width + color, family = poisson, data = crabdat), test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width
## Model 2: satell ~ width + color
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 168 559.34 3 8.5338 0.03618 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#p-value is small so width should not be dropped from the model
anova(mod1, glm(satell ~ width + color, family = poisson, data = crabdat), test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ color
## Model 2: satell ~ width + color
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 169 609.14
## 2 168 559.34 1 49.794 1.707e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#p-value is small so color should not be dropped from the model
finalmod = glm(satell ~ width + color, family = poisson, data = crabdat)
summary(finalmod)
##
## Call:
## glm(formula = satell ~ width + color, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0415 -1.9581 -0.5575 0.9830 4.7523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.08640 0.55750 -5.536 3.09e-08 ***
## width 0.14934 0.02084 7.166 7.73e-13 ***
## colordarker -0.01100 0.18041 -0.061 0.9514
## colorlight 0.43636 0.17636 2.474 0.0133 *
## colormedium 0.23668 0.11815 2.003 0.0452 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
exp(0.14934) #width
## [1] 1.161068
exp(-0.01100) #darker color
## [1] 0.9890603
exp(0.43636) #light color
## [1] 1.547066
exp(0.23668) #medium color
## [1] 1.267036
An increase of 1 cm in a female’s carapace width is associated with multiplicative change of 1.161068 in the mean number satellites, holding color constant. Females with darker coloring tend to have 0.9890603 times as many satellites as dark females, light females tend to have 1.547066 times as many satellites as dark females, and medium females tend to have 1.267036 times as many satellites as dark females, holding width constant.
halfnorm(residuals(finalmod))
All points on the half-norm plot seem to be following the trend, suggesting that there are no outliers.
dp2 = sum(residuals(finalmod, type="pearson")^2 / finalmod$df.res)
dp2
## [1] 3.233625
Elephant mating (a) Create a histogram of MATINGS. Is there preliminary evidence that number of matings could be modeled as a Poisson response? Explain.
hist(eleph$MATINGS)
The histogram is skewed right, following a Poisson distribution, suggesting that Poisson regression may be appropriate.
plot(MATINGS ~ AGE, data = eleph)
abline(lm(MATINGS ~ AGE, data = eleph))
The data doesn’t seem to fit the least squares line very well, suggsting that linear regression might not be appropriate.
mmean = c(0, 1.5, 1, 1, 2, 3, 1.75, 5.5, 2.6666, 2, 1, 3, 4, 3.6, 3, 5, 7, 2, 9)
age = c(27, 28, 29, 30, 32, 33, 34, 36, 37, 38, 39, 41, 42, 43, 44, 45, 47, 48, 52)
plot(age, log(mmean))
There does not appear to be a curvilinear relationship between age and the log of the mean number of matings. This suggests that a quadratic term would not appropriate.
pmod = glm(MATINGS ~ AGE, family = poisson, data = eleph)
summary(pmod)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
exp(0.06869)
## [1] 1.071104
An increase of 1 year in age is associated with multiplicative change of 1.071104 in the mean number of elephant matings.
exp(confint(pmod, level = 0.95))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.0694813 0.5892357
## AGE 1.0425585 1.1003602
An increase of 1 year in age is associated with multiplicative change between 1.0425585 and 1.10003602 in the mean number of elephant matings.
summary(pmod) #Wald test
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
mod.0 = glm(MATINGS ~ 1, family = poisson, data = eleph)
anova(mod.0, pmod, test = "Chisq") #drop in deviance test
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 40 75.372
## 2 39 51.012 1 24.36 7.991e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for age in the Wald test is much less than 0.05, so the number of matings is significantly related to age. In the drop in deviance test, the p-value is very small so age should not be dropped from the model.
eleph = eleph %>% mutate(AGE2 = AGE*AGE) #new quadratic variable
pmod.quad = glm(MATINGS ~ AGE + AGE2, family = poisson, data = eleph) #model with quadratic term
summary(pmod.quad) #Wald test
##
## Call:
## glm(formula = MATINGS ~ AGE + AGE2, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.0356383 -0.941 0.347
## AGE 0.1359544 0.1580095 0.860 0.390
## AGE2 -0.0008595 0.0020124 -0.427 0.669
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
anova(pmod, pmod.quad, test = "Chisq") #drop in deviance
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AGE2
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 39 51.012
## 2 38 50.826 1 0.18544 0.6667
The p-value for the quadratic age variable is > 0.05, suggesting that the linear model is better. The p-value for the drop in deviance test is also large, so we should drop the quadratic age term.
1-pchisq(pmod$deviance, pmod$df.residual)
## [1] 0.09426231
The p-value is pretty small, indicating a lack of fit.
Estimate the dispersion parameter and use it to redo f, g, h
dp3 = sum(residuals(pmod, type="pearson")^2 / pmod$df.res)
dp3
## [1] 1.157334
summary(pmod, dispersion = dp3) #with dispersion parameter
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.58590 -2.700 0.00693 **
## AGE 0.06869 0.01479 4.645 3.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
anova(mod.0, pmod, test = "Chisq", dispersion = dp3)
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 40 75.372
## 2 39 51.012 1 24.36 4.478e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for age in the Wald test is much less than 0.05, so the number of matings is significantly related to age. In the drop in deviance test, the p-value is very small so age should not be dropped from the model.
summary(pmod.quad, dispersion = dp3) #Wald test
##
## Call:
## glm(formula = MATINGS ~ AGE + AGE2, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.2657237 -0.875 0.382
## AGE 0.1359544 0.1699858 0.800 0.424
## AGE2 -0.0008595 0.0021649 -0.397 0.691
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
anova(pmod, pmod.quad, test = "Chisq", dispersion = dp3) #drop in deviance
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AGE2
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 39 51.012
## 2 38 50.826 1 0.18544 0.6889
The p-value for the quadratic age variable is > 0.05, suggesting that the linear model is better. The p-value for the drop in deviance test is also large, so we should drop the quadratic age term.
1-pchisq(dp3, pmod$df.residual)
## [1] 1
The p-value indicates a better fit than the model with the assumed deviance.