Poisson HW

Crab Mating

1(a) Create a histogram of the number of satellite crabs surrounding female crabs. Interpret in the context of the data.

hist(crabdat$satell)

Most of the female crabs have 2 or less satellite crabs surrounding them. There are very few female crabs with greater than 5 satellite crabs.

1(b) Create side-by-side boxplots of the response by color. Interpret.

boxplot(crabdat$satell ~ crabdat$color)

Light colored females have the highest median number of satellite crabs, followed by medium, dark, and darker. However, there are some medium, dark, and darker females that have a very high number of satellite crabs, exceeding the highest number of satellite crabs from a light colored female. Many darker colored females have 0 satellite crabs, but there are some outliers that have upwards of 5 satellite crabs.

1(c) Create side-by-side boxplots of the response by spine. Interpret.

boxplot(crabdat$satell ~ crabdat$spine)

Females with good spine condition have the highest median number of satellite crabs, followed by females with both spines broken (bad) and females with one spine broken (middle). However, females with both spines broken have the highest maximum number of satellite crabs, as well as outliers upwards of 10 satellite crabs. Females with one spine broken tend have a median of 0 satellite crabs and a maximum of around 7 satellite crabs.

1(d) Create a scatterplot of the log of the response by width. Interpret.

plot(log(satell) ~ width, data = crabdat)

There does not appear to be a relationship between female carapace width and the log number of satellites.

1(e) Use the plots you’ve created as evidence to argue that a Poisson regression is appropriate.

Width does not appear to have a linear relationship with the log number of satellites, implying that linear regression may not be appropriate. The histogram of satellites is right skewed, suggesting that the number of satellites can be reasonably modeled with a Poisson distribution. Additionally, the boxplots are mainly skewed right, which shows that satellites continues to follow a Poisson distribution with predictors color and spine condition.

1(f) Before doing any further analysis, which predictors do you think will have the strongest relationship with the mean number of satellite crabs that a female has?

I think color will have the strongest relationship with the mean number of satellite crabs a female has because there is a large difference in medians in the boxplot.

Create and interpret models with the following predictors.

Just color

mod1 = glm(satell ~ color, family = poisson, data = crabdat)
summary(mod1)

## 
## Call:
## glm(formula = satell ~ color, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8577  -2.1106  -0.1649   0.8721   4.7491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80078    0.10102   7.927 2.24e-15 ***
## colordarker -0.08516    0.18007  -0.473 0.636279    
## colorlight   0.60614    0.17496   3.464 0.000532 ***
## colormedium  0.39155    0.11575   3.383 0.000718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 609.14  on 169  degrees of freedom
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 6

exp(-0.08516) #darker

## [1] 0.9183653

exp(0.60614) #light

## [1] 1.833341

exp(0.39155) #medium

## [1] 1.479272

There is a significant difference in number of satellites between females with dark coloring and females with light or medium coloring. Darker females tend to have 0.9183653 times as many satellites as dark females. Light females tend to have 1.833341 times as many satellites as dark females. Medium females tend to have 1.479272 times as many satellites as dark females.

2(b) Just spine

mod2 = glm(satell ~ spine, family = poisson, data = crabdat)
summary(mod2)

## 
## Call:
## glm(formula = satell ~ spine, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7014  -2.3706  -0.5097   1.1252   5.0859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.03316    0.05423  19.051   <2e-16 ***
## spinegood    0.26120    0.10173   2.568   0.0102 *  
## spinemiddle -0.34001    0.19045  -1.785   0.0742 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 621.16  on 170  degrees of freedom
## AIC: 982.46
## 
## Number of Fisher Scoring iterations: 5

exp(0.2612) #good

## [1] 1.298487

exp(-0.34001) #middle

## [1] 0.7117632

There is a significant difference in number of satellites between females with good spine condition and females with both spines broken. Females with good spine condition tend to have 1.298487 times as many satellites as females with both spines broken. Females with one spine broken tend to have 0.7117632 times as many satellites as females with both spines broken.

2(c) Just width

mod3 = glm(satell ~ width, family = poisson, data = crabdat)
summary(mod3)

## 
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

exp(0.16405)

## [1] 1.178273

An increase of 1 cm in a female’s carapace width is associated with multiplicative change of 1.178273 in the mean number satellites.

Use AIC to compare the three models you’ve created. Does the AIC recommendation line up with your prediction from the previous problem?

AIC(mod1)

## [1] 972.4368

AIC(mod2)

## [1] 982.4582

AIC(mod3)

## [1] 927.1762

The AIC for mod3 is lower than the AICs for mod1 and mod2, so mod3 is a better model. This suggests that carapace width is an important predictor for the number of satellites.

Test for lack-of-fit for the model that you chose using AIC.

halfnorm(residuals(mod3)) #test for outliers

dp1 = sum(residuals(mod3, type="pearson")^2 / mod3$df.res) #test for overdispersion
dp1

## [1] 3.182205

There appear to be about 4 outliers, shown by the points not following the trend in the half-norm plot. According to the dispersion parameter, the variance is about 3x bigger than the mean.

Use forward-selection and the drop-in-deviance test to add more predictors to your model. Write the regression equation and interpret each coefficient of your final model. (Do not allow for an overdispersion parameter yet)

bigmod = glm(satell ~ width + color + spine, family = poisson, data = crabdat) #model with all 3 predictors
mod0 = glm(satell ~ 1, family = poisson, data = crabdat)
fwd.model = stepAIC(mod0, direction='forward', scope=(~ crabdat$width + crabdat$spine + crabdat$color)) #forward selection using AIC suggests that spine should be dropped from the model

## Start:  AIC=990.09
## satell ~ 1
## 
##                 Df Deviance    AIC
## + crabdat$width  1   567.88 927.18
## + crabdat$color  3   609.14 972.44
## + crabdat$spine  2   621.16 982.46
## <none>               632.79 990.09
## 
## Step:  AIC=927.18
## satell ~ crabdat$width
## 
##                 Df Deviance    AIC
## + crabdat$color  3   559.34 924.64
## <none>               567.88 927.18
## + crabdat$spine  2   566.60 929.90
## 
## Step:  AIC=924.64
## satell ~ crabdat$width + crabdat$color
## 
##                 Df Deviance    AIC
## <none>               559.34 924.64
## + crabdat$spine  2   558.63 927.93

anova(bigmod, glm(satell ~ width + color, family = poisson, data = crabdat), test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width + color + spine
## Model 2: satell ~ width + color
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       166     558.63                     
## 2       168     559.34 -2  -0.7153   0.6993

#p-value is large so spine should be dropped from the model

anova(mod3, glm(satell ~ width + color, family = poisson, data = crabdat), test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width
## Model 2: satell ~ width + color
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       171     567.88                       
## 2       168     559.34  3   8.5338  0.03618 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#p-value is small so width should not be dropped from the model

anova(mod1, glm(satell ~ width + color, family = poisson, data = crabdat), test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ color
## Model 2: satell ~ width + color
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       169     609.14                          
## 2       168     559.34  1   49.794 1.707e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#p-value is small so color should not be dropped from the model

finalmod = glm(satell ~ width + color, family = poisson, data = crabdat)
summary(finalmod)

## 
## Call:
## glm(formula = satell ~ width + color, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0415  -1.9581  -0.5575   0.9830   4.7523  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.08640    0.55750  -5.536 3.09e-08 ***
## width        0.14934    0.02084   7.166 7.73e-13 ***
## colordarker -0.01100    0.18041  -0.061   0.9514    
## colorlight   0.43636    0.17636   2.474   0.0133 *  
## colormedium  0.23668    0.11815   2.003   0.0452 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6

exp(0.14934) #width

## [1] 1.161068

exp(-0.01100) #darker color

## [1] 0.9890603

exp(0.43636) #light color

## [1] 1.547066

exp(0.23668) #medium color

## [1] 1.267036

An increase of 1 cm in a female’s carapace width is associated with multiplicative change of 1.161068 in the mean number satellites, holding color constant. Females with darker coloring tend to have 0.9890603 times as many satellites as dark females, light females tend to have 1.547066 times as many satellites as dark females, and medium females tend to have 1.267036 times as many satellites as dark females, holding width constant.

Check for outliers using a half-norm plot.

halfnorm(residuals(finalmod))

All points on the half-norm plot seem to be following the trend, suggesting that there are no outliers.

Estimate the dispersion parameter for the model produced by forward selection.

dp2 = sum(residuals(finalmod, type="pearson")^2 / finalmod$df.res)
dp2

## [1] 3.233625

Elephant mating (a) Create a histogram of MATINGS. Is there preliminary evidence that number of matings could be modeled as a Poisson response? Explain.

hist(eleph$MATINGS)

The histogram is skewed right, following a Poisson distribution, suggesting that Poisson regression may be appropriate.

Plot MATINGS by AGE. Add a least squares line. Is there evidence that modeling matings using a linear regression with age might not be appropriate? Explain.

plot(MATINGS ~ AGE, data = eleph)
abline(lm(MATINGS ~ AGE, data = eleph))

The data doesn’t seem to fit the least squares line very well, suggsting that linear regression might not be appropriate.

For each age, calculate the mean number of matings. Take the log of each mean and plot it by AGE. What assumption can be assessed with this plot? Is there evidence of a quadratic trend on this plot?

mmean = c(0, 1.5, 1, 1, 2, 3, 1.75, 5.5, 2.6666, 2, 1, 3, 4, 3.6, 3, 5, 7, 2, 9)
age = c(27, 28, 29, 30, 32, 33, 34, 36, 37, 38, 39, 41, 42, 43, 44, 45, 47, 48, 52)

plot(age, log(mmean))

There does not appear to be a curvilinear relationship between age and the log of the mean number of matings. This suggests that a quadratic term would not appropriate.

Fit a Poisson regression model with a linear term for AGE. Exponentiate and then interpret the coefficient for AGE.

pmod = glm(MATINGS ~ AGE, family = poisson, data = eleph)
summary(pmod)

## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

exp(0.06869)

## [1] 1.071104

An increase of 1 year in age is associated with multiplicative change of 1.071104 in the mean number of elephant matings.

Construct a 95% confidence interval for the slope and interpret in context (you may want to exponentiate endpoints)

exp(confint(pmod, level = 0.95))

## Waiting for profiling to be done...

##                 2.5 %    97.5 %
## (Intercept) 0.0694813 0.5892357
## AGE         1.0425585 1.1003602

An increase of 1 year in age is associated with multiplicative change between 1.0425585 and 1.10003602 in the mean number of elephant matings.

Are the number of matings significantly related to age? Test with a Wald test and a drop in deviance test.

summary(pmod) #Wald test

## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

mod.0 = glm(MATINGS ~ 1, family = poisson, data = eleph)
anova(mod.0, pmod, test = "Chisq") #drop in deviance test

## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        40     75.372                          
## 2        39     51.012  1    24.36 7.991e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for age in the Wald test is much less than 0.05, so the number of matings is significantly related to age. In the drop in deviance test, the p-value is very small so age should not be dropped from the model.

Add a quadratic term in AGE to determine whether there is a maximum age for the number of matings for elephants. Is a quadratic model preferred to a linear model? To investigate this question, use a Wald test and a drop in deviance test.

eleph = eleph %>% mutate(AGE2 = AGE*AGE) #new quadratic variable
pmod.quad = glm(MATINGS ~ AGE + AGE2, family = poisson, data = eleph) #model with quadratic term
summary(pmod.quad) #Wald test

## 
## Call:
## glm(formula = MATINGS ~ AGE + AGE2, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.0356383  -0.941    0.347
## AGE          0.1359544  0.1580095   0.860    0.390
## AGE2        -0.0008595  0.0020124  -0.427    0.669
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5

anova(pmod, pmod.quad, test = "Chisq") #drop in deviance

## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AGE2
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        39     51.012                     
## 2        38     50.826  1  0.18544   0.6667

The p-value for the quadratic age variable is > 0.05, suggesting that the linear model is better. The p-value for the drop in deviance test is also large, so we should drop the quadratic age term.

What can we say about the goodness-of-fit of the model with age as the sole predictor? Compare the residual deviance for the linear model to a chi-square distribution with the residual model degrees of freedom.

1-pchisq(pmod$deviance, pmod$df.residual)

## [1] 0.09426231

The p-value is pretty small, indicating a lack of fit.

Estimate the dispersion parameter and use it to redo f, g, h

dp3 = sum(residuals(pmod, type="pearson")^2 / pmod$df.res)
dp3

## [1] 1.157334

with dp3

summary(pmod, dispersion = dp3) #with dispersion parameter

## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.58590  -2.700  0.00693 ** 
## AGE          0.06869    0.01479   4.645  3.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1.157334)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

anova(mod.0, pmod, test = "Chisq", dispersion = dp3)

## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        40     75.372                          
## 2        39     51.012  1    24.36 4.478e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

with dp3

summary(pmod.quad, dispersion = dp3) #Wald test

## 
## Call:
## glm(formula = MATINGS ~ AGE + AGE2, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.2657237  -0.875    0.382
## AGE          0.1359544  0.1699858   0.800    0.424
## AGE2        -0.0008595  0.0021649  -0.397    0.691
## 
## (Dispersion parameter for poisson family taken to be 1.157334)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5

anova(pmod, pmod.quad, test = "Chisq", dispersion = dp3) #drop in deviance

## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AGE2
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        39     51.012                     
## 2        38     50.826  1  0.18544   0.6889

The p-value for the quadratic age variable is > 0.05, suggesting that the linear model is better. The p-value for the drop in deviance test is also large, so we should drop the quadratic age term.

with dp3

1-pchisq(dp3, pmod$df.residual)

## [1] 1

The p-value indicates a better fit than the model with the assumed deviance.