Homework: https://www.dropbox.com/s/vaktl73l6tknvqe/PoissonHW.pdf?dl=0

crabdat <-read.csv("http://www.cknudson.com/data/crabs.csv")

1A

hist(crabdat$satell)

There tend to be fewer satellite crabs surrounding females. Looks approximately Poisson distributed

1B

boxplot(crabdat$satell~crabdat$color)

Dark and darker colored female crabs tend to have fewer satellite crabs than those who are light or medium color

1C

boxplot(crabdat$satell~crabdat$spine)

Female crabs with good spines are more likely to have many satellites than bad or middle spines; however, bad spines have more variability in the number of satellites than the other two.

1D

plot(crabdat$satell~crabdat$width)

plot(log(crabdat$satell)~crabdat$width)

There appears to be no relationship between the log of satellites and width

1E

The satellite histogram looks very Poisson with it being strictly non-negative and a heavy right tail. Fitting with the log response would allow us to fit a line better without consistently over or under predicting.

1F

I think color will be the strongest because there seems to be the most differences in the medians of satellites based on the colors.

2A.

colormod<- glm(satell~color, family = poisson, data = crabdat)
summary(colormod)

## 
## Call:
## glm(formula = satell ~ color, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8577  -2.1106  -0.1649   0.8721   4.7491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80078    0.10102   7.927 2.24e-15 ***
## colordarker -0.08516    0.18007  -0.473 0.636279    
## colorlight   0.60614    0.17496   3.464 0.000532 ***
## colormedium  0.39155    0.11575   3.383 0.000718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 609.14  on 169  degrees of freedom
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 6

2B.

spinemod<-glm(satell~spine, family =poisson, data= crabdat)
summary(spinemod)

## 
## Call:
## glm(formula = satell ~ spine, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7014  -2.3706  -0.5097   1.1252   5.0859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.03316    0.05423  19.051   <2e-16 ***
## spinegood    0.26120    0.10173   2.568   0.0102 *  
## spinemiddle -0.34001    0.19045  -1.785   0.0742 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 621.16  on 170  degrees of freedom
## AIC: 982.46
## 
## Number of Fisher Scoring iterations: 5

2C.

widthmod<- glm(satell~width, family = poisson, data = crabdat)
summary(widthmod)

## 
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

widthmod has the lowest residual deviance of the three models. colormod is the only model to have a coeffiecient that is not significant at any level. widthmod also has all of its cofficients being significant at the 0 level(***).

3.

widthmod has the lowest AIC of the three models which does not line up with my prediction from problem 1F. It does match my analysis in 2.

4.

widthdp<- sum(residuals(widthmod, type="pearson")^2 / widthmod$df.res)
widthdp

## [1] 3.182205

The dispersion parameter is not 1, meaning that this is not a perfect Poisson model.

plot(residuals(widthmod)~predict(widthmod,type="link"),xlab="log lambda", ylab="deviance residuals")

The residuals are more positive than they are negative, so that means we are underpredicting the log number of satellites

library(faraway)
halfnorm(residuals(widthmod))

We seem to have at least 4 observations that are definitely outliers on our model.

5.

Forward selection based on an AIC drop of at least 10

basemod<- glm(satell~width, data= crabdat, family = poisson)

mod1<-glm(satell~width + color, data= crabdat, family = poisson)
mod2<-glm(satell~width + spine, data= crabdat, family = poisson)
mod3<-glm(satell~width + weight, data= crabdat, family = poisson)

summary(basemod)

## 
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

summary(mod1)

## 
## Call:
## glm(formula = satell ~ width + color, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0415  -1.9581  -0.5575   0.9830   4.7523  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.08640    0.55750  -5.536 3.09e-08 ***
## width        0.14934    0.02084   7.166 7.73e-13 ***
## colordarker -0.01100    0.18041  -0.061   0.9514    
## colorlight   0.43636    0.17636   2.474   0.0133 *  
## colormedium  0.23668    0.11815   2.003   0.0452 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6

summary(mod2)

## 
## Call:
## glm(formula = satell ~ width + spine, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9509  -1.9740  -0.4963   1.0832   4.7173  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.12469    0.56497  -5.531 3.19e-08 ***
## width        0.15668    0.02096   7.476 7.69e-14 ***
## spinegood    0.09899    0.10490   0.944    0.345    
## spinemiddle -0.10033    0.19309  -0.520    0.603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 566.60  on 169  degrees of freedom
## AIC: 929.9
## 
## Number of Fisher Scoring iterations: 6

summary(mod3)

## 
## Call:
## glm(formula = satell ~ width + weight, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9309  -1.9702  -0.5478   0.9700   4.9904  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.2952111  0.8988960  -1.441  0.14962   
## width        0.0460765  0.0467497   0.986  0.32433   
## weight       0.0004470  0.0001586   2.818  0.00483 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.90  on 170  degrees of freedom
## AIC: 921.2
## 
## Number of Fisher Scoring iterations: 6

None of these models have a sufficient drop in AIC so we will select the current basemod as the final model Final model: glm(satell~ width, family = poisson, data=crabdat)

Forward selection: Drop in Deviance (requiring significance at the alpha = 0.05 level) Pretend mean=var for now

#anova(mod1,mod2, test ="Chisq")
anova(basemod,mod1, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width
## Model 2: satell ~ width + color
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       171     567.88                       
## 2       168     559.34  3   8.5338  0.03618 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(basemod,mod2, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width
## Model 2: satell ~ width + spine
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       171     567.88                     
## 2       169     566.60  2   1.2737    0.529

anova(basemod,mod3, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width
## Model 2: satell ~ width + weight
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       171     567.88                        
## 2       170     559.90  1    7.978 0.004735 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mod3 has the most significant drop in deviance so we use that as our new basemod (basemod2)

basemod2<-mod3
modA<-glm(satell~width + weight + color, data= crabdat, family = poisson)
modB<-glm(satell~width + weight+ spine, data= crabdat, family = poisson)

anova(basemod2,modA,test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + color
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       170     559.90                       
## 2       167     551.38  3   8.5203   0.0364 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(basemod2,modB,test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + spine
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       170     559.90                     
## 2       168     558.83  2   1.0745   0.5844

Model A still shows a significant drop in deviance so it becomes our new basemod (basemod3)

basemod3<- modA
modC<-glm(satell~width + weight + color+spine, data= crabdat, family = poisson)
anova(basemod3,modC,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: satell ~ width + weight + color
## Model 2: satell ~ width + weight + color + spine
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       167     551.38                     
## 2       165     549.59  2   1.7947   0.4076

Model C does not have a significant drop in deviance, so we will select basemod3 as our final model. Final model: glm(satell~width + weight + color, data= crabdat, family = poisson)

6.

halfnorm(residuals(basemod3))

The residuals seem to follow on trend, the biggest residuals do not seem to deviate from the trend by much, although 15 and 149 show up as outliers.

7.

dp <- sum(residuals(basemod3, type="pearson")^2 / basemod3$df.res) 
dp

## [1] 3.204836

The dispersion parameter is 3.204836, so this is not a perfect poisson since mean=/=var

Elephant Mating

elephant<-read.csv("http://www.cknudson.com/data/elephant.csv")
attach(elephant)

A

hist(MATINGS)

There appears to be evidence that MATINGS could be modeled as a Poisson response since the values are non-negative and heavily skewed to the right #B

plot(MATINGS~AGE, main="Elephant Matings by Age")
modelage<-lm(MATINGS~AGE)
abline(modelage)

plot(modelage)

This does not appear to be a good model. There are several outliers among the residuals and there does not appear to be equal variance for the residuals #C

meanmate<-aggregate(MATINGS~AGE,elephant,mean)
plot(log(meanmate$MATINGS),meanmate$AGE, main="Log of average mating by age", xlab="Age",ylab="Log(Average Number of Mating)")

The data now looks like we could fit a line better to this data. This means that we could try to use Poisson Regression for this data. I do not see evidence of a quadratic trend in this plot

D.

model1<-glm(MATINGS~AGE, family = poisson, data= elephant)
summary(model1)

## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = elephant)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

exp(coef(model1)[2])

##      AGE 
## 1.071107

For each increase of 1 year in age, we expect matings to increase by a scale of 1.07 ( a multiplicative change)

E.

exp(confint(model1, level =0.95)[2,])

## Waiting for profiling to be done...

##    2.5 %   97.5 % 
## 1.042558 1.100360

We are 95% confident that the true multiplicative impact of age on average number of matings per year is between 1.04 and 1.1

F.

Wald test<- Yes it is significant since the pvalue for the test stat for age is 5.81*10^-7 which is basically 0

summary(model1)

## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = elephant)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5

Drop in Dev (assuming mean = var, which it doesn't):

model0<-glm(MATINGS~1, family=poisson, data=elephant)
anova(model0,model1, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        40     75.372                          
## 2        39     51.012  1    24.36 7.991e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a significant drop in deviance when we add AGE, so we should keep age in the model.

G

Wald Test: pval for both test stats are nowhere near significant. Adding Age^2 makes our model much worse, so we should not include it.

AgeSq<-elephant$AGE^2
quadmod<- glm(MATINGS~AGE+AgeSq, data = elephant, family = poisson)
summary(quadmod)

## 
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = elephant)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.0356383  -0.941    0.347
## AGE          0.1359544  0.1580095   0.860    0.390
## AgeSq       -0.0008595  0.0020124  -0.427    0.669
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5

Drop in dev (assuming mean=var)

anova(model1,quadmod, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AgeSq
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        39     51.012                     
## 2        38     50.826  1  0.18544   0.6667

Again, pval is not significant so we should not keep age^2 in the model

H.

teststat<-sum(residuals(model1)^2)
df<-41-3 #n-(k+1)
pchisq(teststat, df, lower.tail = FALSE)

## [1] 0.07717098

The pval is 0.07717098 which is not significant at the 0.05 level so we assume that this model is predicting well enough The pval is on the edge of being significant (it would be at the alpha =0.1 level), and this would mean that the model is not predicting well

F(again)

dp<-sum(residuals(model1, type="pearson")^2 / model1$df.res) 
dp
summary(model1, dispersion=dp)
summary(model1)

The inclusion of the dispersion parameter lowered the pval of Age in the model, although it is still significant Drop of Dev

DropinD<-deviance(model0)-deviance(model1)
teststat2<- DropinD/(dp*1)
pf(teststat2,df1=1,df2=38,lower.tail=FALSE)

## [1] 0.008914323

The pval of this test stat is 4.767*10^-5, it is a little bigger than the Drop in dev pval w/o the dispersion parameter (which is what we expect), but it is still significant so we should keep age as a variable

G (again)

dp2<-sum(residuals(quadmod, type="pearson")^2 / quadmod$df.res) 
dp2

## [1] 1.175194

summary(quadmod, dispersion=dp2)

## 
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = elephant)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.2908256  -0.868    0.385
## AGE          0.1359544  0.1712924   0.794    0.427
## AgeSq       -0.0008595  0.0021816  -0.394    0.694
## 
## (Dispersion parameter for poisson family taken to be 1.175194)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5

According to the Wald Test, even after accounting for the dispersion, the Age^2 variable does not have a significant pvalue and should be left out of the model Drop in dev

DropinD2<-deviance(model1)-deviance(quadmod)
teststat3<- DropinD2/(dp*1)
pf(teststat3,df1=1,df2=37,lower.tail=FALSE)

## [1] 0.8112323

Since the pval > 0.05, there is not a significant drop in deviance by adding age^2, so we should not keep it in the model

H (again)

The GoF test does not seem to depend on the dispersion parameter at all, so the result would be the same as the original part H.

Poisson Regression HW

Cheyanne Simpson

3/4/2021

1A

1B

1C

1D

1E

1F

2A.

2B.

2C.

3.

4.

5.

6.

7.

Elephant Mating

A

D.

E.

F.

G

H.

F(again)

G (again)

H (again)