Homework: https://www.dropbox.com/s/vaktl73l6tknvqe/PoissonHW.pdf?dl=0
crabdat <-read.csv("http://www.cknudson.com/data/crabs.csv")
hist(crabdat$satell)
There tend to be fewer satellite crabs surrounding females. Looks approximately Poisson distributed
boxplot(crabdat$satell~crabdat$color)
Dark and darker colored female crabs tend to have fewer satellite crabs than those who are light or medium color
boxplot(crabdat$satell~crabdat$spine)
Female crabs with good spines are more likely to have many satellites than bad or middle spines; however, bad spines have more variability in the number of satellites than the other two.
plot(crabdat$satell~crabdat$width)
plot(log(crabdat$satell)~crabdat$width)
There appears to be no relationship between the log of satellites and width
The satellite histogram looks very Poisson with it being strictly non-negative and a heavy right tail. Fitting with the log response would allow us to fit a line better without consistently over or under predicting.
I think color will be the strongest because there seems to be the most differences in the medians of satellites based on the colors.
colormod<- glm(satell~color, family = poisson, data = crabdat)
summary(colormod)
##
## Call:
## glm(formula = satell ~ color, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8577 -2.1106 -0.1649 0.8721 4.7491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.80078 0.10102 7.927 2.24e-15 ***
## colordarker -0.08516 0.18007 -0.473 0.636279
## colorlight 0.60614 0.17496 3.464 0.000532 ***
## colormedium 0.39155 0.11575 3.383 0.000718 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 609.14 on 169 degrees of freedom
## AIC: 972.44
##
## Number of Fisher Scoring iterations: 6
spinemod<-glm(satell~spine, family =poisson, data= crabdat)
summary(spinemod)
##
## Call:
## glm(formula = satell ~ spine, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7014 -2.3706 -0.5097 1.1252 5.0859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03316 0.05423 19.051 <2e-16 ***
## spinegood 0.26120 0.10173 2.568 0.0102 *
## spinemiddle -0.34001 0.19045 -1.785 0.0742 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 621.16 on 170 degrees of freedom
## AIC: 982.46
##
## Number of Fisher Scoring iterations: 5
widthmod<- glm(satell~width, family = poisson, data = crabdat)
summary(widthmod)
##
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
widthmod has the lowest residual deviance of the three models. colormod is the only model to have a coeffiecient that is not significant at any level. widthmod also has all of its cofficients being significant at the 0 level(***).
widthmod has the lowest AIC of the three models which does not line up with my prediction from problem 1F. It does match my analysis in 2.
widthdp<- sum(residuals(widthmod, type="pearson")^2 / widthmod$df.res)
widthdp
## [1] 3.182205
The dispersion parameter is not 1, meaning that this is not a perfect Poisson model.
plot(residuals(widthmod)~predict(widthmod,type="link"),xlab="log lambda", ylab="deviance residuals")
The residuals are more positive than they are negative, so that means we are underpredicting the log number of satellites
library(faraway)
halfnorm(residuals(widthmod))
We seem to have at least 4 observations that are definitely outliers on our model.
Forward selection based on an AIC drop of at least 10
basemod<- glm(satell~width, data= crabdat, family = poisson)
mod1<-glm(satell~width + color, data= crabdat, family = poisson)
mod2<-glm(satell~width + spine, data= crabdat, family = poisson)
mod3<-glm(satell~width + weight, data= crabdat, family = poisson)
summary(basemod)
##
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
summary(mod1)
##
## Call:
## glm(formula = satell ~ width + color, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0415 -1.9581 -0.5575 0.9830 4.7523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.08640 0.55750 -5.536 3.09e-08 ***
## width 0.14934 0.02084 7.166 7.73e-13 ***
## colordarker -0.01100 0.18041 -0.061 0.9514
## colorlight 0.43636 0.17636 2.474 0.0133 *
## colormedium 0.23668 0.11815 2.003 0.0452 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
summary(mod2)
##
## Call:
## glm(formula = satell ~ width + spine, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9509 -1.9740 -0.4963 1.0832 4.7173
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.12469 0.56497 -5.531 3.19e-08 ***
## width 0.15668 0.02096 7.476 7.69e-14 ***
## spinegood 0.09899 0.10490 0.944 0.345
## spinemiddle -0.10033 0.19309 -0.520 0.603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 566.60 on 169 degrees of freedom
## AIC: 929.9
##
## Number of Fisher Scoring iterations: 6
summary(mod3)
##
## Call:
## glm(formula = satell ~ width + weight, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9309 -1.9702 -0.5478 0.9700 4.9904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2952111 0.8988960 -1.441 0.14962
## width 0.0460765 0.0467497 0.986 0.32433
## weight 0.0004470 0.0001586 2.818 0.00483 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.90 on 170 degrees of freedom
## AIC: 921.2
##
## Number of Fisher Scoring iterations: 6
None of these models have a sufficient drop in AIC so we will select the current basemod as the final model Final model: glm(satell~ width, family = poisson, data=crabdat)
Forward selection: Drop in Deviance (requiring significance at the alpha = 0.05 level) Pretend mean=var for now
#anova(mod1,mod2, test ="Chisq")
anova(basemod,mod1, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width
## Model 2: satell ~ width + color
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 168 559.34 3 8.5338 0.03618 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(basemod,mod2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width
## Model 2: satell ~ width + spine
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 169 566.60 2 1.2737 0.529
anova(basemod,mod3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width
## Model 2: satell ~ width + weight
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 170 559.90 1 7.978 0.004735 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Mod3 has the most significant drop in deviance so we use that as our new basemod (basemod2)
basemod2<-mod3
modA<-glm(satell~width + weight + color, data= crabdat, family = poisson)
modB<-glm(satell~width + weight+ spine, data= crabdat, family = poisson)
anova(basemod2,modA,test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + color
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 170 559.90
## 2 167 551.38 3 8.5203 0.0364 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(basemod2,modB,test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + spine
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 170 559.90
## 2 168 558.83 2 1.0745 0.5844
Model A still shows a significant drop in deviance so it becomes our new basemod (basemod3)
basemod3<- modA
modC<-glm(satell~width + weight + color+spine, data= crabdat, family = poisson)
anova(basemod3,modC,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + weight + color
## Model 2: satell ~ width + weight + color + spine
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 167 551.38
## 2 165 549.59 2 1.7947 0.4076
Model C does not have a significant drop in deviance, so we will select basemod3 as our final model. Final model: glm(satell~width + weight + color, data= crabdat, family = poisson)
halfnorm(residuals(basemod3))
The residuals seem to follow on trend, the biggest residuals do not seem to deviate from the trend by much, although 15 and 149 show up as outliers.
dp <- sum(residuals(basemod3, type="pearson")^2 / basemod3$df.res)
dp
## [1] 3.204836
The dispersion parameter is 3.204836, so this is not a perfect poisson since mean=/=var
elephant<-read.csv("http://www.cknudson.com/data/elephant.csv")
attach(elephant)
hist(MATINGS)
There appears to be evidence that MATINGS could be modeled as a Poisson response since the values are non-negative and heavily skewed to the right #B
plot(MATINGS~AGE, main="Elephant Matings by Age")
modelage<-lm(MATINGS~AGE)
abline(modelage)
plot(modelage)
This does not appear to be a good model. There are several outliers among the residuals and there does not appear to be equal variance for the residuals #C
meanmate<-aggregate(MATINGS~AGE,elephant,mean)
plot(log(meanmate$MATINGS),meanmate$AGE, main="Log of average mating by age", xlab="Age",ylab="Log(Average Number of Mating)")
The data now looks like we could fit a line better to this data. This means that we could try to use Poisson Regression for this data. I do not see evidence of a quadratic trend in this plot
model1<-glm(MATINGS~AGE, family = poisson, data= elephant)
summary(model1)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = elephant)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
exp(coef(model1)[2])
## AGE
## 1.071107
For each increase of 1 year in age, we expect matings to increase by a scale of 1.07 ( a multiplicative change)
exp(confint(model1, level =0.95)[2,])
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 1.042558 1.100360
We are 95% confident that the true multiplicative impact of age on average number of matings per year is between 1.04 and 1.1
Wald test<- Yes it is significant since the pvalue for the test stat for age is 5.81*10^-7 which is basically 0
summary(model1)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = elephant)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
Drop in Dev (assuming mean = var, which it doesn't):
model0<-glm(MATINGS~1, family=poisson, data=elephant)
anova(model0,model1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 40 75.372
## 2 39 51.012 1 24.36 7.991e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant drop in deviance when we add AGE, so we should keep age in the model.
Wald Test: pval for both test stats are nowhere near significant. Adding Age^2 makes our model much worse, so we should not include it.
AgeSq<-elephant$AGE^2
quadmod<- glm(MATINGS~AGE+AgeSq, data = elephant, family = poisson)
summary(quadmod)
##
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = elephant)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.0356383 -0.941 0.347
## AGE 0.1359544 0.1580095 0.860 0.390
## AgeSq -0.0008595 0.0020124 -0.427 0.669
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
Drop in dev (assuming mean=var)
anova(model1,quadmod, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AgeSq
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 39 51.012
## 2 38 50.826 1 0.18544 0.6667
Again, pval is not significant so we should not keep age^2 in the model
teststat<-sum(residuals(model1)^2)
df<-41-3 #n-(k+1)
pchisq(teststat, df, lower.tail = FALSE)
## [1] 0.07717098
The pval is 0.07717098 which is not significant at the 0.05 level so we assume that this model is predicting well enough The pval is on the edge of being significant (it would be at the alpha =0.1 level), and this would mean that the model is not predicting well
dp<-sum(residuals(model1, type="pearson")^2 / model1$df.res)
dp
summary(model1, dispersion=dp)
summary(model1)
The inclusion of the dispersion parameter lowered the pval of Age in the model, although it is still significant Drop of Dev
DropinD<-deviance(model0)-deviance(model1)
teststat2<- DropinD/(dp*1)
pf(teststat2,df1=1,df2=38,lower.tail=FALSE)
## [1] 0.008914323
The pval of this test stat is 4.767*10^-5, it is a little bigger than the Drop in dev pval w/o the dispersion parameter (which is what we expect), but it is still significant so we should keep age as a variable
dp2<-sum(residuals(quadmod, type="pearson")^2 / quadmod$df.res)
dp2
## [1] 1.175194
summary(quadmod, dispersion=dp2)
##
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = elephant)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.2908256 -0.868 0.385
## AGE 0.1359544 0.1712924 0.794 0.427
## AgeSq -0.0008595 0.0021816 -0.394 0.694
##
## (Dispersion parameter for poisson family taken to be 1.175194)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
According to the Wald Test, even after accounting for the dispersion, the Age^2 variable does not have a significant pvalue and should be left out of the model Drop in dev
DropinD2<-deviance(model1)-deviance(quadmod)
teststat3<- DropinD2/(dp*1)
pf(teststat3,df1=1,df2=37,lower.tail=FALSE)
## [1] 0.8112323
Since the pval > 0.05, there is not a significant drop in deviance by adding age^2, so we should not keep it in the model
The GoF test does not seem to depend on the dispersion parameter at all, so the result would be the same as the original part H.