Satellite Crabs

crabdat<- read.csv("http://www.cknudson.com/data/crabs.csv")
library(faraway)
attach(crabdat)
hist(satell)

The histogram appears to be right skewed and only values greater than or equal to zero. This data visually looks like it might fit a Poisson distribution.

boxplot(satell~color)

This may come into play as an important variable when it comes to modeling because it visually appears that on average lighter crabs have more satellites and darker crabs have less.

boxplot(satell~spine)

This spine variable might not be useful as all the means fall within each other CIs. If I had to guess I would say this will not be a significant variable when we make a model.

logsat <- log(satell)
plot(logsat~width)

This width variable might not be all that great in modeling as there is no explicit pattern. It does look like the variables might be positively related however.

I think a Poisson regression is appropriate because of the range of our response type and the shape of the distribution, which is shown pretty explicitly in the histogram. It also appears that variables like color and width may affect the number of satellites.

I think the lighter colors will have the strongest relationship with number of satellites. Width is our second guess but the log satellite plot is not entirely reassuring.

modcolor <- glm(satell ~ color, family = poisson)
summary(modcolor)
## 
## Call:
## glm(formula = satell ~ color, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8577  -2.1106  -0.1649   0.8721   4.7491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80078    0.10102   7.927 2.24e-15 ***
## colordarker -0.08516    0.18007  -0.473 0.636279    
## colorlight   0.60614    0.17496   3.464 0.000532 ***
## colormedium  0.39155    0.11575   3.383 0.000718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 609.14  on 169  degrees of freedom
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 6
modspine <- glm(satell ~ spine, family = poisson)
summary(modspine)
## 
## Call:
## glm(formula = satell ~ spine, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7014  -2.3706  -0.5097   1.1252   5.0859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.03316    0.05423  19.051   <2e-16 ***
## spinegood    0.26120    0.10173   2.568   0.0102 *  
## spinemiddle -0.34001    0.19045  -1.785   0.0742 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 621.16  on 170  degrees of freedom
## AIC: 982.46
## 
## Number of Fisher Scoring iterations: 5
modwidth <- glm(satell ~ width, family = poisson)
summary(modwidth)
## 
## Call:
## glm(formula = satell ~ width, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

A classification of a female crab into the category of ‘colorlight’ is associated with a 0.606 increase in the number of satellite crabs. Etc…

A classification of a female crab into the category of ‘spinegood’ is associated with a .261 increase in the number of satellite crabs. Etc…

An increase of one cm in the female crabs carapace width is associated with a .164 increase in the number of satellite crabs.

AIC(modcolor)
## [1] 972.4368
AIC(modspine)
## [1] 982.4582
AIC(modwidth)
## [1] 927.1762

My prediction was a little off. I thought color might play a larger role than width but I was wrong. The variable Width is the best predictor of these three options.

plot(modwidth, which = c(1,2))

The residual vs predicted plot and the qq-plot show that we do have few outliers and maybe some missing predictors that would help the fit of the model as the red line of the residual plot looks slightly banana shaped.

dispsat <- sum(residuals(modwidth, type ="pearson")^2 / modwidth$df.residual)
dispsat
## [1] 3.182205

This can be interpreted as our variance is about 3.18 times larger than the mean. This is not ideal. We want our dispersion to be closer to 1.

modbase <- glm(satell ~ 1, family = poisson)
modweight <- glm(satell ~ weight, family = poisson)

AIC(modbase)
## [1] 990.0893
AIC(modcolor)
## [1] 972.4368
AIC(modspine)
## [1] 982.4582
AIC(modwidth)
## [1] 927.1762
AIC(modweight)
## [1] 920.1641

The model with weight as a predictor is the best in terms of AIC so we move forward with modweight.

modWtCl <- glm(satell ~ weight + color, family = poisson)
modWtSp <- glm(satell ~ weight + spine, family = poisson)
modWtWd <- glm(satell ~ weight + width, family = poisson)

AIC(modWtCl)
## [1] 917.1026
AIC(modWtSp)
## [1] 922.7853
AIC(modWtWd)
## [1] 921.1983

None of these models AIC’s decrease by over 10 so we should stop forward-selection

Let’s see if the drop-in-deviance test agrees with this conclusion.

droptest <- deviance(modweight) - deviance(modWtCl)
pchisq(droptest, df = 3, lower.tail =  FALSE)
## [1] 0.02848457

This p-value is significant however so we should really continue our forward-selection despite the AIC test.

modWtClSp <- glm(satell ~ weight + color + spine, family = poisson)
modWtSpWd <- glm(satell ~ weight + spine + width, family = poisson)

AIC(modWtClSp)
## [1] 919.0001
AIC(modWtSpWd)
## [1] 924.1238

Again the AIC forward-selection tells us to stop but lets make sure the drop-in-deviance agrees.

droptest <- deviance(modWtCl) - deviance(modWtClSp)
pchisq(droptest, df = 2, lower.tail =  FALSE)
## [1] 0.3495059

Here the drop-in-deviance confirms our forward-selection stoppage.

summary(modWtCl)
## 
## Call:
## glm(formula = satell ~ weight + color, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9833  -1.9272  -0.5553   0.8646   4.8270  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.996e-01  1.959e-01  -2.551   0.0108 *  
## weight       5.462e-04  6.811e-05   8.019 1.07e-15 ***
## colordarker -2.253e-03  1.805e-01  -0.012   0.9900    
## colorlight   4.498e-01  1.757e-01   2.560   0.0105 *  
## colormedium  2.447e-01  1.177e-01   2.079   0.0376 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 551.80  on 168  degrees of freedom
## AIC: 917.1
## 
## Number of Fisher Scoring iterations: 6

An increase of one gram of female crab weight is associated with a 5.462*10^-4 increase in the number of satellite crabs.

A female crab identified as ‘darker’ is associated with a decrease of 2.253*10^-3 in the number of satellite crabs.

Etc…

The key for the interpretation above of this model is the Estimate column in the Coefficients portion of the summary of the model.

halfnorm(residuals(modWtCl))

The top right of the halfnorm plot shows that our outliers are still don’t fit very well in our model which is something we could address if we had more time.

finaldisp <- sum(residuals(modWtCl, type ="pearson")^2 / modWtCl$df.residual)

This dispersion parameter is very close to our first calculated dispersion but it is slightly larger. Regardless, our variance is still more than 3 times greater than our mean.

Elephant Mating

eleph <- read.csv("http://www.cknudson.com/data/elephant.csv")
attach(eleph)
hist(MATINGS)

There is evidence because the distribution is right-skewed and we have values for MATINGS that are only greater than or equal to zero.

linearmod <- lm(MATINGS ~ AGE)
plot(MATINGS~AGE)
abline(linearmod)

plot(linearmod, which = 1)

It looks like there are quite a few outliers visible from these graphs and some heteroscedasticity in the residual plot. It also implies that 20 year old elephants have a negative number of matings per year which doesn’t make sense.

library(plyr)
meandata <- ddply(eleph, "AGE", summarise, mean= mean(MATINGS))
logmeans <- log(meandata$mean)
plot(meandata$AGE, log(meandata$mean), main = "Log average mating by age" , xlab = "Age", ylab = "Log Average")

It appears the relationship between the log mean of mating occurrences and age is linear. As an elephant increases in age the more mating sessions they participate in. This satisfies the linearity assumption.

I would say there is little to no quadratic trend on this graph. There is a slight banana curve up shape to the plot but it looks much more linear than quadratic.

Pmod <- glm(MATINGS ~ AGE, family = poisson)
summary(Pmod)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
Pmod$coef[2]
##        AGE 
## 0.06869281
exp(.06869281)
## [1] 1.071107

A one increase in age of an elephant is associated with a 7.11% increase in the number of MATINGS.

exp(1)^confint(Pmod, 'AGE', level=.95)
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 1.042558 1.100360

We are 95% certain that the multiplicative factor of an increase of AGE of one year will be between these two values.

summary(Pmod)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
basemod <- glm(MATINGS ~ 1, family = poisson)
anova(basemod, Pmod, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        40     75.372                          
## 2        39     51.012  1    24.36 7.991e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Wald Test and drop-in-deviance test have significant p-values which both mean that the variable AGE is important to the model and should not be dropped in favor of the basemod.

AGEsq <- AGE^2
Pmodsq <- glm(MATINGS ~ AGE + AGEsq, family = poisson)
summary(Pmodsq)
## 
## Call:
## glm(formula = MATINGS ~ AGE + AGEsq, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.0356383  -0.941    0.347
## AGE          0.1359544  0.1580095   0.860    0.390
## AGEsq       -0.0008595  0.0020124  -0.427    0.669
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5
anova(Pmod, Pmodsq, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AGEsq
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        39     51.012                     
## 2        38     50.826  1  0.18544   0.6667

Both the Wald Test and drop-in-deviance test tell me that the additional variable of AGE squared is not significantly useful. The linear (or simpler) model is preferred in this scenario.

pchisq(Pmod$deviance, Pmod$df.residual, lower.tail = FALSE)
## [1] 0.09426231

This p-value that is greater than our alpha of .05 shows that we do not have evidence of lack-of-fit.

Now we will repeat the last three steps with the dispersion accounted for…

disp <- sum(residuals(Pmod, type ="pearson")^2 / Pmod$df.residual)
disp
## [1] 1.157334

This implies that our variance is 1.15 times larger than our mean.

summary(Pmod, dispersion = disp)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.58590  -2.700  0.00693 ** 
## AGE          0.06869    0.01479   4.645  3.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1.157334)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
dropinD <- deviance(basemod) - deviance(Pmod)
teststat <- dropinD/(disp*1)
pf(teststat, df1 = 1, 38, lower.tail =  FALSE)
## [1] 4.767203e-05

The Wald Test and drop-in-deviance test again confirm that AGE is still a significant variable to include even after accounting for dispersion.

summary(Pmodsq)
## 
## Call:
## glm(formula = MATINGS ~ AGE + AGEsq, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.0356383  -0.941    0.347
## AGE          0.1359544  0.1580095   0.860    0.390
## AGEsq       -0.0008595  0.0020124  -0.427    0.669
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5
disp2 <- sum(residuals(Pmodsq, type ="pearson")^2 / Pmodsq$df.residual)

dropinD <- deviance(Pmod) - deviance(Pmodsq)
teststat2 <- dropinD/(disp2*1)
pf(teststat2, df1 = 1, 38, lower.tail =  FALSE)
## [1] 0.6934135

The Wald Test and drop-in-deviance test agree here that the addition of AGE squared is not significant to the fit of the model and it can therefore be dropped. This was the same result when we didn’t account for dispersion.

summary(Pmod, dispersion = disp)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.58590  -2.700  0.00693 ** 
## AGE          0.06869    0.01479   4.645  3.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1.157334)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
pchisq(51.012, 39, lower.tail = FALSE)
## [1] 0.09425638

This p-value that is still greater than our alpha of .05, despite dispersion, so that means we do not have evidence of lack-of-fit.