RMD 4 takes a deeper look into Poisson Regression and how to select models using this type of regression. We spent the week going through our chapter 4 homework.

crabs <- read.csv("http://www.cknudson.com/data/crabs.csv")
head(crabs)
##    color spine width satell weight y
## 1 medium   bad  28.3      8   3050 1
## 2   dark   bad  22.5      0   1550 0
## 3  light  good  26.0      9   2300 1
## 4   dark   bad  24.8      0   2100 0
## 5   dark   bad  26.0      4   2600 1
## 6 medium   bad  23.8      0   2100 0
summary(crabs)
##     color              spine               width          satell      
##  Length:173         Length:173         Min.   :21.0   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.:24.9   1st Qu.: 0.000  
##  Mode  :character   Mode  :character   Median :26.1   Median : 2.000  
##                                        Mean   :26.3   Mean   : 2.919  
##                                        3rd Qu.:27.7   3rd Qu.: 5.000  
##                                        Max.   :33.5   Max.   :15.000  
##      weight           y         
##  Min.   :1200   Min.   :0.0000  
##  1st Qu.:2000   1st Qu.:0.0000  
##  Median :2350   Median :1.0000  
##  Mean   :2437   Mean   :0.6416  
##  3rd Qu.:2850   3rd Qu.:1.0000  
##  Max.   :5200   Max.   :1.0000

A good first step in model selection is to use histograms, boxplots, plots, etc. to analyze the data. In this case, color and spine have categorical variables that aren’t in order. By using factor we can re order the levels.

crabs$color <- factor(crabs$color, levels = c("darker","dark","medium","light"))
crabs$spine <- factor(crabs$spine, levels = c("bad","middle","good"))

In this datset we are looking for the number of satellite crabs sorrounding female crabs. Lets look at the data.

After we have looked a few plots and have a good feel for the data, we can create preliminary models. Note: Why are we using poisson regression instead of linear regression? We are using poisson because we are looking for the number of satellite crabs sorrounding female crabs. Since this response is a count, PR would be ideal.

PmodC <- glm(satell~color, data = crabs, family = poisson)
coef(PmodC) #Lets interpret our model with color
## (Intercept)   colordark colormedium  colorlight 
##  0.71562004  0.08515781  0.47670626  0.69129361
colorMeans <- round(exp(coef(PmodC)), digits = 4) #This takes the exponent of our coefficents, so we can interpret our predictor
colorMeans
## (Intercept)   colordark colormedium  colorlight 
##      2.0455      1.0889      1.6108      1.9963

Lets interpret: The darkest crabs have 2.0455 satellite crabs on average. Dark crabs have 3.1344 satellites, on average. Medium colored satellites have 3.6563 on average. Light crabs have 4.0418 satellites on average.

PmodS <- glm(satell~spine, data = crabs, family = poisson)
spineMeans <- round(exp(coef(PmodS)), digits = 4)

PmodW <- glm(satell~width, data = crabs, family = poisson)
widthMeans <- round(exp(coef(PmodW)), digits = 4)

Interpretation of PmodW: A width increase of 1 unit is associated with a multiplicative increase of exp(coef(PmodW)[2]) in the number of satellites the crab has.

One way to compare models is through using AIC:

AIC(PmodC)
## [1] 972.4368
AIC(PmodS)
## [1] 982.4582
AIC(PmodW)
## [1] 927.1762

Our lowest AIC is our width model, so lets start there. We can use forward selection and a drop in deviance test from there to test other predictors.

mod1 <- glm(satell~width+color, data = crabs, family = poisson)
mod2 <- glm(satell~width+spine, data = crabs, family = poisson)

anova(mod1, PmodW, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width + color
## Model 2: satell ~ width
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       168     559.34                       
## 2       171     567.88 -3  -8.5338  0.03618 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod2, PmodW, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width + spine
## Model 2: satell ~ width
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       169     566.60                     
## 2       171     567.88 -2  -1.2737    0.529

Look at the p values of the anova tests. Adding spine does not significantly improve the model while adding color does (p value = .03618). We can now conclude that our best model is the one with color and width. Consider adding interaction terms, squared and square root predictors, etc.

One way to check the deviance in our model is to use a half norm plot:

library(faraway)
halfnorm(residuals(mod1))

We have 2 outliers which are likely driving up deviance. This means there’s something about the number of satellite crabs we can’t explain with our predictors.

A good last step is to calculate the dispersion parameter for your model.

dp <- sum(residuals(mod1, type = "pearson")^2/mod1$df.residual)

sqrt(dp) #Interpret by multiplying the standard errors by 1.817. Overdispersion does not look too bad.
## [1] 1.798228