RMD 4 takes a deeper look into Poisson Regression and how to select models using this type of regression. We spent the week going through our chapter 4 homework.
crabs <- read.csv("http://www.cknudson.com/data/crabs.csv")
head(crabs)
## color spine width satell weight y
## 1 medium bad 28.3 8 3050 1
## 2 dark bad 22.5 0 1550 0
## 3 light good 26.0 9 2300 1
## 4 dark bad 24.8 0 2100 0
## 5 dark bad 26.0 4 2600 1
## 6 medium bad 23.8 0 2100 0
summary(crabs)
## color spine width satell
## Length:173 Length:173 Min. :21.0 Min. : 0.000
## Class :character Class :character 1st Qu.:24.9 1st Qu.: 0.000
## Mode :character Mode :character Median :26.1 Median : 2.000
## Mean :26.3 Mean : 2.919
## 3rd Qu.:27.7 3rd Qu.: 5.000
## Max. :33.5 Max. :15.000
## weight y
## Min. :1200 Min. :0.0000
## 1st Qu.:2000 1st Qu.:0.0000
## Median :2350 Median :1.0000
## Mean :2437 Mean :0.6416
## 3rd Qu.:2850 3rd Qu.:1.0000
## Max. :5200 Max. :1.0000
A good first step in model selection is to use histograms, boxplots, plots, etc. to analyze the data. In this case, color and spine have categorical variables that aren’t in order. By using factor we can re order the levels.
crabs$color <- factor(crabs$color, levels = c("darker","dark","medium","light"))
crabs$spine <- factor(crabs$spine, levels = c("bad","middle","good"))
In this datset we are looking for the number of satellite crabs sorrounding female crabs. Lets look at the data.
After we have looked a few plots and have a good feel for the data, we can create preliminary models. Note: Why are we using poisson regression instead of linear regression? We are using poisson because we are looking for the number of satellite crabs sorrounding female crabs. Since this response is a count, PR would be ideal.
PmodC <- glm(satell~color, data = crabs, family = poisson)
coef(PmodC) #Lets interpret our model with color
## (Intercept) colordark colormedium colorlight
## 0.71562004 0.08515781 0.47670626 0.69129361
colorMeans <- round(exp(coef(PmodC)), digits = 4) #This takes the exponent of our coefficents, so we can interpret our predictor
colorMeans
## (Intercept) colordark colormedium colorlight
## 2.0455 1.0889 1.6108 1.9963
Lets interpret: The darkest crabs have 2.0455 satellite crabs on average. Dark crabs have 3.1344 satellites, on average. Medium colored satellites have 3.6563 on average. Light crabs have 4.0418 satellites on average.
PmodS <- glm(satell~spine, data = crabs, family = poisson)
spineMeans <- round(exp(coef(PmodS)), digits = 4)
PmodW <- glm(satell~width, data = crabs, family = poisson)
widthMeans <- round(exp(coef(PmodW)), digits = 4)
Interpretation of PmodW: A width increase of 1 unit is associated with a multiplicative increase of exp(coef(PmodW)[2]) in the number of satellites the crab has.
One way to compare models is through using AIC:
AIC(PmodC)
## [1] 972.4368
AIC(PmodS)
## [1] 982.4582
AIC(PmodW)
## [1] 927.1762
Our lowest AIC is our width model, so lets start there. We can use forward selection and a drop in deviance test from there to test other predictors.
mod1 <- glm(satell~width+color, data = crabs, family = poisson)
mod2 <- glm(satell~width+spine, data = crabs, family = poisson)
anova(mod1, PmodW, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + color
## Model 2: satell ~ width
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 168 559.34
## 2 171 567.88 -3 -8.5338 0.03618 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mod2, PmodW, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + spine
## Model 2: satell ~ width
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 169 566.60
## 2 171 567.88 -2 -1.2737 0.529
Look at the p values of the anova tests. Adding spine does not significantly improve the model while adding color does (p value = .03618). We can now conclude that our best model is the one with color and width. Consider adding interaction terms, squared and square root predictors, etc.
One way to check the deviance in our model is to use a half norm plot:
library(faraway)
halfnorm(residuals(mod1))
We have 2 outliers which are likely driving up deviance. This means there’s something about the number of satellite crabs we can’t explain with our predictors.
A good last step is to calculate the dispersion parameter for your model.
dp <- sum(residuals(mod1, type = "pearson")^2/mod1$df.residual)
sqrt(dp) #Interpret by multiplying the standard errors by 1.817. Overdispersion does not look too bad.
## [1] 1.798228