crabdat<- read.csv("http://www.cknudson.com/data/crabs.csv")
library(faraway)
attach(crabdat)
hist(satell)
The histogram appears to be right skewed and only values greater than or equal to zero. This data visually looks like it might fit a Poisson distribution.
boxplot(satell~color)
This may come into play as an important variable when it comes to modeling because it visually appears that on average lighter crabs have more satellites and darker crabs have less.
boxplot(satell~spine)
This spine variable might not be useful as all the means fall within each other CIs. If I had to guess I would say this will not be a significant variable when we make a model.
logsat <- log(satell)
plot(logsat~width)
This width variable might not be all that great in modeling as there is no explicit pattern. It does look like the variables might be positively related however.
I think a Poisson regression is appropriate because of the range of our response type and the shape of the distribution, which is shown pretty explicitly in the histogram. It also appears that variables like color and width may affect the number of satellites.
I think the lighter colors will have the strongest relationship with number of satellites. Width is our second guess but the log satellite plot is not entirely reassuring.
modcolor <- glm(satell ~ color, family = poisson)
summary(modcolor)
##
## Call:
## glm(formula = satell ~ color, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8577 -2.1106 -0.1649 0.8721 4.7491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.80078 0.10102 7.927 2.24e-15 ***
## colordarker -0.08516 0.18007 -0.473 0.636279
## colorlight 0.60614 0.17496 3.464 0.000532 ***
## colormedium 0.39155 0.11575 3.383 0.000718 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 609.14 on 169 degrees of freedom
## AIC: 972.44
##
## Number of Fisher Scoring iterations: 6
modspine <- glm(satell ~ spine, family = poisson)
summary(modspine)
##
## Call:
## glm(formula = satell ~ spine, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7014 -2.3706 -0.5097 1.1252 5.0859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03316 0.05423 19.051 <2e-16 ***
## spinegood 0.26120 0.10173 2.568 0.0102 *
## spinemiddle -0.34001 0.19045 -1.785 0.0742 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 621.16 on 170 degrees of freedom
## AIC: 982.46
##
## Number of Fisher Scoring iterations: 5
modwidth <- glm(satell ~ width, family = poisson)
summary(modwidth)
##
## Call:
## glm(formula = satell ~ width, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
A classification of a female crab into the category of ‘colorlight’ is associated with a 0.606 increase in the number of satellite crabs. Etc…
A classification of a female crab into the category of ‘spinegood’ is associated with a .261 increase in the number of satellite crabs. Etc…
An increase of one cm in the female crabs carapace width is associated with a .164 increase in the number of satellite crabs.
AIC(modcolor)
## [1] 972.4368
AIC(modspine)
## [1] 982.4582
AIC(modwidth)
## [1] 927.1762
My prediction was a little off. I thought color might play a larger role than width but I was wrong. The variable Width is the best predictor of these three options.
plot(modwidth, which = c(1,2))
The residual vs predicted plot and the qq-plot show that we do have few outliers and maybe some missing predictors that would help the fit of the model as the red line of the residual plot looks slightly banana shaped.
dispsat <- sum(residuals(modwidth, type ="pearson")^2 / modwidth$df.residual)
dispsat
## [1] 3.182205
This can be interpreted as our variance is about 3.18 times larger than the mean. This is not ideal. We want our dispersion to be closer to 1.
modbase <- glm(satell ~ 1, family = poisson)
modweight <- glm(satell ~ weight, family = poisson)
AIC(modbase)
## [1] 990.0893
AIC(modcolor)
## [1] 972.4368
AIC(modspine)
## [1] 982.4582
AIC(modwidth)
## [1] 927.1762
AIC(modweight)
## [1] 920.1641
The model with weight as a predictor is the best in terms of AIC so we move forward with modweight.
modWtCl <- glm(satell ~ weight + color, family = poisson)
modWtSp <- glm(satell ~ weight + spine, family = poisson)
modWtWd <- glm(satell ~ weight + width, family = poisson)
AIC(modWtCl)
## [1] 917.1026
AIC(modWtSp)
## [1] 922.7853
AIC(modWtWd)
## [1] 921.1983
None of these models AIC’s decrease by over 10 so we should stop forward-selection
Let’s see if the drop-in-deviance test agrees with this conclusion.
droptest <- deviance(modweight) - deviance(modWtCl)
pchisq(droptest, df = 3, lower.tail = FALSE)
## [1] 0.02848457
This p-value is significant however so we should really continue our forward-selection despite the AIC test.
modWtClSp <- glm(satell ~ weight + color + spine, family = poisson)
modWtSpWd <- glm(satell ~ weight + spine + width, family = poisson)
AIC(modWtClSp)
## [1] 919.0001
AIC(modWtSpWd)
## [1] 924.1238
Again the AIC forward-selection tells us to stop but lets make sure the drop-in-deviance agrees.
droptest <- deviance(modWtCl) - deviance(modWtClSp)
pchisq(droptest, df = 2, lower.tail = FALSE)
## [1] 0.3495059
Here the drop-in-deviance confirms our forward-selection stoppage.
summary(modWtCl)
##
## Call:
## glm(formula = satell ~ weight + color, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9833 -1.9272 -0.5553 0.8646 4.8270
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.996e-01 1.959e-01 -2.551 0.0108 *
## weight 5.462e-04 6.811e-05 8.019 1.07e-15 ***
## colordarker -2.253e-03 1.805e-01 -0.012 0.9900
## colorlight 4.498e-01 1.757e-01 2.560 0.0105 *
## colormedium 2.447e-01 1.177e-01 2.079 0.0376 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 551.80 on 168 degrees of freedom
## AIC: 917.1
##
## Number of Fisher Scoring iterations: 6
An increase of one gram of female crab weight is associated with a 5.462*10^-4 increase in the number of satellite crabs.
A female crab identified as ‘darker’ is associated with a decrease of 2.253*10^-3 in the number of satellite crabs.
Etc…
The key for the interpretation above of this model is the Estimate column in the Coefficients portion of the summary of the model.
halfnorm(residuals(modWtCl))
The top right of the halfnorm plot shows that our outliers are still don’t fit very well in our model which is something we could address if we had more time.
finaldisp <- sum(residuals(modWtCl, type ="pearson")^2 / modWtCl$df.residual)
This dispersion parameter is very close to our first calculated dispersion but it is slightly larger. Regardless, our variance is still more than 3 times greater than our mean.
eleph <- read.csv("http://www.cknudson.com/data/elephant.csv")
attach(eleph)
hist(MATINGS)
There is evidence because the distribution is right-skewed and we have values for MATINGS that are only greater than or equal to zero.
linearmod <- lm(MATINGS ~ AGE)
plot(MATINGS~AGE)
abline(linearmod)
plot(linearmod, which = 1)
It looks like there are quite a few outliers visible from these graphs and some heteroscedasticity in the residual plot. It also implies that 20 year old elephants have a negative number of matings per year which doesn’t make sense.
library(plyr)
meandata <- ddply(eleph, "AGE", summarise, mean= mean(MATINGS))
logmeans <- log(meandata$mean)
plot(meandata$AGE, log(meandata$mean), main = "Log average mating by age" , xlab = "Age", ylab = "Log Average")
It appears the relationship between the log mean of mating occurrences and age is linear. As an elephant increases in age the more mating sessions they participate in. This satisfies the linearity assumption.
I would say there is little to no quadratic trend on this graph. There is a slight banana curve up shape to the plot but it looks much more linear than quadratic.
Pmod <- glm(MATINGS ~ AGE, family = poisson)
summary(Pmod)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
Pmod$coef[2]
## AGE
## 0.06869281
exp(.06869281)
## [1] 1.071107
A one increase in age of an elephant is associated with a 7.11% increase in the number of MATINGS.
exp(1)^confint(Pmod, 'AGE', level=.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 1.042558 1.100360
We are 95% certain that the multiplicative factor of an increase of AGE of one year will be between these two values.
summary(Pmod)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
basemod <- glm(MATINGS ~ 1, family = poisson)
anova(basemod, Pmod, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 40 75.372
## 2 39 51.012 1 24.36 7.991e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Wald Test and drop-in-deviance test have significant p-values which both mean that the variable AGE is important to the model and should not be dropped in favor of the basemod.
AGEsq <- AGE^2
Pmodsq <- glm(MATINGS ~ AGE + AGEsq, family = poisson)
summary(Pmodsq)
##
## Call:
## glm(formula = MATINGS ~ AGE + AGEsq, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.0356383 -0.941 0.347
## AGE 0.1359544 0.1580095 0.860 0.390
## AGEsq -0.0008595 0.0020124 -0.427 0.669
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
anova(Pmod, Pmodsq, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AGEsq
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 39 51.012
## 2 38 50.826 1 0.18544 0.6667
Both the Wald Test and drop-in-deviance test tell me that the additional variable of AGE squared is not significantly useful. The linear (or simpler) model is preferred in this scenario.
pchisq(Pmod$deviance, Pmod$df.residual, lower.tail = FALSE)
## [1] 0.09426231
This p-value that is greater than our alpha of .05 shows that we do not have evidence of lack-of-fit.
Now we will repeat the last three steps with the dispersion accounted for…
disp <- sum(residuals(Pmod, type ="pearson")^2 / Pmod$df.residual)
disp
## [1] 1.157334
This implies that our variance is 1.15 times larger than our mean.
summary(Pmod, dispersion = disp)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.58590 -2.700 0.00693 **
## AGE 0.06869 0.01479 4.645 3.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
dropinD <- deviance(basemod) - deviance(Pmod)
teststat <- dropinD/(disp*1)
pf(teststat, df1 = 1, 38, lower.tail = FALSE)
## [1] 4.767203e-05
The Wald Test and drop-in-deviance test again confirm that AGE is still a significant variable to include even after accounting for dispersion.
summary(Pmodsq)
##
## Call:
## glm(formula = MATINGS ~ AGE + AGEsq, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.0356383 -0.941 0.347
## AGE 0.1359544 0.1580095 0.860 0.390
## AGEsq -0.0008595 0.0020124 -0.427 0.669
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
disp2 <- sum(residuals(Pmodsq, type ="pearson")^2 / Pmodsq$df.residual)
dropinD <- deviance(Pmod) - deviance(Pmodsq)
teststat2 <- dropinD/(disp2*1)
pf(teststat2, df1 = 1, 38, lower.tail = FALSE)
## [1] 0.6934135
The Wald Test and drop-in-deviance test agree here that the addition of AGE squared is not significant to the fit of the model and it can therefore be dropped. This was the same result when we didn’t account for dispersion.
summary(Pmod, dispersion = disp)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.58590 -2.700 0.00693 **
## AGE 0.06869 0.01479 4.645 3.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
pchisq(51.012, 39, lower.tail = FALSE)
## [1] 0.09425638
This p-value that is still greater than our alpha of .05, despite dispersion, so that means we do not have evidence of lack-of-fit.