crabby <- read.csv("http://www.cknudson.com/data/crabs.csv")
summary(crabby)
## color spine width satell
## Length:173 Length:173 Min. :21.0 Min. : 0.000
## Class :character Class :character 1st Qu.:24.9 1st Qu.: 0.000
## Mode :character Mode :character Median :26.1 Median : 2.000
## Mean :26.3 Mean : 2.919
## 3rd Qu.:27.7 3rd Qu.: 5.000
## Max. :33.5 Max. :15.000
## weight y
## Min. :1200 Min. :0.0000
## 1st Qu.:2000 1st Qu.:0.0000
## Median :2350 Median :1.0000
## Mean :2437 Mean :0.6416
## 3rd Qu.:2850 3rd Qu.:1.0000
## Max. :5200 Max. :1.0000
library(ggplot2)
attach(crabby)
library(remotes)
library(cats)
Question 1:
ggplot(data = crabby, aes(satell)) + add_cat() + geom_histogram(bins = 30)
crabby$color <- factor(crabby$color, levels = c("darker", "dark", "medium", "light"))
ggplot(data = crabby, aes(x = satell, y = color, color = color)) + add_cat() + geom_boxplot(outlier.color = "red", outlier.size = 4)
crabby$spine <- factor(crabby$spine, levels = c("bad", "middle", "good"))
ggplot(data = crabby, aes(x = satell, y = spine, color = spine), fill = spine) + add_cat() + geom_boxplot(outlier.color = "red", outlier.size = 4)
ggplot(data = crabby, aes(x = satell, y = width, color = width)) + add_cat() + geom_point(outlier.color = "red", outlier.size = 4)
## Warning: Ignoring unknown parameters: outlier.colour, outlier.size
Our first indicator of poisson regression is that the data we are dealing with uses counts. Seen in part a, the data is right skewed has the general shape of a low lambda poisson distribution. Another argument for poission is from the heteroscedasticity seen in the plot in part d. Our final indicator is that the homework is called “Poisson Regression”… low hanging fruit, come on. Hahaha.
Based on our graphs, we think that color will have the strongest relationship with the number of satellite crabs. Then we think width will be a good predictor somehow because of the variance relationship.
Question 2:
mod1 <- glm(satell ~ factor(color), data = crabby, family = "poisson")
summary(mod1)
##
## Call:
## glm(formula = satell ~ factor(color), family = "poisson", data = crabby)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8577 -2.1106 -0.1649 0.8721 4.7491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.71562 0.14907 4.801 1.58e-06 ***
## factor(color)dark 0.08516 0.18007 0.473 0.636279
## factor(color)medium 0.47671 0.15943 2.990 0.002789 **
## factor(color)light 0.69129 0.20647 3.348 0.000814 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 609.14 on 169 degrees of freedom
## AIC: 972.44
##
## Number of Fisher Scoring iterations: 6
mod2 <- glm(satell ~ factor(spine), data = crabby, family = "poisson")
summary(mod2)
##
## Call:
## glm(formula = satell ~ factor(spine), family = "poisson", data = crabby)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7014 -2.3706 -0.5097 1.1252 5.0859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03316 0.05423 19.051 <2e-16 ***
## factor(spine)middle -0.34001 0.19045 -1.785 0.0742 .
## factor(spine)good 0.26120 0.10173 2.568 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 621.16 on 170 degrees of freedom
## AIC: 982.46
##
## Number of Fisher Scoring iterations: 5
mod3 <- glm(satell ~ width, data = crabby, family = "poisson")
summary(mod3)
##
## Call:
## glm(formula = satell ~ width, family = "poisson", data = crabby)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
Question 3:
If we were to choose models based on AIC, the model with the width only would be substantially the best, while the model with the spines would be the worst. We will have to see how this changes when we alter the dispersion parameter. We were correct that the model with the spine would be the worst, but we thought color would be the best. This might be because the AIC for color is penalized by having more variables since color is a factor.
Question 4:
As you can see below, the residuals for our model have a pattern, which is not good. There also are a few outliers at 149, 15, and 56 which could be having a bad effect. Also, the qqnorm plot is not great at all :/ Our dispersion factor is only about 3.18, so that is not that bad. This makes sense since the variance increases as the numbers of crabs increases, so variance may match the mean. The relationship looks fairly linear, which isn’t too bad.
plot(mod3)
sum(residuals(mod3, type = "pearson")^2/mod3$df.residual)
## [1] 3.182205
xvar <- 0:18
ypreds <- -3.30476 + .16405*xvar
with(crabby, (plot(satell, width)))
## NULL
lines(xvar, ypreds)
Question 5:
This first part finds the best model using forward selection. It chooses the model with width and color.
library(MASS)
mod0 <- glm(satell~1, data = crabby, family = "poisson")
mod4 <- stepAIC(mod3, direction = "forward", scope = list(upper = ~width + factor(spine) + factor(color), lower = ~1))
## Start: AIC=927.18
## satell ~ width
##
## Df Deviance AIC
## + factor(color) 3 559.34 924.64
## <none> 567.88 927.18
## + factor(spine) 2 566.60 929.90
##
## Step: AIC=924.64
## satell ~ width + factor(color)
##
## Df Deviance AIC
## <none> 559.34 924.64
## + factor(spine) 2 558.63 927.93
summary(mod4)
##
## Call:
## glm(formula = satell ~ width + factor(color), family = "poisson",
## data = crabby)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0415 -1.9581 -0.5575 0.9830 4.7523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.09740 0.55755 -5.555 2.77e-08 ***
## width 0.14934 0.02084 7.166 7.73e-13 ***
## factor(color)dark 0.01100 0.18041 0.061 0.9514
## factor(color)medium 0.24767 0.16316 1.518 0.1290
## factor(color)light 0.44736 0.20912 2.139 0.0324 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
Below we do the drop in deviance test with a model adding the variable weight and we find the pvalue to be .8279, so it is not worth it to add weight
mod5 <- glm(satell~width + factor(color) + weight, data = crabby, family = "poisson")
summary(mod5)
##
## Call:
## glm(formula = satell ~ width + factor(color) + weight, family = "poisson",
## data = crabby)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9806 -1.8987 -0.5392 0.9015 4.8274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0787808 0.9106215 -1.185 0.23615
## width 0.0309119 0.0473407 0.653 0.51378
## factor(color)dark -0.0001622 0.1805138 -0.001 0.99928
## factor(color)medium 0.2375227 0.1635298 1.452 0.14637
## factor(color)light 0.4420867 0.2090142 2.115 0.03442 *
## weight 0.0004516 0.0001602 2.818 0.00483 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 551.38 on 167 degrees of freedom
## AIC: 918.68
##
## Number of Fisher Scoring iterations: 6
teststat <- (559.34 - 551.38)/(nrow(crabby)- 5)
pf(teststat, df1 = 1, df2 = (nrow(crabby)- 5), lower.tail = FALSE)
## [1] 0.8279491
Below we do the drop in deviance test with a model adding the variable spine and we find the pvalue to be .9481, so it is not worth it to add spine
mod6 <- glm(satell~width + factor(color) + factor(spine), data = crabby, family = "poisson")
summary(mod6)
##
## Call:
## glm(formula = satell ~ width + factor(color) + factor(spine),
## family = "poisson", data = crabby)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0591 -1.9468 -0.4864 0.9620 4.7608
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.00566 0.58360 -5.150 2.60e-07 ***
## width 0.14596 0.02189 6.669 2.58e-11 ***
## factor(color)dark 0.02508 0.18106 0.139 0.8898
## factor(color)medium 0.26386 0.16513 1.598 0.1101
## factor(color)light 0.48544 0.22824 2.127 0.0334 *
## factor(spine)middle -0.16241 0.19655 -0.826 0.4086
## factor(spine)good -0.02363 0.11729 -0.201 0.8403
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 558.63 on 166 degrees of freedom
## AIC: 927.93
##
## Number of Fisher Scoring iterations: 6
teststat <- (559.34 - 558.63)/(nrow(crabby)- 6)
pf(teststat, df1 = 1, df2 = (nrow(crabby)- 6), lower.tail = FALSE)
## [1] 0.94809
After doing the drop-in deviance test, we stick with our initial model from forward selection. Our regression equation is: sattel = -3.0864 + .14934(width) - .011(darker) + .43636(light) + .23668(medium) This equation tells us that if the female crab is dark and has a width of 0 (which is impossible), it would attract -3.0864 sattelite crabs. Obviously our width will be much higher than zero as a minimum, so it will never be negative. An increase of one centimeter of width a female crab is, about an additional .15 crabs are attracted, holding all else constant. Additionally, if a crab is darker, the log mean number of crabs decreases by .011. If the crab is medium, the log mean number of satellite crabs increases by .23668. If a crab is light, this increases the log mean number of satellite crabs the most (excusing width) by .43636. Yes, satellite crabs are racist and this is not okay >:(
Question 6:
As we can see from the plot below, our residuals are not in a line, but isn’t what we look for. Unfortunately, our biggest residuals do not follow the line. Before we had outliers, 149, 15, and 56, but now 56 follows the trend, so this is an improvement. But still, not great.
library(faraway)
halfnorm(residuals(mod4))
Question 7:
The dispersion parameter is calculated below and is found to be 3.233625
sum(residuals(mod4, type = "pearson")^2/mod4$df.residual)
## [1] 3.233625
sexyelephants <- read.csv("http://www.cknudson.com/data/elephant.csv")
summary(sexyelephants)
## AGE MATINGS
## Min. :27.00 Min. :0.000
## 1st Qu.:29.00 1st Qu.:1.000
## Median :34.00 Median :2.000
## Mean :35.85 Mean :2.683
## 3rd Qu.:42.00 3rd Qu.:3.000
## Max. :52.00 Max. :9.000
attach(sexyelephants)
Part a:
From looking at the histogram below, there is a potential that this could be modeled as a Poisson response. It uses a count, is right skewed, and the peak of the histogram is at approximately the mean. The variance is about double the mean, which is not great evidence, but mayeb everything else outweighs it.
ggplot(sexyelephants, aes(x = MATINGS, fill = "red")) + add_cat() + geom_histogram(bins = 30)
mean(sexyelephants$MATINGS)
## [1] 2.682927
var(sexyelephants$MATINGS)
## [1] 5.071951
Part b:
As we can see by our plots below, it doesn’t look like linear regression will be the way to go. There is heteroscedasticity in out plot, and the variance increases and AGE increases, which points to poisson. Additionally, our residual plots follow a smiley shape, which points to poisson with a quadratic factor.
mod <- lm(MATINGS ~ AGE)
plot(AGE, MATINGS)
abline(coefficients(mod), col = "red", lwd = 2, lty = 2)
plot(mod)
Part c:
This plot has a more even variance throughout opposed to our previous plot. We do not see any evidence of a quadratic trend, it seems to be fairly linear.
arrmeans <- c()
ages <- unique(AGE)
for(age in ages)
{
arrmeans <- c(arrmeans, mean(subset(sexyelephants, subset = AGE == age)$MATINGS))
}
logmeans <- log(arrmeans)
plot(ages, logmeans)
Part d:
After we exponentiate, we can interpret that for each unit increase of one year in AGE, the mean number of MATINGS increases by 1.071104. This makes sense, because of course older elephants have mated more often, but we would expect this to level off and head to zero the older the elephant gets.
mod1 <- glm(MATINGS ~ AGE, data = sexyelephants, family = "poisson")
summary(mod1)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = "poisson", data = sexyelephants)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
exp(.06869)
## [1] 1.071104
Part e:
Our confidence interval is pretty tight, ranging from about 1.04 to 1.1 for our coefficient prediction of AGE.
exp(.06869 + c(-1, 1)*.01375*1.96)
## [1] 1.042623 1.100363
Part f:
summary(mod1)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = "poisson", data = sexyelephants)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
mod0 <- glm(MATINGS ~ 1, data = sexyelephants, family = "poisson")
summary(mod0)
##
## Call:
## glm(formula = MATINGS ~ 1, family = "poisson", data = sexyelephants)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3164 -1.1799 -0.4368 0.1899 3.0252
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.98691 0.09535 10.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 75.372 on 40 degrees of freedom
## AIC: 178.82
##
## Number of Fisher Scoring iterations: 5
teststat <- 75.372 - 51.012
pchisq(teststat, 1, lower.tail = FALSE)
## [1] 7.991084e-07
Part g:
It does not appear that a quadratic model is preffered over a linear model. Let’s run some tests. i) From the Wald test, we find the pvalue for our quadratic term to be .669 (nice), which means it is NOT significant all. ii) The drop in deviance test finds a pvalue of .666 (nice), which indicates that the quadratic term pretty much sucks.
agesq <- AGE^2
mod2 <- glm(MATINGS ~ AGE + agesq, data = sexyelephants, family = "poisson")
summary(mod2)
##
## Call:
## glm(formula = MATINGS ~ AGE + agesq, family = "poisson", data = sexyelephants)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.0356383 -0.941 0.347
## AGE 0.1359544 0.1580095 0.860 0.390
## agesq -0.0008595 0.0020124 -0.427 0.669
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
teststat <- 51.012 - 50.826
pchisq(teststat, 1, lower.tail = FALSE)
## [1] 0.6662668
Part h:
Honestly, this model with age as the only predictor seems pretty good. It has a low pvalue and a decent AIC, comparatively.
We find our dispersion parameter below to be about 1.157334
dp <- sum(residuals(mod1, type = "pearson")^2/mod1$df.residual)
Part f dispersion:
mod3 <- glm(MATINGS ~ AGE, data = sexyelephants, family = "poisson")
mod4 <- glm(MATINGS ~ 1, data = sexyelephants, family = "poisson")
summary(mod3, dispersion = dp)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = "poisson", data = sexyelephants)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.58590 -2.700 0.00693 **
## AGE 0.06869 0.01479 4.645 3.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
summary(mod4, dispersion = dp)
##
## Call:
## glm(formula = MATINGS ~ 1, family = "poisson", data = sexyelephants)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3164 -1.1799 -0.4368 0.1899 3.0252
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9869 0.1026 9.622 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 75.372 on 40 degrees of freedom
## AIC: 178.82
##
## Number of Fisher Scoring iterations: 5
teststat <- 75.372 - 51.012
pchisq(teststat, 1, lower.tail = FALSE)
## [1] 7.991084e-07
Part g dispersion:
It does not appear that a quadratic model is preffered over a linear model. Let’s run some tests. i) From the Wald test, we find the pvalue for our quadratic term to be .691 (nice), which means it is NOT significant all. It increased from before, but not by much. ii) The drop in deviance test finds a pvalue of .666 (nice), which indicates that the quadratic term pretty much sucks. The dispersion parameter caused very little change.
agesq <- AGE^2
mod5 <- glm(MATINGS ~ AGE + agesq, data = sexyelephants, family = "poisson")
summary(mod5, dispersion = dp)
##
## Call:
## glm(formula = MATINGS ~ AGE + agesq, family = "poisson", data = sexyelephants)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.2657237 -0.875 0.382
## AGE 0.1359544 0.1699858 0.800 0.424
## agesq -0.0008595 0.0021649 -0.397 0.691
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
teststat <- 51.012 - 50.826
pchisq(teststat, 1, lower.tail = FALSE)
## [1] 0.6662668
Part h dispersion:
The dispersion parameter was small and caused pretty much no difference. The model with age as the only predictor seems pretty good. It has a low pvalue and a decent AIC, comparatively.