This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(rmarkdown); library(knitr)
library(dplyr)
library(psych); library(corrplot); library(GPArotation)
library(lavaan); library(semPlot); library(moments)
crabdat <- read.csv("http://www.cknudson.com/data/crabs.csv")
###1a Create a histogram of the number of satellite crabs surrounding female crabs. Interpret in the context of the data
hist(crabdat$satell)
#The histogram showing that there are fewer satellite crabs surrounding female crabs.approximately Poisson distributed
###1b Create side-by-side boxplots of the response by color. Interpret.
boxplot(crabdat$satell ~ crabdat$color)
#From the boxplot, we can see that dark carb and darker crab have fewer satelitte than a light color crab and medium color crab
###1c Create side-by-side boxplots of the response by spine. Interpret.
boxplot(crabdat$satell ~ crabdat$spine)
#From the boxplot, good spines crab are have more satellite than others crab.
#Bad spine have the most variability in the number of satellite
###1d Create a scatterplot of the log of the response by width. Interpret.
plot(crabdat$satell~crabdat$width)
plot(log(crabdat$satell) ~ crabdat$width)
#From the scatter plot, we can see there are no relationship between log of the response by width
###1e Use the plots you’ve created as evidence to argue that a Poisson regression is appropriate.
#The satelitte histogram lookspoisson with strictly non negative and a heavy right tail. Fitting with the log response will give us the better fit line.
###1f Before doing any further analysis, which predictors do you think will have the strongest relationship with the mean number of satellite crabs that a female has?
#In my opinion, I thinks the color of crab will have the strongest relationship with the mean number of satellite crabs. Because I can clearly see the differences in the satellite based on color of crab
###2
attach(crabdat)
mod1 <- glm(satell~color,family=poisson)
mod2 <- glm(satell~spine,family=poisson)
mod3 <- glm(satell~width,family=poisson)
summary(mod1)
##
## Call:
## glm(formula = satell ~ color, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8577 -2.1106 -0.1649 0.8721 4.7491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.80078 0.10102 7.927 2.24e-15 ***
## colordarker -0.08516 0.18007 -0.473 0.636279
## colorlight 0.60614 0.17496 3.464 0.000532 ***
## colormedium 0.39155 0.11575 3.383 0.000718 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 609.14 on 169 degrees of freedom
## AIC: 972.44
##
## Number of Fisher Scoring iterations: 6
summary(mod2)
##
## Call:
## glm(formula = satell ~ spine, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7014 -2.3706 -0.5097 1.1252 5.0859
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03316 0.05423 19.051 <2e-16 ***
## spinegood 0.26120 0.10173 2.568 0.0102 *
## spinemiddle -0.34001 0.19045 -1.785 0.0742 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 621.16 on 170 degrees of freedom
## AIC: 982.46
##
## Number of Fisher Scoring iterations: 5
summary(mod3)
##
## Call:
## glm(formula = satell ~ width, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
AIC(mod1,mod2,mod3)
## df AIC
## mod1 4 972.4368
## mod2 3 982.4582
## mod3 2 927.1762
#Mod3 has the lowest residual deviance of the three models. There is only mod1 have a coeffiencient that is not significant at any level. Mod3 also has all of coeffcients being significant at 0
###3
#After producing our AIC values for all three models, we see that model3, the model with width as a predictor for satellite crabs had the lowest AIC value meaning that this is the best model that we have produced. This does not match up with our previous prediction.
###4
widthdp<- sum(residuals(mod3, type="pearson")^2 / mod3$df.res)
widthdp
## [1] 3.182205
#The dispersion parameter is not equal to 1, meaning that this is not a perfect Poisson model.
plot(residuals(mod3)~predict(mod3,type="link"),xlab="log lambda", ylab="deviance residuals")
#From the plot we can see that are more positive than the negative so its mean the log number of satellites are under-predicting
library(faraway)
## Warning: package 'faraway' was built under R version 4.0.3
##
## Attaching package: 'faraway'
## The following object is masked from 'package:psych':
##
## logit
halfnorm(residuals(mod3))
# We can see there are at least 4 outlier on the graph
###5
modebase<- glm(satell~width, data= crabdat, family = poisson)
mod4<-glm(satell~width + color, data= crabdat, family = poisson)
mod5<-glm(satell~width + spine, data= crabdat, family = poisson)
mod6<-glm(satell~width + weight, data= crabdat, family = poisson)
summary(modebase)
##
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
summary(mod4)
##
## Call:
## glm(formula = satell ~ width + color, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0415 -1.9581 -0.5575 0.9830 4.7523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.08640 0.55750 -5.536 3.09e-08 ***
## width 0.14934 0.02084 7.166 7.73e-13 ***
## colordarker -0.01100 0.18041 -0.061 0.9514
## colorlight 0.43636 0.17636 2.474 0.0133 *
## colormedium 0.23668 0.11815 2.003 0.0452 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
summary(mod5)
##
## Call:
## glm(formula = satell ~ width + spine, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9509 -1.9740 -0.4963 1.0832 4.7173
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.12469 0.56497 -5.531 3.19e-08 ***
## width 0.15668 0.02096 7.476 7.69e-14 ***
## spinegood 0.09899 0.10490 0.944 0.345
## spinemiddle -0.10033 0.19309 -0.520 0.603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 566.60 on 169 degrees of freedom
## AIC: 929.9
##
## Number of Fisher Scoring iterations: 6
summary(mod6)
##
## Call:
## glm(formula = satell ~ width + weight, family = poisson, data = crabdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9309 -1.9702 -0.5478 0.9700 4.9904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2952111 0.8988960 -1.441 0.14962
## width 0.0460765 0.0467497 0.986 0.32433
## weight 0.0004470 0.0001586 2.818 0.00483 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.90 on 170 degrees of freedom
## AIC: 921.2
##
## Number of Fisher Scoring iterations: 6
# Looks like non of the models have sufficient drop in AIC so we will chose the currently modbase
#Final mode is : glm(satell~ width, family = poisson, data=crabdat)
anova(modebase,mod4, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width
## Model 2: satell ~ width + color
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 168 559.34 3 8.5338 0.03618 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(modebase,mod5, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width
## Model 2: satell ~ width + spine
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 169 566.60 2 1.2737 0.529
anova(modebase,mod6, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width
## Model 2: satell ~ width + weight
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 171 567.88
## 2 170 559.90 1 7.978 0.004735 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Mod6 was the most significant drop in deviance so we use that as our new modbase2
modbase2 <-mod6
mod7<-glm(satell~width + weight + color, data= crabdat, family = poisson)
mod8<-glm(satell~width + weight+ spine, data= crabdat, family = poisson)
anova(modbase2,mod7,test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + color
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 170 559.90
## 2 167 551.38 3 8.5203 0.0364 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(modbase2,mod8,test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + spine
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 170 559.90
## 2 168 558.83 2 1.0745 0.5844
# looks like model 7 shows significant drop so this will be our new model- modbase3
modbase3 <- mod7
mod9<-glm(satell~width + weight + color+spine, data= crabdat, family = poisson)
anova(modbase3,mod9,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: satell ~ width + weight + color
## Model 2: satell ~ width + weight + color + spine
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 167 551.38
## 2 165 549.59 2 1.7947 0.4076
#Since model 9 do not show any significant drop in deviance . This my final model
###6
halfnorm(residuals(modbase3))
# From the graph, the residual follow on trend. 15 and 149 show as outlier
###7
dp <- sum(residuals(modbase3, type="pearson")^2 / modbase3$df.res)
dp
## [1] 3.204836
# The dipersion parameter is 3.204836 - not a perfect poisson because of the difference in mean and variance
Disperson parameter of the larger one
###Elephant Mating
eleph <-read.csv("http://www.cknudson.com/data/elephant.csv")
attach(eleph)
###A.
hist(MATINGS)
###B.
plot(MATINGS ~ AGE, main = "Elephants matting by ages")
modeleph <- lm(MATINGS ~ AGE)
abline(modeleph)
plot(modeleph)
#this model appear not to be a great model . There are a several outlier among the residuals
###C.
meanmate <- aggregate(MATINGS ~ AGE, eleph,mean)
plot(log(meanmate$MATINGS), meanmate$AGE, main = "Log average mating by age" , xlab = "Age", ylab = "Log Average")
# The data look like we could fit a line better to this data . We could try to use Poisson Regression
# I can not see the evidence of quadratic trend in the plot
###D.
modelA <- glm(MATINGS ~ AGE, family = poisson, data = eleph)
summary(modelA)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
exp(coef(modelA)[2])
## AGE
## 1.071107
# For each increse in 1 years of age, we expec matings to increse by 1.07 (multiplicative change)
###E.
exp(confint(modelA, level =0.95)[2,])
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 1.042558 1.100360
# From the results above, we are 95 % confident that the true multiplicative impact of age on average number of matings per years is between 1.04 and 1.10
###F.
# For Wald Test, it is significant since p-value for the test is 5.81*10^-7 which is really close to 0
summary(modelA)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
modelB<-glm(MATINGS~1, family=poisson, data=eleph)
anova(modelB,modelA, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 40 75.372
## 2 39 51.012 1 24.36 7.991e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Looks like there is no significant drop in deviance when we add age . We will keep age in the model
###G.
AgeSq<-eleph$AGE^2
quadmod<- glm(MATINGS~AGE+AgeSq, data = eleph, family = poisson)
summary(quadmod)
##
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.0356383 -0.941 0.347
## AGE 0.1359544 0.1580095 0.860 0.390
## AgeSq -0.0008595 0.0020124 -0.427 0.669
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
# The deviance is drop in this model
anova(modelA,quadmod, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AgeSq
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 39 51.012
## 2 38 50.826 1 0.18544 0.6667
# We see that the P-value is not significant so we should remove age^2 from the model
###H.
teststat<-sum(residuals(modelA)^2)
df<- 41-3
pchisq(teststat, df, lower.tail = FALSE)
## [1] 0.07717098
# The P- value is 0.07718 is not significant at .05 level so this model is well predicting. The p-value id on the edga of being significants (at alpha = .1 level) .This would mean the model is not predicting well
###F(redo)
dp <-sum (residuals(modelA, type="pearson")^2 / modelA$df.res)
dp
## [1] 1.157334
summary(modelA, dispersion=dp)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.58590 -2.700 0.00693 **
## AGE 0.06869 0.01479 4.645 3.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1.157334)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
summary(modelA)
##
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80798 -0.86137 -0.08629 0.60087 2.17777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58201 0.54462 -2.905 0.00368 **
## AGE 0.06869 0.01375 4.997 5.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 51.012 on 39 degrees of freedom
## AIC: 156.46
##
## Number of Fisher Scoring iterations: 5
DropDev<-deviance(modelB)-deviance(modelA)
teststat2<- DropDev/(dp*1)
pf(teststat2,df1=1,df2=38,lower.tail=FALSE)
## [1] 4.767203e-05
# The p-value is 4.767203e-05. It is still not significant so we will keep age as a variable
###G(redo)
dp2<-sum(residuals(quadmod, type="pearson")^2 / quadmod$df.res)
dp2
## [1] 1.175194
summary(quadmod, dispersion=dp2)
##
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = eleph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8470 -0.8848 -0.1122 0.6580 2.1134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060 3.2908256 -0.868 0.385
## AGE 0.1359544 0.1712924 0.794 0.427
## AgeSq -0.0008595 0.0021816 -0.394 0.694
##
## (Dispersion parameter for poisson family taken to be 1.175194)
##
## Null deviance: 75.372 on 40 degrees of freedom
## Residual deviance: 50.826 on 38 degrees of freedom
## AIC: 158.27
##
## Number of Fisher Scoring iterations: 5
#From the Wald Test, the age^2 still does not have a significant p-value and should be left out of the model drop in deviance
DropDev2<-deviance(modelA)-deviance(quadmod)
teststat3<- DropDev2/(dp*1)
pf(teststat3,df1=1,df2=37,lower.tail=FALSE)
## [1] 0.6912428
# The p-value > .05 mean there is no significant drop in deviance by adding age^2, we shoul remove age^2 from the model
###H(redo)
# Because the dispersion parameter is not depend on the Goodness of fit test . We will see the result as part H
aggregate(MATINGS ~ AGE, eleph, mean)
We are 95 % confident that the true multipicative is between … and …
```
if the residuals are high, you are under predicted
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.