R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(rmarkdown); library(knitr)

library(dplyr)
library(psych); library(corrplot); library(GPArotation)
library(lavaan); library(semPlot); library(moments)
crabdat <- read.csv("http://www.cknudson.com/data/crabs.csv")

###1a Create a histogram of the number of satellite crabs surrounding female crabs. Interpret in the context of the data

hist(crabdat$satell)

#The histogram showing that there are fewer satellite crabs surrounding female crabs.approximately Poisson distributed

###1b Create side-by-side boxplots of the response by color. Interpret.

boxplot(crabdat$satell ~ crabdat$color)

#From the boxplot, we can see that dark carb and darker crab have fewer satelitte than a light color crab and medium color crab

###1c Create side-by-side boxplots of the response by spine. Interpret.

boxplot(crabdat$satell ~ crabdat$spine)

#From the boxplot, good spines crab are have more satellite than others crab. 
#Bad spine have the most variability in the number of satellite

###1d Create a scatterplot of the log of the response by width. Interpret.

plot(crabdat$satell~crabdat$width)

plot(log(crabdat$satell) ~ crabdat$width)

#From the scatter plot, we can see there are no relationship between log of the response by width

###1e Use the plots you’ve created as evidence to argue that a Poisson regression is appropriate.

#The satelitte histogram lookspoisson with strictly non negative and a heavy right tail. Fitting with the log response will give us the better fit line. 

###1f Before doing any further analysis, which predictors do you think will have the strongest relationship with the mean number of satellite crabs that a female has?

#In my opinion,  I thinks the color of crab will have the strongest relationship with the mean number of satellite crabs. Because I can clearly see the differences in the satellite based on color of crab

###2

attach(crabdat)
mod1 <- glm(satell~color,family=poisson)
mod2 <- glm(satell~spine,family=poisson)
mod3 <- glm(satell~width,family=poisson)

summary(mod1)
## 
## Call:
## glm(formula = satell ~ color, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8577  -2.1106  -0.1649   0.8721   4.7491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.80078    0.10102   7.927 2.24e-15 ***
## colordarker -0.08516    0.18007  -0.473 0.636279    
## colorlight   0.60614    0.17496   3.464 0.000532 ***
## colormedium  0.39155    0.11575   3.383 0.000718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 609.14  on 169  degrees of freedom
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 6
summary(mod2)
## 
## Call:
## glm(formula = satell ~ spine, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7014  -2.3706  -0.5097   1.1252   5.0859  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.03316    0.05423  19.051   <2e-16 ***
## spinegood    0.26120    0.10173   2.568   0.0102 *  
## spinemiddle -0.34001    0.19045  -1.785   0.0742 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 621.16  on 170  degrees of freedom
## AIC: 982.46
## 
## Number of Fisher Scoring iterations: 5
summary(mod3)
## 
## Call:
## glm(formula = satell ~ width, family = poisson)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6
AIC(mod1,mod2,mod3)
##      df      AIC
## mod1  4 972.4368
## mod2  3 982.4582
## mod3  2 927.1762
#Mod3 has the lowest residual deviance of the three models. There is only mod1 have a coeffiencient that is not significant at any level. Mod3 also has all of coeffcients being significant at 0

###3

#After producing our AIC values for all three models, we see that model3, the model with width as a predictor for satellite crabs had the lowest AIC value meaning that this is the best model that we have produced. This does not match up with our previous prediction.

###4

widthdp<- sum(residuals(mod3, type="pearson")^2 / mod3$df.res)
widthdp
## [1] 3.182205
#The dispersion parameter is not equal to 1, meaning that this is not a perfect Poisson model.
plot(residuals(mod3)~predict(mod3,type="link"),xlab="log lambda", ylab="deviance residuals")

#From the plot we can see that are more positive than the negative so its mean the log number of satellites are under-predicting
library(faraway)
## Warning: package 'faraway' was built under R version 4.0.3
## 
## Attaching package: 'faraway'
## The following object is masked from 'package:psych':
## 
##     logit
halfnorm(residuals(mod3))

# We can see there are at least 4 outlier on the graph

###5

modebase<- glm(satell~width, data= crabdat, family = poisson)

mod4<-glm(satell~width + color, data= crabdat, family = poisson)
mod5<-glm(satell~width + spine, data= crabdat, family = poisson)
mod6<-glm(satell~width + weight, data= crabdat, family = poisson)

summary(modebase)
## 
## Call:
## glm(formula = satell ~ width, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6
summary(mod4)
## 
## Call:
## glm(formula = satell ~ width + color, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0415  -1.9581  -0.5575   0.9830   4.7523  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.08640    0.55750  -5.536 3.09e-08 ***
## width        0.14934    0.02084   7.166 7.73e-13 ***
## colordarker -0.01100    0.18041  -0.061   0.9514    
## colorlight   0.43636    0.17636   2.474   0.0133 *  
## colormedium  0.23668    0.11815   2.003   0.0452 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6
summary(mod5)
## 
## Call:
## glm(formula = satell ~ width + spine, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9509  -1.9740  -0.4963   1.0832   4.7173  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.12469    0.56497  -5.531 3.19e-08 ***
## width        0.15668    0.02096   7.476 7.69e-14 ***
## spinegood    0.09899    0.10490   0.944    0.345    
## spinemiddle -0.10033    0.19309  -0.520    0.603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 566.60  on 169  degrees of freedom
## AIC: 929.9
## 
## Number of Fisher Scoring iterations: 6
summary(mod6)
## 
## Call:
## glm(formula = satell ~ width + weight, family = poisson, data = crabdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9309  -1.9702  -0.5478   0.9700   4.9904  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.2952111  0.8988960  -1.441  0.14962   
## width        0.0460765  0.0467497   0.986  0.32433   
## weight       0.0004470  0.0001586   2.818  0.00483 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.90  on 170  degrees of freedom
## AIC: 921.2
## 
## Number of Fisher Scoring iterations: 6
# Looks like non of the models have sufficient drop in AIC so we will chose the currently modbase
#Final mode is :  glm(satell~ width, family = poisson, data=crabdat)
anova(modebase,mod4, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width
## Model 2: satell ~ width + color
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       171     567.88                       
## 2       168     559.34  3   8.5338  0.03618 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(modebase,mod5, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width
## Model 2: satell ~ width + spine
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       171     567.88                     
## 2       169     566.60  2   1.2737    0.529
anova(modebase,mod6, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width
## Model 2: satell ~ width + weight
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       171     567.88                        
## 2       170     559.90  1    7.978 0.004735 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Mod6 was the most significant drop in deviance so we use that as our new modbase2
modbase2 <-mod6
mod7<-glm(satell~width + weight + color, data= crabdat, family = poisson)
mod8<-glm(satell~width + weight+ spine, data= crabdat, family = poisson)
anova(modbase2,mod7,test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + color
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       170     559.90                       
## 2       167     551.38  3   8.5203   0.0364 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(modbase2,mod8,test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width + weight
## Model 2: satell ~ width + weight + spine
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       170     559.90                     
## 2       168     558.83  2   1.0745   0.5844
# looks like model 7 shows significant drop so this will be our new model- modbase3
modbase3 <- mod7
mod9<-glm(satell~width + weight + color+spine, data= crabdat, family = poisson)
anova(modbase3,mod9,test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: satell ~ width + weight + color
## Model 2: satell ~ width + weight + color + spine
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       167     551.38                     
## 2       165     549.59  2   1.7947   0.4076
#Since model 9 do not show any significant drop in deviance . This my final model

###6

halfnorm(residuals(modbase3))

# From the graph, the residual follow on trend. 15 and 149 show as outlier

###7

dp <- sum(residuals(modbase3, type="pearson")^2 / modbase3$df.res) 
dp
## [1] 3.204836
# The dipersion parameter is 3.204836 - not a perfect poisson because of the difference in mean and variance

Disperson parameter of the larger one

###Elephant Mating

eleph <-read.csv("http://www.cknudson.com/data/elephant.csv")
attach(eleph)

###A.

hist(MATINGS)

###B.

plot(MATINGS ~ AGE, main = "Elephants matting by ages")
modeleph <- lm(MATINGS ~ AGE)
abline(modeleph)

plot(modeleph)

#this model appear not to be a great model . There are a several outlier among the residuals

###C.

meanmate <- aggregate(MATINGS ~ AGE, eleph,mean)
plot(log(meanmate$MATINGS), meanmate$AGE, main = "Log average mating by age" , xlab = "Age", ylab = "Log Average")

# The data look like we could fit a line better to this data . We could try to use Poisson Regression
# I can not see the evidence of quadratic trend in the plot

###D.

modelA <- glm(MATINGS ~ AGE, family = poisson, data = eleph)
summary(modelA)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
exp(coef(modelA)[2])
##      AGE 
## 1.071107
# For each increse in 1 years of age, we expec matings to increse by 1.07 (multiplicative change)

###E.

exp(confint(modelA, level =0.95)[2,])
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 1.042558 1.100360
# From the results above, we are 95 % confident that the true multiplicative impact of age on average number of matings per years is between 1.04 and 1.10

###F.

# For Wald Test, it is significant since p-value for the test is 5.81*10^-7 which is really close to 0
summary(modelA)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
modelB<-glm(MATINGS~1, family=poisson, data=eleph)
anova(modelB,modelA, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ 1
## Model 2: MATINGS ~ AGE
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        40     75.372                          
## 2        39     51.012  1    24.36 7.991e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Looks like there is no significant drop in deviance when we add age . We will keep age in the model

###G.

AgeSq<-eleph$AGE^2
quadmod<- glm(MATINGS~AGE+AgeSq, data = eleph, family = poisson)
summary(quadmod)
## 
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.0356383  -0.941    0.347
## AGE          0.1359544  0.1580095   0.860    0.390
## AgeSq       -0.0008595  0.0020124  -0.427    0.669
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5
# The deviance is drop in this model
anova(modelA,quadmod, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: MATINGS ~ AGE
## Model 2: MATINGS ~ AGE + AgeSq
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        39     51.012                     
## 2        38     50.826  1  0.18544   0.6667
# We see that the P-value is not significant so we should remove age^2 from the model

###H.

teststat<-sum(residuals(modelA)^2)
df<- 41-3 
pchisq(teststat, df, lower.tail = FALSE)
## [1] 0.07717098
# The P- value is 0.07718 is not significant at .05 level so this model is well predicting. The p-value id on the edga of being significants (at alpha = .1 level) .This would mean the model is not predicting well 

###F(redo)

dp <-sum (residuals(modelA, type="pearson")^2 / modelA$df.res) 
dp
## [1] 1.157334
summary(modelA, dispersion=dp)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.58590  -2.700  0.00693 ** 
## AGE          0.06869    0.01479   4.645  3.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1.157334)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
summary(modelA)
## 
## Call:
## glm(formula = MATINGS ~ AGE, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80798  -0.86137  -0.08629   0.60087   2.17777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.58201    0.54462  -2.905  0.00368 ** 
## AGE          0.06869    0.01375   4.997 5.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 51.012  on 39  degrees of freedom
## AIC: 156.46
## 
## Number of Fisher Scoring iterations: 5
DropDev<-deviance(modelB)-deviance(modelA)
teststat2<- DropDev/(dp*1)
pf(teststat2,df1=1,df2=38,lower.tail=FALSE)
## [1] 4.767203e-05
# The p-value is 4.767203e-05. It is still not significant so we will keep age as a variable

###G(redo)

dp2<-sum(residuals(quadmod, type="pearson")^2 / quadmod$df.res) 
dp2
## [1] 1.175194
summary(quadmod, dispersion=dp2)
## 
## Call:
## glm(formula = MATINGS ~ AGE + AgeSq, family = poisson, data = eleph)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8470  -0.8848  -0.1122   0.6580   2.1134  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8574060  3.2908256  -0.868    0.385
## AGE          0.1359544  0.1712924   0.794    0.427
## AgeSq       -0.0008595  0.0021816  -0.394    0.694
## 
## (Dispersion parameter for poisson family taken to be 1.175194)
## 
##     Null deviance: 75.372  on 40  degrees of freedom
## Residual deviance: 50.826  on 38  degrees of freedom
## AIC: 158.27
## 
## Number of Fisher Scoring iterations: 5
#From the Wald Test, the age^2 still does not have a significant p-value and should be left out of the model drop in deviance 
DropDev2<-deviance(modelA)-deviance(quadmod)
teststat3<- DropDev2/(dp*1)
pf(teststat3,df1=1,df2=37,lower.tail=FALSE)
## [1] 0.6912428
# The p-value > .05 mean there is no significant drop in deviance by adding age^2, we shoul remove age^2 from the model

###H(redo)

# Because the dispersion parameter is not depend on the Goodness of fit test . We will see the result as part H

aggregate(MATINGS ~ AGE, eleph, mean)

We are 95 % confident that the true multipicative is between … and …

```

if the residuals are high, you are under predicted

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.