setwd ("C:/doug/classes/geog6000")
birds <- read.csv("island2.csv", na.strings="")
mycol <- heat.colors(5)
summary(birds)
##    incidence         area         isolation        quality    
##  Min.   :0.00   Min.   :0.153   Min.   :2.023   Min.   :1.00  
##  1st Qu.:0.00   1st Qu.:2.248   1st Qu.:4.823   1st Qu.:3.00  
##  Median :1.00   Median :4.170   Median :5.801   Median :4.50  
##  Mean   :0.58   Mean   :4.319   Mean   :5.856   Mean   :4.66  
##  3rd Qu.:1.00   3rd Qu.:6.431   3rd Qu.:7.191   3rd Qu.:7.00  
##  Max.   :1.00   Max.   :9.269   Max.   :9.577   Max.   :9.00

The ‘incidence’ variable contains information of presence/absence. Make boxplots of other variables to see which have a relationship with ‘incidence’. Using this, state which variables appear to be related the presence/absence of the species

boxplot(area ~ incidence, data=birds, main = 'Box Plot of Area against Incidence (presence or absence)')

boxplot(isolation ~ incidence, data=birds, main = 'Box Plot of Isolation Metric against Incidence (presence or absence)')

boxplot(quality ~ incidence, data=birds, main = 'Box Plot of Quality Metric against Incidence (presence or absence)')

# There is nice separation between presence or absence on both the area and isolation variables, as revealed by the box plots. However, the quality metric appears to show the same IQR for both those islands that have and do not have the birds, so there appears to be no defining difference on that metric #

The two main explanatory variables are island area and island isolation. Using the glm() function, build a generalized linear model of the presence of bird species as explained by these variables. Report the code you used. Use the summary() function to obtain the coefficients, their significance and the AIC score

attach(birds)
birds$area.c = area - mean(area)
birds$isolation.c = isolation - mean(isolation)
bird_presence_glm1 <- glm(incidence ~ area.c + isolation.c, data = birds, family = binomial(link='logit'))
summary(bird_presence_glm1)
## 
## Call:
## glm(formula = incidence ~ area.c + isolation.c, family = binomial(link = "logit"), 
##     data = birds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8189  -0.3089   0.0490   0.3635   2.1192  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   1.1154     0.5877   1.898  0.05770 . 
## area.c        0.5807     0.2478   2.344  0.01909 * 
## isolation.c  -1.3719     0.4769  -2.877  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.029  on 49  degrees of freedom
## Residual deviance: 28.402  on 47  degrees of freedom
## AIC: 34.402
## 
## Number of Fisher Scoring iterations: 6
(coef(bird_presence_glm1))
## (Intercept)      area.c isolation.c 
##   1.1154028   0.5807241  -1.3719411
exp(coef(bird_presence_glm1)) # convert back to odds
## (Intercept)      area.c isolation.c 
##   3.0507967   1.7873322   0.2536142

Finally, use the model to predict the probability of presence of the species on a new island with an area of 5 and an isolation distance of 6. You will need to build a new dataframe for this island. You can either modify the approach used in the last exercise or directly make a new dataframe with these variables and values. Use the predict() function to make the prediction. Note that you will need to include a parameter (type=‘response’), otherwise the predicted values will not be transformed back into a 0-1 scale. Give the predicted value and its standard error (consult the help page for predict.glm() to do this)

#?predict.glm()
newisland <- data.frame(area.c = 0.681, isolation.c = 0.144)
# trying other values newisland <- data.frame(area.c = -1, isolation.c = -44)
# this won't work newisland$area.c <- 0.681
# neither will this newisland$isolation.c <- 0.144
# trying other methods NIpredict <- predict.glm(bird_presence_glm1,type='response', se.fit=TRUE, newisland[1,])
#summary(NIpredict)
#predict(irished.glm1, newdata=newDVRT, type='response', se.fit=TRUE)
predict(bird_presence_glm1, newisland,
        type='response', se.fit=TRUE)
## $fit
##         1 
## 0.7880676 
## 
## $se.fit
##         1 
## 0.1125223 
## 
## $residual.scale
## [1] 1

the predicted value is 1 (present - odds are 0.79) and the standard error is 0.113

  1. The file tsuga.csv has estimates of the abundance of Hemlock trees from a set of plots in the Smokey Mountain national park (data from Jason Fridley, Syracuse University). The abundance values are in classes from 0 to 10, and these follow a Poisson distribution (discrete values, zero-bounded). Use this data to make a Poisson regression model of the abundance (‘cover’), using both distance to stream and elevation as explanatory variables. • Give the code you used to build the model • Using the summary() function, report the coefficients as log-values and their significance and the model AIC
tsuga <- read.csv("tsuga.csv")
summary(tsuga)
##           plotID            date        plotsize           spcode   
##  UFRL-01-0047:  4   00-00-1978:161   Min.   :   10.0   TSUGCAN:746  
##  UFRL-01-0048:  4   00-00-1977: 83   1st Qu.: 1000.0                
##  UFRL-01-0090:  4   00-00-1982: 78   Median : 1000.0                
##  UFRL-01-0093:  4   00-00-1972: 31   Mean   :  988.4                
##  UFRL-01-0156:  4   00-00-1979: 29   3rd Qu.: 1000.0                
##  UFRL-02-0003:  4   07-10-2002:  8   Max.   :10000.0                
##  (Other)     :722   (Other)   :356                                  
##              species        cover            elev             tci        
##  Tsuga canadensis:746   Min.   : 1.00   Min.   : 294.0   Min.   : 2.610  
##                         1st Qu.: 3.00   1st Qu.: 607.2   1st Qu.: 4.629  
##                         Median : 4.00   Median : 841.6   Median : 5.371  
##                         Mean   : 4.66   Mean   : 886.7   Mean   : 6.010  
##                         3rd Qu.: 7.00   3rd Qu.:1159.8   3rd Qu.: 6.669  
##                         Max.   :10.00   Max.   :1621.0   Max.   :25.000  
##                                                                          
##    streamdist    
##  Min.   :  0.00  
##  1st Qu.: 56.57  
##  Median :152.60  
##  Mean   :192.18  
##  3rd Qu.:299.70  
##  Max.   :764.90  
##  NA's   :1
plot(elev ~ cover, data=tsuga)

plot(streamdist ~ cover , data=tsuga)

tsuga.glm <- glm(cover ~ elev + streamdist,
               data=tsuga, family=poisson(link='log'))
summary(tsuga.glm)
## 
## Call:
## glm(formula = cover ~ elev + streamdist, family = poisson(link = "log"), 
##     data = tsuga)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.31395  -0.82155  -0.07929   0.71900   2.62316  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.622e+00  5.226e-02  31.047  < 2e-16 ***
## elev         8.901e-05  5.653e-05   1.575    0.115    
## streamdist  -8.963e-04  1.173e-04  -7.641 2.15e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 748.23  on 744  degrees of freedom
## Residual deviance: 687.10  on 742  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 3150.2
## 
## Number of Fisher Scoring iterations: 4

• Give a brief interpretation of the model: Are the explanatory variables useful? What does the sign of the coefficients tell you about the relationship between Hemlock abundance and elevation and/or stream distance. # as the plots showed, there is an obvious (negative) relationship between cover and distance to stream, but no real relationship with elevation. # • Transform the cofficients to the original (non-log) scale

exp(coef(tsuga.glm))
## (Intercept)        elev  streamdist 
##   5.0652901   1.0000890   0.9991041

not a hugely negative relationship, though significant. every additional unit of distrance from a stream lead to a 0.0009 decrease in expected hemlock trees, no?