Lab 05 Exercises

Eric Goodwin || October 7, 2021

Exercise 1

The file island2.csv contains information about the presence or absence of a particular species of bird across a set of islands in the Mediterranean sea. The format of the file is given below. Use this file to build a model relating species presence to the island characteristics. As the response (y) variable consists of presence/absences, you should use a binomial model, with the logit link function.

The ‘incidence’ variable contains information of presence/absence. Make boxplots of other variables to see which have a relationship with ‘incidence’. Using this, state which variables appear to be related the presence/absence of the species

  • Getting the data prepped:
setwd("N:/Projects/geog6000/lab05")
island2 = read.csv("../datafiles/island2.csv")
island2$incidence = factor(island2$incidence,
                           levels = c(0,1),
                           labels = c("absence", "presence"))
  • Let’s see what looks related to presence of our bird, shall we?
boxplot(area ~ incidence, data = island2)

  • It looks like the presence of the bird is associated with greater island area. Interesting.
boxplot(isolation ~ incidence, data = island2)

  • And it definitely looks like fewer of our study species are showing up in areas with high isolation.
boxplot(quality ~ incidence, data = island2)

  • It’s pretty clear there’s no distinct change in presence/absence of our bird based on the quality variable.

The two main explanatory variables are island area and island isolation. Using the glm() function, build a generalized linear model of the presence of bird species as explained by these variables. Report the code you used. Use the summary() function to obtain the coefficients, their significance and the AIC score

island2$area.cen = island2$area - mean(island2$area)
island2$isolation.cen = island2$isolation - mean(island2$isolation)
island2.glm1 = glm(incidence ~ area.cen + isolation.cen,
                   data = island2,
                   family = binomial(link = 'logit'))
summary(island2.glm1)
## 
## Call:
## glm(formula = incidence ~ area.cen + isolation.cen, family = binomial(link = "logit"), 
##     data = island2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8189  -0.3089   0.0490   0.3635   2.1192  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     1.1154     0.5877   1.898  0.05770 . 
## area.cen        0.5807     0.2478   2.344  0.01909 * 
## isolation.cen  -1.3719     0.4769  -2.877  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.029  on 49  degrees of freedom
## Residual deviance: 28.402  on 47  degrees of freedom
## AIC: 34.402
## 
## Number of Fisher Scoring iterations: 6
  • I’m going to transform the log-odds coefficients for better interpretation.
exp(coef(island2.glm1))
##   (Intercept)      area.cen isolation.cen 
##     3.0507967     1.7873322     0.2536142
  • Given average area and isolation, the bird will be present 3 times for every 1 time it’s absent.

Finally, use the model to predict the probability of presence of the species on a new island with an area of 5 and an isolation distance of 6. You will need to build a new dataframe for this island. You can either modify the approach used in the last exercise or directly make a new dataframe with these variables and values. Use the predict() function to make the prediction. Note that you will need to include a parameter (type=‘response’), otherwise the predicted values will not be transformed back into a 0-1 scale. Give the predicted value and its standard error (consult the help page for predict.glm() to do this)

newisland = data.frame(area.cen = 5 - mean(island2$area),
                       isolation.cen = 6 - mean(island2$isolation))
predict(island2.glm1,
        newdata = newisland,
        type = 'response',
        se.fit = TRUE)
## $fit
##         1 
## 0.7881208 
## 
## $se.fit
##         1 
## 0.1125028 
## 
## $residual.scale
## [1] 1
  • There is about a 78.8% chance this new island will have our species of bird present.

Exercise 2

The file tsuga.csv has estimates of the abundance of Hemlock trees from a set of plots in the Smoky Mountain national park (data from Jason Fridley, Syracuse University). The abundance values are in classes from 0 to 10, and these follow a Poisson distribution (discrete values, zero-bounded). Use this data to make a Poisson regression model of the abundance (‘cover’), using both distance to stream and elevation as explanatory variables.

Give the code you used to build the model

tsuga = read.csv("../datafiles/tsuga.csv")
tsuga.glm = glm(cover ~ streamdist + elev,
                data = tsuga,
                family = poisson(link = 'log'))

Using the summary() function, report the coefficients as log-values and their significance and the model AIC

summary(tsuga.glm)
## 
## Call:
## glm(formula = cover ~ streamdist + elev, family = poisson(link = "log"), 
##     data = tsuga)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.31395  -0.82155  -0.07929   0.71900   2.62316  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.622e+00  5.226e-02  31.047  < 2e-16 ***
## streamdist  -8.963e-04  1.173e-04  -7.641 2.15e-14 ***
## elev         8.901e-05  5.653e-05   1.575    0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 748.23  on 744  degrees of freedom
## Residual deviance: 687.10  on 742  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 3150.2
## 
## Number of Fisher Scoring iterations: 4

Transform the coefficients to the original (non-log) scale

exp(coef(tsuga.glm))
## (Intercept)  streamdist        elev 
##   5.0652901   0.9991041   1.0000890

Give a brief interpretation of the model: Are the explanatory variables useful? What does the sign of the coefficients tell you about the relationship between Hemlock abundance and elevation and/or stream distance.

  • The model shows us that while distance from stream is a significant predictor for Hemlock abundance (very low p-value), elevation is probably not a good predictor (p-value of 0.115). As distance from stream increases by 1 from the intercept (where distance from stream and elevation = 0), the Hemlock abundance value is reduced by 5.0652901*0.9991041 = 5.06075211.

  • Thus, by using a Poisson regression, we have seen how increasing distance from the stream reduces the abundance values of the Hemlock in the Great Smoky Mountains National Park

  • P.S. I am from Tennessee, so I was happy to work with this dataset =)