birds <- read.csv("island2.csv")
trees <- read.csv("tsuga.csv")

Exercise 1.

Instructions: The file island2.csv contains information about the presence or absence of a particular species of bird across a set of islands in the Mediterranean sea. The format of the file is given below. Use this file to build a model relating species presence to the island characteristics. As the response (y) variable consists of presence/absences, you should use a binomial model, with the logit link function.

  1. The incidence variable contains information of presence/absence. Make boxplots of other variables to see which have a relationship with incidence. Using this, state which variables appear to be related the presence/absence of the species.
names(birds)
## [1] "incidence" "area"      "isolation" "quality"
birds$incidence <- factor(birds$incidence,
                          levels = c(0, 1),
                          labels = c("Absent", "Present"))
par(mfrow = c(1, 3))
boxplot(area ~ incidence,
        data = birds,
        ylab = "Area",
        xlab = "",
        ylim = c(0, 10),
        border = "gray75",
        col = c("slategrey", "goldenrod3"))
boxplot(quality ~ incidence,
        data = birds,
        ylab = "Quality",
        xlab = "Incidence",
        ylim = c(0, 10),
        border = "gray75",
        col = c("slategrey", "goldenrod3"))
boxplot(isolation ~ incidence,
        data = birds,
        ylab = "Isolation",
        xlab = "",
        ylim = c(0, 10),
        border = "gray75",
        col = c("slategrey", "goldenrod3"))

Answer: It seems that larger area and less isolation are related to presence of a species, and higher quality is probably unrelated to presence of a species.

  1. The two main explanatory variables are island area and island isolation. Using the glm() function, build a generalized linear model of the presence of bird species as explained by these variables. Report the code you used. Use the summary() function to obtain the coefficients, their significance and the AIC score.
summary(birds_glm <- glm(incidence ~ area + isolation,
                         data = birds,
                         family = binomial(link = 'logit')))
## 
## Call:
## glm(formula = incidence ~ area + isolation, family = binomial(link = "logit"), 
##     data = birds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8189  -0.3089   0.0490   0.3635   2.1192  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   6.6417     2.9218   2.273  0.02302 * 
## area          0.5807     0.2478   2.344  0.01909 * 
## isolation    -1.3719     0.4769  -2.877  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.029  on 49  degrees of freedom
## Residual deviance: 28.402  on 47  degrees of freedom
## AIC: 34.402
## 
## Number of Fisher Scoring iterations: 6
  1. Finally, use the model to predict the probability of presence of the species on a new island with an area of 5 and an isolation distance of 6. You will need to build a new dataframe for this island. You can either modify the approach used in the last exercise or directly make a new dataframe with these variables and values. Use the predict() function to make the prediction. Note that you will need to include a parameter (type=‘response’), otherwise the predicted values will not be transformed back into a 0-1 scale. Give the predicted value and its standard error (consult the help page for predict.glm() to do this)
new_island <- data.frame("area" = 5, "isolation" = 6)

pred_new_island <- predict.glm(birds_glm,
                               new_island,
                               type = 'response',
                               se.fit = TRUE)
pred_new_island$fit
##         1 
## 0.7881208
pred_new_island$se.fit
##         1 
## 0.1125028

Answer: For a place with an area of 5 and an isolation distance of 6, the predicted incidence of birds is 0.788, with a standard error of 0.113.

Exercise 2

Instructions: The file tsuga.csv has estimates of the abundance of Hemlock trees from a set of plots in the Smoky Mountain national park (data from Jason Fridley, Syracuse University). The abundance values are in classes from 0 to 10, and these follow a Poisson distribution (discrete values, zero-bounded). Use this data to make a Poisson regression model of the abundance (cover), using both distance to stream and elevation as explanatory variables.

  1. Give the code you used to build the model.
  2. Using the summary() function, report the coefficients as log-values and their significance and the model AIC
  3. Transform the coefficients to the original (non-log) scale
  4. Give a brief interpretation of the model: Are the explanatory variables useful? What does the sign of the coefficients tell you about the relationship between Hemlock abundance and elevation and/or stream distance.
summary(trees_glm <- glm(cover ~ streamdist + elev,
                 data = trees,
                 family = poisson(link = "log")))
## 
## Call:
## glm(formula = cover ~ streamdist + elev, family = poisson(link = "log"), 
##     data = trees)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.31395  -0.82155  -0.07929   0.71900   2.62316  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.622e+00  5.226e-02  31.047  < 2e-16 ***
## streamdist  -8.963e-04  1.173e-04  -7.641 2.15e-14 ***
## elev         8.901e-05  5.653e-05   1.575    0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 748.23  on 744  degrees of freedom
## Residual deviance: 687.10  on 742  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 3150.2
## 
## Number of Fisher Scoring iterations: 4
exp(coef(trees_glm))
## (Intercept)  streamdist        elev 
##   5.0652901   0.9991041   1.0000890

Answer: AIC: 3150.2 The intercept is 5.065 (log: 1.622e+00, p<2e-16), with a stream distance coefficient of 0.999 (log: -8.963e-04, p = 2.15e-14) and an elevation coefficient of 1.000 (log: 8.901e-05, p = 0.115).

Interpretation: The expected abundance of trees in an area that is 0 units from a stream and at an elevation of 0 is about 5. With elevation constant, for every unit further from a stream, the amount of tree cover is expected to reduce by about 0.01%. The elevation coefficient is not significant, with almost no expected change in tree cover for a unit increase in elevation.

I created a model without elevation and found that, while the AIC score is slightly higher, it is only a difference of 0.5. Given that elevation is not significant, and the addition only slightly reduces the AIC, I would feel comfortable modeling cover without elevation.

In the model of cover by distance, I scaled the distance by a factor of 100. This was done to increase interpretability. While I’m not sure what the units are (feet, meters, etc.), I can now say that for every 100 units further from a stream, the percent of cover is expected to reduce by about 8.2%.

#checking that cover of 5 @ 0 distance sounds right
stream_dist_sum <- summary(trees$streamdist) #Q1: 56.57
near_stream <- trees[which(trees$streamdist < 50),]
mean(near_stream$cover) #5.599
## [1] 5.598726
#scaling distance and modeling
trees$streamdist_scaled <- trees$streamdist/100

summary(trees_glm2 <- glm(cover ~ streamdist_scaled,
                 data = trees,
                 family = poisson(link = "log")))
## 
## Call:
## glm(formula = cover ~ streamdist_scaled, family = poisson(link = "log"), 
##     data = trees)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3432  -0.8483  -0.1126   0.7154   2.6662  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.69354    0.02592  65.342  < 2e-16 ***
## streamdist_scaled -0.08540    0.01139  -7.499 6.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 748.23  on 744  degrees of freedom
## Residual deviance: 689.58  on 743  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 3150.7
## 
## Number of Fisher Scoring iterations: 4
exp(coef(trees_glm2))
##       (Intercept) streamdist_scaled 
##         5.4387095         0.9181495