Exercise 1

The file island2.csv contains information about the presence or absence of a particular species of bird across a set of islands in the Mediterranean sea. The format of the file is given below. Use this file to build a model relating species presence to the island characteristics. As the response (\(y\)) variable consists of presence/absences, you should use a binomial model, with the logit link function.

Reading in the data

island2 <- read.csv("../datafiles/island2-1.csv")

# change incidence variable to factor
island2$incidence <- factor(island2$incidence, levels = c(0, 1),
                            labels = c("Absence", "Presence"))

The ‘incidence’ variable contains information of presence/absence. Make boxplots of other variables to see which have a relationship with ‘incidence’. Using this, state which variables appear to be related the presence/absence of the species and whether the relationship is positive (probability of presence increases as the variable increases) or negative (probability of presence decreases as the variable increases)

library(ggplot2)

bird_colors = c("#756bb1", "#1C9099")

ggplot(island2) +
  geom_boxplot(aes(x = incidence, y = area, fill = incidence),
               show.legend = F) +
  labs(title = "Area Data by Incedence",
       x = "Incidence",
       y = "Area") +
  scale_fill_manual(values = bird_colors) +
  theme_bw()

Comparing the area data by incidence, it appears that there is a positive relationship. The areas with an incidence presence are typically larger than those without an incidence.

ggplot(island2) +
  geom_boxplot(aes(x = incidence, y = isolation, fill = incidence),
               show.legend = F) +
  labs(title = "Isolation Data by Incedence",
       x = "Incidence",
       y = "Isolation") +
  scale_fill_manual(values = bird_colors) +
  theme_bw()

The isolation data also has a relationship with incidence. Isolation is higher where the incident is absent verses when it is present.

ggplot(island2) +
  geom_boxplot(aes(x = incidence, y = quality, fill = incidence),
               show.legend = F) +
  labs(title = "Quality Data by Incedence",
       x = "Incidence",
       y = "Quality") +
  scale_fill_manual(values = bird_colors) +
  theme_bw()

There looks like there is a difference between the mean value of quality between the absence or presence of incidence, however this may not be statistically significant.

The two main explanatory variables are island area and island isolation. Using the glm() function, build a generalized linear model of the presence of bird species as explained by these variables. Report the code you used. Use the summary() function to obtain the coefficients, their significance and the AIC score

bird_model <- glm(incidence ~ area + isolation, data = island2,
                  family = binomial(link = 'logit'))

summary(bird_model)
## 
## Call:
## glm(formula = incidence ~ area + isolation, family = binomial(link = "logit"), 
##     data = island2)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   6.6417     2.9218   2.273  0.02302 * 
## area          0.5807     0.2478   2.344  0.01909 * 
## isolation    -1.3719     0.4769  -2.877  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.029  on 49  degrees of freedom
## Residual deviance: 28.402  on 47  degrees of freedom
## AIC: 34.402
## 
## Number of Fisher Scoring iterations: 6

The coefficent values are (Intercept) = 6.6417, (area) = 0.5807, (isolation) = (-1.3719). The intercept and area coefficients are significant at 95% confidence, the isolation coefficient is significant at 99% confidence. The AIC for the model is 34.402.

Give a brief interpretation of the coefficients

exp(coef(bird_model))
## (Intercept)        area   isolation 
## 766.3669575   1.7873322   0.2536142

The intercept (\(\beta_0\)) is the odds of a species being present with an average area of 766. The area coefficient is the rate at which the odds of incidence change 766x1.79 = 1,371.74 for every unit increase in area higher than the average area. The isolation coefficient is the rate at which the odds change 766*0.25 = 191.5 for every unit increase in isolation.

Finally, use the model to predict the probability of presence of the species on a new island with an area of 5 and an isolation distance of 6. You will need to build a new dataframe for this island. You can either modify the approach used in the last exercise or directly make a new dataframe with these variables and values. Use the predict() function to make the prediction. Note that you will need to include a parameter (type='response'), otherwise the predicted values will not be transformed back into a 0-1 scale. Give the predicted value and its standard error (consult the help page for predict.glm() to do this)

new_island <- data.frame(area = 5, isolation = 6)

predict.glm(bird_model, newdata = new_island, type = "response", se.fit = T)
## $fit
##         1 
## 0.7881208 
## 
## $se.fit
##         1 
## 0.1125028 
## 
## $residual.scale
## [1] 1

The predicted value for the new island is 1 (Present), with an estimated standard error of 0.788.


Exercise 2

The file tsuga.csv has estimates of the abundance of Hemlock trees from a set of plots in the Smoky Mountain national park (data from Jason Fridley, Syracuse University). The abundance values are in classes from 0 to 10, and these follow a Poisson distribution (discrete values, zero-bounded). Use this data to make a Poisson regression model of the abundance (‘cover’), using both distance to stream and elevation as explanatory variables.

tsuga <- read.csv("../datafiles/tsuga.csv")

#tsuga$cover <- factor(tsuga$cover, levels = c(1:10))

Lets take a look at the data.

summary(tsuga)
##     plotID              date              plotsize          spcode         
##  Length:746         Length:746         Min.   :   10.0   Length:746        
##  Class :character   Class :character   1st Qu.: 1000.0   Class :character  
##  Mode  :character   Mode  :character   Median : 1000.0   Mode  :character  
##                                        Mean   :  988.4                     
##                                        3rd Qu.: 1000.0                     
##                                        Max.   :10000.0                     
##                                                                            
##    species              cover            elev             tci        
##  Length:746         Min.   : 1.00   Min.   : 294.0   Min.   : 2.610  
##  Class :character   1st Qu.: 3.00   1st Qu.: 607.2   1st Qu.: 4.629  
##  Mode  :character   Median : 4.00   Median : 841.6   Median : 5.371  
##                     Mean   : 4.66   Mean   : 886.7   Mean   : 6.010  
##                     3rd Qu.: 7.00   3rd Qu.:1159.8   3rd Qu.: 6.669  
##                     Max.   :10.00   Max.   :1621.0   Max.   :25.000  
##                                                                      
##    streamdist    
##  Min.   :  0.00  
##  1st Qu.: 56.57  
##  Median :152.60  
##  Mean   :192.18  
##  3rd Qu.:299.70  
##  Max.   :764.90  
##  NA's   :1

There’s 1 missing data in streamdist. Lets omit that row.

tsuga <- na.omit(tsuga)

summary(tsuga)
##     plotID              date              plotsize          spcode         
##  Length:745         Length:745         Min.   :   10.0   Length:745        
##  Class :character   Class :character   1st Qu.: 1000.0   Class :character  
##  Mode  :character   Mode  :character   Median : 1000.0   Mode  :character  
##                                        Mean   :  976.3                     
##                                        3rd Qu.: 1000.0                     
##                                        Max.   :10000.0                     
##    species              cover             elev             tci        
##  Length:745         Min.   : 1.000   Min.   : 294.0   Min.   : 2.610  
##  Class :character   1st Qu.: 3.000   1st Qu.: 607.0   1st Qu.: 4.635  
##  Mode  :character   Median : 4.000   Median : 841.6   Median : 5.371  
##                     Mean   : 4.656   Mean   : 886.1   Mean   : 6.012  
##                     3rd Qu.: 7.000   3rd Qu.:1159.0   3rd Qu.: 6.669  
##                     Max.   :10.000   Max.   :1621.0   Max.   :25.000  
##    streamdist    
##  Min.   :  0.00  
##  1st Qu.: 56.57  
##  Median :152.60  
##  Mean   :192.18  
##  3rd Qu.:299.70  
##  Max.   :764.90

There are ID, date, and species name and code variables. The average plot size is 976.3, elevation ranges from 294 to 1621. The average tci is 6.012 and the average stream distance is 192.18. Cover is a categorical variable, I will change that to a factor now.

Next step is to look at the distribution of the variable values.

ggplot(tsuga) +
  geom_histogram(aes(x = elev)) +
  labs(title = "Distribution of Elevation Values",
       x = "Elevation",
       y = "Count") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks like there are many samples around elevations around 500 to 600, and less samples at the elevation extremes.

ggplot(tsuga) +
  geom_histogram(aes(x = streamdist)) +
  labs(title = "Distribution of Stream Distance Values",
       x = "Distance",
       y = "Count") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The stream distance value has a strong negative exponential distribution. There are many more samples close to the stream than far away from the stream.

tsuga$coverType <- factor(tsuga$cover)

ggplot(tsuga) +
  geom_boxplot(aes(y = coverType, x = elev, fill = coverType),
               show.legend = F) +
  labs(title = "Cover ~ Elev",
       y = "Cover",
       x = "Elevation") +
  theme_bw()

It doesn’t look like there is a strong relationship between Cover and Elevation. Some cover types like 2, 8, and 9 have higher mean elevation, than other cover types, but most are around 800-900.

ggplot(tsuga) +
  geom_boxplot(aes(x = coverType, y = streamdist, fill = coverType),
               show.legend = F) +
  labs(title = "Cover ~ Streamdistance",
       x = "Cover",
       y = "Stream Distance") +
  theme_bw()

It looks like there is a stronger relationship between cover type and stream distance, with the lower the cover type number, the higher the distance from the stream.

ggplot(tsuga) +
  geom_point(aes(x = elev, y = streamdist, colour = coverType),
             size = 3, stroke = 0) +
  labs(title = "Cover ~ Streamdistance + Elevation",
       x = "Elevation",
       y = "Stream Distance",
       colour = "Cover") +
  theme_bw()

Using both elevation and stream distance, there doesn’t appear to be much relationship. There is no clear trend between the three variables.

Give the code you used to build the model

tree_model <- glm(cover ~ streamdist + elev, data = tsuga,
                  family = poisson(link = 'log'))
summary(tree_model)
## 
## Call:
## glm(formula = cover ~ streamdist + elev, family = poisson(link = "log"), 
##     data = tsuga)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.622e+00  5.226e-02  31.047  < 2e-16 ***
## streamdist  -8.963e-04  1.173e-04  -7.641 2.15e-14 ***
## elev         8.901e-05  5.653e-05   1.575    0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 748.23  on 744  degrees of freedom
## Residual deviance: 687.10  on 742  degrees of freedom
## AIC: 3150.2
## 
## Number of Fisher Scoring iterations: 4

Using the summary() function, report the coefficients as log-values and their significance and the model AIC

The model AIC is 3150.2, which seems high. The streamdist variable is very significant, but the elevation is not. As we saw earlier there isn’t much of a relationship between cover and elevation.

coef(tree_model)
##   (Intercept)    streamdist          elev 
##  1.622411e+00 -8.963237e-04  8.901257e-05

The coefficients for the log-values are shown above.

Transform the cofficients to the original (non-log) scale

exp(coef(tree_model))
## (Intercept)  streamdist        elev 
##   5.0652901   0.9991041   1.0000890

The values (non-log) for the coefficients are shown above.

Give a brief interpretation of the model: Are the explanatory variables useful? What does the sign of the coefficients tell you about the relationship between Hemlock abundance and elevation and/or stream distance.

The model is:

\[ y_{cover} = \beta_0 + \beta_1x_{streamdist} + \beta_2x_{elevation} \]

where

\[ \beta_0 = 5.065 (Intercept),\] \[\beta_1 = 0.999 (stream distance rate),\] \[\beta_2 = 1.000 (elevation rate).\] Because the values of \(\beta_0\) and \(\beta_1\) are close to 1.0 we can say that for a one-unit increase in either of those values, the cover value will also increase.