birds <- read.csv("island2.csv")
trees <- read.csv("tsuga.csv")
Instructions: The file island2.csv contains information about the presence or absence of a particular species of bird across a set of islands in the Mediterranean sea. The format of the file is given below. Use this file to build a model relating species presence to the island characteristics. As the response (y) variable consists of presence/absences, you should use a binomial model, with the logit link function.
incidence variable contains information of presence/absence. Make boxplots of other variables to see which have a relationship with incidence. Using this, state which variables appear to be related the presence/absence of the species.names(birds)
## [1] "incidence" "area" "isolation" "quality"
birds$incidence <- factor(birds$incidence,
levels = c(0, 1),
labels = c("Absent", "Present"))
par(mfrow = c(1, 3))
boxplot(area ~ incidence,
data = birds,
ylab = "Area",
xlab = "",
ylim = c(0, 10),
border = "gray75",
col = c("slategrey", "goldenrod3"))
boxplot(quality ~ incidence,
data = birds,
ylab = "Quality",
xlab = "Incidence",
ylim = c(0, 10),
border = "gray75",
col = c("slategrey", "goldenrod3"))
boxplot(isolation ~ incidence,
data = birds,
ylab = "Isolation",
xlab = "",
ylim = c(0, 10),
border = "gray75",
col = c("slategrey", "goldenrod3"))
Answer: It seems that larger area and less isolation are related to presence of a species, and higher quality is probably unrelated to presence of a species.
summary(birds_glm <- glm(incidence ~ area + isolation,
data = birds,
family = binomial(link = 'logit')))
##
## Call:
## glm(formula = incidence ~ area + isolation, family = binomial(link = "logit"),
## data = birds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8189 -0.3089 0.0490 0.3635 2.1192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.6417 2.9218 2.273 0.02302 *
## area 0.5807 0.2478 2.344 0.01909 *
## isolation -1.3719 0.4769 -2.877 0.00401 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.029 on 49 degrees of freedom
## Residual deviance: 28.402 on 47 degrees of freedom
## AIC: 34.402
##
## Number of Fisher Scoring iterations: 6
new_island <- data.frame("area" = 5, "isolation" = 6)
pred_new_island <- predict.glm(birds_glm,
new_island,
type = 'response',
se.fit = TRUE)
pred_new_island$fit
## 1
## 0.7881208
pred_new_island$se.fit
## 1
## 0.1125028
Answer: For a place with an area of 5 and an isolation distance of 6, the predicted incidence of birds is 0.788, with a standard error of 0.113.
Instructions: The file tsuga.csv has estimates of the abundance of Hemlock trees from a set of plots in the Smoky Mountain national park (data from Jason Fridley, Syracuse University). The abundance values are in classes from 0 to 10, and these follow a Poisson distribution (discrete values, zero-bounded). Use this data to make a Poisson regression model of the abundance (cover), using both distance to stream and elevation as explanatory variables.
summary(trees_glm <- glm(cover ~ streamdist + elev,
data = trees,
family = poisson(link = "log")))
##
## Call:
## glm(formula = cover ~ streamdist + elev, family = poisson(link = "log"),
## data = trees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31395 -0.82155 -0.07929 0.71900 2.62316
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.622e+00 5.226e-02 31.047 < 2e-16 ***
## streamdist -8.963e-04 1.173e-04 -7.641 2.15e-14 ***
## elev 8.901e-05 5.653e-05 1.575 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 748.23 on 744 degrees of freedom
## Residual deviance: 687.10 on 742 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 3150.2
##
## Number of Fisher Scoring iterations: 4
exp(coef(trees_glm))
## (Intercept) streamdist elev
## 5.0652901 0.9991041 1.0000890
Answer: AIC: 3150.2 The intercept is 5.065 (log: 1.622e+00, p<2e-16), with a stream distance coefficient of 0.999 (log: -8.963e-04, p = 2.15e-14) and an elevation coefficient of 1.000 (log: 8.901e-05, p = 0.115).
Interpretation: The expected abundance of trees in an area that is 0 units from a stream and at an elevation of 0 is about 5. With elevation constant, for every unit further from a stream, the amount of tree cover is expected to reduce by about 0.01%. The elevation coefficient is not significant, with almost no expected change in tree cover for a unit increase in elevation.
I created a model without elevation and found that, while the AIC score is slightly higher, it is only a difference of 0.5. Given that elevation is not significant, and the addition only slightly reduces the AIC, I would feel comfortable modeling cover without elevation.
In the model of cover by distance, I scaled the distance by a factor of 100. This was done to increase interpretability. While I’m not sure what the units are (feet, meters, etc.), I can now say that for every 100 units further from a stream, the percent of cover is expected to reduce by about 8.2%.
#checking that cover of 5 @ 0 distance sounds right
stream_dist_sum <- summary(trees$streamdist) #Q1: 56.57
near_stream <- trees[which(trees$streamdist < 50),]
mean(near_stream$cover) #5.599
## [1] 5.598726
#scaling distance and modeling
trees$streamdist_scaled <- trees$streamdist/100
summary(trees_glm2 <- glm(cover ~ streamdist_scaled,
data = trees,
family = poisson(link = "log")))
##
## Call:
## glm(formula = cover ~ streamdist_scaled, family = poisson(link = "log"),
## data = trees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3432 -0.8483 -0.1126 0.7154 2.6662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.69354 0.02592 65.342 < 2e-16 ***
## streamdist_scaled -0.08540 0.01139 -7.499 6.43e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 748.23 on 744 degrees of freedom
## Residual deviance: 689.58 on 743 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 3150.7
##
## Number of Fisher Scoring iterations: 4
exp(coef(trees_glm2))
## (Intercept) streamdist_scaled
## 5.4387095 0.9181495