Ecologists are concerned about growing numbers of mountain pine beetles, which in the past few years have been causing excessive damage to forests in areas such as Colorado. They are interested in knowing what factors most affect pine beetle infestation, and want predictions in order to know how to best combat the infestation. The goal of inference is to learn which variables most affect whether an area will have pine beetle infestation or not. The goal of prediction is to predict the probability of an area having pine beetle infestation given a certain set of explanatory variables.
library(tidyverse)
knitr::opts_chunk$set(message = FALSE)
df <- read_csv("Pine_Beetle_Data.csv") %>%
mutate(Infested = ifelse(Infested == "Yes", 1, 0)) %>%
mutate_if(is.character, as.factor)
Data was collected on several environmental variables for different regions in colorado along with whether or not that region had pine beetle infestation. Following are scatterplots overlaid with smooth curves for each one of the quantitative explanatory variables.
df %>%
pivot_longer(c(January, August_max, Slope, Elev, Precip)) %>%
ggplot(aes(x = value, y = Infested)) +
geom_point() +
geom_smooth(se = FALSE) +
facet_wrap(~name, scales = "free")
Most of these smooth curves are fairly monotonic, meaning that they either constantly increase or decrease, but August looks like it goes up and then back down, so it might not meet the assumption of monotonicity. However, we will continue with thie study.
Because we are interested in predicting whether an area is infested, we cannot use multiple linear regression to achieve our above goals, since our response variable is categorical. Since an area being infested follows a Bernoulli distribution, we will use logistic regression. Logistic regression will return a probability between zero and one of an area being infested.
In order to identify which variables to include in our model, we ran an exhaustive test of all possible subsets of our explanatory variables and compared the AIC of each of them, and then selected the model with the lowest AIC. We decided to run an exhaustive test so that we would pick the very best model possible. We chose to use AIC as the metric of choice because prediction seems to be an important goal for this study. We learned that the important variables to include in our model are January, August, Slope, Elevation, Precipitation, North Central, Southeast, and Southwest.
best_model <- df %>%
as.data.frame %>%
bestglm::bestglm(
family = binomial,
IC = "AIC",
method = "exhaustive"
)
glm <- glm(
Infested ~ January + August_max + Slope + Elev + Precip + NC + SE + SW,
family = binomial,
data = df
)
Following is our statistical model:
\[Y_i \sim^{iid} Bern(p_i) \]
\[log(p_i/(1 - p_i)) = \beta_0 + \beta_1 January_i + \beta_2 August_i + \beta_3 Slope_i + \beta_4 Elev_i + \beta_5 Precip_i + \beta_6 NC_i + \beta_7 SE_i + \beta_8 SW_i\]
When average January temperature increases by one, the probability of the area being infested is multiplied by \(e^{\beta_1}\). If an area hypothetically became a southeast region instead of a region outside of the southeast, southwest, and north central regions, the probability of the area being infested is multiplied by \(e^{\beta_7}\).
When we perform logistic regression, we assume that the data is monotonic and that the observations are independent. We can test the assumption of monotonicity by plotting scatterplots of each quantitative explanatory variable with the response variable along with a smooth curve, and seeing if the smooth curve is strictly increasing or decreasing. Following are scatterplots with smooth curves for each quantitative explanatory variable in our model.
df %>%
pivot_longer(c(January, August_max, Slope, Elev, Precip)) %>%
ggplot(aes(x = value, y = Infested)) +
geom_point() +
geom_smooth(se = FALSE) +
facet_wrap(~name, scales = "free")
All of these plots look fairly monotonic, except for the August_max plot. It seems that for the August_max plot, the smooth curve increases for the first half, and then decreases for the second half, which breaks the assumption of monotonicity. However, we will continue with our study.
When we fit our model, we get the following table of estimates for our model parameters. There are point estimates for each effect, along with lower and upper bounds for a 95% confidence interval.
glm %>%
confint %>%
as.data.frame %>%
rownames_to_column("Variable") %>%
transmute(
Variable = Variable,
Lower = `2.5 %`,
Point = glm$coefficients,
Upper = `97.5 %`
) %>%
knitr::kable(
digits = 4,
caption = "95% confidence intervals for model parameters"
)
| Variable | Lower | Point | Upper |
|---|---|---|---|
| (Intercept) | -4.9937 | -2.2897 | 0.3958 |
| January | -0.2192 | -0.1646 | -0.1107 |
| August_max | -0.1333 | -0.0861 | -0.0391 |
| Slope | -0.0130 | 0.0551 | 0.1243 |
| Elev | 0.0000 | 0.0002 | 0.0004 |
| Precip | 0.0022 | 0.0030 | 0.0038 |
| NCYes | -1.7209 | -1.3290 | -0.9421 |
| SEYes | -1.0649 | -0.7301 | -0.3954 |
| SWYes | 0.0764 | 0.4975 | 0.9285 |
jan_effect <- tibble(
lwr = confint(glm)["January", "2.5 %"],
point = coefficients(glm)["January"],
upr = confint(glm)["January", "97.5 %"]
) %>%
mutate_all(exp) %>%
mutate_all(round, 4)
One example of interpreting these parameters is that when average january minimum temperature increases by one, the probability of the area being infested is multiplied by 0.8483, and that we are 95% confident that it will be multiplied by between 0.8032 and 0.8952.
fitted <- predict.glm(glm, type="response")
roc <- pROC::roc(df$Infested, fitted)
auc <- roc$auc %>%
round(4)
thresh <- seq(from=0, to=1, length=1000)
misclass <- rep(NA,length=length(thresh))
for(i in 1:length(thresh)) {
my.classification <- ifelse(fitted>thresh[i], 1, 0)
misclass[i] <- mean(my.classification!=df$Infested)
}
thresh_point <- thresh[which.min(misclass)]
We selected a cutoff classification probability by finding the classification threshold that would give us the highest proportion of correctly predicted observations. In this example, our classification threshold was 0.4874875. The ROC curve can be graphed as follows:
ggplot() +
geom_line(aes(x = 1 - roc$specificities, y = roc$sensitivities)) +
geom_abline(slope = 1, intercept = 0) +
labs(
title = "ROC curve",
x = "1 - Specificity",
y = "Sensitivity"
)
The area under the curve is 0.7555. A confusion matrix for our data calculated with the above cutoff threshold is as follows:
table(
df$Infested,
ifelse(fitted > thresh_point, 1, 0)
) %>%
addmargins %>%
`row.names<-`(c("Actual No", "Actual Yes", "Sum")) %>%
knitr::kable(
col.names = c("Predicted No", "Predicted Yes", "Sum"),
caption = "Confusion Matrix"
)
| Predicted No | Predicted Yes | Sum | |
|---|---|---|---|
| Actual No | 245 | 402 | 647 |
| Actual Yes | 74 | 1589 | 1663 |
| Sum | 319 | 1991 | 2310 |
From this confusion matrix, we can calculate sensitivity, specificity, positive predicted value, and negative predicted value, which are metrics we can use to see how well our model fits our data. Our sensitivity is 0.9555, which is good since it means that out of the areas that were infested, we predicted that they were infested 95.55% of the time. Our specificity is 0.3787, which means that out of the areas that were not infested, we predicted that they were not infested 37.87% of the time. Our positive predicted value is 0.7981, which means that when we predicted that an area was infested, it was infested 79.81% of the time. Our negative predicted value is 0.7680, which means that when we predicted that an area was not infested, it was not infested 76.80% of the time. These metrics showed that our model did a very good job at predicting areas that were infested, but did not do a great job at predicting areas that were not infested.
In order to see how well our model does at predicting new locations where pine beetle infestations will occur, we run a cross validation study on our data. We ran one hundred cross validations of our data, fit our model, and calculated accuracy metrics.
## Choose number of CV studies to run in a loop & test set size
n.cv <- 100
n.test <- round(.1*nrow(df))
## Initialize matrices to hold CV results
sens <- rep(NA, n.cv)
spec <- rep(NA, n.cv)
ppv <- rep(NA, n.cv)
npv <- rep(NA, n.cv)
auc <- rep(NA, n.cv)
## Begin for loop
for(cv in 1:n.cv){
## Separate into test and training sets
test.obs <- sample(1:nrow(df), n.test)
test.set <- df[test.obs,]
train.set <- df[-test.obs,]
## Fit best model to training set
train.model <- glm(
Infested ~ January + August_max + Slope + Elev + Precip + NC + SE,
data=train.set,
family=binomial
)
## Use fitted model to predict test set
pred.probs <- predict.glm(train.model,newdata=test.set, type="response") #response gives probabilities
## Classify according to threshold
test.class <- ifelse(pred.probs>thresh_point, 1, 0)
## Create a confusion matrix
conf.mat <- addmargins(table(factor(test.set$Infested), factor(test.class)))
## Pull of sensitivity, specificity, PPV and NPV using bracket notation
sens[cv] <- conf.mat[2,2]/conf.mat[2,3]
spec[cv] <- conf.mat[1,1]/conf.mat[1,3]
ppv[cv] <- conf.mat[2,2]/conf.mat[3,2]
npv[cv] <- conf.mat[1,1]/conf.mat[3,1]
## Calculate AUC
auc[cv] <- pROC::auc(pROC::roc(test.set$Infested, pred.probs))
}
cv_df <- tibble(
Sensitivity = sens,
Specificity = spec,
`Positive Predicted Value` = ppv,
`Negative Predicted Value` = npv,
`Area under curve` = auc
)
cv_df %>%
pivot_longer(everything()) %>%
ggplot() +
geom_density(aes(x = value)) +
facet_wrap(~name, ncol = 2, scales = "free")
cv_df %>%
summarize_all(mean) %>%
round(4) %>%
t %>%
knitr::kable(
col.names = "Measure",
caption = "Cross Validation Metrics"
)
| Measure | |
|---|---|
| Sensitivity | 0.9486 |
| Specificity | 0.3576 |
| Positive Predicted Value | 0.7940 |
| Negative Predicted Value | 0.7288 |
| Area under curve | 0.7535 |
From the above table and graphs, we can see that our model does a good job at predicting when an area is infested, but does not do great at predicting when an area is not infested.
The probabilities of the region being infested each year are displayed in the following table:
future <- tibble(
January = c(-13.98, -17.80, -17.27, -12.52, -15.99, -11.97, -15.75, -16.19, -17.87, -12.44),
August_max = c(15.89, 18.07, 16.74, 18.06, 18.23, 15.81, 16.85, 16.51, 17.84, 16.96),
Slope = 18.07,
Elev = 1901.95,
Precip = c(771.13, 788.54, 677.63, 522.77, 732.32, 615.96, 805.90, 714.57, 740.50, 801.22),
NC = factor("No"),
NW = factor("No"),
EC = factor("No"),
WC = factor("No"),
SE = factor("Yes"),
SC = factor("No"),
SW = factor("No")
)
predict.glm(glm, newdata = future, type="response") %>%
enframe() %>%
transmute(
Year = 2018:2027,
Prediction = round(value, 4)
) %>%
knitr::kable(
caption = "Predictions"
)
| Year | Prediction |
|---|---|
| 2018 | 0.8390 |
| 2019 | 0.8951 |
| 2020 | 0.8629 |
| 2021 | 0.6178 |
| 2022 | 0.8408 |
| 2023 | 0.7032 |
| 2024 | 0.8769 |
| 2025 | 0.8572 |
| 2026 | 0.8841 |
| 2027 | 0.8015 |
According to the predictions, it seems likely that the area will be infested over each of the next ten years. Since each of the probabilities is higher than our cutoff threshold probability, we would predict that it will be infested each year. Therefore, I would suggest that the forest service concentrate on this area in order to prevent infestation.
The most important takeaways from this study are that the most important variables for predicting whether or not an area is infested with beetles are minimum January temperature, maximum August temperature, Slope, Elevation, Precipitation, North Central, Southwest, and Southeast region. We also specified a statistical model with which we can create predictions for if an area will become infested or not. These predictions will help ecologists know which regions most need help in order to avoid pine beetle infestations.
Another important factor that could affect whether an area has pine beetle infestations would be the location of the area. If there are nearby areas with pine beetle infestation, that area is much more likely to also have pine beetle infestation. Another important variable to consider would be how pine beetle infestations have changed over time. If repeated measures of the same area are recorded, we can see how infestations are changing over time in order to better predict what will happen in the future.