Section 1: Introduction and Problem Background

Ecologists are concerned about growing numbers of mountain pine beetles, which in the past few years have been causing excessive damage to forests in areas such as Colorado. They are interested in knowing what factors most affect pine beetle infestation, and want predictions in order to know how to best combat the infestation. The goal of inference is to learn which variables most affect whether an area will have pine beetle infestation or not. The goal of prediction is to predict the probability of an area having pine beetle infestation given a certain set of explanatory variables.

library(tidyverse)
knitr::opts_chunk$set(message = FALSE)

df <- read_csv("Pine_Beetle_Data.csv") %>% 
  mutate(Infested = ifelse(Infested == "Yes", 1, 0)) %>% 
  mutate_if(is.character, as.factor)

Data was collected on several environmental variables for different regions in colorado along with whether or not that region had pine beetle infestation. Following are scatterplots overlaid with smooth curves for each one of the quantitative explanatory variables.

df %>% 
  pivot_longer(c(January, August_max, Slope, Elev, Precip)) %>% 
  ggplot(aes(x = value, y = Infested)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(~name, scales = "free")

Most of these smooth curves are fairly monotonic, meaning that they either constantly increase or decrease, but August looks like it goes up and then back down, so it might not meet the assumption of monotonicity. However, we will continue with thie study.

Because we are interested in predicting whether an area is infested, we cannot use multiple linear regression to achieve our above goals, since our response variable is categorical. Since an area being infested follows a Bernoulli distribution, we will use logistic regression. Logistic regression will return a probability between zero and one of an area being infested.

Section 2: Statistical Modeling

In order to identify which variables to include in our model, we ran an exhaustive test of all possible subsets of our explanatory variables and compared the AIC of each of them, and then selected the model with the lowest AIC. We decided to run an exhaustive test so that we would pick the very best model possible. We chose to use AIC as the metric of choice because prediction seems to be an important goal for this study. We learned that the important variables to include in our model are January, August, Slope, Elevation, Precipitation, North Central, Southeast, and Southwest.

best_model <- df %>% 
  as.data.frame %>% 
  bestglm::bestglm(
    family = binomial,
    IC = "AIC",
    method = "exhaustive"
  )

glm <- glm(
  Infested ~ January + August_max + Slope + Elev + Precip + NC + SE + SW,
  family = binomial,
  data = df
)

Following is our statistical model:

\[Y_i \sim^{iid} Bern(p_i) \]

\[log(p_i/(1 - p_i)) = \beta_0 + \beta_1 January_i + \beta_2 August_i + \beta_3 Slope_i + \beta_4 Elev_i + \beta_5 Precip_i + \beta_6 NC_i + \beta_7 SE_i + \beta_8 SW_i\]

When average January temperature increases by one, the probability of the area being infested is multiplied by \(e^{\beta_1}\). If an area hypothetically became a southeast region instead of a region outside of the southeast, southwest, and north central regions, the probability of the area being infested is multiplied by \(e^{\beta_7}\).

When we perform logistic regression, we assume that the data is monotonic and that the observations are independent. We can test the assumption of monotonicity by plotting scatterplots of each quantitative explanatory variable with the response variable along with a smooth curve, and seeing if the smooth curve is strictly increasing or decreasing. Following are scatterplots with smooth curves for each quantitative explanatory variable in our model.

df %>% 
  pivot_longer(c(January, August_max, Slope, Elev, Precip)) %>% 
  ggplot(aes(x = value, y = Infested)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(~name, scales = "free")

All of these plots look fairly monotonic, except for the August_max plot. It seems that for the August_max plot, the smooth curve increases for the first half, and then decreases for the second half, which breaks the assumption of monotonicity. However, we will continue with our study.

Section 3: Results

When we fit our model, we get the following table of estimates for our model parameters. There are point estimates for each effect, along with lower and upper bounds for a 95% confidence interval.

glm %>% 
  confint %>% 
  as.data.frame %>% 
  rownames_to_column("Variable") %>% 
  transmute(
    Variable = Variable,
    Lower = `2.5 %`,
    Point = glm$coefficients,
    Upper = `97.5 %`
  ) %>% 
  knitr::kable(
    digits = 4,
    caption = "95% confidence intervals for model parameters"
  )
95% confidence intervals for model parameters
Variable Lower Point Upper
(Intercept) -4.9937 -2.2897 0.3958
January -0.2192 -0.1646 -0.1107
August_max -0.1333 -0.0861 -0.0391
Slope -0.0130 0.0551 0.1243
Elev 0.0000 0.0002 0.0004
Precip 0.0022 0.0030 0.0038
NCYes -1.7209 -1.3290 -0.9421
SEYes -1.0649 -0.7301 -0.3954
SWYes 0.0764 0.4975 0.9285
jan_effect <- tibble(
  lwr = confint(glm)["January", "2.5 %"],
  point = coefficients(glm)["January"],
  upr = confint(glm)["January", "97.5 %"]
) %>% 
  mutate_all(exp) %>% 
  mutate_all(round, 4)

One example of interpreting these parameters is that when average january minimum temperature increases by one, the probability of the area being infested is multiplied by 0.8483, and that we are 95% confident that it will be multiplied by between 0.8032 and 0.8952.

fitted <- predict.glm(glm, type="response")
roc <- pROC::roc(df$Infested, fitted)
auc <- roc$auc %>% 
  round(4)
thresh <- seq(from=0, to=1, length=1000) 
misclass <- rep(NA,length=length(thresh))
for(i in 1:length(thresh)) {
  my.classification <- ifelse(fitted>thresh[i], 1, 0)
  misclass[i] <- mean(my.classification!=df$Infested)
}
thresh_point <- thresh[which.min(misclass)]

We selected a cutoff classification probability by finding the classification threshold that would give us the highest proportion of correctly predicted observations. In this example, our classification threshold was 0.4874875. The ROC curve can be graphed as follows:

ggplot() + 
  geom_line(aes(x = 1 - roc$specificities, y = roc$sensitivities)) + 
  geom_abline(slope = 1, intercept = 0) +
  labs(
    title = "ROC curve",
    x = "1 - Specificity",
    y = "Sensitivity"
  )

The area under the curve is 0.7555. A confusion matrix for our data calculated with the above cutoff threshold is as follows:

table(
  df$Infested,
  ifelse(fitted > thresh_point, 1, 0)
) %>% 
  addmargins %>% 
  `row.names<-`(c("Actual No", "Actual Yes", "Sum")) %>% 
  knitr::kable(
    col.names = c("Predicted No", "Predicted Yes", "Sum"),
    caption = "Confusion Matrix"
  )
Confusion Matrix
Predicted No Predicted Yes Sum
Actual No 245 402 647
Actual Yes 74 1589 1663
Sum 319 1991 2310

From this confusion matrix, we can calculate sensitivity, specificity, positive predicted value, and negative predicted value, which are metrics we can use to see how well our model fits our data. Our sensitivity is 0.9555, which is good since it means that out of the areas that were infested, we predicted that they were infested 95.55% of the time. Our specificity is 0.3787, which means that out of the areas that were not infested, we predicted that they were not infested 37.87% of the time. Our positive predicted value is 0.7981, which means that when we predicted that an area was infested, it was infested 79.81% of the time. Our negative predicted value is 0.7680, which means that when we predicted that an area was not infested, it was not infested 76.80% of the time. These metrics showed that our model did a very good job at predicting areas that were infested, but did not do a great job at predicting areas that were not infested.

In order to see how well our model does at predicting new locations where pine beetle infestations will occur, we run a cross validation study on our data. We ran one hundred cross validations of our data, fit our model, and calculated accuracy metrics.

## Choose number of CV studies to run in a loop & test set size
n.cv <- 100
n.test <- round(.1*nrow(df))

## Initialize matrices to hold CV results
sens <- rep(NA, n.cv)
spec <- rep(NA, n.cv)
ppv <- rep(NA, n.cv)
npv <- rep(NA, n.cv)
auc <- rep(NA, n.cv)

## Begin for loop
for(cv in 1:n.cv){
  ## Separate into test and training sets
  test.obs <- sample(1:nrow(df), n.test)
  
  test.set <- df[test.obs,]
  train.set <- df[-test.obs,]
  
  ## Fit best model to training set
  train.model <- glm(
    Infested ~ January + August_max + Slope + Elev + Precip + NC + SE, 
    data=train.set, 
    family=binomial
  )
  
  ## Use fitted model to predict test set
  pred.probs <- predict.glm(train.model,newdata=test.set, type="response") #response gives probabilities
  
  ## Classify according to threshold
  test.class <- ifelse(pred.probs>thresh_point, 1, 0)
  
  ## Create a confusion matrix
  conf.mat <- addmargins(table(factor(test.set$Infested), factor(test.class)))
  
  ## Pull of sensitivity, specificity, PPV and NPV using bracket notation
  sens[cv] <- conf.mat[2,2]/conf.mat[2,3]
  spec[cv] <- conf.mat[1,1]/conf.mat[1,3]
  ppv[cv] <- conf.mat[2,2]/conf.mat[3,2]
  npv[cv] <- conf.mat[1,1]/conf.mat[3,1]
  
  ## Calculate AUC
  auc[cv] <- pROC::auc(pROC::roc(test.set$Infested, pred.probs))
}

cv_df <- tibble(
  Sensitivity = sens,
  Specificity = spec,
  `Positive Predicted Value` = ppv,
  `Negative Predicted Value` = npv,
  `Area under curve` = auc
)
  
cv_df %>% 
  pivot_longer(everything()) %>% 
  ggplot() +
  geom_density(aes(x = value)) +
  facet_wrap(~name, ncol = 2, scales = "free")

cv_df %>% 
  summarize_all(mean) %>% 
  round(4) %>% 
  t %>% 
  knitr::kable(
    col.names = "Measure",
    caption = "Cross Validation Metrics"
  )
Cross Validation Metrics
Measure
Sensitivity 0.9486
Specificity 0.3576
Positive Predicted Value 0.7940
Negative Predicted Value 0.7288
Area under curve 0.7535

From the above table and graphs, we can see that our model does a good job at predicting when an area is infested, but does not do great at predicting when an area is not infested.

The probabilities of the region being infested each year are displayed in the following table:

future <- tibble(
  January = c(-13.98, -17.80, -17.27, -12.52, -15.99, -11.97, -15.75, -16.19, -17.87, -12.44),
  August_max = c(15.89, 18.07, 16.74, 18.06, 18.23, 15.81, 16.85, 16.51, 17.84, 16.96),
  Slope = 18.07,
  Elev = 1901.95,
  Precip = c(771.13, 788.54, 677.63, 522.77, 732.32, 615.96, 805.90, 714.57, 740.50, 801.22),
  NC = factor("No"),
  NW = factor("No"),
  EC = factor("No"),
  WC = factor("No"),
  SE = factor("Yes"),
  SC = factor("No"),
  SW = factor("No")
)

predict.glm(glm, newdata = future, type="response") %>% 
  enframe() %>% 
  transmute(
    Year = 2018:2027,
    Prediction = round(value, 4)
  ) %>% 
  knitr::kable(
    caption = "Predictions"
  )
Predictions
Year Prediction
2018 0.8390
2019 0.8951
2020 0.8629
2021 0.6178
2022 0.8408
2023 0.7032
2024 0.8769
2025 0.8572
2026 0.8841
2027 0.8015

According to the predictions, it seems likely that the area will be infested over each of the next ten years. Since each of the probabilities is higher than our cutoff threshold probability, we would predict that it will be infested each year. Therefore, I would suggest that the forest service concentrate on this area in order to prevent infestation.

Section 4: Conclusions

The most important takeaways from this study are that the most important variables for predicting whether or not an area is infested with beetles are minimum January temperature, maximum August temperature, Slope, Elevation, Precipitation, North Central, Southwest, and Southeast region. We also specified a statistical model with which we can create predictions for if an area will become infested or not. These predictions will help ecologists know which regions most need help in order to avoid pine beetle infestations.

Another important factor that could affect whether an area has pine beetle infestations would be the location of the area. If there are nearby areas with pine beetle infestation, that area is much more likely to also have pine beetle infestation. Another important variable to consider would be how pine beetle infestations have changed over time. If repeated measures of the same area are recorded, we can see how infestations are changing over time in order to better predict what will happen in the future.