Welcome to the Lake Tahoe Basin Jeffrey pine beetle study dashboard!
The data were collected during the Lake Tahoe Basin Jeffrey pine beetle outbreak in 1991-1996 from a 60-acre study area with 10,722 trees followed annually.The dataset includes 9687 rows and 13 columns.We will use two analytic approaches to find out which factors affect chances of infestation: logistic regression and Random Forest algorithm.
| Variable Names | Var Type | Description |
|---|---|---|
| DeadDist | outcome | Minimum linear distance to nearest brood tree |
| TreeDiam | predictor | Tree diameter |
| Infest_Serv1 | predictor | Infestation severity nearest to response tree |
| Ind_DeadDist | predictor | Indicator if nearest brood tree is within 50m effective distance found |
| SDI_20th | predictor | Stand Density Index at 1/20th-acre neighborhood surrounding response tree |
| BA_20th | predictor | Basal Area at 1/20th-acre neighborhood surrounding response tree |
From the map below, we can see that some of the areas were severely infested by pine beetles, while others remained relatively lightly infested.
| term | estimate | penalty |
|---|---|---|
| (Intercept) | 70.0368339 | 0.001 |
| TreeDiam | -0.0366437 | 0.001 |
| Infest_Serv1 | -0.0531255 | 0.001 |
| Ind_DeadDist | -39.7364718 | 0.001 |
| SDI_20th | -0.1558299 | 0.001 |
| BA_20th | -0.2956895 | 0.001 |
---
title: "Pine Beetle Statistical Models"
output:
flexdashboard::flex_dashboard:
theme: flatly
orientation: columns
vertical_layout: fill
source_code: embed
logo: beetle-007.jpg
---
```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(kableExtra)
library(tidyverse)
library(readxl)
library(broom)
library(car)
library(ggfortify)
library(tidymodels)
library(vip)
library(performance)
pine_tbl <- read_excel("Data_1993.xlsx", sheet = 1)
# Split data for training/testing
set.seed(7)
# Put 3/4 of the data into the training set
data_split <- initial_split(pine_tbl, prop = 0.75)
# Create data frames for the two sets:
beetle_train_tbl <- training(data_split)
beetle_test_tbl <- testing(data_split)
```
Study Overview
=======================================================================
Column {data-width=250}
-----------------------------------------------------------------------
### **Description**
**Welcome to the Lake Tahoe Basin Jeffrey pine beetle study dashboard!**
The data were collected during the Lake Tahoe Basin Jeffrey pine beetle outbreak in 1991-1996 from a 60-acre study area with 10,722 trees followed annually.The dataset includes 9687 rows and 13 columns.We will use two analytic approaches to find out which factors affect chances of infestation: logistic regression and Random Forest algorithm.
### **These tiny guys can destroy large forest areas**
```{r beetle-image, fig.width = 4, fig.height = 4}
include_graphics("beetle-003.jpg")
```
### **Data Dictionary for the selected variables**
```{r}
tibble(Variables = c("DeadDist", "TreeDiam", "Infest_Serv1",
"Ind_DeadDist", "SDI_20th", "BA_20th"),
Role = c("outcome", "predictor", "predictor", "predictor", "predictor", "predictor"),
Description = c("Minimum linear distance to nearest brood tree",
"Tree diameter",
"Infestation severity nearest to response tree",
"Indicator if nearest brood tree is within 50m effective distance found",
"Stand Density Index at 1/20th-acre neighborhood surrounding response tree",
"Basal Area at 1/20th-acre neighborhood surrounding response tree")) %>%
kable(col.names = c("Variable Names", "Var Type", "Description")) %>%
kable_styling(bootstrap_options = "striped")
```
Column {data-width=450}
-----------------------------------------------------------------------
### **Jeffrey Pine Beetle Infestation Map in Lake Tahoe Basin in 1993**
From the map below, we can see that some of the areas were severely infested by pine beetles, while others remained relatively lightly infested.
```{r fig.align='center', fig.height= 7, fig.width=12}
ggplot(data = pine_tbl,
aes(x = Easting, y = Northing)) +
geom_point(aes(color = Infest_Serv1),
alpha = 0.9) +
scale_color_gradient("",
low = "#009E73",
high = "red") +
theme_gray() +
xlab("UTM X") +
ylab("UTM Y") +
ggtitle("Infestation Map") +
theme(text = element_text(size = 15)) +
ggplot2::theme_bw()
```
Exploratory Analysis
==========================================================================
Column {data-width=300}
-----------------------------------------------------------------------
### **Tree density And beetle population**
```{r pop-density, fig.width = 13, fig.height = 10}
pine_tbl %>%
mutate(
dens_groups = case_when(SDI_20th < mean(SDI_20th) - sd(SDI_20th) ~ 1,
SDI_20th > mean(SDI_20th) + sd(SDI_20th) ~ 3,
TRUE ~ 2),
dens_groups = factor(dens_groups,
labels = c("Mean - SD",
"Mean",
"Mean + SD"))
) %>%
ggplot(aes(x = dens_groups)) +
geom_boxplot(aes(y = log(BA_Inf_20th - 0.001))) +
geom_jitter(aes(y = log(BA_Inf_20th - 0.001)),
alpha = 0.9, width = 0.25, height = 0.1,
color = "#009E73") +
labs(x = "Tree density", y = "Beetle population (log)") +
theme_bw() +
theme(text = element_text(size = 30))
```
Column {data-width=300}
-----------------------------------------------------------------------
### **Tree Diameter Histogram**
```{r}
pine_tbl |>
ggplot(aes(x = TreeDiam)) +
geom_histogram(aes(y = after_stat(density)),
binwidth = 2, color = "white", fill = "darkseagreen") +
facet_wrap(~Ind_DeadDist, nrow = 1) +
ylab("") +
ggtitle("Tree diagram",
subtitle = "By indicator if nearest brood tree is within 50m effective distance found") +
theme_bw()
```
Column {data-width=300}
-----------------------------------------------------------------------
### **DeadDist vs TreeDiam**
```{r}
pine_tbl |>
mutate(Ind_DeadDist = as.factor(Ind_DeadDist)) |>
filter(TreeDiam < 50) |>
ggplot(aes(x = TreeDiam, y = DeadDist, group = Ind_DeadDist)) +
geom_point(aes(color = Ind_DeadDist)) +
geom_smooth(method = "lm") +
scale_color_manual(values = c("#009E73", "darkolivegreen")) +
theme_bw()
```
Statistical Models
=======================================================================
Column {data-width=300}
-----------------------------------------------------------------------
### **Random Forest Variable Importance**
```{r rf-vip, fig.align='center', fig.height= 5, fig.width=8}
# Model specification
rf_mod <- rand_forest(mode = "regression", mtry = 3, trees = 500) %>%
set_engine("ranger", importance = "impurity_corrected")
# Create recipe
beetle_rec_rf <-
recipe(DeadDist ~ TreeDiam + Infest_Serv1 + Ind_DeadDist + SDI_20th + BA_20th,
data = beetle_train_tbl)
# Create workflow
beetle_wflow_rf <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(beetle_rec_rf)
# Fit the model
beetle_fit_rf <-
beetle_wflow_rf %>%
fit(data = beetle_train_tbl)
# variable importance
var_imp <- beetle_fit_rf %>%
extract_fit_engine() %>%
ranger::importance() %>%
sort(decreasing = TRUE)
tibble(variable = as.factor(names(var_imp)), importance = var_imp) %>%
ggplot(aes(y = forcats::fct_reorder(.f = variable, importance),
x = importance)) +
geom_col(fill = "#009E73") +
ylab("")
```
### **Model Performance: R-squared ($R^2$)**
```{r rf-gauge}
# Gauge R squared
rf_metrics_tbl <- last_fit(
beetle_wflow_rf,
data_split
) %>%
collect_metrics()
rf_rsq <- rf_metrics_tbl %>% filter(.metric == "rsq") %>% pull(.estimate)
gauge(value = rf_rsq, min = 0, max = 1, label = "R squared")
```
### **RMSE: Root Mean Square Error**
```{r}
rf_rmse <- rf_metrics_tbl %>% filter(.metric == "rmse") %>% pull(.estimate)
gauge(value = rf_rmse, min = 0, max = 20, label = "RMSE")
```
Column {data-width=300}
-----------------------------------------------------------------------
### **Ridge Regression Model**
```{r rr-fit}
# Model specification
rr_mod <- linear_reg(penalty = 0.001, mixture = 0) %>%
set_engine("glmnet")
# Create recipe
beetle_rec_rr <-
recipe(DeadDist ~ TreeDiam + Infest_Serv1 + Ind_DeadDist + SDI_20th + BA_20th,
data = beetle_train_tbl)
# Create workflow
beetle_wflow_rr <-
workflow() %>%
add_model(rr_mod) %>%
add_recipe(beetle_rec_rr)
# Fit the model
beetle_fit_rr <-
beetle_wflow_rr %>%
fit(data = beetle_train_tbl)
beetle_fit_rr %>%
tidy() %>%
kable() %>%
kable_classic()
```
### **Model Performance: R-squared ($R^2$)**
```{r rr-gauge}
# Gauge R squared
rr_metrics_tbl <- last_fit(
beetle_wflow_rr,
data_split
) %>%
collect_metrics()
rr_rsq <- rr_metrics_tbl %>% filter(.metric == "rsq") %>% pull(.estimate)
gauge(value = rr_rsq, min = 0, max = 1, label = "R squared")
```
### **RMSE: Root Mean Square Error**
```{r}
rr_rmse <- rr_metrics_tbl %>% filter(.metric == "rmse") %>% pull(.estimate)
gauge(value = rr_rmse, min = 0, max = 20, label = "RMSE")
```
Column {data-width=300}
-----------------------------------------------------------------------
### **Performance Random Forest**
``` {r rf-test, fig.align='center', fig.height= 5, fig.width=8}
# Predict values
beetle_pred_rf <-
predict(beetle_fit_rf, beetle_test_tbl) %>%
bind_cols(beetle_test_tbl %>% select(DeadDist))
beetle_pred_rf %>%
rename(prediction = .pred, actual = DeadDist) %>%
ggplot(aes(x = prediction, y = actual)) +
geom_point(color = "#009E73") +
geom_abline(slope = 1, intercept = 0,
color = "orangered",
linetype = "dotted", linewidth = 1.5) +
ggtitle("Random Rorest model prediction vs actual values") +
theme_bw()
```
### **Performance Ridge Regression**
``` {r rr-test, fig.align='center', fig.height= 5, fig.width=8}
# Predict values
beetle_pred_rr <-
predict(beetle_fit_rr, beetle_test_tbl) %>%
bind_cols(beetle_test_tbl %>% select(DeadDist))
beetle_pred_rr %>%
rename(prediction = .pred, actual = DeadDist) %>%
ggplot(aes(x = prediction, y = actual)) +
geom_point(color = "#009E73") +
geom_abline(slope = 1, intercept = 0,
color = "orangered",
linetype = "dotted", linewidth = 1.5) +
ggtitle("Ridge Regression model prediction vs actual values") +
theme_bw()
```