Study Overview

Column

Description

Welcome to the Lake Tahoe Basin Jeffrey pine beetle study dashboard!

The data were collected during the Lake Tahoe Basin Jeffrey pine beetle outbreak in 1991-1996 from a 60-acre study area with 10,722 trees followed annually.The dataset includes 9687 rows and 13 columns.We will use two analytic approaches to find out which factors affect chances of infestation: logistic regression and Random Forest algorithm.

These tiny guys can destroy large forest areas

Data Dictionary for the selected variables

Variable Names Var Type Description
DeadDist outcome Minimum linear distance to nearest brood tree
TreeDiam predictor Tree diameter
Infest_Serv1 predictor Infestation severity nearest to response tree
Ind_DeadDist predictor Indicator if nearest brood tree is within 50m effective distance found
SDI_20th predictor Stand Density Index at 1/20th-acre neighborhood surrounding response tree
BA_20th predictor Basal Area at 1/20th-acre neighborhood surrounding response tree

Column

Jeffrey Pine Beetle Infestation Map in Lake Tahoe Basin in 1993

From the map below, we can see that some of the areas were severely infested by pine beetles, while others remained relatively lightly infested.

Exploratory Analysis

Column

Tree density And beetle population

Column

Tree Diameter Histogram

Column

DeadDist vs TreeDiam

Statistical Models

Column

Random Forest Variable Importance

Model Performance: R-squared (\(R^2\))

RMSE: Root Mean Square Error

Column

Ridge Regression Model

term estimate penalty
(Intercept) 70.0368339 0.001
TreeDiam -0.0366437 0.001
Infest_Serv1 -0.0531255 0.001
Ind_DeadDist -39.7364718 0.001
SDI_20th -0.1558299 0.001
BA_20th -0.2956895 0.001

Model Performance: R-squared (\(R^2\))

RMSE: Root Mean Square Error

Column

Performance Random Forest

Performance Ridge Regression

---
title: "Pine Beetle Statistical Models"

output: 
  flexdashboard::flex_dashboard:
    theme: flatly
    orientation: columns
    vertical_layout: fill
    source_code: embed
    logo: beetle-007.jpg
---

```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(kableExtra)
library(tidyverse)
library(readxl)
library(broom)
library(car)
library(ggfortify)
library(tidymodels)
library(vip)
library(performance)

pine_tbl <- read_excel("Data_1993.xlsx", sheet = 1)

# Split data for training/testing
set.seed(7)

# Put 3/4 of the data into the training set 
data_split <- initial_split(pine_tbl, prop = 0.75)

# Create data frames for the two sets:
beetle_train_tbl <- training(data_split) 
beetle_test_tbl  <- testing(data_split)
```

Study Overview
=======================================================================

Column {data-width=250}
-----------------------------------------------------------------------

### **Description**

**Welcome to the Lake Tahoe Basin Jeffrey pine beetle study dashboard!**

The data were collected during the Lake Tahoe Basin Jeffrey pine beetle outbreak in 1991-1996 from a 60-acre study area with 10,722 trees followed annually.The dataset includes 9687 rows and 13 columns.We will use two analytic approaches to find out which factors affect chances of infestation: logistic regression and Random Forest algorithm.

### **These tiny guys can destroy large forest areas**

```{r beetle-image, fig.width = 4, fig.height = 4}


include_graphics("beetle-003.jpg")


```

### **Data Dictionary for the selected variables**

```{r}
tibble(Variables = c("DeadDist", "TreeDiam", "Infest_Serv1", 
                     "Ind_DeadDist", "SDI_20th", "BA_20th"), 
       Role = c("outcome", "predictor", "predictor", "predictor", "predictor", "predictor"),
           Description = c("Minimum linear distance to nearest brood tree",
                           "Tree diameter", 
                          "Infestation severity nearest to response tree", 
                          "Indicator if nearest brood tree is within 50m effective distance found",
                          "Stand Density Index at 1/20th-acre neighborhood surrounding response tree",
                          "Basal Area at 1/20th-acre neighborhood surrounding response tree")) %>% 
  kable(col.names = c("Variable Names", "Var Type", "Description")) %>% 
  kable_styling(bootstrap_options = "striped")

```

Column {data-width=450}
-----------------------------------------------------------------------

### **Jeffrey Pine Beetle Infestation Map in Lake Tahoe Basin in 1993**

From the map below, we can see that some of the areas were severely infested by pine beetles, while others remained relatively lightly infested. 

```{r  fig.align='center', fig.height= 7, fig.width=12}
ggplot(data = pine_tbl, 
       aes(x = Easting, y = Northing)) +
  geom_point(aes(color = Infest_Serv1),
                 alpha = 0.9) +
  scale_color_gradient("",
                       low = "#009E73",
                       high = "red") +
  theme_gray() +
  xlab("UTM X") +
  ylab("UTM Y") +
  ggtitle("Infestation Map") +
  theme(text = element_text(size = 15)) +
  ggplot2::theme_bw()
```

Exploratory Analysis
==========================================================================

Column {data-width=300}
-----------------------------------------------------------------------

### **Tree density And beetle population**

```{r pop-density, fig.width = 13, fig.height = 10}

pine_tbl %>%
  mutate(
    dens_groups = case_when(SDI_20th < mean(SDI_20th) - sd(SDI_20th) ~ 1,
                            SDI_20th > mean(SDI_20th) + sd(SDI_20th) ~ 3,
                            TRUE ~ 2),
    dens_groups = factor(dens_groups,
                         labels = c("Mean - SD",
                                    "Mean",
                                    "Mean + SD"))
    ) %>%
  ggplot(aes(x = dens_groups)) +
  geom_boxplot(aes(y = log(BA_Inf_20th - 0.001))) +
  geom_jitter(aes(y = log(BA_Inf_20th - 0.001)),
              alpha = 0.9, width = 0.25, height = 0.1,
              color = "#009E73") +
  labs(x = "Tree density", y = "Beetle population (log)") +
  theme_bw() +
  theme(text = element_text(size = 30))

```


Column {data-width=300}
-----------------------------------------------------------------------
### **Tree Diameter Histogram**

```{r}
pine_tbl |>
  ggplot(aes(x = TreeDiam)) +
  geom_histogram(aes(y = after_stat(density)),
                 binwidth = 2, color = "white", fill = "darkseagreen") +
  facet_wrap(~Ind_DeadDist, nrow = 1) +
  ylab("") +
  ggtitle("Tree diagram", 
          subtitle = "By indicator if nearest brood tree is within 50m effective distance found") +
  theme_bw()
```

Column {data-width=300}
-----------------------------------------------------------------------

### **DeadDist vs TreeDiam**
```{r}
pine_tbl |>
  mutate(Ind_DeadDist = as.factor(Ind_DeadDist)) |>
  filter(TreeDiam < 50) |>
  ggplot(aes(x = TreeDiam, y = DeadDist, group = Ind_DeadDist)) +
  geom_point(aes(color = Ind_DeadDist)) +
  geom_smooth(method = "lm") +
  scale_color_manual(values = c("#009E73", "darkolivegreen")) +
  theme_bw()
```



Statistical Models
=======================================================================

Column {data-width=300}
-----------------------------------------------------------------------

### **Random Forest Variable Importance**

```{r rf-vip, fig.align='center', fig.height= 5, fig.width=8}

# Model specification
rf_mod <- rand_forest(mode = "regression", mtry = 3, trees = 500) %>%
  set_engine("ranger", importance = "impurity_corrected")

# Create recipe
beetle_rec_rf <-
  recipe(DeadDist ~ TreeDiam + Infest_Serv1 + Ind_DeadDist + SDI_20th + BA_20th,
         data = beetle_train_tbl) 

# Create workflow
beetle_wflow_rf <- 
  workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(beetle_rec_rf)

# Fit the model
beetle_fit_rf <- 
  beetle_wflow_rf %>% 
  fit(data = beetle_train_tbl)

# variable importance
var_imp <- beetle_fit_rf %>% 
  extract_fit_engine() %>% 
  ranger::importance() %>% 
  sort(decreasing = TRUE) 

tibble(variable = as.factor(names(var_imp)), importance = var_imp) %>% 
  ggplot(aes(y = forcats::fct_reorder(.f = variable, importance), 
             x = importance)) +
  geom_col(fill = "#009E73") +
  ylab("")
```


### **Model Performance: R-squared ($R^2$)**

```{r rf-gauge}

# Gauge R squared

rf_metrics_tbl <- last_fit(
  beetle_wflow_rf,
  data_split
) %>%
  collect_metrics()

rf_rsq <- rf_metrics_tbl %>% filter(.metric == "rsq") %>% pull(.estimate)

gauge(value = rf_rsq, min = 0, max = 1, label = "R squared")

```

### **RMSE: Root Mean Square Error**
```{r}
rf_rmse <- rf_metrics_tbl %>% filter(.metric == "rmse") %>% pull(.estimate)
gauge(value = rf_rmse, min = 0, max = 20, label = "RMSE")
```



Column {data-width=300}
-----------------------------------------------------------------------

### **Ridge Regression  Model**

```{r rr-fit}

# Model specification
rr_mod <- linear_reg(penalty = 0.001, mixture = 0) %>% 
  set_engine("glmnet")

# Create recipe
beetle_rec_rr <-
  recipe(DeadDist ~ TreeDiam + Infest_Serv1 + Ind_DeadDist + SDI_20th + BA_20th,
         data = beetle_train_tbl) 

# Create workflow
beetle_wflow_rr <- 
  workflow() %>% 
  add_model(rr_mod) %>% 
  add_recipe(beetle_rec_rr)

# Fit the model
beetle_fit_rr <- 
  beetle_wflow_rr %>% 
  fit(data = beetle_train_tbl)

beetle_fit_rr %>% 
  tidy() %>% 
  kable() %>% 
  kable_classic()


```

### **Model Performance: R-squared ($R^2$)**

```{r rr-gauge}


# Gauge R squared

rr_metrics_tbl <- last_fit(
  beetle_wflow_rr,
  data_split
) %>%
  collect_metrics()

rr_rsq <- rr_metrics_tbl %>% filter(.metric == "rsq") %>% pull(.estimate)

gauge(value = rr_rsq, min = 0, max = 1, label = "R squared")

```

### **RMSE: Root Mean Square Error** 

```{r}
rr_rmse <- rr_metrics_tbl %>% filter(.metric == "rmse") %>% pull(.estimate)
gauge(value = rr_rmse, min = 0, max = 20, label = "RMSE")
```



Column {data-width=300}
-----------------------------------------------------------------------

### **Performance Random Forest**

``` {r rf-test, fig.align='center', fig.height= 5, fig.width=8}

# Predict values
beetle_pred_rf <- 
  predict(beetle_fit_rf, beetle_test_tbl) %>%
  bind_cols(beetle_test_tbl %>% select(DeadDist))

beetle_pred_rf %>% 
  rename(prediction = .pred, actual = DeadDist) %>% 
  ggplot(aes(x = prediction, y = actual)) +
  geom_point(color = "#009E73") +
  geom_abline(slope = 1, intercept = 0, 
              color = "orangered", 
              linetype = "dotted", linewidth = 1.5) +
  ggtitle("Random Rorest model prediction vs actual values") +
  theme_bw()


```

### **Performance Ridge Regression**

``` {r rr-test, fig.align='center', fig.height= 5, fig.width=8}

# Predict values
beetle_pred_rr <- 
  predict(beetle_fit_rr, beetle_test_tbl) %>%
  bind_cols(beetle_test_tbl %>% select(DeadDist))

beetle_pred_rr %>% 
  rename(prediction = .pred, actual = DeadDist) %>% 
  ggplot(aes(x = prediction, y = actual)) +
  geom_point(color = "#009E73") +
  geom_abline(slope = 1, intercept = 0, 
              color = "orangered", 
              linetype = "dotted", linewidth = 1.5) +
  ggtitle("Ridge Regression model prediction vs actual values") +
  theme_bw()


```