Introduction to Project and Dataset

Column

Project Overview

This project uses the data collected at Lake Tahoe Basin regarding the Jeffrey pine beetle outbreak between 1991-1996. From 1991 to 1996, Jeffrey pine beetles (JPB) caused tree mortality throughout the Lake Tahoe Basin during a severe drought. The data set describes the dynamics within the Lake Tahoe Basin of a 60-acre study area with 10,722 trees followed annually and assesses patterns of JPB-caused mortality. This project uses the information in the ‘pine beetle’ dataset to analyze and predict the minimum linear distance to the nearest brood tree (DeadDist). To do this we will use both linear and ridge regression models in order to determine the ‘DeadDist’ outcome as well as to determine regression model variations.

The predictors used for these models were infestation severity closest to the response tree (Infest_Serv1), the basal area total for all infested trees within 1/2 acre neighborhood (BA_Infest_1/2th), stand density index @ 1/20th-acre neighborhood surrounding response tree (SDI_20th), the basal area total summed for all trees within 1-acre neighborhood of response tree (Neigh_1), and the indicator of any infested trees within 1.5-acre neighborhood of response tree (IND_BA_infest_1.5).

Jeffery Pine Beetle

Column

Data Variable Explanation

Variables Description
TreeDiam Tree diameter/size
Infest_sever1 Infestation severity nearest to response tree
Invest_sever2 Infestation severity nearest to response tree
Ind_DeadDist Indicator if nearest brood tree is within 50m effective distance found
DeadDist Minimum linear distance to nearest brood tree
SDI_20th Stand Density Index @ 1/20th-acre neighborhood surrounding response tree
Neigh_SDI_1/4th Stand Density Index @ 1/4th-acre neighborhood surrounding response tree
BA_20th Basal Area @ 1/20th-acre neighborhood surrounding response tree
Neigh_1/4th Basal area total summed for all trees within 1/4th-acre neighborhood of response tree
Neigh_1/2th Basal area total summed for all trees within 1/2-acre neighborhood of response tree
Neigh_1 Basal area total summed for all trees within 1-acre neighborhood of response tree
Neigh_1.5 Basal area total summed for all trees within 1.5-acre neighborhood of response tree
BA_Inf_20th Basal area total for all infested trees within 1/20th-acre neighborhood
BA_infest_1/4th Basal area total for all infested trees within 1/4th-acre neighborhood
BA_infest_1/2th Basal area total for all infested trees within 1/2-acre neighborhood
BA_infest_1 Basal area total for all infested trees within 1-acre neighborhood
BA_infest_1.5 Basal area total for all infested trees within 1.5-acre neighborhood
IND_BA_Infest_20th Binary indicator for if a response tree has any infested trees within neigborhood
IND_BA_infest_1/4th Indicator of any infested trees within 1/4th-acre neighborhood of response tree
IND_BA_infest_1/2th Indicator of any infested trees within 1/2-acre neighborhood of response tree
IND_BA_infest_1 Indicator of any infested trees within 1-acre neighborhood of response tree
IND_BA_infest_1.5 Indicator of any infested trees within 1.5-acre neighborhood of response tree

Selected Predictors Correlation

term DeadDist Infest_Serv1 BA_Infest_1.2th SDI_20th IND_BA_Infest_1.5 Neigh_1
DeadDist -.03 -.45 -.35 -.76 -.58
Infest_Serv1 -.03 .47 -.12 -.01 -.17
BA_Infest_1/2th -.45 .47 .17 .22 .26
SDI_20th -.35 -.12 .17 .18 .51
IND_BA_Infest_1.5 -.76 -.01 .22 .18 .41
Neigh_1 -.58 -.17 .26 .51 .41

Predictor Overview

Regression Models

Column

Tidy Model Fit Table (Linear)

# A tibble: 6 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        8.79     0.0452       194.  0        
2 `BA_Infest_1/2th` -0.0746   0.00139      -53.5 0        
3 Infest_Serv1       0.00733  0.000726      10.1 7.24e- 24
4 SDI_20th          -0.0197   0.00101      -19.5 1.93e- 83
5 Neigh_1           -0.00530  0.000212     -25.0 1.22e-133
6 IND_BA_Infest_1.5 -2.61     0.0346       -75.3 0        

VIP Plot (Linear)

Column

Tidy Model Fit Table (Ridge)

# A tibble: 6 × 3
  term              estimate penalty
  <chr>                <dbl>   <dbl>
1 (Intercept)         4.51     0.163
2 BA_Infest_1/2th    -0.665    0.163
3 Infest_Serv1        0.0897   0.163
4 SDI_20th           -0.247    0.163
5 Neigh_1            -0.382    0.163
6 IND_BA_Infest_1.5  -0.845    0.163

VIP Plot (Ridge)

Column

R^2 Evaluation (Linear)

R^2 Evaluation (Ridge)

Conclusion: Regression Outcomes

When comparing the liner and ridge regressions we can see that similar outcomes are provided. The VIP plots show similar variable importance distributions, and the R^2 values are very similar.

The R^2 values from the models indicate that ~68% of the variability in ‘DeadDist’ is explained by the model. From this we can say that the model moderately fits the data, but it can be improved. To increase the fit of the model, more predictors can be added or the lower importance predictors can be switched for higher importance predictors.

---
title: "Predicting 'DeadDist' Using Regression Models"
output:
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    theme: 
      version: 4
      bootswatch: minty
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(DT)
library(ggplot2)
library(plotly)
library(corrr)
library(emo)
library(tidyverse)
library(readxl)
library(broom)
library(car)
library(ggfortify)
library(tidymodels)
library(vip)
library(performance)
library(GGally)
library(corrr)

pine_tbl <- read_excel("Data/Data_1993.xlsx", sheet = 1)

my_theme <- theme(axis.text.x = element_text(size = 24),
                  axis.text.y = element_text(size = 24),
                  axis.title.x = element_text(size = 24))

```

Introduction to Project and Dataset 
========================================

Column {data-width=375} 
-----------------------------------------------------------------------

### **Project Overview** 

This project uses the data collected at Lake Tahoe Basin regarding the Jeffrey 
pine beetle outbreak between 1991-1996. From 1991 to 1996, Jeffrey pine beetles 
(JPB) caused tree mortality throughout the Lake Tahoe Basin during a severe drought. 
The data set describes the dynamics within the Lake Tahoe Basin of a 60-acre study 
area with 10,722 trees followed annually and assesses patterns of JPB-caused mortality. 
This project uses the information in the 'pine beetle' dataset to analyze and predict 
the minimum linear distance to the nearest brood tree (DeadDist). To do this we will 
use both linear and ridge regression models in order to determine the 'DeadDist' outcome
as well as to determine regression model variations. 

The predictors used for these models were infestation severity 
closest to the response tree (Infest_Serv1), the basal area total for all 
infested trees within 1/2 acre neighborhood (BA_Infest_1/2th), stand density 
index @ 1/20th-acre neighborhood surrounding response tree (SDI_20th), 
the basal area total summed for all trees within 1-acre neighborhood of response 
tree (Neigh_1), and the indicator of any infested trees within 1.5-acre neighborhood 
of response tree (IND_BA_infest_1.5).

### **Jeffery Pine Beetle**

```{r photo input, out.width='100%'}
knitr::include_graphics("~/Desktop/HGEN_612_Project2/Dendroctonus_ponderosae.jpg")

```



Column {data-width=625} 
-----------------------------------------------------------------------

### **Data Variable Explanation** 

```{r variable input, out.width='100%'}

data.frame(Variables = c("TreeDiam", "Infest_sever1", "Invest_sever2", "Ind_DeadDist", 
                         "DeadDist", "SDI_20th", "Neigh_SDI_1/4th", "BA_20th", 
                         "Neigh_1/4th", "Neigh_1/2th", "Neigh_1", "Neigh_1.5", 
                         "BA_Inf_20th", "BA_infest_1/4th", "BA_infest_1/2th", "BA_infest_1", 
                         "BA_infest_1.5", "IND_BA_Infest_20th", "IND_BA_infest_1/4th", 
                         "IND_BA_infest_1/2th", "IND_BA_infest_1", "IND_BA_infest_1.5"),
           
           Description = c("Tree diameter/size", 
                           "Infestation severity nearest to response tree", 
                           "Infestation severity nearest to response tree", 
                           "Indicator if nearest brood tree is within 50m effective distance found", 
                           "Minimum linear distance to nearest brood tree",
                           "Stand Density Index @ 1/20th-acre neighborhood surrounding response tree", 
                           "Stand Density Index @ 1/4th-acre neighborhood surrounding response tree",
                           "Basal Area @ 1/20th-acre neighborhood surrounding response tree", 
                           "Basal area total summed for all trees within 1/4th-acre neighborhood of response tree",
                           "Basal area total summed for all trees within 1/2-acre neighborhood of response tree", 
                           "Basal area total summed for all trees within 1-acre neighborhood of response tree", 
                           "Basal area total summed for all trees within 1.5-acre neighborhood of response tree", 
                           "Basal area total for all infested trees within 1/20th-acre neighborhood", 
                           "Basal area total for all infested trees within 1/4th-acre neighborhood", 
                           "Basal area total for all infested trees within 1/2-acre neighborhood", 
                           "Basal area total for all infested trees within 1-acre neighborhood", 
                           "Basal area total for all infested trees within 1.5-acre neighborhood", 
                           "Binary indicator for if a response tree has any infested trees within neigborhood", 
                           "Indicator of any infested trees within 1/4th-acre neighborhood of response tree", 
                           "Indicator of any infested trees within 1/2-acre neighborhood of response tree", 
                           "Indicator of any infested trees within 1-acre neighborhood of response tree", 
                           "Indicator of any infested trees within 1.5-acre neighborhood of response tree")) %>% 
  knitr::kable()
```

### **Selected Predictors Correlation** 

```{r predictors correlation, out.width='100%'}

pine_tbl_select <- pine_tbl %>% 
  select("DeadDist", "BA_Infest_1/2th", "Infest_Serv1", "SDI_20th", "Neigh_1", "IND_BA_Infest_1.5")

pine_tbl_select_cor <- correlate(pine_tbl_select) 


pine_tbl_select_cor %>% 
  rearrange() %>% 
  fashion() %>% 
  knitr::kable()

```


Predictor Overview
========================================
``` {r predictor overview}
ggpairs(pine_tbl_select)
```


Regression Models
========================================

Column{data-width=375} 
-----------------------------------------------------------------------

### **Tidy Model Fit Table (Linear)**
``` {r Linear Regression model build out}

pine_tbl_recipe <- 
  recipe(DeadDist ~ ., data = pine_tbl_select) %>% 
  step_sqrt(all_outcomes()) %>% 
  step_corr(all_predictors()) 

lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

pine_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(pine_tbl_recipe)

pine_fit <- 
  pine_wflow %>% 
  fit(data = pine_tbl_select)

lm_rsq_pine_fit = pine_fit %>% 
  extract_fit_parsnip() %>% 
  glance() %>% 
  select(r.squared) %>% 
  round(., 3)

pine_fit %>% 
  extract_fit_parsnip() %>% 
  tidy()


```

### **VIP Plot (Linear)**
``` {r VIP Plot, fig.width = 12, fig.height= 7}

vip_lm <- pine_fit %>% 
  extract_fit_parsnip() %>% 
  vip::vip()

vip_lm + my_theme

```

Column {data-width=375} 
-----------------------------------------------------------------------
``` {r ridge set up, include=FALSE}

pine_split <- initial_split(pine_tbl_select)
pine_train <- training(pine_split)
pine_test <- testing(pine_split)

ridge_mod <-
  linear_reg(mixture = 0, penalty = 0.1629751) %>%  
  set_engine("glmnet")

ridge_mod %>% 
  translate()
  
pine_rec <- pine_train %>% 
  recipe(DeadDist ~ ., data = pine_tbl_select) %>% 
  step_sqrt(all_outcomes()) %>% 
  step_corr(all_predictors()) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_zv(all_numeric(), -all_outcomes()) #%>% 


pine_ridge_wflow <- 
  workflow() %>% 
  add_model(ridge_mod) %>% 
  add_recipe(pine_rec)

pine_ridge_wflow

pine_ridge_fit <- 
  pine_ridge_wflow %>% 
  fit(data = pine_train)

pine_ridge_evaluation <- last_fit(pine_ridge_wflow, pine_split) %>%
  collect_metrics()

pine_ridge_rsq = round(pine_ridge_evaluation[2,]$.estimate,3)

```


### **Tidy Model Fit Table (Ridge)**

``` {r tidy table ridge}

pine_ridge_fit %>% 
  extract_fit_parsnip() %>% 
  tidy()

```

### **VIP Plot (Ridge)**
``` {r VIP Plot ridge, fig.width = 12, fig.height= 7}

vip_ridge <- pine_ridge_fit %>% 
  extract_fit_parsnip() %>% 
  vip::vip()

vip_ridge+ my_theme
```

Column {data-width=250} 
-----------------------------------------------------------------------
### **R^2 Evaluation (Linear)**

```{r lm r^2 gauge}

percent_lm_rsq <- lm_rsq_pine_fit$r.squared*100

gauge(value = percent_lm_rsq, min = 0, max = 100, symbol = "%", gaugeSectors(colors = "darkcyan"))
```

### **R^2 Evaluation (Ridge)**

```{r ridge r^2 gauge}

percent_ridge_rsq <- pine_ridge_rsq*100

gauge(value = percent_ridge_rsq, min = 0, max = 100, symbol = "%", gaugeSectors(colors = "darkcyan"))

```

### **Conclusion: Regression Outcomes**
When comparing the liner and ridge regressions we can see that similar outcomes are 
provided. The VIP plots show similar variable importance distributions, and the R^2 
values are very similar. 

The R^2 values from the models indicate that ~68% of the variability in 'DeadDist' 
is explained by the model. From this we can say that the model moderately fits the data, 
but it can be improved. To increase the fit of the model, more predictors can be 
added or the lower importance predictors can be switched for higher importance predictors.