This project uses the data collected at Lake Tahoe Basin regarding the Jeffrey pine beetle outbreak between 1991-1996. From 1991 to 1996, Jeffrey pine beetles (JPB) caused tree mortality throughout the Lake Tahoe Basin during a severe drought. The data set describes the dynamics within the Lake Tahoe Basin of a 60-acre study area with 10,722 trees followed annually and assesses patterns of JPB-caused mortality. This project uses the information in the ‘pine beetle’ dataset to analyze and predict the minimum linear distance to the nearest brood tree (DeadDist). To do this we will use both linear and ridge regression models in order to determine the ‘DeadDist’ outcome as well as to determine regression model variations.
The predictors used for these models were infestation severity closest to the response tree (Infest_Serv1), the basal area total for all infested trees within 1/2 acre neighborhood (BA_Infest_1/2th), stand density index @ 1/20th-acre neighborhood surrounding response tree (SDI_20th), the basal area total summed for all trees within 1-acre neighborhood of response tree (Neigh_1), and the indicator of any infested trees within 1.5-acre neighborhood of response tree (IND_BA_infest_1.5).
| Variables | Description |
|---|---|
| TreeDiam | Tree diameter/size |
| Infest_sever1 | Infestation severity nearest to response tree |
| Invest_sever2 | Infestation severity nearest to response tree |
| Ind_DeadDist | Indicator if nearest brood tree is within 50m effective distance found |
| DeadDist | Minimum linear distance to nearest brood tree |
| SDI_20th | Stand Density Index @ 1/20th-acre neighborhood surrounding response tree |
| Neigh_SDI_1/4th | Stand Density Index @ 1/4th-acre neighborhood surrounding response tree |
| BA_20th | Basal Area @ 1/20th-acre neighborhood surrounding response tree |
| Neigh_1/4th | Basal area total summed for all trees within 1/4th-acre neighborhood of response tree |
| Neigh_1/2th | Basal area total summed for all trees within 1/2-acre neighborhood of response tree |
| Neigh_1 | Basal area total summed for all trees within 1-acre neighborhood of response tree |
| Neigh_1.5 | Basal area total summed for all trees within 1.5-acre neighborhood of response tree |
| BA_Inf_20th | Basal area total for all infested trees within 1/20th-acre neighborhood |
| BA_infest_1/4th | Basal area total for all infested trees within 1/4th-acre neighborhood |
| BA_infest_1/2th | Basal area total for all infested trees within 1/2-acre neighborhood |
| BA_infest_1 | Basal area total for all infested trees within 1-acre neighborhood |
| BA_infest_1.5 | Basal area total for all infested trees within 1.5-acre neighborhood |
| IND_BA_Infest_20th | Binary indicator for if a response tree has any infested trees within neigborhood |
| IND_BA_infest_1/4th | Indicator of any infested trees within 1/4th-acre neighborhood of response tree |
| IND_BA_infest_1/2th | Indicator of any infested trees within 1/2-acre neighborhood of response tree |
| IND_BA_infest_1 | Indicator of any infested trees within 1-acre neighborhood of response tree |
| IND_BA_infest_1.5 | Indicator of any infested trees within 1.5-acre neighborhood of response tree |
| term | DeadDist | Infest_Serv1 | BA_Infest_1.2th | SDI_20th | IND_BA_Infest_1.5 | Neigh_1 |
|---|---|---|---|---|---|---|
| DeadDist | -.03 | -.45 | -.35 | -.76 | -.58 | |
| Infest_Serv1 | -.03 | .47 | -.12 | -.01 | -.17 | |
| BA_Infest_1/2th | -.45 | .47 | .17 | .22 | .26 | |
| SDI_20th | -.35 | -.12 | .17 | .18 | .51 | |
| IND_BA_Infest_1.5 | -.76 | -.01 | .22 | .18 | .41 | |
| Neigh_1 | -.58 | -.17 | .26 | .51 | .41 |
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 8.79 0.0452 194. 0
2 `BA_Infest_1/2th` -0.0746 0.00139 -53.5 0
3 Infest_Serv1 0.00733 0.000726 10.1 7.24e- 24
4 SDI_20th -0.0197 0.00101 -19.5 1.93e- 83
5 Neigh_1 -0.00530 0.000212 -25.0 1.22e-133
6 IND_BA_Infest_1.5 -2.61 0.0346 -75.3 0
# A tibble: 6 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 4.51 0.163
2 BA_Infest_1/2th -0.665 0.163
3 Infest_Serv1 0.0897 0.163
4 SDI_20th -0.247 0.163
5 Neigh_1 -0.382 0.163
6 IND_BA_Infest_1.5 -0.845 0.163
When comparing the liner and ridge regressions we can see that similar outcomes are provided. The VIP plots show similar variable importance distributions, and the R^2 values are very similar.
The R^2 values from the models indicate that ~68% of the variability in ‘DeadDist’ is explained by the model. From this we can say that the model moderately fits the data, but it can be improved. To increase the fit of the model, more predictors can be added or the lower importance predictors can be switched for higher importance predictors.
---
title: "Predicting 'DeadDist' Using Regression Models"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
theme:
version: 4
bootswatch: minty
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(DT)
library(ggplot2)
library(plotly)
library(corrr)
library(emo)
library(tidyverse)
library(readxl)
library(broom)
library(car)
library(ggfortify)
library(tidymodels)
library(vip)
library(performance)
library(GGally)
library(corrr)
pine_tbl <- read_excel("Data/Data_1993.xlsx", sheet = 1)
my_theme <- theme(axis.text.x = element_text(size = 24),
axis.text.y = element_text(size = 24),
axis.title.x = element_text(size = 24))
```
Introduction to Project and Dataset
========================================
Column {data-width=375}
-----------------------------------------------------------------------
### **Project Overview**
This project uses the data collected at Lake Tahoe Basin regarding the Jeffrey
pine beetle outbreak between 1991-1996. From 1991 to 1996, Jeffrey pine beetles
(JPB) caused tree mortality throughout the Lake Tahoe Basin during a severe drought.
The data set describes the dynamics within the Lake Tahoe Basin of a 60-acre study
area with 10,722 trees followed annually and assesses patterns of JPB-caused mortality.
This project uses the information in the 'pine beetle' dataset to analyze and predict
the minimum linear distance to the nearest brood tree (DeadDist). To do this we will
use both linear and ridge regression models in order to determine the 'DeadDist' outcome
as well as to determine regression model variations.
The predictors used for these models were infestation severity
closest to the response tree (Infest_Serv1), the basal area total for all
infested trees within 1/2 acre neighborhood (BA_Infest_1/2th), stand density
index @ 1/20th-acre neighborhood surrounding response tree (SDI_20th),
the basal area total summed for all trees within 1-acre neighborhood of response
tree (Neigh_1), and the indicator of any infested trees within 1.5-acre neighborhood
of response tree (IND_BA_infest_1.5).
### **Jeffery Pine Beetle**
```{r photo input, out.width='100%'}
knitr::include_graphics("~/Desktop/HGEN_612_Project2/Dendroctonus_ponderosae.jpg")
```
Column {data-width=625}
-----------------------------------------------------------------------
### **Data Variable Explanation**
```{r variable input, out.width='100%'}
data.frame(Variables = c("TreeDiam", "Infest_sever1", "Invest_sever2", "Ind_DeadDist",
"DeadDist", "SDI_20th", "Neigh_SDI_1/4th", "BA_20th",
"Neigh_1/4th", "Neigh_1/2th", "Neigh_1", "Neigh_1.5",
"BA_Inf_20th", "BA_infest_1/4th", "BA_infest_1/2th", "BA_infest_1",
"BA_infest_1.5", "IND_BA_Infest_20th", "IND_BA_infest_1/4th",
"IND_BA_infest_1/2th", "IND_BA_infest_1", "IND_BA_infest_1.5"),
Description = c("Tree diameter/size",
"Infestation severity nearest to response tree",
"Infestation severity nearest to response tree",
"Indicator if nearest brood tree is within 50m effective distance found",
"Minimum linear distance to nearest brood tree",
"Stand Density Index @ 1/20th-acre neighborhood surrounding response tree",
"Stand Density Index @ 1/4th-acre neighborhood surrounding response tree",
"Basal Area @ 1/20th-acre neighborhood surrounding response tree",
"Basal area total summed for all trees within 1/4th-acre neighborhood of response tree",
"Basal area total summed for all trees within 1/2-acre neighborhood of response tree",
"Basal area total summed for all trees within 1-acre neighborhood of response tree",
"Basal area total summed for all trees within 1.5-acre neighborhood of response tree",
"Basal area total for all infested trees within 1/20th-acre neighborhood",
"Basal area total for all infested trees within 1/4th-acre neighborhood",
"Basal area total for all infested trees within 1/2-acre neighborhood",
"Basal area total for all infested trees within 1-acre neighborhood",
"Basal area total for all infested trees within 1.5-acre neighborhood",
"Binary indicator for if a response tree has any infested trees within neigborhood",
"Indicator of any infested trees within 1/4th-acre neighborhood of response tree",
"Indicator of any infested trees within 1/2-acre neighborhood of response tree",
"Indicator of any infested trees within 1-acre neighborhood of response tree",
"Indicator of any infested trees within 1.5-acre neighborhood of response tree")) %>%
knitr::kable()
```
### **Selected Predictors Correlation**
```{r predictors correlation, out.width='100%'}
pine_tbl_select <- pine_tbl %>%
select("DeadDist", "BA_Infest_1/2th", "Infest_Serv1", "SDI_20th", "Neigh_1", "IND_BA_Infest_1.5")
pine_tbl_select_cor <- correlate(pine_tbl_select)
pine_tbl_select_cor %>%
rearrange() %>%
fashion() %>%
knitr::kable()
```
Predictor Overview
========================================
``` {r predictor overview}
ggpairs(pine_tbl_select)
```
Regression Models
========================================
Column{data-width=375}
-----------------------------------------------------------------------
### **Tidy Model Fit Table (Linear)**
``` {r Linear Regression model build out}
pine_tbl_recipe <-
recipe(DeadDist ~ ., data = pine_tbl_select) %>%
step_sqrt(all_outcomes()) %>%
step_corr(all_predictors())
lm_model <-
linear_reg() %>%
set_engine("lm")
pine_wflow <-
workflow() %>%
add_model(lm_model) %>%
add_recipe(pine_tbl_recipe)
pine_fit <-
pine_wflow %>%
fit(data = pine_tbl_select)
lm_rsq_pine_fit = pine_fit %>%
extract_fit_parsnip() %>%
glance() %>%
select(r.squared) %>%
round(., 3)
pine_fit %>%
extract_fit_parsnip() %>%
tidy()
```
### **VIP Plot (Linear)**
``` {r VIP Plot, fig.width = 12, fig.height= 7}
vip_lm <- pine_fit %>%
extract_fit_parsnip() %>%
vip::vip()
vip_lm + my_theme
```
Column {data-width=375}
-----------------------------------------------------------------------
``` {r ridge set up, include=FALSE}
pine_split <- initial_split(pine_tbl_select)
pine_train <- training(pine_split)
pine_test <- testing(pine_split)
ridge_mod <-
linear_reg(mixture = 0, penalty = 0.1629751) %>%
set_engine("glmnet")
ridge_mod %>%
translate()
pine_rec <- pine_train %>%
recipe(DeadDist ~ ., data = pine_tbl_select) %>%
step_sqrt(all_outcomes()) %>%
step_corr(all_predictors()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_zv(all_numeric(), -all_outcomes()) #%>%
pine_ridge_wflow <-
workflow() %>%
add_model(ridge_mod) %>%
add_recipe(pine_rec)
pine_ridge_wflow
pine_ridge_fit <-
pine_ridge_wflow %>%
fit(data = pine_train)
pine_ridge_evaluation <- last_fit(pine_ridge_wflow, pine_split) %>%
collect_metrics()
pine_ridge_rsq = round(pine_ridge_evaluation[2,]$.estimate,3)
```
### **Tidy Model Fit Table (Ridge)**
``` {r tidy table ridge}
pine_ridge_fit %>%
extract_fit_parsnip() %>%
tidy()
```
### **VIP Plot (Ridge)**
``` {r VIP Plot ridge, fig.width = 12, fig.height= 7}
vip_ridge <- pine_ridge_fit %>%
extract_fit_parsnip() %>%
vip::vip()
vip_ridge+ my_theme
```
Column {data-width=250}
-----------------------------------------------------------------------
### **R^2 Evaluation (Linear)**
```{r lm r^2 gauge}
percent_lm_rsq <- lm_rsq_pine_fit$r.squared*100
gauge(value = percent_lm_rsq, min = 0, max = 100, symbol = "%", gaugeSectors(colors = "darkcyan"))
```
### **R^2 Evaluation (Ridge)**
```{r ridge r^2 gauge}
percent_ridge_rsq <- pine_ridge_rsq*100
gauge(value = percent_ridge_rsq, min = 0, max = 100, symbol = "%", gaugeSectors(colors = "darkcyan"))
```
### **Conclusion: Regression Outcomes**
When comparing the liner and ridge regressions we can see that similar outcomes are
provided. The VIP plots show similar variable importance distributions, and the R^2
values are very similar.
The R^2 values from the models indicate that ~68% of the variability in 'DeadDist'
is explained by the model. From this we can say that the model moderately fits the data,
but it can be improved. To increase the fit of the model, more predictors can be
added or the lower importance predictors can be switched for higher importance predictors.