Introduction

Column

Infestation

Correlation Matrix

Column

Data Dictionary


This dashboard showcases the data from the Lake Tahoe Basin Jeffrey pine beetle outbreak from 2003. Using this data, the minimum linear distance to the nearest brood tree (DeadDist) is predicted. The variables of interest–the predictors chosen for these models–are TreeDiam and SDI_20th.


variable description
DeadDist Minimum linear distance to nearest brood
TreeDiam Tree diameter/size
SDI_20th Stand Density Index @ 1/20th-acre neighborhood surrounding response tree

Base LM Coefficients


DeadDist ~ TreeDiam + SDI_20th
term estimate std.error statistic p.value
(Intercept) 39.9110128 0.5868380 68.010271 0.0e+00
TreeDiam -0.1447438 0.0296711 -4.878269 1.1e-06
SDI_20th -0.5286419 0.0141074 -37.472605 0.0e+00
variance inflation factor
TreeDiam 1.00628
SDI_20th 1.00628

The variance inflation factors are much lower after removing the BA_20th variable.

Linear Regression

Column

Model

Predictor Importance

Check Model

Column

Fitted Coefficients


DeadDist ~ TreeDiam + SDI_20th
term estimate std.error statistic p.value
(Intercept) 6.2561598 0.0701696 89.157648 0.00e+00
TreeDiam -0.0145586 0.0035076 -4.150544 3.36e-05
SDI_20th -0.0588562 0.0016854 -34.921720 0.00e+00

Evaluation

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.1526374 0.1523873 1.746792 610.3783 0 2 -13400.66 26809.32 26836.61 20678.55 6777 6780
.metric .estimator .estimate
rmse standard 27.37132
.metric .estimator .estimate
rsq standard 0.1214732

With a high rmse value and a low rsq value, this model is not a good fit.

Ridge Regression

Column

Model

Predictor Importance

Column

Fitted Coefficients


DeadDist ~ TreeDiam + SDI_20th
term estimate penalty
(Intercept) 4.5084842 0.1629751
TreeDiam -0.0766023 0.1629751
SDI_20th -0.6839523 0.1629751

Evaluation

.metric .estimator .estimate
rmse standard 42.78099
.metric .estimator .estimate
rsq standard 0.1204247

With a high rmse value and a low rsq value, this model is not a good fit.

---
title: "Project 1 - Rowan O'Hara"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    theme: lumen
    logo: ../img/pine-beetle-logo.png
    source_code: embed
---

```{r setup, include=FALSE}
library(tidyverse)
library(flexdashboard)
library(readxl)
library(tidyverse)
library(broom)
library(car)
library(ggfortify)
library(tidymodels)
library(vip)
library(performance)
library(glmnet)
library(GGally)
library(ggpmisc)
library(caret)


data_beetles <- read_excel("../data/Data_1993.xlsx")

```

Introduction
================================================================================

Column {data-width=600, .tabset .tabset-fade}
--------------------------------------------------------------------------------

### Infestation

```{r dead_map}
ggplot(data = data_beetles,aes(x = Easting, y = Northing)) +
  geom_point(aes(color =factor(Response), alpha =factor(Response))) +
  scale_alpha_discrete(range =c(0.5, 1), guide= "none") +
  scale_color_manual("",values=c("#778A35", "#FF5C5C"),labels =c("Alive", "JPB-attacked")) +
  theme_minimal(base_size = 14) +
  xlab("Easting Coordinate") +
  ylab("Northing Coordinate") +
  theme(legend.position =c(0,1), legend.text=element_text(size=15))
  
```

### Correlation Matrix

```{r correlation_matrix}
ggpairs(data_beetles[, c("DeadDist", "TreeDiam", "SDI_20th")],
        diag = list(continuous = "barDiag", discrete = "barDiag", na = "naDiag")) +
  theme_minimal(base_size = 14)
```


Column {data-width=400, .tabset .tabset-fade}
-------------------------------------------------------------------------------

### Data Dictionary

This dashboard showcases the data from the Lake Tahoe Basin Jeffrey pine beetle outbreak from 2003. Using this data, the minimum linear distance to the nearest brood tree (`DeadDist`) is predicted. The variables of interest--the *predictors* chosen for these models--are `TreeDiam` and `SDI_20th`.
| variable | description | |:--------------|:-----------------------------------------------------------------------------| | `DeadDist` | Minimum linear distance to nearest brood | | `TreeDiam` | Tree diameter/size | | `SDI_20th` | Stand Density Index @ 1/20th-acre neighborhood surrounding response tree | ### Base LM Coefficients
`DeadDist` ~ `TreeDiam` + `SDI_20th`
```{r fitted_coefficients} lm_fit2 <- lm(DeadDist ~ TreeDiam + SDI_20th, data = data_beetles) lm_fit2 %>% tidy() %>% knitr::kable() car::vif(lm_fit2) %>% knitr::kable(col.names = "variance inflation factor") ``` *The variance inflation factors are much lower after removing the `BA_20th` variable.* Linear Regression ================================================================================ Column {data-width=600, .tabset .tabset-fade} ----------------------------------------------------------------------- ### Model ```{r lin-reg-model} # This source was very helpful for this portion as well as the tidymodels recode lecture # https://www.gmudatamining.com/lesson-10-r-tutorial.html set.seed(1234) pine_split <- initial_split(data_beetles, prop = 0.7) pine_train <- pine_split %>% training() pine_test <- pine_split %>% testing() pine_rec <- data_beetles %>% recipe(DeadDist ~ TreeDiam + SDI_20th) %>% step_sqrt(all_outcomes()) %>% step_corr(all_predictors()) %>% prep() lm_mod <- linear_reg() %>% set_engine("lm") %>% set_mode("regression") pine_wflow <- workflow() %>% add_model(lm_mod) %>% add_recipe(pine_rec) pine_fit <- pine_wflow %>% last_fit(split = pine_split) pine_results <- pine_fit %>% collect_predictions() ggplot(data = pine_results, mapping = aes(x = .pred, y = DeadDist)) + geom_point(color = '#778A35', alpha = 0.25) + geom_abline(intercept = 0, slope = 1, color = 'black') + stat_poly_eq(formula = y ~ x, eq.with.lhs = "italic(hat(y))~`=`~", aes(label = paste(..eq.label.., ..rr.label.., sep = "*plain(\",\")~")), parse = TRUE) + # https://stackoverflow.com/questions/7549694/add-regression-line-equation-and-r2-on-graph labs(title = 'Multi-Linear Regression Results', x = 'Predicted DeadDist', y = 'Actual DeadDist') + theme_minimal(base_size = 14) ``` ### Predictor Importance ```{r predictors} pine_fit %>% extract_fit_parsnip() %>% vip::vip(aesthetics = list( size = 4, fill = "#778A35" ) ) + theme_minimal(base_size = 14) ``` ### Check Model ```{r check-linear-model} pine_fit %>% extract_fit_parsnip() %>% performance::check_model() # Could not get this to work #autoplot(pine_fit, which = 1, label.size = 2) + # theme_bw() ``` Column {data-width=400, .tabset .tabset-fade} ----------------------------------------------------------------------- ### Fitted Coefficients
`DeadDist` ~ `TreeDiam` + `SDI_20th`
```{r fitted-linear-model} pine_fit %>% # fitted coefficients extract_fit_parsnip() %>% tidy() %>% knitr::kable() ``` ### Evaluation ```{r evaluate-linear} pine_fit2 <- pine_wflow %>% fit(data = pine_train) pine_fit2_extract <- pine_fit %>% extract_fit_parsnip() pine_test_results <- predict(pine_fit2_extract, new_data = pine_test) %>% bind_cols(pine_test) pine_fit2 %>% glance() %>% knitr::kable() yardstick::rmse(pine_test_results, truth = DeadDist, estimate = .pred) %>% knitr::kable() yardstick::rsq(pine_test_results, truth = DeadDist, estimate = .pred) %>% knitr::kable() ``` With a high `rmse` value and a low `rsq` value, this model is not a good fit. Ridge Regression ================================================================================ Column {data-width=600, .tabset .tabset-fade} ----------------------------------------------------------------------- ### Model ```{r ridge-reg-model} # picking Dr. Smirnova's best lambda estimate; can estimate with tune() - see below ridge_mod <- linear_reg(mixture = 0, penalty = 0.1629751) %>% #validation sample or resampling can estimate this set_engine("glmnet") pine_ridge_rec <- pine_train %>% recipe(DeadDist ~ TreeDiam + SDI_20th) %>% step_sqrt(all_outcomes()) %>% step_corr(all_predictors()) %>% step_normalize(all_numeric(), -all_outcomes()) %>% step_zv(all_numeric(), -all_outcomes()) %>% prep() pine_ridge_wflow <- workflow() %>% add_model(ridge_mod) %>% add_recipe(pine_ridge_rec) pine_ridge_fit <- pine_ridge_wflow %>% last_fit(split = pine_split) pine_ridge_results <- pine_ridge_fit %>% collect_predictions() ggplot(data = pine_ridge_results, mapping = aes(x = .pred, y = DeadDist)) + geom_point(color = '#778A35', alpha = 0.25) + geom_abline(intercept = 0, slope = 1, color = 'black') + stat_poly_eq(formula = y ~ x, eq.with.lhs = "italic(hat(y))~`=`~", aes(label = paste(..eq.label.., ..rr.label.., sep = "*plain(\",\")~")), parse = TRUE) + # https://stackoverflow.com/questions/7549694/add-regression-line-equation-and-r2-on-graph labs(title = 'Ridge Regression Results', x = 'Predicted DeadDist', y = 'Actual DeadDist') + theme_minimal(base_size = 14) ``` ### Predictor Importance ```{r ridge-predictors} pine_ridge_fit %>% extract_fit_parsnip() %>% vip::vip(aesthetics = list( size = 4, fill = "#778A35" ) ) + theme_minimal(base_size = 14) ``` Column {data-width=400, .tabset .tabset-fade} ----------------------------------------------------------------------- ### Fitted Coefficients
`DeadDist` ~ `TreeDiam` + `SDI_20th`
```{r fitted-ridge-model} pine_ridge_fit %>% # fitted coefficients extract_fit_parsnip() %>% tidy() %>% knitr::kable() ``` ### Evaluation ```{r evaluate-ridge} pine_ridge_fit2 <- pine_ridge_wflow %>% fit(data = pine_train) pine_ridge_fit2_extract <- pine_ridge_fit2 %>% extract_fit_parsnip() pine_ridge_test_results <- predict(pine_ridge_fit2_extract, new_data = pine_test) %>% bind_cols(pine_test) yardstick::rmse(pine_ridge_test_results, truth = DeadDist, estimate = .pred) %>% knitr::kable() yardstick::rsq(pine_ridge_test_results, truth = DeadDist, estimate = .pred) %>% knitr::kable() ``` With a high `rmse` value and a low `rsq` value, this model is not a good fit.