Page 1

Row

The goal here is to use information from the year 1993 collected from the Lake Tahoe Basin (a 60-acre study area with 10,722 trees that is closely monitored) and predict the minimum linear distance to the nearest brood (or dead) tree. Page 3 has variable descriptions in detail for abbreviations.

Row

Correlation heatmap of all measured variables

Scatterplot with significant predictor of distance to nearest dead tree

Feature Importance (Random Forest Model, top 10 shown)

Row

Key for Model Summary Evaluation Metrics (Right)

  • RMSE (Root Mean Squared Error): A measure of average magnitude of error. Communicates how far off the predictions are from the observed values. Lower RMSE values indicate a better fit.

  • R-Squared (R²): Proportion of variance in the dependent variable that can be predicted from the independent variables. R² values closer to 1 mean a better model fit, 1 being a perfect fit.

  • MAE (Mean Absolute Error): Measures the average absolute difference between the predicted and observed values. Lower MAE values indicate better model performance, with 0 indicating no error.

Linear Model Summary (Page 2 for Interpretation)

Model Evaluation Metrics (Linear Model)
Metric Estimator Estimate
rmse standard 5.3051223
rsq standard 0.9261439
mae standard 2.9899526

Random Forest (Ranger) Model Summary (Page 2 for Interpretation)

Model Evaluation Metrics (Random Forest Ranger)
Metric Estimator Estimate
rmse standard 2.3598864
rsq standard 0.9855771
mae standard 1.6804356

Page 2

Row

Evaluating the models

Overall,  Random Forest performs better with lower RMSE & MAE (less error) and higher R² (better fit). The Linear Model is generally weaker with higher error rates. 
This ulatimately means that this model is better at predicting the minimum linear distance to the nearet brood tree (m) based on this 1993 dataset. 

 The predictive variables used are:
 -  TreeNum
-  Response
-  Easting
-  Northing
-  TreeDiam
-  Infest_Serv1
-  Ind_DeadDist
-  Neigh_SDI_1/4th
-  BA_20th
-  Neigh_1.5
-  BA_Inf_20th
-  BA_Infest_1/4th
-  BA_Infest_1.5
-  IND_BA_Infest_20th
-  IND_BA_Infest_1/4th
-  IND_BA_Infest_1/2th
-  IND_BA_Infest_1
-  IND_BA_Infest_1.5

Model metrics comparison from Page 1

Page 3

Row

Below are all variables and their descriptions.

Row

---
title: "Pine Beetles Analysis"
subtitle: "Minimum linear distance to the nearest brood tree"
author: "RL"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    theme: cerulean
    logo: "images/three_trees.png"
    favicon: "images/three_trees.png"
    source_code: embed
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE, echo=FALSE}
library(flexdashboard)
library(tidymodels)
library(tidyverse)
library(rsample)
library(yardstick)
library(recipes)
library(dr)
library(ggplot2)
library(knitr)
library(plotly)
library(corrr)
library(emo)
library(ggfortify)
library(vip)
library(GGally)
library(readxl)
library(kableExtra)
library(corrplot)

```


```{r data stuff, echo=FALSE, include=FALSE}
#Load data
getwd()
pine_beetles_data_path <- file.path("data", "Data_1993.xlsx")
beetles_data <- read_xlsx(pine_beetles_data_path)

#Data structure
dim(beetles_data)
rm(pine_beetles_data_path)
str(beetles_data)
summary(beetles_data)

#relationships
ggpairs(beetles_data[, c("DeadDist", "Neigh_SDI_1/4th", "BA_Infest_1/2th", "Neigh_1.5")],
        diag = list(continuous = "barDiag", discrete = "barDiag", na = "naDiag"))

# Display the first few rows of dtaset
glimpse(beetles_data)

```

```{r processing, echo=FALSE, include=FALSE}
# Train-test data split
set.seed(123)

beetles_split <- 
  initial_split(beetles_data, prop = 0.80)
beetles_split

beetles_split %>%
  training() %>%
  glimpse()


#Data pre-processing
beetles_recipe <- 
  training(beetles_split) %>%
  recipe(DeadDist ~ .) %>%
  step_corr(all_predictors()) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes()) %>%
  prep()

beetles_training <- 
  bake(beetles_recipe, new_data = NULL)
glimpse(beetles_training)
glimpse(beetles_data)
names(beetles_training)

beetles_testing <- 
  beetles_recipe %>%
  bake(testing(beetles_split)) 
glimpse(beetles_testing)

beetles_recipe

#Ranger Model
beetles_ranger <- 
  rand_forest(trees = 200, mode = "regression") %>%
  set_engine("ranger", importance = "impurity") %>%
  fit(DeadDist ~ ., data = beetles_training)


predict(beetles_ranger, beetles_testing)

beetles_ranger %>%
  predict(beetles_testing) %>%
  bind_cols(beetles_testing)

beetles_ranger %>%
  predict(beetles_testing) %>%
  bind_cols(beetles_testing) %>%
  metrics(truth = DeadDist, estimate = .pred)

beetles_ranger %>%
  predict(beetles_testing)


beetles_probs <- 
  beetles_ranger %>%
  predict(beetles_testing) %>%
  bind_cols(beetles_testing)
beetles_probs

#Linear Modeling
beetles_recipe_1 <- 
  recipe(DeadDist ~ ., data = beetles_training) %>%
  step_corr(all_predictors()) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes())

#Linear reg mod 
beetles_lm_model <-
  linear_reg() %>%
  set_engine("lm")

beetles_lm_workflow <- 
  workflow() %>%
  add_recipe(beetles_recipe_1) %>%
  add_model(beetles_lm_model)

beetles_lm_fit <- 
  beetles_lm_workflow %>%
  fit(data = beetles_training)

beetles_lm_fit %>%
  predict(beetles_testing) %>%
  bind_cols(beetles_testing) %>%
  metrics(truth = DeadDist, estimate = .pred)



```


Page 1
===================================== 
Row {data-height=25}
-----------------------------------------------------------------------
<h3 style="font-size: 20px; margin: 0; padding: 0">The goal here is to use information from the year 1993 collected from the Lake Tahoe Basin (a 60-acre study area with 10,722 trees that is closely monitored) and predict the minimum linear distance to the nearest brood (or dead) tree. Page 3 has variable descriptions in detail for abbreviations. </h3>

Row {data-height=500}
-----------------------------------------------------------------------
### Correlation heatmap of all measured variables
```{r data summary, echo=FALSE}
cor_matrix <- 
  cor(beetles_data %>% select_if(is.numeric), use = "complete.obs")

corrplot(cor_matrix, method = "circle", type = "upper", tl.col = "black", tl.cex = 0.8)

```


### Scatterplot with significant predictor of distance to nearest dead tree

```{r scatterplot, echo=FALSE}
ggplot(beetles_data, aes(x = Neigh_1.5, y = DeadDist)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  xlab("1.5-Acre Neighborhood Basal Area Sum (m²)") +
  ylab("Dead Distance (m)") +
  theme_bw(base_size = 18)

```


### Feature Importance (Random Forest Model, top 10 shown)

```{r 2, echo=FALSE}
beetles_ranger %>%
  vip::vip(
    num_features = 10,
    geom = "point",
    aesthetics = list(
      size = 4,
      color = "blue"
    )
  ) +
  theme_bw(base_size = 18) +
  xlab("Variables of Interest") +
  ylab("Variable Importance") +
  scale_y_continuous(labels = label_number(accuracy = 1, scale = 1/100000, suffix = "", big.mark = ","))

  

```

Row {data-height=300}
-----------------------------------------------------------------------
### Key for Model Summary Evaluation Metrics (Right)
- **RMSE (Root Mean Squared Error):** A measure of average magnitude of  error. Communicates how far off the predictions are from the observed values. Lower RMSE values indicate a better fit.
  
- **R-Squared (R²):** Proportion of variance in the dependent variable that can be predicted from the independent variables. R² values closer to 1 mean a better model fit, 1 being a perfect fit.

- **MAE (Mean Absolute Error):** Measures the average absolute difference between the predicted and observed values. Lower MAE values indicate better model performance, with 0 indicating no error.



### Linear Model Summary (Page 2 for Interpretation)


```{r 3}
beetles_lm_fit %>%
  predict(beetles_testing) %>%
  bind_cols(beetles_testing) %>%
  metrics(truth = DeadDist, estimate = .pred) %>%
  rename(
    Metric = .metric,
    Estimator = .estimator,
    Estimate = .estimate
  ) %>%
  kable("html", caption = "Model Evaluation Metrics (Linear Model)") %>%
  kable_styling(
    full_width = FALSE, 
    position = "center", 
    font_size = 14,
    bootstrap_options = c("striped", "hover", "condensed")
  ) %>%
  column_spec(1, width = "10em") %>%
  column_spec(2, width = "10em") %>%
  column_spec(3, width = "10em") %>%
  row_spec(0, bold = TRUE, background = "#f39c12")



```

### Random Forest (Ranger) Model Summary (Page 2 for Interpretation)


```{r 4}
beetles_ranger %>%
  predict(beetles_testing) %>%
  bind_cols(beetles_testing) %>%
  metrics(truth = DeadDist, estimate = .pred) %>%
  rename(
    Metric = .metric,
    Estimator = .estimator,
    Estimate = .estimate
  ) %>%
  kable("html", caption = "Model Evaluation Metrics (Random Forest Ranger)") %>%
  kable_styling(
    full_width = FALSE, 
    position = "center", 
    font_size = 14,
    bootstrap_options = c("striped", "hover", "condensed")
  ) %>%
  column_spec(1, width = "10em") %>%
  column_spec(2, width = "10em") %>%
  column_spec(3, width = "10em") %>%
  row_spec(0, bold = TRUE, background = "#18bc9c")



```


Page 2
===================================== 

Row {data-height=600}
-----------------------------------------------------------------------
### Evaluating the models 
```{r evaluating models, echo=FALSE}
variables <- c("TreeNum", "Response", "Easting", "Northing", "TreeDiam", 
               "Infest_Serv1", "Ind_DeadDist", "Neigh_SDI_1/4th", "BA_20th", 
               "Neigh_1.5", "BA_Inf_20th", "BA_Infest_1/4th", "BA_Infest_1.5", 
               "IND_BA_Infest_20th", "IND_BA_Infest_1/4th", "IND_BA_Infest_1/2th", 
               "IND_BA_Infest_1", "IND_BA_Infest_1.5")

cat("Overall,  Random Forest performs better with lower RMSE & MAE (less error) and higher R² (better fit). The Linear Model is generally weaker with higher error rates. \nThis ulatimately means that this model is better at predicting the minimum linear distance to the nearet brood tree (m) based on this 1993 dataset. \n\n",
    "The predictive variables used are:\n",
    paste("- ", variables, collapse = "\n"))

```

### Model metrics comparison from Page 1

```{r comparisons, echo=FALSE}
model_comparison <- tibble(
  Model = c("Random Forest", "Random Forest", "Random Forest", "Linear Model", "Linear Model", "Linear Model"),
  Metric = c("RMSE", "R²", "MAE", "RMSE", "R²", "MAE"),
  Value = c(2.36, 0.986, 1.68, 5.31, 0.926, 2.99)
)

ggplot(model_comparison, aes(x = Model, y = Value, color = Model, group = Model)) +
  geom_point(size = 6) + 
  geom_line(aes(group = Metric), linetype = "dashed", color = "gray") + 
  facet_wrap(~Metric, scales = "free_y") + 
  labs(title = "Model Comparison") +
  scale_color_manual(values = c("Random Forest" = "#18bc9c", "Linear Model" = "#f39c12")) +
  theme_bw(base_size = 16) + 
  theme(
    strip.background = element_rect(),
    strip.text = element_text(size = 16, face = "bold"),
    axis.title = element_blank(),  
    axis.text.x = element_blank(),  
    axis.ticks.x = element_blank(),  
    legend.title = element_blank(),  
    legend.position = "top" 
  )


```


Page 3
===================================== 

Row {data-height=25}
-----------------------------------------------------------------------
<h3 style="font-size: 20px; margin: 0; padding: 0">Below are all variables and their descriptions.</h3>

Row {data-height=500}
-----------------------------------------------------------------------
```{r embed-image, echo=FALSE, out.width='95%', out.height='80%'}
knitr::include_graphics("images/variables_descriptions.png")

```