Project Overview

This Project is test project to compare the performance of two models: XGBoost and Random Forest in a regression task.

the Data is sourced from the 1993 sample of the Lake Tahoe Basin Jeffrey pine beetle outbreak from 1991-1996. Check out Egan et al, 2016 “Multi-temporal ecological analysis of Jeffrey pine beetle outbreak dynamics within the Lake Tahoe Basin”.

The outcome of note is the Minimum linear distance to nearest brood tree.

The models were compared tusing the Root Mean Squared Error (RMSE) metric. Both models seemed to perform similarly with an RMSE ~0.1.

The next page contains charts of the outcome distribution, visualization of the pine beetle infestation, charts of variable importance for the models and plots of predicted to true minimum distance to the nearest brood tree.

Charts

Column

PINE BEETLE OUTBREAK

OUTCOME

Column

XGBOOST FEATURE IMPORTANCE

RANDOM FOREST FEATURE IMPORTANCE

Column

XGBOOST MODEL ASSESSMENT

RANDOM FOREST MODEL ASSESSMENT

---
title: "Pine Beetle"
author: "Jerry John Rawlings Mensah"
date: "`r Sys.Date()`"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(readxl)
library(skimr)
library(tidyverse)
library(tidymodels)
library(ranger)
library(xgboost)
library(hardhat)
library(vip)
library(plotly)

data <- read_excel("Data_1993.xlsx",
                       sheet = 1)

#split data
set.seed(1010)

data_split <- initial_split(data = data, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

#create cross-validation folds
cv_folds <- vfold_cv(train_data, v = 5)

#preprocess data
data_recipe <- recipe(DeadDist ~ ., data = train_data) %>% 
  step_select(!c("Response", "Ind_DeadDist")) %>%  #prevent data leakage variables
  update_role(TreeNum, new_role = "id variable") %>% #make TreeNum an id variable
  step_corr(all_numeric_predictors()) %>% #remove highly correlated variables
  step_nzv(all_predictors()) %>% #remove variables with mostly one variable
  step_normalize(all_numeric()) #normalize all numeric variables
  
#specify models
xgb_spec <- boost_tree(trees = tune(),
                  tree_depth = tune(),
                  learn_rate = tune(),
                  mtry = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

rf_spec <- rand_forest(mtry = tune(),
                       trees = tune(),
                       min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

#specify workflows
xgb_wf  <- workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(data_recipe)

rf_wf <- workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(data_recipe)

#set up hyperparameters for tuning
xgb_params <- extract_parameter_set_dials(xgb_spec) %>%
  update(trees = trees(c(100, 1000)),
         tree_depth = tree_depth(c(2, 10)),
         learn_rate = learn_rate(c(-4, 0)),
         mtry = mtry(c(7, 21)))

rf_params <- extract_parameter_set_dials(rf_spec) %>% 
  update(mtry = finalize(mtry(c(7, 14))),
         trees = trees(c(100, 1000)),
         min_n = min_n(c(2, 20)))

#tune model hyperparameters
xgb_tune <- tune_bayes(xgb_wf,
                       resamples = cv_folds,
                       param_info = xgb_params,
                       initial = 10,
                       iter = 20,
                       metrics = metric_set(rmse),
                       control = control_bayes(no_improve = 3,
                                               verbose = FALSE,
                                               save_pred = FALSE,
                                               parallel_over = "everything"))



rf_tune <- tune_bayes(rf_wf,
                      resamples = cv_folds,
                      param_info = rf_params,
                      initial = 10,
                      iter = 20,
                      metrics = metric_set(rmse),
                      control = control_bayes(no_improve = 3,
                                              verbose = FALSE,
                                              save_pred = FALSE,
                                              parallel_over = "everything"))

#select best models
best_xgb_model <- select_best(xgb_tune, metric = "rmse")

best_rf_model <- select_best(rf_tune, metric = "rmse")

#finalize workflow
best_xgb_wf <- finalize_workflow(xgb_wf, best_xgb_model)

best_rf_wf <- finalize_workflow(rf_wf, best_rf_model)


#fit and predict
xgb <- last_fit(best_xgb_wf,
                split = data_split,
                metrics = metric_set(rmse))

rf <- last_fit(best_rf_wf,
               split = data_split,
               metrics = metric_set(rmse))

```

Project Overview
======================================================================

This Project is test project to compare the performance of two models: XGBoost 
and Random Forest in a regression task.

 - Data
 
the Data is sourced from the 1993 sample of the Lake Tahoe Basin Jeffrey pine 
beetle outbreak from 1991-1996. Check out Egan et al, 2016 “Multi-temporal 
ecological analysis of Jeffrey pine beetle outbreak dynamics within the Lake 
Tahoe Basin”.

The outcome of note is the Minimum linear distance to nearest brood tree. 

- Model Performance Metrics

The models were compared tusing the Root Mean Squared Error (RMSE) metric. Both 
models seemed to perform similarly with an RMSE ~0.1. 


The next page contains charts of the outcome distribution, visualization of the 
pine beetle infestation, charts of variable importance for the models and plots
of predicted to true minimum distance to the nearest brood tree.

Charts
======================================================================

Column {data-width=350}
-----------------------------------------------------------------------

### PINE BEETLE OUTBREAK

```{r data viz}
a <- data %>% 
  ggplot(aes(x = Easting, y = Northing)) +
  geom_point(aes(color = factor(Response))) +
  scale_color_manual("",values=c("green", "red"),
                     labels = c("Alive", "Infested")) + 
  labs(title = "Map of Jeffrey Pine Beetle Outbreak") +
  theme_bw()

ggplotly(a)
```

### OUTCOME

```{r outcome viz}
b <- data %>% 
  ggplot(aes(x = DeadDist)) +
  geom_histogram() +
  labs(title = "Histogram of Minimum Linear Distance to Nearest Brood Tree") +
  xlab("Distance") +
  theme_bw()

ggplotly(b)
```

Column {data-width=350}
-----------------------------------------------------------------------


### XGBOOST FEATURE IMPORTANCE

```{r xgboost}
c <- xgb %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 14,
      aesthetics = list(fill  = "blue")) +
  theme_classic() +
  labs(title = "XGBoost Variable Importance",
       x = "Importance",
       y = "Feature")

ggplotly(c) 
```

### RANDOM FOREST FEATURE IMPORTANCE

```{r random forest}
d <- rf %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 14,
      aesthetics = list(fill = "yellowgreen")) +
  theme_classic() +
  labs(title = "Random Forest Variable Importance",
       x = "Importance",
       y = "Feature")

ggplotly(d)
```

Column {data-width=300}
-----------------------------------------------------------------------

### XGBOOST MODEL ASSESSMENT

```{r xgb assess}
e <- xgb %>% 
  collect_predictions() %>% 
  ggplot(aes(x = .pred, y = DeadDist)) +
  geom_point() +
  theme_classic() +
  labs(title = "XGBOOST TRUE vs PREDICTED",
       x = "Predicted",
       y = "True Value") +
  annotate("text", 
           x = 4, y = 0, 
           label = paste("RMSE:", round(xgb$.metrics[[1]][[3]], 3)),
           fontface = "bold.italic")

ggplotly(e)
```

### RANDOM FOREST MODEL ASSESSMENT

```{r rf assess}
f <- rf %>% 
  collect_predictions() %>% 
  ggplot(aes(x = .pred, y = DeadDist)) +
  geom_point() +
  theme_classic() +
  labs(title = "RANDOM FOREST TRUE vs PREDICTED",
       x = "Predicted",
       y = "True Value") +
  annotate("text", 
           x = 4, y = 0, 
           label = paste("RMSE:", round(rf$.metrics[[1]][[3]], 3)),
           fontface = "bold.italic")

ggplotly(f)
```