This Project is test project to compare the performance of two models: XGBoost and Random Forest in a regression task.
the Data is sourced from the 1993 sample of the Lake Tahoe Basin Jeffrey pine beetle outbreak from 1991-1996. Check out Egan et al, 2016 “Multi-temporal ecological analysis of Jeffrey pine beetle outbreak dynamics within the Lake Tahoe Basin”.
The outcome of note is the Minimum linear distance to nearest brood tree.
The models were compared tusing the Root Mean Squared Error (RMSE) metric. Both models seemed to perform similarly with an RMSE ~0.1.
The next page contains charts of the outcome distribution, visualization of the pine beetle infestation, charts of variable importance for the models and plots of predicted to true minimum distance to the nearest brood tree.
---
title: "Pine Beetle"
author: "Jerry John Rawlings Mensah"
date: "`r Sys.Date()`"
output:
flexdashboard::flex_dashboard:
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(readxl)
library(skimr)
library(tidyverse)
library(tidymodels)
library(ranger)
library(xgboost)
library(hardhat)
library(vip)
library(plotly)
data <- read_excel("Data_1993.xlsx",
sheet = 1)
#split data
set.seed(1010)
data_split <- initial_split(data = data, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
#create cross-validation folds
cv_folds <- vfold_cv(train_data, v = 5)
#preprocess data
data_recipe <- recipe(DeadDist ~ ., data = train_data) %>%
step_select(!c("Response", "Ind_DeadDist")) %>% #prevent data leakage variables
update_role(TreeNum, new_role = "id variable") %>% #make TreeNum an id variable
step_corr(all_numeric_predictors()) %>% #remove highly correlated variables
step_nzv(all_predictors()) %>% #remove variables with mostly one variable
step_normalize(all_numeric()) #normalize all numeric variables
#specify models
xgb_spec <- boost_tree(trees = tune(),
tree_depth = tune(),
learn_rate = tune(),
mtry = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
rf_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
#specify workflows
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_recipe(data_recipe)
rf_wf <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(data_recipe)
#set up hyperparameters for tuning
xgb_params <- extract_parameter_set_dials(xgb_spec) %>%
update(trees = trees(c(100, 1000)),
tree_depth = tree_depth(c(2, 10)),
learn_rate = learn_rate(c(-4, 0)),
mtry = mtry(c(7, 21)))
rf_params <- extract_parameter_set_dials(rf_spec) %>%
update(mtry = finalize(mtry(c(7, 14))),
trees = trees(c(100, 1000)),
min_n = min_n(c(2, 20)))
#tune model hyperparameters
xgb_tune <- tune_bayes(xgb_wf,
resamples = cv_folds,
param_info = xgb_params,
initial = 10,
iter = 20,
metrics = metric_set(rmse),
control = control_bayes(no_improve = 3,
verbose = FALSE,
save_pred = FALSE,
parallel_over = "everything"))
rf_tune <- tune_bayes(rf_wf,
resamples = cv_folds,
param_info = rf_params,
initial = 10,
iter = 20,
metrics = metric_set(rmse),
control = control_bayes(no_improve = 3,
verbose = FALSE,
save_pred = FALSE,
parallel_over = "everything"))
#select best models
best_xgb_model <- select_best(xgb_tune, metric = "rmse")
best_rf_model <- select_best(rf_tune, metric = "rmse")
#finalize workflow
best_xgb_wf <- finalize_workflow(xgb_wf, best_xgb_model)
best_rf_wf <- finalize_workflow(rf_wf, best_rf_model)
#fit and predict
xgb <- last_fit(best_xgb_wf,
split = data_split,
metrics = metric_set(rmse))
rf <- last_fit(best_rf_wf,
split = data_split,
metrics = metric_set(rmse))
```
Project Overview
======================================================================
This Project is test project to compare the performance of two models: XGBoost
and Random Forest in a regression task.
- Data
the Data is sourced from the 1993 sample of the Lake Tahoe Basin Jeffrey pine
beetle outbreak from 1991-1996. Check out Egan et al, 2016 “Multi-temporal
ecological analysis of Jeffrey pine beetle outbreak dynamics within the Lake
Tahoe Basin”.
The outcome of note is the Minimum linear distance to nearest brood tree.
- Model Performance Metrics
The models were compared tusing the Root Mean Squared Error (RMSE) metric. Both
models seemed to perform similarly with an RMSE ~0.1.
The next page contains charts of the outcome distribution, visualization of the
pine beetle infestation, charts of variable importance for the models and plots
of predicted to true minimum distance to the nearest brood tree.
Charts
======================================================================
Column {data-width=350}
-----------------------------------------------------------------------
### PINE BEETLE OUTBREAK
```{r data viz}
a <- data %>%
ggplot(aes(x = Easting, y = Northing)) +
geom_point(aes(color = factor(Response))) +
scale_color_manual("",values=c("green", "red"),
labels = c("Alive", "Infested")) +
labs(title = "Map of Jeffrey Pine Beetle Outbreak") +
theme_bw()
ggplotly(a)
```
### OUTCOME
```{r outcome viz}
b <- data %>%
ggplot(aes(x = DeadDist)) +
geom_histogram() +
labs(title = "Histogram of Minimum Linear Distance to Nearest Brood Tree") +
xlab("Distance") +
theme_bw()
ggplotly(b)
```
Column {data-width=350}
-----------------------------------------------------------------------
### XGBOOST FEATURE IMPORTANCE
```{r xgboost}
c <- xgb %>%
extract_fit_parsnip() %>%
vip(num_features = 14,
aesthetics = list(fill = "blue")) +
theme_classic() +
labs(title = "XGBoost Variable Importance",
x = "Importance",
y = "Feature")
ggplotly(c)
```
### RANDOM FOREST FEATURE IMPORTANCE
```{r random forest}
d <- rf %>%
extract_fit_parsnip() %>%
vip(num_features = 14,
aesthetics = list(fill = "yellowgreen")) +
theme_classic() +
labs(title = "Random Forest Variable Importance",
x = "Importance",
y = "Feature")
ggplotly(d)
```
Column {data-width=300}
-----------------------------------------------------------------------
### XGBOOST MODEL ASSESSMENT
```{r xgb assess}
e <- xgb %>%
collect_predictions() %>%
ggplot(aes(x = .pred, y = DeadDist)) +
geom_point() +
theme_classic() +
labs(title = "XGBOOST TRUE vs PREDICTED",
x = "Predicted",
y = "True Value") +
annotate("text",
x = 4, y = 0,
label = paste("RMSE:", round(xgb$.metrics[[1]][[3]], 3)),
fontface = "bold.italic")
ggplotly(e)
```
### RANDOM FOREST MODEL ASSESSMENT
```{r rf assess}
f <- rf %>%
collect_predictions() %>%
ggplot(aes(x = .pred, y = DeadDist)) +
geom_point() +
theme_classic() +
labs(title = "RANDOM FOREST TRUE vs PREDICTED",
x = "Predicted",
y = "True Value") +
annotate("text",
x = 4, y = 0,
label = paste("RMSE:", round(rf$.metrics[[1]][[3]], 3)),
fontface = "bold.italic")
ggplotly(f)
```