The objective of this project was to predict the ‘DeadDist’ variable, a variable referring to the minimum linear distance to the nearest brood tree, in the 1993 Pine Beetle Dataset. To accomplish this, I used the ‘tidymodels’ package for a linear regression model and a ridge regression model. Predictors for the models include TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, and Northing. This website serves as a comprehensive analysis of the two models.
---
title: "Estimating Dead Distance with Pine Beetle Dataset"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
logo: "C:/Users/dcoat/OneDrive/Desktop/black-beetle-logo.png"
theme: journal
source_code: embed
---
```{r setup, include=FALSE}
### LOADING THE NECESSARY LIBRARIES -----
library(flexdashboard)
library(tidymodels)
library(tidyverse)
library(readxl)
library(performance)
library(yardstick)
library(DT)
library(ggplot2)
library(plotly)
```
```{r read-data}
### READING PINE BEETLE DATA -----
pine_tbl <- read_excel("C:/Users/dcoat/OneDrive/Desktop/Data_1993.xlsx", sheet = 1)
```
```{r linear-workflow, include = FALSE}
### LINEAR REGRESSION WORKFLOW -----
# Splitting data
set.seed(123) # For reproducibility
data_split <- initial_split(pine_tbl, prop = 3/4)
train_tbl <- training(data_split)
test_tbl <- testing(data_split)
#Recipe
pine_rec <- train_tbl %>%
recipe(DeadDist ~ TreeDiam + Infest_Serv1 + BA_20th + Neigh_1.5 + Easting + Northing) %>%
step_corr(all_predictors()) %>%
step_center(all_predictors(), -all_outcomes()) %>%
step_scale(all_predictors(), -all_outcomes())
#Bake
pine_rec %>%
prep() %>%
bake(new_data = NULL)
#Creating model
lm_mod <-
linear_reg() %>%
set_engine("lm")
#Creating workflow
pine_wflow <-
workflow() %>%
add_model(lm_mod) %>%
add_recipe(pine_rec)
#Fitting the model
pine_fit <-
pine_wflow %>%
fit(data = pine_tbl)
#Predict with the trained workflow
pine_pred <-
predict(pine_fit, test_tbl) %>%
bind_cols(test_tbl %>% select(DeadDist, TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, Northing))
#Evaluate performance
metrics <- pine_pred %>%
metrics(truth = DeadDist, estimate = .pred)
#R-squared = 38.68% or 0.3868
```
```{r ridge-workflow, include = FALSE}
### RIDGE REGRESSION WORKFLOW -----
# Create training/testing data
pine_split <- initial_split(pine_tbl)
pine_train <- training(pine_split)
pine_test <- testing(pine_split)
# Picking Dr. Smirnova's best lambda estimate
ridge_mod <-
linear_reg(mixture = 0, penalty = 0.1629751) %>%
set_engine("glmnet")
# Verify what we are doing
ridge_mod %>%
translate()
# Create a new recipe
pine_rec <- pine_train %>%
recipe(DeadDist ~ TreeDiam + Infest_Serv1 + BA_20th + Neigh_1.5 + Easting + Northing) %>%
step_corr(all_predictors()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_zv(all_numeric(), -all_outcomes()) #%>%
# prep()
pine_ridge_wflow <-
workflow() %>%
add_model(ridge_mod) %>%
add_recipe(pine_rec)
pine_ridge_fit <-
pine_ridge_wflow %>%
fit(data = pine_train)
#Predict with the trained workflow
pine_pred_ridge <-
predict(pine_ridge_fit, pine_test) %>%
bind_cols(pine_test %>% select(DeadDist, TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, Northing))
#Evaluate performance
metrics_ridge <- pine_pred_ridge %>%
metrics(truth = DeadDist, estimate = .pred)
#R-squared = 40.71% or 0.4071
```
Introduction
=====================================
Column {data-width=350}
-----------------------------------------------------------------------
### Objective
The objective of this project was to predict the **'DeadDist'** variable, a variable referring to the minimum linear distance to the nearest brood tree, in the **1993 Pine Beetle Dataset**. To accomplish this, I used the 'tidymodels' package for a **linear regression model** and a **ridge regression model**. Predictors for the models include TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, and Northing. This website serves as a comprehensive analysis of the two models.
### Pine Beetle Raw Data
```{r rawdata-table}
##### SHOWCASING DATA -----
#Slice first 10 rows
filtered_pine_tbl <- pine_tbl |> slice(1:10)
#Showcase data using the DT package
datatable(filtered_pine_tbl, options = list(pageLength = 5, autoWidth = TRUE))
```
Column {data-width=650}
-----------------------------------------------------------------------
### Spatial Distribution of Pine Beetle Dataset
```{r main-plot}
###### MAIN SPATIAL DISTRIBUTION SCATTERPLOT (RAW DATA) ------
main_scatterplot <- plot_ly(
data = pine_tbl,
x = ~Easting,
y = ~Northing,
type = 'scatter',
mode = 'markers',
marker = list(
size = 8,
opacity = 0.7
),
color = ~factor(Response, labels = c("Alive", "JPB")),
colors = c("darkblue", "orange")
) %>%
layout(
title = list(
text = 'Comparison of Alive vs Japanese Pine Beetle-attacked Sites',
y = 0.95,
x = 0.5,
xanchor = 'center',
yanchor = 'top'
),
plot_bgcolor = "white",
xaxis = list(title = 'UTM X (Easting in meters)'),
yaxis = list(title = 'UTM Y (Northing in meters)'),
legend = list(
title = list(text = 'Response'),
itemsizing = 'constant'
)
)
main_scatterplot
```
Linear Regression
=====================================
Column {data-width=650}
-----------------------------------------------------------------------
### Actual values vs. predicted values
```{r actvspredlinear-plot}
#### ACTUAL VS PREDICTED VALS PLOT (Linear regression) ----
#Visualize predictions
actvspredlinear_plot <- pine_pred %>%
ggplot(aes(x = DeadDist, y = .pred)) +
geom_point(color = "gray") +
geom_abline(color = "darkred", linetype = "dashed", linewidth = 1) +
labs(title = "Observed vs. Predicted",
x = "Observed DeadDist",
y = "Predicted DeadDist") +
theme_minimal() +
theme(
text = element_text(color = "black", size = 14),
plot.title = element_text(size = 18, hjust = 0.5)
) +
scale_x_continuous(limits = c(-10, 150)) +
scale_y_continuous(limits = c(-10, 150))
actvspredlinear_plot
```
Column {data-width=350}
-----------------------------------------------------------------------
### Predictor Importance in Linear Regression
```{r coefflinear-plot}
####PLOTTING COEFFICIENTS (Linear Regression) ----
#Coefficient data without the intercept (grabbed from tidy(pine_fit))
coefficients<- data.frame(
term = c("TreeDiam", "Infest_Serv1", "BA_20th", "Neigh_1.5", "Easting", "Northing"),
estimate = c(0.1306964, -1.1245499, -6.7868673, -11.3101040, 3.2208172, 0.8725219 ))
# Plot
coeflinear_plot <- ggplot(coefficients, aes(x = estimate, y = term)) +
geom_point(size = 4, color = "darkred") +
geom_vline(xintercept = 0, colour = "black", linetype = "dashed", size = 1.5) +
theme_bw(base_size = 14) +
theme(
axis.text = element_text(color = "black", size = 16),
axis.title = element_text(color = "black", size = 18),
plot.title = element_text(color = "black", size = 20)
) +
labs(
title = "",
x = "",
y = ""
)
coeflinear_plot
```
### How Much Variance did this Model Explain? (R-squared)
```{r gauge-linear}
#### R-SQUARED GAUGE (Linear)----
gauge(38.68, min = 0, max = 100, symbol = '%', gaugeSectors(
success = c(80, 100), warning = c(40, 79), danger = c(0, 39)
))
```
Ridge Regression
=====================================
Column {data-width=650}
-----------------------------------------------------------------------
### Actual values vs. Predicted values
```{r actvspredridge-plot}
#### ACTUAL VS PREDICTED VALS PLOT (Ridge regression) ----
#Visualize predictions
actvspredridge_plot <- pine_pred_ridge %>%
ggplot(aes(x = DeadDist, y = .pred)) +
geom_point(color = "gray") +
geom_abline(color = "darkred", linetype = "dashed", linewidth = 1) +
labs(title = "Observed vs. Predicted",
x = "Observed DeadDist",
y = "Predicted DeadDist") +
theme_minimal() +
theme(
text = element_text(color = "black", size = 14),
plot.title = element_text(size = 18, hjust = 0.5)
) +
scale_x_continuous(limits = c(-10, 150)) +
scale_y_continuous(limits = c(-10, 150))
actvspredridge_plot
```
Column {data-width=350}
-----------------------------------------------------------------------
### Predictor Importance in Ridge Regression
```{r coeffridge-plot}
####PLOTTING COEFFICIENTS (Ridge Regression) ----
#Coefficient data without the intercept (grabbed from tidy(pine_ridge_fit))
coefficients_ridge <- data.frame(
term = c("TreeDiam", "Infest_Serv1", "BA_20th", "Neigh_1.5", "Easting", "Northing"),
estimate = c(0.02687227, -0.15686679, -0.68597558, -10.5029791, 2.8389821, 0.4529027))
coefridge_plot <- ggplot(coefficients_ridge, aes(x = estimate, y = term)) +
geom_point(size = 4, color = "darkred") +
geom_vline(xintercept = 0, colour = "black", linetype = "dashed", size = 1.5) +
theme_bw(base_size = 14) +
theme(
axis.text = element_text(color = "black", size = 16),
axis.title = element_text(color = "black", size = 18),
plot.title = element_text(color = "black", size = 20)
) +
labs(
title = "",
x = "",
y = ""
)
coefridge_plot
```
### How Much Variance did this Model Explain? (R-squared)
```{r gauge-ridge}
#### R-SQUARED GAUGE (Ridge) ----
gauge(40.71, min = 0, max = 100, symbol = '%', gaugeSectors(
success = c(80, 100), warning = c(40, 79), danger = c(0, 39)
))
```