Introduction

Column

Objective

The objective of this project was to predict the ‘DeadDist’ variable, a variable referring to the minimum linear distance to the nearest brood tree, in the 1993 Pine Beetle Dataset. To accomplish this, I used the ‘tidymodels’ package for a linear regression model and a ridge regression model. Predictors for the models include TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, and Northing. This website serves as a comprehensive analysis of the two models.

Pine Beetle Raw Data

Column

Spatial Distribution of Pine Beetle Dataset

Linear Regression

Column

Actual values vs. predicted values

Column

Predictor Importance in Linear Regression

How Much Variance did this Model Explain? (R-squared)

Ridge Regression

Column

Actual values vs. Predicted values

Column

Predictor Importance in Ridge Regression

How Much Variance did this Model Explain? (R-squared)

---
title: "Estimating Dead Distance with Pine Beetle Dataset"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    logo: "C:/Users/dcoat/OneDrive/Desktop/black-beetle-logo.png"
    theme: journal
    source_code: embed
---

```{r setup, include=FALSE}
### LOADING THE NECESSARY LIBRARIES -----
library(flexdashboard)
library(tidymodels)
library(tidyverse) 
library(readxl)
library(performance)
library(yardstick)
library(DT)
library(ggplot2)
library(plotly)
```


```{r read-data}

### READING PINE BEETLE DATA -----
pine_tbl <- read_excel("C:/Users/dcoat/OneDrive/Desktop/Data_1993.xlsx", sheet = 1)
```

```{r linear-workflow, include = FALSE}
### LINEAR REGRESSION WORKFLOW -----

# Splitting data
set.seed(123)  # For reproducibility
data_split <- initial_split(pine_tbl, prop = 3/4)
train_tbl <- training(data_split)  
test_tbl <- testing(data_split)    


#Recipe
pine_rec <- train_tbl %>% 
  recipe(DeadDist ~ TreeDiam + Infest_Serv1 + BA_20th + Neigh_1.5 + Easting + Northing) %>% 
  step_corr(all_predictors()) %>%
  step_center(all_predictors(), -all_outcomes())  %>%
  step_scale(all_predictors(), -all_outcomes()) 

#Bake
pine_rec %>% 
  prep() %>%
  bake(new_data = NULL)

#Creating model
lm_mod <- 
  linear_reg() %>% 
  set_engine("lm")

#Creating workflow
pine_wflow <- 
  workflow() %>% 
  add_model(lm_mod) %>% 
  add_recipe(pine_rec)

#Fitting the model
pine_fit <- 
  pine_wflow %>% 
  fit(data = pine_tbl)

#Predict with the trained workflow
pine_pred <- 
  predict(pine_fit, test_tbl) %>% 
  bind_cols(test_tbl %>% select(DeadDist, TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, Northing))


#Evaluate performance
metrics <- pine_pred %>% 
  metrics(truth = DeadDist, estimate = .pred)

#R-squared = 38.68% or 0.3868


```

```{r ridge-workflow, include = FALSE}
### RIDGE REGRESSION WORKFLOW -----


# Create training/testing data
pine_split <- initial_split(pine_tbl)
pine_train <- training(pine_split)
pine_test <- testing(pine_split)


# Picking Dr. Smirnova's best lambda estimate
ridge_mod <-
  linear_reg(mixture = 0, penalty = 0.1629751) %>%  
  set_engine("glmnet")

# Verify what we are doing
ridge_mod %>% 
  translate()
  

# Create a new recipe
pine_rec <- pine_train %>% 
  recipe(DeadDist ~ TreeDiam + Infest_Serv1 + BA_20th + Neigh_1.5 + Easting + Northing) %>% 
  step_corr(all_predictors()) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_zv(all_numeric(), -all_outcomes()) #%>% 
  # prep()


pine_ridge_wflow <- 
  workflow() %>% 
  add_model(ridge_mod) %>% 
  add_recipe(pine_rec)


pine_ridge_fit <- 
  pine_ridge_wflow %>% 
  fit(data = pine_train)


#Predict with the trained workflow
pine_pred_ridge <- 
  predict(pine_ridge_fit, pine_test) %>% 
  bind_cols(pine_test %>% select(DeadDist, TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, Northing))


#Evaluate performance
metrics_ridge <- pine_pred_ridge %>% 
  metrics(truth = DeadDist, estimate = .pred)

#R-squared = 40.71% or 0.4071


```


Introduction
=====================================
Column {data-width=350}
-----------------------------------------------------------------------

### Objective

The objective of this project was to predict the **'DeadDist'** variable, a variable referring to the minimum linear distance to the nearest brood tree, in the **1993 Pine Beetle Dataset**. To accomplish this, I used the 'tidymodels' package for a **linear regression model** and a **ridge regression model**. Predictors  for the models include TreeDiam, Infest_Serv1, BA_20th, Neigh_1.5, Easting, and Northing. This website serves as a comprehensive analysis of the two models.



### Pine Beetle Raw Data

```{r rawdata-table}
##### SHOWCASING DATA -----

#Slice first 10 rows
filtered_pine_tbl <- pine_tbl |> slice(1:10)
#Showcase data using the DT package
datatable(filtered_pine_tbl, options = list(pageLength = 5, autoWidth = TRUE))
```


Column {data-width=650}
-----------------------------------------------------------------------

### Spatial Distribution of Pine Beetle Dataset

```{r main-plot}
###### MAIN SPATIAL DISTRIBUTION SCATTERPLOT (RAW DATA) ------
main_scatterplot <- plot_ly(
  data = pine_tbl, 
  x = ~Easting, 
  y = ~Northing, 
  type = 'scatter', 
  mode = 'markers', 
  marker = list(
    size = 8, 
    opacity = 0.7 
  ),
  color = ~factor(Response, labels = c("Alive", "JPB")), 
  colors = c("darkblue", "orange") 
) %>%
  layout(
    title = list(
      text = 'Comparison of Alive vs Japanese Pine Beetle-attacked Sites', 
      y = 0.95, 
      x = 0.5, 
      xanchor = 'center', 
      yanchor = 'top'
    ),
    plot_bgcolor = "white", 
    xaxis = list(title = 'UTM X (Easting in meters)'), 
    yaxis = list(title = 'UTM Y (Northing in meters)'), 
    legend = list(
      title = list(text = 'Response'), 
      itemsizing = 'constant'
    )
  )


main_scatterplot


  
```



Linear Regression
=====================================  
Column {data-width=650}
-----------------------------------------------------------------------

### Actual values vs. predicted values
```{r actvspredlinear-plot}
#### ACTUAL VS PREDICTED VALS PLOT (Linear regression) ----
#Visualize predictions
actvspredlinear_plot <- pine_pred %>% 
  ggplot(aes(x = DeadDist, y = .pred)) +
  geom_point(color = "gray") +  
  geom_abline(color = "darkred", linetype = "dashed", linewidth = 1) +  
  labs(title = "Observed vs. Predicted", 
       x = "Observed DeadDist", 
       y = "Predicted DeadDist") +
  theme_minimal() +  
  theme(
    text = element_text(color = "black", size = 14),  
    plot.title = element_text(size = 18, hjust = 0.5) 
  ) +
  scale_x_continuous(limits = c(-10, 150)) +
  scale_y_continuous(limits = c(-10, 150))

actvspredlinear_plot
```

Column {data-width=350}
-----------------------------------------------------------------------
### Predictor Importance in Linear Regression

```{r coefflinear-plot}
####PLOTTING COEFFICIENTS (Linear Regression) ----
#Coefficient data without the intercept (grabbed from tidy(pine_fit))
coefficients<- data.frame(
  term = c("TreeDiam", "Infest_Serv1", "BA_20th", "Neigh_1.5", "Easting", "Northing"),
  estimate = c(0.1306964, -1.1245499, -6.7868673, -11.3101040, 3.2208172, 0.8725219	))

# Plot
coeflinear_plot <- ggplot(coefficients, aes(x = estimate, y = term)) +
  geom_point(size = 4, color = "darkred") +                         
  geom_vline(xintercept = 0, colour = "black", linetype = "dashed", size = 1.5) +  
  theme_bw(base_size = 14) +                                        
  theme(
    axis.text = element_text(color = "black", size = 16),
    axis.title = element_text(color = "black", size = 18),
    plot.title = element_text(color = "black", size = 20)
  ) +
  labs(
    title = "",
    x = "",
    y = ""
  )

coeflinear_plot
```

### How Much Variance did this Model Explain? (R-squared)

```{r gauge-linear}
#### R-SQUARED GAUGE (Linear)----
gauge(38.68, min = 0, max = 100, symbol = '%', gaugeSectors(
  success = c(80, 100), warning = c(40, 79), danger = c(0, 39)
))
```
Ridge Regression
=====================================  
Column {data-width=650}
-----------------------------------------------------------------------

### Actual values vs. Predicted values
```{r actvspredridge-plot}
#### ACTUAL VS PREDICTED VALS PLOT (Ridge regression) ----
#Visualize predictions
actvspredridge_plot <- pine_pred_ridge %>% 
  ggplot(aes(x = DeadDist, y = .pred)) +
  geom_point(color = "gray") +  
  geom_abline(color = "darkred", linetype = "dashed", linewidth = 1) +  
  labs(title = "Observed vs. Predicted", 
       x = "Observed DeadDist", 
       y = "Predicted DeadDist") +
  theme_minimal() +  
  theme(
    text = element_text(color = "black", size = 14),  
    plot.title = element_text(size = 18, hjust = 0.5) 
  ) +
  scale_x_continuous(limits = c(-10, 150)) +
  scale_y_continuous(limits = c(-10, 150))

actvspredridge_plot
```

Column {data-width=350}
-----------------------------------------------------------------------
### Predictor Importance in Ridge Regression

```{r coeffridge-plot}

####PLOTTING COEFFICIENTS (Ridge Regression) ----
#Coefficient data without the intercept (grabbed from tidy(pine_ridge_fit))
coefficients_ridge <- data.frame(
  term = c("TreeDiam", "Infest_Serv1", "BA_20th", "Neigh_1.5", "Easting", "Northing"),
  estimate = c(0.02687227, -0.15686679, -0.68597558, -10.5029791, 2.8389821, 0.4529027))


coefridge_plot <- ggplot(coefficients_ridge, aes(x = estimate, y = term)) +
  geom_point(size = 4, color = "darkred") +                          
  geom_vline(xintercept = 0, colour = "black", linetype = "dashed", size = 1.5) +  
  theme_bw(base_size = 14) +                                        
  theme(
    axis.text = element_text(color = "black", size = 16),
    axis.title = element_text(color = "black", size = 18),
    plot.title = element_text(color = "black", size = 20)
  ) +
  labs(
    title = "",
    x = "",
    y = ""
  )

coefridge_plot

```

### How Much Variance did this Model Explain? (R-squared)

```{r gauge-ridge}
#### R-SQUARED GAUGE (Ridge) ----
gauge(40.71, min = 0, max = 100, symbol = '%', gaugeSectors(
  success = c(80, 100), warning = c(40, 79), danger = c(0, 39)
))
```