Project 2

Predictors	Description
TreeDiam	The diameter/size of the tree.
Infest_Serv1	Infestation severity nearest to response tree.
SDI_20th	Stand Density Index 1/20th-acre neighborhood surrounding response tree.
BA_20th	Basal Area 1/20th-acre neighborhood surrounding response tree.
Metric	Estimate
RMSE	1.74
RSQ	0.144
Predictor	Estimate
TreeDiam	0.0139
Infest_Serv1	-0.147
BA_20th	-0.712
Predictor	Estimate
TreeDiam	0.0167
Infest_Serv1	-0.161
BA_20th	-0.771
---
title: "Project 2"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    theme: flatly
    logo: 
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(readxl)
library(broom)
library(car)
library(ggfortify)
library(tidymodels)
library(vip)
library(performance)
library(plotly)
library(GGally)
library(corrr)
library(DT)

pine_trees <- read_xlsx("hgen-612/data/Data_1993.xlsx")
```

Overview
=================

Row {data-width=500}
-----------------------------------------------------------------------

### Overview
Using the predictors listed and described below, the objective of this analysis was to
find the best model in predicting the minimum linear distance to the nearest brood tree
(DeadDist).

```{r}

options(knitr.kable.NA = '')
data.frame(Predictors = c("TreeDiam", "Infest_Serv1", "SDI_20th", "BA_20th" ),
           Description = c("The diameter/size of the tree.", 
           "Infestation severity nearest to response tree.", 
           "Stand Density Index 1/20th-acre neighborhood surrounding response tree.", 
           "Basal Area 1/20th-acre neighborhood surrounding response tree.")) |>
  knitr::kable()

#I made this table with the help of Stack Overflow https://stackoverflow.com/questions/76996026/manual-table-in-r-markdown

```
After both a Ridge and Lasso Regression, it was determined that the variables 
with the most importance to the prediction of DeadDist were TreeDiam, Infest_Serv1,
and BA_20th.

Additionally, both models determined similar associations between the predictors
and DeadDist.

Row {data-width=350}
-----------------------------------------------------------------------

### Distance to the Nearest Brood Tree

```{r}
ggplot(pine_trees, aes(DeadDist)) +
  geom_histogram(color = "black", fill = "#5f71cd") +
  theme_bw() +
  xlab("Distance to Nearest Brood Tree") +
  ylab("Number of Trees") +
  ggtitle("Distribution of the Distance to the Nearest Brood Tree")
```

```{r model codes, include=FALSE}

#Ridge
pine_tree_split <- initial_split(pine_trees)
pine_tree_train <- training(pine_tree_split)
pine_tree_test <- testing(pine_tree_split)

ridge_model <-
  linear_reg(mixture = 0, penalty = 0.1629751) %>% 
  set_engine("glmnet")

ridge_model %>% 
  translate()

pine_tree_recipe <- pine_tree_train %>% 
  recipe(DeadDist ~ TreeDiam + Infest_Serv1 + SDI_20th + BA_20th) %>% 
  step_sqrt(all_outcomes()) %>% 
  step_corr(all_predictors()) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_zv(all_numeric(), -all_outcomes())
  
ridge_workflow <-
  workflow() %>% 
  add_model(ridge_model) %>% 
  add_recipe(pine_tree_recipe)

ridge_fit <-
  ridge_workflow %>% 
  fit(data = pine_tree_train)

ridge_fit %>% 
  extract_fit_parsnip() %>% 
  tidy()

ridge_fit %>% 
  extract_preprocessor()

ridge_fit %>% 
  extract_spec_parsnip()

last_fit(ridge_workflow, pine_tree_split) %>% 
  collect_metrics()

#Lasso
set.seed(1234)
pine_tree_boot <- bootstraps(pine_tree_train)

lamda_grid <- grid_regular(penalty(), levels = 50)

lasso_model <- 
  linear_reg(mixture = 1, penalty = tune ()) %>% 
  set_engine("glmnet")

lasso_model %>% 
  translate()

lasso_workflow <-
  workflow() %>% 
  add_model(lasso_model) %>% 
  add_recipe(pine_tree_recipe)

set.seed(2026)
lasso_grid <- tune_grid(lasso_workflow, 
                        resamples = pine_tree_boot, 
                        grid = lamda_grid)

lasso_grid %>% 
  collect_metrics()

lowest_rmse <- lasso_grid %>% 
  select_best(metric = "rmse")

final_model <- finalize_workflow(lasso_workflow,
                                 lowest_rmse)

final_model %>% 
  fit(pine_tree_train) %>% 
  extract_fit_parsnip() %>% 
  tidy()

last_fit(final_model,
         pine_tree_split) %>%
  collect_metrics()

```

### Model Fits

```{r}

temp <- last_fit(ridge_workflow, pine_tree_split) %>% 
  collect_metrics()

rmse_ridge <- temp$.estimate

gauge(rmse_ridge, min = 0, max = 1.8)

temp_lasso <- last_fit(final_model,
                       pine_tree_split) %>%
  collect_metrics()

rmse_lasso <- temp$.estimate

gauge(rmse_lasso, min = 0, max = 1.8)

```
Both models had incredibly similar RMSE scores, therefore either are useful
models for predicting the distance to the nearest brood tree.

Ridge
=================

Row {data-width=650}
-----------------------------------------------------------------------

### Model Description

Ridge Regression is a type of multiple linear regression model that enforces
a penalty on the coefficient of the predictors based on their size.
All of the coefficients are shrunk towards each other and towards zero which
can fix poorly determined coefficients. 

```{r, include=FALSE}

###Definitions come from Dr. Smirnova's lectures on Multiple regression
```

### Important Predictors
```{r, figure.length = 200}

ridge_fit %>% 
   fit(pine_tree_train) %>% 
   extract_fit_parsnip() %>% 
   vip::vip()

```


Row {data-width=350}
----------------------------------------------------------------------

### Model Evaluation

```{r}
options(knitr.kable.NA = '')
data.frame(Metric = c("RMSE", "RSQ"),
           Estimate = c("1.74", 
           "0.144")) |>
  knitr::kable()
```
This model was evaluated by two different metrics: RMSE and RSQ. The goodness of
the fit is based off of RMSE.

### Model Results

```{r}

options(knitr.kable.NA = '')
data.frame(Predictor = c("TreeDiam", "Infest_Serv1", "BA_20th" ),
           Estimate = c("0.0139", 
           "-0.147", 
           "-0.712")) |>
  knitr::kable()


```
The results of the Ridge Regression demonstrated that BA_20th and Infest_Serv1
had a negative association with DeadDist while TreeDiam had a positive association.

Lasso
=================

Row {data-width=650}
-----------------------------------------------------------------------

### Model Description
Lasso Regression is similar to Ridge Regression in the way that Lasso will also
enforce a penalty on the predictor's coefficients. However, Lasso Regression tends to lead to a more streamlined model.

This model was evaluated by two different metrics: RMSE and RSQ. The goodness of
the fit is based off of RMSE.
The figure below demonstrates the effect of the penalty enforced by Lasso Regression 
on both the RMSE and the RSQ


```{r, include=FALSE}

###Definitions come from Dr. Smirnova's lectures on Multiple regression
```

### Important Predictors
```{r}
final_model %>% 
  fit(pine_tree_train) %>% 
  extract_fit_parsnip() %>% 
  vip::vip()
```

Row {data-width=350}
-----------------------------------------------------------------------

### Model Evaluation
```{r, figure.length = 12}
lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.5
  ) +
  geom_line(size = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10() +
  theme(legend.position = "none")
```

### Model Results

```{r}

options(knitr.kable.NA = '')
data.frame(Predictor = c("TreeDiam", "Infest_Serv1", "BA_20th" ),
           Estimate = c("0.0167", 
           "-0.161", 
           "-0.771")) |>
  knitr::kable()

```
The results of the Lasso Regression demonstrated that BA_20th and Infest_Serv1
had a negative association with DeadDist while TreeDiam had a positive association.