Using the predictors listed and described below, the objective of this analysis was to find the best model in predicting the minimum linear distance to the nearest brood tree (DeadDist).
Predictors | Description |
---|---|
TreeDiam | The diameter/size of the tree. |
Infest_Serv1 | Infestation severity nearest to response tree. |
SDI_20th | Stand Density Index 1/20th-acre neighborhood surrounding response tree. |
BA_20th | Basal Area 1/20th-acre neighborhood surrounding response tree. |
After both a Ridge and Lasso Regression, it was determined that the variables with the most importance to the prediction of DeadDist were TreeDiam, Infest_Serv1, and BA_20th.
Additionally, both models determined similar associations between the predictors and DeadDist.
Both models had incredibly similar RMSE scores, therefore either are useful models for predicting the distance to the nearest brood tree.
Ridge Regression is a type of multiple linear regression model that enforces a penalty on the coefficient of the predictors based on their size. All of the coefficients are shrunk towards each other and towards zero which can fix poorly determined coefficients.
Metric | Estimate |
---|---|
RMSE | 1.74 |
RSQ | 0.144 |
This model was evaluated by two different metrics: RMSE and RSQ. The goodness of the fit is based off of RMSE.
Predictor | Estimate |
---|---|
TreeDiam | 0.0139 |
Infest_Serv1 | -0.147 |
BA_20th | -0.712 |
The results of the Ridge Regression demonstrated that BA_20th and Infest_Serv1 had a negative association with DeadDist while TreeDiam had a positive association.
Lasso Regression is similar to Ridge Regression in the way that Lasso will also enforce a penalty on the predictor’s coefficients. However, Lasso Regression tends to lead to a more streamlined model.
This model was evaluated by two different metrics: RMSE and RSQ. The goodness of the fit is based off of RMSE. The figure below demonstrates the effect of the penalty enforced by Lasso Regression on both the RMSE and the RSQ
Predictor | Estimate |
---|---|
TreeDiam | 0.0167 |
Infest_Serv1 | -0.161 |
BA_20th | -0.771 |
The results of the Lasso Regression demonstrated that BA_20th and Infest_Serv1 had a negative association with DeadDist while TreeDiam had a positive association.
---
title: "Project 2"
output:
flexdashboard::flex_dashboard:
orientation: rows
vertical_layout: fill
theme: flatly
logo:
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(readxl)
library(broom)
library(car)
library(ggfortify)
library(tidymodels)
library(vip)
library(performance)
library(plotly)
library(GGally)
library(corrr)
library(DT)
pine_trees <- read_xlsx("hgen-612/data/Data_1993.xlsx")
```
Overview
=================
Row {data-width=500}
-----------------------------------------------------------------------
### Overview
Using the predictors listed and described below, the objective of this analysis was to
find the best model in predicting the minimum linear distance to the nearest brood tree
(DeadDist).
```{r}
options(knitr.kable.NA = '')
data.frame(Predictors = c("TreeDiam", "Infest_Serv1", "SDI_20th", "BA_20th" ),
Description = c("The diameter/size of the tree.",
"Infestation severity nearest to response tree.",
"Stand Density Index 1/20th-acre neighborhood surrounding response tree.",
"Basal Area 1/20th-acre neighborhood surrounding response tree.")) |>
knitr::kable()
#I made this table with the help of Stack Overflow https://stackoverflow.com/questions/76996026/manual-table-in-r-markdown
```
After both a Ridge and Lasso Regression, it was determined that the variables
with the most importance to the prediction of DeadDist were TreeDiam, Infest_Serv1,
and BA_20th.
Additionally, both models determined similar associations between the predictors
and DeadDist.
Row {data-width=350}
-----------------------------------------------------------------------
### Distance to the Nearest Brood Tree
```{r}
ggplot(pine_trees, aes(DeadDist)) +
geom_histogram(color = "black", fill = "#5f71cd") +
theme_bw() +
xlab("Distance to Nearest Brood Tree") +
ylab("Number of Trees") +
ggtitle("Distribution of the Distance to the Nearest Brood Tree")
```
```{r model codes, include=FALSE}
#Ridge
pine_tree_split <- initial_split(pine_trees)
pine_tree_train <- training(pine_tree_split)
pine_tree_test <- testing(pine_tree_split)
ridge_model <-
linear_reg(mixture = 0, penalty = 0.1629751) %>%
set_engine("glmnet")
ridge_model %>%
translate()
pine_tree_recipe <- pine_tree_train %>%
recipe(DeadDist ~ TreeDiam + Infest_Serv1 + SDI_20th + BA_20th) %>%
step_sqrt(all_outcomes()) %>%
step_corr(all_predictors()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_zv(all_numeric(), -all_outcomes())
ridge_workflow <-
workflow() %>%
add_model(ridge_model) %>%
add_recipe(pine_tree_recipe)
ridge_fit <-
ridge_workflow %>%
fit(data = pine_tree_train)
ridge_fit %>%
extract_fit_parsnip() %>%
tidy()
ridge_fit %>%
extract_preprocessor()
ridge_fit %>%
extract_spec_parsnip()
last_fit(ridge_workflow, pine_tree_split) %>%
collect_metrics()
#Lasso
set.seed(1234)
pine_tree_boot <- bootstraps(pine_tree_train)
lamda_grid <- grid_regular(penalty(), levels = 50)
lasso_model <-
linear_reg(mixture = 1, penalty = tune ()) %>%
set_engine("glmnet")
lasso_model %>%
translate()
lasso_workflow <-
workflow() %>%
add_model(lasso_model) %>%
add_recipe(pine_tree_recipe)
set.seed(2026)
lasso_grid <- tune_grid(lasso_workflow,
resamples = pine_tree_boot,
grid = lamda_grid)
lasso_grid %>%
collect_metrics()
lowest_rmse <- lasso_grid %>%
select_best(metric = "rmse")
final_model <- finalize_workflow(lasso_workflow,
lowest_rmse)
final_model %>%
fit(pine_tree_train) %>%
extract_fit_parsnip() %>%
tidy()
last_fit(final_model,
pine_tree_split) %>%
collect_metrics()
```
### Model Fits
```{r}
temp <- last_fit(ridge_workflow, pine_tree_split) %>%
collect_metrics()
rmse_ridge <- temp$.estimate
gauge(rmse_ridge, min = 0, max = 1.8)
temp_lasso <- last_fit(final_model,
pine_tree_split) %>%
collect_metrics()
rmse_lasso <- temp$.estimate
gauge(rmse_lasso, min = 0, max = 1.8)
```
Both models had incredibly similar RMSE scores, therefore either are useful
models for predicting the distance to the nearest brood tree.
Ridge
=================
Row {data-width=650}
-----------------------------------------------------------------------
### Model Description
Ridge Regression is a type of multiple linear regression model that enforces
a penalty on the coefficient of the predictors based on their size.
All of the coefficients are shrunk towards each other and towards zero which
can fix poorly determined coefficients.
```{r, include=FALSE}
###Definitions come from Dr. Smirnova's lectures on Multiple regression
```
### Important Predictors
```{r, figure.length = 200}
ridge_fit %>%
fit(pine_tree_train) %>%
extract_fit_parsnip() %>%
vip::vip()
```
Row {data-width=350}
----------------------------------------------------------------------
### Model Evaluation
```{r}
options(knitr.kable.NA = '')
data.frame(Metric = c("RMSE", "RSQ"),
Estimate = c("1.74",
"0.144")) |>
knitr::kable()
```
This model was evaluated by two different metrics: RMSE and RSQ. The goodness of
the fit is based off of RMSE.
### Model Results
```{r}
options(knitr.kable.NA = '')
data.frame(Predictor = c("TreeDiam", "Infest_Serv1", "BA_20th" ),
Estimate = c("0.0139",
"-0.147",
"-0.712")) |>
knitr::kable()
```
The results of the Ridge Regression demonstrated that BA_20th and Infest_Serv1
had a negative association with DeadDist while TreeDiam had a positive association.
Lasso
=================
Row {data-width=650}
-----------------------------------------------------------------------
### Model Description
Lasso Regression is similar to Ridge Regression in the way that Lasso will also
enforce a penalty on the predictor's coefficients. However, Lasso Regression tends to lead to a more streamlined model.
This model was evaluated by two different metrics: RMSE and RSQ. The goodness of
the fit is based off of RMSE.
The figure below demonstrates the effect of the penalty enforced by Lasso Regression
on both the RMSE and the RSQ
```{r, include=FALSE}
###Definitions come from Dr. Smirnova's lectures on Multiple regression
```
### Important Predictors
```{r}
final_model %>%
fit(pine_tree_train) %>%
extract_fit_parsnip() %>%
vip::vip()
```
Row {data-width=350}
-----------------------------------------------------------------------
### Model Evaluation
```{r, figure.length = 12}
lasso_grid %>%
collect_metrics() %>%
ggplot(aes(penalty, mean, color = .metric)) +
geom_errorbar(aes(
ymin = mean - std_err,
ymax = mean + std_err
),
alpha = 0.5
) +
geom_line(size = 1.5) +
facet_wrap(~.metric, scales = "free", nrow = 2) +
scale_x_log10() +
theme(legend.position = "none")
```
### Model Results
```{r}
options(knitr.kable.NA = '')
data.frame(Predictor = c("TreeDiam", "Infest_Serv1", "BA_20th" ),
Estimate = c("0.0167",
"-0.161",
"-0.771")) |>
knitr::kable()
```
The results of the Lasso Regression demonstrated that BA_20th and Infest_Serv1
had a negative association with DeadDist while TreeDiam had a positive association.