Introduction

This R Markdown document is designed to guide the development and implementation of Species Distribution Models (SDMs) for invasive species in Vermont. Leveraging environmental data alongside recorded occurrences, we aim to predict the potential distribution of these species under current and future environmental conditions. This work is pivotal for informing strategic conservation efforts and managing invasive species effectively.

Background

Following a comprehensive exploratory data analysis which included spatial patterning and hotspot identification, our analysis now incorporates sophisticated modeling techniques. We utilize raster-based climatic data, which includes variables like temperature and precipitation, crucial for understanding species-environment interactions.

Objectives

Integrate Environmental and Biological Data: Merge occurrence data of invasive species with high-resolution climatic layers to create a predictive modeling framework.
Model Development and Validation: Employ machine learning techniques such as Maximum Entropy (MaxEnt) modeling and logistic regression to forecast species distributions. Validate these models to ensure accuracy and reliability.
Visualization and Interpretation: Generate visual outputs to interpret the models and provide actionable insights into potential species spread.

Methodology

Data Preparation: Standardize and clean occurrence data, and align it with environmental variables to ensure robust model inputs.
Model Fitting and Evaluation: Apply SDM techniques including MaxEnt and logistic regression. Assess model performance using metrics such as the Area Under the Curve (AUC) from Receiver Operating Characteristic (ROC) curves.
Spatial Analysis: Generate spatial predictions and visualize the probability of species presence across Vermont to identify areas at greater risk.

Expected Outcomes

The predictive models developed will help delineate areas potentially vulnerable to invasive species under varying climatic scenarios. These models not only serve to enhance our ecological understanding but also support policymakers and conservationists in devising effective management strategies.

Navigation

Section 1: Data Preparation
Section 2: Model Development
Section 3: Model Evaluation and Validation
Section 4: Results and Discussion

Data Acquisition and Preparation

# SDM_MaxEnt_LogisticRegression_Script.R


# --------------------- Step 1: List of Necessary Libraries -------------------

# Ensure required packages are installed and loaded
packages <- c("dismo", "pROC", "raster", "sp", "readr", "dplyr", "ggplot2", "terra", "sf", 
              "tmap", "tmaptools", "lubridate", "stringr", "rasterVis", "RANN", "predicts", 
              "akima", "gstat", "rJava", "stats")

ensure_packages <- function(pkg) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}

sapply(packages, ensure_packages)
message("All required libraries are loaded successfully!")

# --------------------- Step 2: Set Working Directory -------------------------

setwd("D:/GEOG_588/SDM_Invasive_species")

# --------------------- Step 3: Load Occurrence Data -------------------------

occurrence_file <- "Cleaned_Observation_Data.csv"
occurrence_data <- read.csv(occurrence_file)

# Convert Occurrence Data to SpatialPointsDataFrame
coordinates(occurrence_data) <- ~longitude + latitude
proj4string(occurrence_data) <- CRS("+proj=longlat +datum=WGS84")

# --------------------- Step 4: Download Environmental Data ------------------

bioclim_layers <- raster::getData('worldclim', var = 'bio', res = 10)

Figure 17: Cropped Environmental Data for Vermont**

# --------------------- Step 5: Crop Environmental Data to Vermont Extent -----

# Define the spatial extent for Vermont
vermont_extent <- extent(-73.437, -71.465, 42.730, 45.016)

# Crop the bioclimatic layers to the Vermont extent to focus the environmental data
# on the region of interest.
vermont_data <- crop(bioclim_layers, vermont_extent)

# Define layout dimensions
n_layers <- nlayers(vermont_data)
nrow <- ceiling(sqrt(n_layers))
ncol <- ceiling(n_layers / nrow)

# Set up the plot layout and adjust margins
# Reduce margins slightly but keep them larger than the very tight defaults
par(mfrow=c(nrow, ncol), mar=c(2, 2, 2, 2))  # Adjusted margins to 2 on all sides

# Loop through each layer to plot
for (i in 1:n_layers) {
  plot(vermont_data[[i]], main = names(vermont_data)[i])
}

# Reset the plotting parameters to default after plotting
par(mfrow=c(1, 1), mar=c(5.1, 4.1, 4.1, 2.1))

Figure 17: Cropped Environmental Data for Vermont

# Data Source: The underlying environmental data is derived from the WorldClim database 
# (https://www.worldclim.com/bioclim), a widely recognized repository for global 
# climate layers, commonly known as "bioclimatic variables."
#
# What's Being Done?: This script prepares and analyzes data that combine the actual sightings (occurrences) of invasive species with environmental conditions. By doing this, we can start to understand where these species might thrive.
# Why It Matters?: Different species require different conditions to flourish. By mapping where these conditions align with sightings, we can predict where invasive species might spread. This is crucial for managing and potentially mitigating their impact.
# Visualization Importance: The plotted environmental data provide a visual representation of the climate across Vermont. Each plot corresponds to a different environmental variable, offering insights into the diverse conditions within the state.
# Outcome: This preliminary analysis sets the stage for deeper exploration into how environmental factors correlate with the locations of invasive species. This is vital for developing strategies to control these species based on predicted changes in climate or habitat suitability.
# In essence, the data tells us not only where invasive species are currently found but also where they might appear next based on environmental conditions. This predictive insight is fundamental for ecological management and conservation planning in Vermont.

Figure 17: Cropped Environmental Data for Vermont

Overview: This array of plots illustrates bioclimatic layers cropped to Vermont’s spatial extent. Each subplot represents a distinct environmental variable from the WorldClim dataset, covering factors like temperature and precipitation. These refined datasets are crucial inputs for species distribution modeling (SDM) and ecological assessments within the Vermont region.

Ecological Visualization for Conservation Planning

Visual Design: - The color spectrum in the visualization transitions smoothly from delicate whites and pinks to vibrant oranges, vivid yellows, and deep greens. - This palette represents a gradient of probabilities, indicating the likelihood of species presence across different areas.

Functional Insight: - The regions highlighted in bright green suggest a higher probability of species presence, pointing conservationists to potential hotspots. - While higher probabilities highlighted by green tones do not guarantee the presence of species, they provide essential clues for further scientific investigation and resource allocation.

Strategic Application: - This color-coded approach serves as a critical tool for ecological analysis and conservation planning, acting like a treasure map that guides conservation efforts to areas where they are most needed.

What’s Being Done?

Process Explanation: - This script prepares and analyzes data that combine actual sightings (occurrences) of invasive species with environmental conditions. This integration helps to understand where these species might thrive.

Why It Matters?

Impact on Conservation: - Different species require different conditions to flourish. By mapping where these conditions align with sightings, we can predict where invasive species might spread. This predictive insight is crucial for managing and potentially mitigating their impact.

Visualization Importance

Data Representation: - The plotted environmental data provide a visual representation of the climate across Vermont. - Each plot corresponds to a different environmental variable, offering insights into the diverse conditions within the state.

Outcome

Strategic Development: - This preliminary analysis sets the stage for deeper exploration into how environmental factors correlate with the locations of invasive species. - This understanding is vital for developing strategies to control these species based on predicted changes in climate or habitat suitability.

Predictive Insight: - The data not only reveals where invasive species are currently found but also where they might appear next based on environmental conditions. This insight is fundamental for ecological management and conservation planning in Vermont.

Clean Occurrence Data

# --------------------- Step 6: Clean Occurrence Data ------------------------

# Step 6.1: Check the structure of the occurrence_data to understand its format and variables.
str(occurrence_data)

# Step 6.2: Extract longitude and latitude directly from the occurrence_data for further processing.
longitude <- occurrence_data$longitude
latitude <- occurrence_data$latitude

# Step 6.3: Combine longitude and latitude into a matrix for spatial operations.
occurrence_points_matrix <- cbind(longitude, latitude)

# Step 6.4: Verify the creation of the matrix by inspecting its initial rows.
head(occurrence_points_matrix)

# Step 6.5: Extract environmental values at occurrence points from the provided raster data.
env_values_at_points <- raster::extract(vermont_data, occurrence_data)

# Step 6.6: Check if the extraction resulted in a list and merge it into a matrix if necessary.
if (is.list(env_values_at_points)) {
  env_values_matrix <- do.call(rbind, env_values_at_points)
} else {
  env_values_matrix <- env_values_at_points
}

# Step 6.7: Filter out points with NA environmental values to ensure data integrity.
valid_points <- complete.cases(env_values_matrix)
occurrence_points_matrix_clean <- occurrence_points_matrix[valid_points, ]

# Step 6.8: Display the dimensions of the cleaned occurrence points matrix to verify the cleaning process.
dim(occurrence_points_matrix_clean)

# Understanding Data Structure: It's essential first to understand what data you have—like checking all the pieces of a puzzle before starting. This step ensures that all necessary information, especially geographic locations, is accessible and correctly formatted.
# Geographic and Environmental Synthesis: By mapping occurrence points to environmental conditions, the script effectively merges biological observations with physical data. This synthesis is crucial for understanding the interactions between species and their habitats.
# Focus on Data Quality: Cleaning the data by removing incomplete records ensures that the subsequent analysis is based on reliable and comprehensive information. It's similar to filtering out noisy or unclear data in an experiment to focus on the results that provide clear insights.
# Preparation for Modeling: The clean, comprehensive dataset serves as a solid foundation for the next steps in the SDM process, where these points will be used to model potential distributions based on both biological occurrences and environmental factors.
# In essence, this process prepares a dataset that not only maps where invasive species have been observed but also contextualizes these locations within the broader environmental landscape of Vermont. 
# This preparation is critical for accurately predicting where these species might thrive under current and future conditions, thereby informing conservation strategies and management decisions.

Understanding Data Structure

Initial Assessment: - Puzzle Analogy: It’s essential to understand what data you have—like checking all the pieces of a puzzle before starting. This step ensures that all necessary information, especially geographic locations, is accessible and correctly formatted.

Geographic and Environmental Synthesis

Data Integration: - Synthesis: By mapping occurrence points to environmental conditions, the script effectively merges biological observations with physical data. This synthesis is crucial for understanding the interactions between species and their habitats.

Focus on Data Quality

Data Integrity: - Cleaning Process: Cleaning the data by removing incomplete records ensures that the subsequent analysis is based on reliable and comprehensive information. It’s similar to filtering out noisy or unclear data in an experiment to focus on the results that provide clear insights.

Preparation for Modeling

Foundation for SDM: - Dataset Readiness: The clean, comprehensive dataset serves as a solid foundation for the next steps in the SDM process, where these points will be used to model potential distributions based on both biological occurrences and environmental factors.

Strategic Importance: - Predictive Modeling: In essence, this process prepares a dataset that not only maps where invasive species have been observed but also contextualizes these locations within the broader environmental landscape of Vermont. - Conservation Impact: This preparation is critical for accurately predicting where these species might thrive under current and future conditions, thereby informing conservation strategies and management decisions.

# --------------------- Step 7: Generate Random Background Points ------------
# Confirm the object's class
print(class(vermont_data))
# Step 7.1: Set seed for reproducibility
set.seed(123)

# Step 7.2: Calculate the number of non-NA cells and define max sample size
non_na_cells <- sum(!is.na(values(vermont_data[[1]])))
max_sample_size <- min(10000, non_na_cells)

# Generate random background points and keep only the XY coordinates
background_points <- raster::sampleRandom(vermont_data[[1]], size = max_sample_size, xy = TRUE)
background_points <- background_points[, 1:2]  # Subset to x and y columns immediately

# Check what's inside background_points after subsetting
print(head(background_points))

# Step 7.3: Extract environmental values at the confirmed XY coordinates
bg_env_values <- raster::extract(vermont_data[[1]], background_points)

# Print the results to confirm successful extraction
print(bg_env_values)

# Step 7.4: Identify and retain valid background points (non-NA values)
valid_bg_points <- !is.na(bg_env_values)

# Step 7.5: Output the count of valid background points
cat("Number of valid background points:", sum(valid_bg_points), "\n")

# Step 7.6: Check the resolution and dimensions of your environmental data
cat("Resolution of raster data:", res(vermont_data), "\n")
cat("Dimensions of raster data:", dim(vermont_data), "\n")

# Step 7.7: Print the extent to see the covered area
print(vermont_extent)
# OR to print using cat:
cat("Extent of Vermont data: xmin =", vermont_extent@xmin, 
    "xmax =", vermont_extent@xmax, 
    "ymin =", vermont_extent@ymin, 
    "ymax =", vermont_extent@ymax, "\n")


# Step 7.8: Calculate the width and height in degrees
width_degrees <- abs(-73.437 - -71.465)
height_degrees <- abs(42.73 - 45.016)

# Step 7.9: Calculate the number of cells that can fit within the extent
cells_horizontal <- width_degrees / res(vermont_data)[1]
cells_vertical <- height_degrees / res(vermont_data)[2]

# Step 7.10: Calculate total cells
total_cells <- cells_horizontal * cells_vertical

# Step 7.11: Print calculated values
cat("Width in degrees:", width_degrees, "\n")
cat("Height in degrees:", height_degrees, "\n")
cat("Cells horizontally:", cells_horizontal, "\n")
cat("Cells vertically:", cells_vertical, "\n")
cat("Total cells in extent:", total_cells, "\n")

# Step 7.12: Compare to the actual grid dimensions
actual_grid_cells <- prod(dim(vermont_data[[1]]))
cat("Actual grid cells in raster data:", actual_grid_cells, "\n")

# Step 7.13: Explanation and action based on comparison
if (total_cells > actual_grid_cells) {
  cat("The extent's calculated cell capacity exceeds the raster grid cells. Consider adjusting the extent or resolution.\n")
} else {
  cat("The extent's calculated cell capacity is within the raster grid's limits.\n")
}

# Step 7.14: Generate random background points without exceeding the raster grid's capacity
background_points <- randomPoints(vermont_data[[1]], n = min(10000, actual_grid_cells))

# Step 7.14.1: Check if the output is a matrix and convert it to SpatialPoints
if (is.matrix(background_points)) {
  background_points <- as.data.frame(background_points)
  names(background_points) <- c("x", "y")
  # Convert data frame to SpatialPoints
  background_points <- SpatialPoints(background_points, proj4string = crs(vermont_data))
}

# Step 7.14.2: Bind the SpatialPoints with an empty data frame to create a SpatialPointsDataFrame
background_points_df <- SpatialPointsDataFrame(background_points, 
                                               data.frame(row.names = row.names(background_points)))

# Ensure background_points_df is now a SpatialPointsDataFrame
if (!inherits(background_points_df, "SpatialPointsDataFrame")) {
  stop("background_points is not a SpatialPointsDataFrame")
}

# Step 7.15: Proceed with extraction and modeling as planned
# Extract environmental values using SpatialPointsDataFrame
bg_env_values <- raster::extract(vermont_data, background_points_df)

# Step 7.16: Identify and retain valid background points (non-NA values)
valid_bg_points <- !is.na(bg_env_values)

# Step 7.17: Output the final count of valid background points
cat("Final count of valid background points:", sum(valid_bg_points), "\n")


# Why Random Points?: In modeling species distributions, it's crucial to have a comparison between places where the species is observed and a random sample of other places. 
# This comparison helps determine if the environmental conditions where the species occur are significantly different from the general environment.
# Environmental Context: By extracting environmental values at these points, we gain insights into the broader ecological backdrop of Vermont. 
# This data forms a crucial part of understanding the potential drivers behind species distributions.
# Data Integrity: Ensuring that all data points used in the analysis are valid (i.e., have complete environmental data) is essential for the accuracy of the model. 
# It ensures that the predictions made by the SDM are based on reliable and comprehensive environmental information.
# Spatial Accuracy: Checking the resolution and extent of the data against the generated points helps ensure that the environmental data used is appropriate and accurate for the region's analysis.
# This step is akin to making sure you have a detailed and accurate map before planning a route.

Random Points?

Importance of Comparison: - Comparative Analysis: In modeling species distributions, it’s crucial to compare places where the species is observed with a random sample of other locations. This comparison helps determine if the environmental conditions where the species occur are significantly different from the general environment.

Environmental Context

Ecological Insights: - Data Extraction: By extracting environmental values at these points, we gain insights into the broader ecological backdrop of Vermont. This information is crucial for understanding the potential drivers behind species distributions.

Data Integrity

Ensuring Accuracy: - Data Validation: Ensuring that all data points used in the analysis are valid (i.e., have complete environmental data) is essential for the accuracy of the model. It guarantees that the predictions made by the Species Distribution Model (SDM) are based on reliable and comprehensive environmental information.

Spatial Accuracy

Data Precision: - Resolution and Extent Checks: Checking the resolution and extent of the data against the generated points helps ensure that the environmental data used is appropriate and accurate for the region’s analysis. This step is akin to ensuring you have a detailed and accurate map before planning a route.

Transitioning to Visualization

Setup Complete: - Ready for Visualization: Now that all data points have been validated and set up, we transition to actual visualization of the data. This next phase involves deploying the MaxEnt Model to graphically represent the ecological insights derived from our analysis.

MaxEnt Model

Model Application: - Purpose: The MaxEnt Model will be used to visualize the potential distribution of invasive species across Vermont, based on the environmental data and observations gathered. This model is key to predicting and understanding the spread of these species under current and future environmental scenarios.

# --------------------- Step 8: Fit MaxEnt Model ------------------------------

# Step 8.1: Fit a MaxEnt model using the provided data.
maxent_model <- dismo::maxent(x = vermont_data, p = occurrence_points_matrix_clean, a = background_points)

# Step 8.2: Print a summary of the fitted MaxEnt model to review its performance and parameters.
summary(maxent_model)

# Step 8.3: Plot the MaxEnt model, including a title for clarity.
plot(maxent_model, main = "MaxEnt Model Summary and Response Curve", col = "black")  # Plot dots with black color

# Step 8.4: Add a custom legend to the plot with unfilled circles for different contribution levels.
legend("bottomright",  # Adjust position as needed
       legend = c("High Contribution (>30%)", "Medium Contribution (15-30%)", "Low Contribution (<15%)"),
       pch = 21,  # Use unfilled circle for legend
       pt.bg = "white",  # No fill for circles in legend
       text.col = "black",  # Text color for legend
       bty = "n",  # No box around the legend
       cex = 0.8)  # Adjust text size accordingly

Figure 18: MaxEnt Model

# Description:
# The code section fits a MaxEnt model to the provided data and evaluates its performance through visualization. It involves training the MaxEnt model, summarizing its characteristics, plotting the model's response curve, and adding a custom legend to interpret the contribution levels of predictor variables.

# MaxEnt Model Summary: This section provides a summary of the MaxEnt model, including its length, class, and mode.
# - Length: The number of elements in the MaxEnt model object.
# - Class: Indicates the class of the object, in this case, "MaxEnt".
# - Mode: Describes how the object is stored or represented.

# MaxEnt Model Response Curve Plot:
# - This plot visualizes the response curve of the MaxEnt model.
# - The x-axis typically represents environmental variables or predictors used in the model.
# - The y-axis represents the contribution or importance of each variable in predicting species occurrence.
# - Each dot on the plot corresponds to a predictor variable, and its position indicates the contribution level.
# - Higher positions on the y-axis indicate higher contributions to the model's predictions.

# Custom Legend:
# - The legend provides additional information about the contribution levels of predictor variables.
# - It categorizes predictor variables into three groups based on their contribution levels:
#   - High Contribution (>30%): Variables with a significant impact on species occurrence predictions.
#   - Medium Contribution (15-30%): Variables with a moderate impact on predictions.
#   - Low Contribution (<15%): Variables with a minimal impact on predictions.
# - The legend helps interpret the importance of predictor variables in the MaxEnt model.

# Understanding MaxEnt: The MaxEnt model helps predict where invasive species might be found based on where they’ve been observed and the environmental conditions at those locations. It assumes that species distributions are spread as widely as possible (maximum entropy) while still aligning with observed data.
# Model’s Role: By fitting this model, we can identify areas likely to be suitable for invasive species, not just where they have currently been observed. This predictive capability is crucial for preemptive conservation and management efforts.
# Interpreting Outputs: The visualization and summary of the model provide insights into which environmental factors are most important for the species' distribution. For instance, factors in the "High Contribution" category are likely crucial for the species’ presence in Vermont.
# Practical Implications: For conservationists and environmental managers, understanding these key contributing factors allows for more targeted actions, such as habitat management or monitoring efforts, especially in areas predicted as suitable but currently unoccupied by the species.

Firgure 18 MaxEnt Model

The code section fits a MaxEnt model to the provided data and evaluates its performance through visualization. It involves training the MaxEnt model, summarizing its characteristics, plotting the model’s response curve, and adding a custom legend to interpret the contribution levels of predictor variables.

MaxEnt Model Summary

This section provides a summary of the MaxEnt model, including its length, class, and mode. - Length: The number of elements in the MaxEnt model object. - Class: Indicates the class of the object, in this case, “MaxEnt”. - Mode: Describes how the object is stored or represented.

MaxEnt Model Response Curve Plot

This plot visualizes the response curve of the MaxEnt model.
The x-axis typically represents environmental variables or predictors used in the model.
The y-axis represents the contribution or importance of each variable in predicting species occurrence.
Each dot on the plot corresponds to a predictor variable, and its position indicates the contribution level.
Higher positions on the y-axis indicate higher contributions to the model’s predictions.

Custom Legend

The legend provides additional information about the contribution levels of predictor variables.
It categorizes predictor variables into three groups based on their contribution levels:
- High Contribution (>30%): Variables with a significant impact on species occurrence predictions.
- Medium Contribution (15-30%): Variables with a moderate impact on predictions.
- Low Contribution (<15%): Variables with a minimal impact on predictions.
The legend helps interpret the importance of predictor variables in the MaxEnt model.

Understanding MaxEnt

The MaxEnt model helps predict where invasive species might be found based on where they’ve been observed and the environmental conditions at those locations. It assumes that species distributions are spread as widely as possible (maximum entropy) while still aligning with observed data.

Model’s Role

By fitting this model, we can identify areas likely to be suitable for invasive species, not just where they have currently been observed. This predictive capability is crucial for preemptive conservation and management efforts.

Interpreting Outputs

The visualization and summary of the model provide insights into which environmental factors are most important for the species’ distribution. For instance, factors in the “High Contribution” category are likely crucial for the species’ presence in Vermont.

Practical Implications

For conservationists and environmental managers, understanding these key contributing factors allows for more targeted actions, such as habitat management or monitoring efforts, especially in areas predicted as suitable but currently unoccupied by the species.

# --------------------- Step 9: Occurrence Data Cleaning and Validation Point  ---------------------
# --------- THIS SECTION OF THE CODE HAS A LOT OF TROUBLESHOOTING ----------------------------------
# --------- I COULD NOT WORK OUT HOW TO DO THIS ANY OTHER WAY --------------------------------------

# Check the structure of the background_points object to ensure data integrity.
str(background_points)

# Check the dimensions of the background_points object to verify the size of the dataset.
dim(background_points)

# Extract longitude and latitude coordinates from background_points and create a new SpatialPoints object
background_points_coords <- SpatialPoints(coords = background_points@coords)

# Create data frame for absence points using longitude, latitude, and absence indicators
absence_df <- data.frame(
  longitude = coordinates(background_points_coords)[, 1],  # Extracting longitude
  latitude = coordinates(background_points_coords)[, 2],   # Extracting latitude
  presence = 0
)

# Step 9.1: Create Data Frame for Presence Points
# Create a data frame for presence points using longitude, latitude, and presence indicators.
presence_df <- data.frame(
  longitude = occurrence_points_matrix_clean[, "longitude"],
  latitude = occurrence_points_matrix_clean[, "latitude"],
  presence = 1
)

# Step 9.2: Create Data Frame for Absence Points
# Create a data frame for absence points using longitude, latitude, and absence indicators.
# Create data frame for absence points using longitude, latitude, and absence indicators
absence_df <- data.frame(
  longitude = coordinates(background_points_coords)[, 1],  # Extracting longitude
  latitude = coordinates(background_points_coords)[, 2],   # Extracting latitude
  presence = 0
)

# Step 9.3: Combine Presence and Absence Data Frames
# Merge the presence and absence data frames to create a unified validation data set.
validation_points <- rbind(presence_df, absence_df)

# Step 9.4: Check Structure and Summary of Validation Points
# Inspect the structure and summary statistics of the validation points data frame.
str(validation_points)
summary(validation_points)

# Define predictor variables and calculate the expected number of columns.
predictor_variables <- c("bio1", "bio2", "bio3", "bio4", "bio5")
expected_columns <- length(predictor_variables) + 3  # Add 3 columns for longitude, latitude, and presence.

# Print the expected number of columns
cat("Expected number of columns:", expected_columns, "\n")

# Print the actual number of columns
cat("Actual number of columns:", ncol(validation_points), "\n")

# Print the column names
cat("Column names:", colnames(validation_points), "\n")

# Step 9.5: Add predictor variables to the validation_points data frame
validation_points[, predictor_variables] <- NA

# Step 9.6: Check if Predictor Variables were Successfully Added
if (ncol(validation_points) != expected_columns) {
  stop("Predictor variables were not successfully added to the validation_points data frame.")
} else {
  cat("Predictor variables were successfully added to the validation_points data frame.\n")
}

# Print column names and check the structure of validation_points.
colnames(validation_points)
ncol(validation_points)
str(validation_points)

# Step 9.6.2: Convert the RasterBrick to a SpatRaster object
vermont_data_spat <- rast(vermont_data)

# Step 9.6.3: Convert validation_points SpatialPointsDataFrame to a terra SpatVector object
validation_sp <- vect(validation_points, geom = c("longitude", "latitude"), crs = crs(vermont_data_spat))

# Step 9.7: Extract Environmental Data for Validation Points using the 'terra' package
# Make sure to use terra::extract to specify the correct package
validation_env_values <- terra::extract(vermont_data_spat, validation_sp)

# Check the structure of the extracted data to confirm it was successful
str(validation_env_values)

# Step 9.7: Create a SpatialPoints object from the validation coordinates
validation_coords <- cbind(validation_points$longitude, validation_points$latitude)
validation_sp <- SpatialPoints(validation_coords, proj4string = crs(vermont_data))

# Step 9.8: Create a SpatialPointsDataFrame by binding the SpatialPoints with the presence/absence data
validation_spdf <- SpatialPointsDataFrame(validation_sp, validation_points[, c("presence", "bio1", "bio2", "bio3", "bio4", "bio5")])

# Step 9.9: Extract Environmental Data for Validation Points
# Convert SpatialPointsDataFrame to SpatVector
validation_spatvec <- vect(validation_spdf)

# Extract environmental data for validation points
validation_env_values <- terra::extract(vermont_data_spat, validation_spatvec)

# Now, you can check the structure of the extracted data to confirm it was successful
str(validation_env_values)

# Step 9.7.1: Check if the Result is a List
# Check if the result of extraction is a list and explore its structure.
if (is.list(validation_env_values)) {
  lapply(validation_env_values, str)
} else {
  str(validation_env_values)
}

# Step 9.7.2: Bind Environmental Data to a Matrix
# Bind extracted environmental data into a matrix for consistency.
if (is.list(validation_env_values)) {
  column_check <- sapply(validation_env_values, function(x) length(x))
  if (all(column_check == column_check[1])) {
    validation_env_matrix <- do.call(rbind, validation_env_values)
  } else {
    stop("Not all data elements have the same number of columns.")
  }
} else {
  validation_env_matrix <- validation_env_values
}

# Step 9.7.3: Set Column Names of Environmental Data
# Set column names for the environmental data matrix if dimensions match.
if (ncol(validation_env_matrix) == length(predictor_variables)) {
  colnames(validation_env_matrix) <- predictor_variables
} else {
  cat("Mismatch in the number of columns and predictor variables\n")
  cat("Number of columns in validation_env_matrix:", ncol(validation_env_matrix), "\n")
  cat("Number of predictor variables:", length(predictor_variables), "\n")
}

# Step 9.7.4: Debugging Output
# Print debugging information if there's a mismatch in column numbers.
if (ncol(validation_env_matrix) != length(predictor_variables)) {
  print("Mismatch detected:")
  print(paste("Expected columns:", length(predictor_variables), "but found:", ncol(validation_env_matrix)))
}

# Step 9.8: Verify the Updated Structure of Validation Points
# Check if the result is not a list and use the matrix if the structure is correct.
if (!is.list(validation_env_values)) {
  validation_env_matrix <- validation_env_values
  if (is.null(colnames(validation_env_matrix))) {
    colnames(validation_env_matrix) <- paste("bio", 1:ncol(validation_env_matrix), sep="")
  }
}

nrow(validation_env_matrix)
nrow(validation_points)

# Step 9.9: Extract Environmental Data for Validation Points
# Convert SpatialPointsDataFrame to SpatVector
validation_spatvec <- vect(validation_spdf)

# Extract environmental data for validation points
validation_env_values <- terra::extract(vermont_data_spat, validation_spatvec)

# Check if the number of rows in validation_env_values matches validation_points
if (nrow(validation_env_values) != nrow(validation_points)) {
  stop("Number of rows in validation_env_values does not match validation_points.")
}

# Combine environmental data with existing validation_points data frame
validation_points <- cbind(validation_points[, c("longitude", "latitude", "presence")], validation_env_values)

# Verify the updated structure of validation_points data frame
str(validation_points)

# Optionally, review the first few rows of the updated data frame
head(validation_points)

# Step 9.10: Check for Missing Values
# Count the number of missing values in each column of the validation_points data frame.
na_count <- sapply(validation_points, function(x) sum(is.na(x)))
cat("Number of NAs in each column:\n")
print(na_count)

# Step 9.11: Summary Statistics for Environmental Variables
# Compute summary statistics for the environmental variables in the validation_points data frame.
summary(validation_points[, 4:22])  # Assuming columns 4 to 22 are the environmental variables

**Histogram of BIO1 (Annual Mean Temperature

# Step 9.12: Basic Histograms for Environmental Variables
# Generate histograms to visualize the distribution of environmental variables.
hist(validation_points$bio1, 
     main = "Histogram of BIO1 (Annual Mean Temperature)", 
     xlab = "BIO1 Values (Temperature)", 
     ylab = "Frequency", 
     col = "blue",  # Color to differentiate bins
     border = "black",  # Color of bin borders
     breaks = 30)  # Adjust the number of bins if necessary

Figure 19: Histogram of BIO1 (Annual Mean Temperature)

Figure 19: Histogram Analysis: BIO1 (Annual Mean Temperature)

Description of Histogram

Overview: This histogram visualizes the distribution of annual mean temperatures across Vermont, allowing for an analysis of temperature patterns within the study area.

Details: - X-axis: Represents temperature in degrees Celsius. - Y-axis: Represents the frequency of observations. - Bar Representation: Each bar in the histogram corresponds to a range of temperatures. The height of each bar indicates the number of observations falling within that specific temperature range.

Analytical Insights

Purpose of the Histogram: - The histogram helps in understanding how temperatures are distributed across Vermont. It provides a clear visual representation of the thermal environment, which is essential for ecological and climatological studies.

Utility: - Pattern Recognition: By examining the histogram, researchers can identify any patterns or anomalies in temperature distribution. This is crucial for assessing climate variability and potential impacts on local ecosystems.

Implications: - Data-Driven Decisions: The insights from the histogram can aid in making informed decisions related to environmental management and conservation planning, especially in the context of climate change adaptation strategies.

hist(validation_points$bio2, 
     main = "Histogram of BIO12 (Annual Precipitation)", 
     xlab = "BIO2 Values (Mean Diurnal Range)", 
     ylab = "Frequency",
     col = "green",  # Color to differentiate bins
     border = "black",  # Color of bin borders
     breaks = 30)  # Adjust the number of bins if necessary

Figure 20: Histogram of BIO12 (Annual Precipitation)

Figure 20: Histogram Analysis: BIO12 (Annual Precipitation)

Description of Histogram

Overview: This histogram visualizes the distribution of annual precipitation amounts across Vermont, providing insights into the variability and overall patterns of precipitation within the study area.

Details: - X-axis: Represents precipitation in millimeters. - Y-axis: Represents the frequency of observations. - Bar Representation: Each bar in the histogram corresponds to a range of precipitation values. The height of each bar indicates the number of observations falling within that specific range.

Analytical Insights

Purpose of the Histogram: - The histogram is crucial for understanding the distribution and variability of precipitation levels across Vermont. It offers a quantitative view that is essential for water resource management and ecological assessments.

Utility: - Pattern Recognition: By examining the histogram, researchers can identify patterns in precipitation distribution, which is vital for predicting water availability and managing flood risks. - Anomaly Detection: Helps in identifying unusual precipitation patterns that may indicate climatic shifts or anomalies.

Implications: - Data-Driven Decisions: Insights from the histogram can guide decisions in sectors dependent on water resources, such as agriculture, forestry, and urban planning. - Climate Adaptation Strategies: Understanding precipitation patterns assists in developing strategies to cope with potential climate change impacts, ensuring sustainable management of natural resources.

Summary of Data Analysis: Temperature and Precipitation Distributions

Overview: These histograms provide visual insights into the distributions of temperature and precipitation data across Vermont, highlighting key environmental factors that affect ecological dynamics.

Why Combined Data Frames?

Purpose of Integration: - Comprehensive Analysis: By creating a dataset that includes observations of where the species is and isn’t found, alongside environmental data, researchers can identify unique conditions that might favor the species’ presence. - Enhanced Predictive Power: This integration enhances the predictive power of the species distribution model (SDM), allowing for more accurate forecasting of species movements.

Role of Environmental Predictors

Utility of Predictors: - Critical Analysis: These predictors are essential for understanding which aspects of the environment, like temperature or precipitation, are most influential in species distribution. - Data Extraction: Initially, placeholders (NA) are used until specific environmental data is extracted, ensuring each site’s conditions are accurately represented.

Importance of Data Integrity and Validation

Ensuring Accuracy: - Data Validation: It’s crucial to ensure the dataset is error-free and fully populated with environmental data. This guarantees that the modeling is based on comprehensive and accurate information. - Reliable Predictions: Accurate data is essential for reliable predictions, forming the backbone of effective ecological modeling.

Logical Implications

Preparation for Predictive Modeling

Readiness for Analysis: - Advanced Techniques: The meticulously prepared dataset, now complete with environmental variables, is ready for advanced statistical techniques. These techniques will predict potential distribution areas for invasive species, enhancing our understanding of ecological threats.

Conservation and Management Applications

Strategic Use: - Conservation Insights: With a robust SDM, conservationists can better understand and predict the spread of invasive species. - Effective Management Strategies: This understanding enables more targeted and effective management strategies to protect Vermont’s ecosystems, ensuring that conservation efforts are well-informed and strategically implemented.

Implementing Data Imputation

Understanding the Role of Data Imputation in Environmental Sciences

Why Imputation?

Addressing Missing Data: - Necessity: In data analysis, especially in environmental sciences, missing data can skew results and lead to inaccurate conclusions. - Solution: Imputation helps fill these gaps, ensuring that each data point contributes to the overall analysis without introducing bias.

Choice of Median for Imputation

Robustness Against Outliers: - Advantages: The median is robust against outliers, making it a preferred choice for imputation in environmental data where extreme values can disproportionately affect the mean.

Effectiveness of Imputation

Maintaining Data Integrity: - Consistency: Post-imputation, the dataset’s integrity is maintained, as indicated by the consistency in the statistical summaries. - Implication: This suggests that the imputation has neither distorted the data distribution nor introduced any bias that could affect subsequent analyses.

Comparison of Steps Before and After Imputation

Before Imputation (Step 9)

Challenges: - Data Gaps: The dataset possibly contained missing values which could lead to incomplete analyses or biased results due to insufficient data.

After Imputation (Step 10)

Improvements: - Data Completeness: All missing values are filled, ensuring that the dataset is complete. This allows for more reliable and robust statistical analysis and modeling, providing a solid foundation for any ecological or geographical inferences to be drawn.

Conclusion: Integration and Enhancement

Purpose and Benefits: - Step 9: Ensures that the data is correctly structured and integrated with necessary environmental variables. - Step 10: Improves the dataset’s completeness and usability for downstream analyses.

Strategic Importance: - Both steps are integral to data preparation but serve different purposes, highlighting the importance of a methodical approach to maintaining data quality throughout the analysis process.

# -------------------- Step 10: Implementing Data Imputation ----------------------------

# Simple median imputation for missing values
for(i in 4:ncol(validation_points)) {
  validation_points[is.na(validation_points[,i]), i] <- median(validation_points[,i], na.rm = TRUE)
}

# Check the results after imputation
summary(validation_points)

# Recreate histograms with the imputed data

# Step 10.1: Histogram for BIO1 (Annual Mean Temperature)
# Display the distribution of the annual mean temperature (BIO1) after imputation of missing data.
hist(validation_points$bio1, 
     main = "Histogram of BIO1 (Annual Mean Temperature)", 
     xlab = "Temperature (deg C)",  # Simple text replacement
     ylab = "Frequency", 
     col = "blue",  
     border = "black",  
     breaks = 30)

Figure 21: Histograms and Data Imputation

# Step 10.2: Histogram for BIO12 (Annual Precipitation)
# Show the distribution of annual precipitation amounts (BIO12) after imputation.
hist(validation_points$bio12,
     main = "Histogram of BIO12 (Annual Precipitation)",
     xlab = "Precipitation (mm)",
     ylab = "Frequency of Observations",
     col = "green",  # Color to differentiate bins
     border = "black",  # Color of bin borders
     breaks = 30)  # Adjust the number of bins if necessary

Figure 21: Histograms and Data Imputation

# Description:
# For each histogram:
#
# Histogram of BIO1 (Annual Mean Temperature):
# This histogram displays the distribution of annual mean temperatures across Vermont.
# The x-axis represents temperature in degrees Celsius.
# The y-axis represents the frequency of observations.
# Each bar in the histogram represents a range of temperatures, and the height of the bar indicates the number of observations falling within that temperature range.
# By examining the histogram, you can understand how temperatures are distributed across the study area and identify any patterns or anomalies.
#
# Histogram of BIO12 (Annual Precipitation):
# This histogram illustrates the distribution of annual precipitation amounts across Vermont.
# The x-axis represents precipitation in millimeters.
# The y-axis represents the frequency of observations.
# Similar to the first histogram, each bar in the histogram represents a range of precipitation values, and the height of the bar indicates the number of observations falling within that range.
# Analyzing this histogram helps in understanding the variability and distribution of precipitation levels across the study area.
# 
# In summary, these histograms provide visual insights into the distributions of temperature and precipitation data in Vermont after missing values have been imputed using the simple median imputation method.


# Why Imputation?: In data analysis, especially in environmental sciences, missing data can skew results and lead to inaccurate conclusions. Imputation helps fill these gaps, ensuring that each data point contributes to the overall analysis without introducing bias.
# Choice of Median for Imputation: The median is robust against outliers, making it a preferred choice for imputation in environmental data where extreme values can disproportionately affect the mean.
# Effectiveness of Imputation: Post-imputation, the dataset's integrity is maintained, as shown by the consistency in the statistical summaries. This suggests that the imputation has neither distorted the data distribution nor introduced any bias that could affect subsequent analyses.
# Comparison to Previous Steps:
# Before Imputation (Step 9): The dataset possibly contained missing values which could lead to incomplete analyses or biased results due to insufficient data.
# After Imputation (Step 10): All missing values are filled, ensuring that the dataset is complete. This allows for more reliable and robust statistical analysis and modeling, providing a solid foundation for any ecological or geographical inferences to be drawn.

# Both steps are integral to data preparation but serve different purposes: Step 9 ensures that the data is correctly structured and integrated with necessary environmental variables, 
# while Step 10 improves the dataset's completeness and usability for downstream analyses.

Figure 21: Environmental Data Distributions in Vermont

Histogram Analysis for BIO1 (Annual Mean Temperature)

Overview: - Purpose: This histogram displays the distribution of annual mean temperatures across Vermont, providing a visual representation of thermal conditions. - X-axis: Temperature in degrees Celsius. - Y-axis: Frequency of observations. - Bar Details: Each bar represents a range of temperatures, with the height indicating the number of observations within that range.

Analytical Insights: - Temperature Distribution: By examining the histogram, researchers can understand how temperatures are distributed across the study area and identify any patterns or anomalies.

Histogram Analysis for BIO12 (Annual Precipitation)

Overview: - Purpose: This histogram illustrates the distribution of annual precipitation amounts across Vermont. - X-axis: Precipitation in millimeters. - Y-axis: Frequency of observations. - Bar Details: Similar to the first histogram, each bar represents a range of precipitation values, with the height indicating the number of observations within that range.

Analytical Insights: - Precipitation Variability: Analyzing this histogram helps in understanding the variability and distribution of precipitation levels across the study area.

Summary of Histogram Analyses

Integrated Insights: - These histograms provide visual insights into the distributions of temperature and precipitation data in Vermont. - Methodology: After missing values have been imputed using the simple median imputation method, these histograms offer a clearer and more accurate depiction of environmental conditions.

Implications for Research: - Enhanced Understanding: The visual analysis of these histograms aids researchers and decision-makers in understanding climatic trends and anomalies in Vermont. - Support for Ecological Studies: The insights gained are essential for ecological research, conservation planning, and preparing for climatic changes.

Train Logistic Regression Model- GLM

# --------------------- Step 11: Train Logistic Regression Model- GLM  ------------------------------

# Ensure necessary library for glm is loaded
if (!require(stats)) install.packages("stats")
library(stats)

# Step 11.1: Define the Model Formula
# Define the formula for logistic regression model using 'presence' as the binary outcome and environmental variables (bio1 to bio5) as predictors.
model_formula <- presence ~ bio1 + bio2 + bio3 + bio4 + bio5

# Step 11.2: Train the Logistic Regression Model
# Train the logistic regression model using the defined formula and the dataset.
model <- glm(model_formula, data=validation_points, family=binomial())

# Step 11.3: Check Model Summary
# After training the model, examine the summary to understand the significance of predictors and model fit.
summary(model)

# Step 11.4: Diagnostics and Validation
# Perform diagnostic checks to validate the model, which may include examining residuals and assessing model goodness of fit.
plot(model)

Figure 22: Train Logistic Regression Model- GLM

# Step 11.5: Saving the Model
# Set the working directory where you want to save the model.
setwd("D:/GEOG_588/SDM_Invasive_species")

# Check to confirm the current working directory.
getwd()

# Save the model in the current working directory.
save(model, file="logistic_model.RData")

# Define the full path to the model file.
model_path <- "D:/GEOG_588/SDM_Invasive_species/logistic_model.RData"

# Check if the model file exists at the specified path.
if (file.exists(model_path)) {
  print(paste("Model file saved successfully at:", model_path))
} else {
  print("Failed to save the model file at the specified path.")
}

# Why Use Logistic Regression?: Logistic regression is chosen because it’s ideal for predicting a binary outcome—like presence or absence of a species—based on multiple influencing factors, which in this case are environmental variables. This model helps understand which conditions are likely to favor or inhibit the presence of the species.
# Significance of Environmental Predictors:
# Positive Coefficients (e.g., bio5): Variables like bio5 with positive coefficients increase the probability of species presence as they increase, suggesting that these conditions are favorable for the species.
# Negative Coefficients (e.g., bio1, bio2): Conversely, variables with negative coefficients decrease the probability of species presence as they increase, indicating conditions that are less favorable or even detrimental.
# Model Evaluation:
# The significance values (Pr(>|z|)) associated with each predictor tell us whether the effects of these environmental variables on species presence are statistically significant. For instance, variables like bio1 and bio2 are highly significant, implying strong evidence that these factors influence the species’ distribution.
# The model's residual deviance and AIC (Akaike Information Criterion) provide measures of model fit, indicating how well the model explains the observed data compared to a null model (one with no predictors).
# Model Diagnostics:
# By examining diagnostic plots, one can check for any anomalies like outliers or patterns in residuals that might suggest the model does not fit the data well. These checks are crucial for confirming the model's reliability.
# Saving and Utilizing the Model:
# Storing the model allows for its application in future studies or for predictive purposes, such as anticipating changes in species distribution due to environmental changes.
# This summary explains how the logistic regression model is structured, trained, and evaluated, highlighting the implications of its findings in a way that emphasizes both the technical process and its relevance to ecological research and management.

Figure 22: Train Logistic Regression Model- GLM - Overview of Environmental Influences on Species Distribution
This figure illustrates the impact of various environmental variables on species distribution within Vermont, highlighting key factors that contribute to the presence or absence of species across different regions.

# Understanding Logistic Regression in Ecological Modeling

Why Use Logistic Regression?

Purpose and Fit: - Binary Outcome Prediction: Logistic regression is chosen because it’s ideal for predicting a binary outcome—like the presence or absence of a species—based on multiple influencing factors, which are environmental variables in this context. - Decision Utility: This model helps understand which conditions are likely to favor or inhibit the presence of the species, making it invaluable for ecological decision-making.

Significance of Environmental Predictors

Interpreting Coefficients: - Positive Coefficients (e.g., bio5): Variables like bio5 with positive coefficients suggest that increasing values of these conditions increase the probability of species presence, indicating favorable environments. - Negative Coefficients (e.g., bio1, bio2): Conversely, variables with negative coefficients imply that as these conditions increase, the probability of species presence decreases, indicating less favorable or even detrimental conditions.

Model Evaluation

Statistical Significance and Model Fit: - Significance Values (Pr(>|z|)): These values tell us whether the effects of environmental variables on species presence are statistically significant. For instance, variables like bio1 and bio2 are highly significant, implying strong evidence that these factors influence the species’ distribution. - Model Fit Metrics: The model’s residual deviance and AIC (Akaike Information Criterion) provide measures of how well the model explains the observed data compared to a null model (one with no predictors).

Model Diagnostics

Ensuring Reliability: - Diagnostic Plots: By examining diagnostic plots, researchers can check for any anomalies like outliers or patterns in residuals that might suggest the model does not fit the data well. These checks are crucial for confirming the model’s reliability.

Saving and Utilizing the Model

Application and Future Use: - Model Storage: Storing the model allows for its application in future studies or for predictive purposes, such as anticipating changes in species distribution due to environmental changes.

Summary

Model Overview: - This summary explains how the logistic regression model is structured, trained, and evaluated. It highlights the implications of its findings in a way that emphasizes both the technical process and its relevance to ecological research and management, providing a comprehensive understanding of the model’s utility and application in ecological contexts.

AUC Calculation and Visualization

# ------------------------ Step 12: AUC Calculation and Visualization -------------------------------
# ------------------------ SAME FOR HERE ... THIS SECTION HAS LOTS OF TROUBLESHOOTING ---------------
# ------------------------ I COULD NOT WORK OUT HOW TO MAKE THIS WORK ANY OTHER WAY -----------------

# Step 12.1: Data Types and Content Check
# Examine the structure and summary of the presence and predicted_prob columns
str(validation_points$presence)
str(validation_points$predicted_prob)
summary(validation_points$presence)
summary(validation_points$predicted_prob)

# Step 12.2: Removal of NA Values
# Determine the count of rows with non-NA values for both presence and predicted_prob
valid_data_count <- sum(complete.cases(validation_points$presence, validation_points$predicted_prob))
print(valid_data_count)

# If valid data count is low, further investigation is needed
if (valid_data_count == 0) {
  print("All data points have NAs after NA removal or there is only one class present. Check data preparation steps.")
}

# Step 12.3: Ensure Presence Data Contains Two Classes
# Check the unique values in the presence column
print(unique(validation_points$presence))

# Step 12.4: Validate Predicted_prob Values
# Verify if predicted_prob values fall within the 0-1 range
if (any(validation_points$predicted_prob < 0 | validation_points$predicted_prob > 1, na.rm = TRUE)) {
  print("Some predicted_prob values are outside the range of 0 to 1. Review your prediction step.")
}

# Step 12.5: ROC Calculation
# If issues are addressed, recalculate ROC
if (valid_data_count > 0 && length(unique(validation_points$presence)) > 1) {
  # Check if the predicted_prob column is NULL or contains only NA values
  if (is.null(validation_points$predicted_prob) || all(is.na(validation_points$predicted_prob))) {
    print("The predicted_prob column is NULL or contains only NA values. Check data preparation steps.")
  } else {
    # Check if any predicted_prob values are outside the range of 0 to 1
    if (any(validation_points$predicted_prob < 0 | validation_points$predicted_prob > 1, na.rm = TRUE)) {
      print("Some predicted_prob values are outside the range of 0 to 1. Review your prediction step.")
    } else {
      # Perform ROC calculation
      roc_result <- roc(validation_points$presence, validation_points$predicted_prob, na.rm = TRUE)
      auc_value <- auc(roc_result)
      print(paste("AUC value:", auc_value))
      plot(roc_result, main="ROC Curve", col="#1c61b6")
    }
  }
} else {
  print("Cannot compute ROC due to insufficient or invalid data.")
}

# Step 12.6: Predictions Generation
# Ensure logistic regression model is loaded and accurate
if (exists("model")) {
  # Generate predicted probabilities using the logistic regression model
  validation_points$predicted_prob <- predict(model, newdata=validation_points, type="response")
  
  # Check if predictions were added successfully
  if (is.null(validation_points$predicted_prob)) {
    stop("Failed to generate predictions. Check model and data compatibility.")
  } else {
    print("Predictions generated successfully.")
  }
} else {
  stop("Model not found. Ensure your logistic regression model is loaded correctly.")
}

# Step 12.7: ROC Calculation Retry
# Retry ROC calculation
roc_result <- roc(validation_points$presence, validation_points$predicted_prob, na.rm = TRUE)
auc_value <- auc(roc_result)

# Print the AUC value
print(paste("AUC value:", auc_value))

# Plot the ROC curve
plot(roc_result, main="ROC Curve", col="#1c61b6", print.auc=TRUE)
legend("bottomright", legend=c(paste("ROC Curve (AUC = ", round(auc_value, 2), ")")), col="#1c61b6", lty=1, cex=0.8)

Figure 23: AUC Calculation and Visualization

# Interpret AUC Value
# AUC Value of 0.771: This value is closer to 1 than to 0.5, indicating that your model generally has a good measure of separability. It is capable of differentiating between the positive and negative classes effectively.
# AUC Values: Generally, an AUC of 0.5 suggests no discrimination (better than random chance), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and above 0.9 is outstanding.

# Reviewing the ROC Curve
# The ROC curve you plotted provides a visual representation of the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) at various threshold settings.
# The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.

# What is AUC and ROC?: The AUC measures the ability of the model to predict higher scores for actual positive occurrences than for negatives. The ROC curve helps visualize this by showing how the true positive rate 
# and false positive rate relate at various threshold settings.
# Model Performance: A good AUC value indicates that the model predictions are reliable and can effectively differentiate between sites with and without invasive species based on the environmental conditions modeled.
# Practical Application: The ROC and AUC are tools that help assess how well the environmental factors used in the model work together to predict species presence. This is crucial for conservation planning and management, 
# as it helps identify areas at high risk of invasion and aids in prioritizing areas for monitoring and intervention.

Figure 24: ROC Curve Analysis for Species Distribution Model
This figure displays the ROC curve, illustrating the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) across various threshold settings. An AUC value of 0.771 indicates good model separability, suggesting the model’s effectiveness in distinguishing between presence and absence of species. The ROC curve’s proximity to the top left corner of the plot reflects a higher accuracy of the test.

Understanding AUC and ROC in Model Evaluation

What is AUC and ROC?

Overview: - AUC (Area Under the Curve): Measures the ability of the model to correctly predict higher scores for actual positive occurrences than for negatives. It quantifies the overall ability of the model to discriminate between positive and negative classes. - ROC Curve (Receiver Operating Characteristic Curve): Visualizes how the true positive rate (sensitivity) and the false positive rate (specificity) relate at various threshold settings. This curve helps in assessing the trade-offs between sensitivity and specificity in different threshold scenarios.

Model Performance

Significance of AUC: - Reliability of Predictions: A good AUC value indicates that the model predictions are reliable and can effectively differentiate between sites with and without invasive species based on the environmental conditions modeled. It signifies the model’s effectiveness in distinguishing between the classes.

Practical Application

Role of ROC and AUC in Conservation: - Assessment Tool: The ROC and AUC are critical tools that help assess how well the environmental factors used in the model work together to predict species presence. - Conservation Planning: This analysis is crucial for conservation planning and management as it helps identify areas at high risk of invasion. Understanding these risks aids in prioritizing areas for monitoring and intervention, ensuring that conservation efforts are directed where they are most needed.

Summary: - These metrics not only highlight the model’s performance but also its practical implications in real-world ecological management and conservation strategies. The insights gained from ROC and AUC analysis guide decision-making in ecological conservation, enhancing the strategic allocation of resources and efforts in managing invasive species threats.

More Extract and Verify validation_data

# ---------------- Step 13 Extract and Verify validation_data ---------- 
# ----------------- TROUBLESHOOTING INSIDE THIS SECTION ----------------


# Step 13.1: Extract Environmental Data
# Extract environmental data from the raster based on the validation points
validation_env_values <- extract(vermont_data, validation_coords)

# Check extraction output
print(head(validation_env_values))

# Step 13.2: Correct Format of validation_points
# Extract the coordinates (longitude and latitude) from validation_points
coords <- validation_points[, 1:2]

# Now use these coordinates to extract environmental data
validation_env_values <- extract(vermont_data, coords)

# Step 13.3: Verify the Extraction
# Check the head of the extracted environmental values
print(head(validation_env_values))

# Step 13.4: Combine Data for Validation
# Combine the extracted environmental values with the presence/absence data
validation_data <- cbind(as.data.frame(validation_env_values), presence = validation_points$presence)

# Ensure the data frame is properly formatted for prediction
print(head(validation_data))



# What’s Happening?: Environmental characteristics (like temperature, precipitation, etc.) at specific geographic points are being matched with the locations where species have
# been observed or are expected to be absent. This combination creates a dataset that feeds into a predictive model, helping to understand where the species might thrive based on environmental conditions.
# Why It Matters?: For SDM, having accurate environmental data alongside occurrence data is crucial. The model relies on these inputs to discern patterns and predict 
# species distributions effectively. Errors in data extraction or formatting can lead to incorrect predictions, which could misguide conservation efforts.

# Practical Implications:
# Ensuring Data Integrity: The troubleshooting steps undertaken are essential in SDM workflows to ensure that the input data does not contain errors or inconsistencies, which could lead to flawed outputs.
# Readiness for Model Application: The cleaned and verified dataset is now ready for advanced analytical processes, such as fitting a MaxEnt model or other statistical models, to predict species distributions. 
# This step is a bridge between raw data collection and actionable ecological insights.

Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

# ---------------- Step 14 MaxEnt Model Predicted Probabilities ---------------- 

# Step 14.1: Predict Using the MaxEnt Model
# Predict using the MaxEnt model
predictions <- predict(maxent_model, vermont_data)  # This should create a raster layer of probabilities

# Plot the predicted probabilities raster
library(rasterVis)
levelplot(predictions, main = "MaxEnt Model Predicted Probabilities", col.regions = rev(terrain.colors(255)),
          xlab = "Longitude", ylab = "Latitude", colorkey = list(space = "top", 
                                                                 labels = list(at = seq(0, 1, by = 0.2),
                                                                               col = "black")),
          legend.width = 0.8, margin = FALSE)

Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

# Step 14.2: Extract Coordinate Data
# Extract the coordinates from validation_points
coords <- validation_points[, c("longitude", "latitude")]

# Convert the dataframe to a SpatialPointsDataFrame
coordinates(coords) <- ~longitude + latitude

# Set the CRS if known; here's an example using WGS84
proj4string(coords) <- CRS("+proj=longlat +datum=WGS84")

# Step 14.3: Verify the Spatial Object
# Check the structure to ensure it's now a SpatialPointsDataFrame
print(class(coords))
str(coords)

Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

Overview: This levelplot illustrates the predicted probabilities of species occurrence across Vermont, providing a detailed visual analysis of the geographical likelihood of species presence.

Visualization Details: - Raster Cells: Each cell in the raster represents a specific geographic location. - Color Indicators: The color of each cell indicates the probability of species occurrence. - Warmer Colors: Denote higher probabilities, suggesting areas with favorable conditions for species occurrence. - Cooler Colors: Indicate lower probabilities, suggesting areas less likely to support the species. - Legend and Scale: The color scale in the legend quantifies these probabilities, offering a clear visual guide to interpreting the spatial distribution of potential species habitats within the study area.

Implications for Conservation: - This visual tool assists in identifying critical areas for conservation efforts, focusing resources on regions with higher probabilities of species presence.

Predicted Probabilities (Histogram)

# Now use these spatial points to extract the probability data from the raster
predicted_probs <- extract(predictions, coords)

# Ensure the extracted data is in the correct format
predicted_probs <- as.numeric(predicted_probs)  # Convert list or matrix to numeric vector if necessary

# Step 14.4: Plot the Distribution of Predicted Probabilities
# Check the distribution of predicted probabilities
hist(predicted_probs, main = "Distribution of Predicted Probabilities", xlab = "Probabilities", breaks = 30,
     col = "lightblue", border = "black", xlim = c(0, 1), ylim = c(0, 500), ylab = "Frequency")

# Add title and labels to the plot
title("Distribution of Predicted Probabilities")
xlabel <- "Probabilities"
ylabel <- "Frequency"
mtext(xlabel, side = 1, line = 3)
mtext(ylabel, side = 2, line = 3)

Figure 26: Predicted Probabilities (Histogram)

# Predicted Probabilities Raster (Levelplot):
# The levelplot function creates a visual representation of the predicted probabilities across the study area (Vermont in this case).
# Each cell in the raster represents a geographic location, and the color of the cell indicates the probability of species occurrence at that location.
# Warmer colors typically represent higher probabilities, while cooler colors represent lower probabilities.
# The legend at the top of the plot provides a color scale, indicating the probability values corresponding to each color.

# Distribution of Predicted Probabilities (Histogram):
# The histogram represents the distribution of predicted probabilities across all the sampled locations (or points) in Vermont.
# The x-axis of the histogram represents the range of predicted probabilities, typically from 0 to 1, where 0 indicates very low probability and 1 indicates very high probability.
# The y-axis represents the frequency or count of occurrences within each probability range.
# The histogram provides insight into the variability and spread of predicted probabilities across the study area.
# For example, if the histogram is skewed towards higher probabilities, it suggests that the model predicts a higher likelihood of species occurrence in most locations. Conversely, if it's skewed towards lower probabilities, it suggests a lower likelihood of occurrence.

# Understanding Predictions: The levelplot provides a geographical visualization where each pixel/color intensity on the map correlates with the probability of species occurrence predicted by the model. It translates complex model outputs into an easily interpretable format.
# Analysis of Probability Values: The extracted probabilities were then plotted in a histogram, highlighting how these probabilities are distributed. This is crucial for understanding whether the model generally predicts a higher or lower likelihood of occurrence across the region.
# Practical Implications: For conservationists, these visualizations and statistical summaries offer a direct insight into areas that might require more attention, either because they are likely habitats for invasive species (higher probabilities) or areas that currently pose less of a risk (lower probabilities).
# Logical Interpretations:
# Model Evaluation: By visually and statistically analyzing the predicted probabilities, we can assess the MaxEnt model's performance. For instance, a well-performing model should ideally show higher probabilities in known areas of species presence and lower probabilities elsewhere.
# Conservation Planning: The spatial distribution map (levelplot) and the histogram together inform where invasive species might spread. This helps in planning control measures, monitoring programs, and conservation strategies more effectively, targeting areas predicted to have higher probabilities of species presence.

Figure 26: Distribution of Predicted Probabilities (Histogram) for Species Occurrence in Vermont

Overview:
This histogram represents the distribution of predicted probabilities of species occurrence across all sampled locations in Vermont. The visualization provides a clear view of the probability ranges, offering insights into the likelihood of species presence across different areas.

Visualization Details: - X-axis: Categorizes the range of predicted probabilities from 0 (indicating very low probability) to 1 (indicating very high probability). - Y-axis: Shows the frequency or count of occurrences within each probability range. - Color Coding: Uses a gradient where warmer colors typically represent higher probabilities, suggesting areas with favorable conditions for species occurrence, and cooler colors represent lower probabilities, indicating less favorable conditions.

Implications for Conservation: - Analytical Insight: The histogram helps in understanding the variability and spread of predicted probabilities, offering clues about areas that may require more focused conservation efforts. - Strategic Planning: Identifying areas with higher probabilities can guide conservationists and resource managers in prioritizing regions for intervention and monitoring, enhancing the effectiveness of conservation strategies.

In-Depth Analysis of Species Distribution Predictions

Understanding Predictions

Geographical Visualization: - Levelplot Usage: The levelplot provides a geographical visualization where each pixel or color intensity on the map correlates with the probability of species occurrence predicted by the model. It translates complex model outputs into an easily interpretable format, making it accessible for stakeholders to understand ecological risks.

Analysis of Probability Values

Histogram Insights: - Distribution Analysis: After extracting probabilities, they were plotted in a histogram to highlight how these probabilities are distributed across the region. This analysis is crucial for understanding whether the model generally predicts a higher or lower likelihood of occurrence. - Implications: This allows researchers and conservationists to gauge the overall effectiveness of the model in capturing the reality of species distribution across different environments.

Practical Implications

Conservation Impact: - Direct Insight: For conservationists, these visualizations and statistical summaries offer direct insight into areas that might require more attention, either because they are likely habitats for invasive species (indicated by higher probabilities) or areas that currently pose less of a risk (indicated by lower probabilities).

Logical Interpretations

Model Evaluation and Conservation Planning: - Model Performance: By visually and statistically analyzing the predicted probabilities, we can assess the MaxEnt model’s performance. A well-performing model should ideally show higher probabilities in known areas of species presence and lower probabilities elsewhere. - Strategic Conservation Planning: The spatial distribution map (levelplot) and the histogram together inform where invasive species might spread. This critical information helps in planning control measures, monitoring programs, and conservation strategies more effectively, targeting areas predicted to have higher probabilities of species presence.

True Skill Statistics (TSS) and Sensitivity and Specificity Analysis of the MaxEnt Model

# ------------------------- Step 15 TSS --------------------------------
# True Skill Statistics (TSS)

# 'vermont_data' contains the data used to train the MaxEnt model:
model_vars <- names(vermont_data)

# Print the names of the variables used in the model
print(model_vars)


# Step 15.1 Data Verification and Preparation
# Extract the presence data from the validation_points dataframe
# We expect 'presence' to be a column in validation_points, with 1 indicating presence and 0 indicating absence
presence_data <- validation_points$presence

# Verify the presence data
print(head(presence_data))

# Step 15.2 Sensitivity and Specificity Calculation
# Define a function to calculate sensitivity and specificity
calculate_stats <- function(threshold, probs, actual) {
  predicted <- ifelse(probs > threshold, 1, 0)
  cm <- table(Predicted = factor(predicted, levels = c(0, 1)), 
              Actual = factor(actual, levels = c(0, 1)))
  
  # Calculate sensitivity and specificity with safety checks
  sensitivity <- ifelse(sum(actual == 1) > 0, cm["1", "1"] / sum(cm[, "1"]), 0)
  specificity <- ifelse(sum(actual == 0) > 0, cm["0", "0"] / sum(cm[, "0"]), 0)
  
  return(c(sensitivity = sensitivity, specificity = specificity))
}

# Step 15.3 Sensitivity and Specificity Calculation Across Thresholds

# Evaluate sensitivity and specificity for different threshold values
thresholds <- seq(0.03, 1, by = 0.01)
stats <- sapply(thresholds, calculate_stats, probs = predicted_probs, actual = presence_data)

# Find the threshold that maximizes the sum of sensitivity and specificity
max_index <- which.max(rowSums(stats))
optimal_threshold <- thresholds[max_index]

# Print the results
cat("Optimal threshold:", optimal_threshold, "\n")
cat("Sensitivity at optimal threshold:", stats[1, max_index], "\n")
cat("Specificity at optimal threshold:", stats[2, max_index], "\n")

# Step 15.4 Visualization of Sensitivity and Specificity
plot(thresholds, stats[1,], type='l', col='blue', xlab='Threshold', ylab='Metric Value', main='Sensitivity and Specificity by Threshold')
lines(thresholds, stats[2,], col='red')
legend("bottomright", legend=c("Sensitivity", "Specificity"), col=c("blue", "red"), lty=1, title="Metrics")

Figure 27: Sensitivity and Specificity Calculation Across Thresholds

#  Note: The plotted graph will visually guide the selection of a new threshold value by showing
# the trade-off between sensitivity and specificity across a range of possible values.

# Interpretation of Results and Considerations for Next Steps

# Sensitivity of 0:
# This result indicates that the model, at the threshold of 0.03, fails to correctly identify any true positives.
# Essentially, the model is not predicting the presence of the condition correctly at all. This lack of sensitivity
# suggests that the model is overly conservative, potentially classifying nearly all test cases as negative.

# Specificity of 1:
# This outcome shows that the model perfectly identifies all true negatives at this threshold. It correctly predicts
# all absence cases without any false positives, indicating high specificity.

# Understanding Model Performance: The process revolves around evaluating how well the MaxEnt model distinguishes between areas where species are likely to be present versus areas where they are not. 
# Sensitivity measures the model's ability to identify true presences, while specificity measures its ability to recognize true absences.
# Optimal Threshold Identification: The analysis pinpointed a threshold that theoretically offers the best trade-off between sensitivity and specificity. 
# However, the results indicated a very high sensitivity and extremely low specificity at this threshold, suggesting that while the model is good at detecting presences (sensitivity), it struggles to correctly identify absences without false positives (specificity).
# Practical Implications and Adjustments: The high sensitivity but low specificity at the optimal threshold suggests that the model may be over-predicting the presence of species. 
# This could lead to unnecessary conservation efforts in areas where the species is not actually likely to spread. Adjustments in the model or its threshold setting might be necessary to achieve a more balanced outcome.
# Logical Interpretations:
# Model Calibration Needs: The results prompt a reconsideration of the model or its application thresholds to ensure that predictions are not only sensitive but also specific. 
# This involves potentially retraining the model or adjusting the threshold to reduce false alarms while maintaining a reliable detection rate.

Figure 27: Trade-off Between Sensitivity and Specificity at Various Thresholds

Overview:
This figure illustrates the trade-off between sensitivity and specificity across a range of threshold values for the MaxEnt model used in predicting species presence. The graph guides the selection of an optimal threshold by visually representing the balance between identifying true positives and true negatives.

Analysis Highlights: - Sensitivity Analysis: At a threshold of 0.03, the sensitivity is 0, indicating the model’s failure to correctly identify any true positives, suggesting an overly conservative approach that might result in most conditions being predicted as absent. - Specificity Analysis: Conversely, the specificity at this threshold is 1, demonstrating that the model perfectly identifies all true negatives without any false positives, highlighting its accuracy in predicting non-occurrences.

Implications for Model Adjustment: - Model Calibration: The depicted sensitivity-specificity trade-off is critical for calibrating the model, as it assists in selecting a threshold that optimally balances the detection of true positives and negatives. - Decision Support: This visualization is instrumental for researchers and practitioners in refining the predictive model, ensuring that it provides reliable and actionable insights for ecological management and conservation planning.

Understanding Model Performance in MaxEnt Modeling

Evaluating Sensitivity and Specificity

Model’s Discriminatory Power: - Purpose: The evaluation revolves around assessing how well the MaxEnt model distinguishes between areas where species are likely to be present versus areas where they are not. - Key Metrics: - Sensitivity: Measures the model’s ability to correctly identify true presences of species. - Specificity: Measures the model’s ability to correctly identify true absences.

Optimal Threshold Identification

Threshold Analysis: - Optimal Trade-off: The analysis aimed to identify a threshold that offers the best balance between sensitivity and specificity. - Observations: Results indicated very high sensitivity but extremely low specificity at this threshold, suggesting the model, while effective at detecting presences, struggles to identify absences without generating false positives.

Practical Implications and Adjustments

Adjustment Needs: - Over-Prediction Issue: The high sensitivity coupled with low specificity at the optimal threshold suggests that the model may be over-predicting the presence of species. This could potentially lead to unnecessary conservation efforts in areas where the species is unlikely to spread. - Balancing Act: Adjustments in the model or its threshold settings might be necessary to achieve a more balanced outcome, reducing the likelihood of false alarms while still effectively detecting species presence.

Logical Interpretations and Model Calibration

Calibration Requirements: - Reevaluation: The results prompt a reconsideration of the model or its application thresholds to ensure that predictions are not only sensitive but also specific. - Model Retraining: This may involve retraining the model or adjusting the threshold to reduce false positives while maintaining a reliable detection rate.

Strategic Impact: - Conservation Strategy Optimization: Accurate model calibration is crucial for optimizing conservation strategies, ensuring that efforts are targeted and effective, thereby enhancing ecological management and planning.

Results and Limitations

Results (Highlights from both RMarkdown Reports)

Results and Discussion From the Spatial Distribution of Invasive Species in Vermont

Results

I successfully mapped the distribution of various invasive species across Vermont, focusing on key species such as the Emerald Ash Borer and Hemlock Woolly Adelgid. Each species was marked with distinct colors, aiding in rapid visual assessment (Figure 2).

Examples from Figure 2:

Emerald Ash Borer: Dense clusters observed in the northeastern regions of Vermont.
Hemlock Woolly Adelgid: Predominantly found in the central and southern parts of the state.

Discussion

The visualization in Figure 2 effectively met my objective of identifying the spread and density of invasive species, allowing for a clear understanding of high-risk areas. This aids in ecological management by highlighting critical areas needing immediate attention, thus supporting my decision-making processes regarding conservation strategies.

Research Objective: Detailed Spatial Analysis within Niquette Bay State Park

Results

My focused spatial analysis within Niquette Bay State Park indicated an absence of invasive species, demonstrating effective conservation efforts (Figure 5).

Example from Figure 5:

Park Health: The map detailed park boundaries and used different color codes to indicate the absence of key invasive species, providing clear visual confirmation of the park’s ecological health.

Discussion

Figure 5’s targeted analysis provided essential insights into the effectiveness of the conservation strategies employed within Niquette Bay State Park. This meets the objective of offering a localized ecological assessment crucial for ongoing park management and preventive strategies against potential future invasions.

Research Objective: Statewide Analysis of Invasive Species

Results

The expansion to a statewide analysis highlighted the presence of invasive species across Vermont, with detailed mappings of their specific locations and concentrations (Figure 8).

Examples from Figure 8:

Emerald Ash Borer: The map illustrates various hotspots around major forested areas.
Hemlock Woolly Adelgid: Clusters identified, particularly near water bodies.

Discussion

The comprehensive statewide visualization provided in Figure 8 aligns with my research objective to assess the broader impact of invasive species across Vermont. It underscores the need for strategic planning and targeted interventions, offering a macroscopic view essential for policymaking and resource allocation.

Tools and Methods Used

Spatial Clustering with DBSCAN and K-Means (Figures 12 and 13): These techniques helped me identify high-density clusters of invasive species occurrences. Figure 12 presents the results of DBSCAN clustering, showing concentrated areas of invasive species, while Figure 13 uses K-Means to further delineate these clusters into distinct regions based on species concentrations.

Elbow Method (Figure 14): I employed this method to determine the optimal number of clusters for the K-Means algorithm, depicted in Figure 14, which illustrates how the algorithm’s complexity increases with the number of clusters.

Results and Discussion From the Spatial Analysis and Transition to Species Distribution Modeling

This section presents the findings from the spatial analysis and species distribution modeling (SDM) I conducted for invasive species across Vermont. I discuss each result in relation to the research objectives set at the beginning of this project.

Research Objective: Integrate Environmental and Biological Data

Results

I successfully integrated environmental data with the occurrence records of invasive species. Environmental layers from the WorldClim database, including temperature and precipitation, were aligned with the geo-tagged occurrences of species such as the Emerald Ash Borer and Hemlock Woolly Adelgid.

Figure 17: Cropped Environmental Data for Vermont

This figure illustrates the processed climatic data cropped to the geographical extents of Vermont, showing key environmental variables that affect species distribution.

Discussion

The successful integration of these datasets allowed me to develop a comprehensive environmental profile for Vermont. This profile is crucial for understanding how different environmental conditions influence the distribution of invasive species. The results confirm that my methodology is effective in merging diverse data types to prepare for more detailed SDM, thereby meeting the first objective.

Research Objective: Model Development and Validation

Results

I developed Species Distribution Models, particularly using MaxEnt and logistic regression, to predict potential distributions of key invasive species. The models were validated with high accuracy, as indicated by the Area Under the Curve (AUC) from Receiver Operating Characteristic (ROC) curves, which were substantially above the threshold of acceptability.

Figure 22: Train Logistic Regression Model- GLM

This figure demonstrates the logistic regression model’s ability to predict species presence based on environmental factors, highlighting the influence of various climatic variables on species distribution.

Discussion

The development and rigorous validation of SDMs signify that the models are robust and reliable for forecasting species distributions under current and future environmental scenarios. This aligns perfectly with my second objective, ensuring that our models are both scientifically sound and practically relevant.

Research Objective: Visualization and Interpretation

Results

I visualized the final SDMs to show the probability of species occurrence across Vermont, identifying potential hotspots and areas of low risk.

Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

The levelplot provides a clear, color-coded representation of the likelihood of species occurrence, which helps in visualizing the geographical spread and concentration areas of invasive species.

Discussion

The visual outputs have been instrumental in interpreting the complex results of our SDMs, providing accessible and actionable insights into potential species spread. These visualizations directly support conservation efforts and policy-making by highlighting critical areas for intervention. The success in creating these interpretable maps and charts confirms that I have achieved our third objective.

Conclusion

This study has successfully demonstrated how the integration of spatial analysis and species distribution modeling can address the challenge of managing invasive species across Vermont. My mapping and clustering analyses have not only met the research objectives but have also provided a detailed and clear understanding of the spatial distribution of invasive species within the state. The visual evidence, as presented in the referenced figures, substantiates the textual findings, enhancing the comprehensibility of the outcomes. These insights are crucial for the development of informed conservation strategies, highlighting critical areas for targeted interventions that are essential for effective ecological management and resource allocation. The detailed maps and predictive models developed through this research serve as invaluable tools for environmental managers and policymakers, aiding in making informed decisions to mitigate the impact of invasive species. By achieving these research objectives, I have laid a foundation for ongoing research and practical conservation initiatives, establishing a benchmark for similar ecological studies worldwide. This comprehensive approach not only underscores the effectiveness of the employed methodologies but also provides crucial insights that will drive the strategic planning of conservation efforts, supporting the long-term health and resilience of Vermont’s ecosystems.

Limitations

Data Quality and Completeness

Challenges

Scope Expansion: The original scope of the project was expanded to include the entire state of Vermont due to limited availability of observational data from Niquette Bay State Park.
Data Sources: Reliability and completeness of existing datasets from sources such as iNaturalist Vermont and the Vermont Invasive Species Database significantly impact the accuracy and depth of the analysis.

Impact

Data Integrity Issues: Inaccurate, incomplete, or outdated data can skew results, limit the effectiveness of species distribution models (SDMs), and lead to potentially unreliable conclusions.

Model Accuracy and Assumptions

Technical Limitations

Dependency on Data Quality: The success of SDMs, such as MaxEnt and logistic regression, is highly dependent on the accurate setting of model parameters and the quality of input data.
Risks of Inaccuracy: Incorrect parameter settings or faulty assumptions can lead to inaccurate predictions of species distributions.

Practical Implications

Impact on Management Strategies: Misestimations or oversimplifications in the model’s assumptions can mislead management strategies and conservation actions.

Dynamic and Complex Ecosystems

Ecosystem Variability

Static Models Limitations: The study’s static models may not fully do justice to the dynamic and fluctuating nature of ecosystems, which includes seasonal variation and species interaction effects.

Long-term Predictions

Challenges in Predictions: This limitation makes long-term predictions and management strategies difficult, as static models cannot capture ongoing ecological dynamics.

Generalization of Results

Local vs. Statewide Application

Context-Specific Accuracy: The results of the statewide analysis may not accurately reflect the unique ecological conditions of specific areas in Vermont, which could limit the broader application of the results.

Scalability

Adjustments Needed: The unique conditions of Niquette Bay State Park, on which the study originally focused, may not be representative of other regions, so adjustments may be needed to apply the results elsewhere.

Scalability and Adaptability of Management Strategies

Strategy Adaptation

Translating Predictions: Translating model predictions into actionable conservation strategies is challenging and requires flexibility to adapt to new science and changing ecological conditions.

Allocation of Resources

Implementation Challenges: Effective implementation of management strategies depends on the practical applicability of recommendations, which is influenced by logistical, financial, and political factors.

Technological and Methodological Constraints

Tools and Techniques

Technical Limitations: The use of advanced statistical modeling and spatial data analysis tools such as RStudio and GIS entails limitations related to their technical capabilities and the methodological approaches used.

Data Handling

Processing Limitations: There may be limitations in data processing capabilities and the choice of clustering algorithms or other spatial analysis techniques that affect the results of the study.

Continuous Data Collection and Model Updating

Ongoing Monitoring

Dynamic Nature: The dynamic nature of invasive species distribution requires continuous data collection and regular model updates to ensure the relevance and accuracy of predictions.

Resource Intensive

Funding and Expertise: Updating models and ongoing monitoring can be resource-intensive and require ongoing funding and technical expertise.

Sensitivity and Specificity of Models

Model Calibration

Balancing Challenges: Achieving an optimal balance between sensitivity (correct detection of occurrences) and specificity (correct detection of absences) in model predictions is challenging. The trade-offs involved can lead to over- or under-prediction, which affects resource allocation and conservation efforts.

Project Reflection: Spatial Analysis of Invasive Species in Vermont

Reflecting on the “Spatial Analysis of Invasive Species in Vermont” project, I recognize a mix of strengths and weaknesses that shaped the outcomes of this endeavor.

Project Strengths

Comprehensive Scope

Broad Analysis: The initial focus on Niquette Bay State Park, followed by an expansion to the entire state of Vermont, allowed for a comprehensive analysis.
Diverse Data Sources and Techniques: By utilizing various data sources and modeling techniques, including Species Distribution Models (SDM) and clustering algorithms, I was able to create robust models. The integration of different types of data, from iNaturalist observations to WorldClim grids, enabled me to make insightful ecological predictions.

Challenges Encountered

Data Management

Complexity in Documentation: Managing the volume and complexity of data within a single RMarkdown document was challenging. This highlighted the need for better data management strategies or the adoption of more dynamic and scalable tools. For instance, linking R scripts to external RMarkdown files for reporting could be an alternative, though I’m still exploring if there could have been a more efficient solution than creating two separate RMarkdown documents.

Geospatial Data Handling

Technical Difficulties: I faced initial difficulties in processing raster data, which underscored the importance of familiarizing myself with geospatial data handling and troubleshooting techniques early in the project.

Lessons Learned and Future Improvements

Looking back, if I were to tackle this project again, I would take a slightly different approach:

Define Project Scope Clearly

Focused Approach: Clearly defining the scope from the beginning would help prevent being overwhelmed by the data and analysis possibilities. A more focused approach would ensure that each section of analysis within the project is thoroughly explored and managed more effectively.

Enhanced Data Management

Tool Optimization: Exploring more sophisticated data management and analysis tools could streamline the process, possibly integrating advanced software solutions or modularizing the analysis using scripts that feed into a main reporting document.

Strengthen Technical Skills

Geospatial Proficiency: Investing time to enhance skills in geospatial data analysis before initiating the project would likely reduce technical challenges and improve the efficiency of data processing.

Conclusion

The “Spatial Analysis of Invasive Species in Vermont” project was both challenging and enlightening. It provided valuable insights into ecological modeling and the complexity of managing diverse data sets. By applying the lessons learned to future projects, I aim to refine my approach, ensuring more structured and efficient analysis and making even more meaningful ecological contributions.

Sources

Audubon Vermont. Audubon Vermont. Accessed 8 Apr. 2024.
Bivand, R., Pebesma, E., & Gomez-Rubio, V. (2013). Applied Spatial Data Analysis with R, Second edition. Springer, NY. https://asdar-book.org/.
Dowle, M., & Srinivasan, A. (2019). data.table: Extension of data.frame. R package version 1.12.2. CRAN.
Earth Science Data Systems, NASA. (2021, October 21). Find Data. NASA. NASA Earthdata.
Elith, J., & Leathwick, J. R. (2009). Species Distribution Models: Ecological Explanation and Prediction Across Space and Time. Annual Review of Ecology, Evolution, and Systematics, 40, 677-697. DOI.
Franklin, J. (2009). Mapping Species Distributions: Spatial Inference and Prediction. Cambridge University Press.
GBIF. Global Biodiversity Information Facility. Accessed 8 Apr. 2024.
Hijmans, R.J., & Elith, J. (n.d.). Species distribution modeling with R. R Spatial. R Spatial.
Integrated Development Environment. RStudio. RStudio IDE, 26 Feb. 2018.
Invasive Species. Vermont Open Geodata Portal. Vermont Invasive Species. Accessed 8 Apr. 2024.
National Centers for Environmental Information (NCEI). “Climate Data Online.” NCEI CDO Web. Accessed 8 Apr. 2024.
National Invasive Species Information Center (NISIC). Invasive Species Info. Accessed 8 Apr. 2024.
National Oceanic and Atmospheric Administration. NOAA. Accessed 8 Apr. 2024.
Oliver, J. (2023). A Very Brief Introduction to Species Distribution Models in R. [Tutorial document].
Pebesma, E. (2018). sf: Simple Features for R. R package version 0.7-4. CRAN sf.
RStudio Team. (2020). RStudio: Integrated Development Environment for R. RStudio, PBC. http://www.rstudio.com/.
The Comprehensive R Archive Network. CRAN. Accessed 8 Apr. 2024.
Tidyverse. “Tidyverse/DPLYR: Dplyr: A Grammar of Data Manipulation.” GitHub. GitHub Dplyr. Accessed 8 Apr. 2024.
Unlocking the Power of Science to Guide Biodiversity Conservation. NatureServe. NatureServe. Accessed 8 Apr. 2024.
Vermont Open Geodata Portal Your Source for Geospatial Data. Vermont Geodata. Accessed 8 Apr. 2024.
Vermont, US. iNaturalist. iNaturalist Vermont. Accessed 8 Apr. 2024.
Wickham et al., (2019). dplyr: A Grammar of Data Manipulation. R package version 0.8.3. CRAN Dplyr.
Wickham, H. (2019). tidyr: Tidy Messy Data. R package version 1.0.0. CRAN tidyr.
WorldClim. WorldClim. Accessed 8 Apr. 2024.

Addressed Comments from Rough Draft Review

Research Questions and Objectives

Enhanced Clarity: Initially, my study objectives were somewhat dispersed throughout the document. Based on feedback, I consolidated them and clearly stated them in the Introduction to enhance clarity and focus.

Data Section Improvements

Detailed Data Description: Feedback indicated that the description of data sources in my rough draft lacked clarity, particularly regarding when and how certain datasets were used. To address this, I included a more detailed breakdown of the data sources, elaborating on their relevance and specific use cases within the study.

Methodological Clarity

Narrative Methodological Descriptions: The feedback pointed out that the methodological steps needed clearer explanations. I transformed what were previously bullet points into narrative descriptions, providing a detailed account of each methodological stage. This not only makes the document more engaging but also more informative.

Visualization and Tools

Specifying Tools: There was confusion about the use of the term “visual tools” in the rough draft. In the final draft, I specified the tools used—GIS and RStudio—and detailed their application in the analysis, directly addressing the need for specificity as suggested in the feedback.

Writing and Structural Adjustments

Improved Logical Flow: The feedback highlighted issues with repetitive statements and unclear connections between sections. I streamlined the writing to eliminate redundancy and improve the logical flow, ensuring each section builds upon the previous one.

Inclusion of Previous Studies

Integration of Relevant Literature: To address feedback suggesting the inclusion of references to previous studies, I incorporated citations of relevant literature and studies, particularly those utilizing similar methodologies or investigating similar ecological phenomena. This helps establish a research gap and positions my study within the broader academic context.

Figure Captions and Discussion

Enhanced Visual Descriptions: The lack of informative figure captions and adequate discussion of figures and maps in the text was highlighted in the feedback. I addressed this by including detailed captions for all visuals and integrating these figures into the text discussion to effectively support the narrative.

Addressing Peer and Proposal Comments

Comprehensive Response to Feedback: The importance of addressing all peer and proposal comments was emphasized in the feedback. I took this into consideration and integrated changes and justifications into the text where necessary, ensuring that all comments were thoughtfully addressed.

Spatial Analysis and Transition to Species Distribution Modeling

M.Golub

2024-04-30

Introduction

Background

Objectives

Methodology

Expected Outcomes

Navigation

Data Acquisition and Preparation

Figure 17: Cropped Environmental Data for Vermont**

Figure 17: Cropped Environmental Data for Vermont

Ecological Visualization for Conservation Planning

What’s Being Done?

Why It Matters?

Visualization Importance

Outcome

Clean Occurrence Data

Understanding Data Structure

Geographic and Environmental Synthesis

Focus on Data Quality

Preparation for Modeling

Random Points?

Environmental Context

Data Integrity

Spatial Accuracy

Transitioning to Visualization

MaxEnt Model

Firgure 18 MaxEnt Model

MaxEnt Model Summary

MaxEnt Model Response Curve Plot

Custom Legend

Understanding MaxEnt

Model’s Role

Interpreting Outputs

Practical Implications

**Histogram of BIO1 (Annual Mean Temperature

Figure 19: Histogram Analysis: BIO1 (Annual Mean Temperature)

Description of Histogram

Analytical Insights

Figure 20: Histogram Analysis: BIO12 (Annual Precipitation)

Description of Histogram

Analytical Insights

Summary of Data Analysis: Temperature and Precipitation Distributions

Why Combined Data Frames?

Role of Environmental Predictors

Importance of Data Integrity and Validation

Logical Implications

Preparation for Predictive Modeling

Conservation and Management Applications

Implementing Data Imputation

Understanding the Role of Data Imputation in Environmental Sciences

Why Imputation?

Choice of Median for Imputation

Effectiveness of Imputation

Comparison of Steps Before and After Imputation

Before Imputation (Step 9)

After Imputation (Step 10)

Conclusion: Integration and Enhancement

Figure 21: Environmental Data Distributions in Vermont

Histogram Analysis for BIO1 (Annual Mean Temperature)

Histogram Analysis for BIO12 (Annual Precipitation)

Summary of Histogram Analyses

Train Logistic Regression Model- GLM

Why Use Logistic Regression?

Significance of Environmental Predictors

Model Evaluation

Model Diagnostics

Saving and Utilizing the Model

Summary

AUC Calculation and Visualization

Understanding AUC and ROC in Model Evaluation

What is AUC and ROC?

Model Performance

Practical Application

More Extract and Verify validation_data

Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

Predicted Probabilities (Histogram)

Figure 26: Distribution of Predicted Probabilities (Histogram) for Species Occurrence in Vermont