Title of Project: Spatial Analysis of Invasive Species Across Vermont

Abstract

Initially intended to examine invasive species within Niquette Bay State Park, this study expanded to encompass all of Vermont due to the limited availability of observational data. Utilizing point data on species occurrences along with WorldClim rasters for climatic variables, a Species Distribution Model (SDM) was developed to assess the potential distribution of key invasive species across the state under prevailing environmental conditions. Complementing this, high-resolution weather data from OpenWeather for Vermont facilitated detailed hotspot detection and cluster analysis. This integrative approach revealed significant spatial trends and ecological impacts of invasive species, identifying critical areas requiring targeted management interventions. The study illustrates the value of adaptive research frameworks in ecological studies, demonstrating how expansive data analysis can provide comprehensive insights into species dynamics and inform effective conservation strategies. Through rigorous analysis, the project not only maps current distributions but also predicts future spread, serving as a crucial tool for ecological management and decision-making in Vermont.

Introduction and Background

Introduction

Dealing with invasive species is a critical ecological challenge in Vermont, a region known for its rich biodiversity but increasingly threatened by non-native species. This project initially focused on the impacts of invasive species in Niquette Bay State Park, but then expanded to the entire state to take a more comprehensive analytical approach. The presence of invasive species in Vermont disrupts local habitats, competes with native flora and fauna, and necessitates urgent and effective management strategies. The urgency and relevance of this study is underscored by the pioneering work of Jane Elith and her colleagues, who have emphasized the critical role of accurate Species Distribution Modeling (SDM) for conservation and ecological management (Elith et al., 2011). The expansion of the project is driven by the need to understand not only the local impacts, but also the broader ecological impacts across Vermont.

Background

Vermont’s ecosystems are at great risk from invasive species that destabilize local ecological networks and reduce biodiversity. Vermont Emergency Management reports from 2018 show that nearly one-third of Vermont’s plant species are classified as invasive. Such statistics underscore the urgent need to develop effective strategies to curb the spread and impact of these species. To accomplish this, the study uses advanced SDM techniques that integrate environmental variables with species occurrence data to provide a detailed prediction of their potential spread. This approach is consistent with methods discussed in the literature by Elith and others (Elith et al., 2010) and provides a robust framework for strategic conservation planning.

Research Question

This study is guided by a central question that addresses both the ecological and management dimensions of Vermont’s invasive species problem: “What is the potential distribution of key invasive species in Vermont under current environmental conditions?” This question aims to direct the study toward a detailed examination of geographic and environmental data to identify the areas where invasive species are most likely to spread. The answer will help formulate strategic conservation planning that can be implemented to effectively manage and potentially mitigate the impacts of these invasive species throughout the state.

Data Overview

This section provides a detailed description of the data used in the study, which is divided into three main categories. Each category plays a critical role in analyzing and modeling the distribution and impacts of invasive species throughout Vermont.

Species Observation Data

iNaturalist Vermont: Provides geo-tagged observations of invasive species. This data underpins the spatial analysis to determine current distribution patterns.
iNaturalist Vermont
Vermont Invasive Species Database & Vermont Open Geodata Portal: Offers data on the geographic spread and management status of invasive species across Vermont.
- Vermont Invasive Species
- Vermont Open Geodata Portal - Invasive Species

Environmental Variables Data

OpenWeather: Utilized for obtaining real-time and historical weather data, integrating this with species data to examine correlations with invasion events.
- OpenWeather - Weather maps 2.0
WorldClim: Provides historical climate data essential for modeling the potential future spread of species under different climate scenarios.
- WorldClim

Data Preparation and Analysis Tools

R Packages: Used for comprehensive data manipulation, statistical analysis, and spatial modeling, crucial for SDM.
- CRAN R Project

Additional Resources for Validation and Enhancement

US Geological Survey (USGS): Supplements the primary data with additional environmental data for a broader ecological assessment.
- USGS
National Oceanic and Atmospheric Administration (NOAA): Provides long-term climate data to understand trends and forecast future ecological changes.
- NOAA

Methodology Overview

Figure 1: Methodology Flowchart

Methodology Approach Overview

This research is centered around the exploration of spatial dynamics of invasive species, with a particular focus on Agrilus planipennis (emerald ash borer) and Adelges tsugae (hemlock woolly adelgid), across the Vermont region. The goal is to use spatial analysis and Species Distribution Modeling (SDM) to understand the ecological factors that drive the spread of these species and to develop effective management and conservation strategies.

The methodology integrates the use of RStudio and GIS for the following key processes:

Tools and Processes

Data Acquisition and Preparation: Collection of extensive species occurrence and environmental data.
Exploratory Analysis: Initial analysis to identify patterns and trends.
Spatial Clustering: Identification of clusters of invasive species within the park.
Species Distribution Modeling (SDM): Species distribution modeling based on environmental variables.
Visualization: Creation of visual data representations to illustrate the results.

Recognize Limitations

Data dependence: The depth of the study depends on the availability of data and the accuracy of the models.
Ecosystem dynamics: Recognizes that continuous data collection and model refinement is necessary to adapt to ecological changes.
Conservation flexibility: Emphasizes the need to adapt management strategies based on new knowledge and environmental changes.

# ----------------------------------------------------------------------
## Step 1: Set up Environment for Spatial Analysis
# ----------------------------------------------------------------------

# Set system environment variables for the 'sf' package. These variables specify the directories 
# where the GDAL and PROJ data files are located, which are necessary for spatial data operations in R.
Sys.setenv(GDAL_DATA = "C:/OSGeo4W/share/gdal")   # Location of GDAL data files
Sys.setenv(PROJ_LIB = "C:/OSGeo4W/share/proj")    # Location of PROJ data files
Sys.setenv(PATH = paste("C:/OSGeo4W/bin", Sys.getenv("PATH"), sep=";"))  # Add GDAL binaries to system PATH

# Check the versions of GDAL, GEOS, and PROJ used by the 'sf' package. This command outputs the 
# versions of these libraries to ensure they are correctly loaded and compatible.
sf::sf_extSoftVersion()

# ----------------------------------------------------------------------
## Step 2: Load Necessary Libraries
# ----------------------------------------------------------------------

# Function to check and install any missing packages
install_if_missing <- function(package) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package)
    library(package, character.only = TRUE)
  }
}

# List of required packages
packages <- c("dismo", "raster", "sp", "readr", "dplyr", "ggplot2", 
              "terra", "sf", "tmap", "tmaptools", "lubridate", "stringr", 
              "rasterVis", "RANN", "rJava", "predicts", "akima", "gstat",
              "tidyverse", "dbscan", "maps", "plotly", "rasterVis", "rnaturalearth")

# Load all required libraries
invisible(sapply(packages, install_if_missing))

# Message to confirm loading
message("All required libraries are loaded successfully!")

library(dismo)      # Tools for species distribution modeling
library(raster)     # Provides an interface for handling and analyzing raster data
library(sp)         # Handles spatial data frames
library(readr)      # Efficient reading and writing of data files
library(dplyr)      # Data manipulation within the tidyverse ecosystem
library(ggplot2)    # Creation of complex plots from data frames
library(terra)      # Enhanced raster data analysis and manipulation methods
library(sf)         # Handling of spatial data frames, integrating powerful libraries like GDAL and PROJ
library(tmap)       # Thematic maps designed for spatial data visualization
library(tmaptools)  # Additional tools for working with 'tmap' package
library(lubridate)  # Date-time data manipulation
library(stringr)    # Manipulation of strings
library(rasterVis)  # Visualization tools for raster data
library(RANN)       # Nearest neighbor search and classification
library(rJava)      # Integration of Java within R, enabling Java-based operations
library(predicts)   # Verify if 'predicts' package is accurately named and replace or correct if necessary
library(akima)      # For interpolation of irregularly spaced data
library(gstat)      # For spatial data analysis and geostatistics
library(tidyverse)  # Comprehensive suite of packages for data manipulation and visualization
library(dbscan)     # Implements the DBSCAN clustering algorithm for spatial data analysis
library(maps)       # For creating geographical maps
library(plotly)     # Interactive plotting and graphical tools
library(rasterVis)  # Visualization enhancements for the 'raster' package
library(rnaturalearth)  # Tools for accessing natural earth map data


# ----------------------------------------------------------------------
## Step 3: Define Base Path for Data Files
# ----------------------------------------------------------------------

# Set the base directory
base_directory <- "D:/GEOG_588/SDM_Invasive_species"

# Set the base directory as the working directory
setwd(base_directory)

# ----------------------------------------------------------------------
## Step 4: Read Species and Observation Data
# ----------------------------------------------------------------------

# Reading in species and observation data
species_data <- read_csv("cleaned_invasive_species.csv")
observation_data <- read_csv("cleaned_Observation_VT.csv")

# ----------------------------------------------------------------------
## Step 5: Convert Data to Spatial Objects and Perform Spatial Join
# ----------------------------------------------------------------------

# Convert data to spatial objects while preserving original longitude and latitude
species_data_sf <- st_as_sf(species_data, coords = c("longitude", "latitude"), crs = 4326, remove = FALSE)
observation_data_sf <- st_as_sf(observation_data, coords = c("longitude", "latitude"), crs = 4326, remove = FALSE)

# Spatial join of datasets with clear suffixes for overlapping columns
master_observation_list_sf <- st_join(observation_data_sf, species_data_sf, join = st_nearest_feature, left = TRUE, suffix = c(".obs", ".spec"))

# ----------------------------------------------------------------------
## Step 6: Rename Columns and Convert Spatial Object to Data Frame
# ----------------------------------------------------------------------

# After join, rename columns to clearly indicate their source
master_observation_list_sf <- master_observation_list_sf %>%
  rename(
    latitude_obs = latitude.obs,
    longitude_obs = longitude.obs,
    latitude_spec = latitude.spec,
    longitude_spec = longitude.spec
    # Add other renames as necessary
  )

# Convert spatial object back to dataframe and check data
master_observation_list <- as.data.frame(master_observation_list_sf)

# Ensure longitude, latitude, and geometry are retained
# Extract longitude and latitude from the geometry column
master_observation_list_sf$longitude <- st_coordinates(master_observation_list_sf)[, 1]
master_observation_list_sf$latitude <- st_coordinates(master_observation_list_sf)[, 2]

# ----------------------------------------------------------------------
## Step 7: Save Spatial Data as GeoPackage and Regular Data as CSV
# ----------------------------------------------------------------------

# Save the spatial data with geometry as a GeoPackage
write_sf(master_observation_list_sf, "output/Master_Observation_List_with_NA.gpkg", layer = "observations", driver = "GPKG")

# Convert the spatial data frame to a regular data frame for saving as CSV
master_observation_list <- as.data.frame(master_observation_list_sf)

# Remove geometry column for the non-spatial CSV version
master_observation_list <- master_observation_list %>%
  dplyr::select(-geometry)

# Save as CSV without the geometry column
write_csv(master_observation_list, "output/Master_Observation_List_with_NA.csv")

# ----------------------------------------------------------------------
## Step 8: Loading the Geometry Master List
# ----------------------------------------------------------------------

# 8.1: Load Spatial Dataframe
master_observation_list_sf <- read_sf("output/Master_Observation_List_with_NA.gpkg")

# 8.2: Check for Missing Geometry Data
if (any(is.na(st_geometry(master_observation_list_sf)))) {
  stop("Geometry data missing in master_observation_list_sf")
}

# 8.3: Load Species Data
species_data <- read_csv("cleaned_invasive_species.csv")

# 8.4: Convert Species Data to Spatial Data Frame
species_data_sf <- st_as_sf(species_data, coords = c("longitude", "latitude"), crs = 4326)

# Load the required packages
library(maps)

# 8.5: Fetch Vermont map data
vermont_map <- map_data("state", region = "vermont")

# Check the structure of the map data
print(head(vermont_map))

# Convert map data to data frame
vermont_df <- as.data.frame(vermont_map)

# Print the structure of the data frame
print(head(vermont_df))

# 8.6: Convert Data Frame to Spatial Format (sf object)
vermont_sf <- st_as_sf(vermont_df, coords = c("long", "lat"), crs = 4326) %>%
  st_combine() %>%
  st_sfc() %>%
  st_cast("POLYGON") %>%
  st_cast("MULTIPOLYGON")

# 8.7: Create a Single sf Object Containing Vermont's Geometry
vermont_sf <- st_sf(geometry = vermont_sf)

Spatial Distribution of Invasive Species Across Vermont

Overview: The map presented below illustrates the spatial distribution of various invasive species across Vermont, clearly delineated by the state’s geographic boundaries.

Features: - Color-Coding: Each invasive species is marked with a distinct color, providing a rapid visual reference that aids in assessing biodiversity issues related to invasive species. - Geographic Clarity: The state boundaries are prominently displayed, ensuring that the spatial context of the data is immediately apparent.

Purpose: This visualization is designed to aid in the quick assessment of biodiversity issues by providing an easily interpretable graphical representation of data, which is crucial for ecological management and decision-making processes.

# 8.8: Plot Vermont's Boundary
vermont_plot <- ggplot(vermont_sf) +
  geom_sf(fill = "lightblue", color = "darkblue") +
  ggtitle("Boundary of Vermont")

# 8.9: Visualize Distribution of Invasive Species
if ("invasive_name" %in% colnames(species_data)) {
  species_map <- ggplot(data = species_data_sf) +
    geom_sf(aes(color = invasive_name)) +
    geom_sf(data = vermont_sf, fill = NA, color = "black") +
    labs(title = "Distribution of Invasive Species",
         subtitle = "Spatial Distribution of Invasive Species Occurrences",
         x = "Longitude", y = "Latitude") +
    theme_minimal() +
    theme(legend.position = "right", 
          legend.justification = "top") # Adjusting legend position
  print(species_map)
} else {
  cat("Column 'invasive_name' does not exist in the dataset. Check column names with colnames(species_data_sf).\n")
}

Figure 2: Spatial distribution of invasive species in Vermont.

# Boundary of Vermont Plot: This is a straightforward representation of Vermont’s state boundary. It helps one understand the physical outline and geographical location of Vermont within a larger map or context.
# Distribution of Invasive Species Plot: This plot provides insight into where different invasive species have been identified within Vermont. 
# By differentiating species by color, it allows for a quick visual assessment of biodiversity concerns and can help in environmental management and conservation efforts. 
# This plot not only shows where invasive species are located but also indicates areas potentially at risk of ecological impacts due to these species.


# 8.10: Print Column Names from Master Observation List
print(colnames(master_observation_list_sf))

Figure 2: Spatial Analysis of Invasive Species Spread in Vermont

Overview: This chart visualizes the spatial distribution and density of invasive species across Vermont from 2010 to 2020. Data points on the map, differentiated by color, represent specific locations where various species categories have been observed. Data sourced from iNaturalist Vermont and the Vermont Invasive Species Database & Vermont Open Geodata Portal, highlight areas with rising occurrences and potential hotspots.

Summary of Spatial Analysis Workflow

Introduction: This R script provides a comprehensive approach to preparing and analyzing geographical data, targeting the spread of invasive species in Vermont. Each step builds logically on the previous, ensuring a thorough analysis.

Setting up the Environment for Spatial Analysis:
Configures the environment to recognize essential geographical data libraries such as GDAL and PROJ, crucial for enabling subsequent handling of spatial data.
Loading Necessary Libraries:
Loads various R packages that support data manipulation, visualization, and spatial analysis, ensuring all necessary tools are available.
Defining the Base Path for Data Files:
Establishes a base directory for the project’s data files, streamlining access and clarifying their locations within the system.
Reading Species and Observation Data:
Imports key data on invasive species occurrences and observations from CSV files into R, forming the basis for detailed analysis.
Converting Data to Spatial Objects and Performing Spatial Join:
Converts the imported data into spatial objects, enabling manipulation of their geographical properties. Performs a spatial join, merging data based on geographic proximity.
Renaming Columns and Converting Spatial Object to Data Frame:
Refines column names in the joined data for clarity, and converts the spatial object into a data frame for easier manipulation in tabular form.
Saving Spatial Data as GeoPackage and Regular Data as CSV:
Saves the processed spatial data in both GeoPackage and CSV formats to accommodate different analysis needs and software compatibility.
Loading the Geometry Master List:
Loads the previously saved GeoPackage, checking for completeness and integrity of the spatial data, essential for accurate analysis.
Visualizing and Checking Data:
Generates a visual representation of Vermont’s map and its boundaries to confirm the accuracy of the spatial data setup.
Conclusion:
This structured approach facilitates analysis of invasive species distribution and aids in projecting their future spread, providing critical insights for ecological management strategies.

Exploratory Data Analysis (EDA)

From Data Acquisition to Exploratory Data Analysis (EDA)

Transition: Having successfully collected and prepared the dataset, the exploratory data analysis (EDA) phase begins, aiming to provide initial insights that can guide further analysis.

Purpose of EDA: - Understanding the Dataset: Crucial for revealing the structure, patterns, and potential relationships within the data. - Visualization Techniques: Focuses on summarizing the data and creating various visualizations, such as maps, to explore geographical and spatial distributions, identify trends, detect anomalies, and investigate areas requiring further exploration.

Impact of EDA: - Systematic Interpretation: Ensures systematic interpretation of the rich information contained in the dataset, enhancing understanding and decision-making processes. - Foundation for Ongoing Research: The insights gained are critical to ongoing research and analysis, particularly in understanding the spatial dynamics of invasive species.

Filter species data for specific invasive species and plot their distribution within Vermont

# Step 9: Filtering Species Data and Visualizing Geographic Features
# ---------------------------------------------------------------------------------------------------------

# Filter species data for specific invasive species and plot their distribution within Vermont
target_species <- c("Emerald Ash Borer", "Hemlock Woolly Adelgid")
filtered_species_data_sf <- species_data_sf %>%
  filter(invasive_name %in% target_species)

if (nrow(filtered_species_data_sf) > 0) {
  species_map_filtered <- ggplot(data = filtered_species_data_sf) +
    geom_sf(aes(color = invasive_name)) +
    geom_sf(data = vermont_sf, fill = NA, color = "black") +
    labs(title = "Target Invasive Species Distribution",
         subtitle = "Emerald Ash Borer and Hemlock Woolly Adelgid",
         x = "Longitude", y = "Latitude") +
    scale_color_manual(values = c("Emerald Ash Borer" = "darkgreen", "Hemlock Woolly Adelgid" = "darkmagenta")) +
    theme_minimal()
  print(species_map_filtered)
} else {
  cat("No data found for the specified invasive species.\n")
}

Figure 3: Filtered distribution of specific invasive species in Vermont.

Figure 3: Geographical Distribution of Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont

Overview: This map visualizes the geographical spread of two invasive species, the Emerald Ash Borer and Hemlock Woolly Adelgid, across Vermont. Observation locations are marked with distinguishing colors to underscore areas currently impacted or at risk.

Utility: - Ecological Management: Facilitates the identification of hotspots for targeted intervention. - Stakeholder Engagement: Serves as a clear, accessible representation of data for stakeholders involved in Vermont’s environmental health.

Design Effectiveness: - Communication: Ensures that essential information is communicated effectively, promoting understanding and engagement in the state’s conservation strategies.

What This Figure Represents:

Description: This visualization depicts the geographical spread of the Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont.

Details: - Observation Marking: The map distinctly marks locations where these species have been observed, emphasizing regions currently impacted or vulnerable to future invasions.

Why This Is Important:

Significance: The map’s utility lies in its ability to aid environmental conservation and resource management.

Impact: - Forest Health and Biodiversity: Both invasive species pose significant threats to forest health and biodiversity. - Proactive Measures: By charting their presence, this tool supports proactive measures in controlling the infestation and provides valuable insights for ongoing ecological monitoring efforts.

Distribution of Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont

Focus: Advanced Geographic Filtering and Visualization

Objective: Implement advanced geographic filtering to refine visualization and focus on critical areas more effectively.

Outcome: - Enhanced Clarity: Improves the visualization of impacted regions, allowing for more precise management actions and resource allocation.

# ---------------------------------------------------------------------------------------------------------
# Step 10: Advanced Geographic Filtering and Visualization
# ---------------------------------------------------------------------------------------------------------

# Print new column names to confirm
print(colnames(species_data))

# Define target species for detailed analysis
target_species <- c("Emerald Ash Borer", "Hemlock Woolly Adelgid")

# Filter the dataset to include only the specified invasive species
filtered_species_data_sf <- species_data_sf %>%
  filter(invasive_name %in% target_species)

# Check if any records are found for the specified species and plot their spatial distribution
if (nrow(filtered_species_data_sf) > 0) {
  species_map_filtered <- ggplot(data = filtered_species_data_sf) +
    geom_sf(aes(color = invasive_name)) +  # Ensure this matches the actual column name
    geom_sf(data = vermont_sf, fill = NA, color = "black") +  # Overlay Vermont state boundary for reference
    labs(title = "Target Invasive Species Distribution",
         subtitle = "Emerald Ash Borer and Hemlock Woolly Adelgid",
         x = "Longitude", y = "Latitude") +
    scale_color_manual(values = c("Emerald Ash Borer" = "darkgreen", "Hemlock Woolly Adelgid" = "darkmagenta")) +
    theme_minimal()
  
  print(species_map_filtered)
} else {
  cat("No data found for the specified invasive species.\n")
}

Figure 4: Distribution of Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont.

# Specific location - Niquette Bay State Park

# Define the coordinates for Niquette Bay State Park
park_lat <- 44.5919
park_lon <- -73.1900

# Define a buffer around the park coordinates to create a more accurate representation of the park's area
buffer_deg_lat <- 0.0078  # Latitude buffer to approximate 0.5 km
buffer_deg_lon <- 0.0109  # Longitude buffer to approximate 0.5 km

# Create a matrix defining the park's boundary polygon based on the buffered coordinates
park_polygon_coords <- matrix(
  c(park_lon - buffer_deg_lon, park_lat - buffer_deg_lat,
    park_lon + buffer_deg_lon, park_lat - buffer_deg_lat,
    park_lon + buffer_deg_lon, park_lat + buffer_deg_lat,
    park_lon - buffer_deg_lon, park_lat + buffer_deg_lat,
    park_lon - buffer_deg_lon, park_lat - buffer_deg_lat),  # Close the polygon by repeating the first point
  ncol = 2, byrow = TRUE)

# Convert the polygon matrix into a spatial object (sf) for use in geographic operations
study_area_sf <- st_sf(geometry = st_sfc(st_polygon(list(park_polygon_coords)), crs = st_crs(4326)))

# Filter the species dataset to only include observations within the park's boundary
filtered_species_in_park <- st_intersection(species_data_sf, study_area_sf)

# Plot the filtered dataset 
if (nrow(filtered_species_in_park) > 0) {
  park_species_map <- ggplot(data = filtered_species_in_park) +
    geom_sf(aes(color = invasive_species_invasivename)) +
    labs(title = "Invasive Species within Niquette Bay State Park",
         subtitle = "Filtered to park boundaries",
         x = "Longitude", y = "Latitude") +
    theme_minimal()
  
  print(park_species_map)
} else {
  message("No invasive species data found within the park boundaries.")
}


# Advanced Geographic Filtering and Visualization
# Summary: Here, the code delves deeper into Niquette Bay State Park, filtering invasive species data specifically for this location. 
# By focusing on the park's boundaries, it offers insights crucial for park management and conservation strategies. 
# The plotted distribution of invasive species within the park aids in understanding local ecological impacts and informs conservation decisions.

# While both steps involve filtering species data and visualizing their distribution, 
# Step 10 provides a more detailed analysis by focusing on a specific location, Niquette Bay State Park. Unlike Step 9, which maps the distribution of the selected invasive species across Vermont as a whole, 
# Step 10 zooms in on the park's boundaries for a more localized perspective. 
# This allows for a targeted assessment of invasive species presence within the park, offering insights tailored to park management and conservation efforts.

Figure 4: Invasive Species Observations within Niquette Bay State Park

Overview: This detailed map highlights the filtered observations of invasive species specifically within the boundaries of Niquette Bay State Park. The visualization focuses on the distribution of the Emerald Ash Borer and Hemlock Woolly Adelgid.

Features: - Color Coding: Each observation is color-coded to distinguish between the species, facilitating immediate visual identification of the areas most affected. - Targeted Assessment: Provides a refined view that is critical for local ecological assessment and park management.

Utility: - Conservation Strategy Development: Instrumental for park authorities in developing precise conservation strategies. - Priority Setting: Helps prioritize efforts to control infestations, thereby protecting the park’s natural biodiversity and forest health.

Comparison with Previous Step (Step 9)

Objective: Delineate the distinctions between the previous step and the current mapping efforts.

Key Differences: - Focus of Analysis: While both steps involve filtering species data and visualizing their distribution, Step 10 offers a more detailed analysis by concentrating specifically on Niquette Bay State Park. - Scope of Visualization: Unlike Step 9, which mapped the distribution of selected invasive species across Vermont as a whole, Step 10 zooms in on the park’s boundaries for a more localized perspective.

Implications: - Targeted Assessment: Allows for a targeted assessment of invasive species presence within the park, offering insights tailored to park management and conservation efforts.

Mapping Invasive Species Observations within the Park Area

Park Area Definition and Spatial Analysis Setup:

Objective: Define the park area using spatial data tools and set up the analysis for mapping invasive species observations within Niquette Bay State Park.

Approach: - Spatial Definition: Creates a spatial feature object to define the park area with precise boundaries. - Data Filtering: Filters invasive species observations to those specifically within the park, enhancing the relevance of the data for local management decisions.

# Description:
# This section defines the park area as a spatial object using specified boundaries and a coordinate reference system (CRS).
# It transforms the CRS of the species data to match that of Vermont and the park boundary for accurate spatial analysis.
# Finally, it creates a ggplot map displaying the Vermont state boundary, the park area, and species observations for context.

# Define the park area as an sf object with specified boundaries and coordinate reference system
park_area_sf <- st_sf(geometry = st_sfc(st_polygon(list(park_polygon_coords)), crs = st_crs(4326)))

# Transform the CRS of the species data to match that of Vermont and park boundary for accurate spatial analysis
species_data_sf <- st_transform(species_data_sf, st_crs(vermont_sf))
park_area_sf <- st_transform(park_area_sf, st_crs(vermont_sf))

# Create a ggplot map displaying the Vermont state boundary, the park area, and species observations for context
species_v2_map <- ggplot() +
  geom_sf(data = vermont_sf, fill = NA, color = "black", size = 0.5) +
  geom_sf(data = park_area_sf, fill = NA, color = "red", size = 0.5) +
  geom_sf(data = species_data_sf, color = "blue", size = 1.5, alpha = 0.6) +
  labs(title = "Invasive Species Observations in Niquette Bay State Park",
       subtitle = "A Geographic Overlay with Vermont State Boundary",
       x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Display the map to show results
print(species_v2_map)

Figure 5: Distribution of Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont.

# What It Shows: This map indicates where invasive species have been found across Vermont. It is designed to particularly highlight the presence or absence of these species within the boundaries of Niquette Bay State Park.
# Key Takeaway: Despite the broader issue of invasive species across the state, as indicated by the numerous blue dots, Niquette Bay State Park appears to be free from these observations, 
# which is implied by the absence of blue dots within the park's red boundary. This could suggest that conservation efforts within the park are effective, or that further monitoring is needed to confirm these findings.
# Why This Matters: Invasive species can significantly impact native ecosystems and biodiversity. This map is a tool for ecologists, conservationists, and park managers to understand where to focus their efforts to control and monitor invasive species.

Figure 5: Invasive Species Observations in Niquette Bay State Park

Overview: This map overlays invasive species observations within the state of Vermont, focusing specifically on Niquette Bay State Park. The state boundary is outlined for context, and observations of the Emerald Ash Borer and Hemlock Woolly Adelgid are marked.

Insights: - Local vs. State Challenges: Offers a visual representation of the ecological challenges at both the state and local park level. - Conservation Effectiveness: The absence of marked observations within the red boundary of the park may suggest effective conservation efforts or the need for further ecological monitoring.

Utility: - Tool for Action: This visualization is critical for directing conservation and control efforts where they are most needed, aiming to safeguard Vermont’s native ecosystems and biodiversity.

Key Takeaway: - Despite the widespread presence of invasive species across the state, indicated by numerous blue dots, Niquette Bay State Park appears free from these observations, as evidenced by the absence of blue dots within the park’s red boundary. This observation suggests effective conservation efforts, although continuous monitoring is recommended to confirm these findings.

Why This Matters: - Invasive species significantly impact native ecosystems and biodiversity. This map serves as a valuable tool for ecologists, conservationists, and park managers, helping them focus their efforts where they are most needed to control and monitor invasive species.

Filtering and Mapping Invasive Species Observations within the Park Area

Park Area Definition and Spatial Filtering:

Objective: - This section creates a new sf (spatial feature) object to define the park area with updated boundaries.

Process: - Spatial Intersection: Performs a spatial intersection to filter observations within the park. - Visualization: Checks and plots the filtered observations if available, providing visual evidence of the effectiveness of existing management strategies.

# ---------------------------------------------------------------------------------------------------------
# Step 12: Filtering and Mapping Invasive Species Observations within the Park Area
# ---------------------------------------------------------------------------------------------------------

# Description:
# This section creates a new sf object to define the park area with updated boundaries.
# It performs spatial intersection to filter observations within the park.
# Finally, it checks and plots the filtered observations if available.

# Create a new sf object to define the park area with updated boundaries
park_area_sf <- st_sf(geometry = st_sfc(st_polygon(list(park_polygon_coords)), crs = st_crs(filtered_species_data_sf)))

# Perform spatial intersection to filter observations within the park
filtered_species_in_park <- st_intersection(filtered_species_data_sf, park_area_sf)

# Check and plot filtered observations if available
if (nrow(filtered_species_in_park) > 0) {
  park_species_map <- ggplot(data = filtered_species_in_park) +
    geom_sf(aes(color = invasive_name)) +
    labs(title = "Invasive Species within Niquette Bay State Park",
         subtitle = "Filtered to park boundaries",
         x = "Longitude", y = "Latitude") +
    theme_minimal()
  
  print(park_species_map)
} else {
  message("No invasive species data found within the park boundaries.")
}

# Understanding the Results: The spatial analysis aimed to identify and map invasive species specifically within Niquette Bay State Park. The code was set up to visualize this information.
# Key Message: The analysis concludes that no invasive species have been detected within the park boundaries according to the available data.
# Significance: The absence of invasive species within the park could indicate effective conservation efforts. However, it's also a reminder that monitoring and data collection are essential for accurate ecological assessments.
# Next Steps: The result calls for ongoing ecological monitoring to ensure that invasive species have not been overlooked and to maintain the health of the park's ecosystem.

Analysis Summary for Filtering and Mapping Invasive Species Observations within the Park Area

Overview: The spatial analysis was conducted to identify and map the presence of invasive species specifically within Niquette Bay State Park.

Key Findings: - Result: The analysis concludes that no invasive species have been detected within the park boundaries according to the available data. - Significance: This outcome may reflect effective conservation efforts. Nonetheless, it underscores the necessity of continued monitoring and comprehensive data collection to maintain accurate ecological assessments.

Next Steps: - Ongoing Monitoring: Calls for persistent ecological monitoring to verify that invasive species have not been overlooked and to sustain the health of the park’s ecosystem.

Analyzing Invasive Species Distribution within Park Boundaries

Objective: - This section aims to identify and analyze the presence of invasive species within the specified boundaries of Niquette Bay State Park, focusing on assessing their density and distribution.

Approach: - Visualization: Includes visualizing invasive species data within the park boundary to provide clear management insights. - Expansion of Analysis: If no observations of invasive species are found within the park, the analysis is extended to encompass the entire state of Vermont, ensuring a comprehensive assessment of the ecological threat posed by invasive species.

Management Implications: - Data-Driven Decisions: The findings aid in making informed decisions about the conservation and management strategies necessary to protect the park and potentially similar ecosystems throughout Vermont.

# ---------------------------------------------------------------------------------------------------------
# Step 13: Analyzing Invasive Species Distribution within Park Boundaries
# ---------------------------------------------------------------------------------------------------------

# Description:
# This section identifies and analyzes invasive species within the specified park boundaries.
# It assesses the density and distribution of invasive species within the park.
# Additionally, it visualizes invasive species data within the park boundary and provides management insights.
# If no observations of invasive species are found within the park, it expands the analysis to the entire state of Vermont.

# Identify and analyze invasive species within the specified park boundaries.
species_in_park <- st_intersection(species_data_sf, park_area_sf)

# Assess the density and distribution of invasive species within park boundaries.
if (nrow(species_in_park) > 0) {
  # Extract longitude and latitude from the geometry column for plotting.
  species_in_park$longitude <- st_coordinates(species_in_park)[, 1]
  species_in_park$latitude <- st_coordinates(species_in_park)[, 2]
  
  # Visualize the density and specific locations of invasive species using a 2D density plot combined with point data.
  ggplot(data = species_in_park) +
    geom_density_2d(aes(x = longitude, y = latitude, color = invasive_name)) +
    geom_point(aes(x = longitude, y = latitude, color = invasive_name), size = 1, alpha = 0.6) +
    labs(title = "Density and Spread of Invasive Species within Niquette Bay State Park") +
    theme_minimal()
} else {
  # Display a message and an empty plot if no invasive species are found within the park.
  message_text <- "No invasive species found within Niquette Bay State Park."
  ggplot() +
    annotate("text", x = 0.5, y = 0.5, label = message_text, size = 6, color = "red", hjust = 0.5, vjust = 0.5) +
    labs(title = "No Invasive Species Found") +
    theme_void()
}

Figure 6: Density and Spread of Invasive Species within Niquette Bay State Park

# Final Confirmation: This step acts as a final confirmation that there are indeed no invasive species detected within the boundaries of Niquette Bay State Park based on the data provided.
# Visualization of Absence: Rather than leaving the result implicit, the code explicitly visualizes the absence of data through an annotated message within a plot. 
# This reinforces the finding visually, which can be particularly impactful for stakeholders.
# Implications for Park Management: The repeated absence of invasive species findings is a positive sign for the park's ecosystem but also suggests that ongoing monitoring is essential. 
# It indicates that either the park's environmental management practices are effective or that there might be gaps in data collection.
# Cautious Interpretation: A novice reader is reminded that absence of evidence is not necessarily evidence of absence. 
# It’s crucial to maintain vigilance and continue regular monitoring to ensure the park remains free of invasive species and to verify that the current results are not due to insufficient data.

Figure 6: Analysis of Invasive Species Distribution within Niquette Bay State Park

Overview: This figure presents the investigation into the presence of invasive species within Niquette Bay State Park. A density plot reveals the distribution of species within the park’s boundaries, offering insights into ecological patterns.

Key Points: - Visualization Purpose: Should no observations be present, the visualization will clearly communicate this, underscoring the park’s ecological health. - Management Tool: The data acts as a critical resource for environmental management, indicating either the success of protective measures or the need for enhanced monitoring.

Final Confirmation: Absence of Invasive Species

Cautious Interpretation: - Important Reminder: A novice reader is reminded that the absence of evidence is not necessarily evidence of absence. - Recommended Action: It’s crucial to maintain vigilance and continue regular monitoring to ensure the park remains free of invasive species and to verify that the current results are not due to insufficient data collection.

# Management Efforts: Overlay the park boundary and locations of invasive species for management planning.
ggplot() +
  geom_sf(data = species_in_park, aes(color = invasive_name)) +
  geom_sf(data = park_area_sf, fill = NA, color = "black", size = 0.5) +
  labs(title = "Invasive Species within Niquette Bay State Park",
       subtitle = "Mapped with park boundary overlay",
       x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "bottom")

Figure 7: Invasive Species Management Planning within Niquette Bay State Park

# Intended Data Check: This step was conducted as a precautionary measure to ensure that no invasive species were overlooked in previous analyses within the boundaries of Niquette Bay State Park.
# Visualization Outcome: The plot does not display any invasive species data, which indicates that after a thorough check, no invasive species have been recorded within the park according to the dataset being used.
# Why This Step Is Important: Performing such checks is crucial in environmental management. It helps confirm the effectiveness of existing conservation measures or highlights areas that may require additional surveying and protection efforts.
# Management Implications: The absence of data points on the map confirms that, based on the current information, the park does not have an invasive species problem within its bounds. 
# This information can be used for future park management and conservation planning to maintain the integrity of the park's ecosystem.
# In essence, the map tells us that the park is currently in a good state concerning invasive species, but it also underlines the necessity for ongoing vigilance in environmental monitoring and data collection.

Figure 7: Invasive Species Management Planning within Niquette Bay State Park

Overview: This map serves as an essential instrument for conservation planning, depicting the spatial relation of invasive species observations within the boundaries of Niquette Bay State Park. Notably, the map shows an absence of detected invasive species, potentially reflecting the effectiveness of current conservation efforts.

Key Points: - Visualization: Provides a clear visual representation of the area free from invasive species. - Conservation Success: The lack of invasive species may indicate successful conservation practices within the park.

Data Check for Invasive Species, Figure 7 above

Objective: Conduct a precautionary check to ensure no invasive species were overlooked in Niquette Bay State Park.

Importance: - Environmental Management: Crucial for confirming the effectiveness of existing conservation measures and identifying any areas that may require further surveying and protection efforts. - Outcome: Confirms that, based on current data, the park does not have an invasive species problem, supporting ongoing management and conservation planning.

Conclusion: - Current Status: Suggests the park is in a good state concerning invasive species management. - Ongoing Vigilance: Underlines the necessity for continuous environmental monitoring and data collection to maintain the integrity of the park’s ecosystem.

Analyzing Invasive Species Distribution in Vermont

Expansion to Statewide Analysis:

Objective: Broaden the scope of the analysis to encompass the entire state of Vermont, examining the distribution of all invasive species using spatial data.

Purpose: - Statewide Overview: Checks and visualizes the distribution of invasive species across Vermont, offering insights into broader ecological challenges. - Data Utilization: Employs spatial data tools to provide a comprehensive overview of invasive species presence and impact statewide.

# If no observations of invasive species are found within the park, expand the analysis to the entire state of Vermont.
species_in_vermont <- st_intersection(species_data_sf, vermont_sf)

# Check and print invasive species data for Vermont, handling cases where no data is found.
if (nrow(species_in_vermont) > 0) {
  print("Invasive species data within Vermont:")
  print(species_in_vermont)
} else {
  message("No invasive species data found within Vermont boundaries.")
}

# Visualize the distribution of all invasive species in Vermont using spatial data.
species_map_v3 <- ggplot(species_in_vermont) +
  geom_sf(aes(color = invasive_name)) +  # Use color to differentiate species
  geom_sf(data = vermont_sf, fill = NA, color = "black") +  # Outline Vermont state boundary
  labs(title = "Distribution of Invasive Species in Vermont",
       subtitle = "Spatial distribution of invasive species occurrences",
       x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "left")
print(species_map_v3)

Figure 8: Distribution of Invasive Species in Vermont

Overview: This map showcases the localized occurrences of invasive species within the boundaries of Vermont, emphasizing specific areas of detection.

Purpose: - Highlight Critical Areas: Identifies locations that may require immediate attention and intervention due to the presence of invasive species. - Visualization Aid: Assists in understanding the distribution patterns within the state, crucial for targeted ecological management and effective resource allocation.

# Display a map of all recorded invasive species observations across Vermont for comprehensive insight.
all_species_vermont_map_v4 <- ggplot(data = species_data_sf) +
  geom_sf(aes(color = invasive_name)) +
  geom_sf(data = vermont_sf, fill = NA, color = "black") +
  labs(title = "Distribution of Invasive Species Across Vermont",
       subtitle = "All Recorded Species Observations",
       x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "left")
print(all_species_vermont_map_v4)

Figure 9: Analyzing Invasive Species Distribution in Vermont

# differentiate between the two sections of code within Step 14:

# Checking and Visualizing Invasive Species Data within Vermont Boundaries:
# This section intersects the species data with the boundaries of Vermont to check for the presence of invasive species within the state.
# If invasive species data is found within Vermont, it prints and displays the data.
# Visualizes the distribution of all invasive species within Vermont boundaries using spatial data.
# Visualizing the Distribution of All Recorded Invasive Species Observations Across Vermont:
# This section directly visualizes the distribution of all recorded invasive species observations across Vermont.
# It does not filter the data based on Vermont boundaries but rather plots all recorded invasive species occurrences across the entire state.
# Provides a comprehensive view of invasive species occurrences statewide, without specifically focusing on intersecting with Vermont boundaries.


# Contextual Expansion: After confirming that Niquette Bay State Park did not show invasive species within its boundaries, the analysis was expanded to see if this was also the case across Vermont. 
# The results confirm the presence of invasive species elsewhere in the state.
# Comprehensive View: The visualization of this expanded dataset will illustrate the extent of the invasive species issue across Vermont, providing a visual and data-driven narrative of the ecological challenges the state faces.
# Implications for Policy and Management: Understanding the distribution of invasive species at the state level allows policymakers and environmental managers to allocate resources effectively, prioritize areas for treatment, and track the success of conservation efforts.
# Importance of Repeated Analysis: Reiterating the analysis at different scales (statewide vs. park-level) is crucial for accuracy in environmental monitoring. It ensures that broader patterns are not missed when focusing on smaller, specific areas like Niquette Bay State Park.

Figure 9: Spatial Distribution of All Recorded Invasive Species Observations Across Vermont

Overview: This comprehensive map presents all recorded observations of invasive species across Vermont, providing a holistic view of the ecological challenges posed by these species throughout the state.

Purpose: - State-wide Assessment: Essential for informing policymakers and environmental managers about the widespread nature of these threats. - Strategic Planning: Aids in strategic planning for conservation efforts, offering a broad perspective crucial for effective ecological management.

Summary of Analysis of Invasive Species in Vermont

Checking and Visualizing Invasive Species Data within Vermont Boundaries

Objective: Detect and visualize the presence of invasive species confined within Vermont’s geographic boundaries using spatial data tools.

Process: - Spatial Intersection: Involves the intersection of species data with Vermont’s geographic boundaries. - Visualization: Detected occurrences are displayed to assess the distribution of invasive species within the state.

Visualizing the Distribution of All Recorded Invasive Species Observations Across Vermont

Comparison: - Unrestricted Data: Unlike previous analyses, this segment visualizes all recorded observations without restricting the dataset to state boundaries. - Broad Overview: Provides an extensive look at the ecological challenges Vermont faces, underscoring the importance of comprehensive assessments.

Contextual Expansion of Analysis Beyond Niquette Bay State Park

Initial Findings: - No Invasive Species Detected: Initial analysis at Niquette Bay State Park showed no invasive species presence. - State-wide Expansion: The scope was broadened to include the entire state following initial findings, confirming the presence of invasive species in various statewide locations.

Implications for Policy and Management

Strategic Importance: - Resource Allocation: Crucial for efficient resource allocation and prioritization of treatment areas. - Policy Making: Supports the evaluation of conservation effort effectiveness, guiding policy adjustments.

Importance of Repeated Analysis

Methodology: - Repeated Analyses: Essential for ensuring precise environmental monitoring by performing analyses at various scales, from specific locales to statewide assessments. - Ecological Patterns: Helps identify and address larger ecological patterns, preventing oversight that can occur by focusing solely on localized areas.

Filtering and Analyzing Target Invasive Species Data Across Vermont

# ----------------------------------------------------------------------------------
# Step 16: Filtering and Analyzing Target Invasive Species Data Across Vermont
# ----------------------------------------------------------------------------------

# Description:
# This step filters the species data to include only the target invasive species across Vermont.
# It focuses on the distribution of the Emerald Ash Borer and Hemlock Woolly Adelgid observations.
# The filtered data is visualized to analyze the spatial distribution of these target invasive species across Vermont.

# Filter and analyze target invasive species data across Vermont.
target_species <- c("Emerald Ash Borer", "Hemlock Woolly Adelgid")
filtered_species_vermont <- species_data_sf %>%
  filter(invasive_name %in% target_species)

# Visualize the distribution of targeted invasive species across Vermont.
filtered_species_vermont_map <- ggplot(data = filtered_species_vermont) +
  geom_sf(aes(color = invasive_name)) +
  geom_sf(data = vermont_sf, fill = NA, color = "black") +
  labs(title = "Distribution of Target Invasive Species Across Vermont",
       subtitle = "Emerald Ash Borer and Hemlock Woolly Adelgid Observations",
       x = "Longitude", y = "Latitude") +
  scale_color_manual(values = c("Emerald Ash Borer" = "darkgreen",
                                "Hemlock Woolly Adelgid" = "darkmagenta")) +
  theme_minimal()

print(filtered_species_vermont_map)

Figure 10: Filtering and Analyzing Invasive Species Data Across Vermont

# Narrowed Focus: After a broader examination of all invasive species in Vermont, this analysis narrows the focus to two critical species to understand their specific distribution patterns.
# Color-Coded Clarity: The map's use of distinct colors allows anyone, regardless of their expertise, to see where each of these species has been observed. This clarity is essential for communicating complex data simply.
# Environmental Insight: The visual distribution offers immediate insights into which areas may be more affected or at risk. This information is vital for planning targeted responses to mitigate the impact of these invasive species.
# Management Strategy Development: The map aids in identifying 'hotspots' of activity for each invasive species, allowing for more efficient allocation of resources for management and control efforts.
# In summary, the data tells us where we need to pay attention and potentially direct our conservation efforts to address the specific threats posed by the Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont.

Figure 10: Impact of Invasive Species Distribution in Vermont

Objective: This figure illustrates the distribution of two significant invasive species in Vermont: the Emerald Ash Borer and the Hemlock Woolly Adelgid, highlighting critical hotspots where these species pose substantial threats to biodiversity and forest health.

Action Required: - Immediate Management: Targeted management actions are necessary to effectively address the challenges posed by these hotspots.

Discussion on Figure 10: Strategic Implications

Strategic Allocation: - Resource Optimization: The map aids in strategically allocating conservation resources by identifying regions requiring urgent intervention. - Effort Focusing: Focusing efforts on these high-risk areas enables environmental managers to optimize strategies for controlling the spread of invasive species, enhancing the effectiveness of conservation efforts.

Discussion on Figure 10: Future Considerations

Adaptive Management: - Continuous Monitoring: Continuous monitoring and periodic analysis updates are crucial as environmental conditions and species behaviors evolve. - Strategy Adjustment: Expanding monitoring efforts to include emerging hotspots and adjusting management strategies accordingly are essential for staying ahead of new challenges.

Outcome: - Sustainable Practices: This proactive approach ensures the long-term effectiveness of conservation efforts, adapting to new challenges and promoting sustainable ecological practices across similar settings.

Comprehensive Analysis of Invasive Species Distribution and Environmental Impacts in Vermont

# ------------------------------------------------------------------------------------------------
# Step 17: Analyze Species Occurrences
# ------------------------------------------------------------------------------------------------

# Description:
# In this step, we calculate the count of occurrences for each invasive species, focusing on identifying the most prevalent ones based on the provided species data.

# Calculate the count of occurrences for each species.
species_count <- species_data_sf %>%
  group_by(invasive_name) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Display the count of observations per species.
print(species_count)

## Simple feature collection with 34 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -73.41649 ymin: 42.72899 xmax: -71.62245 ymax: 45.01408
## Geodetic CRS:  WGS 84
## # A tibble: 34 × 3
##    invasive_name          count                                         geometry
##    <chr>                  <int>                                 <MULTIPOINT [°]>
##  1 Hemlock Woolly Adelgid   331 ((-72.23133 43.7532), (-72.22273 43.82385), (-7…
##  2 Tatarian honeysuckle      84 ((-71.62476 44.68753), (-71.62245 44.688), (-71…
##  3 Common buckthorn          66 ((-72.2355 43.8118), (-72.614 43.7345), (-72.63…
##  4 Emerald Ash Borer         59 ((-72.13964 44.41226), (-72.15172 44.41078), (-…
##  5 Common reed               57 ((-71.63021 44.71069), (-71.67049 44.68651), (-…
##  6 Japanese honeysuckle      49 ((-72.06293 44.65791), (-72.18078 44.5078), (-7…
##  7 Elongate Hemlock Scale    48 ((-72.23122 43.75284), (-72.22223 43.83185), (-…
##  8 Japanese knotweed         46 ((-72.06206 44.65957), (-71.79214 44.53043), (-…
##  9 Goutweed                  39 ((-72.2366 43.8112), (-72.4068 43.6402), (-72.6…
## 10 Oriental bittersweet      37 ((-71.79917 44.5807), (-72.30571 44.22848), (-7…
## # ℹ 24 more rows

# Prevalence Ranking: The output is a simple feature collection table listing the invasive species names and the count of their occurrences. This table shows that the Hemlock Woolly Adelgid has the highest number of recorded occurrences, followed by other species like Tatarian honeysuckle, Common buckthorn, and the Emerald Ash Borer.
# Most Affected Species: The Hemlock Woolly Adelgid appears as the most prevalent invasive species within the observed data, signaling a potentially significant threat to local ecosystems.
# Potential Actions: This data can inform conservationists and policymakers where to focus efforts for surveys, control, and prevention measures. Species with higher counts may require urgent action to prevent further spread and mitigate ecological impact.
# In essence, this analysis helps to clarify the magnitude of invasive species issues by quantifying observations, thereby guiding decision-making processes for ecological management and resource allocation

Comprehensive Analysis of Invasive Species Distribution and Environmental Impacts in Vermont

Objective: This section synthesizes data from various sources to visualize the current distribution and potential future spread of invasive species across Vermont.

Data Sources: - iNaturalist Vermont: Provides geo-tagged observations of invasive species. - Vermont Invasive Species Database & Vermont Open Geodata Portal: Offer spatial data crucial for identifying areas with significant invasive activity. - OpenWeather and WorldClim: Supply environmental variables and historical climate data to assess correlations between weather patterns and invasive species invasions.

Insights: - Predictive Analysis: Utilizes predictive insights to forecast future threats under varying climate scenarios. - Strategic Planning: Supports strategic conservation planning with a nuanced understanding of ecological dynamics.

Discussion on Detailed Analysis of Species Prevalence and Impact

Focus: This analysis highlights the prevalence and geographic distribution of key invasive species within Vermont, particularly the Emerald Ash Borer and Hemlock Woolly Adelgid.

Key Points: - Hotspot Identification: Identifies specific areas heavily affected by these invasive species to target management actions. - Resource Allocation: Facilitates the efficient allocation of resources and planning of containment and eradication strategies. - Adaptive Management: Emphasizes the importance of continual monitoring and updating of analyses to respond to changing environmental conditions and adaptive invasive species.

Outcome: - Enhanced Management Efficiency: Ensures that management efforts are effective and future strategies are informed by accurate and timely data.

Identify and Visualize Top Two Most Common Invasive Species

#---------------------------------------------------------------------------------
# Step 18: Identify and Visualize Top Two Most Common Invasive Species
#---------------------------------------------------------------------------------

# Description:
# In this step, we identify and visualize the distribution of the top two most common invasive species in Vermont based on the occurrence counts calculated in the previous step.

# Identify the names of the top two most common invasive species.
top_species_names <- head(species_count$invasive_name, 2)

# Filter the species data to include only observations of the top two species.
top_species_data <- species_data_sf %>%
  filter(invasive_name %in% top_species_names)

# Create a ggplot map to visualize the distribution of the top two species across Vermont.
top_species_map <- ggplot(data = top_species_data) +
  geom_sf(aes(color = invasive_name)) +
  geom_sf(data = vermont_sf, fill = NA, color = "black") +
  labs(title = "Distribution of the Top Two Most Common Invasive Species in Vermont",
       subtitle = "Observations of the Two Most Prevalent Species",
       x = "Longitude", y = "Latitude") +
  scale_color_manual(values = c(setNames(rainbow(2), top_species_names))) +
  theme_minimal()

# Display the map.
print(top_species_map)

Figure 11: Identify and Visualize Top Two Most Common Invasive Species

# Unexpected Findings: Contrary to initial reports and publications that indicated the Emerald Ash Borer and Hemlock Woolly Adelgid were the most invasive, 
# the data reveals a different story. The actual most common species based on observed occurrences are Hemlock Woolly Adelgid, with the highest count, and another species that was not initially expected to be as prevalent.
# Visual Evidence: The ggplot map provides visual evidence of the distribution of these two species, emphasizing the actual impact as reflected by the data.
# Implications for Research and Management: This discrepancy between expectation and data highlights the importance of direct data analysis in ecological studies. While prior research and reports are valuable, 
# empirical data can sometimes tell a different story, which can lead to updated priorities and strategies in managing invasive species.
# Effective Communication: the data underscores that while certain species may be anticipated to be predominant due to their reputation or impact in other areas, local data analysis is essential for accurate ecological assessment and effective resource management.
# The map generated in this step tells the factual story of Vermont's current situation regarding invasive species, based on actual data, allowing for a technical and logical approach to addressing the ecological challenges presented by these species.

Figure 11: Distribution Discrepancy Analysis of Invasive Species in Vermont

Objective: Present an analysis of observed occurrences of invasive species in Vermont, focusing on unexpected findings regarding the prevalence of certain species.

Details: - Unexpected Findings: Contrary to initial expectations, the Hemlock Woolly Adelgid appears as the most common species, alongside another unexpectedly prevalent species. - Visualization: Uses a ggplot map to represent the distribution of these species, providing empirical evidence that challenges prior assumptions.

Implications for Research and Management

Key Insight: The discrepancy between expectation and empirical data underscores the importance of direct data analysis in ecological studies.

Management Strategy: - Updated Priorities: Empirical data may lead to updated priorities and strategies in managing invasive species. - Communication: Effective communication of these findings is crucial for local data analysis to guide accurate ecological assessment and resource management.

Conclusion

Map Analysis: The map generated from this analysis provides a factual representation of Vermont’s current invasive species situation, facilitating a technical and logical approach to addressing these ecological challenges and guiding future research and management efforts.

Transition from Hotspot Identification to Clustering with DBSCAN, Elbow, and K-means

From Hotspot Identification:

The analysis has successfully pinpointed regions in Vermont where invasive species are most prevalent, setting the stage for a deeper exploration of spatial distribution patterns.

Introduction to Clustering Techniques:

Transitioning to advanced clustering techniques, including DBSCAN, Elbow method, and K-means, to identify underlying patterns and potential ecological drivers.

DBSCAN:

DBSCAN groups closely packed points into clusters and marks points in low-density areas as outliers, helping identify dense regions of invasive species occurrences.

Elbow Method:

This heuristic method helps determine the optimal number of clusters by identifying the “elbow point” in a plot of within-cluster sum of squares (WCSS) against the number of clusters.

K-means Clustering:

K-means partitions the dataset into distinct, non-overlapping clusters, iteratively assigning data points to the nearest cluster centroid and updating centroids based on the assigned points’ mean.

Rationale for Transition:

Moving from hotspot identification to clustering provides insights beyond high-density areas, uncovering complex spatial patterns and groupings of invasive species occurrences, informing targeted conservation and management strategies.

Spatial Clustering Analysis

Objective: Conduct spatial clustering analysis using spatial clustering algorithms to identify clusters or hotspots of invasive species occurrences.

Method: - Technique Used: Spatial clustering algorithms group spatially close observations into clusters based on their geographical proximity. - Outcome: The resulting clusters reveal insights into the spatial distribution patterns of invasive species.

DBSCAN Clustering of Observations

# ----------------------------------------------------------------------------

# Spatial Clustering Analysis

# ----------------------------------------------------------------------------

# This step involves conducting spatial clustering analysis on the species data to identify clusters or 
# hotspots of invasive species occurrences. It employs spatial clustering algorithms to group spatially close observations into clusters based on their geographical proximity. 
# The resulting clusters provide insights into the spatial distribution patterns of invasive species.

## --------------------------------------------------------------------------------
## Step 19: DBSCAN Clustering Analysis
## --------------------------------------------------------------------------------

# Setting Up Environment
# Ensure the 'sf' package is available, install it if not
if (!requireNamespace("sf", quietly = TRUE)) {
  install.packages("sf")
}

# Extracting Coordinates
# Extract geographical coordinates (longitude, latitude) from the 'master_observation_list_sf' spatial dataframe
example_coords <- st_coordinates(master_observation_list_sf)

# Perform DBSCAN Clustering
# Apply the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithm to the extracted coordinates
# Adjust the parameters 'eps' (maximum distance between two points to be considered in the same neighborhood) and 'MinPts' (minimum number of points required to form a dense region) based on data characteristics
db_clusters <- dbscan(example_coords, eps = 0.01, MinPts = 5)

# Incorporate clustering results into the original spatial dataframe
# Create a new column 'cluster' as a factor to represent different clusters identified by DBSCAN
master_observation_list_sf$cluster <- as.factor(db_clusters$cluster)

# Visualization of DBSCAN Clustering Results

# Create a map visualizing the results of DBSCAN clustering, with different colors representing different clusters:

cluster_map <- ggplot() +
  geom_sf(data = master_observation_list_sf, aes(color = cluster)) +  # Plot clusters using ggplot2
  labs(title = "DBSCAN Clustering of Observations",
       subtitle = "Clusters based on geographical coordinates",
       x = "Longitude", y = "Latitude") +  # Add titles and labels
  theme_minimal() +  # Apply minimal theme for visualization aesthetics
  theme(legend.position = c(2.1, 0.5))  # Move legend slightly to the right
print(cluster_map)

Figure 12: DBSCAN Clustering of Observations

# Visualization and Data Story:
# Cluster Map Creation: A map is generated to visualize these clusters, with different colors representing each cluster, clearly distinguishing the groups of observations.
# Clustering Insights: The visualization shows where invasive species occurrences are not random but instead clustered in certain areas – these are the hotspots where species are more densely located.
# Management Implications: Understanding these clusters can help in identifying areas that might be at higher risk of invasion and thus could benefit from focused management efforts.
# Summary for Novices:
# What is DBSCAN?: DBSCAN is a method that finds neighbors that are closely packed together and groups them into clusters. It helps us see if there are any 'hotspots' where invasive species are especially numerous.
# The Plot’s Message: The clusters on the map show us where the invasive species are not just scattered randomly but are concentrated in specific areas. Each color on the map represents a different cluster, or a 'neighborhood,' of invasive species.
# Why This Matters: By knowing where these clusters are, we can better manage invasive species because these are the areas where they're most likely to cause problems. We might need to take extra care in these hotspots to protect the local environment.
# Practicality of Findings: The analysis gives us a practical way to use limited resources more effectively by targeting the areas where the need is greatest.
# In essence, the DBSCAN clustering tells us a story about where the invasive species are gathering in groups across the landscape, which can be crucial information for making smart decisions about environmental management.

Figure 12: Visualization of DBSCAN Clustering Results

Objective: Display the results of DBSCAN clustering on a map, each cluster represented by a different color for clear differentiation.

Details: - Visualization: The map shows the spatial distribution of invasive species occurrences, with each cluster differentiated by color. - Insights: Reveals that invasive species occurrences are clustered in certain areas, representing hotspots of activity.

Spatial Clustering Analysis of Invasive Species Occurrences in Vermont

Objective

Present a clear visual representation using DBSCAN clustering to identify where invasive species occurrences are concentrated across Vermont.

Cluster Map Creation

Visualization Method: Utilize varied colors on the map to represent distinct clusters, enhancing the clarity of spatially clustered observations.

Clustering Insights

Distribution Pattern: Analysis shows that invasive species do not appear randomly but are significantly clustered in specific areas.
Identification of Hotspots: These clusters highlight areas with dense populations of invasive species.

Management Implications

Importance of Recognizing Clusters

Understanding these clusters is vital for pinpointing areas potentially at greater risk of invasion, guiding targeted management efforts effectively.

Summary

What is DBSCAN?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering method that groups closely packed points into clusters, effectively identifying dense ‘hotspots’ of activity.

The Plot’s Message

Spatial Distribution: The map demonstrates that invasive species are not randomly dispersed but are notably concentrated in specific areas.
Cluster Delineation: Each cluster is marked by a different color, distinguishing various ‘neighborhoods’ of species.

Why This Matters

Identifying these clusters enables more effective management of invasive species by focusing resources and conservation efforts on the most impacted areas.

Practicality of Findings

The insights from the clustering analysis offer a strategic approach to allocate resources efficiently, concentrating on areas with the most urgent needs.

In summary, DBSCAN clustering provides valuable insights into the distribution of invasive species across Vermont, serving as a critical tool for informed environmental management decisions.

K-Means Clustering Analysis

## ----------------------------------------------------------------------------------------------------------------
## Step 20: K-Means Clustering Analysis
## ----------------------------------------------------------------------------------------------------------------

# Convert 'master_observation_list' to a spatial dataframe (sf object) with geographical coordinates
master_observation_list_sf <- st_as_sf(master_observation_list, coords = c("longitude", "latitude"), crs = 4326)

# K-Means Clustering:

# Perform K-Means clustering on the spatial coordinates
set.seed(123)  # Ensure reproducibility
k_means_result <- kmeans(st_coordinates(master_observation_list_sf), centers = 5)

# Add the K-Means cluster assignments to the spatial dataframe
master_observation_list_sf$cluster_kmeans <- as.factor(k_means_result$cluster)

# Convert the spatial dataframe to a regular dataframe for visualization with ggplot2
master_observation_list_df <- as.data.frame(master_observation_list_sf)

# Plot K-Means clusters with ggplot2
k_means_map <- ggplot() +
  geom_sf(data = master_observation_list_sf, aes(color = cluster_kmeans)) +  # Plot clusters using ggplot2
  labs(title = "K-Means Clustering of Observations",
       subtitle = "Clusters based on geographical coordinates",
       x = "Longitude", y = "Latitude") +  # Add titles and labels
  theme_minimal()  # Apply minimal theme for visualization aesthetics

# Display the K-Means cluster map
print(k_means_map)

Figure 13: K-Means Clustering Analysis

# Visualization and Data Story:
# Map Generation: The ggplot2 package creates a map that displays these clusters, each marked with a different color for easy differentiation.
# Interpretation: The map illustrates how the observations are not randomly scattered but tend to group together in certain areas—these are the clusters identified by the algorithm.
# Insights for Management: Identifying these clusters helps focus conservation efforts where they're most needed. If a cluster corresponds to a critical habitat or a highly affected area, it might require more intensive management actions.
# Summary for Novices:
# What is K-Means?: K-Means is a way to organize scattered data into groups (clusters). It's like organizing scattered points on a map into five different regions based on their location.
# Map's Role: The map with different colored points shows the 'regions' or clusters where similar observations are grouped, helping to visualize patterns in the distribution of invasive species.
# Management Relevance: Knowing these patterns helps us decide where to act to control invasive species. It's a strategy to use resources wisely, by concentrating on areas with the most observations.
# Data-Driven Decisions: This step illustrates the importance of using data to make informed decisions in managing the environment. It confirms that invasive species have a pattern in their spread, which we can target for better ecological outcomes.
# This K-Means clustering thus tells a data-driven story about how and where invasive species congregate within Vermont, guiding future action to address these ecological concerns in an efficient manner.

Figure 13: DBSCAN Clustering of Invasive Species Distribution

Objective: Demonstrate the results of DBSCAN clustering on the spatial distribution of invasive species using the ggplot2 package in R.

Details: - Visualization: Each cluster is depicted in a unique color to simplify the differentiation between groups of spatially aggregated observations. - Insights: The map reveals that invasive species are not randomly distributed but tend to form distinct clusters in certain areas, indicating potential hotspots. - Application: This information is crucial for directing targeted management strategies, particularly in critical habitats or heavily impacted areas.

Impact: - Conservation Priorities: Aids in understanding spatial patterns to inform conservation priorities. - Resource Allocation: Facilitates strategic deployment of resources to mitigate the impact of invasive species.

Understanding K-Means Clustering for Ecological Management

What is K-Means?

K-Means clustering is a method used to organize scattered data into meaningful groups or clusters. In the context of ecological management, it organizes observations of invasive species into distinct regions based on geographical distribution.

Role of Mapping

Visualization: Clusters are visualized on a map, each represented by a different color.
Pattern Recognition: Helps ecologists and environmental managers identify patterns in the distribution of invasive species.

Relevance in Management

Understanding these patterns is essential for making informed decisions about managing invasive species, focusing resources and efforts on areas with the highest concentration of observations to combat invasive species strategically.

Importance of Data-Driven Decisions

Decision Making: Emphasizes the importance of using data to drive decisions in environmental management.
Efficiency: Recognizing and targeting patterns in the spread of invasive species leads to more efficient and effective ecological outcomes.

Conclusion

K-Means clustering provides insights into how and where invasive species aggregate within a specific area, such as Vermont. These insights guide future actions and interventions aimed at addressing ecological concerns in a manner that optimizes resource allocation and conservation efforts.

Elbow Method for Optimal Cluster Count Selection

# -----------------------------------------------------------------------------------------------------------------------------------
# Step 21: Elbow Method for Optimal Cluster Count Selection
# -----------------------------------------------------------------------------------------------------------------------------------
# Description:
# The Elbow Method is employed to determine the optimal number of clusters for the K-Means clustering algorithm. By plotting the total within-cluster sum of squares against the number of clusters (k), this method helps identify the point where the rate of decrease in within-cluster variance slows down, indicating the optimal number of clusters.

# Perform the Elbow Method to compute the total within-cluster sum of squares for different cluster counts (k).
wss <- map_dbl(1:10, function(k) {
  kmeans(example_coords, centers = k, nstart = 10)$tot.withinss
})

# Visualize the Elbow plot to identify the optimal number of clusters (k).
elbow_plot <- data.frame(k = 1:10, wss = wss) %>%
  ggplot(aes(x = k, y = wss)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Method for Choosing k",
       x = "Number of Clusters (k)",
       y = "Total Within-Cluster Sum of Squares")

print(elbow_plot)

Figure 14: Elbow Method for Optimal Cluster Count Selection

# Visualization and Interpretation:
# The Plot: On the Elbow plot, the x-axis represents the number of clusters, and the y-axis shows the WSS. Points on the plot show the WSS for each k.
# Finding the Elbow: The "elbow" is the point where increases in k result in smaller reductions in WSS. It suggests adding more clusters doesn't provide a much better fit.
# Optimal Clusters: The k at the elbow is considered the optimal number of clusters because it's a good trade-off between compactness (low WSS) and the number of clusters.
# Summary for Novices:
# What's the Elbow Method?: It's like trying to find the right place to cut a tree branch—the spot where you get the best cut with the least effort. Here, we look for the point where adding more clusters doesn't make our model much better.
# What Does the Plot Show?: The graph helps us see how well our data fits into a certain number of groups. At first, adding more groups (clusters) really helps, but after a certain point, it doesn't improve much.
# Why Is This Useful?: This method helps us avoid two things: having too many groups, which is unnecessary, or too few, which might lump different observations together.
# Decisions Based on Data: The Elbow Method ensures that the choice of how many groups to divide the data into is not random but based on actual trends in the data.
# In this way, the Elbow Method tells a story about finding balance—enough clusters to accurately reflect the data without overcomplicating the model. It guides us to a logical decision on the number of clusters to use for our invasive species analysis.

Figure 14: Elbow Plot for Determining Optimal Number of Clusters

Objective: Illustrate the relationship between the number of clusters (k) and the Within-Cluster Sum of Squares (WSS) to aid in determining the optimal number of clusters for the K-Means algorithm.

Details: - X-axis: Number of clusters (k). - Y-axis: Corresponding WSS values. - Data Points: Each point on the plot represents the WSS for a specific value of k.

Interpreting the Elbow Plot:

Finding the Elbow: Identify the “elbow” point where increases in k result in diminishing reductions in WSS, depicting the trade-off between the number of clusters and the compactness of the data.
Determining Optimal Clusters: The k value at the elbow is considered optimal, balancing low WSS (indicative of compact clusters) and minimized complexity from additional clusters.

Understanding the Elbow Method for Cluster Analysis

What is the Elbow Method?

The Elbow Method is a technique akin to finding the optimal place to prune a tree branch—it helps identify the point where adding more clusters does not significantly improve the model’s fit.

Insights from the Plot

Initial Improvement: Adding more clusters initially enhances the fit of the model.
Diminishing Returns: Beyond a certain point, the improvement in fit becomes marginal, indicating the optimal clustering threshold.

Utility of the Elbow Method

Balance Between Clusters: This method helps strike a balance between too many clusters, which can introduce unnecessary complexity, and too few clusters, which may merge distinct observations.
Data-Driven Decision Making: Ensures that the choice of the number of clusters is informed by the underlying trends in the data, rather than being arbitrary.

Conclusion

The Elbow Method provides a narrative of finding equilibrium—selecting an appropriate number of clusters that accurately represent the data without overly complicating the model. It serves as a guide for making logical and data-driven decisions in determining the number of clusters for our analysis of invasive species.

Spatial Distribution of Invasive Species

# ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Step 22: Spatial Distribution of Invasive Species 
# ------------------------------------------------------------------------------------

# top_species_data

# Check if top_species_data exists, if not, create a mock data frame for demonstration
if (!exists("top_species_data")) {
  print("top_species_data does not exist, creating a mock data frame for demonstration.")
  top_species_data <- data.frame(
    species = sample(c("Species1", "Species2", "Species3"), 415, replace = TRUE),
    longitude = runif(415, min = -73.5, max = -71.5),
    latitude = runif(415, min = 43.5, max = 45.5)
  )
}

# Convert data frame to an sf object while retaining original longitude and latitude columns
top_species_data_sf <- top_species_data %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, remove = FALSE)  # Set remove = FALSE to keep the original columns

# Check and print the structure to verify that the geometry column has been added and original columns are retained
print(str(top_species_data_sf))

## sf [415 × 26] (S3: sf/tbl_df/tbl/data.frame)
##  $ x.id                      : num [1:415] 454405 454462 454516 454470 454317 ...
##  $ y.id                      : num [1:415] 156319 156518 156218 156452 156364 ...
##  $ site_name                 : chr [1:415] "Branbury S.P." "Branbury S.P." "Branbury S.P." "Branbury S.P." ...
##  $ observation_id            : num [1:415] 103 104 105 106 107 108 109 110 120 135 ...
##  $ invasive_name             : chr [1:415] "Tatarian honeysuckle" "Tatarian honeysuckle" "Tatarian honeysuckle" "Tatarian honeysuckle" ...
##  $ observation_date          : chr [1:415] "09/06/2010" "10/06/2010" "09/06/2010" "09/06/2010" ...
##  $ observer                  : chr [1:415] "Tess Greaves" "Tess Greaves" "Tess Greaves" "Tess Greaves" ...
##  $ survey_type               : chr [1:415] "Contractor" "Contractor" "Contractor" "Contractor" ...
##  $ town                      : chr [1:415] "Salisbury" "Salisbury" "Salisbury" "Salisbury" ...
##  $ plant_description         : chr [1:415] NA NA NA NA ...
##  $ distribution_name         : chr [1:415] NA "Scattered Plants or Clumps" NA NA ...
##  $ assessmen_date            : chr [1:415] "Jun  9 2010" "Jun 10 2010" "Jun  9 2010" "Jun  9 2010" ...
##  $ treatment_date            : chr [1:415] NA NA NA NA ...
##  $ treatment_effectiveness   : chr [1:415] NA NA NA NA ...
##  $ treatment_type            : chr [1:415] NA NA NA NA ...
##  $ treatment_person          : chr [1:415] "Tess Greaves" "Tess Greaves" "Tess Greaves" "Tess Greaves" ...
##  $ treatment_assessment_date : chr [1:415] NA NA NA NA ...
##  $ assessor                  : chr [1:415] NA NA NA NA ...
##  $ treatment_assessment_notes: chr [1:415] NA NA NA NA ...
##  $ chemical_name             : chr [1:415] NA NA NA NA ...
##  $ chemical_consentration    : chr [1:415] NA NA NA NA ...
##  $ chemical_ounces           : chr [1:415] NA NA NA NA ...
##  $ application_method        : chr [1:415] NA NA NA NA ...
##  $ certified_applicator      : chr [1:415] NA NA NA NA ...
##  $ eparegistation_number     : chr [1:415] NA NA NA NA ...
##  $ geometry                  :sfc_POINT of length 415; first list element:  'XY' num [1:2] -73.1 43.9
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:25] "x.id" "y.id" "site_name" "observation_id" ...
## NULL

# Data Integrity: The first action, checking if data exists and creating mock data if necessary, shows an important step in data analysis—making sure that you have the data you need to work with, and if not, how to create a stand-in for demonstration.
# Spatial Analysis Readiness: By converting the data to a spatial format while keeping original columns, the data is now ready for spatial analysis, like creating maps or conducting other geospatial computations.
# Geospatial Visualization: This step is foundational for creating a map that will visually show where the most invasive species are located in Vermont. These maps can help us understand how invasive species are spread across the landscape.

# Data Preparation for Spatial Analysis

This document outlines the steps for preparing and analyzing spatial data, focusing on the analysis of invasive species in Vermont.

Ensuring Data Integrity

Objective: Verify the availability of essential data and generate mock data if necessary.

Details: - Data Verification: Check the availability of the required datasets for the analysis. - Mock Data Creation: If the original data is unavailable, create a substitute dataset to ensure that the analysis can proceed without interruptions. This mock dataset will simulate the characteristics of the expected real data.

Readiness for Spatial Analysis

Objective: Convert data into a spatially compatible format while preserving the integrity of the original columns.

Details: - Data Transformation: Convert the dataset into a format suitable for spatial analysis (e.g., GeoDataFrame in Python). - Column Preservation: Ensure that all original data columns are retained during the conversion process to maintain data integrity.

Geospatial Visualization

Objective: Create visual representations of the data to illustrate the spatial distribution of invasive species across Vermont.

Details: - Map Creation: Develop maps that display the geographical spread of invasive species. - Analysis Tool: Utilize these maps as analytical tools to aid in decision-making and to inform ecological management strategies.

Spatial Distribution of Invasive Species

# ------------------------------------------------------------------------
#  Step 23: Spatial Distribution of Invasive Species 
# ------------------------------------------------------------------------

#  'top_species_data_sf' is your spatial dataframe and includes 'invasive_name'
# Check if invasive_name column exists to avoid runtime errors
if ("invasive_name" %in% colnames(top_species_data_sf)) {
  # Prepare for interactive visualization
  ggplot_data <- ggplot(top_species_data_sf) +
    geom_sf(aes(color = invasive_name, text = paste("Invasive Species:", invasive_name)), size = 3) +
    labs(title = "Spatial Distribution of Invasive Species",
         x = "Longitude", y = "Latitude") +
    theme_minimal()
  
  # Convert to Plotly for interactivity
  ggplotly_obj <- ggplotly(ggplot_data, tooltip = "text") %>%
    layout(legend = list(orientation = "h", y = -0.3))
  
  # Print the interactive plot
  ggplotly_obj
} else {
  "Column 'invasive_name' does not exist in the dataset. Please check the dataset."
}

Figure 15: Spatial Distribution of Top Invasive Species

# What's Happening?: Think of the dataset like a guest list for an event, where the 'invasive_name' column is the name of each guest. The code first checks to make sure the guest list is there. Then it creates a map to show where guests are seated, 
# with different colors for different guests, and labels to identify them.
# Visualization Purpose: By turning the list into a colorful map, it becomes much easier to see patterns—like if certain guests are grouped together, which in our case, would mean certain invasive species are found more in some areas than others.
# Why Interactive?: The interactive map allows you to get more information by moving your cursor over the points. It's like walking around the event and getting to know the guests by having a quick chat with them.
# Understanding the Message: The data tells us how these unwanted 'guests' (invasive species) are spread out through our 'event' (the region). It helps us understand where we need to focus our efforts to manage these species.

Figure 15: Spatial Distribution of Invasive Species Across Vermont

Objective: This map illustrates the spatial distribution of various invasive species across Vermont, clearly delineated by the state’s geographic boundaries.

Details: - Color-Coding: Each invasive species is marked with a distinct color, aiding in rapid assessment of biodiversity issues. - Geographic Clarity: The prominent display of state boundaries ensures immediate spatial context.

Purpose: This visualization provides an easily interpretable graphical representation of data, crucial for ecological management and decision-making processes.

What’s Happening?

Analogy Explanation: Think of the dataset like a guest list for an event, where the ‘invasive_name’ column represents the name of each guest. The code first ensures that the guest list exists, then creates a map to display where guests are seated, using different colors and labels to identify them.

Visualization Purpose

Key Functions: - Data Representation: The visualization transforms the guest list into a colorful map, simplifying the identification of patterns. For example, certain guests (invasive species) may be grouped together, indicating higher concentrations in specific areas.

Why Interactive?

Engagement Strategy: - User Interaction: The interactive map enhances user engagement by allowing them to obtain additional information by hovering over data points. This feature simulates the experience of walking around the event and engaging with guests (invasive species) for deeper insights.

Understanding the Message

Data Insights: - The dataset reveals the spatial distribution of invasive species throughout Vermont. Analyzing this data helps prioritize management efforts by identifying areas with higher concentrations of invasive species.

Spatial Clustering Analysis of Top Invasive Species

# ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Step 24: Spatial Clustering Analysis of Top Invasive Species
# ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Assuming 'top_species_data_sf' is already created and contains the necessary geographic data
# Determine the number of clusters
num_clusters <- 3  # Adjust this number based on your specific needs or analysis

# Perform k-means clustering on coordinates
coordinates <- st_coordinates(top_species_data_sf)
kmeans_result <- kmeans(coordinates, centers = num_clusters)
top_species_data_sf$cluster <- as.factor(kmeans_result$cluster)  # Add cluster results to your sf object

# Optional: Check the structure and clustering results
print(table(top_species_data_sf$cluster))

# Prepare for interactive visualization
# Map each cluster to a color and use invasive species names in the hover text
ggplot_data <- ggplot(top_species_data_sf) +
  geom_sf(aes(color = cluster, text = paste("Invasive Species:", invasive_name, "<br>Cluster:", cluster)), size = 3) +
  labs(title = "Spatial Distribution of Invasive Species by Cluster",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

# Convert to Plotly for interactivity
ggplotly_obj <- ggplotly(ggplot_data, tooltip = "text") %>%
  layout(legend = list(orientation = "h", y = -0.3))

# Display the interactive plot
ggplotly_obj

Figure 16: Spatial Distribution of Top Invasive Species by Regions

# What Does This Analysis Show?: The clustering divides the region into three areas based on the proximity of invasive species observations. This can highlight areas with high densities of invasive species, potentially signaling regions of concern.
# Purpose of Interactive Visualization: The interactive elements allow stakeholders, such as conservation managers or educational groups, to explore the data more deeply. By hovering over points, they can get specific information about the species at each location and see which cluster each point belongs to.
# Understanding the Outcome: The colors represent different clusters, making it visually apparent which areas are grouped together in the analysis. This can help in planning targeted management actions or further ecological studies.
# Practical Use: This kind of visualization not only provides a clear picture of where invasive species are most prevalent but also helps in understanding how these species are grouped geographically, which is crucial for effective environmental management and resource allocation.
# In summary, this step leverages both clustering analysis and interactive visualization to provide a comprehensive view of invasive species distribution, enhancing understanding and facilitating more informed decision-making.

Figure 16: Spatial Distribution of Top Invasive Species by Regions

Overview: This figure depicts the spatial distribution of top invasive species across different regions, identified through clustering analysis. Each region is represented by a distinct color, enhancing the visualization of spatial patterns in invasive species distribution.

Features: - Interactive Tooltips: Provide additional information about specific invasive species and their respective clusters, enabling a more interactive and informative user experience.

What Does This Analysis Show?

Clustering Analysis: - The clustering divides the region into three areas based on the proximity of invasive species observations. This method highlights areas with high densities of invasive species, potentially signaling regions of concern.

Purpose of Interactive Visualization

Interactive Elements: - Stakeholder Engagement: The interactive elements allow stakeholders, such as conservation managers or educational groups, to explore the data more deeply. By hovering over points, they can obtain specific information about the species at each location and identify the cluster each point belongs to.

Understanding the Outcome

Visual Clarity: - The colors represent different clusters, making it visually apparent which areas are grouped together in the analysis. This clarity aids in planning targeted management actions or further ecological studies.

Practical Use

Application: - This visualization provides a clear picture of where invasive species are most prevalent, and helps in understanding how these species are grouped geographically. This insight is crucial for effective environmental management and resource allocation.

Summary

Comprehensive View: - This step leverages both clustering analysis and interactive visualization to provide a comprehensive view of invasive species distribution, enhancing understanding and facilitating more informed decision-making.

Spatial Analysis and Transition to Species Distribution Modeling (SDM)

Overview

Context: Having completed an extensive exploratory data analysis, clustering, and hotspot identification, we now have a nuanced understanding of the spatial patterns and prevalence of invasive species across Vermont. This foundational knowledge sets the stage for more advanced ecological assessments.

Advancing to Species Distribution Modeling (SDM)

Transition to SDM

Objective: As we transition into species distribution modeling (SDM), our focus will shift to integrating raster-based weather data. This integration will enrich our models by incorporating crucial environmental variables such as temperature and precipitation, which play a significant role in predicting species distributions under varying climatic conditions.

Significance of the Transition

Enhanced Predictive Models: This step marks a sophisticated evolution in our analytical approach. By synthesizing climatic factors with biological data, we aim to enhance the accuracy and applicability of our predictive models, providing more precise insights into future species movements and potential new hotspots.

Expectations from SDM

Improved Outcomes: - Model Enrichment: The inclusion of environmental variables is expected to provide a richer, more contextually accurate framework for our species distribution models. - Decision Support: Enhanced predictive models will support more informed decision-making in ecological management and conservation planning.

Conclusion

Next Steps: - The transition to species distribution modeling represents a pivotal phase in our ongoing research. It promises to bring deeper insights into invasive species behavior and adaptation under changing climatic scenarios, ensuring that our conservation strategies are as effective and forward-looking as possible.

Spatial Analysis and Transition to Species Distribution Modeling

This R Markdown document is designed to guide the development and implementation of Species Distribution Models (SDMs) for invasive species in Vermont. Leveraging environmental data alongside recorded occurrences, we aim to predict the potential distribution of these species under current and future environmental conditions. This work is pivotal for informing strategic conservation efforts and managing invasive species effectively.

Background

Following a comprehensive exploratory data analysis which included spatial patterning and hotspot identification, our analysis now incorporates sophisticated modeling techniques. We utilize raster-based climatic data, which includes variables like temperature and precipitation, crucial for understanding species-environment interactions.

Data Acquisition and Preparation

# SDM_MaxEnt_LogisticRegression_Script.R


# --------------------- Step 1: List of Necessary Libraries -------------------

# Ensure required packages are installed and loaded
packages <- c("dismo", "pROC", "raster", "sp", "readr", "dplyr", "ggplot2", "terra", "sf", 
              "tmap", "tmaptools", "lubridate", "stringr", "rasterVis", "RANN", "predicts", 
              "akima", "gstat", "rJava", "stats")

ensure_packages <- function(pkg) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}

sapply(packages, ensure_packages)
message("All required libraries are loaded successfully!")

# --------------------- Step 2: Set Working Directory -------------------------

setwd("D:/GEOG_588/SDM_Invasive_species")

# --------------------- Step 3: Load Occurrence Data -------------------------

occurrence_file <- "Cleaned_Observation_Data.csv"
occurrence_data <- read.csv(occurrence_file)

# Convert Occurrence Data to SpatialPointsDataFrame
coordinates(occurrence_data) <- ~longitude + latitude
proj4string(occurrence_data) <- CRS("+proj=longlat +datum=WGS84")

# --------------------- Step 4: Download Environmental Data ------------------

bioclim_layers <- raster::getData('worldclim', var = 'bio', res = 10)

Figure 17: Cropped Environmental Data for Vermont**

# --------------------- Step 5: Crop Environmental Data to Vermont Extent -----

# Define the spatial extent for Vermont
vermont_extent <- extent(-73.437, -71.465, 42.730, 45.016)

# Crop the bioclimatic layers to the Vermont extent to focus the environmental data
# on the region of interest.
vermont_data <- crop(bioclim_layers, vermont_extent)

# Define layout dimensions
n_layers <- nlayers(vermont_data)
nrow <- ceiling(sqrt(n_layers))
ncol <- ceiling(n_layers / nrow)

# Set up the plot layout and adjust margins
# Reduce margins slightly but keep them larger than the very tight defaults
par(mfrow=c(nrow, ncol), mar=c(2, 2, 2, 2))  # Adjusted margins to 2 on all sides

# Loop through each layer to plot
for (i in 1:n_layers) {
  plot(vermont_data[[i]], main = names(vermont_data)[i])
}

# Reset the plotting parameters to default after plotting
par(mfrow=c(1, 1), mar=c(5.1, 4.1, 4.1, 2.1))

Figure 17: Cropped Environmental Data for Vermont

# Data Source: The underlying environmental data is derived from the WorldClim database 
# (https://www.worldclim.com/bioclim), a widely recognized repository for global 
# climate layers, commonly known as "bioclimatic variables."
#
# What's Being Done?: This script prepares and analyzes data that combine the actual sightings (occurrences) of invasive species with environmental conditions. By doing this, we can start to understand where these species might thrive.
# Why It Matters?: Different species require different conditions to flourish. By mapping where these conditions align with sightings, we can predict where invasive species might spread. This is crucial for managing and potentially mitigating their impact.
# Visualization Importance: The plotted environmental data provide a visual representation of the climate across Vermont. Each plot corresponds to a different environmental variable, offering insights into the diverse conditions within the state.
# Outcome: This preliminary analysis sets the stage for deeper exploration into how environmental factors correlate with the locations of invasive species. This is vital for developing strategies to control these species based on predicted changes in climate or habitat suitability.
# In essence, the data tells us not only where invasive species are currently found but also where they might appear next based on environmental conditions. This predictive insight is fundamental for ecological management and conservation planning in Vermont.

Figure 17: Cropped Environmental Data for Vermont

Overview: This array of plots illustrates bioclimatic layers cropped to Vermont’s spatial extent. Each subplot represents a distinct environmental variable from the WorldClim dataset, covering factors like temperature and precipitation. These refined datasets are crucial inputs for species distribution modeling (SDM) and ecological assessments within the Vermont region.

Ecological Visualization for Conservation Planning

Visual Design: - The color spectrum in the visualization transitions smoothly from delicate whites and pinks to vibrant oranges, vivid yellows, and deep greens. - This palette represents a gradient of probabilities, indicating the likelihood of species presence across different areas.

Functional Insight: - The regions highlighted in bright green suggest a higher probability of species presence, pointing conservationists to potential hotspots. - While higher probabilities highlighted by green tones do not guarantee the presence of species, they provide essential clues for further scientific investigation and resource allocation.

Strategic Application: - This color-coded approach serves as a critical tool for ecological analysis and conservation planning, acting like a treasure map that guides conservation efforts to areas where they are most needed.

What’s Being Done?

Process Explanation: - This script prepares and analyzes data that combine actual sightings (occurrences) of invasive species with environmental conditions. This integration helps to understand where these species might thrive.

Why It Matters?

Impact on Conservation: - Different species require different conditions to flourish. By mapping where these conditions align with sightings, we can predict where invasive species might spread. This predictive insight is crucial for managing and potentially mitigating their impact.

Visualization Importance

Data Representation: - The plotted environmental data provide a visual representation of the climate across Vermont. - Each plot corresponds to a different environmental variable, offering insights into the diverse conditions within the state.

Outcome

Strategic Development: - This preliminary analysis sets the stage for deeper exploration into how environmental factors correlate with the locations of invasive species. - This understanding is vital for developing strategies to control these species based on predicted changes in climate or habitat suitability.

Predictive Insight: - The data not only reveals where invasive species are currently found but also where they might appear next based on environmental conditions. This insight is fundamental for ecological management and conservation planning in Vermont.

Clean Occurrence Data

# --------------------- Step 6: Clean Occurrence Data ------------------------

# Step 6.1: Check the structure of the occurrence_data to understand its format and variables.
str(occurrence_data)

# Step 6.2: Extract longitude and latitude directly from the occurrence_data for further processing.
longitude <- occurrence_data$longitude
latitude <- occurrence_data$latitude

# Step 6.3: Combine longitude and latitude into a matrix for spatial operations.
occurrence_points_matrix <- cbind(longitude, latitude)

# Step 6.4: Verify the creation of the matrix by inspecting its initial rows.
head(occurrence_points_matrix)

# Step 6.5: Extract environmental values at occurrence points from the provided raster data.
env_values_at_points <- raster::extract(vermont_data, occurrence_data)

# Step 6.6: Check if the extraction resulted in a list and merge it into a matrix if necessary.
if (is.list(env_values_at_points)) {
  env_values_matrix <- do.call(rbind, env_values_at_points)
} else {
  env_values_matrix <- env_values_at_points
}

# Step 6.7: Filter out points with NA environmental values to ensure data integrity.
valid_points <- complete.cases(env_values_matrix)
occurrence_points_matrix_clean <- occurrence_points_matrix[valid_points, ]

# Step 6.8: Display the dimensions of the cleaned occurrence points matrix to verify the cleaning process.
dim(occurrence_points_matrix_clean)

# Understanding Data Structure: It's essential first to understand what data you have—like checking all the pieces of a puzzle before starting. This step ensures that all necessary information, especially geographic locations, is accessible and correctly formatted.
# Geographic and Environmental Synthesis: By mapping occurrence points to environmental conditions, the script effectively merges biological observations with physical data. This synthesis is crucial for understanding the interactions between species and their habitats.
# Focus on Data Quality: Cleaning the data by removing incomplete records ensures that the subsequent analysis is based on reliable and comprehensive information. It's similar to filtering out noisy or unclear data in an experiment to focus on the results that provide clear insights.
# Preparation for Modeling: The clean, comprehensive dataset serves as a solid foundation for the next steps in the SDM process, where these points will be used to model potential distributions based on both biological occurrences and environmental factors.
# In essence, this process prepares a dataset that not only maps where invasive species have been observed but also contextualizes these locations within the broader environmental landscape of Vermont. 
# This preparation is critical for accurately predicting where these species might thrive under current and future conditions, thereby informing conservation strategies and management decisions.

Understanding Data Structure

Initial Assessment: - Puzzle Analogy: It’s essential to understand what data you have—like checking all the pieces of a puzzle before starting. This step ensures that all necessary information, especially geographic locations, is accessible and correctly formatted.

Geographic and Environmental Synthesis

Data Integration: - Synthesis: By mapping occurrence points to environmental conditions, the script effectively merges biological observations with physical data. This synthesis is crucial for understanding the interactions between species and their habitats.

Focus on Data Quality

Data Integrity: - Cleaning Process: Cleaning the data by removing incomplete records ensures that the subsequent analysis is based on reliable and comprehensive information. It’s similar to filtering out noisy or unclear data in an experiment to focus on the results that provide clear insights.

Preparation for Modeling

Foundation for SDM: - Dataset Readiness: The clean, comprehensive dataset serves as a solid foundation for the next steps in the SDM process, where these points will be used to model potential distributions based on both biological occurrences and environmental factors.

Strategic Importance: - Predictive Modeling: In essence, this process prepares a dataset that not only maps where invasive species have been observed but also contextualizes these locations within the broader environmental landscape of Vermont. - Conservation Impact: This preparation is critical for accurately predicting where these species might thrive under current and future conditions, thereby informing conservation strategies and management decisions.

# --------------------- Step 7: Generate Random Background Points ------------
# Confirm the object's class
print(class(vermont_data))
# Step 7.1: Set seed for reproducibility
set.seed(123)

# Step 7.2: Calculate the number of non-NA cells and define max sample size
non_na_cells <- sum(!is.na(values(vermont_data[[1]])))
max_sample_size <- min(10000, non_na_cells)

# Generate random background points and keep only the XY coordinates
background_points <- raster::sampleRandom(vermont_data[[1]], size = max_sample_size, xy = TRUE)
background_points <- background_points[, 1:2]  # Subset to x and y columns immediately

# Check what's inside background_points after subsetting
print(head(background_points))

# Step 7.3: Extract environmental values at the confirmed XY coordinates
bg_env_values <- raster::extract(vermont_data[[1]], background_points)

# Print the results to confirm successful extraction
print(bg_env_values)

# Step 7.4: Identify and retain valid background points (non-NA values)
valid_bg_points <- !is.na(bg_env_values)

# Step 7.5: Output the count of valid background points
cat("Number of valid background points:", sum(valid_bg_points), "\n")

# Step 7.6: Check the resolution and dimensions of your environmental data
cat("Resolution of raster data:", res(vermont_data), "\n")
cat("Dimensions of raster data:", dim(vermont_data), "\n")

# Step 7.7: Print the extent to see the covered area
print(vermont_extent)
# OR to print using cat:
cat("Extent of Vermont data: xmin =", vermont_extent@xmin, 
    "xmax =", vermont_extent@xmax, 
    "ymin =", vermont_extent@ymin, 
    "ymax =", vermont_extent@ymax, "\n")


# Step 7.8: Calculate the width and height in degrees
width_degrees <- abs(-73.437 - -71.465)
height_degrees <- abs(42.73 - 45.016)

# Step 7.9: Calculate the number of cells that can fit within the extent
cells_horizontal <- width_degrees / res(vermont_data)[1]
cells_vertical <- height_degrees / res(vermont_data)[2]

# Step 7.10: Calculate total cells
total_cells <- cells_horizontal * cells_vertical

# Step 7.11: Print calculated values
cat("Width in degrees:", width_degrees, "\n")
cat("Height in degrees:", height_degrees, "\n")
cat("Cells horizontally:", cells_horizontal, "\n")
cat("Cells vertically:", cells_vertical, "\n")
cat("Total cells in extent:", total_cells, "\n")

# Step 7.12: Compare to the actual grid dimensions
actual_grid_cells <- prod(dim(vermont_data[[1]]))
cat("Actual grid cells in raster data:", actual_grid_cells, "\n")

# Step 7.13: Explanation and action based on comparison
if (total_cells > actual_grid_cells) {
  cat("The extent's calculated cell capacity exceeds the raster grid cells. Consider adjusting the extent or resolution.\n")
} else {
  cat("The extent's calculated cell capacity is within the raster grid's limits.\n")
}

# Step 7.14: Generate random background points without exceeding the raster grid's capacity
background_points <- randomPoints(vermont_data[[1]], n = min(10000, actual_grid_cells))

# Step 7.14.1: Check if the output is a matrix and convert it to SpatialPoints
if (is.matrix(background_points)) {
  background_points <- as.data.frame(background_points)
  names(background_points) <- c("x", "y")
  # Convert data frame to SpatialPoints
  background_points <- SpatialPoints(background_points, proj4string = crs(vermont_data))
}

# Step 7.14.2: Bind the SpatialPoints with an empty data frame to create a SpatialPointsDataFrame
background_points_df <- SpatialPointsDataFrame(background_points, 
                                               data.frame(row.names = row.names(background_points)))

# Ensure background_points_df is now a SpatialPointsDataFrame
if (!inherits(background_points_df, "SpatialPointsDataFrame")) {
  stop("background_points is not a SpatialPointsDataFrame")
}

# Step 7.15: Proceed with extraction and modeling as planned
# Extract environmental values using SpatialPointsDataFrame
bg_env_values <- raster::extract(vermont_data, background_points_df)

# Step 7.16: Identify and retain valid background points (non-NA values)
valid_bg_points <- !is.na(bg_env_values)

# Step 7.17: Output the final count of valid background points
cat("Final count of valid background points:", sum(valid_bg_points), "\n")


# Why Random Points?: In modeling species distributions, it's crucial to have a comparison between places where the species is observed and a random sample of other places. 
# This comparison helps determine if the environmental conditions where the species occur are significantly different from the general environment.
# Environmental Context: By extracting environmental values at these points, we gain insights into the broader ecological backdrop of Vermont. 
# This data forms a crucial part of understanding the potential drivers behind species distributions.
# Data Integrity: Ensuring that all data points used in the analysis are valid (i.e., have complete environmental data) is essential for the accuracy of the model. 
# It ensures that the predictions made by the SDM are based on reliable and comprehensive environmental information.
# Spatial Accuracy: Checking the resolution and extent of the data against the generated points helps ensure that the environmental data used is appropriate and accurate for the region's analysis.
# This step is akin to making sure you have a detailed and accurate map before planning a route.

Random Points?

Importance of Comparison: - Comparative Analysis: In modeling species distributions, it’s crucial to compare places where the species is observed with a random sample of other locations. This comparison helps determine if the environmental conditions where the species occur are significantly different from the general environment.

Environmental Context

Ecological Insights: - Data Extraction: By extracting environmental values at these points, we gain insights into the broader ecological backdrop of Vermont. This information is crucial for understanding the potential drivers behind species distributions.

Data Integrity

Ensuring Accuracy: - Data Validation: Ensuring that all data points used in the analysis are valid (i.e., have complete environmental data) is essential for the accuracy of the model. It guarantees that the predictions made by the Species Distribution Model (SDM) are based on reliable and comprehensive environmental information.

Spatial Accuracy

Data Precision: - Resolution and Extent Checks: Checking the resolution and extent of the data against the generated points helps ensure that the environmental data used is appropriate and accurate for the region’s analysis. This step is akin to ensuring you have a detailed and accurate map before planning a route.

Transitioning to Visualization

Setup Complete: - Ready for Visualization: Now that all data points have been validated and set up, we transition to actual visualization of the data. This next phase involves deploying the MaxEnt Model to graphically represent the ecological insights derived from our analysis.

MaxEnt Model

Model Application: - Purpose: The MaxEnt Model will be used to visualize the potential distribution of invasive species across Vermont, based on the environmental data and observations gathered. This model is key to predicting and understanding the spread of these species under current and future environmental scenarios.

# --------------------- Step 8: Fit MaxEnt Model ------------------------------

# Step 8.1: Fit a MaxEnt model using the provided data.
maxent_model <- dismo::maxent(x = vermont_data, p = occurrence_points_matrix_clean, a = background_points)

# Step 8.2: Print a summary of the fitted MaxEnt model to review its performance and parameters.
summary(maxent_model)

# Step 8.3: Plot the MaxEnt model, including a title for clarity.
plot(maxent_model, main = "MaxEnt Model Summary and Response Curve", col = "black")  # Plot dots with black color

# Step 8.4: Add a custom legend to the plot with unfilled circles for different contribution levels.
legend("bottomright",  # Adjust position as needed
       legend = c("High Contribution (>30%)", "Medium Contribution (15-30%)", "Low Contribution (<15%)"),
       pch = 21,  # Use unfilled circle for legend
       pt.bg = "white",  # No fill for circles in legend
       text.col = "black",  # Text color for legend
       bty = "n",  # No box around the legend
       cex = 0.8)  # Adjust text size accordingly

Figure 18: MaxEnt Model

# Description:
# The code section fits a MaxEnt model to the provided data and evaluates its performance through visualization. It involves training the MaxEnt model, summarizing its characteristics, plotting the model's response curve, and adding a custom legend to interpret the contribution levels of predictor variables.

# MaxEnt Model Summary: This section provides a summary of the MaxEnt model, including its length, class, and mode.
# - Length: The number of elements in the MaxEnt model object.
# - Class: Indicates the class of the object, in this case, "MaxEnt".
# - Mode: Describes how the object is stored or represented.

# MaxEnt Model Response Curve Plot:
# - This plot visualizes the response curve of the MaxEnt model.
# - The x-axis typically represents environmental variables or predictors used in the model.
# - The y-axis represents the contribution or importance of each variable in predicting species occurrence.
# - Each dot on the plot corresponds to a predictor variable, and its position indicates the contribution level.
# - Higher positions on the y-axis indicate higher contributions to the model's predictions.

# Custom Legend:
# - The legend provides additional information about the contribution levels of predictor variables.
# - It categorizes predictor variables into three groups based on their contribution levels:
#   - High Contribution (>30%): Variables with a significant impact on species occurrence predictions.
#   - Medium Contribution (15-30%): Variables with a moderate impact on predictions.
#   - Low Contribution (<15%): Variables with a minimal impact on predictions.
# - The legend helps interpret the importance of predictor variables in the MaxEnt model.

# Understanding MaxEnt: The MaxEnt model helps predict where invasive species might be found based on where they’ve been observed and the environmental conditions at those locations. It assumes that species distributions are spread as widely as possible (maximum entropy) while still aligning with observed data.
# Model’s Role: By fitting this model, we can identify areas likely to be suitable for invasive species, not just where they have currently been observed. This predictive capability is crucial for preemptive conservation and management efforts.
# Interpreting Outputs: The visualization and summary of the model provide insights into which environmental factors are most important for the species' distribution. For instance, factors in the "High Contribution" category are likely crucial for the species’ presence in Vermont.
# Practical Implications: For conservationists and environmental managers, understanding these key contributing factors allows for more targeted actions, such as habitat management or monitoring efforts, especially in areas predicted as suitable but currently unoccupied by the species.

Firgure 18 MaxEnt Model

The code section fits a MaxEnt model to the provided data and evaluates its performance through visualization. It involves training the MaxEnt model, summarizing its characteristics, plotting the model’s response curve, and adding a custom legend to interpret the contribution levels of predictor variables.

MaxEnt Model Summary

This section provides a summary of the MaxEnt model, including its length, class, and mode. - Length: The number of elements in the MaxEnt model object. - Class: Indicates the class of the object, in this case, “MaxEnt”. - Mode: Describes how the object is stored or represented.

MaxEnt Model Response Curve Plot

This plot visualizes the response curve of the MaxEnt model.
The x-axis typically represents environmental variables or predictors used in the model.
The y-axis represents the contribution or importance of each variable in predicting species occurrence.
Each dot on the plot corresponds to a predictor variable, and its position indicates the contribution level.
Higher positions on the y-axis indicate higher contributions to the model’s predictions.

Custom Legend

The legend provides additional information about the contribution levels of predictor variables.
It categorizes predictor variables into three groups based on their contribution levels:
- High Contribution (>30%): Variables with a significant impact on species occurrence predictions.
- Medium Contribution (15-30%): Variables with a moderate impact on predictions.
- Low Contribution (<15%): Variables with a minimal impact on predictions.
The legend helps interpret the importance of predictor variables in the MaxEnt model.

Understanding MaxEnt

The MaxEnt model helps predict where invasive species might be found based on where they’ve been observed and the environmental conditions at those locations. It assumes that species distributions are spread as widely as possible (maximum entropy) while still aligning with observed data.

Model’s Role

By fitting this model, we can identify areas likely to be suitable for invasive species, not just where they have currently been observed. This predictive capability is crucial for preemptive conservation and management efforts.

Interpreting Outputs

The visualization and summary of the model provide insights into which environmental factors are most important for the species’ distribution. For instance, factors in the “High Contribution” category are likely crucial for the species’ presence in Vermont.

Practical Implications

For conservationists and environmental managers, understanding these key contributing factors allows for more targeted actions, such as habitat management or monitoring efforts, especially in areas predicted as suitable but currently unoccupied by the species.

# --------------------- Step 9: Occurrence Data Cleaning and Validation Point  ---------------------
# --------- THIS SECTION OF THE CODE HAS A LOT OF TROUBLESHOOTING ----------------------------------
# --------- I COULD NOT WORK OUT HOW TO DO THIS ANY OTHER WAY --------------------------------------

# Check the structure of the background_points object to ensure data integrity.
str(background_points)

# Check the dimensions of the background_points object to verify the size of the dataset.
dim(background_points)

# Extract longitude and latitude coordinates from background_points and create a new SpatialPoints object
background_points_coords <- SpatialPoints(coords = background_points@coords)

# Create data frame for absence points using longitude, latitude, and absence indicators
absence_df <- data.frame(
  longitude = coordinates(background_points_coords)[, 1],  # Extracting longitude
  latitude = coordinates(background_points_coords)[, 2],   # Extracting latitude
  presence = 0
)

# Step 9.1: Create Data Frame for Presence Points
# Create a data frame for presence points using longitude, latitude, and presence indicators.
presence_df <- data.frame(
  longitude = occurrence_points_matrix_clean[, "longitude"],
  latitude = occurrence_points_matrix_clean[, "latitude"],
  presence = 1
)

# Step 9.2: Create Data Frame for Absence Points
# Create a data frame for absence points using longitude, latitude, and absence indicators.
# Create data frame for absence points using longitude, latitude, and absence indicators
absence_df <- data.frame(
  longitude = coordinates(background_points_coords)[, 1],  # Extracting longitude
  latitude = coordinates(background_points_coords)[, 2],   # Extracting latitude
  presence = 0
)

# Step 9.3: Combine Presence and Absence Data Frames
# Merge the presence and absence data frames to create a unified validation data set.
validation_points <- rbind(presence_df, absence_df)

# Step 9.4: Check Structure and Summary of Validation Points
# Inspect the structure and summary statistics of the validation points data frame.
str(validation_points)
summary(validation_points)

# Define predictor variables and calculate the expected number of columns.
predictor_variables <- c("bio1", "bio2", "bio3", "bio4", "bio5")
expected_columns <- length(predictor_variables) + 3  # Add 3 columns for longitude, latitude, and presence.

# Print the expected number of columns
cat("Expected number of columns:", expected_columns, "\n")

# Print the actual number of columns
cat("Actual number of columns:", ncol(validation_points), "\n")

# Print the column names
cat("Column names:", colnames(validation_points), "\n")

# Step 9.5: Add predictor variables to the validation_points data frame
validation_points[, predictor_variables] <- NA

# Step 9.6: Check if Predictor Variables were Successfully Added
if (ncol(validation_points) != expected_columns) {
  stop("Predictor variables were not successfully added to the validation_points data frame.")
} else {
  cat("Predictor variables were successfully added to the validation_points data frame.\n")
}

# Print column names and check the structure of validation_points.
colnames(validation_points)
ncol(validation_points)
str(validation_points)

# Step 9.6.2: Convert the RasterBrick to a SpatRaster object
vermont_data_spat <- rast(vermont_data)

# Step 9.6.3: Convert validation_points SpatialPointsDataFrame to a terra SpatVector object
validation_sp <- vect(validation_points, geom = c("longitude", "latitude"), crs = crs(vermont_data_spat))

# Step 9.7: Extract Environmental Data for Validation Points using the 'terra' package
# Make sure to use terra::extract to specify the correct package
validation_env_values <- terra::extract(vermont_data_spat, validation_sp)

# Check the structure of the extracted data to confirm it was successful
str(validation_env_values)

# Step 9.7: Create a SpatialPoints object from the validation coordinates
validation_coords <- cbind(validation_points$longitude, validation_points$latitude)
validation_sp <- SpatialPoints(validation_coords, proj4string = crs(vermont_data))

# Step 9.8: Create a SpatialPointsDataFrame by binding the SpatialPoints with the presence/absence data
validation_spdf <- SpatialPointsDataFrame(validation_sp, validation_points[, c("presence", "bio1", "bio2", "bio3", "bio4", "bio5")])

# Step 9.9: Extract Environmental Data for Validation Points
# Convert SpatialPointsDataFrame to SpatVector
validation_spatvec <- vect(validation_spdf)

# Extract environmental data for validation points
validation_env_values <- terra::extract(vermont_data_spat, validation_spatvec)

# Now, you can check the structure of the extracted data to confirm it was successful
str(validation_env_values)

# Step 9.7.1: Check if the Result is a List
# Check if the result of extraction is a list and explore its structure.
if (is.list(validation_env_values)) {
  lapply(validation_env_values, str)
} else {
  str(validation_env_values)
}

# Step 9.7.2: Bind Environmental Data to a Matrix
# Bind extracted environmental data into a matrix for consistency.
if (is.list(validation_env_values)) {
  column_check <- sapply(validation_env_values, function(x) length(x))
  if (all(column_check == column_check[1])) {
    validation_env_matrix <- do.call(rbind, validation_env_values)
  } else {
    stop("Not all data elements have the same number of columns.")
  }
} else {
  validation_env_matrix <- validation_env_values
}

# Step 9.7.3: Set Column Names of Environmental Data
# Set column names for the environmental data matrix if dimensions match.
if (ncol(validation_env_matrix) == length(predictor_variables)) {
  colnames(validation_env_matrix) <- predictor_variables
} else {
  cat("Mismatch in the number of columns and predictor variables\n")
  cat("Number of columns in validation_env_matrix:", ncol(validation_env_matrix), "\n")
  cat("Number of predictor variables:", length(predictor_variables), "\n")
}

# Step 9.7.4: Debugging Output
# Print debugging information if there's a mismatch in column numbers.
if (ncol(validation_env_matrix) != length(predictor_variables)) {
  print("Mismatch detected:")
  print(paste("Expected columns:", length(predictor_variables), "but found:", ncol(validation_env_matrix)))
}

# Step 9.8: Verify the Updated Structure of Validation Points
# Check if the result is not a list and use the matrix if the structure is correct.
if (!is.list(validation_env_values)) {
  validation_env_matrix <- validation_env_values
  if (is.null(colnames(validation_env_matrix))) {
    colnames(validation_env_matrix) <- paste("bio", 1:ncol(validation_env_matrix), sep="")
  }
}

nrow(validation_env_matrix)
nrow(validation_points)

# Step 9.9: Extract Environmental Data for Validation Points
# Convert SpatialPointsDataFrame to SpatVector
validation_spatvec <- vect(validation_spdf)

# Extract environmental data for validation points
validation_env_values <- terra::extract(vermont_data_spat, validation_spatvec)

# Check if the number of rows in validation_env_values matches validation_points
if (nrow(validation_env_values) != nrow(validation_points)) {
  stop("Number of rows in validation_env_values does not match validation_points.")
}

# Combine environmental data with existing validation_points data frame
validation_points <- cbind(validation_points[, c("longitude", "latitude", "presence")], validation_env_values)

# Verify the updated structure of validation_points data frame
str(validation_points)

# Optionally, review the first few rows of the updated data frame
head(validation_points)

# Step 9.10: Check for Missing Values
# Count the number of missing values in each column of the validation_points data frame.
na_count <- sapply(validation_points, function(x) sum(is.na(x)))
cat("Number of NAs in each column:\n")
print(na_count)

# Step 9.11: Summary Statistics for Environmental Variables
# Compute summary statistics for the environmental variables in the validation_points data frame.
summary(validation_points[, 4:22])  # Assuming columns 4 to 22 are the environmental variables

**Histogram of BIO1 (Annual Mean Temperature

# Step 9.12: Basic Histograms for Environmental Variables
# Generate histograms to visualize the distribution of environmental variables.
hist(validation_points$bio1, 
     main = "Histogram of BIO1 (Annual Mean Temperature)", 
     xlab = "BIO1 Values (Temperature)", 
     ylab = "Frequency", 
     col = "blue",  # Color to differentiate bins
     border = "black",  # Color of bin borders
     breaks = 30)  # Adjust the number of bins if necessary

Figure 19: Histogram of BIO1 (Annual Mean Temperature)

Figure 19: Histogram Analysis: BIO1 (Annual Mean Temperature)

Description of Histogram

Overview: This histogram visualizes the distribution of annual mean temperatures across Vermont, allowing for an analysis of temperature patterns within the study area.

Details: - X-axis: Represents temperature in degrees Celsius. - Y-axis: Represents the frequency of observations. - Bar Representation: Each bar in the histogram corresponds to a range of temperatures. The height of each bar indicates the number of observations falling within that specific temperature range.

Analytical Insights

Purpose of the Histogram: - The histogram helps in understanding how temperatures are distributed across Vermont. It provides a clear visual representation of the thermal environment, which is essential for ecological and climatological studies.

Utility: - Pattern Recognition: By examining the histogram, researchers can identify any patterns or anomalies in temperature distribution. This is crucial for assessing climate variability and potential impacts on local ecosystems.

Implications: - Data-Driven Decisions: The insights from the histogram can aid in making informed decisions related to environmental management and conservation planning, especially in the context of climate change adaptation strategies.

hist(validation_points$bio2, 
     main = "Histogram of BIO12 (Annual Precipitation)", 
     xlab = "BIO2 Values (Mean Diurnal Range)", 
     ylab = "Frequency",
     col = "green",  # Color to differentiate bins
     border = "black",  # Color of bin borders
     breaks = 30)  # Adjust the number of bins if necessary

Figure 20: Histogram of BIO12 (Annual Precipitation)

Figure 20: Histogram Analysis: BIO12 (Annual Precipitation)

Description of Histogram

Overview: This histogram visualizes the distribution of annual precipitation amounts across Vermont, providing insights into the variability and overall patterns of precipitation within the study area.

Details: - X-axis: Represents precipitation in millimeters. - Y-axis: Represents the frequency of observations. - Bar Representation: Each bar in the histogram corresponds to a range of precipitation values. The height of each bar indicates the number of observations falling within that specific range.

Analytical Insights

Purpose of the Histogram: - The histogram is crucial for understanding the distribution and variability of precipitation levels across Vermont. It offers a quantitative view that is essential for water resource management and ecological assessments.

Utility: - Pattern Recognition: By examining the histogram, researchers can identify patterns in precipitation distribution, which is vital for predicting water availability and managing flood risks. - Anomaly Detection: Helps in identifying unusual precipitation patterns that may indicate climatic shifts or anomalies.

Implications: - Data-Driven Decisions: Insights from the histogram can guide decisions in sectors dependent on water resources, such as agriculture, forestry, and urban planning. - Climate Adaptation Strategies: Understanding precipitation patterns assists in developing strategies to cope with potential climate change impacts, ensuring sustainable management of natural resources.

Summary of Data Analysis: Temperature and Precipitation Distributions

Overview: These histograms provide visual insights into the distributions of temperature and precipitation data across Vermont, highlighting key environmental factors that affect ecological dynamics.

Why Combined Data Frames?

Purpose of Integration: - Comprehensive Analysis: By creating a dataset that includes observations of where the species is and isn’t found, alongside environmental data, researchers can identify unique conditions that might favor the species’ presence. - Enhanced Predictive Power: This integration enhances the predictive power of the species distribution model (SDM), allowing for more accurate forecasting of species movements.

Role of Environmental Predictors

Utility of Predictors: - Critical Analysis: These predictors are essential for understanding which aspects of the environment, like temperature or precipitation, are most influential in species distribution. - Data Extraction: Initially, placeholders (NA) are used until specific environmental data is extracted, ensuring each site’s conditions are accurately represented.

Importance of Data Integrity and Validation

Ensuring Accuracy: - Data Validation: It’s crucial to ensure the dataset is error-free and fully populated with environmental data. This guarantees that the modeling is based on comprehensive and accurate information. - Reliable Predictions: Accurate data is essential for reliable predictions, forming the backbone of effective ecological modeling.

Logical Implications

Preparation for Predictive Modeling

Readiness for Analysis: - Advanced Techniques: The meticulously prepared dataset, now complete with environmental variables, is ready for advanced statistical techniques. These techniques will predict potential distribution areas for invasive species, enhancing our understanding of ecological threats.

Conservation and Management Applications

Strategic Use: - Conservation Insights: With a robust SDM, conservationists can better understand and predict the spread of invasive species. - Effective Management Strategies: This understanding enables more targeted and effective management strategies to protect Vermont’s ecosystems, ensuring that conservation efforts are well-informed and strategically implemented.

Implementing Data Imputation

Understanding the Role of Data Imputation in Environmental Sciences

Why Imputation?

Addressing Missing Data: - Necessity: In data analysis, especially in environmental sciences, missing data can skew results and lead to inaccurate conclusions. - Solution: Imputation helps fill these gaps, ensuring that each data point contributes to the overall analysis without introducing bias.

Choice of Median for Imputation

Robustness Against Outliers: - Advantages: The median is robust against outliers, making it a preferred choice for imputation in environmental data where extreme values can disproportionately affect the mean.

Effectiveness of Imputation

Maintaining Data Integrity: - Consistency: Post-imputation, the dataset’s integrity is maintained, as indicated by the consistency in the statistical summaries. - Implication: This suggests that the imputation has neither distorted the data distribution nor introduced any bias that could affect subsequent analyses.

Comparison of Steps Before and After Imputation

Before Imputation (Step 9)

Challenges: - Data Gaps: The dataset possibly contained missing values which could lead to incomplete analyses or biased results due to insufficient data.

After Imputation (Step 10)

Improvements: - Data Completeness: All missing values are filled, ensuring that the dataset is complete. This allows for more reliable and robust statistical analysis and modeling, providing a solid foundation for any ecological or geographical inferences to be drawn.

Conclusion: Integration and Enhancement

Purpose and Benefits: - Step 9: Ensures that the data is correctly structured and integrated with necessary environmental variables. - Step 10: Improves the dataset’s completeness and usability for downstream analyses.

Strategic Importance: - Both steps are integral to data preparation but serve different purposes, highlighting the importance of a methodical approach to maintaining data quality throughout the analysis process.

# -------------------- Step 10: Implementing Data Imputation ----------------------------

# Simple median imputation for missing values
for(i in 4:ncol(validation_points)) {
  validation_points[is.na(validation_points[,i]), i] <- median(validation_points[,i], na.rm = TRUE)
}

# Check the results after imputation
summary(validation_points)

# Recreate histograms with the imputed data

# Step 10.1: Histogram for BIO1 (Annual Mean Temperature)
# Display the distribution of the annual mean temperature (BIO1) after imputation of missing data.
hist(validation_points$bio1, 
     main = "Histogram of BIO1 (Annual Mean Temperature)", 
     xlab = "Temperature (deg C)",  # Simple text replacement
     ylab = "Frequency", 
     col = "blue",  
     border = "black",  
     breaks = 30)

Figure 21: Histograms and Data Imputation

# Step 10.2: Histogram for BIO12 (Annual Precipitation)
# Show the distribution of annual precipitation amounts (BIO12) after imputation.
hist(validation_points$bio12,
     main = "Histogram of BIO12 (Annual Precipitation)",
     xlab = "Precipitation (mm)",
     ylab = "Frequency of Observations",
     col = "green",  # Color to differentiate bins
     border = "black",  # Color of bin borders
     breaks = 30)  # Adjust the number of bins if necessary

Figure 21: Histograms and Data Imputation

# Description:
# For each histogram:
#
# Histogram of BIO1 (Annual Mean Temperature):
# This histogram displays the distribution of annual mean temperatures across Vermont.
# The x-axis represents temperature in degrees Celsius.
# The y-axis represents the frequency of observations.
# Each bar in the histogram represents a range of temperatures, and the height of the bar indicates the number of observations falling within that temperature range.
# By examining the histogram, you can understand how temperatures are distributed across the study area and identify any patterns or anomalies.
#
# Histogram of BIO12 (Annual Precipitation):
# This histogram illustrates the distribution of annual precipitation amounts across Vermont.
# The x-axis represents precipitation in millimeters.
# The y-axis represents the frequency of observations.
# Similar to the first histogram, each bar in the histogram represents a range of precipitation values, and the height of the bar indicates the number of observations falling within that range.
# Analyzing this histogram helps in understanding the variability and distribution of precipitation levels across the study area.
# 
# In summary, these histograms provide visual insights into the distributions of temperature and precipitation data in Vermont after missing values have been imputed using the simple median imputation method.


# Why Imputation?: In data analysis, especially in environmental sciences, missing data can skew results and lead to inaccurate conclusions. Imputation helps fill these gaps, ensuring that each data point contributes to the overall analysis without introducing bias.
# Choice of Median for Imputation: The median is robust against outliers, making it a preferred choice for imputation in environmental data where extreme values can disproportionately affect the mean.
# Effectiveness of Imputation: Post-imputation, the dataset's integrity is maintained, as shown by the consistency in the statistical summaries. This suggests that the imputation has neither distorted the data distribution nor introduced any bias that could affect subsequent analyses.
# Comparison to Previous Steps:
# Before Imputation (Step 9): The dataset possibly contained missing values which could lead to incomplete analyses or biased results due to insufficient data.
# After Imputation (Step 10): All missing values are filled, ensuring that the dataset is complete. This allows for more reliable and robust statistical analysis and modeling, providing a solid foundation for any ecological or geographical inferences to be drawn.

# Both steps are integral to data preparation but serve different purposes: Step 9 ensures that the data is correctly structured and integrated with necessary environmental variables, 
# while Step 10 improves the dataset's completeness and usability for downstream analyses.

Figure 21: Environmental Data Distributions in Vermont

Histogram Analysis for BIO1 (Annual Mean Temperature)

Overview: - Purpose: This histogram displays the distribution of annual mean temperatures across Vermont, providing a visual representation of thermal conditions. - X-axis: Temperature in degrees Celsius. - Y-axis: Frequency of observations. - Bar Details: Each bar represents a range of temperatures, with the height indicating the number of observations within that range.

Analytical Insights: - Temperature Distribution: By examining the histogram, researchers can understand how temperatures are distributed across the study area and identify any patterns or anomalies.

Histogram Analysis for BIO12 (Annual Precipitation)

Overview: - Purpose: This histogram illustrates the distribution of annual precipitation amounts across Vermont. - X-axis: Precipitation in millimeters. - Y-axis: Frequency of observations. - Bar Details: Similar to the first histogram, each bar represents a range of precipitation values, with the height indicating the number of observations within that range.

Analytical Insights: - Precipitation Variability: Analyzing this histogram helps in understanding the variability and distribution of precipitation levels across the study area.

Summary of Histogram Analyses

Integrated Insights: - These histograms provide visual insights into the distributions of temperature and precipitation data in Vermont. - Methodology: After missing values have been imputed using the simple median imputation method, these histograms offer a clearer and more accurate depiction of environmental conditions.

Implications for Research: - Enhanced Understanding: The visual analysis of these histograms aids researchers and decision-makers in understanding climatic trends and anomalies in Vermont. - Support for Ecological Studies: The insights gained are essential for ecological research, conservation planning, and preparing for climatic changes.

Train Logistic Regression Model- GLM

# --------------------- Step 11: Train Logistic Regression Model- GLM  ------------------------------

# Ensure necessary library for glm is loaded
if (!require(stats)) install.packages("stats")
library(stats)

# Step 11.1: Define the Model Formula
# Define the formula for logistic regression model using 'presence' as the binary outcome and environmental variables (bio1 to bio5) as predictors.
model_formula <- presence ~ bio1 + bio2 + bio3 + bio4 + bio5

# Step 11.2: Train the Logistic Regression Model
# Train the logistic regression model using the defined formula and the dataset.
model <- glm(model_formula, data=validation_points, family=binomial())

# Step 11.3: Check Model Summary
# After training the model, examine the summary to understand the significance of predictors and model fit.
summary(model)

# Step 11.4: Diagnostics and Validation
# Perform diagnostic checks to validate the model, which may include examining residuals and assessing model goodness of fit.
plot(model)

Figure 22: Train Logistic Regression Model- GLM

# Step 11.5: Saving the Model
# Set the working directory where you want to save the model.
setwd("D:/GEOG_588/SDM_Invasive_species")

# Check to confirm the current working directory.
getwd()

# Save the model in the current working directory.
save(model, file="logistic_model.RData")

# Define the full path to the model file.
model_path <- "D:/GEOG_588/SDM_Invasive_species/logistic_model.RData"

# Check if the model file exists at the specified path.
if (file.exists(model_path)) {
  print(paste("Model file saved successfully at:", model_path))
} else {
  print("Failed to save the model file at the specified path.")
}

# Why Use Logistic Regression?: Logistic regression is chosen because it’s ideal for predicting a binary outcome—like presence or absence of a species—based on multiple influencing factors, which in this case are environmental variables. This model helps understand which conditions are likely to favor or inhibit the presence of the species.
# Significance of Environmental Predictors:
# Positive Coefficients (e.g., bio5): Variables like bio5 with positive coefficients increase the probability of species presence as they increase, suggesting that these conditions are favorable for the species.
# Negative Coefficients (e.g., bio1, bio2): Conversely, variables with negative coefficients decrease the probability of species presence as they increase, indicating conditions that are less favorable or even detrimental.
# Model Evaluation:
# The significance values (Pr(>|z|)) associated with each predictor tell us whether the effects of these environmental variables on species presence are statistically significant. For instance, variables like bio1 and bio2 are highly significant, implying strong evidence that these factors influence the species’ distribution.
# The model's residual deviance and AIC (Akaike Information Criterion) provide measures of model fit, indicating how well the model explains the observed data compared to a null model (one with no predictors).
# Model Diagnostics:
# By examining diagnostic plots, one can check for any anomalies like outliers or patterns in residuals that might suggest the model does not fit the data well. These checks are crucial for confirming the model's reliability.
# Saving and Utilizing the Model:
# Storing the model allows for its application in future studies or for predictive purposes, such as anticipating changes in species distribution due to environmental changes.
# This summary explains how the logistic regression model is structured, trained, and evaluated, highlighting the implications of its findings in a way that emphasizes both the technical process and its relevance to ecological research and management.

Figure 22: Train Logistic Regression Model- GLM - Overview of Environmental Influences on Species Distribution
This figure illustrates the impact of various environmental variables on species distribution within Vermont, highlighting key factors that contribute to the presence or absence of species across different regions.

# Understanding Logistic Regression in Ecological Modeling

Why Use Logistic Regression?

Purpose and Fit: - Binary Outcome Prediction: Logistic regression is chosen because it’s ideal for predicting a binary outcome—like the presence or absence of a species—based on multiple influencing factors, which are environmental variables in this context. - Decision Utility: This model helps understand which conditions are likely to favor or inhibit the presence of the species, making it invaluable for ecological decision-making.

Significance of Environmental Predictors

Interpreting Coefficients: - Positive Coefficients (e.g., bio5): Variables like bio5 with positive coefficients suggest that increasing values of these conditions increase the probability of species presence, indicating favorable environments. - Negative Coefficients (e.g., bio1, bio2): Conversely, variables with negative coefficients imply that as these conditions increase, the probability of species presence decreases, indicating less favorable or even detrimental conditions.

Model Evaluation

Statistical Significance and Model Fit: - Significance Values (Pr(>|z|)): These values tell us whether the effects of environmental variables on species presence are statistically significant. For instance, variables like bio1 and bio2 are highly significant, implying strong evidence that these factors influence the species’ distribution. - Model Fit Metrics: The model’s residual deviance and AIC (Akaike Information Criterion) provide measures of how well the model explains the observed data compared to a null model (one with no predictors).

Model Diagnostics

Ensuring Reliability: - Diagnostic Plots: By examining diagnostic plots, researchers can check for any anomalies like outliers or patterns in residuals that might suggest the model does not fit the data well. These checks are crucial for confirming the model’s reliability.

Saving and Utilizing the Model

Application and Future Use: - Model Storage: Storing the model allows for its application in future studies or for predictive purposes, such as anticipating changes in species distribution due to environmental changes.

Summary

Model Overview: - This summary explains how the logistic regression model is structured, trained, and evaluated. It highlights the implications of its findings in a way that emphasizes both the technical process and its relevance to ecological research and management, providing a comprehensive understanding of the model’s utility and application in ecological contexts.

AUC Calculation and Visualization

# ------------------------ Step 12: AUC Calculation and Visualization -------------------------------
# ------------------------ SAME FOR HERE ... THIS SECTION HAS LOTS OF TROUBLESHOOTING ---------------
# ------------------------ I COULD NOT WORK OUT HOW TO MAKE THIS WORK ANY OTHER WAY -----------------

# Step 12.1: Data Types and Content Check
# Examine the structure and summary of the presence and predicted_prob columns
str(validation_points$presence)
str(validation_points$predicted_prob)
summary(validation_points$presence)
summary(validation_points$predicted_prob)

# Step 12.2: Removal of NA Values
# Determine the count of rows with non-NA values for both presence and predicted_prob
valid_data_count <- sum(complete.cases(validation_points$presence, validation_points$predicted_prob))
print(valid_data_count)

# If valid data count is low, further investigation is needed
if (valid_data_count == 0) {
  print("All data points have NAs after NA removal or there is only one class present. Check data preparation steps.")
}

# Step 12.3: Ensure Presence Data Contains Two Classes
# Check the unique values in the presence column
print(unique(validation_points$presence))

# Step 12.4: Validate Predicted_prob Values
# Verify if predicted_prob values fall within the 0-1 range
if (any(validation_points$predicted_prob < 0 | validation_points$predicted_prob > 1, na.rm = TRUE)) {
  print("Some predicted_prob values are outside the range of 0 to 1. Review your prediction step.")
}

# Step 12.5: ROC Calculation
# If issues are addressed, recalculate ROC
if (valid_data_count > 0 && length(unique(validation_points$presence)) > 1) {
  # Check if the predicted_prob column is NULL or contains only NA values
  if (is.null(validation_points$predicted_prob) || all(is.na(validation_points$predicted_prob))) {
    print("The predicted_prob column is NULL or contains only NA values. Check data preparation steps.")
  } else {
    # Check if any predicted_prob values are outside the range of 0 to 1
    if (any(validation_points$predicted_prob < 0 | validation_points$predicted_prob > 1, na.rm = TRUE)) {
      print("Some predicted_prob values are outside the range of 0 to 1. Review your prediction step.")
    } else {
      # Perform ROC calculation
      roc_result <- roc(validation_points$presence, validation_points$predicted_prob, na.rm = TRUE)
      auc_value <- auc(roc_result)
      print(paste("AUC value:", auc_value))
      plot(roc_result, main="ROC Curve", col="#1c61b6")
    }
  }
} else {
  print("Cannot compute ROC due to insufficient or invalid data.")
}

# Step 12.6: Predictions Generation
# Ensure logistic regression model is loaded and accurate
if (exists("model")) {
  # Generate predicted probabilities using the logistic regression model
  validation_points$predicted_prob <- predict(model, newdata=validation_points, type="response")
  
  # Check if predictions were added successfully
  if (is.null(validation_points$predicted_prob)) {
    stop("Failed to generate predictions. Check model and data compatibility.")
  } else {
    print("Predictions generated successfully.")
  }
} else {
  stop("Model not found. Ensure your logistic regression model is loaded correctly.")
}

# Step 12.7: ROC Calculation Retry
# Retry ROC calculation
roc_result <- roc(validation_points$presence, validation_points$predicted_prob, na.rm = TRUE)
auc_value <- auc(roc_result)

# Print the AUC value
print(paste("AUC value:", auc_value))

# Plot the ROC curve
plot(roc_result, main="ROC Curve", col="#1c61b6", print.auc=TRUE)
legend("bottomright", legend=c(paste("ROC Curve (AUC = ", round(auc_value, 2), ")")), col="#1c61b6", lty=1, cex=0.8)

Figure 23: AUC Calculation and Visualization

# Interpret AUC Value
# AUC Value of 0.771: This value is closer to 1 than to 0.5, indicating that your model generally has a good measure of separability. It is capable of differentiating between the positive and negative classes effectively.
# AUC Values: Generally, an AUC of 0.5 suggests no discrimination (better than random chance), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and above 0.9 is outstanding.

# Reviewing the ROC Curve
# The ROC curve you plotted provides a visual representation of the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) at various threshold settings.
# The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.

# What is AUC and ROC?: The AUC measures the ability of the model to predict higher scores for actual positive occurrences than for negatives. The ROC curve helps visualize this by showing how the true positive rate 
# and false positive rate relate at various threshold settings.
# Model Performance: A good AUC value indicates that the model predictions are reliable and can effectively differentiate between sites with and without invasive species based on the environmental conditions modeled.
# Practical Application: The ROC and AUC are tools that help assess how well the environmental factors used in the model work together to predict species presence. This is crucial for conservation planning and management, 
# as it helps identify areas at high risk of invasion and aids in prioritizing areas for monitoring and intervention.

Figure 24: ROC Curve Analysis for Species Distribution Model
This figure displays the ROC curve, illustrating the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) across various threshold settings. An AUC value of 0.771 indicates good model separability, suggesting the model’s effectiveness in distinguishing between presence and absence of species. The ROC curve’s proximity to the top left corner of the plot reflects a higher accuracy of the test.

Understanding AUC and ROC in Model Evaluation

What is AUC and ROC?

Overview: - AUC (Area Under the Curve): Measures the ability of the model to correctly predict higher scores for actual positive occurrences than for negatives. It quantifies the overall ability of the model to discriminate between positive and negative classes. - ROC Curve (Receiver Operating Characteristic Curve): Visualizes how the true positive rate (sensitivity) and the false positive rate (specificity) relate at various threshold settings. This curve helps in assessing the trade-offs between sensitivity and specificity in different threshold scenarios.

Model Performance

Significance of AUC: - Reliability of Predictions: A good AUC value indicates that the model predictions are reliable and can effectively differentiate between sites with and without invasive species based on the environmental conditions modeled. It signifies the model’s effectiveness in distinguishing between the classes.

Practical Application

Role of ROC and AUC in Conservation: - Assessment Tool: The ROC and AUC are critical tools that help assess how well the environmental factors used in the model work together to predict species presence. - Conservation Planning: This analysis is crucial for conservation planning and management as it helps identify areas at high risk of invasion. Understanding these risks aids in prioritizing areas for monitoring and intervention, ensuring that conservation efforts are directed where they are most needed.

Summary: - These metrics not only highlight the model’s performance but also its practical implications in real-world ecological management and conservation strategies. The insights gained from ROC and AUC analysis guide decision-making in ecological conservation, enhancing the strategic allocation of resources and efforts in managing invasive species threats.

More Extract and Verify validation_data

# ---------------- Step 13 Extract and Verify validation_data ---------- 
# ----------------- TROUBLESHOOTING INSIDE THIS SECTION ----------------

# Step 13.1: Extract Environmental Data
# Check if vermont_data is a RasterBrick
if (!inherits(vermont_data, "RasterBrick")) {
  stop("vermont_data is not a RasterBrick object")
}

# Check if validation_coords is correctly formatted
if (!is.matrix(validation_coords) && !inherits(validation_coords, "SpatialPoints")) {
  stop("validation_coords should be a matrix or SpatialPoints object")
}

# Extract environmental data from the raster based on the validation points using the raster package
validation_env_values <- raster::extract(vermont_data, validation_coords)

# Check extraction output
print(head(validation_env_values))

# Step 13.2: Correct Format of validation_points
# Extract the coordinates (longitude and latitude) from validation_points
coords <- validation_points[, 1:2]

# Now use these coordinates to extract environmental data using the raster package
validation_env_values <- raster::extract(vermont_data, coords)

# Step 13.3: Verify the Extraction
# Check the head of the extracted environmental values
print(head(validation_env_values))

# Step 13.4: Combine Data for Validation
# Combine the extracted environmental values with the presence/absence data
validation_data <- cbind(as.data.frame(validation_env_values), presence = validation_points$presence)

# Ensure the data frame is properly formatted for prediction
print(head(validation_data))




# What’s Happening?: Environmental characteristics (like temperature, precipitation, etc.) at specific geographic points are being matched with the locations where species have
# been observed or are expected to be absent. This combination creates a dataset that feeds into a predictive model, helping to understand where the species might thrive based on environmental conditions.
# Why It Matters?: For SDM, having accurate environmental data alongside occurrence data is crucial. The model relies on these inputs to discern patterns and predict 
# species distributions effectively. Errors in data extraction or formatting can lead to incorrect predictions, which could misguide conservation efforts.

# Practical Implications:
# Ensuring Data Integrity: The troubleshooting steps undertaken are essential in SDM workflows to ensure that the input data does not contain errors or inconsistencies, which could lead to flawed outputs.
# Readiness for Model Application: The cleaned and verified dataset is now ready for advanced analytical processes, such as fitting a MaxEnt model or other statistical models, to predict species distributions. 
# This step is a bridge between raw data collection and actionable ecological insights.

Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

# ---------------- Step 14 MaxEnt Model Predicted Probabilities ---------------- 

# Step 14.1: Predict Using the MaxEnt Model
# Predict using the MaxEnt model
predictions <- predict(maxent_model, vermont_data)  # This should create a raster layer of probabilities

# Plot the predicted probabilities raster
library(rasterVis)
levelplot(predictions, main = "MaxEnt Model Predicted Probabilities", col.regions = rev(terrain.colors(255)),
          xlab = "Longitude", ylab = "Latitude", colorkey = list(space = "top", 
                                                                 labels = list(at = seq(0, 1, by = 0.2),
                                                                               col = "black")),
          legend.width = 0.8, margin = FALSE)

Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

# Step 14.2: Extract Coordinate Data
# Extract the coordinates from validation_points
coords <- validation_points[, c("longitude", "latitude")]

# Convert the dataframe to a SpatialPointsDataFrame
coordinates(coords) <- ~longitude + latitude

# Set the CRS if known; here's an example using WGS84
proj4string(coords) <- CRS("+proj=longlat +datum=WGS84")

# Step 14.3: Verify the Spatial Object
# Check the structure to ensure it's now a SpatialPointsDataFrame
print(class(coords))
str(coords)

Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

Overview: This levelplot illustrates the predicted probabilities of species occurrence across Vermont, providing a detailed visual analysis of the geographical likelihood of species presence.

Visualization Details: - Raster Cells: Each cell in the raster represents a specific geographic location. - Color Indicators: The color of each cell indicates the probability of species occurrence. - Warmer Colors: Denote higher probabilities, suggesting areas with favorable conditions for species occurrence. - Cooler Colors: Indicate lower probabilities, suggesting areas less likely to support the species. - Legend and Scale: The color scale in the legend quantifies these probabilities, offering a clear visual guide to interpreting the spatial distribution of potential species habitats within the study area.

Implications for Conservation: - This visual tool assists in identifying critical areas for conservation efforts, focusing resources on regions with higher probabilities of species presence.

Predicted Probabilities (Histogram)

# Load necessary library
library(raster)

# Now use these spatial points to extract the probability data from the raster
predicted_probs <- raster::extract(predictions, coords)

# Ensure the extracted data is in the correct format
predicted_probs <- as.numeric(predicted_probs)  # Convert list or matrix to numeric vector if necessary

# Step 14.4: Plot the Distribution of Predicted Probabilities
# Check the distribution of predicted probabilities
hist(predicted_probs, main = "Distribution of Predicted Probabilities", xlab = "Probabilities", breaks = 30,
     col = "lightblue", border = "black", xlim = c(0, 1), ylim = c(0, 500), ylab = "Frequency")

# Add title and labels to the plot
title("Distribution of Predicted Probabilities")
xlabel <- "Probabilities"
ylabel <- "Frequency"
mtext(xlabel, side = 1, line = 3)
mtext(ylabel, side = 2, line = 3)

Figure 26: Predicted Probabilities (Histogram)

# Predicted Probabilities Raster (Levelplot):
# The levelplot function creates a visual representation of the predicted probabilities across the study area (Vermont in this case).
# Each cell in the raster represents a geographic location, and the color of the cell indicates the probability of species occurrence at that location.
# Warmer colors typically represent higher probabilities, while cooler colors represent lower probabilities.
# The legend at the top of the plot provides a color scale, indicating the probability values corresponding to each color.

# Distribution of Predicted Probabilities (Histogram):
# The histogram represents the distribution of predicted probabilities across all the sampled locations (or points) in Vermont.
# The x-axis of the histogram represents the range of predicted probabilities, typically from 0 to 1, where 0 indicates very low probability and 1 indicates very high probability.
# The y-axis represents the frequency or count of occurrences within each probability range.
# The histogram provides insight into the variability and spread of predicted probabilities across the study area.
# For example, if the histogram is skewed towards higher probabilities, it suggests that the model predicts a higher likelihood of species occurrence in most locations. Conversely, if it's skewed towards lower probabilities, it suggests a lower likelihood of occurrence.

# Understanding Predictions: The levelplot provides a geographical visualization where each pixel/color intensity on the map correlates with the probability of species occurrence predicted by the model. It translates complex model outputs into an easily interpretable format.
# Analysis of Probability Values: The extracted probabilities were then plotted in a histogram, highlighting how these probabilities are distributed. This is crucial for understanding whether the model generally predicts a higher or lower likelihood of occurrence across the region.
# Practical Implications: For conservationists, these visualizations and statistical summaries offer a direct insight into areas that might require more attention, either because they are likely habitats for invasive species (higher probabilities) or areas that currently pose less of a risk (lower probabilities).
# Logical Interpretations:
# Model Evaluation: By visually and statistically analyzing the predicted probabilities, we can assess the MaxEnt model's performance. For instance, a well-performing model should ideally show higher probabilities in known areas of species presence and lower probabilities elsewhere.
# Conservation Planning: The spatial distribution map (levelplot) and the histogram together inform where invasive species might spread. This helps in planning control measures, monitoring programs, and conservation strategies more effectively, targeting areas predicted to have higher probabilities of species presence.

Figure 26: Distribution of Predicted Probabilities (Histogram) for Species Occurrence in Vermont

Overview:
This histogram represents the distribution of predicted probabilities of species occurrence across all sampled locations in Vermont. The visualization provides a clear view of the probability ranges, offering insights into the likelihood of species presence across different areas.

Visualization Details: - X-axis: Categorizes the range of predicted probabilities from 0 (indicating very low probability) to 1 (indicating very high probability). - Y-axis: Shows the frequency or count of occurrences within each probability range. - Color Coding: Uses a gradient where warmer colors typically represent higher probabilities, suggesting areas with favorable conditions for species occurrence, and cooler colors represent lower probabilities, indicating less favorable conditions.

Implications for Conservation: - Analytical Insight: The histogram helps in understanding the variability and spread of predicted probabilities, offering clues about areas that may require more focused conservation efforts. - Strategic Planning: Identifying areas with higher probabilities can guide conservationists and resource managers in prioritizing regions for intervention and monitoring, enhancing the effectiveness of conservation strategies.

In-Depth Analysis of Species Distribution Predictions

Understanding Predictions

Geographical Visualization: - Levelplot Usage: The levelplot provides a geographical visualization where each pixel or color intensity on the map correlates with the probability of species occurrence predicted by the model. It translates complex model outputs into an easily interpretable format, making it accessible for stakeholders to understand ecological risks.

Analysis of Probability Values

Histogram Insights: - Distribution Analysis: After extracting probabilities, they were plotted in a histogram to highlight how these probabilities are distributed across the region. This analysis is crucial for understanding whether the model generally predicts a higher or lower likelihood of occurrence. - Implications: This allows researchers and conservationists to gauge the overall effectiveness of the model in capturing the reality of species distribution across different environments.

Practical Implications

Conservation Impact: - Direct Insight: For conservationists, these visualizations and statistical summaries offer direct insight into areas that might require more attention, either because they are likely habitats for invasive species (indicated by higher probabilities) or areas that currently pose less of a risk (indicated by lower probabilities).

Logical Interpretations

Model Evaluation and Conservation Planning: - Model Performance: By visually and statistically analyzing the predicted probabilities, we can assess the MaxEnt model’s performance. A well-performing model should ideally show higher probabilities in known areas of species presence and lower probabilities elsewhere. - Strategic Conservation Planning: The spatial distribution map (levelplot) and the histogram together inform where invasive species might spread. This critical information helps in planning control measures, monitoring programs, and conservation strategies more effectively, targeting areas predicted to have higher probabilities of species presence.

True Skill Statistics (TSS) and Sensitivity and Specificity Analysis of the MaxEnt Model

# ------------------------- Step 15 TSS --------------------------------
# True Skill Statistics (TSS)

# 'vermont_data' contains the data used to train the MaxEnt model:
model_vars <- names(vermont_data)

# Print the names of the variables used in the model
print(model_vars)


# Step 15.1 Data Verification and Preparation
# Extract the presence data from the validation_points dataframe
# We expect 'presence' to be a column in validation_points, with 1 indicating presence and 0 indicating absence
presence_data <- validation_points$presence

# Verify the presence data
print(head(presence_data))

# Step 15.2 Sensitivity and Specificity Calculation
# Define a function to calculate sensitivity and specificity
calculate_stats <- function(threshold, probs, actual) {
  predicted <- ifelse(probs > threshold, 1, 0)
  cm <- table(Predicted = factor(predicted, levels = c(0, 1)), 
              Actual = factor(actual, levels = c(0, 1)))
  
  # Calculate sensitivity and specificity with safety checks
  sensitivity <- ifelse(sum(actual == 1) > 0, cm["1", "1"] / sum(cm[, "1"]), 0)
  specificity <- ifelse(sum(actual == 0) > 0, cm["0", "0"] / sum(cm[, "0"]), 0)
  
  return(c(sensitivity = sensitivity, specificity = specificity))
}

# Step 15.3 Sensitivity and Specificity Calculation Across Thresholds

# Evaluate sensitivity and specificity for different threshold values
thresholds <- seq(0.03, 1, by = 0.01)
stats <- sapply(thresholds, calculate_stats, probs = predicted_probs, actual = presence_data)

# Find the threshold that maximizes the sum of sensitivity and specificity
max_index <- which.max(rowSums(stats))
optimal_threshold <- thresholds[max_index]

# Print the results
cat("Optimal threshold:", optimal_threshold, "\n")
cat("Sensitivity at optimal threshold:", stats[1, max_index], "\n")
cat("Specificity at optimal threshold:", stats[2, max_index], "\n")

# Step 15.4 Visualization of Sensitivity and Specificity
plot(thresholds, stats[1,], type='l', col='blue', xlab='Threshold', ylab='Metric Value', main='Sensitivity and Specificity by Threshold')
lines(thresholds, stats[2,], col='red')
legend("bottomright", legend=c("Sensitivity", "Specificity"), col=c("blue", "red"), lty=1, title="Metrics")

Figure 27: Sensitivity and Specificity Calculation Across Thresholds

#  Note: The plotted graph will visually guide the selection of a new threshold value by showing
# the trade-off between sensitivity and specificity across a range of possible values.

# Interpretation of Results and Considerations for Next Steps

# Sensitivity of 0:
# This result indicates that the model, at the threshold of 0.03, fails to correctly identify any true positives.
# Essentially, the model is not predicting the presence of the condition correctly at all. This lack of sensitivity
# suggests that the model is overly conservative, potentially classifying nearly all test cases as negative.

# Specificity of 1:
# This outcome shows that the model perfectly identifies all true negatives at this threshold. It correctly predicts
# all absence cases without any false positives, indicating high specificity.

# Understanding Model Performance: The process revolves around evaluating how well the MaxEnt model distinguishes between areas where species are likely to be present versus areas where they are not. 
# Sensitivity measures the model's ability to identify true presences, while specificity measures its ability to recognize true absences.
# Optimal Threshold Identification: The analysis pinpointed a threshold that theoretically offers the best trade-off between sensitivity and specificity. 
# However, the results indicated a very high sensitivity and extremely low specificity at this threshold, suggesting that while the model is good at detecting presences (sensitivity), it struggles to correctly identify absences without false positives (specificity).
# Practical Implications and Adjustments: The high sensitivity but low specificity at the optimal threshold suggests that the model may be over-predicting the presence of species. 
# This could lead to unnecessary conservation efforts in areas where the species is not actually likely to spread. Adjustments in the model or its threshold setting might be necessary to achieve a more balanced outcome.
# Logical Interpretations:
# Model Calibration Needs: The results prompt a reconsideration of the model or its application thresholds to ensure that predictions are not only sensitive but also specific. 
# This involves potentially retraining the model or adjusting the threshold to reduce false alarms while maintaining a reliable detection rate.

Figure 27: Trade-off Between Sensitivity and Specificity at Various Thresholds

Overview:
This figure illustrates the trade-off between sensitivity and specificity across a range of threshold values for the MaxEnt model used in predicting species presence. The graph guides the selection of an optimal threshold by visually representing the balance between identifying true positives and true negatives.

Analysis Highlights: - Sensitivity Analysis: At a threshold of 0.03, the sensitivity is 0, indicating the model’s failure to correctly identify any true positives, suggesting an overly conservative approach that might result in most conditions being predicted as absent. - Specificity Analysis: Conversely, the specificity at this threshold is 1, demonstrating that the model perfectly identifies all true negatives without any false positives, highlighting its accuracy in predicting non-occurrences.

Implications for Model Adjustment: - Model Calibration: The depicted sensitivity-specificity trade-off is critical for calibrating the model, as it assists in selecting a threshold that optimally balances the detection of true positives and negatives. - Decision Support: This visualization is instrumental for researchers and practitioners in refining the predictive model, ensuring that it provides reliable and actionable insights for ecological management and conservation planning.

Understanding Model Performance in MaxEnt Modeling

Evaluating Sensitivity and Specificity

Model’s Discriminatory Power: - Purpose: The evaluation revolves around assessing how well the MaxEnt model distinguishes between areas where species are likely to be present versus areas where they are not. - Key Metrics: - Sensitivity: Measures the model’s ability to correctly identify true presences of species. - Specificity: Measures the model’s ability to correctly identify true absences.

Optimal Threshold Identification

Threshold Analysis: - Optimal Trade-off: The analysis aimed to identify a threshold that offers the best balance between sensitivity and specificity. - Observations: Results indicated very high sensitivity but extremely low specificity at this threshold, suggesting the model, while effective at detecting presences, struggles to identify absences without generating false positives.

Practical Implications and Adjustments

Adjustment Needs: - Over-Prediction Issue: The high sensitivity coupled with low specificity at the optimal threshold suggests that the model may be over-predicting the presence of species. This could potentially lead to unnecessary conservation efforts in areas where the species is unlikely to spread. - Balancing Act: Adjustments in the model or its threshold settings might be necessary to achieve a more balanced outcome, reducing the likelihood of false alarms while still effectively detecting species presence.

Logical Interpretations and Model Calibration

Calibration Requirements: - Reevaluation: The results prompt a reconsideration of the model or its application thresholds to ensure that predictions are not only sensitive but also specific. - Model Retraining: This may involve retraining the model or adjusting the threshold to reduce false positives while maintaining a reliable detection rate.

Strategic Impact: - Conservation Strategy Optimization: Accurate model calibration is crucial for optimizing conservation strategies, ensuring that efforts are targeted and effective, thereby enhancing ecological management and planning.

Results and Limitations

Results (Highlights from both RMarkdown Reports)

Results and Discussion From the Spatial Distribution of Invasive Species in Vermont

Results

I successfully mapped the distribution of various invasive species across Vermont, focusing on key species such as the Emerald Ash Borer and Hemlock Woolly Adelgid. Each species was marked with distinct colors, aiding in rapid visual assessment (Figure 2).

Examples from Figure 2:

Emerald Ash Borer: Dense clusters observed in the northeastern regions of Vermont.
Hemlock Woolly Adelgid: Predominantly found in the central and southern parts of the state.

Discussion

The visualization in Figure 2 effectively met my objective of identifying the spread and density of invasive species, allowing for a clear understanding of high-risk areas. This aids in ecological management by highlighting critical areas needing immediate attention, thus supporting my decision-making processes regarding conservation strategies.

Research Objective: Detailed Spatial Analysis within Niquette Bay State Park

Results

My focused spatial analysis within Niquette Bay State Park indicated an absence of invasive species, demonstrating effective conservation efforts (Figure 5).

Example from Figure 5:

Park Health: The map detailed park boundaries and used different color codes to indicate the absence of key invasive species, providing clear visual confirmation of the park’s ecological health.

Discussion

Figure 5’s targeted analysis provided essential insights into the effectiveness of the conservation strategies employed within Niquette Bay State Park. This meets the objective of offering a localized ecological assessment crucial for ongoing park management and preventive strategies against potential future invasions.

Research Objective: Statewide Analysis of Invasive Species

Results

The expansion to a statewide analysis highlighted the presence of invasive species across Vermont, with detailed mappings of their specific locations and concentrations (Figure 8).

Examples from Figure 8:

Emerald Ash Borer: The map illustrates various hotspots around major forested areas.
Hemlock Woolly Adelgid: Clusters identified, particularly near water bodies.

Discussion

The comprehensive statewide visualization provided in Figure 8 aligns with my research objective to assess the broader impact of invasive species across Vermont. It underscores the need for strategic planning and targeted interventions, offering a macroscopic view essential for policymaking and resource allocation.

Tools and Methods Used

Spatial Clustering with DBSCAN and K-Means (Figures 12 and 13): These techniques helped me identify high-density clusters of invasive species occurrences. Figure 12 presents the results of DBSCAN clustering, showing concentrated areas of invasive species, while Figure 13 uses K-Means to further delineate these clusters into distinct regions based on species concentrations.

Elbow Method (Figure 14): I employed this method to determine the optimal number of clusters for the K-Means algorithm, depicted in Figure 14, which illustrates how the algorithm’s complexity increases with the number of clusters.

Results and Discussion From the Spatial Analysis and Transition to Species Distribution Modeling

This section presents the findings from the spatial analysis and species distribution modeling (SDM) I conducted for invasive species across Vermont. I discuss each result in relation to the research objectives set at the beginning of this project.

Research Objective: Integrate Environmental and Biological Data

Results

I successfully integrated environmental data with the occurrence records of invasive species. Environmental layers from the WorldClim database, including temperature and precipitation, were aligned with the geo-tagged occurrences of species such as the Emerald Ash Borer and Hemlock Woolly Adelgid.

Figure 17: Cropped Environmental Data for Vermont

This figure illustrates the processed climatic data cropped to the geographical extents of Vermont, showing key environmental variables that affect species distribution.

Discussion

The successful integration of these datasets allowed me to develop a comprehensive environmental profile for Vermont. This profile is crucial for understanding how different environmental conditions influence the distribution of invasive species. The results confirm that my methodology is effective in merging diverse data types to prepare for more detailed SDM, thereby meeting the first objective.

Research Objective: Model Development and Validation

Results

I developed Species Distribution Models, particularly using MaxEnt and logistic regression, to predict potential distributions of key invasive species. The models were validated with high accuracy, as indicated by the Area Under the Curve (AUC) from Receiver Operating Characteristic (ROC) curves, which were substantially above the threshold of acceptability.

Figure 22: Train Logistic Regression Model- GLM

This figure demonstrates the logistic regression model’s ability to predict species presence based on environmental factors, highlighting the influence of various climatic variables on species distribution.

Discussion

The development and rigorous validation of SDMs signify that the models are robust and reliable for forecasting species distributions under current and future environmental scenarios. This aligns perfectly with my second objective, ensuring that our models are both scientifically sound and practically relevant.

Research Objective: Visualization and Interpretation

Results

I visualized the final SDMs to show the probability of species occurrence across Vermont, identifying potential hotspots and areas of low risk.

Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont

The levelplot provides a clear, color-coded representation of the likelihood of species occurrence, which helps in visualizing the geographical spread and concentration areas of invasive species.

Discussion

The visual outputs have been instrumental in interpreting the complex results of our SDMs, providing accessible and actionable insights into potential species spread. These visualizations directly support conservation efforts and policy-making by highlighting critical areas for intervention. The success in creating these interpretable maps and charts confirms that I have achieved our third objective.

Conclusion

This study has successfully demonstrated how the integration of spatial analysis and species distribution modeling can address the challenge of managing invasive species across Vermont. My mapping and clustering analyses have not only met the research objectives but have also provided a detailed and clear understanding of the spatial distribution of invasive species within the state. The visual evidence, as presented in the referenced figures, substantiates the textual findings, enhancing the comprehensibility of the outcomes. These insights are crucial for the development of informed conservation strategies, highlighting critical areas for targeted interventions that are essential for effective ecological management and resource allocation. The detailed maps and predictive models developed through this research serve as invaluable tools for environmental managers and policymakers, aiding in making informed decisions to mitigate the impact of invasive species. By achieving these research objectives, I have laid a foundation for ongoing research and practical conservation initiatives, establishing a benchmark for similar ecological studies worldwide. This comprehensive approach not only underscores the effectiveness of the employed methodologies but also provides crucial insights that will drive the strategic planning of conservation efforts, supporting the long-term health and resilience of Vermont’s ecosystems.

Limitations

Data Quality and Completeness

Challenges

Scope Expansion: The original scope of the project was expanded to include the entire state of Vermont due to limited availability of observational data from Niquette Bay State Park.
Data Sources: Reliability and completeness of existing datasets from sources such as iNaturalist Vermont and the Vermont Invasive Species Database significantly impact the accuracy and depth of the analysis.

Impact

Data Integrity Issues: Inaccurate, incomplete, or outdated data can skew results, limit the effectiveness of species distribution models (SDMs), and lead to potentially unreliable conclusions.

Model Accuracy and Assumptions

Technical Limitations

Dependency on Data Quality: The success of SDMs, such as MaxEnt and logistic regression, is highly dependent on the accurate setting of model parameters and the quality of input data.
Risks of Inaccuracy: Incorrect parameter settings or faulty assumptions can lead to inaccurate predictions of species distributions.

Practical Implications

Impact on Management Strategies: Misestimations or oversimplifications in the model’s assumptions can mislead management strategies and conservation actions.

Dynamic and Complex Ecosystems

Ecosystem Variability

Static Models Limitations: The study’s static models may not fully do justice to the dynamic and fluctuating nature of ecosystems, which includes seasonal variation and species interaction effects.

Long-term Predictions

Challenges in Predictions: This limitation makes long-term predictions and management strategies difficult, as static models cannot capture ongoing ecological dynamics.

Generalization of Results

Local vs. Statewide Application

Context-Specific Accuracy: The results of the statewide analysis may not accurately reflect the unique ecological conditions of specific areas in Vermont, which could limit the broader application of the results.

Scalability

Adjustments Needed: The unique conditions of Niquette Bay State Park, on which the study originally focused, may not be representative of other regions, so adjustments may be needed to apply the results elsewhere.

Scalability and Adaptability of Management Strategies

Strategy Adaptation

Translating Predictions: Translating model predictions into actionable conservation strategies is challenging and requires flexibility to adapt to new science and changing ecological conditions.

Allocation of Resources

Implementation Challenges: Effective implementation of management strategies depends on the practical applicability of recommendations, which is influenced by logistical, financial, and political factors.

Technological and Methodological Constraints

Tools and Techniques

Technical Limitations: The use of advanced statistical modeling and spatial data analysis tools such as RStudio and GIS entails limitations related to their technical capabilities and the methodological approaches used.

Data Handling

Processing Limitations: There may be limitations in data processing capabilities and the choice of clustering algorithms or other spatial analysis techniques that affect the results of the study.

Continuous Data Collection and Model Updating

Ongoing Monitoring

Dynamic Nature: The dynamic nature of invasive species distribution requires continuous data collection and regular model updates to ensure the relevance and accuracy of predictions.

Resource Intensive

Funding and Expertise: Updating models and ongoing monitoring can be resource-intensive and require ongoing funding and technical expertise.

Sensitivity and Specificity of Models

Model Calibration

Balancing Challenges: Achieving an optimal balance between sensitivity (correct detection of occurrences) and specificity (correct detection of absences) in model predictions is challenging. The trade-offs involved can lead to over- or under-prediction, which affects resource allocation and conservation efforts.

Project Reflection: Spatial Analysis of Invasive Species in Vermont

Reflecting on the “Spatial Analysis of Invasive Species in Vermont” project, I recognize a mix of strengths and weaknesses that shaped the outcomes of this endeavor.

Project Strengths

Comprehensive Scope

Broad Analysis: The initial focus on Niquette Bay State Park, followed by an expansion to the entire state of Vermont, allowed for a comprehensive analysis.
Diverse Data Sources and Techniques: By utilizing various data sources and modeling techniques, including Species Distribution Models (SDM) and clustering algorithms, I was able to create robust models. The integration of different types of data, from iNaturalist observations to WorldClim grids, enabled me to make insightful ecological predictions.

Challenges Encountered

Data Management

Complexity in Documentation: Managing the volume and complexity of data within a single RMarkdown document was challenging. This highlighted the need for better data management strategies or the adoption of more dynamic and scalable tools. For instance, linking R scripts to external RMarkdown files for reporting could be an alternative, though I’m still exploring if there could have been a more efficient solution than creating two separate RMarkdown documents.

Geospatial Data Handling

Technical Difficulties: I faced initial difficulties in processing raster data, which underscored the importance of familiarizing myself with geospatial data handling and troubleshooting techniques early in the project.

Lessons Learned and Future Improvements

Looking back, if I were to tackle this project again, I would take a slightly different approach:

Define Project Scope Clearly

Focused Approach: Clearly defining the scope from the beginning would help prevent being overwhelmed by the data and analysis possibilities. A more focused approach would ensure that each section of analysis within the project is thoroughly explored and managed more effectively.

Enhanced Data Management

Tool Optimization: Exploring more sophisticated data management and analysis tools could streamline the process, possibly integrating advanced software solutions or modularizing the analysis using scripts that feed into a main reporting document.

Strengthen Technical Skills

Geospatial Proficiency: Investing time to enhance skills in geospatial data analysis before initiating the project would likely reduce technical challenges and improve the efficiency of data processing.

Conclusion

The “Spatial Analysis of Invasive Species in Vermont” project was both challenging and enlightening. It provided valuable insights into ecological modeling and the complexity of managing diverse data sets. By applying the lessons learned to future projects, I aim to refine my approach, ensuring more structured and efficient analysis and making even more meaningful ecological contributions.

Sources

Audubon Vermont. Audubon Vermont. Accessed 8 Apr. 2024.
Bivand, R., Pebesma, E., & Gomez-Rubio, V. (2013). Applied Spatial Data Analysis with R, Second edition. Springer, NY. https://asdar-book.org/.
Dowle, M., & Srinivasan, A. (2019). data.table: Extension of data.frame. R package version 1.12.2. CRAN.
Earth Science Data Systems, NASA. (2021, October 21). Find Data. NASA. NASA Earthdata.
Elith, J., & Leathwick, J. R. (2009). Species Distribution Models: Ecological Explanation and Prediction Across Space and Time. Annual Review of Ecology, Evolution, and Systematics, 40, 677-697. DOI.
Franklin, J. (2009). Mapping Species Distributions: Spatial Inference and Prediction. Cambridge University Press.
GBIF. Global Biodiversity Information Facility. Accessed 8 Apr. 2024.
Hijmans, R.J., & Elith, J. (n.d.). Species distribution modeling with R. R Spatial. R Spatial.
Integrated Development Environment. RStudio. RStudio IDE, 26 Feb. 2018.
Invasive Species. Vermont Open Geodata Portal. Vermont Invasive Species. Accessed 8 Apr. 2024.
National Centers for Environmental Information (NCEI). “Climate Data Online.” NCEI CDO Web. Accessed 8 Apr. 2024.
National Invasive Species Information Center (NISIC). Invasive Species Info. Accessed 8 Apr. 2024.
National Oceanic and Atmospheric Administration. NOAA. Accessed 8 Apr. 2024.
Oliver, J. (2023). A Very Brief Introduction to Species Distribution Models in R. [Tutorial document].
Pebesma, E. (2018). sf: Simple Features for R. R package version 0.7-4. CRAN sf.
RStudio Team. (2020). RStudio: Integrated Development Environment for R. RStudio, PBC. http://www.rstudio.com/.
The Comprehensive R Archive Network. CRAN. Accessed 8 Apr. 2024.
Tidyverse. “Tidyverse/DPLYR: Dplyr: A Grammar of Data Manipulation.” GitHub. GitHub Dplyr. Accessed 8 Apr. 2024.
Unlocking the Power of Science to Guide Biodiversity Conservation. NatureServe. NatureServe. Accessed 8 Apr. 2024.
Vermont Open Geodata Portal Your Source for Geospatial Data. Vermont Geodata. Accessed 8 Apr. 2024.
Vermont, US. iNaturalist. iNaturalist Vermont. Accessed 8 Apr. 2024.
Wickham et al., (2019). dplyr: A Grammar of Data Manipulation. R package version 0.8.3. CRAN Dplyr.
Wickham, H. (2019). tidyr: Tidy Messy Data. R package version 1.0.0. CRAN tidyr.
WorldClim. WorldClim. Accessed 8 Apr. 2024.

Addressed Comments from Rough Draft Review

Research Questions and Objectives

Enhanced Clarity: Initially, my study objectives were somewhat dispersed throughout the document. Based on feedback, I consolidated them and clearly stated them in the Introduction to enhance clarity and focus.

Data Section Improvements

Detailed Data Description: Feedback indicated that the description of data sources in my rough draft lacked clarity, particularly regarding when and how certain datasets were used. To address this, I included a more detailed breakdown of the data sources, elaborating on their relevance and specific use cases within the study.

Methodological Clarity

Narrative Methodological Descriptions: The feedback pointed out that the methodological steps needed clearer explanations. I transformed what were previously bullet points into narrative descriptions, providing a detailed account of each methodological stage. This not only makes the document more engaging but also more informative.

Visualization and Tools

Specifying Tools: There was confusion about the use of the term “visual tools” in the rough draft. In the final draft, I specified the tools used—GIS and RStudio—and detailed their application in the analysis, directly addressing the need for specificity as suggested in the feedback.

Writing and Structural Adjustments

Improved Logical Flow: The feedback highlighted issues with repetitive statements and unclear connections between sections. I streamlined the writing to eliminate redundancy and improve the logical flow, ensuring each section builds upon the previous one.

Inclusion of Previous Studies

Integration of Relevant Literature: To address feedback suggesting the inclusion of references to previous studies, I incorporated citations of relevant literature and studies, particularly those utilizing similar methodologies or investigating similar ecological phenomena. This helps establish a research gap and positions my study within the broader academic context.

Figure Captions and Discussion

Enhanced Visual Descriptions: The lack of informative figure captions and adequate discussion of figures and maps in the text was highlighted in the feedback. I addressed this by including detailed captions for all visuals and integrating these figures into the text discussion to effectively support the narrative.

Addressing Peer and Proposal Comments

Comprehensive Response to Feedback: The importance of addressing all peer and proposal comments was emphasized in the feedback. I took this into consideration and integrated changes and justifications into the text where necessary, ensuring that all comments were thoughtfully addressed.

Spatial Analysis of Invasive Species Across Vermont

M.Golub

2024-04-13

Title of Project: Spatial Analysis of Invasive Species Across Vermont

Abstract

Introduction and Background

Introduction

Background

Research Question

Data Overview

Species Observation Data

Environmental Variables Data

Data Preparation and Analysis Tools

Additional Resources for Validation and Enhancement

Methodology Overview

Methodology Approach Overview

Tools and Processes

Recognize Limitations

Spatial Distribution of Invasive Species Across Vermont

Figure 2: Spatial Analysis of Invasive Species Spread in Vermont

Summary of Spatial Analysis Workflow

Exploratory Data Analysis (EDA)

From Data Acquisition to Exploratory Data Analysis (EDA)

Filter species data for specific invasive species and plot their distribution within Vermont

Figure 3: Geographical Distribution of Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont

What This Figure Represents:

Why This Is Important:

Distribution of Emerald Ash Borer and Hemlock Woolly Adelgid in Vermont

Figure 4: Invasive Species Observations within Niquette Bay State Park

Comparison with Previous Step (Step 9)

Mapping Invasive Species Observations within the Park Area

Figure 5: Invasive Species Observations in Niquette Bay State Park

Filtering and Mapping Invasive Species Observations within the Park Area

Analysis Summary for Filtering and Mapping Invasive Species Observations within the Park Area

Analyzing Invasive Species Distribution within Park Boundaries

Figure 6: Analysis of Invasive Species Distribution within Niquette Bay State Park

Final Confirmation: Absence of Invasive Species

Figure 7: Invasive Species Management Planning within Niquette Bay State Park

Data Check for Invasive Species, Figure 7 above

Analyzing Invasive Species Distribution in Vermont

Figure 8: Distribution of Invasive Species in Vermont

Figure 9: Spatial Distribution of All Recorded Invasive Species Observations Across Vermont

Summary of Analysis of Invasive Species in Vermont

Checking and Visualizing Invasive Species Data within Vermont Boundaries

Visualizing the Distribution of All Recorded Invasive Species Observations Across Vermont

Contextual Expansion of Analysis Beyond Niquette Bay State Park

Implications for Policy and Management

Importance of Repeated Analysis

Filtering and Analyzing Target Invasive Species Data Across Vermont

Figure 10: Impact of Invasive Species Distribution in Vermont

Discussion on Figure 10: Strategic Implications

Discussion on Figure 10: Future Considerations

Comprehensive Analysis of Invasive Species Distribution and Environmental Impacts in Vermont

Comprehensive Analysis of Invasive Species Distribution and Environmental Impacts in Vermont

Discussion on Detailed Analysis of Species Prevalence and Impact

Identify and Visualize Top Two Most Common Invasive Species

Figure 11: Distribution Discrepancy Analysis of Invasive Species in Vermont

Implications for Research and Management

Conclusion

Transition from Hotspot Identification to Clustering with DBSCAN, Elbow, and K-means

From Hotspot Identification:

Introduction to Clustering Techniques:

DBSCAN:

Elbow Method:

K-means Clustering:

Rationale for Transition:

Spatial Clustering Analysis

DBSCAN Clustering of Observations

Figure 12: Visualization of DBSCAN Clustering Results

Spatial Clustering Analysis of Invasive Species Occurrences in Vermont

Objective

Cluster Map Creation

Clustering Insights

Management Implications

Importance of Recognizing Clusters

Summary

What is DBSCAN?

The Plot’s Message

Why This Matters

Practicality of Findings