This R Markdown document is designed to guide the development and implementation of Species Distribution Models (SDMs) for invasive species in Vermont. Leveraging environmental data alongside recorded occurrences, we aim to predict the potential distribution of these species under current and future environmental conditions. This work is pivotal for informing strategic conservation efforts and managing invasive species effectively.
Following a comprehensive exploratory data analysis which included spatial patterning and hotspot identification, our analysis now incorporates sophisticated modeling techniques. We utilize raster-based climatic data, which includes variables like temperature and precipitation, crucial for understanding species-environment interactions.
The predictive models developed will help delineate areas potentially vulnerable to invasive species under varying climatic scenarios. These models not only serve to enhance our ecological understanding but also support policymakers and conservationists in devising effective management strategies.
Visual Design: - The color spectrum in the visualization transitions smoothly from delicate whites and pinks to vibrant oranges, vivid yellows, and deep greens. - This palette represents a gradient of probabilities, indicating the likelihood of species presence across different areas.
Functional Insight: - The regions highlighted in bright green suggest a higher probability of species presence, pointing conservationists to potential hotspots. - While higher probabilities highlighted by green tones do not guarantee the presence of species, they provide essential clues for further scientific investigation and resource allocation.
Strategic Application: - This color-coded approach serves as a critical tool for ecological analysis and conservation planning, acting like a treasure map that guides conservation efforts to areas where they are most needed.
Process Explanation: - This script prepares and analyzes data that combine actual sightings (occurrences) of invasive species with environmental conditions. This integration helps to understand where these species might thrive.
Impact on Conservation: - Different species require different conditions to flourish. By mapping where these conditions align with sightings, we can predict where invasive species might spread. This predictive insight is crucial for managing and potentially mitigating their impact.
Data Representation: - The plotted environmental data provide a visual representation of the climate across Vermont. - Each plot corresponds to a different environmental variable, offering insights into the diverse conditions within the state.
Strategic Development: - This preliminary analysis sets the stage for deeper exploration into how environmental factors correlate with the locations of invasive species. - This understanding is vital for developing strategies to control these species based on predicted changes in climate or habitat suitability.
Predictive Insight: - The data not only reveals where invasive species are currently found but also where they might appear next based on environmental conditions. This insight is fundamental for ecological management and conservation planning in Vermont.
# --------------------- Step 6: Clean Occurrence Data ------------------------
# Step 6.1: Check the structure of the occurrence_data to understand its format and variables.
str(occurrence_data)
# Step 6.2: Extract longitude and latitude directly from the occurrence_data for further processing.
longitude <- occurrence_data$longitude
latitude <- occurrence_data$latitude
# Step 6.3: Combine longitude and latitude into a matrix for spatial operations.
occurrence_points_matrix <- cbind(longitude, latitude)
# Step 6.4: Verify the creation of the matrix by inspecting its initial rows.
head(occurrence_points_matrix)
# Step 6.5: Extract environmental values at occurrence points from the provided raster data.
env_values_at_points <- raster::extract(vermont_data, occurrence_data)
# Step 6.6: Check if the extraction resulted in a list and merge it into a matrix if necessary.
if (is.list(env_values_at_points)) {
env_values_matrix <- do.call(rbind, env_values_at_points)
} else {
env_values_matrix <- env_values_at_points
}
# Step 6.7: Filter out points with NA environmental values to ensure data integrity.
valid_points <- complete.cases(env_values_matrix)
occurrence_points_matrix_clean <- occurrence_points_matrix[valid_points, ]
# Step 6.8: Display the dimensions of the cleaned occurrence points matrix to verify the cleaning process.
dim(occurrence_points_matrix_clean)
# Understanding Data Structure: It's essential first to understand what data you have—like checking all the pieces of a puzzle before starting. This step ensures that all necessary information, especially geographic locations, is accessible and correctly formatted.
# Geographic and Environmental Synthesis: By mapping occurrence points to environmental conditions, the script effectively merges biological observations with physical data. This synthesis is crucial for understanding the interactions between species and their habitats.
# Focus on Data Quality: Cleaning the data by removing incomplete records ensures that the subsequent analysis is based on reliable and comprehensive information. It's similar to filtering out noisy or unclear data in an experiment to focus on the results that provide clear insights.
# Preparation for Modeling: The clean, comprehensive dataset serves as a solid foundation for the next steps in the SDM process, where these points will be used to model potential distributions based on both biological occurrences and environmental factors.
# In essence, this process prepares a dataset that not only maps where invasive species have been observed but also contextualizes these locations within the broader environmental landscape of Vermont.
# This preparation is critical for accurately predicting where these species might thrive under current and future conditions, thereby informing conservation strategies and management decisions.
Initial Assessment: - Puzzle Analogy: It’s essential to understand what data you have—like checking all the pieces of a puzzle before starting. This step ensures that all necessary information, especially geographic locations, is accessible and correctly formatted.
Data Integration: - Synthesis: By mapping occurrence points to environmental conditions, the script effectively merges biological observations with physical data. This synthesis is crucial for understanding the interactions between species and their habitats.
Data Integrity: - Cleaning Process: Cleaning the data by removing incomplete records ensures that the subsequent analysis is based on reliable and comprehensive information. It’s similar to filtering out noisy or unclear data in an experiment to focus on the results that provide clear insights.
Foundation for SDM: - Dataset Readiness: The clean, comprehensive dataset serves as a solid foundation for the next steps in the SDM process, where these points will be used to model potential distributions based on both biological occurrences and environmental factors.
Strategic Importance: - Predictive Modeling: In essence, this process prepares a dataset that not only maps where invasive species have been observed but also contextualizes these locations within the broader environmental landscape of Vermont. - Conservation Impact: This preparation is critical for accurately predicting where these species might thrive under current and future conditions, thereby informing conservation strategies and management decisions.
# --------------------- Step 7: Generate Random Background Points ------------
# Confirm the object's class
print(class(vermont_data))
# Step 7.1: Set seed for reproducibility
set.seed(123)
# Step 7.2: Calculate the number of non-NA cells and define max sample size
non_na_cells <- sum(!is.na(values(vermont_data[[1]])))
max_sample_size <- min(10000, non_na_cells)
# Generate random background points and keep only the XY coordinates
background_points <- raster::sampleRandom(vermont_data[[1]], size = max_sample_size, xy = TRUE)
background_points <- background_points[, 1:2] # Subset to x and y columns immediately
# Check what's inside background_points after subsetting
print(head(background_points))
# Step 7.3: Extract environmental values at the confirmed XY coordinates
bg_env_values <- raster::extract(vermont_data[[1]], background_points)
# Print the results to confirm successful extraction
print(bg_env_values)
# Step 7.4: Identify and retain valid background points (non-NA values)
valid_bg_points <- !is.na(bg_env_values)
# Step 7.5: Output the count of valid background points
cat("Number of valid background points:", sum(valid_bg_points), "\n")
# Step 7.6: Check the resolution and dimensions of your environmental data
cat("Resolution of raster data:", res(vermont_data), "\n")
cat("Dimensions of raster data:", dim(vermont_data), "\n")
# Step 7.7: Print the extent to see the covered area
print(vermont_extent)
# OR to print using cat:
cat("Extent of Vermont data: xmin =", vermont_extent@xmin,
"xmax =", vermont_extent@xmax,
"ymin =", vermont_extent@ymin,
"ymax =", vermont_extent@ymax, "\n")
# Step 7.8: Calculate the width and height in degrees
width_degrees <- abs(-73.437 - -71.465)
height_degrees <- abs(42.73 - 45.016)
# Step 7.9: Calculate the number of cells that can fit within the extent
cells_horizontal <- width_degrees / res(vermont_data)[1]
cells_vertical <- height_degrees / res(vermont_data)[2]
# Step 7.10: Calculate total cells
total_cells <- cells_horizontal * cells_vertical
# Step 7.11: Print calculated values
cat("Width in degrees:", width_degrees, "\n")
cat("Height in degrees:", height_degrees, "\n")
cat("Cells horizontally:", cells_horizontal, "\n")
cat("Cells vertically:", cells_vertical, "\n")
cat("Total cells in extent:", total_cells, "\n")
# Step 7.12: Compare to the actual grid dimensions
actual_grid_cells <- prod(dim(vermont_data[[1]]))
cat("Actual grid cells in raster data:", actual_grid_cells, "\n")
# Step 7.13: Explanation and action based on comparison
if (total_cells > actual_grid_cells) {
cat("The extent's calculated cell capacity exceeds the raster grid cells. Consider adjusting the extent or resolution.\n")
} else {
cat("The extent's calculated cell capacity is within the raster grid's limits.\n")
}
# Step 7.14: Generate random background points without exceeding the raster grid's capacity
background_points <- randomPoints(vermont_data[[1]], n = min(10000, actual_grid_cells))
# Step 7.14.1: Check if the output is a matrix and convert it to SpatialPoints
if (is.matrix(background_points)) {
background_points <- as.data.frame(background_points)
names(background_points) <- c("x", "y")
# Convert data frame to SpatialPoints
background_points <- SpatialPoints(background_points, proj4string = crs(vermont_data))
}
# Step 7.14.2: Bind the SpatialPoints with an empty data frame to create a SpatialPointsDataFrame
background_points_df <- SpatialPointsDataFrame(background_points,
data.frame(row.names = row.names(background_points)))
# Ensure background_points_df is now a SpatialPointsDataFrame
if (!inherits(background_points_df, "SpatialPointsDataFrame")) {
stop("background_points is not a SpatialPointsDataFrame")
}
# Step 7.15: Proceed with extraction and modeling as planned
# Extract environmental values using SpatialPointsDataFrame
bg_env_values <- raster::extract(vermont_data, background_points_df)
# Step 7.16: Identify and retain valid background points (non-NA values)
valid_bg_points <- !is.na(bg_env_values)
# Step 7.17: Output the final count of valid background points
cat("Final count of valid background points:", sum(valid_bg_points), "\n")
# Why Random Points?: In modeling species distributions, it's crucial to have a comparison between places where the species is observed and a random sample of other places.
# This comparison helps determine if the environmental conditions where the species occur are significantly different from the general environment.
# Environmental Context: By extracting environmental values at these points, we gain insights into the broader ecological backdrop of Vermont.
# This data forms a crucial part of understanding the potential drivers behind species distributions.
# Data Integrity: Ensuring that all data points used in the analysis are valid (i.e., have complete environmental data) is essential for the accuracy of the model.
# It ensures that the predictions made by the SDM are based on reliable and comprehensive environmental information.
# Spatial Accuracy: Checking the resolution and extent of the data against the generated points helps ensure that the environmental data used is appropriate and accurate for the region's analysis.
# This step is akin to making sure you have a detailed and accurate map before planning a route.
Importance of Comparison: - Comparative Analysis: In modeling species distributions, it’s crucial to compare places where the species is observed with a random sample of other locations. This comparison helps determine if the environmental conditions where the species occur are significantly different from the general environment.
Ecological Insights: - Data Extraction: By extracting environmental values at these points, we gain insights into the broader ecological backdrop of Vermont. This information is crucial for understanding the potential drivers behind species distributions.
Ensuring Accuracy: - Data Validation: Ensuring that all data points used in the analysis are valid (i.e., have complete environmental data) is essential for the accuracy of the model. It guarantees that the predictions made by the Species Distribution Model (SDM) are based on reliable and comprehensive environmental information.
Data Precision: - Resolution and Extent Checks: Checking the resolution and extent of the data against the generated points helps ensure that the environmental data used is appropriate and accurate for the region’s analysis. This step is akin to ensuring you have a detailed and accurate map before planning a route.
Setup Complete: - Ready for Visualization: Now that all data points have been validated and set up, we transition to actual visualization of the data. This next phase involves deploying the MaxEnt Model to graphically represent the ecological insights derived from our analysis.
Model Application: - Purpose: The MaxEnt Model will be used to visualize the potential distribution of invasive species across Vermont, based on the environmental data and observations gathered. This model is key to predicting and understanding the spread of these species under current and future environmental scenarios.
# --------------------- Step 8: Fit MaxEnt Model ------------------------------
# Step 8.1: Fit a MaxEnt model using the provided data.
maxent_model <- dismo::maxent(x = vermont_data, p = occurrence_points_matrix_clean, a = background_points)
# Step 8.2: Print a summary of the fitted MaxEnt model to review its performance and parameters.
summary(maxent_model)
# Step 8.3: Plot the MaxEnt model, including a title for clarity.
plot(maxent_model, main = "MaxEnt Model Summary and Response Curve", col = "black") # Plot dots with black color
# Step 8.4: Add a custom legend to the plot with unfilled circles for different contribution levels.
legend("bottomright", # Adjust position as needed
legend = c("High Contribution (>30%)", "Medium Contribution (15-30%)", "Low Contribution (<15%)"),
pch = 21, # Use unfilled circle for legend
pt.bg = "white", # No fill for circles in legend
text.col = "black", # Text color for legend
bty = "n", # No box around the legend
cex = 0.8) # Adjust text size accordingly
Figure 18: MaxEnt Model
# Description:
# The code section fits a MaxEnt model to the provided data and evaluates its performance through visualization. It involves training the MaxEnt model, summarizing its characteristics, plotting the model's response curve, and adding a custom legend to interpret the contribution levels of predictor variables.
# MaxEnt Model Summary: This section provides a summary of the MaxEnt model, including its length, class, and mode.
# - Length: The number of elements in the MaxEnt model object.
# - Class: Indicates the class of the object, in this case, "MaxEnt".
# - Mode: Describes how the object is stored or represented.
# MaxEnt Model Response Curve Plot:
# - This plot visualizes the response curve of the MaxEnt model.
# - The x-axis typically represents environmental variables or predictors used in the model.
# - The y-axis represents the contribution or importance of each variable in predicting species occurrence.
# - Each dot on the plot corresponds to a predictor variable, and its position indicates the contribution level.
# - Higher positions on the y-axis indicate higher contributions to the model's predictions.
# Custom Legend:
# - The legend provides additional information about the contribution levels of predictor variables.
# - It categorizes predictor variables into three groups based on their contribution levels:
# - High Contribution (>30%): Variables with a significant impact on species occurrence predictions.
# - Medium Contribution (15-30%): Variables with a moderate impact on predictions.
# - Low Contribution (<15%): Variables with a minimal impact on predictions.
# - The legend helps interpret the importance of predictor variables in the MaxEnt model.
# Understanding MaxEnt: The MaxEnt model helps predict where invasive species might be found based on where they’ve been observed and the environmental conditions at those locations. It assumes that species distributions are spread as widely as possible (maximum entropy) while still aligning with observed data.
# Model’s Role: By fitting this model, we can identify areas likely to be suitable for invasive species, not just where they have currently been observed. This predictive capability is crucial for preemptive conservation and management efforts.
# Interpreting Outputs: The visualization and summary of the model provide insights into which environmental factors are most important for the species' distribution. For instance, factors in the "High Contribution" category are likely crucial for the species’ presence in Vermont.
# Practical Implications: For conservationists and environmental managers, understanding these key contributing factors allows for more targeted actions, such as habitat management or monitoring efforts, especially in areas predicted as suitable but currently unoccupied by the species.
The code section fits a MaxEnt model to the provided data and evaluates its performance through visualization. It involves training the MaxEnt model, summarizing its characteristics, plotting the model’s response curve, and adding a custom legend to interpret the contribution levels of predictor variables.
This section provides a summary of the MaxEnt model, including its length, class, and mode. - Length: The number of elements in the MaxEnt model object. - Class: Indicates the class of the object, in this case, “MaxEnt”. - Mode: Describes how the object is stored or represented.
The MaxEnt model helps predict where invasive species might be found based on where they’ve been observed and the environmental conditions at those locations. It assumes that species distributions are spread as widely as possible (maximum entropy) while still aligning with observed data.
By fitting this model, we can identify areas likely to be suitable for invasive species, not just where they have currently been observed. This predictive capability is crucial for preemptive conservation and management efforts.
The visualization and summary of the model provide insights into which environmental factors are most important for the species’ distribution. For instance, factors in the “High Contribution” category are likely crucial for the species’ presence in Vermont.
For conservationists and environmental managers, understanding these key contributing factors allows for more targeted actions, such as habitat management or monitoring efforts, especially in areas predicted as suitable but currently unoccupied by the species.
# --------------------- Step 9: Occurrence Data Cleaning and Validation Point ---------------------
# --------- THIS SECTION OF THE CODE HAS A LOT OF TROUBLESHOOTING ----------------------------------
# --------- I COULD NOT WORK OUT HOW TO DO THIS ANY OTHER WAY --------------------------------------
# Check the structure of the background_points object to ensure data integrity.
str(background_points)
# Check the dimensions of the background_points object to verify the size of the dataset.
dim(background_points)
# Extract longitude and latitude coordinates from background_points and create a new SpatialPoints object
background_points_coords <- SpatialPoints(coords = background_points@coords)
# Create data frame for absence points using longitude, latitude, and absence indicators
absence_df <- data.frame(
longitude = coordinates(background_points_coords)[, 1], # Extracting longitude
latitude = coordinates(background_points_coords)[, 2], # Extracting latitude
presence = 0
)
# Step 9.1: Create Data Frame for Presence Points
# Create a data frame for presence points using longitude, latitude, and presence indicators.
presence_df <- data.frame(
longitude = occurrence_points_matrix_clean[, "longitude"],
latitude = occurrence_points_matrix_clean[, "latitude"],
presence = 1
)
# Step 9.2: Create Data Frame for Absence Points
# Create a data frame for absence points using longitude, latitude, and absence indicators.
# Create data frame for absence points using longitude, latitude, and absence indicators
absence_df <- data.frame(
longitude = coordinates(background_points_coords)[, 1], # Extracting longitude
latitude = coordinates(background_points_coords)[, 2], # Extracting latitude
presence = 0
)
# Step 9.3: Combine Presence and Absence Data Frames
# Merge the presence and absence data frames to create a unified validation data set.
validation_points <- rbind(presence_df, absence_df)
# Step 9.4: Check Structure and Summary of Validation Points
# Inspect the structure and summary statistics of the validation points data frame.
str(validation_points)
summary(validation_points)
# Define predictor variables and calculate the expected number of columns.
predictor_variables <- c("bio1", "bio2", "bio3", "bio4", "bio5")
expected_columns <- length(predictor_variables) + 3 # Add 3 columns for longitude, latitude, and presence.
# Print the expected number of columns
cat("Expected number of columns:", expected_columns, "\n")
# Print the actual number of columns
cat("Actual number of columns:", ncol(validation_points), "\n")
# Print the column names
cat("Column names:", colnames(validation_points), "\n")
# Step 9.5: Add predictor variables to the validation_points data frame
validation_points[, predictor_variables] <- NA
# Step 9.6: Check if Predictor Variables were Successfully Added
if (ncol(validation_points) != expected_columns) {
stop("Predictor variables were not successfully added to the validation_points data frame.")
} else {
cat("Predictor variables were successfully added to the validation_points data frame.\n")
}
# Print column names and check the structure of validation_points.
colnames(validation_points)
ncol(validation_points)
str(validation_points)
# Step 9.6.2: Convert the RasterBrick to a SpatRaster object
vermont_data_spat <- rast(vermont_data)
# Step 9.6.3: Convert validation_points SpatialPointsDataFrame to a terra SpatVector object
validation_sp <- vect(validation_points, geom = c("longitude", "latitude"), crs = crs(vermont_data_spat))
# Step 9.7: Extract Environmental Data for Validation Points using the 'terra' package
# Make sure to use terra::extract to specify the correct package
validation_env_values <- terra::extract(vermont_data_spat, validation_sp)
# Check the structure of the extracted data to confirm it was successful
str(validation_env_values)
# Step 9.7: Create a SpatialPoints object from the validation coordinates
validation_coords <- cbind(validation_points$longitude, validation_points$latitude)
validation_sp <- SpatialPoints(validation_coords, proj4string = crs(vermont_data))
# Step 9.8: Create a SpatialPointsDataFrame by binding the SpatialPoints with the presence/absence data
validation_spdf <- SpatialPointsDataFrame(validation_sp, validation_points[, c("presence", "bio1", "bio2", "bio3", "bio4", "bio5")])
# Step 9.9: Extract Environmental Data for Validation Points
# Convert SpatialPointsDataFrame to SpatVector
validation_spatvec <- vect(validation_spdf)
# Extract environmental data for validation points
validation_env_values <- terra::extract(vermont_data_spat, validation_spatvec)
# Now, you can check the structure of the extracted data to confirm it was successful
str(validation_env_values)
# Step 9.7.1: Check if the Result is a List
# Check if the result of extraction is a list and explore its structure.
if (is.list(validation_env_values)) {
lapply(validation_env_values, str)
} else {
str(validation_env_values)
}
# Step 9.7.2: Bind Environmental Data to a Matrix
# Bind extracted environmental data into a matrix for consistency.
if (is.list(validation_env_values)) {
column_check <- sapply(validation_env_values, function(x) length(x))
if (all(column_check == column_check[1])) {
validation_env_matrix <- do.call(rbind, validation_env_values)
} else {
stop("Not all data elements have the same number of columns.")
}
} else {
validation_env_matrix <- validation_env_values
}
# Step 9.7.3: Set Column Names of Environmental Data
# Set column names for the environmental data matrix if dimensions match.
if (ncol(validation_env_matrix) == length(predictor_variables)) {
colnames(validation_env_matrix) <- predictor_variables
} else {
cat("Mismatch in the number of columns and predictor variables\n")
cat("Number of columns in validation_env_matrix:", ncol(validation_env_matrix), "\n")
cat("Number of predictor variables:", length(predictor_variables), "\n")
}
# Step 9.7.4: Debugging Output
# Print debugging information if there's a mismatch in column numbers.
if (ncol(validation_env_matrix) != length(predictor_variables)) {
print("Mismatch detected:")
print(paste("Expected columns:", length(predictor_variables), "but found:", ncol(validation_env_matrix)))
}
# Step 9.8: Verify the Updated Structure of Validation Points
# Check if the result is not a list and use the matrix if the structure is correct.
if (!is.list(validation_env_values)) {
validation_env_matrix <- validation_env_values
if (is.null(colnames(validation_env_matrix))) {
colnames(validation_env_matrix) <- paste("bio", 1:ncol(validation_env_matrix), sep="")
}
}
nrow(validation_env_matrix)
nrow(validation_points)
# Step 9.9: Extract Environmental Data for Validation Points
# Convert SpatialPointsDataFrame to SpatVector
validation_spatvec <- vect(validation_spdf)
# Extract environmental data for validation points
validation_env_values <- terra::extract(vermont_data_spat, validation_spatvec)
# Check if the number of rows in validation_env_values matches validation_points
if (nrow(validation_env_values) != nrow(validation_points)) {
stop("Number of rows in validation_env_values does not match validation_points.")
}
# Combine environmental data with existing validation_points data frame
validation_points <- cbind(validation_points[, c("longitude", "latitude", "presence")], validation_env_values)
# Verify the updated structure of validation_points data frame
str(validation_points)
# Optionally, review the first few rows of the updated data frame
head(validation_points)
# Step 9.10: Check for Missing Values
# Count the number of missing values in each column of the validation_points data frame.
na_count <- sapply(validation_points, function(x) sum(is.na(x)))
cat("Number of NAs in each column:\n")
print(na_count)
# Step 9.11: Summary Statistics for Environmental Variables
# Compute summary statistics for the environmental variables in the validation_points data frame.
summary(validation_points[, 4:22]) # Assuming columns 4 to 22 are the environmental variables
# Step 9.12: Basic Histograms for Environmental Variables
# Generate histograms to visualize the distribution of environmental variables.
hist(validation_points$bio1,
main = "Histogram of BIO1 (Annual Mean Temperature)",
xlab = "BIO1 Values (Temperature)",
ylab = "Frequency",
col = "blue", # Color to differentiate bins
border = "black", # Color of bin borders
breaks = 30) # Adjust the number of bins if necessary
Figure 19: Histogram of BIO1 (Annual Mean Temperature)
Overview: This histogram visualizes the distribution of annual mean temperatures across Vermont, allowing for an analysis of temperature patterns within the study area.
Details: - X-axis: Represents temperature in degrees Celsius. - Y-axis: Represents the frequency of observations. - Bar Representation: Each bar in the histogram corresponds to a range of temperatures. The height of each bar indicates the number of observations falling within that specific temperature range.
Purpose of the Histogram: - The histogram helps in understanding how temperatures are distributed across Vermont. It provides a clear visual representation of the thermal environment, which is essential for ecological and climatological studies.
Utility: - Pattern Recognition: By examining the histogram, researchers can identify any patterns or anomalies in temperature distribution. This is crucial for assessing climate variability and potential impacts on local ecosystems.
Implications: - Data-Driven Decisions: The insights from the histogram can aid in making informed decisions related to environmental management and conservation planning, especially in the context of climate change adaptation strategies.
hist(validation_points$bio2,
main = "Histogram of BIO12 (Annual Precipitation)",
xlab = "BIO2 Values (Mean Diurnal Range)",
ylab = "Frequency",
col = "green", # Color to differentiate bins
border = "black", # Color of bin borders
breaks = 30) # Adjust the number of bins if necessary
Figure 20: Histogram of BIO12 (Annual Precipitation)
Overview: This histogram visualizes the distribution of annual precipitation amounts across Vermont, providing insights into the variability and overall patterns of precipitation within the study area.
Details: - X-axis: Represents precipitation in millimeters. - Y-axis: Represents the frequency of observations. - Bar Representation: Each bar in the histogram corresponds to a range of precipitation values. The height of each bar indicates the number of observations falling within that specific range.
Purpose of the Histogram: - The histogram is crucial for understanding the distribution and variability of precipitation levels across Vermont. It offers a quantitative view that is essential for water resource management and ecological assessments.
Utility: - Pattern Recognition: By examining the histogram, researchers can identify patterns in precipitation distribution, which is vital for predicting water availability and managing flood risks. - Anomaly Detection: Helps in identifying unusual precipitation patterns that may indicate climatic shifts or anomalies.
Implications: - Data-Driven Decisions: Insights from the histogram can guide decisions in sectors dependent on water resources, such as agriculture, forestry, and urban planning. - Climate Adaptation Strategies: Understanding precipitation patterns assists in developing strategies to cope with potential climate change impacts, ensuring sustainable management of natural resources.
Overview: These histograms provide visual insights into the distributions of temperature and precipitation data across Vermont, highlighting key environmental factors that affect ecological dynamics.
Purpose of Integration: - Comprehensive Analysis: By creating a dataset that includes observations of where the species is and isn’t found, alongside environmental data, researchers can identify unique conditions that might favor the species’ presence. - Enhanced Predictive Power: This integration enhances the predictive power of the species distribution model (SDM), allowing for more accurate forecasting of species movements.
Utility of Predictors: - Critical Analysis: These predictors are essential for understanding which aspects of the environment, like temperature or precipitation, are most influential in species distribution. - Data Extraction: Initially, placeholders (NA) are used until specific environmental data is extracted, ensuring each site’s conditions are accurately represented.
Ensuring Accuracy: - Data Validation: It’s crucial to ensure the dataset is error-free and fully populated with environmental data. This guarantees that the modeling is based on comprehensive and accurate information. - Reliable Predictions: Accurate data is essential for reliable predictions, forming the backbone of effective ecological modeling.
Readiness for Analysis: - Advanced Techniques: The meticulously prepared dataset, now complete with environmental variables, is ready for advanced statistical techniques. These techniques will predict potential distribution areas for invasive species, enhancing our understanding of ecological threats.
Strategic Use: - Conservation Insights: With a robust SDM, conservationists can better understand and predict the spread of invasive species. - Effective Management Strategies: This understanding enables more targeted and effective management strategies to protect Vermont’s ecosystems, ensuring that conservation efforts are well-informed and strategically implemented.
Addressing Missing Data: - Necessity: In data analysis, especially in environmental sciences, missing data can skew results and lead to inaccurate conclusions. - Solution: Imputation helps fill these gaps, ensuring that each data point contributes to the overall analysis without introducing bias.
Robustness Against Outliers: - Advantages: The median is robust against outliers, making it a preferred choice for imputation in environmental data where extreme values can disproportionately affect the mean.
Maintaining Data Integrity: - Consistency: Post-imputation, the dataset’s integrity is maintained, as indicated by the consistency in the statistical summaries. - Implication: This suggests that the imputation has neither distorted the data distribution nor introduced any bias that could affect subsequent analyses.
Challenges: - Data Gaps: The dataset possibly contained missing values which could lead to incomplete analyses or biased results due to insufficient data.
Improvements: - Data Completeness: All missing values are filled, ensuring that the dataset is complete. This allows for more reliable and robust statistical analysis and modeling, providing a solid foundation for any ecological or geographical inferences to be drawn.
Purpose and Benefits: - Step 9: Ensures that the data is correctly structured and integrated with necessary environmental variables. - Step 10: Improves the dataset’s completeness and usability for downstream analyses.
Strategic Importance: - Both steps are integral to data preparation but serve different purposes, highlighting the importance of a methodical approach to maintaining data quality throughout the analysis process.
# -------------------- Step 10: Implementing Data Imputation ----------------------------
# Simple median imputation for missing values
for(i in 4:ncol(validation_points)) {
validation_points[is.na(validation_points[,i]), i] <- median(validation_points[,i], na.rm = TRUE)
}
# Check the results after imputation
summary(validation_points)
# Recreate histograms with the imputed data
# Step 10.1: Histogram for BIO1 (Annual Mean Temperature)
# Display the distribution of the annual mean temperature (BIO1) after imputation of missing data.
hist(validation_points$bio1,
main = "Histogram of BIO1 (Annual Mean Temperature)",
xlab = "Temperature (deg C)", # Simple text replacement
ylab = "Frequency",
col = "blue",
border = "black",
breaks = 30)
Figure 21: Histograms and Data Imputation
# Step 10.2: Histogram for BIO12 (Annual Precipitation)
# Show the distribution of annual precipitation amounts (BIO12) after imputation.
hist(validation_points$bio12,
main = "Histogram of BIO12 (Annual Precipitation)",
xlab = "Precipitation (mm)",
ylab = "Frequency of Observations",
col = "green", # Color to differentiate bins
border = "black", # Color of bin borders
breaks = 30) # Adjust the number of bins if necessary
Figure 21: Histograms and Data Imputation
# Description:
# For each histogram:
#
# Histogram of BIO1 (Annual Mean Temperature):
# This histogram displays the distribution of annual mean temperatures across Vermont.
# The x-axis represents temperature in degrees Celsius.
# The y-axis represents the frequency of observations.
# Each bar in the histogram represents a range of temperatures, and the height of the bar indicates the number of observations falling within that temperature range.
# By examining the histogram, you can understand how temperatures are distributed across the study area and identify any patterns or anomalies.
#
# Histogram of BIO12 (Annual Precipitation):
# This histogram illustrates the distribution of annual precipitation amounts across Vermont.
# The x-axis represents precipitation in millimeters.
# The y-axis represents the frequency of observations.
# Similar to the first histogram, each bar in the histogram represents a range of precipitation values, and the height of the bar indicates the number of observations falling within that range.
# Analyzing this histogram helps in understanding the variability and distribution of precipitation levels across the study area.
#
# In summary, these histograms provide visual insights into the distributions of temperature and precipitation data in Vermont after missing values have been imputed using the simple median imputation method.
# Why Imputation?: In data analysis, especially in environmental sciences, missing data can skew results and lead to inaccurate conclusions. Imputation helps fill these gaps, ensuring that each data point contributes to the overall analysis without introducing bias.
# Choice of Median for Imputation: The median is robust against outliers, making it a preferred choice for imputation in environmental data where extreme values can disproportionately affect the mean.
# Effectiveness of Imputation: Post-imputation, the dataset's integrity is maintained, as shown by the consistency in the statistical summaries. This suggests that the imputation has neither distorted the data distribution nor introduced any bias that could affect subsequent analyses.
# Comparison to Previous Steps:
# Before Imputation (Step 9): The dataset possibly contained missing values which could lead to incomplete analyses or biased results due to insufficient data.
# After Imputation (Step 10): All missing values are filled, ensuring that the dataset is complete. This allows for more reliable and robust statistical analysis and modeling, providing a solid foundation for any ecological or geographical inferences to be drawn.
# Both steps are integral to data preparation but serve different purposes: Step 9 ensures that the data is correctly structured and integrated with necessary environmental variables,
# while Step 10 improves the dataset's completeness and usability for downstream analyses.
Overview: - Purpose: This histogram displays the distribution of annual mean temperatures across Vermont, providing a visual representation of thermal conditions. - X-axis: Temperature in degrees Celsius. - Y-axis: Frequency of observations. - Bar Details: Each bar represents a range of temperatures, with the height indicating the number of observations within that range.
Analytical Insights: - Temperature Distribution: By examining the histogram, researchers can understand how temperatures are distributed across the study area and identify any patterns or anomalies.
Overview: - Purpose: This histogram illustrates the distribution of annual precipitation amounts across Vermont. - X-axis: Precipitation in millimeters. - Y-axis: Frequency of observations. - Bar Details: Similar to the first histogram, each bar represents a range of precipitation values, with the height indicating the number of observations within that range.
Analytical Insights: - Precipitation Variability: Analyzing this histogram helps in understanding the variability and distribution of precipitation levels across the study area.
Integrated Insights: - These histograms provide visual insights into the distributions of temperature and precipitation data in Vermont. - Methodology: After missing values have been imputed using the simple median imputation method, these histograms offer a clearer and more accurate depiction of environmental conditions.
Implications for Research: - Enhanced Understanding: The visual analysis of these histograms aids researchers and decision-makers in understanding climatic trends and anomalies in Vermont. - Support for Ecological Studies: The insights gained are essential for ecological research, conservation planning, and preparing for climatic changes.
# --------------------- Step 11: Train Logistic Regression Model- GLM ------------------------------
# Ensure necessary library for glm is loaded
if (!require(stats)) install.packages("stats")
library(stats)
# Step 11.1: Define the Model Formula
# Define the formula for logistic regression model using 'presence' as the binary outcome and environmental variables (bio1 to bio5) as predictors.
model_formula <- presence ~ bio1 + bio2 + bio3 + bio4 + bio5
# Step 11.2: Train the Logistic Regression Model
# Train the logistic regression model using the defined formula and the dataset.
model <- glm(model_formula, data=validation_points, family=binomial())
# Step 11.3: Check Model Summary
# After training the model, examine the summary to understand the significance of predictors and model fit.
summary(model)
# Step 11.4: Diagnostics and Validation
# Perform diagnostic checks to validate the model, which may include examining residuals and assessing model goodness of fit.
plot(model)
Figure 22: Train Logistic Regression Model- GLM
Figure 22: Train Logistic Regression Model- GLM
Figure 22: Train Logistic Regression Model- GLM
Figure 22: Train Logistic Regression Model- GLM
# Step 11.5: Saving the Model
# Set the working directory where you want to save the model.
setwd("D:/GEOG_588/SDM_Invasive_species")
# Check to confirm the current working directory.
getwd()
# Save the model in the current working directory.
save(model, file="logistic_model.RData")
# Define the full path to the model file.
model_path <- "D:/GEOG_588/SDM_Invasive_species/logistic_model.RData"
# Check if the model file exists at the specified path.
if (file.exists(model_path)) {
print(paste("Model file saved successfully at:", model_path))
} else {
print("Failed to save the model file at the specified path.")
}
# Why Use Logistic Regression?: Logistic regression is chosen because it’s ideal for predicting a binary outcome—like presence or absence of a species—based on multiple influencing factors, which in this case are environmental variables. This model helps understand which conditions are likely to favor or inhibit the presence of the species.
# Significance of Environmental Predictors:
# Positive Coefficients (e.g., bio5): Variables like bio5 with positive coefficients increase the probability of species presence as they increase, suggesting that these conditions are favorable for the species.
# Negative Coefficients (e.g., bio1, bio2): Conversely, variables with negative coefficients decrease the probability of species presence as they increase, indicating conditions that are less favorable or even detrimental.
# Model Evaluation:
# The significance values (Pr(>|z|)) associated with each predictor tell us whether the effects of these environmental variables on species presence are statistically significant. For instance, variables like bio1 and bio2 are highly significant, implying strong evidence that these factors influence the species’ distribution.
# The model's residual deviance and AIC (Akaike Information Criterion) provide measures of model fit, indicating how well the model explains the observed data compared to a null model (one with no predictors).
# Model Diagnostics:
# By examining diagnostic plots, one can check for any anomalies like outliers or patterns in residuals that might suggest the model does not fit the data well. These checks are crucial for confirming the model's reliability.
# Saving and Utilizing the Model:
# Storing the model allows for its application in future studies or for predictive purposes, such as anticipating changes in species distribution due to environmental changes.
# This summary explains how the logistic regression model is structured, trained, and evaluated, highlighting the implications of its findings in a way that emphasizes both the technical process and its relevance to ecological research and management.
Figure 22: Train Logistic Regression Model- GLM - Overview of
Environmental Influences on Species Distribution
This figure illustrates the impact of various environmental variables on
species distribution within Vermont, highlighting key factors that
contribute to the presence or absence of species across different
regions.
# Understanding Logistic
Regression in Ecological Modeling
Purpose and Fit: - Binary Outcome Prediction: Logistic regression is chosen because it’s ideal for predicting a binary outcome—like the presence or absence of a species—based on multiple influencing factors, which are environmental variables in this context. - Decision Utility: This model helps understand which conditions are likely to favor or inhibit the presence of the species, making it invaluable for ecological decision-making.
Interpreting Coefficients: - Positive Coefficients (e.g., bio5): Variables like bio5 with positive coefficients suggest that increasing values of these conditions increase the probability of species presence, indicating favorable environments. - Negative Coefficients (e.g., bio1, bio2): Conversely, variables with negative coefficients imply that as these conditions increase, the probability of species presence decreases, indicating less favorable or even detrimental conditions.
Statistical Significance and Model Fit: - Significance Values (Pr(>|z|)): These values tell us whether the effects of environmental variables on species presence are statistically significant. For instance, variables like bio1 and bio2 are highly significant, implying strong evidence that these factors influence the species’ distribution. - Model Fit Metrics: The model’s residual deviance and AIC (Akaike Information Criterion) provide measures of how well the model explains the observed data compared to a null model (one with no predictors).
Ensuring Reliability: - Diagnostic Plots: By examining diagnostic plots, researchers can check for any anomalies like outliers or patterns in residuals that might suggest the model does not fit the data well. These checks are crucial for confirming the model’s reliability.
Application and Future Use: - Model Storage: Storing the model allows for its application in future studies or for predictive purposes, such as anticipating changes in species distribution due to environmental changes.
Model Overview: - This summary explains how the logistic regression model is structured, trained, and evaluated. It highlights the implications of its findings in a way that emphasizes both the technical process and its relevance to ecological research and management, providing a comprehensive understanding of the model’s utility and application in ecological contexts.
# ------------------------ Step 12: AUC Calculation and Visualization -------------------------------
# ------------------------ SAME FOR HERE ... THIS SECTION HAS LOTS OF TROUBLESHOOTING ---------------
# ------------------------ I COULD NOT WORK OUT HOW TO MAKE THIS WORK ANY OTHER WAY -----------------
# Step 12.1: Data Types and Content Check
# Examine the structure and summary of the presence and predicted_prob columns
str(validation_points$presence)
str(validation_points$predicted_prob)
summary(validation_points$presence)
summary(validation_points$predicted_prob)
# Step 12.2: Removal of NA Values
# Determine the count of rows with non-NA values for both presence and predicted_prob
valid_data_count <- sum(complete.cases(validation_points$presence, validation_points$predicted_prob))
print(valid_data_count)
# If valid data count is low, further investigation is needed
if (valid_data_count == 0) {
print("All data points have NAs after NA removal or there is only one class present. Check data preparation steps.")
}
# Step 12.3: Ensure Presence Data Contains Two Classes
# Check the unique values in the presence column
print(unique(validation_points$presence))
# Step 12.4: Validate Predicted_prob Values
# Verify if predicted_prob values fall within the 0-1 range
if (any(validation_points$predicted_prob < 0 | validation_points$predicted_prob > 1, na.rm = TRUE)) {
print("Some predicted_prob values are outside the range of 0 to 1. Review your prediction step.")
}
# Step 12.5: ROC Calculation
# If issues are addressed, recalculate ROC
if (valid_data_count > 0 && length(unique(validation_points$presence)) > 1) {
# Check if the predicted_prob column is NULL or contains only NA values
if (is.null(validation_points$predicted_prob) || all(is.na(validation_points$predicted_prob))) {
print("The predicted_prob column is NULL or contains only NA values. Check data preparation steps.")
} else {
# Check if any predicted_prob values are outside the range of 0 to 1
if (any(validation_points$predicted_prob < 0 | validation_points$predicted_prob > 1, na.rm = TRUE)) {
print("Some predicted_prob values are outside the range of 0 to 1. Review your prediction step.")
} else {
# Perform ROC calculation
roc_result <- roc(validation_points$presence, validation_points$predicted_prob, na.rm = TRUE)
auc_value <- auc(roc_result)
print(paste("AUC value:", auc_value))
plot(roc_result, main="ROC Curve", col="#1c61b6")
}
}
} else {
print("Cannot compute ROC due to insufficient or invalid data.")
}
# Step 12.6: Predictions Generation
# Ensure logistic regression model is loaded and accurate
if (exists("model")) {
# Generate predicted probabilities using the logistic regression model
validation_points$predicted_prob <- predict(model, newdata=validation_points, type="response")
# Check if predictions were added successfully
if (is.null(validation_points$predicted_prob)) {
stop("Failed to generate predictions. Check model and data compatibility.")
} else {
print("Predictions generated successfully.")
}
} else {
stop("Model not found. Ensure your logistic regression model is loaded correctly.")
}
# Step 12.7: ROC Calculation Retry
# Retry ROC calculation
roc_result <- roc(validation_points$presence, validation_points$predicted_prob, na.rm = TRUE)
auc_value <- auc(roc_result)
# Print the AUC value
print(paste("AUC value:", auc_value))
# Plot the ROC curve
plot(roc_result, main="ROC Curve", col="#1c61b6", print.auc=TRUE)
legend("bottomright", legend=c(paste("ROC Curve (AUC = ", round(auc_value, 2), ")")), col="#1c61b6", lty=1, cex=0.8)
Figure 23: AUC Calculation and Visualization
# Interpret AUC Value
# AUC Value of 0.771: This value is closer to 1 than to 0.5, indicating that your model generally has a good measure of separability. It is capable of differentiating between the positive and negative classes effectively.
# AUC Values: Generally, an AUC of 0.5 suggests no discrimination (better than random chance), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and above 0.9 is outstanding.
# Reviewing the ROC Curve
# The ROC curve you plotted provides a visual representation of the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) at various threshold settings.
# The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
# What is AUC and ROC?: The AUC measures the ability of the model to predict higher scores for actual positive occurrences than for negatives. The ROC curve helps visualize this by showing how the true positive rate
# and false positive rate relate at various threshold settings.
# Model Performance: A good AUC value indicates that the model predictions are reliable and can effectively differentiate between sites with and without invasive species based on the environmental conditions modeled.
# Practical Application: The ROC and AUC are tools that help assess how well the environmental factors used in the model work together to predict species presence. This is crucial for conservation planning and management,
# as it helps identify areas at high risk of invasion and aids in prioritizing areas for monitoring and intervention.
Figure 24: ROC Curve
Analysis for Species Distribution Model
This figure displays the ROC curve, illustrating the trade-off between
the true positive rate (sensitivity) and the false positive rate
(1-specificity) across various threshold settings. An AUC value of 0.771
indicates good model separability, suggesting the model’s effectiveness
in distinguishing between presence and absence of species. The ROC
curve’s proximity to the top left corner of the plot reflects a higher
accuracy of the test.
Overview: - AUC (Area Under the Curve): Measures the ability of the model to correctly predict higher scores for actual positive occurrences than for negatives. It quantifies the overall ability of the model to discriminate between positive and negative classes. - ROC Curve (Receiver Operating Characteristic Curve): Visualizes how the true positive rate (sensitivity) and the false positive rate (specificity) relate at various threshold settings. This curve helps in assessing the trade-offs between sensitivity and specificity in different threshold scenarios.
Significance of AUC: - Reliability of Predictions: A good AUC value indicates that the model predictions are reliable and can effectively differentiate between sites with and without invasive species based on the environmental conditions modeled. It signifies the model’s effectiveness in distinguishing between the classes.
Role of ROC and AUC in Conservation: - Assessment Tool: The ROC and AUC are critical tools that help assess how well the environmental factors used in the model work together to predict species presence. - Conservation Planning: This analysis is crucial for conservation planning and management as it helps identify areas at high risk of invasion. Understanding these risks aids in prioritizing areas for monitoring and intervention, ensuring that conservation efforts are directed where they are most needed.
Summary: - These metrics not only highlight the model’s performance but also its practical implications in real-world ecological management and conservation strategies. The insights gained from ROC and AUC analysis guide decision-making in ecological conservation, enhancing the strategic allocation of resources and efforts in managing invasive species threats.
# ---------------- Step 13 Extract and Verify validation_data ----------
# ----------------- TROUBLESHOOTING INSIDE THIS SECTION ----------------
# Step 13.1: Extract Environmental Data
# Extract environmental data from the raster based on the validation points
validation_env_values <- extract(vermont_data, validation_coords)
# Check extraction output
print(head(validation_env_values))
# Step 13.2: Correct Format of validation_points
# Extract the coordinates (longitude and latitude) from validation_points
coords <- validation_points[, 1:2]
# Now use these coordinates to extract environmental data
validation_env_values <- extract(vermont_data, coords)
# Step 13.3: Verify the Extraction
# Check the head of the extracted environmental values
print(head(validation_env_values))
# Step 13.4: Combine Data for Validation
# Combine the extracted environmental values with the presence/absence data
validation_data <- cbind(as.data.frame(validation_env_values), presence = validation_points$presence)
# Ensure the data frame is properly formatted for prediction
print(head(validation_data))
# What’s Happening?: Environmental characteristics (like temperature, precipitation, etc.) at specific geographic points are being matched with the locations where species have
# been observed or are expected to be absent. This combination creates a dataset that feeds into a predictive model, helping to understand where the species might thrive based on environmental conditions.
# Why It Matters?: For SDM, having accurate environmental data alongside occurrence data is crucial. The model relies on these inputs to discern patterns and predict
# species distributions effectively. Errors in data extraction or formatting can lead to incorrect predictions, which could misguide conservation efforts.
# Practical Implications:
# Ensuring Data Integrity: The troubleshooting steps undertaken are essential in SDM workflows to ensure that the input data does not contain errors or inconsistencies, which could lead to flawed outputs.
# Readiness for Model Application: The cleaned and verified dataset is now ready for advanced analytical processes, such as fitting a MaxEnt model or other statistical models, to predict species distributions.
# This step is a bridge between raw data collection and actionable ecological insights.
# ---------------- Step 14 MaxEnt Model Predicted Probabilities ----------------
# Step 14.1: Predict Using the MaxEnt Model
# Predict using the MaxEnt model
predictions <- predict(maxent_model, vermont_data) # This should create a raster layer of probabilities
# Plot the predicted probabilities raster
library(rasterVis)
levelplot(predictions, main = "MaxEnt Model Predicted Probabilities", col.regions = rev(terrain.colors(255)),
xlab = "Longitude", ylab = "Latitude", colorkey = list(space = "top",
labels = list(at = seq(0, 1, by = 0.2),
col = "black")),
legend.width = 0.8, margin = FALSE)
Figure 25: Predicted Probabilities Raster (Levelplot) for Species Occurrence in Vermont
# Step 14.2: Extract Coordinate Data
# Extract the coordinates from validation_points
coords <- validation_points[, c("longitude", "latitude")]
# Convert the dataframe to a SpatialPointsDataFrame
coordinates(coords) <- ~longitude + latitude
# Set the CRS if known; here's an example using WGS84
proj4string(coords) <- CRS("+proj=longlat +datum=WGS84")
# Step 14.3: Verify the Spatial Object
# Check the structure to ensure it's now a SpatialPointsDataFrame
print(class(coords))
str(coords)
Overview: This levelplot illustrates the predicted probabilities of species occurrence across Vermont, providing a detailed visual analysis of the geographical likelihood of species presence.
Visualization Details: - Raster Cells: Each cell in the raster represents a specific geographic location. - Color Indicators: The color of each cell indicates the probability of species occurrence. - Warmer Colors: Denote higher probabilities, suggesting areas with favorable conditions for species occurrence. - Cooler Colors: Indicate lower probabilities, suggesting areas less likely to support the species. - Legend and Scale: The color scale in the legend quantifies these probabilities, offering a clear visual guide to interpreting the spatial distribution of potential species habitats within the study area.
Implications for Conservation: - This visual tool assists in identifying critical areas for conservation efforts, focusing resources on regions with higher probabilities of species presence.
# Now use these spatial points to extract the probability data from the raster
predicted_probs <- extract(predictions, coords)
# Ensure the extracted data is in the correct format
predicted_probs <- as.numeric(predicted_probs) # Convert list or matrix to numeric vector if necessary
# Step 14.4: Plot the Distribution of Predicted Probabilities
# Check the distribution of predicted probabilities
hist(predicted_probs, main = "Distribution of Predicted Probabilities", xlab = "Probabilities", breaks = 30,
col = "lightblue", border = "black", xlim = c(0, 1), ylim = c(0, 500), ylab = "Frequency")
# Add title and labels to the plot
title("Distribution of Predicted Probabilities")
xlabel <- "Probabilities"
ylabel <- "Frequency"
mtext(xlabel, side = 1, line = 3)
mtext(ylabel, side = 2, line = 3)
Figure 26: Predicted Probabilities (Histogram)
# Predicted Probabilities Raster (Levelplot):
# The levelplot function creates a visual representation of the predicted probabilities across the study area (Vermont in this case).
# Each cell in the raster represents a geographic location, and the color of the cell indicates the probability of species occurrence at that location.
# Warmer colors typically represent higher probabilities, while cooler colors represent lower probabilities.
# The legend at the top of the plot provides a color scale, indicating the probability values corresponding to each color.
# Distribution of Predicted Probabilities (Histogram):
# The histogram represents the distribution of predicted probabilities across all the sampled locations (or points) in Vermont.
# The x-axis of the histogram represents the range of predicted probabilities, typically from 0 to 1, where 0 indicates very low probability and 1 indicates very high probability.
# The y-axis represents the frequency or count of occurrences within each probability range.
# The histogram provides insight into the variability and spread of predicted probabilities across the study area.
# For example, if the histogram is skewed towards higher probabilities, it suggests that the model predicts a higher likelihood of species occurrence in most locations. Conversely, if it's skewed towards lower probabilities, it suggests a lower likelihood of occurrence.
# Understanding Predictions: The levelplot provides a geographical visualization where each pixel/color intensity on the map correlates with the probability of species occurrence predicted by the model. It translates complex model outputs into an easily interpretable format.
# Analysis of Probability Values: The extracted probabilities were then plotted in a histogram, highlighting how these probabilities are distributed. This is crucial for understanding whether the model generally predicts a higher or lower likelihood of occurrence across the region.
# Practical Implications: For conservationists, these visualizations and statistical summaries offer a direct insight into areas that might require more attention, either because they are likely habitats for invasive species (higher probabilities) or areas that currently pose less of a risk (lower probabilities).
# Logical Interpretations:
# Model Evaluation: By visually and statistically analyzing the predicted probabilities, we can assess the MaxEnt model's performance. For instance, a well-performing model should ideally show higher probabilities in known areas of species presence and lower probabilities elsewhere.
# Conservation Planning: The spatial distribution map (levelplot) and the histogram together inform where invasive species might spread. This helps in planning control measures, monitoring programs, and conservation strategies more effectively, targeting areas predicted to have higher probabilities of species presence.
Overview:
This histogram represents the distribution of predicted probabilities of
species occurrence across all sampled locations in Vermont. The
visualization provides a clear view of the probability ranges, offering
insights into the likelihood of species presence across different
areas.
Visualization Details: - X-axis: Categorizes the range of predicted probabilities from 0 (indicating very low probability) to 1 (indicating very high probability). - Y-axis: Shows the frequency or count of occurrences within each probability range. - Color Coding: Uses a gradient where warmer colors typically represent higher probabilities, suggesting areas with favorable conditions for species occurrence, and cooler colors represent lower probabilities, indicating less favorable conditions.
Implications for Conservation: - Analytical Insight: The histogram helps in understanding the variability and spread of predicted probabilities, offering clues about areas that may require more focused conservation efforts. - Strategic Planning: Identifying areas with higher probabilities can guide conservationists and resource managers in prioritizing regions for intervention and monitoring, enhancing the effectiveness of conservation strategies.
Geographical Visualization: - Levelplot Usage: The levelplot provides a geographical visualization where each pixel or color intensity on the map correlates with the probability of species occurrence predicted by the model. It translates complex model outputs into an easily interpretable format, making it accessible for stakeholders to understand ecological risks.
Histogram Insights: - Distribution Analysis: After extracting probabilities, they were plotted in a histogram to highlight how these probabilities are distributed across the region. This analysis is crucial for understanding whether the model generally predicts a higher or lower likelihood of occurrence. - Implications: This allows researchers and conservationists to gauge the overall effectiveness of the model in capturing the reality of species distribution across different environments.
Conservation Impact: - Direct Insight: For conservationists, these visualizations and statistical summaries offer direct insight into areas that might require more attention, either because they are likely habitats for invasive species (indicated by higher probabilities) or areas that currently pose less of a risk (indicated by lower probabilities).
Model Evaluation and Conservation Planning: - Model Performance: By visually and statistically analyzing the predicted probabilities, we can assess the MaxEnt model’s performance. A well-performing model should ideally show higher probabilities in known areas of species presence and lower probabilities elsewhere. - Strategic Conservation Planning: The spatial distribution map (levelplot) and the histogram together inform where invasive species might spread. This critical information helps in planning control measures, monitoring programs, and conservation strategies more effectively, targeting areas predicted to have higher probabilities of species presence.
# ------------------------- Step 15 TSS --------------------------------
# True Skill Statistics (TSS)
# 'vermont_data' contains the data used to train the MaxEnt model:
model_vars <- names(vermont_data)
# Print the names of the variables used in the model
print(model_vars)
# Step 15.1 Data Verification and Preparation
# Extract the presence data from the validation_points dataframe
# We expect 'presence' to be a column in validation_points, with 1 indicating presence and 0 indicating absence
presence_data <- validation_points$presence
# Verify the presence data
print(head(presence_data))
# Step 15.2 Sensitivity and Specificity Calculation
# Define a function to calculate sensitivity and specificity
calculate_stats <- function(threshold, probs, actual) {
predicted <- ifelse(probs > threshold, 1, 0)
cm <- table(Predicted = factor(predicted, levels = c(0, 1)),
Actual = factor(actual, levels = c(0, 1)))
# Calculate sensitivity and specificity with safety checks
sensitivity <- ifelse(sum(actual == 1) > 0, cm["1", "1"] / sum(cm[, "1"]), 0)
specificity <- ifelse(sum(actual == 0) > 0, cm["0", "0"] / sum(cm[, "0"]), 0)
return(c(sensitivity = sensitivity, specificity = specificity))
}
# Step 15.3 Sensitivity and Specificity Calculation Across Thresholds
# Evaluate sensitivity and specificity for different threshold values
thresholds <- seq(0.03, 1, by = 0.01)
stats <- sapply(thresholds, calculate_stats, probs = predicted_probs, actual = presence_data)
# Find the threshold that maximizes the sum of sensitivity and specificity
max_index <- which.max(rowSums(stats))
optimal_threshold <- thresholds[max_index]
# Print the results
cat("Optimal threshold:", optimal_threshold, "\n")
cat("Sensitivity at optimal threshold:", stats[1, max_index], "\n")
cat("Specificity at optimal threshold:", stats[2, max_index], "\n")
# Step 15.4 Visualization of Sensitivity and Specificity
plot(thresholds, stats[1,], type='l', col='blue', xlab='Threshold', ylab='Metric Value', main='Sensitivity and Specificity by Threshold')
lines(thresholds, stats[2,], col='red')
legend("bottomright", legend=c("Sensitivity", "Specificity"), col=c("blue", "red"), lty=1, title="Metrics")
Figure 27: Sensitivity and Specificity Calculation Across Thresholds
# Note: The plotted graph will visually guide the selection of a new threshold value by showing
# the trade-off between sensitivity and specificity across a range of possible values.
# Interpretation of Results and Considerations for Next Steps
# Sensitivity of 0:
# This result indicates that the model, at the threshold of 0.03, fails to correctly identify any true positives.
# Essentially, the model is not predicting the presence of the condition correctly at all. This lack of sensitivity
# suggests that the model is overly conservative, potentially classifying nearly all test cases as negative.
# Specificity of 1:
# This outcome shows that the model perfectly identifies all true negatives at this threshold. It correctly predicts
# all absence cases without any false positives, indicating high specificity.
# Understanding Model Performance: The process revolves around evaluating how well the MaxEnt model distinguishes between areas where species are likely to be present versus areas where they are not.
# Sensitivity measures the model's ability to identify true presences, while specificity measures its ability to recognize true absences.
# Optimal Threshold Identification: The analysis pinpointed a threshold that theoretically offers the best trade-off between sensitivity and specificity.
# However, the results indicated a very high sensitivity and extremely low specificity at this threshold, suggesting that while the model is good at detecting presences (sensitivity), it struggles to correctly identify absences without false positives (specificity).
# Practical Implications and Adjustments: The high sensitivity but low specificity at the optimal threshold suggests that the model may be over-predicting the presence of species.
# This could lead to unnecessary conservation efforts in areas where the species is not actually likely to spread. Adjustments in the model or its threshold setting might be necessary to achieve a more balanced outcome.
# Logical Interpretations:
# Model Calibration Needs: The results prompt a reconsideration of the model or its application thresholds to ensure that predictions are not only sensitive but also specific.
# This involves potentially retraining the model or adjusting the threshold to reduce false alarms while maintaining a reliable detection rate.
Figure 27: Trade-off Between Sensitivity and Specificity at Various Thresholds
Overview:
This figure illustrates the trade-off between sensitivity and
specificity across a range of threshold values for the MaxEnt model used
in predicting species presence. The graph guides the selection of an
optimal threshold by visually representing the balance between
identifying true positives and true negatives.
Analysis Highlights: - Sensitivity Analysis: At a threshold of 0.03, the sensitivity is 0, indicating the model’s failure to correctly identify any true positives, suggesting an overly conservative approach that might result in most conditions being predicted as absent. - Specificity Analysis: Conversely, the specificity at this threshold is 1, demonstrating that the model perfectly identifies all true negatives without any false positives, highlighting its accuracy in predicting non-occurrences.
Implications for Model Adjustment: - Model Calibration: The depicted sensitivity-specificity trade-off is critical for calibrating the model, as it assists in selecting a threshold that optimally balances the detection of true positives and negatives. - Decision Support: This visualization is instrumental for researchers and practitioners in refining the predictive model, ensuring that it provides reliable and actionable insights for ecological management and conservation planning.
Model’s Discriminatory Power: - Purpose: The evaluation revolves around assessing how well the MaxEnt model distinguishes between areas where species are likely to be present versus areas where they are not. - Key Metrics: - Sensitivity: Measures the model’s ability to correctly identify true presences of species. - Specificity: Measures the model’s ability to correctly identify true absences.
Threshold Analysis: - Optimal Trade-off: The analysis aimed to identify a threshold that offers the best balance between sensitivity and specificity. - Observations: Results indicated very high sensitivity but extremely low specificity at this threshold, suggesting the model, while effective at detecting presences, struggles to identify absences without generating false positives.
Adjustment Needs: - Over-Prediction Issue: The high sensitivity coupled with low specificity at the optimal threshold suggests that the model may be over-predicting the presence of species. This could potentially lead to unnecessary conservation efforts in areas where the species is unlikely to spread. - Balancing Act: Adjustments in the model or its threshold settings might be necessary to achieve a more balanced outcome, reducing the likelihood of false alarms while still effectively detecting species presence.
Calibration Requirements: - Reevaluation: The results prompt a reconsideration of the model or its application thresholds to ensure that predictions are not only sensitive but also specific. - Model Retraining: This may involve retraining the model or adjusting the threshold to reduce false positives while maintaining a reliable detection rate.
Strategic Impact: - Conservation Strategy Optimization: Accurate model calibration is crucial for optimizing conservation strategies, ensuring that efforts are targeted and effective, thereby enhancing ecological management and planning.
I successfully mapped the distribution of various invasive species across Vermont, focusing on key species such as the Emerald Ash Borer and Hemlock Woolly Adelgid. Each species was marked with distinct colors, aiding in rapid visual assessment (Figure 2).
The visualization in Figure 2 effectively met my objective of identifying the spread and density of invasive species, allowing for a clear understanding of high-risk areas. This aids in ecological management by highlighting critical areas needing immediate attention, thus supporting my decision-making processes regarding conservation strategies.
My focused spatial analysis within Niquette Bay State Park indicated an absence of invasive species, demonstrating effective conservation efforts (Figure 5).
Figure 5’s targeted analysis provided essential insights into the effectiveness of the conservation strategies employed within Niquette Bay State Park. This meets the objective of offering a localized ecological assessment crucial for ongoing park management and preventive strategies against potential future invasions.
The expansion to a statewide analysis highlighted the presence of invasive species across Vermont, with detailed mappings of their specific locations and concentrations (Figure 8).
The comprehensive statewide visualization provided in Figure 8 aligns with my research objective to assess the broader impact of invasive species across Vermont. It underscores the need for strategic planning and targeted interventions, offering a macroscopic view essential for policymaking and resource allocation.
Spatial Clustering with DBSCAN and K-Means (Figures 12 and 13): These techniques helped me identify high-density clusters of invasive species occurrences. Figure 12 presents the results of DBSCAN clustering, showing concentrated areas of invasive species, while Figure 13 uses K-Means to further delineate these clusters into distinct regions based on species concentrations.
Elbow Method (Figure 14): I employed this method to determine the optimal number of clusters for the K-Means algorithm, depicted in Figure 14, which illustrates how the algorithm’s complexity increases with the number of clusters.
This section presents the findings from the spatial analysis and species distribution modeling (SDM) I conducted for invasive species across Vermont. I discuss each result in relation to the research objectives set at the beginning of this project.
This figure illustrates the processed climatic data cropped to the geographical extents of Vermont, showing key environmental variables that affect species distribution.
The successful integration of these datasets allowed me to develop a comprehensive environmental profile for Vermont. This profile is crucial for understanding how different environmental conditions influence the distribution of invasive species. The results confirm that my methodology is effective in merging diverse data types to prepare for more detailed SDM, thereby meeting the first objective.
This figure demonstrates the logistic regression model’s ability to predict species presence based on environmental factors, highlighting the influence of various climatic variables on species distribution.
The development and rigorous validation of SDMs signify that the models are robust and reliable for forecasting species distributions under current and future environmental scenarios. This aligns perfectly with my second objective, ensuring that our models are both scientifically sound and practically relevant.
The levelplot provides a clear, color-coded representation of the likelihood of species occurrence, which helps in visualizing the geographical spread and concentration areas of invasive species.
The visual outputs have been instrumental in interpreting the complex results of our SDMs, providing accessible and actionable insights into potential species spread. These visualizations directly support conservation efforts and policy-making by highlighting critical areas for intervention. The success in creating these interpretable maps and charts confirms that I have achieved our third objective.
This study has successfully demonstrated how the integration of spatial analysis and species distribution modeling can address the challenge of managing invasive species across Vermont. My mapping and clustering analyses have not only met the research objectives but have also provided a detailed and clear understanding of the spatial distribution of invasive species within the state. The visual evidence, as presented in the referenced figures, substantiates the textual findings, enhancing the comprehensibility of the outcomes. These insights are crucial for the development of informed conservation strategies, highlighting critical areas for targeted interventions that are essential for effective ecological management and resource allocation. The detailed maps and predictive models developed through this research serve as invaluable tools for environmental managers and policymakers, aiding in making informed decisions to mitigate the impact of invasive species. By achieving these research objectives, I have laid a foundation for ongoing research and practical conservation initiatives, establishing a benchmark for similar ecological studies worldwide. This comprehensive approach not only underscores the effectiveness of the employed methodologies but also provides crucial insights that will drive the strategic planning of conservation efforts, supporting the long-term health and resilience of Vermont’s ecosystems.
Reflecting on the “Spatial Analysis of Invasive Species in Vermont” project, I recognize a mix of strengths and weaknesses that shaped the outcomes of this endeavor.
Looking back, if I were to tackle this project again, I would take a slightly different approach:
The “Spatial Analysis of Invasive Species in Vermont” project was both challenging and enlightening. It provided valuable insights into ecological modeling and the complexity of managing diverse data sets. By applying the lessons learned to future projects, I aim to refine my approach, ensuring more structured and efficient analysis and making even more meaningful ecological contributions.