Data Exploration of eDNA data for model-building

This dataset contains presence/absence data for 186 freshwater invertebrate taxa, from eDNA samples collected in watersheds around the Atlantic Region of Canada. Sites are split into reference sites and test sites, where reference sites are areas with lower impact from human influence, while test sites are more human-influenced, with a combination of land use that could drive ecosystem changes, like nearby agriculture, foresty, or roads. The goal of this data set is build a predictive model to estimate watershed health for new sites within the region.

The data were explored here to unpack patterns and insights relating to taxa richness and percent of ept taxa, as well as exploring the relationship between environmental stressors and species communities between test and reference sites.

5 different excel files were used for this data. Sites were aligned by sample number and large river sites were removed from this analysis.

Part 1: Exploring Taxa Richness

Basic summary statistics to explore taxa richness

At first glance, the two different groups look very similar, with similar means, medians and standard errors. Since the mean and median are so similar for both groups, it’s likely that this sample is normally distributed

Summary Stats Reference Site Values Test Site Values
kurtosis 0.1 -0.4
max 65 68
mean 34.7 36
median 33 36.5
min 2 13
n 222 208
range 63 55
sd 11.7 10.7
se 0.8 0.7
skew 0.5 -0.1

Count of Taxa Richness for Reference & Test Sites

Visualizing the distribution of the count of taxa richness confirms this: the data appear pretty normally distributed and without large differences between test and reference sites.

However there could be more differences underneath the surface. There are 186 different species in this study, and over 200 different sites for each category (test and reference sites). Do the same species appear at all test sites? How about all reference sites? And how about looking at the groups of species, and how they cluster together, (termed the “species communities” in ecoloyg). Are there groups of species present that are more sensitive to human influence? This is explored in the rest of the analysis.

Part 2: Exploring the species that make up taxa richness

The most and least common species (from the 95th and 5th quantiles) were graphed below. The taxa in the following graphs are the most and least common species, for reference and test sites.

So far, nothing jumps out - it doesn’t seem easy to see if there are any patterns here between the most and least common species across test and reference sites. There are some species that appear a few times between these two sites, but it’s not clear if this is a meaningful pattern.

The percent of “EPT” taxa for test and reference sites was assessed, to begin to see if there are any obvious differences in species communities. EPT taxa are those that belong to the orders Ephemeroptera, Plecoptera, and Trichoptera. They were chosen because these are often useful bioindicator taxa, meaning that they can be more sensitive to human influence. These were further broken down by looking specifically at the percent of taxa that belong to the Plecoptera order, as this order contains even more sensitive species.

It looks like there is a pattern here - we see that although percent EPT does not vary between test and reference sites, percent plecoptera does vary between test and reference sites. A Welch’s two sample t-test was run, and this is significant at alpha = .05, and it is probably that there is a difference between the mean percent plecoptera at reference and test sites.

Part 3: Exploring taxa richness and environmental stressors

It’s important to look at how the environment might be effecting these species, especially if we want to build a model to predict if a site is a test or reference site.

Watershed stressor scores were obtained using ArcGIS. These watershed stressor scores represent land use or stressors within the watershed that could drive ecological change, such as road density, agriculture, foresty or other industrial use.

We compared watershed stressor scores with taxa richness, percent ept and percent plecoptera.

It looks like there are some interesting patterns here - this will be explored further using ANCOVAs to see if the percent ept varies between test and reference sites.

Watershed Stressor Score and Total Taxa Abundance

And finally, the watershed stressor scores were divided into six ranges, and the total taxa abundance was summed across each range of watershed stressor scores. The following tile plots show the total taxa abundance for each species, across all test and reference sites. This plot was made as another way to visualize if there are any differences or patterns between the species or groups of species that appear at reference or test sites.

Part 4: Distribution of Environmental Variables

And finally, it’s important to understand if there are differences in environmental variables at each site, to make sure we are modeling the difference in species communities rather than the differences in environmental variables.

The distribution of each environmental variable was graphed in reference and test sites to see if there are differences in environmental variables. Environmental variables were also graphed by latitude and longitude to assess spatial variation.

So far, no major differences in environmental variables exist between test and reference sites. There is some natural variability, but this makes sense in a complex environmental system. As the main boxes of the boxplot (.25-.75 quantiles) overlap with one another, it’s not likely that there are huge differences here that might affect the modeling process.

And finally, there does not appear to be any obvious groupings or relationship between lat/long and the different variables - the orange and blue dots seem to follow the same patterns, so there shouldn’t be huge differences between these variables that may affect the modeling process.