I wanted to organize a living document of my workflow for data manipulation, analyses, and visualization on the Illinois Waterway Round goby project. I find these RMarkdown files help organize my own thoughts in narrative form and are convenient ways of commuincating modeling and visualization results.
We seem to have distilled our research questions into the following:
1. Are goby associated with hard substrate as compared to soft substrate?
2. Are goby associated with certain strata, for example main-channel as compared to backwater?
3. Do our data support that goby are present and/or established in all pools of the Illinois Waterway?
4. Do our data support the hypothesis that body-size decreases along an upstream-to-downstream invasion front?
Let’s start with some basic, descriptive looks to contextualize our substrate question.
As a reminder, the LTRM protocols code the substrate values accordingly:
1. Silt
very fine and very soft sediments that may contain highly hydrated [very soft] clay; sand lacking
2. Silt/Clay/Little Sand
fine and soft sediments dominated by silt but usually containing little fine sand, with perhaps dehydrated [firm] clay pellets or moderately hydrated clay with little fine sand
3. Sand/Mostly Sand
firm to very firm, fine to coarse sediments with sand dominant, or entirely sand
4. Gravel/Rock/Hard Clay
hard substrate consisting of dehydrated [firm] clay, gravel, rock, bedrock, or concrete
So there are far more hard-substrate sites with goby records than soft-susbrate sites with goby records, but that doesn’t account for the prevalence of hard- vs. soft- substrate sites in the survey design.
Let’s look at the percentage of hard- and soft-substrate sites with goby records. We’ll break it down by pool, while we’re at it. In this case, I’m going to say a substrate code of 4 is hard, and codes of 1-3 are soft. The numbers next to each point indicate the sample size of sites for this category across all years (2019-2021).
For the sake of the following analyses, I’m combining minifykes and e-fishing samples, the two gears with which we caught goby during the study. I’m sure you could make an argument that they should be separated because they have different probabilities of capture given occupancy, but I seem to think they’re similar enough in their strengths and weaknesses in detecting goby to combine them. I’m more comfortable working with presence/absence data if we’re combining gears, so let’s start there.
This gives us a better idea of how substrate is affecting catch. There is clearly a substantial association between substrate and goby capture. It seems like the impact might vary by pool, but there are so few sites with gobies overall in the downstream reaches that it’s hard to be certain if the effect disappears or there are just far fewer goby. Notice there is no catch in Alton.
Another thing that stands out to me is those sample size annotations. I was thinking that the CAWS reaches had a higher ratio of hard to soft substrate, but this indicates it’s somwhere between 1:1 and 1:2. It’s the middle reaches – MA and ST – that have a 3:2 ratio of hard to soft in the study design. Maybe I’ve been thinking about it wrong so far.
We can also look at mean CPUE instead of presence/absence. Again, we might be wary about combining gears here, so I’ll just show Minifykes. For what it’s worth, these CPUE figures look very similar if you instead plot total catch or MPUE, so I won’t bother plotting total catch or MPUE here, but know we have those options.
Minifyke catch accounts for 74% of the sites at which we catch round goby, and 89% of the total number of round goby individuals caught. The remainder are from e-fishing.
Wow, the gap between hard and soft has pretty much disappeared except in the upper river, where the trend has flipped.
In Lockport, soft substrate seems to support higher CPUE. However, after looking at the data, this soft-substrate mean is pulled way upward by one large value (103 total goby, 115 CPUE). Eliminating that value, I’m not sure this visualization tells us a whole lot about differences in CPUE.
These CPUE numbers are awful low in many places, mostly because they’re being influenced by a very high proportion of zero CPUE values (i.e. goby were absent at a large majority of sites in many pools). What if we looked at CPUE, but only at sites at which goby were captured? In other words, when & where goby were captured, does substrate influence how many were captured? This ignores the difference between 0 and 1 goby, which is a large part of the story, but I’m sort of interested in pulling out the captures and examining those by substrate.
It’s hard to say definitively if substrate effects CPUE at sites where goby are captured. Because of a reduced sample size, that one high value in Lockport makes an even greater difference between soft and hard, in favor of soft. I’m not sure that’s all that relevant, knowing it’s driven by one minifyke. Perhaps this isn’t as interesting a way to slice the data as I thought…
So, that’s by goby substrate type, but another question we had was how does stratum effect our detection/abundnace of goby? We can look at the same types of figures, but replacing substrate types with strata.
Again, we see goby captures predominantly where we’d expect to: in the main channel. But what about after controlling for the prevalence of each strata?
It seems like Main Channel is definitely a hotter area for goby capture than the backwater in the upper river, but that gap narrows as you move down the river. Side channel areas are not as prevalent in the upper river, so it’s hard to say how different it is than main channel, but it’s the same or a little less than main channel in terms of % of sites with goby capture in the middle and lower reaches.
How about CPUE at all sites?
Again, Main Channel (and to an extent Side Channel) seem to have higher CPUE than backwater in the upper and middle river, but the sample size of backwater locations is so low in the upper river it’s hard to truly compare. In the middle and lower river, the gap is pretty small.
Now let’s look at CPUE but only at sites where goby were captured.
Again, it’s hard to glean much from this figure. The sample sizes are very small, particularly in the lower river. But in the lower river where the occupancy gap was narrow, the CPUE gap at occupied sites seems to widen in favor of the Main Channel. Maybe that’s something? I’d be interested in your thoughts on the relevance of looking at CPUE only where goby were captured.
To formally test the effect of substrate (and strata, if we wish), we can build a generalized linear model that predicts the presence or absence (perhaps better defined as capture or non-capture) of round goby as a binomial logistic regression. I’d like to have a model that fits the data well, which means we should account for lots of other factors potentially at play, so we’ll include lots of other temporal, spatial, and environmental variables.
However, many of the environmental variables are not complete throughout the dataset. Let’s take a look at the variables I’m considering for the model, and what percentage of them are ‘NA’ in the dataset.
.| Variable | PctMissing |
|---|---|
| WATER_TEMP_NAs | 0.00 |
| WATER_VEL_NAs | 68.17 |
| WATER_DEPTH_NAs | 4.50 |
| SECCHI_DISK_NAs | 1.29 |
| DISSOLVED_OXYGEN_NAs | 1.93 |
| SPECIFIC_CONDUCTIVITY_NAs | 0.00 |
| Turbidity_NAs | 17.68 |
| river_km_NAs | 0.00 |
| DENSITY_EMERGENT_SUBMERSED_VEG_NAs | 51.13 |
| EMERGENT_SUBMERSED_AQUATIC_VEG_PRECENT_NAs | 3.86 |
| snag_NAs | 0.00 |
| wingdyke_NAs | 0.00 |
| trib_NAs | 0.00 |
| riprap_NAs | 0.00 |
| inout_NAs | 0.00 |
| closing_NAs | 0.00 |
| flooded_NAs | 0.00 |
| othrstrc_NAs | 0.00 |
.
.
.
Most of these are nearly complete, with the exception of WATER_VEL. 68% is far too much to reliably impute, so we’ll omit WATER_VEL from any model.
DENSITY_EMERGENT_SUBMERSED_VEG is also highly incomplete (51%), but this is misleading. This column should always be 0 when EMERGENT_SUBMERSED_AQUATIC_VEG_PRECENT is 0, but often it’s NA. After making that change, the number of NAs in the DENSITY column is closer to 9%. This isn’t ideal, but it’s not enough to throw this out. We’ll impute.
Turbidity is missing quite a bit, too (18%). These NAs tend to be clumped in time and space, which could prove problematic. We’ll impute but if it turns out to be important in the model, we’ll have to remember to discuss this.
The remainder of the columns have a modest percentage of NAs. For the remaining NAs, we will impute them using the median value of all non-NA observations for that value from samples in the same year, sample time period, pool, and major habitat class. On the rare occasion that there are no other observations from the same grouping from which to pull a median, I widen the search to just the same year and pool.
So it’s time to build the model(s). We’ll start with a full model including year, pool, major stratum, sample time period, gear type, substrate type, and all of the environmental variables above except the two that we eliminated for having too high a proportion of NAs (water velocity). I decided to just use one of the aquatic veg variables, the percentage one. I’m also adding an indicator of discharge, which is a 7-day lagged mean of discharge from the nearest USGS guage. We’ll call this the ‘full’ model, then remove some non-significant variables and examine some ‘reduced’ models.
| Model | AIC | BIC | AUC |
|---|---|---|---|
| Full | 1478.885 | 1692.824 | 88.60 |
| Reduced, Substrate Factor | 1471.692 | 1593.942 | 92.77 |
| Reduced, Substrate Continuous | 1468.085 | 1578.111 | 92.75 |
| Reduced, Substrate Binary | 1469.474 | 1579.500 | 91.86 |
| Reduced, No Substrate | 1482.971 | 1586.884 | 91.62 |
Eliminating several non-significant variables from the full model vastly improves the BIC and AIC metrics. These improved ‘reduced’ models still retain year, pool/reach, major stratum, sample time period, gear type, water temperature, and water depth, in addition to the substrate type. Here, we compare three ‘reduced’ models with different encodings for the substrate type variable, our primary variable of interest.
Factor: 1, 2, 3, and 4 as unrelated factors
Continuous: 1-4 as a continuous numeric value
Binary: 1-3 as ‘Soft’ substrate, encoded as a ‘0’. 4 as ‘Hard’ substrate, encoded as a ‘1’
Finally, we included one ‘reduced’ model that removes substrate from the model entirely, which performs very poorly in all metrics.
Of the three ‘reduced’ models with substrate encoding variants, the Continuous and Binary models perform best, and are almost indistinguishable from eachother (within 2 AIC and BIC units). This is intuitive to me, because they way they’re encoded imply a gradient from soft to hard as you move from 1 to 4, just in different ways.
In addition to AIC and BIC, we can train each model on a subset (87.5%) of the data and test its ability to predict goby capture/non-capture at each site of the remainder (12.5%) of the data. We can plot Receiver Operator Curves for each model and compare the shape and area under each curve (AUC) as a way of testing specificity and sensitivity of the model for predicting goby capture/non-capture. All ‘reduced’ models perform similarly in AUC. Below, we’ll visualize the ROCs for the Binary Substrate model and the Continuous Substrate model. The shape and area underneath are nearly identical.
Because one of our main research questions is whether or not rocky areas are highly associated with goby capture, I tend to prefer the Binary Substrate model over the rest. It performs as well or better than every other model, and it’s a simpler, more direct test of our research question. The Factor Substrate model, though not as well-fit as the Binary Substrate model, can provide some insight here. In the Factor model (summarised in the table below), model coefficient values for SUBSTRATE_CODE code increase when comparing 1 to 2, then 1 to 3, and 1 to 4, suggesting there is some impact on goby capture/non-capture as we move along the scale numerically. This is also supported by the Continuous Substrate model having a positive, significant coefficient for continuous substrate from 1-4. However, in the Factor model, the model coefficients comparing 1 to 2 and 1 to 3 are not significant. Only comparing 1 to 4 is significant. Using pairwise comparisons of their estimated marginal means, we can see the only significant comparison is 1 to 4 (4 - 1 p.value = 0.0004).
## contrast estimate SE df z.ratio p.value
## 2 - 1 0.214 0.242 Inf 0.886 0.8119
## 3 - 1 0.333 0.264 Inf 1.260 0.5888
## 3 - 2 0.118 0.284 Inf 0.416 0.9757
## 4 - 1 0.717 0.179 Inf 4.004 0.0004
## 4 - 2 0.503 0.232 Inf 2.172 0.1311
## 4 - 3 0.385 0.241 Inf 1.597 0.3804
##
## Results are averaged over the levels of: YEAR, POOL_REACH, HABITAT_CLASS_major_corrected, SAMPLE_TIME_PERIOD, GEAR_CODE
## Results are given on the log odds ratio (not the response) scale.
## P value adjustment: tukey method for comparing a family of 4 estimates
The z statistic in the above Factor Substrate model contrast of 1-4 (z = 4.0) is similar to that of the contrast between soft and hard in the Binary Substrate model contrast below (z = 3.9).
## contrast estimate SE df z.ratio p.value
## 1 - 0 0.595 0.151 Inf 3.944 0.0001
##
## Results are averaged over the levels of: YEAR, POOL_REACH, HABITAT_CLASS_major_corrected, SAMPLE_TIME_PERIOD, GEAR_CODE
## Results are given on the log odds ratio (not the response) scale.
In fact, if we look at the Factor Substrate model summary (below), the model coefficient comparing 1 to 4 (.717) is over twice as large as the coefficient comparing 1 to 3 (.214). This suggests that, though moving from 1 to 2 to 3 may improve our chances of goby capture, the most important difference between any two SUBSTRATE_CODE values is moving from ‘Soft’ (1-3) to ‘Hard’ (4), as in the Binary Substrate model. See summary for the Factor Substrate model below:
##
## Call:
## glm(formula = RDGY_PRESENCE ~ YEAR + POOL_REACH + HABITAT_CLASS_major_corrected +
## SAMPLE_TIME_PERIOD + GEAR_CODE + as.factor(SUBSTRATE_CODE) +
## WATER_TEMP + WATER_DEPTH, family = binomial(link = "logit"),
## data = SITE_RDGYall2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.86974 -0.38972 -0.20289 -0.06551 3.15666
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.67293 0.69200 -0.972 0.33083
## YEAR2020 0.38203 0.18575 2.057 0.03971 *
## YEAR2021 0.56120 0.18720 2.998 0.00272 **
## POOL_REACHBN 1.52904 0.28783 5.312 0.0000001082 ***
## POOL_REACHDR 0.17635 0.27213 0.648 0.51697
## POOL_REACHMA -0.40475 0.29127 -1.390 0.16464
## POOL_REACHST -1.32447 0.33027 -4.010 0.0000606545 ***
## POOL_REACHPE -1.01606 0.31490 -3.227 0.00125 **
## POOL_REACHLG -3.10952 0.58152 -5.347 0.0000000893 ***
## POOL_REACHAL -16.82576 320.90988 -0.052 0.95818
## HABITAT_CLASS_major_correctedMCB 1.06810 0.19845 5.382 0.0000000736 ***
## HABITAT_CLASS_major_correctedSCB 0.88521 0.22263 3.976 0.0000700175 ***
## SAMPLE_TIME_PERIOD2 -0.13612 0.16357 -0.832 0.40531
## SAMPLE_TIME_PERIOD3 -1.07391 0.26194 -4.100 0.0000413520 ***
## GEAR_CODED -1.60862 0.16374 -9.824 < 0.0000000000000002 ***
## as.factor(SUBSTRATE_CODE)2 0.21440 0.24185 0.886 0.37535
## as.factor(SUBSTRATE_CODE)3 0.33255 0.26402 1.260 0.20782
## as.factor(SUBSTRATE_CODE)4 0.71745 0.17920 4.004 0.0000623881 ***
## WATER_TEMP -0.04811 0.02490 -1.932 0.05335 .
## WATER_DEPTH -0.16349 0.05912 -2.765 0.00569 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2063.3 on 3335 degrees of freedom
## Residual deviance: 1431.7 on 3316 degrees of freedom
## AIC: 1471.7
##
## Number of Fisher Scoring iterations: 17
So, we’ve selected a model in the Binary Substrate model, which also includes year, pool/reach, major stratum, sample time period, gear type, substrate type, water temp, and water depth. We also tried but eventually eliminated other environmental variables including Secchi depth, DO, specific conductivity, turbidity, aquatic vegetation, a 7-day average of nearby discharge, river kilometer, and presence/absence of woody debris, wingdams/dykes, tributary mouths, revetment, inlet/outlet channels, low-head dam closing structures, flooded terrestrial areas, and other structures.
Let’s take a look at the summary of this best model and its results.
##
## Call:
## glm(formula = RDGY_PRESENCE ~ YEAR + POOL_REACH + HABITAT_CLASS_major_corrected +
## SAMPLE_TIME_PERIOD + GEAR_CODE + as.factor(SUBSTRATE_CODE_BINARY) +
## WATER_TEMP + WATER_DEPTH, family = binomial(link = "logit"),
## data = SITE_RDGYall3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.84995 -0.38603 -0.20669 -0.06491 3.12502
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.62261 0.68849 -0.904 0.365833
## YEAR2020 0.37029 0.18412 2.011 0.044303 *
## YEAR2021 0.53827 0.18404 2.925 0.003448 **
## POOL_REACHBN 1.50938 0.28584 5.280 0.0000001288 ***
## POOL_REACHDR 0.19935 0.27156 0.734 0.462892
## POOL_REACHMA -0.31642 0.28280 -1.119 0.263194
## POOL_REACHST -1.21838 0.31882 -3.822 0.000133 ***
## POOL_REACHPE -0.96236 0.31144 -3.090 0.002001 **
## POOL_REACHLG -2.96656 0.56135 -5.285 0.0000001259 ***
## POOL_REACHAL -16.78238 318.87190 -0.053 0.958026
## HABITAT_CLASS_major_correctedMCB 1.09073 0.19817 5.504 0.0000000371 ***
## HABITAT_CLASS_major_correctedSCB 0.88523 0.22204 3.987 0.0000669575 ***
## SAMPLE_TIME_PERIOD2 -0.14625 0.16259 -0.899 0.368392
## SAMPLE_TIME_PERIOD3 -1.06900 0.26101 -4.096 0.0000420943 ***
## GEAR_CODED -1.64416 0.16132 -10.192 < 0.0000000000000002 ***
## as.factor(SUBSTRATE_CODE_BINARY)1 0.59497 0.15086 3.944 0.0000802156 ***
## WATER_TEMP -0.04635 0.02484 -1.866 0.062039 .
## WATER_DEPTH -0.16199 0.05886 -2.752 0.005918 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2063.3 on 3335 degrees of freedom
## Residual deviance: 1433.5 on 3318 degrees of freedom
## AIC: 1469.5
##
## Number of Fisher Scoring iterations: 17
We have a lot of factor variables in this model, so it’s important to remember that the model coefficients for these variables are as compared to the ‘reference level.’ The reference levels for this model are as follows:
YEAR: 2019
POOL_REACH: Lockport (LP)
HABITAT_CLASS_major_corrected: Backwater (BWC)
SAMPLE_TIME_PERIOD: 1
GEAR_CODE: Minifyke (M)
SUBSTRATE_CODE_BINARY: Soft (1-3, encoded as ‘0’)
We can see that for our two main variables of interest, the model coefficients are significant (p < 0.05).
For major stratum (HABITAT_CLASS_major_corrected), the MCB and SCB levels both have positive coefficient estimates, with very low p-values, meaning we are significantly more likely to catch goby in the main and side channels as compared to the backwater. Exactly how much more likely depends on what level the rest of our factor and continuous variables are at in the model, but we can calculate an Odds Ratio that gives us a % change in our odds of capturing goby when all other variables are held fixed and we move from Backwater to Main Channel or Backwater to Side Channel. For strata, those odds ratios look like this:
## predictor oddsratio ci_low (2.5) ci_high (97.5) increment
## 1 YEAR2020 1.448 1.011 2.083 Indicator variable
## 2 YEAR2021 1.713 1.198 2.466 Indicator variable
## 3 POOL_REACHBN 4.524 2.606 8.007 Indicator variable
## 4 POOL_REACHDR 1.221 0.721 2.095 Indicator variable
## 5 POOL_REACHMA 0.729 0.420 1.276 Indicator variable
## 6 POOL_REACHST 0.296 0.157 0.551 Indicator variable
## 7 POOL_REACHPE 0.382 0.207 0.704 Indicator variable
## 8 POOL_REACHLG 0.051 0.015 0.140 Indicator variable
## 9 POOL_REACHAL 0.000 0.000 0.000 Indicator variable
## 10 HABITAT_CLASS_major_correctedMCB 2.976 2.032 4.425 Indicator variable
## 11 HABITAT_CLASS_major_correctedSCB 2.424 1.572 3.760 Indicator variable
## 12 SAMPLE_TIME_PERIOD2 0.864 0.628 1.188 Indicator variable
## 13 SAMPLE_TIME_PERIOD3 0.343 0.205 0.570 Indicator variable
## 14 GEAR_CODED 0.193 0.140 0.264 Indicator variable
## 15 as.factor(SUBSTRATE_CODE_BINARY)1 1.813 1.349 2.438 Indicator variable
## 16 WATER_TEMP 0.955 0.909 1.003 1
## 17 WATER_DEPTH 0.850 0.754 0.950 1
The odds ratio tells us by what factor we increase our odds of finding goby at a site when altering the predictor variable by one unit, or from its reference level to another level. If an odds ratio is > 1, that variable increases our odds of capturing goby as its value increases or changes from the reference level to the level of interest. If an odds ratio is < 1, it decreases our odds.
Here we can see that moving from Backwater (our reference level for HABITAT_CLASS_major_corrected) to Main Channel increases our odds of capturing goby by a factor of 2.976, in other words +197.6%. From Backwater to Sidechannel, our odds ratio is 2.424, implying an increase in our odds of capturing goby by +142.4%.
That seems like a lot, and indeed it’s significant in the model.
For substrate type, we just have one row because we’re comparing ‘Soft’ (1-3) to ‘Hard’ (4) substrate. Remember, we’ve encoded Soft as 0 and Hard as 1, with 0 as our reference level. As you can see below, our odds ratio comparing 0 to 1 is 1.813, implying an +81.3% increase in our odds of capturing goby on hard substrate as opposed to soft. Another convincing result in favor of our hypothesis that goby are highly associated with hard substrate.
We can visualize all the odds ratios and their confidence intervals, too. Just a warning, the confidence intervals for Lagrange and Alton are sort of nonsensically large because there were so few goby caught there as compared to Lockport. I’ve removed them to rectify axes scaling issues, which I feel OK about because the means are not misleadingly low and they really aren’t factor levels of interest.
It’s important to remember these are all compared to our reference levels for each factor variable. The larger vertical white line that occurs at an x-value (Odds ratio) of 1 indicates where a variable would have 0% increase/decrease on the odds of capturing a goby. The red dots to the left indicate variables or factor levels that decrease our odds of catching goby, and the blue dots to the right indicate variables or factor levels that increase our odds of catching goby. I like visualizing the model results this way because it allows for a comparison of effect size between variables and among levels of a given variable. The further away from the zero line an odds ratio is, the more important that variable is in predicting goby capture/non-capture.
We can see that most of the Lower Illinois pools provide lower odds of catching goby as compared to Lockport, the reference pool. But the odds of catching goby are by far the highest in Brandon Road. In fact we’re almost 5 times as likely to catch a goby there than in Lockport. Because Dresden’s and Marseille’s odds ratio error bars overlap with the 1 line, we can say the odds of catching goby in Dresden or Marseilles aren’t significantly different from catching goby in Lockport. The odds ratio of Alton Pool is 0 as compared to Lockport, because no goby were ever caught in Alton. The odds ratio of LaGrange is 0.051, meaning we’re 95% less likely to catch goby here than in LaGrange, holding all other variables constant.
The variables with the largest significant effects are the pool and major stratum in which we’re sampling. That makes a lot of sense to me, as these are the highest-level spatial organizations of the data. Year is a strong predictor, too, with more goby caught in 2020 and 2021 than in 2019. 2020 and 2021 are about the same. Sample time period is also important, with most goby caught in period 1. Gear is also importnat, with goby far less likely (-80%) to be caught e-fishing than in minifykes. The effect of substrate on goby capture/non-capture (i.e. its odds ratio) is about as large as the effect of year, and slightly smaller than that of major stratum.
The odds ratios for water temperature and depth appear small, but it’s important to remember the context that this is the increase in odds for every 1-unit change in those variables. The odds of catching goby decrease 5% for each degree Celsius increase in temp and decrease 15% for each meter in depth.
We can visualize RDGY capture probabilities in a few ways. First, let’s look at our actual data, broken down in a way where it’s easy to see how the model contrasts are performed, with Substrate on the x-axis, RDGY capture probability on the y-axis, Major stratum as colors, and the pools as facets.
We can summarize and simplify the trends that the model has identified by plotting the predictions of RDGY capture probability given the data in our actual observations. The statistically significant points of our model are a bit more easily visualized like this.
A third way to visualize this is to duplicate the data in our observations, but to change a variable of interest in the duplicated half to its opposite value. For instance, if we want to focus on the effect of Substrate Type, we can generate all of our data twice, once with a Substrate value of 0 and once with a Substrate value of 1, regardless of its original value. This gives equal sample sizes of data to both ends of each facet’s x-axis (Substrate Type), making the plots look slightly different.
This slightly different way of using the model for prediction allows us to look at each observation’s predicted probability of RDGY capture if its substrate value was 0, and compare it to its predicted probability if we changed its substrate value to 1. If we plot those two predictions against eachother and throw in a 1:1 line, we can see the influence of hard substrate across the entire dataset.
Here we see that, for observations where RDGY capture would be near zero with soft substrate (the x-axis), for example in the Alton pool or a particularly unsuitable backwater, changing substrate to hard (the y-axis) does not impact the model’s predicted probability of capture very much. However, as we approach probabilities of .25 on the x-axis, changing substrate from soft to hard has a noticeable depature from the 1:1 line, indicating an increase in the probability that we’d capture goby at that site if it were hard. The points don’t reach the far end of either axis because we have so few points where round goby capture probability is > .75 in our data, according to the model. This is just another way to visualize that the model tells us there’s a significant impact of hard substrate on the probability of RDGY capture.
So, the take-home here is that we have strong evidence that both of our hypotheses are supported by the model, with major habitat associations for goby in the Main Channel (and to an extent, Side Channel) and hard substrates.
We can probably write a whole paper just on habitat associations and the implications of those associations for the impact of round goby in the IWW or in other systems with lots of hard or soft substrate. But it may also be interesting to explore distribution patterns along the upstream-to-downstream invasion front. In invasive species study and management, it’s often of interest to describe where a species is merely ‘present’ and where it is ‘established,’ albeit those definitions are sort of tricky. Our question here is can we use aspects of our data to draw inferences on where goby are present or established in the IWW?
The results of the generalized linear model suggest that there are significant differences in our probability of capturing goby depending on several factors, including which pool in which we’re sampling. Fully recognizing that our methods did not properly target round goby, perhaps we can use the likelihood of capture as a proxy for abundnace in describing the distribution of round goby along an invasion gradient from upstream to downstream. If we take another look at the model summary, you can see significant decreses in capture in Starved Rock, Peoria, and LaGrange as compared to Lockport. And we know we caught no goby in Alton.
##
## Call:
## glm(formula = RDGY_PRESENCE ~ YEAR + POOL_REACH + HABITAT_CLASS_major_corrected +
## SAMPLE_TIME_PERIOD + GEAR_CODE + as.factor(SUBSTRATE_CODE_BINARY) +
## WATER_TEMP + WATER_DEPTH, family = binomial(link = "logit"),
## data = SITE_RDGYall3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.84995 -0.38603 -0.20669 -0.06491 3.12502
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.62261 0.68849 -0.904 0.365833
## YEAR2020 0.37029 0.18412 2.011 0.044303 *
## YEAR2021 0.53827 0.18404 2.925 0.003448 **
## POOL_REACHBN 1.50938 0.28584 5.280 0.0000001288 ***
## POOL_REACHDR 0.19935 0.27156 0.734 0.462892
## POOL_REACHMA -0.31642 0.28280 -1.119 0.263194
## POOL_REACHST -1.21838 0.31882 -3.822 0.000133 ***
## POOL_REACHPE -0.96236 0.31144 -3.090 0.002001 **
## POOL_REACHLG -2.96656 0.56135 -5.285 0.0000001259 ***
## POOL_REACHAL -16.78238 318.87190 -0.053 0.958026
## HABITAT_CLASS_major_correctedMCB 1.09073 0.19817 5.504 0.0000000371 ***
## HABITAT_CLASS_major_correctedSCB 0.88523 0.22204 3.987 0.0000669575 ***
## SAMPLE_TIME_PERIOD2 -0.14625 0.16259 -0.899 0.368392
## SAMPLE_TIME_PERIOD3 -1.06900 0.26101 -4.096 0.0000420943 ***
## GEAR_CODED -1.64416 0.16132 -10.192 < 0.0000000000000002 ***
## as.factor(SUBSTRATE_CODE_BINARY)1 0.59497 0.15086 3.944 0.0000802156 ***
## WATER_TEMP -0.04635 0.02484 -1.866 0.062039 .
## WATER_DEPTH -0.16199 0.05886 -2.752 0.005918 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2063.3 on 3335 degrees of freedom
## Residual deviance: 1433.5 on 3318 degrees of freedom
## AIC: 1469.5
##
## Number of Fisher Scoring iterations: 17
That’s some evidence right there that there may be decreased abundance as we move downstream. This could be driven by a lot of factors, but we know there is a strong upstream to downstream gradient in available substrate types and that substrate type is highly predictive of round goby capture.
Another way to infer establishment is the presence of more than one life stage. We didn’t specifically set out to catch goby and we don’t have larval fish data, so this is tricky to answer. But we do have lengths for quite a few goby in our study, and perhaps we can use some statistics to infer which ones are ‘age-0’ and which ones are ‘age-1 +’ fish. Demonstrating we have more than one age cohort at a site could help solidify our argument that goby are ‘established’ there.
Let’s take a quick look at the distribution of body sizes using total length (mm).
When looking for differences in lengths, I find it’s helpful to look at log-lengths. Let’s change that x-axis to emphasize those small changes.
It’s pretty hard to see, but what we’re looking for is two humps in the distribution, theoretically one for age-0 fish and one for age-1+ fish. You can sort of see an initial peak around 20-25mm, and a second around 40. But you’d have to squint.
Luckily, we don’t have to rely on squinting. I had talked with Kristen Bouska a while back about using Gaussian Mixture Models for exactly this sort of thing in order to come up with young-of-year length cutoffs that better match our data compared to the published LTRM cutoff values. This is pretty easy to do in R using Hartigans’ dip test for unimodality / multimodality from the package ‘diptest’ and the normalmixEM function from the ‘mixtools’ package. We’ll use log10 of the lengths for this so it will be normal enough for the Gaussian Mixture analysis. Below you can see the output from the Hartigans’ dip test.
##
## Hartigans' dip test for unimodality / multimodality
##
## data: lengthgrp
## D = 0.018545, p-value = 0.01039
## alternative hypothesis: non-unimodal, i.e., at least bimodal
## number of iterations= 544
## summary of normalmixEM object:
## comp 1 comp 2
## lambda 0.638796 0.3612038
## mu 1.559570 1.7803444
## sigma 0.155091 0.0935697
## loglik at estimate: 403.9552
Here we see the p value is low enough to assume that our distribution of goby lengths is actually made up of two normal distributions.
The results of Gaussian Mixture analysis are also in an above table. ‘comp 1’ corresponds to smaller fish, and ‘comp 2’ corresponds to larger fish. Lambda describes the proportion of data belonging to each distribution. We can see there are nearly twice as many fish belonging to what we’ll describe as the ‘age-0’ class. Mu gives the (log) values of the mean of each distribution, and Sigma gives the (log) values of the standard deviation.
With a little math, we can tell that the ‘age-0’ class has a mean length of 10^1.56, or 36mm. The ‘age-1+’ class has a mean length of 10^1.78, or 60mm.
Even though this might not be the most precise language, I’m going to refer to these two size groups as YOY and Adult from here on out.
The model predicts a probability that any given fish belongs to each distribution. That’s what the two distribution curves plotted above visualize. There are a few different ways of assigning cutoffs or probability thresholds to categorize fish based on the model.
The default option in the ‘mixtools’ package is to place the threshold at the maximum distance between the means of the two distributions (ie their average). Because these are normal distributions, the means are the peaks. Using this method, the YOY cutoff would be 47mm.
A more liberal method (predicting more YOYs and less Adults) would be to place all fish with a probability of belonging to the YOY group greater than 50% in the YOY, and all other fish in the Adult group. This cutoff would be 52mm.
A more conservative method (predicting less YOYs and more Adults) would be to take the mean of the adult distribution (mu of comp 2 from above table) minus 2 standard deviations of the adult distribution (sigma of comp 2 from above table). That cutoff would be 39mm.
All three of these seem reasonable based on the literature I’ve reviewed. One way to view the predicted probabilities of belonging to the YOY group alongside these cutoffs is to plot the goby lengths by day-of-year. This isn’t as informative for goby as for some species because goby are likely reproducing throughout the summer, and so we can’t see clear entry and growth of the YOYs. For that reason, my attempt to create three Gaussian Mixture models for each time period was not significant. But let’s take a look nonetheless.
We’ll pot the conservative cutoff in blue, the medium in black, and the liberal in red.
## number of iterations= 566
## Error in `[.data.frame`(new_m_stats_row, , order(names(m_stats))): undefined columns selected
## [1] 46.77351
And here’s how the cutoffs look against the traditional histogram.
None of the thresholds stand out to me as particularly good or bad, so I’m inclined to trust the default cutoff value of 47mm (furthest point between both distribution means). Using that’s threshold, we can predict which fish are ‘YOY’ and which are ‘Adult’ and take a look at where each group is present spatially.
Just like the generalized linear modeling, I’m considering both e-fishing and minifykes here. Because there are so many sampling sites, I’m going to try and aggregate the data a bit for mapping. I’m grouping together sites every 10 kilometers and plotting them by their most upriver coordinates.
There are no aggregated sites that only have adults present. We either have only YOY, both YOY and Adults, or None.
Each pool has aggregated sites with Both, except Alton which contains no captures. LaGrange has long stretches with no goby captures, a couple with YOY-only, and a couple with Both.
If your definition for ‘established’ is having any stretches of the river with both YOY and Adults, there is evidence here that they are ‘established’ in every pool except Alton, including LaGrange. Another definition might include in which pools we have multiple year of detecting a given cohort. If we aggregate up to the pool-scale, we can see all pools have at least 2 out of 3 years in which we captured Adult goby, except Alton.
So far we’ve only focused on presence/absence (or more precisely, capture/non-capture). But we can also look at CPUE. For this, I think we should only look at one gear, and a large majority of goby were caught in minifykes.
There’s no great way of looking at this data, in my opinion. I don’t like aggregating it by river mile because it aggregates across a lot of potentially interesting variation, but not doing so leaves things cluttered. In this HTML format, you can zoom in and pan around to overcome the clutter. Obviously that won’t work for a paper. Let’s look at it non-aggregated, then aggregated by river km. Note: the color scales aren’t exactly linear. I chose custom breaks that I felt represented the data distributions well.
Scrolling around, the CPUE results are intuitive. We have high CPUE in Lockport and Brandon Road, with pockets of high CPUE throughout the Upper River. CPUE (as well as detections) drop off after Peoria. The small black dots indicate minifyke sites with no goby captures. You can toggle on and off the ‘present’ and ‘absent’ dots using the menu in the top right of each map.
I know we talked about spatially interpolating into more of a raster format. I was hoping to have that done this week but I got caught up in the different R packages for interpreting the model predictions. We can keep it on the to-do list if you want to include CPUE as an analysis in the paper. However, I’m not really sure how much stock I put in the CPUE data. It mirrors the detection data in upstream to downstream trends, but because minifykes are a flawed way of sampling goby, I think it’s more intuitive to look at capture vs. non-capture as opposed to CPUE. That’s up for debate though and we can disuss that. I haven’t modeled CPUE at all but the CPUE by substrate descriptive figures we plotted at the top of this document suggest it wouldn’t be as clear-cut a story as the capture/non-capture angle. The CPUE stuff is statistically a little messier to model because of all the zeros and effort differences. Not impossible, but I think the binomial logistic regression is more intuitive and ‘clean.’
Our final hypothesis has to do with body size as a continuous variable rather than YOY vs. Adult. There are a few papers predicting that, for an invasion playing out from upstream to downstream, fish body size tends to decrease toward the invasion front because the invasion front is generally advanced by smaller bodied individuals being transported downstream by flow. This is as opposed to invasions that work downstream to upstream, wherein large bodied individuals must fight against the flow to push the upstream invasion-front forward.
We’re dealing with an upstream to downstream invasion here. Of course there’s the wrinkle that goby may have sort of hopscotched past the Alton pool and found suitable habitat in the Mississippi River, which could in turn provide a downstream-to-upstream source of invasion. But I think we can generally agree that this invasion is playing out upstream to downstream.
To reduce noise and boost signal, we can aggregate our data into three major ecohydrological regions: the CAWS (Lockport, Brandon Road), the Upper Illinois (Dresden, Marseilles, Starved Rock), and the Lower Illinois (Peoria, LaGrange, and Alton). Plotting (modeled) total length by major Reach, we see the hypothesized pattern of larger fish upstream and smaller fish downstream. I’ll plot it with and without the log-scaled x-axis we were using earlier for the YOY cutoff analysis.
You can see I’ve put that YOY cutoff of 47mm on here. Now, one criticism we could anticipate is that if body size varies by reach, how do we know our YOY cutoff analysis isn’t just driven by that upstream-to-downstream body size gradient as opposed to purely ontogeny. In reality, both gradients are probably baked in. There’s probably proportionally more YOY in the lower reaches, but the overall body size downstream may also be smaller. It seems impossible to pull those two things apart without otolith data, but I don’t think it invalidates the results, especially because the YOY cutoff seems reasonable in light of the literature.
Either way, I think we can say with good evidence that the body size hypothesis is supported by the data. Body size appears to decrease along the upstream-to-downstream invasion gradient.
Statistically, we can test to see if these three Major Reach length distributions are different from eachother with pairwise Kolmogorov-Smirnov tests. Below, I’ve plotted cumulative distribution curves for the modeled total length (mm) of each goby in each major reach. The K-S test p value is printed along with each plot. For each pairwise comparison, the test is statistically significant, indicating the length distributions of the goby caught in those reaches are different from eachother. In each case, the more downstream reach accumulates fish at a faster rate at smaller lengths (the left side of the plot), indicating the downstream reach is more dominated by small fish than the more upstream reach. This further supports the visual inspection of the by-reach distributions in favor of our hypothesis that the further downstream you go in a downstream invasion, the smaller-bodied the fish are.
Overall, I think we find strong evidence for our hypotheses that goby capture is highly associated with hard substrate and main/side-channel habitats as compared to soft substrate and backwater habitat. Furthermore, we find evidence that the downstream invasion front tends to support smaller-bodied fish than the larger-bodied, upstream source population.
As far as describing the Illinois Waterway round goby population in invasion biology terms of ‘present’ and ‘established,’ evidence (multiple age cohorts captured in multiple years) suggests that goby are established in all pools, with the exception of Alton. I don’t think we can say they are not established in Alton, just that we don’t have enough evidence to say that they are established in Alton.
Another question that interests me: Are there so few captures / low CPUE in the lower river because of a natural gradient along a shifting invasion front from upstream to downstream? In other words, this lack of fish downriver is temporary, and will be overcome with time? Or, is this invasion gradient confounded by a strong gradient in an important habitat feature, namely hard substrate, because the prevalence of hard substrate also decreases from upstream to downstream? At the top of the ‘Presence vs. Establishment’ section, the ‘Substrate availability by pool’ illustrates this gradient (or rather step-change) in hard substrate as a proportion of all sites. But what stands out to me is that LaGrange has almost no hard substrate, even though it has multiple years with multiple sites hosting multiple age cohorts. Contrast that against Alton, which has no goby detected throughout the study, but has a relatively higher proportion of hard substrate as compared to LaGrange. Clearly, substrate alone doesn’t explain why goby are caught in LaGrange (albeit infrequently) but not in Alton. Is the explanation the natural invasion front that might be overcome with time? Or some other limiting factor, like depth?
Hopefully laying out the analysis workflow here can provide a basic structure for a manuscript’s methods and results sections. The data underlying these figures is in our shared Box folder if you want to play with it yourselves. Notably, it doesn’t include the MWRD data from Austin Happel. I think we have enough here to go forward without it because I’m not sure I like the upstream/downstream invasion confusion it introduces. But I’m happy to continue to develop the hypotheses and analyses with you both and work on fitting the spatial data into a more publishable format.