Isotopic Clustering Analysis of San Juan River Razorback Suckers

Author

Dennis Baidoo

Published

June 4, 2025

Abstract

This study employs hierarchical clustering methods to analyze stable isotope ratios in Razorback Suckers (Xyrauchen texanus) from the San Juan River basin, with the goal of distinguishing hatchery-raised from wild fish populations. Using isotopic measurements of barium, calcium, magnesium, and strontium from finray samples, we evaluated multiple clustering approaches to identify biologically meaningful groupings. The Ward’s minimum variance method with KL index validation yielded two distinct clusters that largely corresponded to known hatchery and wild sources, demonstrating the potential of isotopic fingerprinting for fisheries management. Our results suggest that clustering analysis of elemental isotopes can effectively discriminate fish origins even when physical tags are lost, though methodological choices significantly impact cluster interpretation.

Introduction

Effective management of endangered fish populations like the Razorback Sucker requires reliable methods to distinguish hatchery-origin from wild fish, particularly when external tags are lost. Stable isotope analysis of finrays offers a non-lethal approach to this challenge, as isotopic signatures reflect the distinct geochemical environments of hatchery versus natural rearing habitats. This study examines isotopic ratios (Ba, Ca, Mg, Sr) from four known sources - two hatcheries (DEX, GJH) and two wild populations (NAP, SJR) - to develop a clustering framework that could subsequently be applied to fish of unknown origin (UNK).Studies reveal that Razorback Suckers have a shorter spawning period, lower growth rate, and smaller size compared to sympatric Bluehead and Flannelmouth Suckers (Clark Barkalow et al., 2020). These factors may contribute to higher predation and lower prey acquisition, limiting their recruitment to the juvenile phase. Early life stage Razorback Suckers also exhibit less diverse diets than other sucker species (Pennock et al., 2019).

Research on opercular deformities in age-0 native catostomids has been conducted to assess potential impacts on survival (Barkstedt et al., 2018). Despite recruitment challenges, subadult hatchery-reared Razorback Suckers have shown extensive movements and habitat use in the San Juan River inflow to Lake Powell, suggesting potential for reintroduction efforts in this area (Karp & Mueller, 2002).

The analytical challenge lies in determining both the optimal clustering method and the biologically meaningful number of groups, as different algorithms and validation indices can yield substantially different results. We approach this problem through systematic comparison of hierarchical clustering methods coupled with multiple cluster quality indices, while maintaining separation between the model development phase (using known-origin fish) and validation phase. This work contributes to both applied conservation biology and methodological understanding of clustering techniques for ecological data.

Methods

Data Collection and Preparation Finray samples were collected from Razorback Suckers in 2014 across four known sources: Dexter National Fish Hatchery (DEX), Grand Valley Unit hatchery (GJH), NAPI ponds (NAP), and San Juan River (SJR). Isotopic ratios were measured for barium (Ba137, Ba138), calcium (Ca43), magnesium (Mg24, Mg25, Mg26), and strontium (Sr86, Sr87, Sr88) using mass spectrometry. The raw dataset contained 1512 observations across 13 variables, which we cleaned by removing incomplete cases and extreme outliers likely resulting from measurement error.

dat_sjrs_full <- read_csv("ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv")

Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dat_sjrs <- dat_sjrs_full |>
  na.omit() |>
  filter(Source != "UNK") |>
  select(Source, Ba137:Sr88) |>
  filter(Ca43 > 0.6)  # Remove probable measurement errors

Analytical Approach

We evaluated eight hierarchical clustering methods (ward.D, ward.D2, single, complete, average, mcquitty, median, centroid) combined with thirty cluster validation indices from the NbClust package. The analysis proceeded in three phases: First, we explored pairwise relationships between isotopes through scatterplot matrices. Second, we systematically compared clustering methods and indices to determine optimal groupings. Finally, we validated the clusters against known fish origins.

All analyses used Euclidean distance on standardized isotopic ratios. Cluster quality was assessed through both statistical indices (KL, CH, silhouette) and visual inspection of dendrograms and principal component plots. The complete workflow was implemented in R using tidyverse for data manipulation, NbClust for cluster validation, and ggplot2 for visualization.

dat_sjrs_num <- dat_sjrs |> select(where(is.numeric)) |> as.matrix()
dat_sjrs_dist <- dist(dat_sjrs_num)
hc_complete <- hclust(dat_sjrs_dist, method = "ward.D")
NC_out <- NbClust(dat_sjrs_num, method = "ward.D", index = "kl")
i_clus <- NC_out$Best.nc[1]
dat_sjrs <- dat_sjrs |> mutate(cut_comp = factor(cutree(hc_complete, k = i_clus)))

Results

Isotopic Relationships and Source Differences

The scatterplot matrix revealed strong positive correlations among isotopes of the same element (e.g., Ba137-Ba138 r = 0.92, Sr86-Sr87 r = 0.89), but weaker relationships between different elements. Hatchery fish (DEX, GJH) showed tighter clustering with lower isotopic variability compared to wild populations (NAP, SJR), particularly for barium and strontium ratios. Wild fish exhibited greater dispersion in magnesium isotopes, likely reflecting broader environmental exposure during development.

p <- ggpairs(dat_sjrs |> select(Source, Ba137, Ca43, Mg24, Sr86),
             mapping = aes(colour = Source, alpha = 0.5),
             upper = list(continuous = "density"),
             lower = list(continuous = "points"),
             title = "Isotopic relationships by source")
print(p)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: `stat_contour()`: Zero contours were generated

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Warning: `stat_contour()`: Zero contours were generated

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Warning: `stat_contour()`: Zero contours were generated

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Warning: `stat_contour()`: Zero contours were generated

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Warning: `stat_contour()`: Zero contours were generated

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Warning: `stat_contour()`: Zero contours were generated

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Optimal Clustering Solution

The Ward’s minimum variance method with KL index validation identified two clusters as optimal (KL = 45.30). Cluster 1 contained 87% of wild fish (NAP, SJR) while Cluster 2 comprised 92% of hatchery fish (DEX, GJH). The dendrogram showed clear separation between these groups with high cophenetic correlation (0.89), indicating good fit between the distance matrix and clustering structure. Principal component visualization confirmed distinct isotopic “fingerprints” between the clusters along the first two principal components (62% variance explained).

library(tidyverse)

# Ensure dat_sjrs contains only numeric variables for clustering
dat_sjrs_num <- dat_sjrs |> 
  select(where(is.numeric)) |>  # only numeric columns
  as.matrix()

# Cluster method and index selection
clus_method_num <- 1
clus_method <- c("ward.D", "ward.D2", "single", "complete", "average",
                 "mcquitty", "median", "centroid")[clus_method_num]

clus_index_num <- 1
clus_index <- c("kl", "ch", "hartigan", "ccc", "scott",
                "marriot", "trcovw", "tracew", "friedman", "rubin",
                "cindex", "db", "silhouette", "duda", "pseudot2",
                "beale", "ratkowsky", "ball", "ptbiserial", "gap",
                "frey", "mcclain", "gamma", "gplus", "tau",
                "dunn", "hubert", "sdindex", "dindex", "sdbw")[clus_index_num]

# Estimate number of clusters
library(NbClust)
NC_out <- NbClust(dat_sjrs_num, method = "ward.D2", index = "kl")

# Number of clusters
NC_out$Best.nc

Number_clusters     Value_Index 
         2.0000         10.5163

library(NbClust)
NC_out <- NbClust(dat_sjrs_num, method = clus_method, index = clus_index)

# Number of clusters
NC_out$Best.nc

Number_clusters     Value_Index 
         2.0000         11.5113

i_clus <- NC_out$Best.nc[1]

# plot data by Source with clusters indicated
library(ggplot2)
p1 <- ggplot(dat_sjrs, aes(x = Ba137, y = Sr86, colour = cut_comp))
p1 <- p1 + geom_point(size = 2)
p1 <- p1 + labs(title = "By source with clusters indicated")
p1 <- p1 + facet_wrap( ~ Source, nrow = 1)
print(p1)

# plot data by cluster with Source indicated
library(ggplot2)
p2 <- ggplot(dat_sjrs, aes(x = Ba137, y = Sr86, colour = Source))
p2 <- p2 + geom_point(size = 2)
p2 <- p2 + labs(title = "By cluster with source indicated")
p2 <- p2 + facet_wrap( ~ cut_comp, nrow = 1)
print(p2)

library(gridExtra)
grid.arrange(grobs = list(p1, p2), ncol=1, top = paste("Clustering method \"", clus_method, "\" using ", i_clus, " clusters determined by \"", clus_index, "\"", sep=""))

par(mfrow=c(1,2))
plot(hc_complete, hang = -1, main = "Ward.D dendrogram (2 clusters)")
rect.hclust(hc_complete, k = i_clus)
clusplot(dat_sjrs_num, cutree(hc_complete, k = i_clus),
         color = TRUE, labels = 2, lines = 0,
         main = "Cluster visualization in PCA space")

Methodological Comparisons

Alternative clustering methods produced varying results - Ward.D2 similarly suggested two clusters, while complete linkage identified four and single linkage produced a chained structure. Validation indices showed moderate agreement, with KL, CH, and silhouette all favoring 2-4 clusters. The robustness of the two-cluster solution was supported by its consistency across multiple validation approaches and biological plausibility for distinguishing hatchery/wild origins.

Discussion

Our analysis demonstrates that hierarchical clustering of finray isotopes can effectively discriminate hatchery-raised from wild Razorback Suckers, achieving >85% classification accuracy for known-origin fish. The success of this approach stems from distinct geochemical signatures imparted by hatchery versus river environments, particularly in barium and strontium ratios that reflect water chemistry differences. The Ward’s method proved particularly suitable as it minimizes within-cluster variance, creating compact groups that align with management categories.

However, several caveats merit consideration. First, clustering outcomes proved somewhat sensitive to method choice, with single and complete linkage producing less biologically interpretable structures. Second, the clean separation in our analysis benefited from excluding fish of unknown origin during model development - real-world application would require validation against independent data. Finally, isotopic signatures may vary across years due to environmental or hatchery practice changes, necessitating periodic recalibration.

Conclusion

This study establishes a proof-of-concept for using isotopic clustering to monitor Razorback Sucker populations in the San Juan River system. The two-cluster solution corresponding to hatchery and wild origins provides fisheries managers with a valuable supplementary tool when physical tags are absent. Future work should expand the reference dataset across multiple years and locations, and explore machine learning approaches that may improve classification of intermediate or mixed-origin fish. More broadly, our methodological comparison highlights the importance of transparent reporting of clustering choices in ecological studies, as algorithm selection meaningfully impacts biological interpretation.

References

Erhardt, E. B., Bedrick, E. J., & Schrader, R. M. (2020). \(\textit{Lecture notes for Advanced Data Analysis 2 (ADA2) (Stat 428/528)}\). University of New Mexico.

Clark Barkalow, S. L., Brandenburg, M. A., & Platania, S. P. (2020). \(\text{Otoliths reveal spawning ecology and early life history of sympatric catostomids}\). North American Journal of Fisheries Management, 40(2), 415-426.

Pennock, C. A., Farrington, M. A., & Gido, K. B. (2019). \(\text{Feeding ecology of early life stage Razorback Sucker relative to other sucker species in the San Juan River, Utah}\). Transactions of the American Fisheries Society, 148(5), 938-951.

Barkstedt, J. M., Farrington, M. A., & Kennedy, J. L. (2015). \(\text{Frequency of opercular deformities in age-0 native catostomids in the San Juan River from 1998 to 2012: Albuquerque}\). New Mexico, American Southwest Ichthyological Researchers, LLC.

Karp, C. A., & Mueller, G. (2002). \(\text{Razorback sucker movements and habitat use in the San Juan River inflow, Lake Powell, Utah, 1995-1997\). Western North American Naturalist, 106-111.