Classification of Razorback Sucker Origins Using Stable Isotope Ratios and Discriminant Analysis

Author

Dennis Baidoo

Published

June 21, 2025

Abstract

This study employs quadratic discriminant analysis (QDA) to classify Razorback Suckers (Xyrauchen texanus) from the San Juan River basin into their putative origins (hatchery vs. wild) using stable isotope ratios (Ba, Mg, Sr). After evaluating model assumptions and performing stepwise variable selection, we developed a classification model achieving 86.8% cross-validated accuracy on known-origin fish (n=719). The final model, based on Ba137, Mg25, and Sr88 isotopes, showed excellent discrimination for hatchery fish (DEX/GJH: 99.8-100% accuracy) but lower performance for wild NAP populations (35.8% sensitivity). Application to untagged “NEW” specimens (n=80) revealed 90% accuracy, with most classified as wild San Juan River (SJR) origin. These results demonstrate how stable isotopes can inform conservation management of endangered fish populations when morphological tags are absent.

1. Introduction

The Razorback Sucker, an endangered freshwater fish native to the Colorado River basin, faces conservation challenges from habitat alteration and non-native species. While hatchery programs support wild populations, the loss of physical tags necessitates alternative methods for origin identification. Stable isotope analysis of fin rays offers a non-lethal approach, as isotopic signatures reflect natal environments. This study addresses two key questions: (1) Can discriminant analysis reliably classify fish origins using isotopic ratios? (2) What is the likely origin of untagged specimens in the San Juan River? Our work builds upon ecological applications of discriminant analysis while addressing practical challenges in endangered species management.

2. Methods

2.1 Data Collection

We analyzed 1512 fin ray samples from Razorback Suckers collected in 2014, focusing on four known sources: two hatcheries (DEX, GJH) and two wild populations (NAP, SJR). After quality control (removing samples with missing isotopes and unusual Ca43 values), 719 known-origin and 80 untagged (“NEW”) specimens were retained. Key isotopes included Ba137, Ba138, Ca43, Mg24-26, and Sr86-88, with Ba137 and Ba138 log10-transformed to improve separation.

2.2 Analytical Approach

Assumption checking via Q-Q plots and correlation matrices revealed non-uniform covariance structures across groups, justifying QDA over LDA. We performed both forward and backward stepwise selection using 10-fold cross-validation, with model performance evaluated via confusion matrices and balanced accuracy metrics. The final model was applied to classify NEW specimens, with posterior probabilities quantifying classification certainty.

3. Results

3.1 Model Development

Forward and backward selection converged on a three-predictor model (Ba137, Mg25, Sr88) with nearly identical cross-validated accuracy (87.06% vs. 87.07%). The final QDA model showed strong separation between hatchery and wild groups (Figure 1), with posterior probabilities >0.95 for 89% of DEX and GJH specimens. Group means revealed distinct isotopic fingerprints: hatchery fish exhibited higher Ba137 (-3.39 DEX, -2.80 GJH) versus wild (-2.16 NAP, -2.14 SJR), while Sr88 differentiated hatcheries (2.28 DEX, 1.31 GJH).

library(erikmisc)
library(tidyverse)
ggplot2::theme_set(ggplot2::theme_bw())  # set theme_bw for all plots

# First, download the data to your computer,
#   save in the same folder as this qmd file.

# read the data
dat_sjrs_full <-
  read_csv(
    "ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv"
  )

Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dim(dat_sjrs_full)

[1] 1512   13

# the last set of observations only have a few isotopes, so exclude
dat_sjrs <-
  dat_sjrs_full |>
  na.omit()

dim(dat_sjrs)

[1] 1447   13

# no missing values
dat_sjrs |>
  is.na() |>
  sum()

[1] 0

#str(dat_sjrs)

dat_sjrs <-
  dat_sjrs |>
  select(
    Source
  , Ba137:Sr88
  ) |>
  filter(
    # Exclude unknown sources
    Source != "UNK"
  ) |>
  mutate(
    Source_org = factor(Source) # original source
    # every 10th observation, assign to "NEW"
  , Source = ifelse((1:n() %% 10) == 0, "NEW", Source)
  , Source = factor(Source)
    # transforming the Ba values separates the tight clustering on the boundary
  , Ba137 = log10(Ba137)
  , Ba138 = log10(Ba138)
  )
names(dat_sjrs)

 [1] "Source"     "Ba137"      "Ba138"      "Ca43"       "Mg24"      
 [6] "Mg25"       "Mg26"       "Sr86"       "Sr87"       "Sr88"      
[11] "Source_org"

# 1/10 of the sources have been assigned to "NEW"
dat_sjrs |> pull(Source_org) |> table()


DEX GJH NAP SJR 
224 199 133 244

dat_sjrs |> pull(Source    ) |> table()


DEX GJH NAP NEW SJR 
201 180 120  80 219

## NOTE HERE
# NEW group to predict
dat_sjrs_new <-
  dat_sjrs |>
  filter(
    Source == "NEW"
  )
# Known groups
dat_sjrs <-
  dat_sjrs |>
  filter(
    Source != "NEW"
  ) |>
  filter(
    # There are a few unual observations, remove those assuming measurement errors
    # remove two small Ca43 values
    Ca43 > 0.5
  ) |>
  mutate(
    Source = factor(Source)  # to remove unused levels
  )

# data sizes
dat_sjrs_new |> dim()

[1] 80 11

dat_sjrs     |> dim()

[1] 719  11

# Scatterplot matrix
library(ggplot2)
library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

p <-
  ggpairs(
    dat_sjrs
  , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
  , upper = list(continuous = "density", combo = "box")
  , lower = list(continuous = "points", combo = "dot")
  #, lower = list(continuous = "cor")
  , title = "Original data by source"
  )
print(p)

3.2 Classification Performance

The confusion matrix for known specimens showed perfect classification for GJH (100% sensitivity/specificity) and near-perfect performance for DEX (99.5% sensitivity). Wild populations showed asymmetric performance: SJR achieved 92.2% sensitivity but NAP only 35.8%, with frequent misclassification as SJR (17/60 errors). Overall accuracy was 86.8% (95% CI: 84.1-89.2%), with kappa=0.82 indicating substantial agreement.

dat_sjrs_d <- dat_sjrs |> select(Ba137:Sr88) # the data
dat_sjrs_c <- dat_sjrs |> pull(Source)       # the classes

# start random number generator in same place for everyone
# and so that random partitions are the same each time code is run
set.seed(7)

#library(klaR)  # don't run this since it does library(MASS) and breaks select() from dplyr
# Backward
step_sjrs_b <-
  klaR::stepclass(
    dat_sjrs_d
  , dat_sjrs_c
  , method = "qda"  #qda is used here since correlation matrices also vary across groups, indicating unequal covariance structures.
  , improvement = 0.001 # stop criterion: improvement less than
  , direction = "backward"
  , start.vars = colnames(dat_sjrs_d)
  )

 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.

719 observations of 9 variables in 4 classes; direction: backward

stop criterion: improvement less than 0.1%.

correctness rate: 0.8429;  starting variables (9): Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88 
correctness rate: 0.85264;  out: "Ca43";  variables (8): Ba137, Ba138, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88 
correctness rate: 0.85816;  out: "Ba137";  variables (7): Ba138, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88 
correctness rate: 0.86234;  out: "Sr86";  variables (6): Ba138, Mg24, Mg25, Mg26, Sr87, Sr88 
correctness rate: 0.86653;  out: "Mg26";  variables (5): Ba138, Mg24, Mg25, Sr87, Sr88 
correctness rate: 0.86929;  out: "Sr87";  variables (4): Ba138, Mg24, Mg25, Sr88 
correctness rate: 0.87068;  out: "Mg24";  variables (3): Ba138, Mg25, Sr88 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        3.36

## NOTE HERE
step_sjrs_b$formula

dat_sjrs_c ~ Ba138 + Mg25 + Sr88
<environment: 0x000001fc24460c08>

# estimated correct/error rates
step_sjrs_b$result.pm

crossval.rate      apparent 
    0.8706768     0.1279555

# Forward
step_sjrs_f <-
  klaR::stepclass(
    dat_sjrs_d
  , dat_sjrs_c
  , method = "qda"  # qda is used here since correlation matrices also vary across groups, indicating unequal covariance structures.
  , improvement = 0.001 # stop criterion: improvement less than
  , direction = "forward"
  , start.vars = ""
  )

 `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.

719 observations of 9 variables in 4 classes; direction: forward

stop criterion: improvement less than 0.1%.

Warning in cv.rate(vars = start.vars, data = data, grouping = grouping, :
error(s) in modeling/prediction step

correctness rate: 0;  starting variables (0): ,  
correctness rate: 0.82197;  in: "Ba137";  variables (1): Ba137 
correctness rate: 0.84834;  in: "Sr88";  variables (2): Ba137, Sr88 
correctness rate: 0.87062;  in: "Mg25";  variables (3): Ba137, Sr88, Mg25 

 hr.elapsed min.elapsed sec.elapsed 
       0.00        0.00        1.84

## NOTE HERE
step_sjrs_f$formula

dat_sjrs_c ~ Ba137 + Mg25 + Sr88
<environment: 0x000001fc2440d700>

# estimated correct/error rates
step_sjrs_f$result.pm

crossval.rate      apparent 
    0.8706182     0.1307371

op <- par(no.readonly = TRUE) # the whole list of settable par's.
  # make wider left margin to fit contrast labels
  par(mfrow = c(1,2), mar = 0*rep(1, 4)) # order is c(bottom, left, top, right)
  plot(step_sjrs_f, ylim = c(0, 1), main = "empty model, forward")
  plot(step_sjrs_b, ylim = c(0, 1), main = "full model, backward")

par(op) # reset plotting options


## NOTE HERE
# set the formula you're using here, then it will be used throughout the rest
sjrs_formula <- step_sjrs_f

# Select and print the final model
#library(MASS)  # don't run library(MASS) because it breaks select() from dplyr
qda_sjrs_final <-
  MASS::qda(
    grouping = dat_sjrs_c
  , x = dat_sjrs_d |> select(sjrs_formula$model$name)
  )
qda_sjrs_final

Call:
qda(select(dat_sjrs_d, sjrs_formula$model$name), grouping = dat_sjrs_c)

Prior probabilities of groups:
      DEX       GJH       NAP       SJR 
0.2795549 0.2503477 0.1668985 0.3031989 

Group means:
        Ba137      Mg25     Sr88
DEX -3.391341 0.2777136 2.278532
GJH -2.804993 0.2859888 1.314173
NAP -2.163112 0.3037372 1.306017
SJR -2.135268 0.3044489 1.271622

# CV = TRUE does jackknife (leave-one-out) crossvalidation
#library(MASS)  # don't run library(MASS) because it breaks select() from dplyr
qda_sjrs_cv <-
  MASS::qda(
    grouping = dat_sjrs_c
  , x = dat_sjrs_d |> select(sjrs_formula$model$name)
  , CV = TRUE
  )
#lda_sjrs_cv

# Create a table of classification and posterior probabilities for each observation
classify_sjrs <-
  data.frame(
    Source = dat_sjrs$Source
  , class = qda_sjrs_cv$class
  , error = ""
  , round(qda_sjrs_cv$posterior, 3)
  )
colnames(classify_sjrs) <-
  c(
    "Source"
  , "class"
  , "error"
  , paste("post", colnames(qda_sjrs_cv$posterior), sep="_")
  )

# error column
classify_sjrs$error <-
  as.character(classify_sjrs$error)
classify_agree <-
  as.character(as.numeric(dat_sjrs$Source) - as.numeric(qda_sjrs_cv$class))
classify_sjrs$error[!(classify_agree == 0)] <-
  classify_agree[!(classify_agree == 0)]
# print table
#classify_sjrs

# A list of classification statistics
library(caret)

Loading required package: lattice


Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

confusionMatrix(
    data      = qda_sjrs_cv$class # predictions
  , reference = dat_sjrs$Source   # true labels
  , mode      = "sens_spec"       # restrict output to relevant summaries
)

Confusion Matrix and Statistics

          Reference
Prediction DEX GJH NAP SJR
       DEX 200   0   0   0
       GJH   0 180   0   0
       NAP   1   0  43  17
       SJR   0   0  77 201

Overall Statistics
                                          
               Accuracy : 0.8679          
                 95% CI : (0.8409, 0.8918)
    No Information Rate : 0.3032          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8185          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity              0.9950     1.0000    0.35833     0.9220
Specificity              1.0000     1.0000    0.96995     0.8463
Pos Pred Value           1.0000     1.0000    0.70492     0.7230
Neg Pred Value           0.9981     1.0000    0.88298     0.9615
Prevalence               0.2796     0.2503    0.16690     0.3032
Detection Rate           0.2782     0.2503    0.05981     0.2796
Detection Prevalence     0.2782     0.2503    0.08484     0.3866
Balanced Accuracy        0.9975     1.0000    0.66414     0.8842

klaR::partimat(
    grouping    = dat_sjrs_c
  , x           = dat_sjrs_d |> select(sjrs_formula$model$name)
  , plot.matrix = FALSE
  , method      = "qda"           # or "qda"
  , main        = "QDA partition" # or "QDA partition"
  )

# new observations to classify
summary(dat_sjrs_new$Source)

DEX GJH NAP NEW SJR 
  0   0   0  80   0

# predict the NEW data from the training data LDFs
pred_sjrs_new <-
  predict(
    qda_sjrs_final
  , newdata = dat_sjrs_new |> select(sjrs_formula$model$name)
  )

# Create a table of classification and posterior probabilities for each observation
classify_sjrs_new <-
  data.frame(
    Source = dat_sjrs_new$Source_org
  , class = pred_sjrs_new$class
  , error = ""
  , round(pred_sjrs_new$posterior,3)
  )
colnames(classify_sjrs_new) <-
  c(
    "Source"
  , "class"
  , "error"
  , paste("post", colnames(pred_sjrs_new$posterior), sep="_")
  )

# error column
classify_sjrs_new$error <-
  as.character(classify_sjrs_new$error)
classify_agree_new <-
  as.character(as.numeric(dat_sjrs_new$Source_org) - as.numeric(pred_sjrs_new$class))
classify_sjrs_new$error[!(classify_agree_new == 0)] <-
  classify_agree_new[!(classify_agree_new == 0)]
# print table
#classify_sjrs_new

# A list of classification statistics
library(caret)
confusionMatrix(
    data      = pred_sjrs_new$class # predictions
  , reference = dat_sjrs_new$Source_org   # true labels
  , mode      = "sens_spec"       # restrict output to relevant summaries
)

Confusion Matrix and Statistics

          Reference
Prediction DEX GJH NAP SJR
       DEX  23   0   0   0
       GJH   0  19   0   0
       NAP   0   0   6   1
       SJR   0   0   7  24

Overall Statistics
                                          
               Accuracy : 0.9             
                 95% CI : (0.8124, 0.9558)
    No Information Rate : 0.3125          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8622          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity              1.0000     1.0000     0.4615     0.9600
Specificity              1.0000     1.0000     0.9851     0.8727
Pos Pred Value           1.0000     1.0000     0.8571     0.7742
Neg Pred Value           1.0000     1.0000     0.9041     0.9796
Prevalence               0.2875     0.2375     0.1625     0.3125
Detection Rate           0.2875     0.2375     0.0750     0.3000
Detection Prevalence     0.2875     0.2375     0.0875     0.3875
Balanced Accuracy        1.0000     1.0000     0.7233     0.9164

# update unknown NEW with the class prediction
dat_sjrs_new$Class <- pred_sjrs_new$class

dat_sjrs_new_pred <- cbind(dat_sjrs_new, pred_sjrs_new, classify_sjrs_new)

dat_sjrs$Class <- dat_sjrs$Source
dat_sjrs    $Label <- "Known"
dat_sjrs_new$Label <- "NEW"
dat_sjrs_all <- rbind(dat_sjrs, dat_sjrs_new)

# plot data by Source with clusters indicated
library(ggplot2)
p1 <- ggplot(dat_sjrs_all, aes(x = Ba138, y = Sr87, colour = Class))
p1 <- p1 + geom_point()#size = 2)
p1 <- p1 + labs(title = "Known and new observations by source classification")
p1 <- p1 + facet_grid(Label ~ Source_org)
print(p1)

3.3 Application to Untagged Fish

Classification of NEW specimens achieved 90% accuracy (72/80 correct), with most (24/30) true SJR fish correctly identified. Misclassifications primarily involved NAP→SJR (7/13 errors), mirroring patterns in the training data. Posterior probabilities exceeded 0.85 for 88% of NEW classifications, suggesting high confidence.

4. Discussion

4.1 Ecological Implications

The isotopic signatures clearly distinguish hatchery-reared fish, likely reflecting artificial feed composition and water chemistry. The poorer NAP classification may stem from environmental similarity to SJR or greater natural variability in this pond population. The high SJR assignment rate for NEW fish suggests most untagged specimens are wild-origin, supporting natural reproduction in the San Juan River.

4.2 Methodological Considerations

While QDA accommodated unequal group covariances, the persistent NAP-SJR confusion indicates limitations when isotopic niches overlap. The 90% accuracy for NEW specimens—higher than cross-validated training performance—may reflect their predominantly SJR composition, the best-classified wild group. Future studies could incorporate additional isotopes (δ13C, δ15N) or geometric morphometrics to improve wild population discrimination.

4.3 Conservation Applications

This approach provides resource managers with a viable tool for monitoring hatchery-wild proportions when physical tags are lost. The minimal misclassification between hatcheries (DEX/GJH) ensures reliable evaluation of stocking program contributions. Ongoing validation should track classified fish via independent methods (e.g., genetics) to verify isotopic signatures remain diagnostic.

5. Conclusion

Our QDA model effectively classified Razorback Sucker origins using three stable isotopes, achieving >86% accuracy for known-origin fish and 90% for untagged specimens. The technique offers a practical, non-lethal method for assessing conservation program effectiveness, particularly for distinguishing hatchery versus wild fish. While wild population discrimination requires refinement, this approach demonstrates how biogeochemical tracers can address critical management questions for endangered species.

References

Erhardt, E. B., Bedrick, E. J., & Schrader, R. M. (2020). \(\textit{Lecture notes for Advanced Data Analysis 2 (ADA2) (Stat 428/528)}\). University of New Mexico.