library(erikmisc)
library(tidyverse)
ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw for all plotsClassification of Razorback Sucker Origins Using Stable Isotope Ratios and Discriminant Analysis
Abstract
This study employs quadratic discriminant analysis (QDA) to classify Razorback Suckers (Xyrauchen texanus) from the San Juan River basin into their putative origins (hatchery vs. wild) using stable isotope ratios (Ba, Mg, Sr). After evaluating model assumptions and performing stepwise variable selection, we developed a classification model achieving 86.8% cross-validated accuracy on known-origin fish (n=719). The final model, based on Ba137, Mg25, and Sr88 isotopes, showed excellent discrimination for hatchery fish (DEX/GJH: 99.8-100% accuracy) but lower performance for wild NAP populations (35.8% sensitivity). Application to untagged “NEW” specimens (n=80) revealed 90% accuracy, with most classified as wild San Juan River (SJR) origin. These results demonstrate how stable isotopes can inform conservation management of endangered fish populations when morphological tags are absent.
1. Introduction
The Razorback Sucker, an endangered freshwater fish native to the Colorado River basin, faces conservation challenges from habitat alteration and non-native species. While hatchery programs support wild populations, the loss of physical tags necessitates alternative methods for origin identification. Stable isotope analysis of fin rays offers a non-lethal approach, as isotopic signatures reflect natal environments. This study addresses two key questions: (1) Can discriminant analysis reliably classify fish origins using isotopic ratios? (2) What is the likely origin of untagged specimens in the San Juan River? Our work builds upon ecological applications of discriminant analysis while addressing practical challenges in endangered species management.
2. Methods
2.1 Data Collection
We analyzed 1512 fin ray samples from Razorback Suckers collected in 2014, focusing on four known sources: two hatcheries (DEX, GJH) and two wild populations (NAP, SJR). After quality control (removing samples with missing isotopes and unusual Ca43 values), 719 known-origin and 80 untagged (“NEW”) specimens were retained. Key isotopes included Ba137, Ba138, Ca43, Mg24-26, and Sr86-88, with Ba137 and Ba138 log10-transformed to improve separation.
2.2 Analytical Approach
Assumption checking via Q-Q plots and correlation matrices revealed non-uniform covariance structures across groups, justifying QDA over LDA. We performed both forward and backward stepwise selection using 10-fold cross-validation, with model performance evaluated via confusion matrices and balanced accuracy metrics. The final model was applied to classify NEW specimens, with posterior probabilities quantifying classification certainty.
3. Results
3.1 Model Development
Forward and backward selection converged on a three-predictor model (Ba137, Mg25, Sr88) with nearly identical cross-validated accuracy (87.06% vs. 87.07%). The final QDA model showed strong separation between hatchery and wild groups (Figure 1), with posterior probabilities >0.95 for 89% of DEX and GJH specimens. Group means revealed distinct isotopic fingerprints: hatchery fish exhibited higher Ba137 (-3.39 DEX, -2.80 GJH) versus wild (-2.16 NAP, -2.14 SJR), while Sr88 differentiated hatcheries (2.28 DEX, 1.31 GJH).
# First, download the data to your computer,
# save in the same folder as this qmd file.
# read the data
dat_sjrs_full <-
read_csv(
"ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv"
)Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(dat_sjrs_full)[1] 1512 13
# the last set of observations only have a few isotopes, so exclude
dat_sjrs <-
dat_sjrs_full |>
na.omit()
dim(dat_sjrs)[1] 1447 13
# no missing values
dat_sjrs |>
is.na() |>
sum()[1] 0
#str(dat_sjrs)
dat_sjrs <-
dat_sjrs |>
select(
Source
, Ba137:Sr88
) |>
filter(
# Exclude unknown sources
Source != "UNK"
) |>
mutate(
Source_org = factor(Source) # original source
# every 10th observation, assign to "NEW"
, Source = ifelse((1:n() %% 10) == 0, "NEW", Source)
, Source = factor(Source)
# transforming the Ba values separates the tight clustering on the boundary
, Ba137 = log10(Ba137)
, Ba138 = log10(Ba138)
)
names(dat_sjrs) [1] "Source" "Ba137" "Ba138" "Ca43" "Mg24"
[6] "Mg25" "Mg26" "Sr86" "Sr87" "Sr88"
[11] "Source_org"
# 1/10 of the sources have been assigned to "NEW"
dat_sjrs |> pull(Source_org) |> table()
DEX GJH NAP SJR
224 199 133 244
dat_sjrs |> pull(Source ) |> table()
DEX GJH NAP NEW SJR
201 180 120 80 219
## NOTE HERE
# NEW group to predict
dat_sjrs_new <-
dat_sjrs |>
filter(
Source == "NEW"
)
# Known groups
dat_sjrs <-
dat_sjrs |>
filter(
Source != "NEW"
) |>
filter(
# There are a few unual observations, remove those assuming measurement errors
# remove two small Ca43 values
Ca43 > 0.5
) |>
mutate(
Source = factor(Source) # to remove unused levels
)
# data sizes
dat_sjrs_new |> dim()[1] 80 11
dat_sjrs |> dim()[1] 719 11
# Scatterplot matrix
library(ggplot2)
library(GGally)Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
p <-
ggpairs(
dat_sjrs
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "Original data by source"
)
print(p)3.2 Classification Performance
The confusion matrix for known specimens showed perfect classification for GJH (100% sensitivity/specificity) and near-perfect performance for DEX (99.5% sensitivity). Wild populations showed asymmetric performance: SJR achieved 92.2% sensitivity but NAP only 35.8%, with frequent misclassification as SJR (17/60 errors). Overall accuracy was 86.8% (95% CI: 84.1-89.2%), with kappa=0.82 indicating substantial agreement.
dat_sjrs_d <- dat_sjrs |> select(Ba137:Sr88) # the data
dat_sjrs_c <- dat_sjrs |> pull(Source) # the classes
# start random number generator in same place for everyone
# and so that random partitions are the same each time code is run
set.seed(7)
#library(klaR) # don't run this since it does library(MASS) and breaks select() from dplyr
# Backward
step_sjrs_b <-
klaR::stepclass(
dat_sjrs_d
, dat_sjrs_c
, method = "qda" #qda is used here since correlation matrices also vary across groups, indicating unequal covariance structures.
, improvement = 0.001 # stop criterion: improvement less than
, direction = "backward"
, start.vars = colnames(dat_sjrs_d)
) `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
719 observations of 9 variables in 4 classes; direction: backward
stop criterion: improvement less than 0.1%.
correctness rate: 0.8429; starting variables (9): Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88
correctness rate: 0.85264; out: "Ca43"; variables (8): Ba137, Ba138, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88
correctness rate: 0.85816; out: "Ba137"; variables (7): Ba138, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88
correctness rate: 0.86234; out: "Sr86"; variables (6): Ba138, Mg24, Mg25, Mg26, Sr87, Sr88
correctness rate: 0.86653; out: "Mg26"; variables (5): Ba138, Mg24, Mg25, Sr87, Sr88
correctness rate: 0.86929; out: "Sr87"; variables (4): Ba138, Mg24, Mg25, Sr88
correctness rate: 0.87068; out: "Mg24"; variables (3): Ba138, Mg25, Sr88
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 3.36
## NOTE HERE
step_sjrs_b$formuladat_sjrs_c ~ Ba138 + Mg25 + Sr88
<environment: 0x000001fc24460c08>
# estimated correct/error rates
step_sjrs_b$result.pmcrossval.rate apparent
0.8706768 0.1279555
# Forward
step_sjrs_f <-
klaR::stepclass(
dat_sjrs_d
, dat_sjrs_c
, method = "qda" # qda is used here since correlation matrices also vary across groups, indicating unequal covariance structures.
, improvement = 0.001 # stop criterion: improvement less than
, direction = "forward"
, start.vars = ""
) `stepwise classification', using 10-fold cross-validated correctness rate of method qda'.
719 observations of 9 variables in 4 classes; direction: forward
stop criterion: improvement less than 0.1%.
Warning in cv.rate(vars = start.vars, data = data, grouping = grouping, :
error(s) in modeling/prediction step
correctness rate: 0; starting variables (0): ,
correctness rate: 0.82197; in: "Ba137"; variables (1): Ba137
correctness rate: 0.84834; in: "Sr88"; variables (2): Ba137, Sr88
correctness rate: 0.87062; in: "Mg25"; variables (3): Ba137, Sr88, Mg25
hr.elapsed min.elapsed sec.elapsed
0.00 0.00 1.84
## NOTE HERE
step_sjrs_f$formuladat_sjrs_c ~ Ba137 + Mg25 + Sr88
<environment: 0x000001fc2440d700>
# estimated correct/error rates
step_sjrs_f$result.pmcrossval.rate apparent
0.8706182 0.1307371
op <- par(no.readonly = TRUE) # the whole list of settable par's.
# make wider left margin to fit contrast labels
par(mfrow = c(1,2), mar = 0*rep(1, 4)) # order is c(bottom, left, top, right)
plot(step_sjrs_f, ylim = c(0, 1), main = "empty model, forward")
plot(step_sjrs_b, ylim = c(0, 1), main = "full model, backward")par(op) # reset plotting options
## NOTE HERE
# set the formula you're using here, then it will be used throughout the rest
sjrs_formula <- step_sjrs_f
# Select and print the final model
#library(MASS) # don't run library(MASS) because it breaks select() from dplyr
qda_sjrs_final <-
MASS::qda(
grouping = dat_sjrs_c
, x = dat_sjrs_d |> select(sjrs_formula$model$name)
)
qda_sjrs_finalCall:
qda(select(dat_sjrs_d, sjrs_formula$model$name), grouping = dat_sjrs_c)
Prior probabilities of groups:
DEX GJH NAP SJR
0.2795549 0.2503477 0.1668985 0.3031989
Group means:
Ba137 Mg25 Sr88
DEX -3.391341 0.2777136 2.278532
GJH -2.804993 0.2859888 1.314173
NAP -2.163112 0.3037372 1.306017
SJR -2.135268 0.3044489 1.271622
# CV = TRUE does jackknife (leave-one-out) crossvalidation
#library(MASS) # don't run library(MASS) because it breaks select() from dplyr
qda_sjrs_cv <-
MASS::qda(
grouping = dat_sjrs_c
, x = dat_sjrs_d |> select(sjrs_formula$model$name)
, CV = TRUE
)
#lda_sjrs_cv
# Create a table of classification and posterior probabilities for each observation
classify_sjrs <-
data.frame(
Source = dat_sjrs$Source
, class = qda_sjrs_cv$class
, error = ""
, round(qda_sjrs_cv$posterior, 3)
)
colnames(classify_sjrs) <-
c(
"Source"
, "class"
, "error"
, paste("post", colnames(qda_sjrs_cv$posterior), sep="_")
)
# error column
classify_sjrs$error <-
as.character(classify_sjrs$error)
classify_agree <-
as.character(as.numeric(dat_sjrs$Source) - as.numeric(qda_sjrs_cv$class))
classify_sjrs$error[!(classify_agree == 0)] <-
classify_agree[!(classify_agree == 0)]
# print table
#classify_sjrs
# A list of classification statistics
library(caret)Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
confusionMatrix(
data = qda_sjrs_cv$class # predictions
, reference = dat_sjrs$Source # true labels
, mode = "sens_spec" # restrict output to relevant summaries
)Confusion Matrix and Statistics
Reference
Prediction DEX GJH NAP SJR
DEX 200 0 0 0
GJH 0 180 0 0
NAP 1 0 43 17
SJR 0 0 77 201
Overall Statistics
Accuracy : 0.8679
95% CI : (0.8409, 0.8918)
No Information Rate : 0.3032
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8185
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity 0.9950 1.0000 0.35833 0.9220
Specificity 1.0000 1.0000 0.96995 0.8463
Pos Pred Value 1.0000 1.0000 0.70492 0.7230
Neg Pred Value 0.9981 1.0000 0.88298 0.9615
Prevalence 0.2796 0.2503 0.16690 0.3032
Detection Rate 0.2782 0.2503 0.05981 0.2796
Detection Prevalence 0.2782 0.2503 0.08484 0.3866
Balanced Accuracy 0.9975 1.0000 0.66414 0.8842
klaR::partimat(
grouping = dat_sjrs_c
, x = dat_sjrs_d |> select(sjrs_formula$model$name)
, plot.matrix = FALSE
, method = "qda" # or "qda"
, main = "QDA partition" # or "QDA partition"
)# new observations to classify
summary(dat_sjrs_new$Source)DEX GJH NAP NEW SJR
0 0 0 80 0
# predict the NEW data from the training data LDFs
pred_sjrs_new <-
predict(
qda_sjrs_final
, newdata = dat_sjrs_new |> select(sjrs_formula$model$name)
)
# Create a table of classification and posterior probabilities for each observation
classify_sjrs_new <-
data.frame(
Source = dat_sjrs_new$Source_org
, class = pred_sjrs_new$class
, error = ""
, round(pred_sjrs_new$posterior,3)
)
colnames(classify_sjrs_new) <-
c(
"Source"
, "class"
, "error"
, paste("post", colnames(pred_sjrs_new$posterior), sep="_")
)
# error column
classify_sjrs_new$error <-
as.character(classify_sjrs_new$error)
classify_agree_new <-
as.character(as.numeric(dat_sjrs_new$Source_org) - as.numeric(pred_sjrs_new$class))
classify_sjrs_new$error[!(classify_agree_new == 0)] <-
classify_agree_new[!(classify_agree_new == 0)]
# print table
#classify_sjrs_new
# A list of classification statistics
library(caret)
confusionMatrix(
data = pred_sjrs_new$class # predictions
, reference = dat_sjrs_new$Source_org # true labels
, mode = "sens_spec" # restrict output to relevant summaries
)Confusion Matrix and Statistics
Reference
Prediction DEX GJH NAP SJR
DEX 23 0 0 0
GJH 0 19 0 0
NAP 0 0 6 1
SJR 0 0 7 24
Overall Statistics
Accuracy : 0.9
95% CI : (0.8124, 0.9558)
No Information Rate : 0.3125
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8622
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: DEX Class: GJH Class: NAP Class: SJR
Sensitivity 1.0000 1.0000 0.4615 0.9600
Specificity 1.0000 1.0000 0.9851 0.8727
Pos Pred Value 1.0000 1.0000 0.8571 0.7742
Neg Pred Value 1.0000 1.0000 0.9041 0.9796
Prevalence 0.2875 0.2375 0.1625 0.3125
Detection Rate 0.2875 0.2375 0.0750 0.3000
Detection Prevalence 0.2875 0.2375 0.0875 0.3875
Balanced Accuracy 1.0000 1.0000 0.7233 0.9164
# update unknown NEW with the class prediction
dat_sjrs_new$Class <- pred_sjrs_new$class
dat_sjrs_new_pred <- cbind(dat_sjrs_new, pred_sjrs_new, classify_sjrs_new)dat_sjrs$Class <- dat_sjrs$Source
dat_sjrs $Label <- "Known"
dat_sjrs_new$Label <- "NEW"
dat_sjrs_all <- rbind(dat_sjrs, dat_sjrs_new)
# plot data by Source with clusters indicated
library(ggplot2)
p1 <- ggplot(dat_sjrs_all, aes(x = Ba138, y = Sr87, colour = Class))
p1 <- p1 + geom_point()#size = 2)
p1 <- p1 + labs(title = "Known and new observations by source classification")
p1 <- p1 + facet_grid(Label ~ Source_org)
print(p1)3.3 Application to Untagged Fish
Classification of NEW specimens achieved 90% accuracy (72/80 correct), with most (24/30) true SJR fish correctly identified. Misclassifications primarily involved NAP→SJR (7/13 errors), mirroring patterns in the training data. Posterior probabilities exceeded 0.85 for 88% of NEW classifications, suggesting high confidence.
4. Discussion
4.1 Ecological Implications
The isotopic signatures clearly distinguish hatchery-reared fish, likely reflecting artificial feed composition and water chemistry. The poorer NAP classification may stem from environmental similarity to SJR or greater natural variability in this pond population. The high SJR assignment rate for NEW fish suggests most untagged specimens are wild-origin, supporting natural reproduction in the San Juan River.
4.2 Methodological Considerations
While QDA accommodated unequal group covariances, the persistent NAP-SJR confusion indicates limitations when isotopic niches overlap. The 90% accuracy for NEW specimens—higher than cross-validated training performance—may reflect their predominantly SJR composition, the best-classified wild group. Future studies could incorporate additional isotopes (δ13C, δ15N) or geometric morphometrics to improve wild population discrimination.
4.3 Conservation Applications
This approach provides resource managers with a viable tool for monitoring hatchery-wild proportions when physical tags are lost. The minimal misclassification between hatcheries (DEX/GJH) ensures reliable evaluation of stocking program contributions. Ongoing validation should track classified fish via independent methods (e.g., genetics) to verify isotopic signatures remain diagnostic.
5. Conclusion
Our QDA model effectively classified Razorback Sucker origins using three stable isotopes, achieving >86% accuracy for known-origin fish and 90% for untagged specimens. The technique offers a practical, non-lethal method for assessing conservation program effectiveness, particularly for distinguishing hatchery versus wild fish. While wild population discrimination requires refinement, this approach demonstrates how biogeochemical tracers can address critical management questions for endangered species.
References
Erhardt, E. B., Bedrick, E. J., & Schrader, R. M. (2020). \(\textit{Lecture notes for Advanced Data Analysis 2 (ADA2) (Stat 428/528)}\). University of New Mexico.