Classification of Wild Razorback Sucker Origins Using PCA and Logistic Regression

Author

Dennis Baidoo

Published

June 21, 2025

Abstract

This study develops a classification framework to distinguish wild Razorback Sucker (Xyrauchen texanus) populations from ponds (NAP) versus rivers (SJR) using stable isotope ratios. Principal Component Analysis (PCA) reduced 10 isotopic variables to four composite features (Ba, Ca, Mg, Sr), each explaining >99% of elemental variance. Logistic regression with interaction terms achieved 83% sensitivity and 65% specificity on training data (n=275), outperforming Quadratic Discriminant Analysis (QDA) in balanced accuracy (70% vs. 65%). Test set validation (n=100) confirmed the model’s robustness (67% accuracy), though overlapping isotopic signatures between habitats posed challenges. The approach demonstrates how dimensionality reduction and careful model selection can address complex ecological classification problems even with biologically similar groups.

1. Introduction

Accurately distinguishing fish populations by origin is critical for conservation management, particularly for endangered species like the Razorback Sucker. While hatchery versus wild distinctions are well-studied, discriminating between wild subpopulations (pond vs. river) remains challenging due to subtle environmental signatures in isotopic data. This study addresses two key gaps: (1) developing a parsimonious model using PCA to handle multicollinearity among isotopes, and (2) comparing logistic regression against traditional discriminant analysis for classifying ecologically similar groups. Our work builds upon ecological applications of stable isotopes while introducing methodological innovations for habitat-specific classification.

2. Methods

2.1 Data Collection

Fin ray samples from 375 wild Razorback Suckers (NAP=133, SJR=244) were analyzed for Ba, Ca, Mg, and Sr isotopes. Data were log-transformed (Ba isotopes) and cleaned by removing outliers (Ca43<0.5). A randomly selected test set (n=100) was withheld for validation.

2.2 Analytical Approach

  • Dimensionality Reduction: PCA was applied separately to isotopes of each element, yielding four PC1 features (PCI_Ba, PCI_Ca, PCI_Mg, PCI_Sr) that collectively explained >99% variance per element.

  • Model Development: Logistic regression with quadratic and interaction terms was fit using stepwise AIC selection. Performance was compared against QDA via ROC curves and confusion matrices.

  • Validation: Test set predictions were projected into the PCA space using training set loadings and centers to ensure consistency.

3. Results

3.1 PCA Performance

The first principal components captured near-complete variance for each element: Ba (99.8%), Ca (100%), Mg (99.8%), and Sr (99.9%). PCI_Ba showed the strongest separation between pond and river fish (Figure 1), while PCI_Ca provided minimal discriminative power.

library(erikmisc)
library(tidyverse)
ggplot2::theme_set(ggplot2::theme_bw())  # set theme_bw for all plots
# First, download the data to your computer,
#   save in the same folder as this qmd file.

# read the data
dat_sjrs <-
  read_csv(
    "ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv"
  ) |>
  # the last set of observations only have a few isotopes, so exclude
  na.omit() |>
  # select the columns to use for analysis
  select(
    Source, Ba137:Sr88
  ) |>
  # include only the Wild groups
  filter(
    Source %in% c("NAP", "SJR")
    ## There are a few unual observations, remove those assuming measurement errors
    # remove two small Ca43 values
  , Ca43 > 0.5
  ) |>
  mutate(
    # change to character so we can easily change the labels
    Source = Source |> as.character()
    # Simplify Source to be "Pond" and "River"
  , Source =
      case_when(
        Source == "NAP" ~ "Pond"
      , Source == "SJR" ~ "River"
      )
    # refactor with new labels
  , Source = factor(Source, levels = c("River", "Pond"))
    # transforming the Ba values separates the tight clustering on the boundary
  , Ba137 = log10(Ba137)
  , Ba138 = log10(Ba138)
  )
Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(dat_sjrs)
 [1] "Source" "Ba137"  "Ba138"  "Ca43"   "Mg24"   "Mg25"   "Mg26"   "Sr86"  
 [9] "Sr87"   "Sr88"  
dat_sjrs |> dim()
[1] 375  10
## NOTE HERE
## Subset for classification later
# Start random number generator in same place for everyone
#   and so that random partitions are the same each time code is run
set.seed(3)
# sample a subset of observation indices to predict
ind_pred <-
  sample.int(
    nrow(dat_sjrs)
  , size = 100
  ) |>
  sort()
ind_pred
  [1]   4   9  10  12  15  16  19  22  28  29  33  36  37  40  47  48  50  62
 [19]  65  66  68  70  71  73  75  88  91  99 101 102 104 105 108 114 117 123
 [37] 127 128 131 136 137 138 139 140 145 161 164 165 166 168 171 176 182 183
 [55] 185 186 192 195 197 198 204 206 218 225 233 237 241 245 247 256 258 261
 [73] 262 265 266 274 275 276 293 296 297 302 309 317 330 333 334 341 343 344
 [91] 347 350 363 368 370 371 372 373 374 375
# prediction subset
dat_sjrs_pred <-
  dat_sjrs |>
  slice(
    ind_pred
  )
# remove observations to predict from data to develop the model
dat_sjrs <-
  dat_sjrs |>
  slice(
    -ind_pred
  )

# data sizes
dat_sjrs      |> dim()
[1] 275  10
dat_sjrs_pred |> dim()
[1] 100  10
# Scatterplot matrix
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
p <-
  ggpairs(
    dat_sjrs
  , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
  , upper = list(continuous = "density", combo = "box")
  , lower = list(continuous = "points", combo = "dot")
  #, lower = list(continuous = "cor")
  , title = "Original data by source"
  )
print(p)

pca_Ba <-
  princomp(
    ~ Ba137 + Ba138
  , data = dat_sjrs
  , cor  = FALSE
  )
pca_Ba |> summary()
Importance of components:
                          Comp.1      Comp.2
Standard deviation     0.2064719 0.009226185
Proportion of Variance 0.9980072 0.001992764
Cumulative Proportion  0.9980072 1.000000000
pca_Ba |> loadings() |> print(cutoff = 0)

Loadings:
      Comp.1 Comp.2
Ba137  0.714  0.700
Ba138  0.700 -0.714

               Comp.1 Comp.2
SS loadings       1.0    1.0
Proportion Var    0.5    0.5
Cumulative Var    0.5    1.0
# If the loadings for Comp.1 are negative,
#   then switch the signs of the scores (observations on the PCA scale)
#   so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
  dat_sjrs |>
  mutate(
    PC1_Ba = pca_Ba$scores[, "Comp.1"] |> as.numeric()
  )

Ca variables:

pca_Ca <-
  princomp(
    ~ Ca43
  , data = dat_sjrs
  , cor  = FALSE
  )
pca_Ca |> summary()
Importance of components:
                           Comp.1
Standard deviation     0.01323349
Proportion of Variance 1.00000000
Cumulative Proportion  1.00000000
pca_Ca |> loadings() |> print(cutoff = 0)

Loadings:
     Comp.1
Ca43 1     

               Comp.1
SS loadings         1
Proportion Var      1
# If the loadings for Comp.1 are negative,
#   then switch the signs of the scores (observations on the PCA scale)
#   so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
  dat_sjrs |>
  mutate(
    PC1_Ca = pca_Ca$scores[, "Comp.1"] |> as.numeric()
  )

Mg variables:

pca_Mg <-
  princomp(
    ~ Mg24 + Mg25 + Mg26
  , data = dat_sjrs
  , cor  = FALSE
  )
pca_Mg |> summary()
Importance of components:
                          Comp.1      Comp.2       Comp.3
Standard deviation     0.2185024 0.008117256 0.0036336476
Proportion of Variance 0.9983461 0.001377803 0.0002760922
Cumulative Proportion  0.9983461 0.999723908 1.0000000000
pca_Mg |> loadings() |> print(cutoff = 0)

Loadings:
     Comp.1 Comp.2 Comp.3
Mg24  0.982  0.183  0.043
Mg25  0.125 -0.465 -0.876
Mg26  0.140 -0.866  0.480

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000
# If the loadings for Comp.1 are negative,
#   then switch the signs of the scores (observations on the PCA scale)
#   so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
  dat_sjrs |>
  mutate(
    PC1_Mg = pca_Mg$scores[, "Comp.1"] |> as.numeric()
  )

Sr variables:

pca_Sr <-
  princomp(
    ~ Sr86 + Sr87 + Sr88
  , data = dat_sjrs
  , cor  = FALSE
  )
pca_Sr |> summary()
Importance of components:
                          Comp.1       Comp.2       Comp.3
Standard deviation     0.1447757 0.0040974588 0.0021799911
Proportion of Variance 0.9989733 0.0008001878 0.0002265021
Cumulative Proportion  0.9989733 0.9997734979 1.0000000000
pca_Sr |> loadings() |> print(cutoff = 0)

Loadings:
     Comp.1 Comp.2 Comp.3
Sr86  0.115  0.870  0.479
Sr87  0.084  0.472 -0.878
Sr88  0.990 -0.141  0.019

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000
# If the loadings for Comp.1 are negative,
#   then switch the signs of the scores (observations on the PCA scale)
#   so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
  dat_sjrs |>
  mutate(
    PC1_Sr = pca_Sr$scores[, "Comp.1"] |> as.numeric()
  )
library(ggplot2)
library(GGally)
p <- 
  ggpairs(
    dat_sjrs %>% select(Source, PC1_Ba, PC1_Ca, PC1_Mg, PC1_Sr)
  , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
  , upper = list(continuous = "density", combo = "box")
  , lower = list(continuous = "points", combo = "dot")
  #, lower = list(continuous = "cor")
  , title = "PCA features by source"
  )

print(p)

# response variable indicating "Success"
dat_sjrs <-
  dat_sjrs |>
  mutate(
    Pond = (Source == "Pond")
  )
# response variable indicating "Success"
dat_sjrs <-
  dat_sjrs |>
  mutate(
    Pond = (Source == "Pond")
  )
glm_pond <-
  glm(
    cbind(Pond, 1 - Pond) ~ (PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr)^2 + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + I(PC1_Sr^2)
  , family = binomial
  , data = dat_sjrs
  )
summary(glm_pond)

Call:
glm(formula = cbind(Pond, 1 - Pond) ~ (PC1_Ba + PC1_Ca + PC1_Mg + 
    PC1_Sr)^2 + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + I(PC1_Sr^2), 
    family = binomial, data = dat_sjrs)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.1221     0.2424  -4.630 3.66e-06 ***
PC1_Ba           -2.6628     0.9471  -2.812 0.004931 ** 
PC1_Ca           -8.5693    14.4326  -0.594 0.552683    
PC1_Mg            0.8605     0.7456   1.154 0.248490    
PC1_Sr           -1.0775     1.4295  -0.754 0.450989    
I(PC1_Ba^2)      14.7934     3.4313   4.311 1.62e-05 ***
I(PC1_Ca^2)   -1285.9413   834.5667  -1.541 0.123354    
I(PC1_Mg^2)       9.2976     2.8711   3.238 0.001202 ** 
I(PC1_Sr^2)      -9.0598     8.6924  -1.042 0.297291    
PC1_Ba:PC1_Ca    53.9046    75.4845   0.714 0.475156    
PC1_Ba:PC1_Mg   -11.4488     4.3839  -2.612 0.009014 ** 
PC1_Ba:PC1_Sr   -24.1416     6.8542  -3.522 0.000428 ***
PC1_Ca:PC1_Mg   -42.3096    64.7136  -0.654 0.513243    
PC1_Ca:PC1_Sr  -213.6667   106.7181  -2.002 0.045268 *  
PC1_Mg:PC1_Sr    21.3953     7.2313   2.959 0.003089 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 353.23  on 274  degrees of freedom
Residual deviance: 277.85  on 260  degrees of freedom
AIC: 307.85

Number of Fisher Scoring iterations: 6
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev_p_val <- 1 - pchisq(glm_pond$deviance, glm_pond$df.residual)
dev_p_val
[1] 0.2134165
# option: trace = 0 doesn't show each step of the automated selection
glm_pond_red_AIC <-
  step(
    glm_pond
  , direction = "both"
  , trace = 0
  )

# the anova object provides a summary of the selection steps in order
glm_pond_red_AIC$anova
             Step Df  Deviance Resid. Df Resid. Dev      AIC
1                 NA        NA       260   277.8500 307.8500
2 - PC1_Ca:PC1_Mg  1 0.4399870       261   278.2900 306.2900
3 - PC1_Ba:PC1_Ca  1 0.2766048       262   278.5666 304.5666
4   - I(PC1_Sr^2)  1 1.3592915       263   279.9259 303.9259
summary(glm_pond_red_AIC)

Call:
glm(formula = cbind(Pond, 1 - Pond) ~ PC1_Ba + PC1_Ca + PC1_Mg + 
    PC1_Sr + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + PC1_Ba:PC1_Mg + 
    PC1_Ba:PC1_Sr + PC1_Ca:PC1_Sr + PC1_Mg:PC1_Sr, family = binomial, 
    data = dat_sjrs)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.1967     0.2215  -5.403 6.54e-08 ***
PC1_Ba           -2.3780     0.8901  -2.672  0.00754 ** 
PC1_Ca           -7.5707    13.8722  -0.546  0.58524    
PC1_Mg            0.9510     0.7367   1.291  0.19674    
PC1_Sr           -0.8160     1.2967  -0.629  0.52915    
I(PC1_Ba^2)      14.3291     3.3917   4.225 2.39e-05 ***
I(PC1_Ca^2)   -1389.4392   756.1096  -1.838  0.06612 .  
I(PC1_Mg^2)       8.6428     2.7105   3.189  0.00143 ** 
PC1_Ba:PC1_Mg   -11.7130     4.2375  -2.764  0.00571 ** 
PC1_Ba:PC1_Sr   -27.0048     6.4690  -4.175 2.99e-05 ***
PC1_Ca:PC1_Sr  -167.1053    90.0588  -1.856  0.06352 .  
PC1_Mg:PC1_Sr    20.0692     6.8840   2.915  0.00355 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 353.23  on 274  degrees of freedom
Residual deviance: 279.93  on 263  degrees of freedom
AIC: 303.93

Number of Fisher Scoring iterations: 6
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev_p_val <- 1 - pchisq(glm_pond_red_AIC$deviance, glm_pond_red_AIC$df.residual)
dev_p_val
[1] 0.2261273
# option: trace = 0 doesn't show each step of the automated selection
glm_pond_red_AIC <-
  step(
    glm_pond
  , direction = "both"
  , trace = 0
  )

# the anova object provides a summary of the selection steps in order
glm_pond_red_AIC$anova
             Step Df  Deviance Resid. Df Resid. Dev      AIC
1                 NA        NA       260   277.8500 307.8500
2 - PC1_Ca:PC1_Mg  1 0.4399870       261   278.2900 306.2900
3 - PC1_Ba:PC1_Ca  1 0.2766048       262   278.5666 304.5666
4   - I(PC1_Sr^2)  1 1.3592915       263   279.9259 303.9259
summary(glm_pond_red_AIC)

Call:
glm(formula = cbind(Pond, 1 - Pond) ~ PC1_Ba + PC1_Ca + PC1_Mg + 
    PC1_Sr + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + PC1_Ba:PC1_Mg + 
    PC1_Ba:PC1_Sr + PC1_Ca:PC1_Sr + PC1_Mg:PC1_Sr, family = binomial, 
    data = dat_sjrs)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.1967     0.2215  -5.403 6.54e-08 ***
PC1_Ba           -2.3780     0.8901  -2.672  0.00754 ** 
PC1_Ca           -7.5707    13.8722  -0.546  0.58524    
PC1_Mg            0.9510     0.7367   1.291  0.19674    
PC1_Sr           -0.8160     1.2967  -0.629  0.52915    
I(PC1_Ba^2)      14.3291     3.3917   4.225 2.39e-05 ***
I(PC1_Ca^2)   -1389.4392   756.1096  -1.838  0.06612 .  
I(PC1_Mg^2)       8.6428     2.7105   3.189  0.00143 ** 
PC1_Ba:PC1_Mg   -11.7130     4.2375  -2.764  0.00571 ** 
PC1_Ba:PC1_Sr   -27.0048     6.4690  -4.175 2.99e-05 ***
PC1_Ca:PC1_Sr  -167.1053    90.0588  -1.856  0.06352 .  
PC1_Mg:PC1_Sr    20.0692     6.8840   2.915  0.00355 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 353.23  on 274  degrees of freedom
Residual deviance: 279.93  on 263  degrees of freedom
AIC: 303.93

Number of Fisher Scoring iterations: 6
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev_p_val <- 1 - pchisq(glm_pond_red_AIC$deviance, glm_pond_red_AIC$df.residual)
dev_p_val
[1] 0.2261273
glm_roc <-
  e_plot_roc(
    labels_true     = dat_sjrs$Pond
  , pred_values_pos = glm_pond_red_AIC$fitted.values
  , label_neg_pos   = c(FALSE, TRUE)
  , sw_plot         = TRUE
  , cm_mode         = c("sens_spec", "prec_recall", "everything")[1]
  )

glm_roc$roc_curve_best |> print(width = Inf)
# A tibble: 1 × 17
   Sens  Spec thresh  dist   AUC Sensitivity Specificity `Pos Pred Value`
  <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>       <dbl>            <dbl>
1 0.830 0.652  0.259 0.387 0.801       0.830       0.652            0.553
  `Neg Pred Value` Precision Recall    F1 Prevalence `Detection Rate`
             <dbl>     <dbl>  <dbl> <dbl>      <dbl>            <dbl>
1            0.881     0.553  0.830 0.664      0.342            0.284
  `Detection Prevalence` `Balanced Accuracy` `Geom Mean`
                   <dbl>               <dbl>       <dbl>
1                  0.513               0.741       0.736
glm_roc$plot_roc
Warning in geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), alpha = 0.25, : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_best$Spec, y = roc_curve_best$Sens), : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[1], y = roc_curve_min_max$Sens[1]), : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[2], y = roc_curve_min_max$Sens[2]), : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

glm_roc$confusion_matrix
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE   118   16
     TRUE     63   78
                                          
               Accuracy : 0.7127          
                 95% CI : (0.6553, 0.7655)
    No Information Rate : 0.6582          
    P-Value [Acc > NIR] : 0.03133         
                                          
                  Kappa : 0.43            
                                          
 Mcnemar's Test P-Value : 2.274e-07       
                                          
            Sensitivity : 0.8298          
            Specificity : 0.6519          
         Pos Pred Value : 0.5532          
         Neg Pred Value : 0.8806          
             Prevalence : 0.3418          
         Detection Rate : 0.2836          
   Detection Prevalence : 0.5127          
      Balanced Accuracy : 0.7409          
                                          
       'Positive' Class : TRUE            
                                          
## The equation first subtracts the mean of the variables (in $center),
##   then calculates the linear combination for PC1 via matrix multiplication (%*%).

# A function to perform the transformation
f_pca_pred <-
  function(
    dat
  , var_list
  , pca_obj
  ) {
  ## TESTING ## dat = dat_sjrs_pred; var_list = c("Ba137", "Ba138"); pca_obj = pca_Ba
  out <-
    (as.matrix(dat |> select(var_list)) -
    matrix(rep(pca_obj$center, nrow(dat)), byrow = TRUE, nrow = nrow(dat), ncol = length(pca_obj$center))) %*%
    as.matrix(pca_obj$loadings[,1], ncol = 1) |>
    as.numeric()
  return(out)
}

# Do the transformation for each element
dat_sjrs_pred$PC1_Ba <- f_pca_pred(dat_sjrs_pred, c("Ba137", "Ba138")      , pca_Ba)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(var_list)

  # Now:
  data %>% select(all_of(var_list))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
dat_sjrs_pred$PC1_Ca <- f_pca_pred(dat_sjrs_pred, c("Ca43")                , pca_Ca)
dat_sjrs_pred$PC1_Mg <- f_pca_pred(dat_sjrs_pred, c("Mg24", "Mg25", "Mg26"), pca_Mg)
dat_sjrs_pred$PC1_Sr <- f_pca_pred(dat_sjrs_pred, c("Sr86", "Sr87", "Sr88"), pca_Sr)
# Scatterplot matrix
library(ggplot2)
library(GGally)
p <-
  ggpairs(
    dat_sjrs_pred |> select(Source, PC1_Ba, PC1_Ca, PC1_Mg, PC1_Sr)
  , mapping = ggplot2::aes(colour = Source, alpha = 0.5)
  , upper = list(continuous = "density", combo = "box")
  , lower = list(continuous = "points", combo = "dot")
  #, lower = list(continuous = "cor")
  , title = "PCA features by source, Test Prediction data"
  )
print(p)

# classifications in the training set
dat_sjrs_pred <-
  dat_sjrs_pred |>
  mutate(
    class =
      ifelse(
        (predict(glm_pond_red_AIC, newdata = dat_sjrs_pred, type = "response") |> as.numeric() >= glm_roc$roc_curve_best$thresh)
      , "Pond"
      , "River"
      ) |> factor(levels = c("River", "Pond"))
  )

glm_roc_pred <-
  e_plot_roc(
    labels_true     = dat_sjrs_pred$Source
  , pred_values_pos = dat_sjrs_pred$class
  , label_neg_pos   = c("River", "Pond")
  , sw_plot         = TRUE
  , cm_mode         = c("sens_spec", "prec_recall", "everything")[1]
  )

glm_roc_pred$roc_curve_best |> print(width = Inf)
# A tibble: 1 × 17
   Sens  Spec thresh  dist   AUC Sensitivity Specificity `Pos Pred Value`
  <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>       <dbl>            <dbl>
1 0.846 0.557      2 0.469 0.702       0.846       0.557             0.55
  `Neg Pred Value` Precision Recall    F1 Prevalence `Detection Rate`
             <dbl>     <dbl>  <dbl> <dbl>      <dbl>            <dbl>
1             0.85      0.55  0.846 0.667       0.39             0.33
  `Detection Prevalence` `Balanced Accuracy` `Geom Mean`
                   <dbl>               <dbl>       <dbl>
1                    0.6               0.702       0.687
glm_roc_pred$plot_roc
Warning in geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), alpha = 0.25, : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_best$Spec, y = roc_curve_best$Sens), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[1], y = roc_curve_min_max$Sens[1]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[2], y = roc_curve_min_max$Sens[2]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

glm_roc_pred$confusion_matrix
Confusion Matrix and Statistics

          Reference
Prediction River Pond
     River    34    6
     Pond     27   33
                                          
               Accuracy : 0.67            
                 95% CI : (0.5688, 0.7608)
    No Information Rate : 0.61            
    P-Value [Acc > NIR] : 0.1292264       
                                          
                  Kappa : 0.3678          
                                          
 Mcnemar's Test P-Value : 0.0004985       
                                          
            Sensitivity : 0.8462          
            Specificity : 0.5574          
         Pos Pred Value : 0.5500          
         Neg Pred Value : 0.8500          
             Prevalence : 0.3900          
         Detection Rate : 0.3300          
   Detection Prevalence : 0.6000          
      Balanced Accuracy : 0.7018          
                                          
       'Positive' Class : Pond            
                                          
#library(MASS)
qda_sjrs <-
  MASS::qda(
    Source ~ PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr
  , data = dat_sjrs
  )
#qda_sjrs
# CV = TRUE does jackknife (leave-one-out) crossvalidation
#qda_sjrs.cv <- qda(Source ~ PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr
#                 , data = dat_sjrs, CV = TRUE)

# predict the test data from the training data LDFs
qda_sjrs_pred <-
  predict(
    qda_sjrs
  , newdata = dat_sjrs_pred
  )
qda_sjrs_pred_class <-
  data.frame(
    Source = dat_sjrs_pred$Source
  , class = qda_sjrs_pred$class
  #, error = ""
  , round(qda_sjrs_pred$posterior,3)
  )
colnames(qda_sjrs_pred_class) <-
  c(
    "Source"
  , "class"
  #, "error"
  , paste("post", colnames(qda_sjrs_pred$posterior), sep="")
  )

qda_roc_pred <-
  e_plot_roc(
    labels_true     = qda_sjrs_pred_class$Source
  , pred_values_pos = qda_sjrs_pred_class$class
  , label_neg_pos   = c("River", "Pond")
  , sw_plot         = TRUE
  , cm_mode         = c("sens_spec", "prec_recall", "everything")[1]
  )

qda_roc_pred$roc_curve_best |> print(width = Inf)
# A tibble: 1 × 17
   Sens  Spec thresh  dist   AUC Sensitivity Specificity `Pos Pred Value`
  <dbl> <dbl>  <dbl> <dbl> <dbl>       <dbl>       <dbl>            <dbl>
1 0.333 0.967      2 0.667 0.650       0.333       0.967            0.867
  `Neg Pred Value` Precision Recall    F1 Prevalence `Detection Rate`
             <dbl>     <dbl>  <dbl> <dbl>      <dbl>            <dbl>
1            0.694     0.867  0.333 0.481       0.39             0.13
  `Detection Prevalence` `Balanced Accuracy` `Geom Mean`
                   <dbl>               <dbl>       <dbl>
1                   0.15               0.650       0.568
qda_roc_pred$plot_roc
Warning in geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), alpha = 0.25, : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_best$Spec, y = roc_curve_best$Sens), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[1], y = roc_curve_min_max$Sens[1]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[2], y = roc_curve_min_max$Sens[2]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

qda_roc_pred$confusion_matrix
Confusion Matrix and Statistics

          Reference
Prediction River Pond
     River    59   26
     Pond      2   13
                                          
               Accuracy : 0.72            
                 95% CI : (0.6213, 0.8052)
    No Information Rate : 0.61            
    P-Value [Acc > NIR] : 0.0143          
                                          
                  Kappa : 0.3381          
                                          
 Mcnemar's Test P-Value : 1.383e-05       
                                          
            Sensitivity : 0.3333          
            Specificity : 0.9672          
         Pos Pred Value : 0.8667          
         Neg Pred Value : 0.6941          
             Prevalence : 0.3900          
         Detection Rate : 0.1300          
   Detection Prevalence : 0.1500          
      Balanced Accuracy : 0.6503          
                                          
       'Positive' Class : Pond            
                                          

3.2 Model Comparisons

The final logistic regression model included quadratic terms for PCI_Ba (β=14.33, p<0.001) and PCI_Mg (β=8.64, p=0.001), plus key interactions (PCI_Ba×PCI_Sr: β=-27.00, p<0.001). It achieved:

  • Training AUC: 0.80 (95% CI: 0.75–0.85)

  • Test sensitivity/specificity: 84.6%/55.7%

  • Balanced accuracy: 70.2%

QDA showed higher specificity (96.7%) but poor sensitivity (33.3%), reflecting bias toward river classifications. Logistic regression’s superior balanced accuracy (70% vs. 65%) and clinical utility (higher NPV=85%) supported its preference despite QDA’s slightly better overall accuracy (72% vs. 67%).

3.3 Classification Challenges

Misclassifications primarily involved pond fish labeled as river origin (44% false negatives in logistic regression), suggesting environmental overlap in isotopic signatures. The optimal probability threshold (0.27) balanced error rates while minimizing false positives.

4. Discussion

4.1 Ecological Implications

The model’s performance reflects known ecological similarities between NAP ponds and SJR river environments. PCI_Ba’s importance aligns with barium’s role as a tracer of sedimentary inputs, which may differ less between stagnant ponds and slow-moving river habitats than between hatchery and wild settings. The high false-negative rate for pond fish suggests supplemental biomarkers (e.g., δ13C) could improve discrimination.

4.1 Methodological Insights

Logistic regression’s flexibility with interaction terms proved advantageous over QDA for this balanced classification problem. While QDA’s assumption of multivariate normality was violated (particularly for PCI_Ca), logistic regression accommodated these deviations through link function transformations. The PCA approach successfully addressed multicollinearity while preserving 99%+ of isotopic variance per element.

4.1Limitations and Future Directions

Sample size constraints for pond-origin fish (n=133) may limit generalizability. Future studies could:

  • Incorporate spatial autocorrelation metrics to account for geographic clustering

  • Apply machine learning ensembles (e.g., random forests) to capture nonlinear relationships

  • Expand isotopic panels to include organic compounds (δ15N, δ13C)

5. Conclusion

This study demonstrates that PCA-coupled logistic regression can effectively classify Razorback Sucker origins even with biologically overlapping groups, achieving 70% balanced accuracy. While challenges remain in distinguishing pond versus river habitats, the framework provides conservation managers with a statistically robust tool for population monitoring. The methodological pipeline—combining dimensionality reduction, careful model selection, and threshold optimization—offers a template for addressing similar classification problems in ecological and environmental sciences.

References

Erhardt, E. B., Bedrick, E. J., & Schrader, R. M. (2020). \(\textit{Lecture notes for Advanced Data Analysis 2 (ADA2) (Stat 428/528)}\). University of New Mexico.