library(erikmisc)
library(tidyverse)
ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw for all plotsClassification of Wild Razorback Sucker Origins Using PCA and Logistic Regression
Abstract
This study develops a classification framework to distinguish wild Razorback Sucker (Xyrauchen texanus) populations from ponds (NAP) versus rivers (SJR) using stable isotope ratios. Principal Component Analysis (PCA) reduced 10 isotopic variables to four composite features (Ba, Ca, Mg, Sr), each explaining >99% of elemental variance. Logistic regression with interaction terms achieved 83% sensitivity and 65% specificity on training data (n=275), outperforming Quadratic Discriminant Analysis (QDA) in balanced accuracy (70% vs. 65%). Test set validation (n=100) confirmed the model’s robustness (67% accuracy), though overlapping isotopic signatures between habitats posed challenges. The approach demonstrates how dimensionality reduction and careful model selection can address complex ecological classification problems even with biologically similar groups.
1. Introduction
Accurately distinguishing fish populations by origin is critical for conservation management, particularly for endangered species like the Razorback Sucker. While hatchery versus wild distinctions are well-studied, discriminating between wild subpopulations (pond vs. river) remains challenging due to subtle environmental signatures in isotopic data. This study addresses two key gaps: (1) developing a parsimonious model using PCA to handle multicollinearity among isotopes, and (2) comparing logistic regression against traditional discriminant analysis for classifying ecologically similar groups. Our work builds upon ecological applications of stable isotopes while introducing methodological innovations for habitat-specific classification.
2. Methods
2.1 Data Collection
Fin ray samples from 375 wild Razorback Suckers (NAP=133, SJR=244) were analyzed for Ba, Ca, Mg, and Sr isotopes. Data were log-transformed (Ba isotopes) and cleaned by removing outliers (Ca43<0.5). A randomly selected test set (n=100) was withheld for validation.
2.2 Analytical Approach
Dimensionality Reduction: PCA was applied separately to isotopes of each element, yielding four PC1 features (PCI_Ba, PCI_Ca, PCI_Mg, PCI_Sr) that collectively explained >99% variance per element.
Model Development: Logistic regression with quadratic and interaction terms was fit using stepwise AIC selection. Performance was compared against QDA via ROC curves and confusion matrices.
Validation: Test set predictions were projected into the PCA space using training set loadings and centers to ensure consistency.
3. Results
3.1 PCA Performance
The first principal components captured near-complete variance for each element: Ba (99.8%), Ca (100%), Mg (99.8%), and Sr (99.9%). PCI_Ba showed the strongest separation between pond and river fish (Figure 1), while PCI_Ca provided minimal discriminative power.
# First, download the data to your computer,
# save in the same folder as this qmd file.
# read the data
dat_sjrs <-
read_csv(
"ADA2_CL_21_Clustering_SanJuanRazorbackSuckers_data2014.csv"
) |>
# the last set of observations only have a few isotopes, so exclude
na.omit() |>
# select the columns to use for analysis
select(
Source, Ba137:Sr88
) |>
# include only the Wild groups
filter(
Source %in% c("NAP", "SJR")
## There are a few unual observations, remove those assuming measurement errors
# remove two small Ca43 values
, Ca43 > 0.5
) |>
mutate(
# change to character so we can easily change the labels
Source = Source |> as.character()
# Simplify Source to be "Pond" and "River"
, Source =
case_when(
Source == "NAP" ~ "Pond"
, Source == "SJR" ~ "River"
)
# refactor with new labels
, Source = factor(Source, levels = c("River", "Pond"))
# transforming the Ba values separates the tight clustering on the boundary
, Ba137 = log10(Ba137)
, Ba138 = log10(Ba138)
)Rows: 1512 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Station, Source, Type
dbl (10): Sort Key, Ba137, Ba138, Ca43, Mg24, Mg25, Mg26, Sr86, Sr87, Sr88
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(dat_sjrs) [1] "Source" "Ba137" "Ba138" "Ca43" "Mg24" "Mg25" "Mg26" "Sr86"
[9] "Sr87" "Sr88"
dat_sjrs |> dim()[1] 375 10
## NOTE HERE
## Subset for classification later
# Start random number generator in same place for everyone
# and so that random partitions are the same each time code is run
set.seed(3)
# sample a subset of observation indices to predict
ind_pred <-
sample.int(
nrow(dat_sjrs)
, size = 100
) |>
sort()
ind_pred [1] 4 9 10 12 15 16 19 22 28 29 33 36 37 40 47 48 50 62
[19] 65 66 68 70 71 73 75 88 91 99 101 102 104 105 108 114 117 123
[37] 127 128 131 136 137 138 139 140 145 161 164 165 166 168 171 176 182 183
[55] 185 186 192 195 197 198 204 206 218 225 233 237 241 245 247 256 258 261
[73] 262 265 266 274 275 276 293 296 297 302 309 317 330 333 334 341 343 344
[91] 347 350 363 368 370 371 372 373 374 375
# prediction subset
dat_sjrs_pred <-
dat_sjrs |>
slice(
ind_pred
)
# remove observations to predict from data to develop the model
dat_sjrs <-
dat_sjrs |>
slice(
-ind_pred
)
# data sizes
dat_sjrs |> dim()[1] 275 10
dat_sjrs_pred |> dim()[1] 100 10
# Scatterplot matrix
library(ggplot2)
library(GGally)Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
p <-
ggpairs(
dat_sjrs
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "Original data by source"
)
print(p)pca_Ba <-
princomp(
~ Ba137 + Ba138
, data = dat_sjrs
, cor = FALSE
)
pca_Ba |> summary()Importance of components:
Comp.1 Comp.2
Standard deviation 0.2064719 0.009226185
Proportion of Variance 0.9980072 0.001992764
Cumulative Proportion 0.9980072 1.000000000
pca_Ba |> loadings() |> print(cutoff = 0)
Loadings:
Comp.1 Comp.2
Ba137 0.714 0.700
Ba138 0.700 -0.714
Comp.1 Comp.2
SS loadings 1.0 1.0
Proportion Var 0.5 0.5
Cumulative Var 0.5 1.0
# If the loadings for Comp.1 are negative,
# then switch the signs of the scores (observations on the PCA scale)
# so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
dat_sjrs |>
mutate(
PC1_Ba = pca_Ba$scores[, "Comp.1"] |> as.numeric()
)Ca variables:
pca_Ca <-
princomp(
~ Ca43
, data = dat_sjrs
, cor = FALSE
)
pca_Ca |> summary()Importance of components:
Comp.1
Standard deviation 0.01323349
Proportion of Variance 1.00000000
Cumulative Proportion 1.00000000
pca_Ca |> loadings() |> print(cutoff = 0)
Loadings:
Comp.1
Ca43 1
Comp.1
SS loadings 1
Proportion Var 1
# If the loadings for Comp.1 are negative,
# then switch the signs of the scores (observations on the PCA scale)
# so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
dat_sjrs |>
mutate(
PC1_Ca = pca_Ca$scores[, "Comp.1"] |> as.numeric()
)Mg variables:
pca_Mg <-
princomp(
~ Mg24 + Mg25 + Mg26
, data = dat_sjrs
, cor = FALSE
)
pca_Mg |> summary()Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 0.2185024 0.008117256 0.0036336476
Proportion of Variance 0.9983461 0.001377803 0.0002760922
Cumulative Proportion 0.9983461 0.999723908 1.0000000000
pca_Mg |> loadings() |> print(cutoff = 0)
Loadings:
Comp.1 Comp.2 Comp.3
Mg24 0.982 0.183 0.043
Mg25 0.125 -0.465 -0.876
Mg26 0.140 -0.866 0.480
Comp.1 Comp.2 Comp.3
SS loadings 1.000 1.000 1.000
Proportion Var 0.333 0.333 0.333
Cumulative Var 0.333 0.667 1.000
# If the loadings for Comp.1 are negative,
# then switch the signs of the scores (observations on the PCA scale)
# so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
dat_sjrs |>
mutate(
PC1_Mg = pca_Mg$scores[, "Comp.1"] |> as.numeric()
)Sr variables:
pca_Sr <-
princomp(
~ Sr86 + Sr87 + Sr88
, data = dat_sjrs
, cor = FALSE
)
pca_Sr |> summary()Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 0.1447757 0.0040974588 0.0021799911
Proportion of Variance 0.9989733 0.0008001878 0.0002265021
Cumulative Proportion 0.9989733 0.9997734979 1.0000000000
pca_Sr |> loadings() |> print(cutoff = 0)
Loadings:
Comp.1 Comp.2 Comp.3
Sr86 0.115 0.870 0.479
Sr87 0.084 0.472 -0.878
Sr88 0.990 -0.141 0.019
Comp.1 Comp.2 Comp.3
SS loadings 1.000 1.000 1.000
Proportion Var 0.333 0.333 0.333
Cumulative Var 0.333 0.667 1.000
# If the loadings for Comp.1 are negative,
# then switch the signs of the scores (observations on the PCA scale)
# so that positive still indicates larger values.
# For Ba, we need to use a positive sign in front of the scores to do this.
dat_sjrs <-
dat_sjrs |>
mutate(
PC1_Sr = pca_Sr$scores[, "Comp.1"] |> as.numeric()
)library(ggplot2)
library(GGally)
p <-
ggpairs(
dat_sjrs %>% select(Source, PC1_Ba, PC1_Ca, PC1_Mg, PC1_Sr)
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "PCA features by source"
)
print(p)# response variable indicating "Success"
dat_sjrs <-
dat_sjrs |>
mutate(
Pond = (Source == "Pond")
)# response variable indicating "Success"
dat_sjrs <-
dat_sjrs |>
mutate(
Pond = (Source == "Pond")
)glm_pond <-
glm(
cbind(Pond, 1 - Pond) ~ (PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr)^2 + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + I(PC1_Sr^2)
, family = binomial
, data = dat_sjrs
)
summary(glm_pond)
Call:
glm(formula = cbind(Pond, 1 - Pond) ~ (PC1_Ba + PC1_Ca + PC1_Mg +
PC1_Sr)^2 + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + I(PC1_Sr^2),
family = binomial, data = dat_sjrs)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1221 0.2424 -4.630 3.66e-06 ***
PC1_Ba -2.6628 0.9471 -2.812 0.004931 **
PC1_Ca -8.5693 14.4326 -0.594 0.552683
PC1_Mg 0.8605 0.7456 1.154 0.248490
PC1_Sr -1.0775 1.4295 -0.754 0.450989
I(PC1_Ba^2) 14.7934 3.4313 4.311 1.62e-05 ***
I(PC1_Ca^2) -1285.9413 834.5667 -1.541 0.123354
I(PC1_Mg^2) 9.2976 2.8711 3.238 0.001202 **
I(PC1_Sr^2) -9.0598 8.6924 -1.042 0.297291
PC1_Ba:PC1_Ca 53.9046 75.4845 0.714 0.475156
PC1_Ba:PC1_Mg -11.4488 4.3839 -2.612 0.009014 **
PC1_Ba:PC1_Sr -24.1416 6.8542 -3.522 0.000428 ***
PC1_Ca:PC1_Mg -42.3096 64.7136 -0.654 0.513243
PC1_Ca:PC1_Sr -213.6667 106.7181 -2.002 0.045268 *
PC1_Mg:PC1_Sr 21.3953 7.2313 2.959 0.003089 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 353.23 on 274 degrees of freedom
Residual deviance: 277.85 on 260 degrees of freedom
AIC: 307.85
Number of Fisher Scoring iterations: 6
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev_p_val <- 1 - pchisq(glm_pond$deviance, glm_pond$df.residual)
dev_p_val[1] 0.2134165
# option: trace = 0 doesn't show each step of the automated selection
glm_pond_red_AIC <-
step(
glm_pond
, direction = "both"
, trace = 0
)
# the anova object provides a summary of the selection steps in order
glm_pond_red_AIC$anova Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 260 277.8500 307.8500
2 - PC1_Ca:PC1_Mg 1 0.4399870 261 278.2900 306.2900
3 - PC1_Ba:PC1_Ca 1 0.2766048 262 278.5666 304.5666
4 - I(PC1_Sr^2) 1 1.3592915 263 279.9259 303.9259
summary(glm_pond_red_AIC)
Call:
glm(formula = cbind(Pond, 1 - Pond) ~ PC1_Ba + PC1_Ca + PC1_Mg +
PC1_Sr + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + PC1_Ba:PC1_Mg +
PC1_Ba:PC1_Sr + PC1_Ca:PC1_Sr + PC1_Mg:PC1_Sr, family = binomial,
data = dat_sjrs)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1967 0.2215 -5.403 6.54e-08 ***
PC1_Ba -2.3780 0.8901 -2.672 0.00754 **
PC1_Ca -7.5707 13.8722 -0.546 0.58524
PC1_Mg 0.9510 0.7367 1.291 0.19674
PC1_Sr -0.8160 1.2967 -0.629 0.52915
I(PC1_Ba^2) 14.3291 3.3917 4.225 2.39e-05 ***
I(PC1_Ca^2) -1389.4392 756.1096 -1.838 0.06612 .
I(PC1_Mg^2) 8.6428 2.7105 3.189 0.00143 **
PC1_Ba:PC1_Mg -11.7130 4.2375 -2.764 0.00571 **
PC1_Ba:PC1_Sr -27.0048 6.4690 -4.175 2.99e-05 ***
PC1_Ca:PC1_Sr -167.1053 90.0588 -1.856 0.06352 .
PC1_Mg:PC1_Sr 20.0692 6.8840 2.915 0.00355 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 353.23 on 274 degrees of freedom
Residual deviance: 279.93 on 263 degrees of freedom
AIC: 303.93
Number of Fisher Scoring iterations: 6
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev_p_val <- 1 - pchisq(glm_pond_red_AIC$deviance, glm_pond_red_AIC$df.residual)
dev_p_val[1] 0.2261273
# option: trace = 0 doesn't show each step of the automated selection
glm_pond_red_AIC <-
step(
glm_pond
, direction = "both"
, trace = 0
)
# the anova object provides a summary of the selection steps in order
glm_pond_red_AIC$anova Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 260 277.8500 307.8500
2 - PC1_Ca:PC1_Mg 1 0.4399870 261 278.2900 306.2900
3 - PC1_Ba:PC1_Ca 1 0.2766048 262 278.5666 304.5666
4 - I(PC1_Sr^2) 1 1.3592915 263 279.9259 303.9259
summary(glm_pond_red_AIC)
Call:
glm(formula = cbind(Pond, 1 - Pond) ~ PC1_Ba + PC1_Ca + PC1_Mg +
PC1_Sr + I(PC1_Ba^2) + I(PC1_Ca^2) + I(PC1_Mg^2) + PC1_Ba:PC1_Mg +
PC1_Ba:PC1_Sr + PC1_Ca:PC1_Sr + PC1_Mg:PC1_Sr, family = binomial,
data = dat_sjrs)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1967 0.2215 -5.403 6.54e-08 ***
PC1_Ba -2.3780 0.8901 -2.672 0.00754 **
PC1_Ca -7.5707 13.8722 -0.546 0.58524
PC1_Mg 0.9510 0.7367 1.291 0.19674
PC1_Sr -0.8160 1.2967 -0.629 0.52915
I(PC1_Ba^2) 14.3291 3.3917 4.225 2.39e-05 ***
I(PC1_Ca^2) -1389.4392 756.1096 -1.838 0.06612 .
I(PC1_Mg^2) 8.6428 2.7105 3.189 0.00143 **
PC1_Ba:PC1_Mg -11.7130 4.2375 -2.764 0.00571 **
PC1_Ba:PC1_Sr -27.0048 6.4690 -4.175 2.99e-05 ***
PC1_Ca:PC1_Sr -167.1053 90.0588 -1.856 0.06352 .
PC1_Mg:PC1_Sr 20.0692 6.8840 2.915 0.00355 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 353.23 on 274 degrees of freedom
Residual deviance: 279.93 on 263 degrees of freedom
AIC: 303.93
Number of Fisher Scoring iterations: 6
# Test residual deviance for lack-of-fit (if > 0.10, little-to-no lack-of-fit)
dev_p_val <- 1 - pchisq(glm_pond_red_AIC$deviance, glm_pond_red_AIC$df.residual)
dev_p_val[1] 0.2261273
glm_roc <-
e_plot_roc(
labels_true = dat_sjrs$Pond
, pred_values_pos = glm_pond_red_AIC$fitted.values
, label_neg_pos = c(FALSE, TRUE)
, sw_plot = TRUE
, cm_mode = c("sens_spec", "prec_recall", "everything")[1]
)
glm_roc$roc_curve_best |> print(width = Inf)# A tibble: 1 × 17
Sens Spec thresh dist AUC Sensitivity Specificity `Pos Pred Value`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.830 0.652 0.259 0.387 0.801 0.830 0.652 0.553
`Neg Pred Value` Precision Recall F1 Prevalence `Detection Rate`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.881 0.553 0.830 0.664 0.342 0.284
`Detection Prevalence` `Balanced Accuracy` `Geom Mean`
<dbl> <dbl> <dbl>
1 0.513 0.741 0.736
glm_roc$plot_rocWarning in geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), alpha = 0.25, : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_best$Spec, y = roc_curve_best$Sens), : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[1], y = roc_curve_min_max$Sens[1]), : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[2], y = roc_curve_min_max$Sens[2]), : All aesthetics have length 1, but the data has 276 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
glm_roc$confusion_matrixConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 118 16
TRUE 63 78
Accuracy : 0.7127
95% CI : (0.6553, 0.7655)
No Information Rate : 0.6582
P-Value [Acc > NIR] : 0.03133
Kappa : 0.43
Mcnemar's Test P-Value : 2.274e-07
Sensitivity : 0.8298
Specificity : 0.6519
Pos Pred Value : 0.5532
Neg Pred Value : 0.8806
Prevalence : 0.3418
Detection Rate : 0.2836
Detection Prevalence : 0.5127
Balanced Accuracy : 0.7409
'Positive' Class : TRUE
## The equation first subtracts the mean of the variables (in $center),
## then calculates the linear combination for PC1 via matrix multiplication (%*%).
# A function to perform the transformation
f_pca_pred <-
function(
dat
, var_list
, pca_obj
) {
## TESTING ## dat = dat_sjrs_pred; var_list = c("Ba137", "Ba138"); pca_obj = pca_Ba
out <-
(as.matrix(dat |> select(var_list)) -
matrix(rep(pca_obj$center, nrow(dat)), byrow = TRUE, nrow = nrow(dat), ncol = length(pca_obj$center))) %*%
as.matrix(pca_obj$loadings[,1], ncol = 1) |>
as.numeric()
return(out)
}
# Do the transformation for each element
dat_sjrs_pred$PC1_Ba <- f_pca_pred(dat_sjrs_pred, c("Ba137", "Ba138") , pca_Ba)Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(var_list)
# Now:
data %>% select(all_of(var_list))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
dat_sjrs_pred$PC1_Ca <- f_pca_pred(dat_sjrs_pred, c("Ca43") , pca_Ca)
dat_sjrs_pred$PC1_Mg <- f_pca_pred(dat_sjrs_pred, c("Mg24", "Mg25", "Mg26"), pca_Mg)
dat_sjrs_pred$PC1_Sr <- f_pca_pred(dat_sjrs_pred, c("Sr86", "Sr87", "Sr88"), pca_Sr)# Scatterplot matrix
library(ggplot2)
library(GGally)
p <-
ggpairs(
dat_sjrs_pred |> select(Source, PC1_Ba, PC1_Ca, PC1_Mg, PC1_Sr)
, mapping = ggplot2::aes(colour = Source, alpha = 0.5)
, upper = list(continuous = "density", combo = "box")
, lower = list(continuous = "points", combo = "dot")
#, lower = list(continuous = "cor")
, title = "PCA features by source, Test Prediction data"
)
print(p)# classifications in the training set
dat_sjrs_pred <-
dat_sjrs_pred |>
mutate(
class =
ifelse(
(predict(glm_pond_red_AIC, newdata = dat_sjrs_pred, type = "response") |> as.numeric() >= glm_roc$roc_curve_best$thresh)
, "Pond"
, "River"
) |> factor(levels = c("River", "Pond"))
)
glm_roc_pred <-
e_plot_roc(
labels_true = dat_sjrs_pred$Source
, pred_values_pos = dat_sjrs_pred$class
, label_neg_pos = c("River", "Pond")
, sw_plot = TRUE
, cm_mode = c("sens_spec", "prec_recall", "everything")[1]
)
glm_roc_pred$roc_curve_best |> print(width = Inf)# A tibble: 1 × 17
Sens Spec thresh dist AUC Sensitivity Specificity `Pos Pred Value`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.846 0.557 2 0.469 0.702 0.846 0.557 0.55
`Neg Pred Value` Precision Recall F1 Prevalence `Detection Rate`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.85 0.55 0.846 0.667 0.39 0.33
`Detection Prevalence` `Balanced Accuracy` `Geom Mean`
<dbl> <dbl> <dbl>
1 0.6 0.702 0.687
glm_roc_pred$plot_rocWarning in geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), alpha = 0.25, : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_best$Spec, y = roc_curve_best$Sens), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[1], y = roc_curve_min_max$Sens[1]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[2], y = roc_curve_min_max$Sens[2]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
glm_roc_pred$confusion_matrixConfusion Matrix and Statistics
Reference
Prediction River Pond
River 34 6
Pond 27 33
Accuracy : 0.67
95% CI : (0.5688, 0.7608)
No Information Rate : 0.61
P-Value [Acc > NIR] : 0.1292264
Kappa : 0.3678
Mcnemar's Test P-Value : 0.0004985
Sensitivity : 0.8462
Specificity : 0.5574
Pos Pred Value : 0.5500
Neg Pred Value : 0.8500
Prevalence : 0.3900
Detection Rate : 0.3300
Detection Prevalence : 0.6000
Balanced Accuracy : 0.7018
'Positive' Class : Pond
#library(MASS)
qda_sjrs <-
MASS::qda(
Source ~ PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr
, data = dat_sjrs
)
#qda_sjrs
# CV = TRUE does jackknife (leave-one-out) crossvalidation
#qda_sjrs.cv <- qda(Source ~ PC1_Ba + PC1_Ca + PC1_Mg + PC1_Sr
# , data = dat_sjrs, CV = TRUE)
# predict the test data from the training data LDFs
qda_sjrs_pred <-
predict(
qda_sjrs
, newdata = dat_sjrs_pred
)
qda_sjrs_pred_class <-
data.frame(
Source = dat_sjrs_pred$Source
, class = qda_sjrs_pred$class
#, error = ""
, round(qda_sjrs_pred$posterior,3)
)
colnames(qda_sjrs_pred_class) <-
c(
"Source"
, "class"
#, "error"
, paste("post", colnames(qda_sjrs_pred$posterior), sep="")
)
qda_roc_pred <-
e_plot_roc(
labels_true = qda_sjrs_pred_class$Source
, pred_values_pos = qda_sjrs_pred_class$class
, label_neg_pos = c("River", "Pond")
, sw_plot = TRUE
, cm_mode = c("sens_spec", "prec_recall", "everything")[1]
)
qda_roc_pred$roc_curve_best |> print(width = Inf)# A tibble: 1 × 17
Sens Spec thresh dist AUC Sensitivity Specificity `Pos Pred Value`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.333 0.967 2 0.667 0.650 0.333 0.967 0.867
`Neg Pred Value` Precision Recall F1 Prevalence `Detection Rate`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.694 0.867 0.333 0.481 0.39 0.13
`Detection Prevalence` `Balanced Accuracy` `Geom Mean`
<dbl> <dbl> <dbl>
1 0.15 0.650 0.568
qda_roc_pred$plot_rocWarning in geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), alpha = 0.25, : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_best$Spec, y = roc_curve_best$Sens), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[1], y = roc_curve_min_max$Sens[1]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning in geom_point(aes(x = roc_curve_min_max$Spec[2], y = roc_curve_min_max$Sens[2]), : All aesthetics have length 1, but the data has 3 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
qda_roc_pred$confusion_matrixConfusion Matrix and Statistics
Reference
Prediction River Pond
River 59 26
Pond 2 13
Accuracy : 0.72
95% CI : (0.6213, 0.8052)
No Information Rate : 0.61
P-Value [Acc > NIR] : 0.0143
Kappa : 0.3381
Mcnemar's Test P-Value : 1.383e-05
Sensitivity : 0.3333
Specificity : 0.9672
Pos Pred Value : 0.8667
Neg Pred Value : 0.6941
Prevalence : 0.3900
Detection Rate : 0.1300
Detection Prevalence : 0.1500
Balanced Accuracy : 0.6503
'Positive' Class : Pond
3.2 Model Comparisons
The final logistic regression model included quadratic terms for PCI_Ba (β=14.33, p<0.001) and PCI_Mg (β=8.64, p=0.001), plus key interactions (PCI_Ba×PCI_Sr: β=-27.00, p<0.001). It achieved:
Training AUC: 0.80 (95% CI: 0.75–0.85)
Test sensitivity/specificity: 84.6%/55.7%
Balanced accuracy: 70.2%
QDA showed higher specificity (96.7%) but poor sensitivity (33.3%), reflecting bias toward river classifications. Logistic regression’s superior balanced accuracy (70% vs. 65%) and clinical utility (higher NPV=85%) supported its preference despite QDA’s slightly better overall accuracy (72% vs. 67%).
3.3 Classification Challenges
Misclassifications primarily involved pond fish labeled as river origin (44% false negatives in logistic regression), suggesting environmental overlap in isotopic signatures. The optimal probability threshold (0.27) balanced error rates while minimizing false positives.
4. Discussion
4.1 Ecological Implications
The model’s performance reflects known ecological similarities between NAP ponds and SJR river environments. PCI_Ba’s importance aligns with barium’s role as a tracer of sedimentary inputs, which may differ less between stagnant ponds and slow-moving river habitats than between hatchery and wild settings. The high false-negative rate for pond fish suggests supplemental biomarkers (e.g., δ13C) could improve discrimination.
4.1 Methodological Insights
Logistic regression’s flexibility with interaction terms proved advantageous over QDA for this balanced classification problem. While QDA’s assumption of multivariate normality was violated (particularly for PCI_Ca), logistic regression accommodated these deviations through link function transformations. The PCA approach successfully addressed multicollinearity while preserving 99%+ of isotopic variance per element.
4.1Limitations and Future Directions
Sample size constraints for pond-origin fish (n=133) may limit generalizability. Future studies could:
Incorporate spatial autocorrelation metrics to account for geographic clustering
Apply machine learning ensembles (e.g., random forests) to capture nonlinear relationships
Expand isotopic panels to include organic compounds (δ15N, δ13C)
5. Conclusion
This study demonstrates that PCA-coupled logistic regression can effectively classify Razorback Sucker origins even with biologically overlapping groups, achieving 70% balanced accuracy. While challenges remain in distinguishing pond versus river habitats, the framework provides conservation managers with a statistically robust tool for population monitoring. The methodological pipeline—combining dimensionality reduction, careful model selection, and threshold optimization—offers a template for addressing similar classification problems in ecological and environmental sciences.
References
Erhardt, E. B., Bedrick, E. J., & Schrader, R. M. (2020). \(\textit{Lecture notes for Advanced Data Analysis 2 (ADA2) (Stat 428/528)}\). University of New Mexico.