Catch Per Unit Effort (CPUE) is one of the most commonly used indices of fish abundance in fisheries science. The basic idea is simple: if fish are more abundant, fishers should catch more fish for the same amount of effort.
\[\text{CPUE} = \frac{\text{Catch}}{\text{Effort}}\]
In longline fisheries, effort is typically measured as the number of hooks deployed. We often express CPUE as “number of fish per 1000 hooks” for easier interpretation.
Nominal (raw) CPUE can be misleading because it is affected by many factors beyond fish abundance:
CPUE standardization uses statistical models to remove these confounding effects, providing a cleaner signal of true abundance changes over time.
By the end of this tutorial, you will be able to:
We will use the following R packages:
library(dplyr) # Data manipulation
library(tidyr) # Data reshaping
library(ggplot2) # Plotting
library(glmmTMB) # Generalized Linear Mixed Models
library(emmeans) # Estimated Marginal Means (for standardized indices)
library(DHARMa) # Model diagnostics via simulation
library(readxl) # Load data from Excel
# Set seed for reproducibility
set.seed(123)Before any analysis, we must explore and understand our data structure and quality.
## tibble [5,913 × 10] (S3: tbl_df/tbl/data.frame)
## $ year : chr [1:5913] "1986" "1986" "1986" "1986" ...
## $ month : chr [1:5913] "10" "10" "11" "10" ...
## $ lat : num [1:5913] 26.5 26.5 27.5 41.5 28.5 27.5 37.5 27.5 26.5 26.5 ...
## $ lon : num [1:5913] -76.5 -75.5 -88.5 -65.5 -85.5 -86.5 -67.5 -87.5 -76.5 -76.5 ...
## $ light_color: chr [1:5913] "3" "3" "2" "3" ...
## $ hook_type : chr [1:5913] "4" "4" "4" "4" ...
## $ bait_type : chr [1:5913] "4" "4" "4" "4" ...
## $ hbf : chr [1:5913] "2" "2" "3" "3" ...
## $ hooks : num [1:5913] 246 242 420 420 552 423 300 648 246 254 ...
## $ catch_swo : num [1:5913] 9 9 4 7 0 3 9 5 7 5 ...
## year month lat lon
## Length:5913 Length:5913 Min. :-29.50 Min. :-92.50
## Class :character Class :character 1st Qu.: 27.50 1st Qu.:-78.50
## Mode :character Mode :character Median : 29.50 Median :-71.50
## Mean : 32.17 Mean :-70.73
## 3rd Qu.: 38.50 3rd Qu.:-66.50
## Max. : 51.50 Max. : 14.50
## light_color hook_type bait_type hbf
## Length:5913 Length:5913 Length:5913 Length:5913
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## hooks catch_swo
## Min. : 110.0 Min. : 0.000
## 1st Qu.: 415.0 1st Qu.: 1.000
## Median : 630.0 Median : 3.000
## Mean : 625.1 Mean : 3.317
## 3rd Qu.: 816.0 3rd Qu.: 5.000
## Max. :1662.0 Max. :64.000
| Variable | Description |
|---|---|
year |
Year of the fishing operation |
month |
Month (1-12) |
lat, lon |
Geographic coordinates |
light_color |
Light stick color used (categorical) |
hook_type |
Type of hook (categorical) |
bait_type |
Type of bait (categorical) |
hbf |
Hooks Between Floats - indicates fishing depth |
hooks |
Number of hooks deployed (effort) |
catch_swo |
Number of swordfish caught |
Data preparation is crucial for reliable analysis. We will process the data step by step.
In R, factors are used for categorical variables in statistical models. This tells R that these are discrete categories, not continuous numbers.
dat <- dat %>%
mutate(
year = factor(year),
month = factor(month),
light_color = factor(light_color),
hook_type = factor(hook_type),
bait_type = factor(bait_type)
)Let’s check the factor levels and their frequencies:
## N
## 1986 27
## 1987 210
## 1988 245
## 1989 274
## 1990 259
## 1991 253
## 1992 245
## 1993 241
## 1994 256
## 1995 275
## 1996 278
## 1997 257
## 1998 217
## 1999 206
## 2000 202
## 2001 193
## 2002 169
## 2003 167
## 2004 166
## 2005 140
## 2006 138
## 2007 153
## 2008 164
## 2009 169
## 2010 144
## 2011 152
## 2012 196
## 2013 191
## 2014 183
## 2015 143
## N
## 1 442
## 10 591
## 11 416
## 12 326
## 2 356
## 3 377
## 4 448
## 5 484
## 6 552
## 7 682
## 8 658
## 9 581
## N
## 1 259
## 2 1508
## 3 3408
## 4 738
Quarter groups months into seasons, capturing seasonal patterns with fewer parameters than using all 12 months. As this dataset is limited, we will start with Quarters on a simpler model
dat <- dat %>%
mutate(
month_num = as.numeric(as.character(month)),
quarter = factor(ceiling(month_num / 3))
)
# Verify quarter assignment
print(table(Month = dat$month, Quarter = dat$quarter))## Quarter
## Month 1 2 3 4
## 1 442 0 0 0
## 10 0 0 0 591
## 11 0 0 0 416
## 12 0 0 0 326
## 2 356 0 0 0
## 3 377 0 0 0
## 4 0 448 0 0
## 5 0 484 0 0
## 6 0 552 0 0
## 7 0 0 682 0
## 8 0 0 658 0
## 9 0 0 581 0
HBF (Hooks Between Floats) is an important variable in longline fisheries that indicates fishing depth:
We treat HBF as categorical variable (allows non-linear effects).
dat <- dat %>%
mutate(hbf_cat = factor(hbf))
# see HBF distribution:
print(cbind(N = table(dat$hbf_cat)))## N
## 2 748
## 3 1064
## 4 2214
## 5 1386
## 6 501
We aggregate geographic positions into 5×5 degree cells. This:
dat <- dat %>%
mutate(
# Floor to get lower-left corner, add 2.5 to get cell center
lat5 = floor(lat / 5) * 5 + 2.5,
lon5 = floor(lon / 5) * 5 + 2.5,
# Create unique cell identifier
cell5 = interaction(lat5, lon5, drop = TRUE)
)
cat("Number of unique 5×5° cells:", nlevels(dat$cell5), "\n")## Number of unique 5×5° cells: 77
CPUE = Catch Per Unit Effort
We standardize CPUE to “fish per 1000 hooks” for easier interpretation. This is what is usually called nominal CPUE
dat <- dat %>%
mutate(cpue_nominal = (catch_swo / hooks) * 1000)
# Nominal CPUE summary (N per 1000 hooks):
summary(dat$cpue_nominal)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.562 4.167 6.258 8.333 64.433
## Total records: 5913
## Year range: 1986 - 2015
cat("Records with zero catch:", sum(dat$catch_swo == 0),
"(", round(mean(dat$catch_swo == 0) * 100, 1), "%)\n")## Records with zero catch: 1113 ( 18.8 %)
Note: A high percentage of zeros is common in fisheries data. This is why we use appropriate statistical distributions (like the Negative Binomial) that can handle zeros. In a real and full work, other distributions should be tested and compared
Exploratory analysis helps us understand patterns in the data before modeling.
p_catch_hist <- ggplot(dat, aes(x = catch_swo)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
theme_bw(base_size = 12) +
labs(
title = "Distribution of Swordfish Catch per Set",
x = "Catch (number of SWO)",
y = "Frequency"
)
print(p_catch_hist)Observation: The catch distribution is highly right-skewed with many zeros. It is typical for fisheries data. This suggests we need a count-based model (not normal regression).
p_hooks_hist <- ggplot(dat, aes(x = hooks)) +
geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
theme_bw(base_size = 12) +
labs(
title = "Distribution of Fishing Effort",
x = "Number of Hooks per Set",
y = "Frequency"
)
print(p_hooks_hist)This is our unstandardized=nominal trend, what we see without any statistical adjustment.
nominal_by_year <- dat %>%
group_by(year) %>%
summarise(
mean_cpue = mean(cpue_nominal, na.rm = TRUE),
sd_cpue = sd(cpue_nominal, na.rm = TRUE),
se_cpue = sd_cpue / sqrt(n()),
median_cpue = median(cpue_nominal, na.rm = TRUE),
n_sets = n(),
total_catch = sum(catch_swo),
total_hooks = sum(hooks),
.groups = "drop"
)
print(nominal_by_year)## # A tibble: 30 × 8
## year mean_cpue sd_cpue se_cpue median_cpue n_sets total_catch total_hooks
## <fct> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1986 17.5 14.1 2.72 16.3 27 150 9968
## 2 1987 14.9 13.3 0.919 11.6 210 1237 99731
## 3 1988 17.0 13.0 0.830 15 245 1637 104081
## 4 1989 13.4 8.45 0.510 12.5 274 1540 117251
## 5 1990 10.7 7.15 0.444 10 259 1185 111114
## 6 1991 10.0 6.48 0.407 9.74 253 1120 116195
## 7 1992 8.72 6.94 0.443 7.67 245 1116 136163
## 8 1993 7.37 5.60 0.361 6.25 241 895 137208
## 9 1994 6.46 5.67 0.354 5.17 256 869 150317
## 10 1995 5.03 4.15 0.250 4 275 803 177673
## # ℹ 20 more rows
p_nominal_year <- ggplot(nominal_by_year,
aes(x = as.numeric(as.character(year)), y = mean_cpue)) +
geom_ribbon(aes(ymin = mean_cpue - 1.96 * se_cpue,
ymax = mean_cpue + 1.96 * se_cpue),
alpha = 0.2, fill = "blue") +
geom_line(linewidth = 1, color = "blue") +
geom_point(size = 2.5, color = "blue") +
theme_bw(base_size = 12) +
labs(
title = "Nominal SWO CPUE by Year",
subtitle = "Mean ± 95% CI (unstandardized)",
x = "Year",
y = "CPUE (N / 1000 hooks)"
)
print(p_nominal_year)Question: Is this trend reflecting true abundance changes, or could it be influenced by other factors?
p_month <- ggplot(dat, aes(x = month, y = cpue_nominal, fill = month)) +
geom_boxplot(outlier.alpha = 0.3) +
theme_bw(base_size = 12) +
labs(
title = "Nominal SWO CPUE by Month",
x = "Month",
y = "CPUE (N / 1000 hooks)"
) +
theme(legend.position = "none")
print(p_month)p_hbf <- ggplot(dat, aes(x = hbf_cat, y = cpue_nominal, fill = hbf_cat)) +
geom_boxplot(outlier.alpha = 0.3) +
theme_bw(base_size = 12) +
labs(
title = "Nominal SWO CPUE by Hooks Between Floats (HBF)",
subtitle = "Lower HBF = Shallower fishing",
x = "HBF Category",
y = "CPUE (N / 1000 hooks)"
) +
theme(legend.position = "none")
print(p_hbf)Think about it: Why do lower HBF values show higher swordfish CPUE? Consider swordfish vertical habitat preferences.
Create similar boxplots for:
light_colorhook_typebait_typeWhat patterns do you observe?
It’s important to check if we have data coverage across all factor combinations.
year_quarter_tab <- dat %>%
count(year, quarter) %>%
pivot_wider(names_from = quarter, values_from = n, values_fill = 0)
# Number of sets by Year and Quarter:
print(year_quarter_tab, n = 50)## # A tibble: 30 × 5
## year `4` `1` `2` `3`
## <fct> <int> <int> <int> <int>
## 1 1986 27 0 0 0
## 2 1987 43 36 53 78
## 3 1988 55 63 51 76
## 4 1989 69 56 63 86
## 5 1990 60 71 63 65
## 6 1991 64 45 59 85
## 7 1992 47 54 59 85
## 8 1993 35 61 53 92
## 9 1994 70 36 60 90
## 10 1995 55 53 78 89
## 11 1996 58 56 77 87
## 12 1997 43 56 66 92
## 13 1998 54 33 46 84
## 14 1999 59 34 58 55
## 15 2000 55 41 52 54
## 16 2001 41 29 59 64
## 17 2002 42 31 48 48
## 18 2003 39 41 47 40
## 19 2004 45 31 58 32
## 20 2005 25 29 47 39
## 21 2006 34 31 32 41
## 22 2007 31 31 30 61
## 23 2008 32 32 54 46
## 24 2009 24 30 49 66
## 25 2010 33 26 32 53
## 26 2011 39 28 36 49
## 27 2012 46 44 42 64
## 28 2013 39 37 37 78
## 29 2014 40 28 45 70
## 30 2015 29 32 30 52
Important: Gaps in data coverage can cause problems for standardization. We need sufficient data across factor combinations. In this case, 1986 might be a problem. We will proceed, but keep this in mind…
grid_summary <- dat %>%
group_by(lat5, lon5) %>%
summarise(
n_sets = n(),
mean_cpue = mean(cpue_nominal, na.rm = TRUE),
total_hooks = sum(hooks, na.rm = TRUE),
.groups = "drop"
)p_spatial_cpue <- ggplot(grid_summary, aes(x = lon5, y = lat5)) +
geom_tile(aes(fill = mean_cpue), width = 5, height = 5) +
scale_fill_viridis_c(option = "plasma", name = "Mean CPUE\n(N/1000 hooks)") +
borders("world", colour = "gray50", fill = "gray80") +
coord_quickmap(xlim = range(grid_summary$lon5) + c(-5, 5),
ylim = range(grid_summary$lat5) + c(-5, 5)) +
theme_bw(base_size = 12) +
labs(
title = "Spatial Distribution of Nominal SWO CPUE",
x = "Longitude",
y = "Latitude"
)
print(p_spatial_cpue)p_spatial_effort <- ggplot(grid_summary, aes(x = lon5, y = lat5)) +
geom_tile(aes(fill = n_sets), width = 5, height = 5) +
scale_fill_viridis_c(option = "viridis", name = "Number\nof Sets") +
borders("world", colour = "gray50", fill = "gray80") +
coord_quickmap(xlim = range(grid_summary$lon5) + c(-5, 5),
ylim = range(grid_summary$lat5) + c(-5, 5)) +
theme_bw(base_size = 12) +
labs(
title = "Spatial Distribution of Fishing Effort",
x = "Longitude",
y = "Latitude"
)
print(p_spatial_effort)Traditional linear regression assumes normally distributed residuals with constant variance. Fisheries catch data violate these assumptions because:
Generalized Linear Models (GLMs) with appropriate distributions handle these issues. Generalized Linear Mixed Models (GLMMs) extend GLMs and allow random variables, which can be quite usefull for some variables (we will test with one).
We use the Negative Binomial distribution because:
The model has the form:
\[\log(\mu_i) = \beta_0 + \beta_1 \cdot \text{Year}_i + \beta_2 \cdot \text{Quarter}_i + ... + \log(\text{hooks}_i)\]
The \(\log(\text{hooks})\) term is called an offset: This adjusts the expected count based on effort, effectively modeling CPUE.
model_data <- dat %>%
filter(
!is.na(catch_swo),
!is.na(hooks),
hooks > 0
) %>%
droplevels()
cat("Data ready for modeling:", nrow(model_data), "records\n")## Data ready for modeling: 5913 records
We build models sequentially, adding one term at a time. This helps us understand the contribution of each factor.
The simplest model - only intercept and effort offset.
This is the factor we’re most interested in - it represents the abundance trend.
Accounts for seasonal variation.
Accounts for targeting behavior (fishing depth).
Accounts for gear configuration differences.
Accounts for spatial variation using a random effect for 5×5° cells.
Let’s examine the structure of our most complete model:
## Family: nbinom1 ( log )
## Formula:
## catch_swo ~ year + quarter + hbf_cat + light_color + offset(log(hooks)) +
## (1 | cell5)
## Data: model_data
##
## AIC BIC logLik -2*log(L) df.resid
## 23580.9 23861.7 -11748.5 23496.9 5871
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## cell5 (Intercept) 0.3048 0.5521
## Number of obs: 5913, groups: cell5, 77
##
## Dispersion parameter for nbinom1 family (): 0.579
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.31558 0.13618 -31.69 < 2e-16 ***
## year1987 -0.08913 0.11171 -0.80 0.424948
## year1988 -0.05842 0.11012 -0.53 0.595756
## year1989 -0.26772 0.11012 -2.43 0.015046 *
## year1990 -0.41419 0.11148 -3.72 0.000203 ***
## year1991 -0.42364 0.11198 -3.78 0.000155 ***
## year1992 -0.40242 0.11253 -3.58 0.000349 ***
## year1993 -0.55437 0.11474 -4.83 1.36e-06 ***
## year1994 -0.51323 0.11518 -4.46 8.35e-06 ***
## year1995 -0.67136 0.11604 -5.79 7.24e-09 ***
## year1996 -0.77739 0.11757 -6.61 3.78e-11 ***
## year1997 -0.71832 0.11797 -6.09 1.14e-09 ***
## year1998 -0.60760 0.11731 -5.18 2.22e-07 ***
## year1999 -0.68549 0.12029 -5.70 1.21e-08 ***
## year2000 -0.60041 0.11990 -5.01 5.51e-07 ***
## year2001 -0.69861 0.12095 -5.78 7.66e-09 ***
## year2002 -0.82379 0.12530 -6.57 4.88e-11 ***
## year2003 -0.86107 0.12781 -6.74 1.61e-11 ***
## year2004 -0.75475 0.12492 -6.04 1.52e-09 ***
## year2005 -0.85829 0.12905 -6.65 2.91e-11 ***
## year2006 -0.83997 0.12737 -6.59 4.26e-11 ***
## year2007 -0.86342 0.12741 -6.78 1.23e-11 ***
## year2008 -0.88622 0.12623 -7.02 2.20e-12 ***
## year2009 -1.02581 0.12641 -8.11 4.87e-16 ***
## year2010 -0.83826 0.12541 -6.68 2.32e-11 ***
## year2011 -0.89195 0.12605 -7.08 1.48e-12 ***
## year2012 -0.90043 0.12356 -7.29 3.16e-13 ***
## year2013 -0.95720 0.12653 -7.56 3.88e-14 ***
## year2014 -0.84742 0.12700 -6.67 2.51e-11 ***
## year2015 -0.89871 0.12856 -6.99 2.73e-12 ***
## quarter2 0.04923 0.03065 1.61 0.108160
## quarter3 0.20340 0.03076 6.61 3.79e-11 ***
## quarter4 0.01234 0.03146 0.39 0.694928
## hbf_cat3 -0.28441 0.02889 -9.85 < 2e-16 ***
## hbf_cat4 -0.77149 0.03704 -20.83 < 2e-16 ***
## hbf_cat5 -0.71557 0.04080 -17.54 < 2e-16 ***
## hbf_cat6 -1.11789 0.05128 -21.80 < 2e-16 ***
## light_color2 -0.32072 0.05371 -5.97 2.35e-09 ***
## light_color3 0.34567 0.04750 7.28 3.41e-13 ***
## light_color4 0.12425 0.05104 2.43 0.014910 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We compare models using AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). Lower values indicate better fit (accounting for model complexity).
model_comparison <- data.frame(
Model = c("M0: Null",
"M1: +Year",
"M2: +Quarter",
"M3: +HBF",
"M4: +Light",
"M5: +Cell5 RE"),
df = c(attr(logLik(m0), "df"),
attr(logLik(m1), "df"),
attr(logLik(m2), "df"),
attr(logLik(m3), "df"),
attr(logLik(m4), "df"),
attr(logLik(m5), "df")),
logLik = round(c(as.numeric(logLik(m0)),
as.numeric(logLik(m1)),
as.numeric(logLik(m2)),
as.numeric(logLik(m3)),
as.numeric(logLik(m4)),
as.numeric(logLik(m5))), 1),
AIC = round(c(AIC(m0), AIC(m1), AIC(m2), AIC(m3), AIC(m4), AIC(m5)), 1),
BIC = round(c(BIC(m0), BIC(m1), BIC(m2), BIC(m3), BIC(m4), BIC(m5)), 1)
) %>%
mutate(
deltaAIC = round(AIC - min(AIC), 1),
deltaBIC = round(BIC - min(BIC), 1)
)
print(model_comparison)## Model df logLik AIC BIC deltaAIC deltaBIC
## 1 M0: Null 2 -14252.9 28509.8 28523.1 4928.9 4661.4
## 2 M1: +Year 31 -13344.9 26751.9 26959.1 3171.0 3097.4
## 3 M2: +Quarter 34 -13297.4 26662.8 26890.1 3081.9 3028.4
## 4 M3: +HBF 38 -12807.4 25690.8 25944.8 2109.9 2083.1
## 5 M4: +Light 41 -12438.8 24959.5 25233.6 1378.6 1371.9
## 6 M5: +Cell5 RE 42 -11748.5 23580.9 23861.7 0.0 0.0
Interpretation: The model with the lowest AIC/BIC is preferred. Each additional term should substantially improve fit to justify the added complexity.
We can formally test with LRT (for nested models) whether each term significantly improves the model:
Interpretation: A significant p-value (< 0.05) indicates that the additional term significantly improves model fit.
Before trusting our results, we must verify that the model assumptions are met.
The DHARMa package uses simulation to create standardized residuals that should be uniformly distributed if the model is correct.
# Simulate residuals
# Note: this may take a few minutes. Reduce to n=200 if it takes too long
sim_resid <- simulateResiduals(final_model, n = 500)Left panel (QQ plot): Points should follow the diagonal line if residuals are uniformly distributed.
Right panel (Residuals vs. Predicted): Should show no patterns; residuals should be scattered uniformly.
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.060771, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.60859, p-value = 0.12
## alternative hypothesis: two.sided
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.0617, p-value = 0.756
## alternative hypothesis: two.sided
Interpretation: Non-significant p-values (> 0.05) indicate no evidence of problems. If tests are significant, we may need to adjust our model.
The emmeans package calculates Estimated
Marginal Means - the predicted values for each year while
averaging over all other factors.
# Reference effort for standardization
E0 <- 1000 # hooks
# Get estimated marginal means for year
std_index <- emmeans(
final_model,
~ year,
type = "response", # Back-transform to count scale
re.form = NA, # Marginalize over random effects
at = list(hooks = E0) # Reference effort = 1000 hooks
) %>%
as.data.frame() %>%
transmute(
year = as.integer(as.character(year)),
std_cpue = response, # Standardized CPUE (per 1000 hooks)
std_lwr = asymp.LCL, # Lower 95% CI
std_upr = asymp.UCL # Upper 95% CI
) %>%
arrange(year)
# Standardized SWO CPUE Index (per 1000 hooks):
print(std_index)## year std_cpue std_lwr std_upr
## 1 1986 8.313219 6.455522 10.705504
## 2 1987 7.604302 6.480699 8.922712
## 3 1988 7.841478 6.703350 9.172844
## 4 1989 6.360600 5.431274 7.448941
## 5 1990 5.494021 4.680375 6.449113
## 6 1991 5.442321 4.636630 6.388014
## 7 1992 5.559047 4.738968 6.521042
## 8 1993 4.775410 4.058781 5.618569
## 9 1994 4.975948 4.230268 5.853070
## 10 1995 4.248175 3.610557 4.998396
## 11 1996 3.820784 3.237555 4.509078
## 12 1997 4.053298 3.431123 4.788292
## 13 1998 4.527854 3.838706 5.340723
## 14 1999 4.188557 3.520684 4.983126
## 15 2000 4.560521 3.844840 5.409419
## 16 2001 4.133954 3.475413 4.917278
## 17 2002 3.647546 3.028086 4.393730
## 18 2003 3.514070 2.903261 4.253387
## 19 2004 3.908291 3.246156 4.705485
## 20 2005 3.523855 2.898033 4.284822
## 21 2006 3.588999 2.962101 4.348574
## 22 2007 3.505844 2.894787 4.245887
## 23 2008 3.426813 2.836995 4.139255
## 24 2009 2.980328 2.466252 3.601562
## 25 2010 3.595171 2.982998 4.332975
## 26 2011 3.407206 2.823588 4.111453
## 27 2012 3.378461 2.817694 4.050830
## 28 2013 3.192012 2.642974 3.855104
## 29 2014 3.562357 2.946236 4.307323
## 30 2015 3.384255 2.789418 4.105939
# Combine standardized and nominal indices
comparison_data <- std_index %>%
left_join(
nominal_by_year %>%
mutate(year = as.integer(as.character(year))) %>%
select(year, nom_cpue = mean_cpue, n_sets),
by = "year"
)
# Scale both series to mean = 1 for easier comparison
mean_std <- mean(comparison_data$std_cpue, na.rm = TRUE)
mean_nom <- mean(comparison_data$nom_cpue, na.rm = TRUE)
comparison_data <- comparison_data %>%
mutate(
std_index = std_cpue / mean_std,
std_lwr_index = std_lwr / mean_std,
std_upr_index = std_upr / mean_std,
nom_index = nom_cpue / mean_nom
)
# Relative Indices (scaled to mean = 1):
print(comparison_data %>%
select(year, nom_index, std_index, std_lwr_index, std_upr_index, n_sets))## year nom_index std_index std_lwr_index std_upr_index n_sets
## 1 1986 2.8764921 1.8540471 1.4397360 2.3875839 27
## 2 1987 2.4513239 1.6959415 1.4453511 1.9899784 210
## 3 1988 2.7902330 1.7488375 1.4950076 2.0457640 245
## 4 1989 2.1925962 1.4185662 1.2113042 1.6612922 274
## 5 1990 1.7643476 1.2252983 1.0438357 1.4383068 259
## 6 1991 1.6484135 1.2137679 1.0340795 1.4246801 253
## 7 1992 1.4327506 1.2398007 1.0569033 1.4543485 245
## 8 1993 1.2097666 1.0650309 0.9052055 1.2530755 241
## 9 1994 1.0605448 1.1097556 0.9434511 1.3053749 256
## 10 1995 0.8255221 0.9474448 0.8052407 1.1147620 275
## 11 1996 0.7214637 0.8521263 0.7220523 1.0056325 278
## 12 1997 0.7266626 0.9039825 0.7652227 1.0679039 257
## 13 1998 0.9355129 1.0098200 0.8561234 1.1911092 217
## 14 1999 0.7002450 0.9341486 0.7851968 1.1113566 206
## 15 2000 0.7209616 1.0171053 0.8574914 1.2064299 202
## 16 2001 0.6597572 0.9219707 0.7751004 1.0966708 193
## 17 2002 0.5748648 0.8134902 0.6753358 0.9799070 169
## 18 2003 0.4846928 0.7837219 0.6474967 0.9486071 167
## 19 2004 0.4986325 0.8716425 0.7239706 1.0494358 166
## 20 2005 0.5324449 0.7859042 0.6463308 0.9556180 140
## 21 2006 0.5458242 0.8004328 0.6606194 0.9698362 138
## 22 2007 0.5464681 0.7818871 0.6456069 0.9469345 153
## 23 2008 0.5215654 0.7642613 0.6327177 0.9231531 164
## 24 2009 0.4811761 0.6646846 0.5500332 0.8032346 169
## 25 2010 0.5622652 0.8018093 0.6652800 0.9663573 144
## 26 2011 0.5692866 0.7598884 0.6297277 0.9169526 152
## 27 2012 0.5005349 0.7534778 0.6284133 0.9034321 196
## 28 2013 0.4510023 0.7118951 0.5894465 0.8597806 191
## 29 2014 0.5013742 0.7944909 0.6570811 0.9606361 183
## 30 2015 0.5132747 0.7547699 0.6221070 0.9157227 143
p_cpue_abs <- ggplot(comparison_data, aes(x = year)) +
geom_ribbon(aes(ymin = std_lwr, ymax = std_upr),
alpha = 0.2, fill = "blue") +
geom_line(aes(y = std_cpue, color = "Standardized"), linewidth = 1.2) +
geom_point(aes(y = std_cpue, color = "Standardized"), size = 3) +
geom_line(aes(y = nom_cpue, color = "Nominal"),
linewidth = 1, linetype = "dashed") +
geom_point(aes(y = nom_cpue, color = "Nominal"),
size = 2, shape = 21, fill = "white") +
scale_color_manual(
values = c("Standardized" = "blue", "Nominal" = "red"),
name = "CPUE Type"
) +
theme_bw(base_size = 12) +
labs(
title = "SWO CPUE: Nominal vs. Standardized",
x = "Year",
y = "CPUE (N / 1000 hooks)"
) +
theme(legend.position = "bottom")
print(p_cpue_abs)Scaling both indices to mean = 1 makes trends easier to compare:
p_index <- ggplot(comparison_data, aes(x = year)) +
geom_ribbon(aes(ymin = std_lwr_index, ymax = std_upr_index),
alpha = 0.2, fill = "blue") +
geom_line(aes(y = std_index, color = "Standardized"), linewidth = 1.2) +
geom_point(aes(y = std_index, color = "Standardized"), size = 3) +
geom_line(aes(y = nom_index, color = "Nominal"),
linewidth = 1, linetype = "dashed") +
geom_point(aes(y = nom_index, color = "Nominal"),
size = 2, shape = 21, fill = "white") +
geom_hline(yintercept = 1, linetype = "dotted", color = "gray50") +
scale_color_manual(
values = c("Standardized" = "blue", "Nominal" = "red"),
name = "Index Type"
) +
theme_bw(base_size = 12) +
labs(
title = "SWO Abundance Index: Nominal vs. Standardized",
subtitle = "Both scaled to mean = 1",
x = "Year",
y = "Relative Index"
) +
theme(legend.position = "bottom")
print(p_index)Prepared the data: Converted variables to appropriate types, created derived variables (quarter, spatial cells), calculated nominal CPUE
Explored the data: Examined distributions, temporal patterns, spatial patterns, and relationships with explanatory variables
Built statistical models: Used Negative Binomial GLMs and GLMMs with an effort offset, sequentially adding terms
Selected the best model: Used AIC/BIC and likelihood ratio tests
Diagnosed the model: Verified assumptions with residual analysis
Extracted standardized indices: Used emmeans (marginal means) to get year effects while controlling (i.e., standardizing) for other factors
Our final model accounted for:
Create boxplots showing CPUE relationships with
light_color, hook_type, and
bait_type. What patterns do you observe?
What does the AIC comparison tell us about model fit? Why does adding each term improve (or not improve) the model?
Try adding hook_type and bait_type to the
model. Does it improve the fit? Does it change the standardized index?
Does it create any issues?
Replace quarter with month in the model.
How does this affect the results? Are there any trade-offs?