1 Introduction

1.1 What is CPUE?

Catch Per Unit Effort (CPUE) is one of the most commonly used indices of fish abundance in fisheries science. The basic idea is simple: if fish are more abundant, fishers should catch more fish for the same amount of effort.

\[\text{CPUE} = \frac{\text{Catch}}{\text{Effort}}\]

In longline fisheries, effort is typically measured as the number of hooks deployed. We often express CPUE as “number of fish per 1000 hooks” for easier interpretation.

1.2 Why Standardize CPUE?

Nominal (raw) CPUE can be misleading because it is affected by many factors beyond fish abundance:

Temporal factors: Season, month, time of day
Spatial factors: Fishing location, depth
Gear factors: Hook type, bait type, line configuration
Targeting behavior: What species fishers are trying to catch
Vessel effects: Skipper skill, vessel characteristics

CPUE standardization uses statistical models to remove these confounding effects, providing a cleaner signal of true abundance changes over time.

1.3 Learning Objectives

By the end of this tutorial, you will be able to:

Prepare fisheries data for CPUE analysis
Conduct exploratory data analysis
Fit Generalized Linear Models (GLMs) and Generalized Linear Mixed Models (GLMs) for count data
Perform model selection and diagnostics
Extract and interpret standardized abundance indices
Compare nominal vs. standardized CPUE trends

2 Setup

2.1 Load Required Packages

We will use the following R packages:

library(dplyr)      # Data manipulation
library(tidyr)      # Data reshaping
library(ggplot2)    # Plotting
library(glmmTMB)    # Generalized Linear Mixed Models
library(emmeans)    # Estimated Marginal Means (for standardized indices)
library(DHARMa)     # Model diagnostics via simulation
library(readxl)     # Load data from Excel

# Set seed for reproducibility
set.seed(123)

2.2 Load the Data

We will work with a simulated longline fisheries dataset containing swordfish SWO (Xiphias gladius) catch data.

# Load your data - adjust the path as needed
dat <- read_excel("LLsim_2.xlsx")

3 Data Exploration

Before any analysis, we must explore and understand our data structure and quality.

3.1 Data Structure

# View data structure
str(dat)

## tibble [5,913 × 10] (S3: tbl_df/tbl/data.frame)
##  $ year       : chr [1:5913] "1986" "1986" "1986" "1986" ...
##  $ month      : chr [1:5913] "10" "10" "11" "10" ...
##  $ lat        : num [1:5913] 26.5 26.5 27.5 41.5 28.5 27.5 37.5 27.5 26.5 26.5 ...
##  $ lon        : num [1:5913] -76.5 -75.5 -88.5 -65.5 -85.5 -86.5 -67.5 -87.5 -76.5 -76.5 ...
##  $ light_color: chr [1:5913] "3" "3" "2" "3" ...
##  $ hook_type  : chr [1:5913] "4" "4" "4" "4" ...
##  $ bait_type  : chr [1:5913] "4" "4" "4" "4" ...
##  $ hbf        : chr [1:5913] "2" "2" "3" "3" ...
##  $ hooks      : num [1:5913] 246 242 420 420 552 423 300 648 246 254 ...
##  $ catch_swo  : num [1:5913] 9 9 4 7 0 3 9 5 7 5 ...

# First few rows
head(dat, 10)

# Summary statistics
summary(dat)

##      year              month                lat              lon        
##  Length:5913        Length:5913        Min.   :-29.50   Min.   :-92.50  
##  Class :character   Class :character   1st Qu.: 27.50   1st Qu.:-78.50  
##  Mode  :character   Mode  :character   Median : 29.50   Median :-71.50  
##                                        Mean   : 32.17   Mean   :-70.73  
##                                        3rd Qu.: 38.50   3rd Qu.:-66.50  
##                                        Max.   : 51.50   Max.   : 14.50  
##  light_color         hook_type          bait_type             hbf           
##  Length:5913        Length:5913        Length:5913        Length:5913       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      hooks          catch_swo     
##  Min.   : 110.0   Min.   : 0.000  
##  1st Qu.: 415.0   1st Qu.: 1.000  
##  Median : 630.0   Median : 3.000  
##  Mean   : 625.1   Mean   : 3.317  
##  3rd Qu.: 816.0   3rd Qu.: 5.000  
##  Max.   :1662.0   Max.   :64.000

3.1.1 Understanding the Variables

Variable	Description
`year`	Year of the fishing operation
`month`	Month (1-12)
`lat`, `lon`	Geographic coordinates
`light_color`	Light stick color used (categorical)
`hook_type`	Type of hook (categorical)
`bait_type`	Type of bait (categorical)
`hbf`	Hooks Between Floats - indicates fishing depth
`hooks`	Number of hooks deployed (effort)
`catch_swo`	Number of swordfish caught

4 Data Preparation

Data preparation is crucial for reliable analysis. We will process the data step by step.

4.1 Step A: Convert Categorical Variables to Factors

In R, factors are used for categorical variables in statistical models. This tells R that these are discrete categories, not continuous numbers.

dat <- dat %>%
  mutate(
    year = factor(year),
    month = factor(month),
    light_color = factor(light_color),
    hook_type = factor(hook_type),
    bait_type = factor(bait_type)
  )

Let’s check the factor levels and their frequencies:

# Years in the dataset
print(cbind(N = table(dat$year)))

##        N
## 1986  27
## 1987 210
## 1988 245
## 1989 274
## 1990 259
## 1991 253
## 1992 245
## 1993 241
## 1994 256
## 1995 275
## 1996 278
## 1997 257
## 1998 217
## 1999 206
## 2000 202
## 2001 193
## 2002 169
## 2003 167
## 2004 166
## 2005 140
## 2006 138
## 2007 153
## 2008 164
## 2009 169
## 2010 144
## 2011 152
## 2012 196
## 2013 191
## 2014 183
## 2015 143

# Months
print(cbind(N = table(dat$month)))

##      N
## 1  442
## 10 591
## 11 416
## 12 326
## 2  356
## 3  377
## 4  448
## 5  484
## 6  552
## 7  682
## 8  658
## 9  581

# Light colors
print(cbind(N = table(dat$light_color)))

##      N
## 1  259
## 2 1508
## 3 3408
## 4  738

4.2 Step B: Create Derived Temporal Variables

Quarter groups months into seasons, capturing seasonal patterns with fewer parameters than using all 12 months. As this dataset is limited, we will start with Quarters on a simpler model

Q1: January - March (months 1-3)
Q2: April - June (months 4-6)
Q3: July - September (months 7-9)
Q4: October - December (months 10-12)

dat <- dat %>%
  mutate(
    month_num = as.numeric(as.character(month)),
    quarter = factor(ceiling(month_num / 3))
  )

# Verify quarter assignment
print(table(Month = dat$month, Quarter = dat$quarter))

##      Quarter
## Month   1   2   3   4
##    1  442   0   0   0
##    10   0   0   0 591
##    11   0   0   0 416
##    12   0   0   0 326
##    2  356   0   0   0
##    3  377   0   0   0
##    4    0 448   0   0
##    5    0 484   0   0
##    6    0 552   0   0
##    7    0   0 682   0
##    8    0   0 658   0
##    9    0   0 581   0

4.3 Step C: Create HBF Category

HBF (Hooks Between Floats) is an important variable in longline fisheries that indicates fishing depth:

Lower HBF = Shallower fishing = Usually more swordfish (SWO found in shallower waters at night, when this fishery operates)
Higher HBF = Deeper fishing = Usually more tunas

We treat HBF as categorical variable (allows non-linear effects).

dat <- dat %>%
  mutate(hbf_cat = factor(hbf))

# see HBF distribution:
print(cbind(N = table(dat$hbf_cat)))

##      N
## 2  748
## 3 1064
## 4 2214
## 5 1386
## 6  501

4.4 Step D: Create Spatial Grid Cells

We aggregate geographic positions into 5×5 degree cells. This:

Reduces spatial noise
Allows us to model spatial effects
Is a common resolution in fisheries analyses

dat <- dat %>%
  mutate(
    # Floor to get lower-left corner, add 2.5 to get cell center
    lat5 = floor(lat / 5) * 5 + 2.5,
    lon5 = floor(lon / 5) * 5 + 2.5,
    # Create unique cell identifier
    cell5 = interaction(lat5, lon5, drop = TRUE)
  )

cat("Number of unique 5×5° cells:", nlevels(dat$cell5), "\n")

## Number of unique 5×5° cells: 77

4.5 Step E: Calculate Nominal CPUE

CPUE = Catch Per Unit Effort

We standardize CPUE to “fish per 1000 hooks” for easier interpretation. This is what is usually called nominal CPUE

dat <- dat %>%
  mutate(cpue_nominal = (catch_swo / hooks) * 1000)

# Nominal CPUE summary (N per 1000 hooks):
summary(dat$cpue_nominal)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.562   4.167   6.258   8.333  64.433

4.6 Data Summary

cat("Total records:", nrow(dat), "\n")

## Total records: 5913

cat("Year range:", paste(range(as.numeric(as.character(dat$year))), collapse = " - "), "\n")

## Year range: 1986 - 2015

cat("Records with zero catch:", sum(dat$catch_swo == 0), 
    "(", round(mean(dat$catch_swo == 0) * 100, 1), "%)\n")

## Records with zero catch: 1113 ( 18.8 %)

Note: A high percentage of zeros is common in fisheries data. This is why we use appropriate statistical distributions (like the Negative Binomial) that can handle zeros. In a real and full work, other distributions should be tested and compared

5 Exploratory Data Analysis

Exploratory analysis helps us understand patterns in the data before modeling.

5.1 Distribution of Catch

p_catch_hist <- ggplot(dat, aes(x = catch_swo)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
  theme_bw(base_size = 12) +
  labs(
    title = "Distribution of Swordfish Catch per Set",
    x = "Catch (number of SWO)", 
    y = "Frequency"
  )
print(p_catch_hist)

Observation: The catch distribution is highly right-skewed with many zeros. It is typical for fisheries data. This suggests we need a count-based model (not normal regression).

5.2 Distribution of Effort

p_hooks_hist <- ggplot(dat, aes(x = hooks)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
  theme_bw(base_size = 12) +
  labs(
    title = "Distribution of Fishing Effort",
    x = "Number of Hooks per Set", 
    y = "Frequency"
  )
print(p_hooks_hist)

5.3 Nominal CPUE by Year

This is our unstandardized=nominal trend, what we see without any statistical adjustment.

nominal_by_year <- dat %>%
  group_by(year) %>%
  summarise(
    mean_cpue = mean(cpue_nominal, na.rm = TRUE),
    sd_cpue = sd(cpue_nominal, na.rm = TRUE),
    se_cpue = sd_cpue / sqrt(n()),
    median_cpue = median(cpue_nominal, na.rm = TRUE),
    n_sets = n(),
    total_catch = sum(catch_swo),
    total_hooks = sum(hooks),
    .groups = "drop"
  )

print(nominal_by_year)

## # A tibble: 30 × 8
##    year  mean_cpue sd_cpue se_cpue median_cpue n_sets total_catch total_hooks
##    <fct>     <dbl>   <dbl>   <dbl>       <dbl>  <int>       <dbl>       <dbl>
##  1 1986      17.5    14.1    2.72        16.3      27         150        9968
##  2 1987      14.9    13.3    0.919       11.6     210        1237       99731
##  3 1988      17.0    13.0    0.830       15       245        1637      104081
##  4 1989      13.4     8.45   0.510       12.5     274        1540      117251
##  5 1990      10.7     7.15   0.444       10       259        1185      111114
##  6 1991      10.0     6.48   0.407        9.74    253        1120      116195
##  7 1992       8.72    6.94   0.443        7.67    245        1116      136163
##  8 1993       7.37    5.60   0.361        6.25    241         895      137208
##  9 1994       6.46    5.67   0.354        5.17    256         869      150317
## 10 1995       5.03    4.15   0.250        4       275         803      177673
## # ℹ 20 more rows

p_nominal_year <- ggplot(nominal_by_year, 
                         aes(x = as.numeric(as.character(year)), y = mean_cpue)) +
  geom_ribbon(aes(ymin = mean_cpue - 1.96 * se_cpue, 
                  ymax = mean_cpue + 1.96 * se_cpue), 
              alpha = 0.2, fill = "blue") +
  geom_line(linewidth = 1, color = "blue") +
  geom_point(size = 2.5, color = "blue") +
  theme_bw(base_size = 12) +
  labs(
    title = "Nominal SWO CPUE by Year",
    subtitle = "Mean ± 95% CI (unstandardized)",
    x = "Year", 
    y = "CPUE (N / 1000 hooks)"
  )
print(p_nominal_year)

Question: Is this trend reflecting true abundance changes, or could it be influenced by other factors?

5.4 CPUE by Month

p_month <- ggplot(dat, aes(x = month, y = cpue_nominal, fill = month)) +
  geom_boxplot(outlier.alpha = 0.3) +
  theme_bw(base_size = 12) +
  labs(
    title = "Nominal SWO CPUE by Month",
    x = "Month", 
    y = "CPUE (N / 1000 hooks)"
  ) +
  theme(legend.position = "none")
print(p_month)

5.5 CPUE by Hooks Between Floats (HBF)

p_hbf <- ggplot(dat, aes(x = hbf_cat, y = cpue_nominal, fill = hbf_cat)) +
  geom_boxplot(outlier.alpha = 0.3) +
  theme_bw(base_size = 12) +
  labs(
    title = "Nominal SWO CPUE by Hooks Between Floats (HBF)",
    subtitle = "Lower HBF = Shallower fishing",
    x = "HBF Category", 
    y = "CPUE (N / 1000 hooks)"
  ) +
  theme(legend.position = "none")
print(p_hbf)

Think about it: Why do lower HBF values show higher swordfish CPUE? Consider swordfish vertical habitat preferences.

5.5.1 Exercise 1

Create similar boxplots for:

CPUE by light_color
CPUE by hook_type
CPUE by bait_type

What patterns do you observe?

5.6 Data Availability

It’s important to check if we have data coverage across all factor combinations.

year_quarter_tab <- dat %>%
  count(year, quarter) %>%
  pivot_wider(names_from = quarter, values_from = n, values_fill = 0)

# Number of sets by Year and Quarter:
print(year_quarter_tab, n = 50)

## # A tibble: 30 × 5
##    year    `4`   `1`   `2`   `3`
##    <fct> <int> <int> <int> <int>
##  1 1986     27     0     0     0
##  2 1987     43    36    53    78
##  3 1988     55    63    51    76
##  4 1989     69    56    63    86
##  5 1990     60    71    63    65
##  6 1991     64    45    59    85
##  7 1992     47    54    59    85
##  8 1993     35    61    53    92
##  9 1994     70    36    60    90
## 10 1995     55    53    78    89
## 11 1996     58    56    77    87
## 12 1997     43    56    66    92
## 13 1998     54    33    46    84
## 14 1999     59    34    58    55
## 15 2000     55    41    52    54
## 16 2001     41    29    59    64
## 17 2002     42    31    48    48
## 18 2003     39    41    47    40
## 19 2004     45    31    58    32
## 20 2005     25    29    47    39
## 21 2006     34    31    32    41
## 22 2007     31    31    30    61
## 23 2008     32    32    54    46
## 24 2009     24    30    49    66
## 25 2010     33    26    32    53
## 26 2011     39    28    36    49
## 27 2012     46    44    42    64
## 28 2013     39    37    37    78
## 29 2014     40    28    45    70
## 30 2015     29    32    30    52

Important: Gaps in data coverage can cause problems for standardization. We need sufficient data across factor combinations. In this case, 1986 might be a problem. We will proceed, but keep this in mind…

5.7 Spatial Distribution

grid_summary <- dat %>%
  group_by(lat5, lon5) %>%
  summarise(
    n_sets = n(),
    mean_cpue = mean(cpue_nominal, na.rm = TRUE),
    total_hooks = sum(hooks, na.rm = TRUE),
    .groups = "drop"
  )

5.7.1 Spatial Distribution of CPUE

p_spatial_cpue <- ggplot(grid_summary, aes(x = lon5, y = lat5)) +
  geom_tile(aes(fill = mean_cpue), width = 5, height = 5) +
  scale_fill_viridis_c(option = "plasma", name = "Mean CPUE\n(N/1000 hooks)") +
  borders("world", colour = "gray50", fill = "gray80") +
  coord_quickmap(xlim = range(grid_summary$lon5) + c(-5, 5),
                 ylim = range(grid_summary$lat5) + c(-5, 5)) +
  theme_bw(base_size = 12) +
  labs(
    title = "Spatial Distribution of Nominal SWO CPUE",
    x = "Longitude", 
    y = "Latitude"
  )
print(p_spatial_cpue)

5.7.2 Spatial Distribution of Effort

p_spatial_effort <- ggplot(grid_summary, aes(x = lon5, y = lat5)) +
  geom_tile(aes(fill = n_sets), width = 5, height = 5) +
  scale_fill_viridis_c(option = "viridis", name = "Number\nof Sets") +
  borders("world", colour = "gray50", fill = "gray80") +
  coord_quickmap(xlim = range(grid_summary$lon5) + c(-5, 5),
                 ylim = range(grid_summary$lat5) + c(-5, 5)) +
  theme_bw(base_size = 12) +
  labs(
    title = "Spatial Distribution of Fishing Effort",
    x = "Longitude", 
    y = "Latitude"
  )
print(p_spatial_effort)

6 Statistical Modeling

6.1 Why Use GLMs (or GLMMs) for CPUE?

Traditional linear regression assumes normally distributed residuals with constant variance. Fisheries catch data violate these assumptions because:

Catches are counts (non-negative integers)
Many zeros (sets with no catch)
Overdispersion (variance > mean)
Effort varies between sets

Generalized Linear Models (GLMs) with appropriate distributions handle these issues. Generalized Linear Mixed Models (GLMMs) extend GLMs and allow random variables, which can be quite usefull for some variables (we will test with one).

6.2 The Negative Binomial Distribution

We use the Negative Binomial distribution because:

It’s designed for count data
It handles overdispersion (extra variance)
It can accommodate many zeros

The model has the form:

\[\log(\mu_i) = \beta_0 + \beta_1 \cdot \text{Year}_i + \beta_2 \cdot \text{Quarter}_i + ... + \log(\text{hooks}_i)\]

The \(\log(\text{hooks})\) term is called an offset: This adjusts the expected count based on effort, effectively modeling CPUE.

6.3 Prepare Data for Modeling

model_data <- dat %>%
  filter(
    !is.na(catch_swo),
    !is.na(hooks), 
    hooks > 0
  ) %>%
  droplevels()

cat("Data ready for modeling:", nrow(model_data), "records\n")

## Data ready for modeling: 5913 records

6.4 Model Building Strategy

We build models sequentially, adding one term at a time. This helps us understand the contribution of each factor.

6.4.1 Model 0: Null Model

The simplest model - only intercept and effort offset.

m0 <- glmmTMB(
  catch_swo ~ 1 + offset(log(hooks)),
  family = nbinom1(link = "log"),
  data = model_data
)

6.4.2 Model 1: Add Year Effect

This is the factor we’re most interested in - it represents the abundance trend.

m1 <- glmmTMB(
  catch_swo ~ year + offset(log(hooks)),
  family = nbinom1(link = "log"),
  data = model_data
)

6.4.3 Model 2: Add Quarter Effect

Accounts for seasonal variation.

m2 <- glmmTMB(
  catch_swo ~ year + quarter + offset(log(hooks)),
  family = nbinom1(link = "log"),
  data = model_data
)

6.4.4 Model 3: Add HBF Effect

Accounts for targeting behavior (fishing depth).

m3 <- glmmTMB(
  catch_swo ~ year + quarter + hbf_cat + offset(log(hooks)),
  family = nbinom1(link = "log"),
  data = model_data
)

6.4.5 Model 4: Add Light Color Effect

Accounts for gear configuration differences.

m4 <- glmmTMB(
  catch_swo ~ year + quarter + hbf_cat + light_color + offset(log(hooks)),
  family = nbinom1(link = "log"),
  data = model_data
)

6.4.6 Model 5: Add Spatial Random Effect

Accounts for spatial variation using a random effect for 5×5° cells.

# This may take a little longer to run...
m5 <- glmmTMB(
  catch_swo ~ year + quarter + hbf_cat + light_color + offset(log(hooks)) + (1 | cell5),
  family = nbinom1(link = "log"),
  data = model_data
)

6.5 Model Summary

Let’s examine the structure of our most complete model:

summary(m5)

##  Family: nbinom1  ( log )
## Formula:          
## catch_swo ~ year + quarter + hbf_cat + light_color + offset(log(hooks)) +  
##     (1 | cell5)
## Data: model_data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##   23580.9   23861.7  -11748.5   23496.9      5871 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  cell5  (Intercept) 0.3048   0.5521  
## Number of obs: 5913, groups:  cell5, 77
## 
## Dispersion parameter for nbinom1 family (): 0.579 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.31558    0.13618  -31.69  < 2e-16 ***
## year1987     -0.08913    0.11171   -0.80 0.424948    
## year1988     -0.05842    0.11012   -0.53 0.595756    
## year1989     -0.26772    0.11012   -2.43 0.015046 *  
## year1990     -0.41419    0.11148   -3.72 0.000203 ***
## year1991     -0.42364    0.11198   -3.78 0.000155 ***
## year1992     -0.40242    0.11253   -3.58 0.000349 ***
## year1993     -0.55437    0.11474   -4.83 1.36e-06 ***
## year1994     -0.51323    0.11518   -4.46 8.35e-06 ***
## year1995     -0.67136    0.11604   -5.79 7.24e-09 ***
## year1996     -0.77739    0.11757   -6.61 3.78e-11 ***
## year1997     -0.71832    0.11797   -6.09 1.14e-09 ***
## year1998     -0.60760    0.11731   -5.18 2.22e-07 ***
## year1999     -0.68549    0.12029   -5.70 1.21e-08 ***
## year2000     -0.60041    0.11990   -5.01 5.51e-07 ***
## year2001     -0.69861    0.12095   -5.78 7.66e-09 ***
## year2002     -0.82379    0.12530   -6.57 4.88e-11 ***
## year2003     -0.86107    0.12781   -6.74 1.61e-11 ***
## year2004     -0.75475    0.12492   -6.04 1.52e-09 ***
## year2005     -0.85829    0.12905   -6.65 2.91e-11 ***
## year2006     -0.83997    0.12737   -6.59 4.26e-11 ***
## year2007     -0.86342    0.12741   -6.78 1.23e-11 ***
## year2008     -0.88622    0.12623   -7.02 2.20e-12 ***
## year2009     -1.02581    0.12641   -8.11 4.87e-16 ***
## year2010     -0.83826    0.12541   -6.68 2.32e-11 ***
## year2011     -0.89195    0.12605   -7.08 1.48e-12 ***
## year2012     -0.90043    0.12356   -7.29 3.16e-13 ***
## year2013     -0.95720    0.12653   -7.56 3.88e-14 ***
## year2014     -0.84742    0.12700   -6.67 2.51e-11 ***
## year2015     -0.89871    0.12856   -6.99 2.73e-12 ***
## quarter2      0.04923    0.03065    1.61 0.108160    
## quarter3      0.20340    0.03076    6.61 3.79e-11 ***
## quarter4      0.01234    0.03146    0.39 0.694928    
## hbf_cat3     -0.28441    0.02889   -9.85  < 2e-16 ***
## hbf_cat4     -0.77149    0.03704  -20.83  < 2e-16 ***
## hbf_cat5     -0.71557    0.04080  -17.54  < 2e-16 ***
## hbf_cat6     -1.11789    0.05128  -21.80  < 2e-16 ***
## light_color2 -0.32072    0.05371   -5.97 2.35e-09 ***
## light_color3  0.34567    0.04750    7.28 3.41e-13 ***
## light_color4  0.12425    0.05104    2.43 0.014910 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.6 Model Comparison

We compare models using AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). Lower values indicate better fit (accounting for model complexity).

model_comparison <- data.frame(
  Model = c("M0: Null", 
            "M1: +Year", 
            "M2: +Quarter", 
            "M3: +HBF",
            "M4: +Light", 
            "M5: +Cell5 RE"),
  df = c(attr(logLik(m0), "df"),
         attr(logLik(m1), "df"),
         attr(logLik(m2), "df"),
         attr(logLik(m3), "df"),
         attr(logLik(m4), "df"),
         attr(logLik(m5), "df")),
  logLik = round(c(as.numeric(logLik(m0)),
                   as.numeric(logLik(m1)),
                   as.numeric(logLik(m2)),
                   as.numeric(logLik(m3)),
                   as.numeric(logLik(m4)),
                   as.numeric(logLik(m5))), 1),
  AIC = round(c(AIC(m0), AIC(m1), AIC(m2), AIC(m3), AIC(m4), AIC(m5)), 1),
  BIC = round(c(BIC(m0), BIC(m1), BIC(m2), BIC(m3), BIC(m4), BIC(m5)), 1)
) %>%
  mutate(
    deltaAIC = round(AIC - min(AIC), 1),
    deltaBIC = round(BIC - min(BIC), 1)
  )

print(model_comparison)

##           Model df   logLik     AIC     BIC deltaAIC deltaBIC
## 1      M0: Null  2 -14252.9 28509.8 28523.1   4928.9   4661.4
## 2     M1: +Year 31 -13344.9 26751.9 26959.1   3171.0   3097.4
## 3  M2: +Quarter 34 -13297.4 26662.8 26890.1   3081.9   3028.4
## 4      M3: +HBF 38 -12807.4 25690.8 25944.8   2109.9   2083.1
## 5    M4: +Light 41 -12438.8 24959.5 25233.6   1378.6   1371.9
## 6 M5: +Cell5 RE 42 -11748.5 23580.9 23861.7      0.0      0.0

Interpretation: The model with the lowest AIC/BIC is preferred. Each additional term should substantially improve fit to justify the added complexity.

6.7 Likelihood Ratio Tests

We can formally test with LRT (for nested models) whether each term significantly improves the model:

anova(m0, m1, m2, m3, m4, m5)

Interpretation: A significant p-value (< 0.05) indicates that the additional term significantly improves model fit.

6.8 Select Final Model

Based on AIC, BIC and LRT, we select our final model:

final_model <- m5
cat("Final model selected: M5 (Year + Quarter + HBF + Light + Spatial RE)\n")

## Final model selected: M5 (Year + Quarter + HBF + Light + Spatial RE)

7 Model Diagnostics

Before trusting our results, we must verify that the model assumptions are met.

7.1 DHARMa Residual Diagnostics

The DHARMa package uses simulation to create standardized residuals that should be uniformly distributed if the model is correct.

# Simulate residuals
# Note: this may take a few minutes. Reduce to n=200 if it takes too long
sim_resid <- simulateResiduals(final_model, n = 500)

7.1.1 Overall Diagnostic Plot

plot(sim_resid, main = "DHARMa Residual Diagnostics")

Left panel (QQ plot): Points should follow the diagonal line if residuals are uniformly distributed.

Right panel (Residuals vs. Predicted): Should show no patterns; residuals should be scattered uniformly.

7.1.2 Formal Statistical Tests

# Uniformity test
print(testUniformity(sim_resid))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  simulationOutput$scaledResiduals
## D = 0.060771, p-value < 2.2e-16
## alternative hypothesis: two-sided

# Dispersion test
print(testDispersion(sim_resid))

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.60859, p-value = 0.12
## alternative hypothesis: two.sided

# Zero-inflation test
print(testZeroInflation(sim_resid))

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.0617, p-value = 0.756
## alternative hypothesis: two.sided

Interpretation: Non-significant p-values (> 0.05) indicate no evidence of problems. If tests are significant, we may need to adjust our model.

8 Extract Standardized Index

8.1 Using emmeans

The emmeans package calculates Estimated Marginal Means - the predicted values for each year while averaging over all other factors.

# Reference effort for standardization
E0 <- 1000  # hooks

# Get estimated marginal means for year
std_index <- emmeans(
  final_model, 
  ~ year,
  type = "response",       # Back-transform to count scale
  re.form = NA,            # Marginalize over random effects
  at = list(hooks = E0)    # Reference effort = 1000 hooks
) %>%
  as.data.frame() %>%
  transmute(
    year = as.integer(as.character(year)),
    std_cpue = response,      # Standardized CPUE (per 1000 hooks)
    std_lwr = asymp.LCL,      # Lower 95% CI
    std_upr = asymp.UCL       # Upper 95% CI
  ) %>%
  arrange(year)

# Standardized SWO CPUE Index (per 1000 hooks):
print(std_index)

##    year std_cpue  std_lwr   std_upr
## 1  1986 8.313219 6.455522 10.705504
## 2  1987 7.604302 6.480699  8.922712
## 3  1988 7.841478 6.703350  9.172844
## 4  1989 6.360600 5.431274  7.448941
## 5  1990 5.494021 4.680375  6.449113
## 6  1991 5.442321 4.636630  6.388014
## 7  1992 5.559047 4.738968  6.521042
## 8  1993 4.775410 4.058781  5.618569
## 9  1994 4.975948 4.230268  5.853070
## 10 1995 4.248175 3.610557  4.998396
## 11 1996 3.820784 3.237555  4.509078
## 12 1997 4.053298 3.431123  4.788292
## 13 1998 4.527854 3.838706  5.340723
## 14 1999 4.188557 3.520684  4.983126
## 15 2000 4.560521 3.844840  5.409419
## 16 2001 4.133954 3.475413  4.917278
## 17 2002 3.647546 3.028086  4.393730
## 18 2003 3.514070 2.903261  4.253387
## 19 2004 3.908291 3.246156  4.705485
## 20 2005 3.523855 2.898033  4.284822
## 21 2006 3.588999 2.962101  4.348574
## 22 2007 3.505844 2.894787  4.245887
## 23 2008 3.426813 2.836995  4.139255
## 24 2009 2.980328 2.466252  3.601562
## 25 2010 3.595171 2.982998  4.332975
## 26 2011 3.407206 2.823588  4.111453
## 27 2012 3.378461 2.817694  4.050830
## 28 2013 3.192012 2.642974  3.855104
## 29 2014 3.562357 2.946236  4.307323
## 30 2015 3.384255 2.789418  4.105939

8.2 Compare with Nominal CPUE

# Combine standardized and nominal indices
comparison_data <- std_index %>%
  left_join(
    nominal_by_year %>%
      mutate(year = as.integer(as.character(year))) %>%
      select(year, nom_cpue = mean_cpue, n_sets),
    by = "year"
  )

# Scale both series to mean = 1 for easier comparison
mean_std <- mean(comparison_data$std_cpue, na.rm = TRUE)
mean_nom <- mean(comparison_data$nom_cpue, na.rm = TRUE)

comparison_data <- comparison_data %>%
  mutate(
    std_index = std_cpue / mean_std,
    std_lwr_index = std_lwr / mean_std,
    std_upr_index = std_upr / mean_std,
    nom_index = nom_cpue / mean_nom
  )

# Relative Indices (scaled to mean = 1):
print(comparison_data %>% 
        select(year, nom_index, std_index, std_lwr_index, std_upr_index, n_sets))

##    year nom_index std_index std_lwr_index std_upr_index n_sets
## 1  1986 2.8764921 1.8540471     1.4397360     2.3875839     27
## 2  1987 2.4513239 1.6959415     1.4453511     1.9899784    210
## 3  1988 2.7902330 1.7488375     1.4950076     2.0457640    245
## 4  1989 2.1925962 1.4185662     1.2113042     1.6612922    274
## 5  1990 1.7643476 1.2252983     1.0438357     1.4383068    259
## 6  1991 1.6484135 1.2137679     1.0340795     1.4246801    253
## 7  1992 1.4327506 1.2398007     1.0569033     1.4543485    245
## 8  1993 1.2097666 1.0650309     0.9052055     1.2530755    241
## 9  1994 1.0605448 1.1097556     0.9434511     1.3053749    256
## 10 1995 0.8255221 0.9474448     0.8052407     1.1147620    275
## 11 1996 0.7214637 0.8521263     0.7220523     1.0056325    278
## 12 1997 0.7266626 0.9039825     0.7652227     1.0679039    257
## 13 1998 0.9355129 1.0098200     0.8561234     1.1911092    217
## 14 1999 0.7002450 0.9341486     0.7851968     1.1113566    206
## 15 2000 0.7209616 1.0171053     0.8574914     1.2064299    202
## 16 2001 0.6597572 0.9219707     0.7751004     1.0966708    193
## 17 2002 0.5748648 0.8134902     0.6753358     0.9799070    169
## 18 2003 0.4846928 0.7837219     0.6474967     0.9486071    167
## 19 2004 0.4986325 0.8716425     0.7239706     1.0494358    166
## 20 2005 0.5324449 0.7859042     0.6463308     0.9556180    140
## 21 2006 0.5458242 0.8004328     0.6606194     0.9698362    138
## 22 2007 0.5464681 0.7818871     0.6456069     0.9469345    153
## 23 2008 0.5215654 0.7642613     0.6327177     0.9231531    164
## 24 2009 0.4811761 0.6646846     0.5500332     0.8032346    169
## 25 2010 0.5622652 0.8018093     0.6652800     0.9663573    144
## 26 2011 0.5692866 0.7598884     0.6297277     0.9169526    152
## 27 2012 0.5005349 0.7534778     0.6284133     0.9034321    196
## 28 2013 0.4510023 0.7118951     0.5894465     0.8597806    191
## 29 2014 0.5013742 0.7944909     0.6570811     0.9606361    183
## 30 2015 0.5132747 0.7547699     0.6221070     0.9157227    143

9 Results Visualization

9.1 Nominal vs. Standardized CPUE (Absolute Values)

p_cpue_abs <- ggplot(comparison_data, aes(x = year)) +
  geom_ribbon(aes(ymin = std_lwr, ymax = std_upr), 
              alpha = 0.2, fill = "blue") +
  geom_line(aes(y = std_cpue, color = "Standardized"), linewidth = 1.2) +
  geom_point(aes(y = std_cpue, color = "Standardized"), size = 3) +
  geom_line(aes(y = nom_cpue, color = "Nominal"), 
            linewidth = 1, linetype = "dashed") +
  geom_point(aes(y = nom_cpue, color = "Nominal"), 
             size = 2, shape = 21, fill = "white") +
  scale_color_manual(
    values = c("Standardized" = "blue", "Nominal" = "red"),
    name = "CPUE Type"
  ) +
  theme_bw(base_size = 12) +
  labs(
    title = "SWO CPUE: Nominal vs. Standardized",
    x = "Year",
    y = "CPUE (N / 1000 hooks)"
  ) +
  theme(legend.position = "bottom")
print(p_cpue_abs)

9.2 Relative Index Comparison (Mean = 1)

Scaling both indices to mean = 1 makes trends easier to compare:

p_index <- ggplot(comparison_data, aes(x = year)) +
  geom_ribbon(aes(ymin = std_lwr_index, ymax = std_upr_index), 
              alpha = 0.2, fill = "blue") +
  geom_line(aes(y = std_index, color = "Standardized"), linewidth = 1.2) +
  geom_point(aes(y = std_index, color = "Standardized"), size = 3) +
  geom_line(aes(y = nom_index, color = "Nominal"), 
            linewidth = 1, linetype = "dashed") +
  geom_point(aes(y = nom_index, color = "Nominal"), 
             size = 2, shape = 21, fill = "white") +
  geom_hline(yintercept = 1, linetype = "dotted", color = "gray50") +
  scale_color_manual(
    values = c("Standardized" = "blue", "Nominal" = "red"),
    name = "Index Type"
  ) +
  theme_bw(base_size = 12) +
  labs(
    title = "SWO Abundance Index: Nominal vs. Standardized",
    subtitle = "Both scaled to mean = 1",
    x = "Year",
    y = "Relative Index"
  ) +
  theme(legend.position = "bottom")
print(p_index)

10 Summary

10.1 What We Did

Prepared the data: Converted variables to appropriate types, created derived variables (quarter, spatial cells), calculated nominal CPUE
Explored the data: Examined distributions, temporal patterns, spatial patterns, and relationships with explanatory variables
Built statistical models: Used Negative Binomial GLMs and GLMMs with an effort offset, sequentially adding terms
Selected the best model: Used AIC/BIC and likelihood ratio tests
Diagnosed the model: Verified assumptions with residual analysis
Extracted standardized indices: Used emmeans (marginal means) to get year effects while controlling (i.e., standardizing) for other factors

10.2 Factors Standardized

Our final model accounted for:

Quarter (seasonality)
HBF (hooks between floats - targeting/depth)
Light color (gear configuration)
Spatial cell (5×5 degree - as random effect)

10.3 Key Interpretation

Differences between nominal and standardized indices indicate important confounding factors
The standardized index better represents true abundance changes
The standardized index is more appropriate for stock assessment

11 Exercises

11.0.1 Exercise 1: Additional Exploratory Plots

Create boxplots showing CPUE relationships with light_color, hook_type, and bait_type. What patterns do you observe?

11.0.2 Exercise 2: Model Comparison

What does the AIC comparison tell us about model fit? Why does adding each term improve (or not improve) the model?

11.0.3 Exercise 3: Alternative Variables

Try adding hook_type and bait_type to the model. Does it improve the fit? Does it change the standardized index? Does it create any issues?

11.0.4 Exercise 4: Month vs. Quarter

Replace quarter with month in the model. How does this affect the results? Are there any trade-offs?

11.0.5 Exercise 5: Interpretation

Why do the nominal and standardized indices differ?

cat("\n Congratulations! You have completed your first CPUE standardization analysis.\n")

## 
##  Congratulations! You have completed your first CPUE standardization analysis.

TP - CPUE Standardization

Longline Fisheries Example (simulated data)

Rui Coelho

Jan 2026