Model Diagnostics in Integrated Stock Assessments

Hindcasting

Laurence Kell, Henning Winker, Massimiliano Cardinale, Rishi Sharma, Iago Mosqueira, Toshihide Kitakado

Jan 31-Feb 3 2022

What diagnostics should be defaults in assessment reports?

Cookbook Carvalho et al. (2021)

ICES

ICCAT & IOTC

Therefore focus is on

GCFM Example

Sicilian Hake

Alternative scenarios

Diagnostics

IOTC

Albacore

Model Error v Estimation Error

1440 scenarios

Should diagnostics be used to eliminate models, weight models, or identify and fix model misspecification?

AIC weighting

Should diagnostics be used to eliminate models, weight models, or identify and fix model misspecification?

Mohn’s \(\rho\) Retrospective Weighting

Comparing Multiple Models

World Conference on Stock Assessment Methods used self- and cross-tests to compare models (Deroba et al. 2015)

However

Cross-validation

Used to determine how well a predictive model will perform in practice.

Hindcast

Validation requires that the system is observable and measurable, and so observations should be used, rather than model-based quantities unless these are well known (Hodges, Dewar, and Center 1992).

Time series cross-validation.

Training set consists of \(observations\) that occurred prior to the observation that forms the \(test set\). No future observations are used in constructing a forecast.

Prediction skill is computed by averaging over the test sets. This procedure is also known as evaluation on a rolling forecasting origin because the origin at which the forecast is based rolls forward in time.

Multi-step forecasts

Multi-step forecasts may be preferreable if assessment advice is for multiple years or benchmarks are conducted every four years.

In this case, the cross-validation procedure based on a rolling forecasting origin can be modified to allow multi-step errors to be used. For 4-step-ahead forecasts the corresponding diagram is

Hindcast

Mean Absolute Scaled Error

Simple Skill Weighting

Using MASE

Weighting Metrics

Albacore Example


Regression Tree for Mohn’s \(\rho\)


MASE

Production functions

Kobe Phase plots

Emergent Properties

If it looks like a duck, walks like a duck and quacks like a duck, then it is a duck.

Production functions

Process Error: Recruitment

Management Strategy Evaluation

Is it possible to automate the acceptance-rejection of models for use with large ensembles?

True Skill Score

The true skill score (TSS) is the proportion of true positives less the proportion of false negatives. A perfect prediction would receive a score of 1, random predictions receive a score of 0 and predictions inferior to random ones receive a negative score.

Is it possible to automate the acceptance-rejection of models for use with large ensembles?

Receiver Operating Characteristics

The area under the receiver operating characteristic curve is a performance measure for machine learning algorithms.

ROC curves plot the True Positive Rate againts the False Positive Rate. The ROC curve is a probability curve, and the area under the curve is important for measuring performance. For example, a coin toss would produce a curve that fell along the \(y=x\) line and the area under the curve would be equal to 0.5. The closer the area under the curve is to 1 the better an indicator is at ranking.

Is there a management strategy that relates closely to the type of data we observe and can then test predictions for?

Hindcasting evaluates the model’s ability to predict observed data in, for example, a one-step ahead approach. This is a very useful if the observed data is directly related to the management objective, but management quantities (e.g. depletion level relative to that associated with MSY) are usually quite different from the observed data (catch, relative indices of abundance, or catch composition). It might be useful to modify management quantities and objectives to be more closely related to the observations. For example, management could be setting catch under a given (e.g. historically observed) effort level or the catch that would increase the relative index by a certain percentage.

Conclusions

References

Deroba, JJ, Doug S Butterworth, RD Methot Jr, JAA De Oliveira, C Fernandez, Anders Nielsen, SX Cadrin, et al. 2015. “Simulation Testing the Robustness of Stock Assessment Models to Error: Some Results from the Ices Strategic Initiative on Stock Assessment Methods.” ICES Journal of Marine Science 72 (1). Oxford University Press: 19–30.

Hodges, James S, James A Dewar, and Arroyo Center. 1992. Is It You or Your Model Talking?: A Framework for Model Validation. Santa Monica, CA: Rand.

Kell, Laurence T, Rishi Sharma, Toshihide Kitakado, Henning Winker, Iago Mosqueira, Massimiliano Cardinale, and Dan Fu. 2021. “Validation of Stock Assessment Methods: Is It Me or My Model Talking?” ICES Journal of Marine Science.

Saltelli, A, D Mayo, R Pielke Jr, T Portaluri, TM Porter, A Puy, I Rafols, et al. 2020. “Five Ways to Ensure That Models Serve Society: A Manifesto.” Nature 582 (7813). Springer Nature.

Thygesen, Uffe Høgsbro, Christoffer Moesgaard Albertsen, Casper Willestofte Berg, Kasper Kristensen, and Anders Nielsen. 2017. “Validation of Ecological State Space Models Using the Laplace Approximation.” Environmental and Ecological Statistics 24 (2). Springer: 317–39.