Habitat Selection Model Validation

# Habitat Selection Model Validation
## "Is my model any good?"

Brian J. Smith  
22 April 2026

*Analysis of Animal Movement Data in R*  
J. Signer & B. Smith

---
class: center, middle

# Concepts

---

## Model validation

> "Model validation is the process of assessing agreement between observations and fitted or predicted values."

&#8212; Fieberg et al. (2018)

???

In other words, "How well does my model do its job?"

Note that this implies you have defined your model's job. We can use models for many purposes, so your model validation needs to be tailored to the objectives of your study.

---

## Calibration

> "When there is close agreement between observed and fitted/predicted values, we say the model is well calibrated; calibration therefore refers to steps taken to improve agree- ment between observed and predicted values."

&#8212; Fieberg et al. (2018)

---

## Discrimination

> "Discrimination ... describes a model’s ability to rank sample units in terms of their likely outcomes."

&#8212; Fieberg et al. (2018)

---

## Calibration vs. discrimination

If you take care to fit a "good" model, you can wind up with a model that is *both* well-calibrated and able to discriminate. However, there are situations where this is not the case.

---

## Calibration vs. discrimination

### Well-calibrated, poor discrimination

Ellner et al. (2002) showed that models used for population viability analyses (PVA) are often too imprecise to rank individual populations in terms of extinction risk (*i.e.*, discriminate), even though on average the ensemble of models is unbiased (*i.e.*, well-calibrated).

---

.center[ 
![Ellner et al. 2002 Fig. 1](figs/Ellner_etal_2002_Fig1.png) 
.note[Fig. 1 from Ellner et al. (2002)]
]

???

Left-panel shows accuracy of the ensemble predictions -- there is good agreement between PVA-predicted probability of decline and the proportion of populations actually declining. *I.e.*, the model is well-calibrated.

Right-panel shows the lack of precision for individual population predictions. Even though this model is well-calibrated, you cannot rank the individual populations in terms of extinction probability. *I.e.*, the model has poor discrimination.

---

## Calibration vs. discrimination

### Poor calibration, good discrimination

Counts of species, uncorrected for imperfect detection, are known to be
biased low. However, these population indices might rank sites well in terms of true abundance, as long as detection probability is similar among sites.

---

.note[This population index is a biased estimate of abundance, but sites are still ranked correctly.]
]

???

Blue line is 1:1 line. If these estimates were unbiased, points would be scattered about the line.

---

## Goodness-of-fit

In general, how well a model fits the data.

Often measured with R2 (for ordinary linear models) or pseudo-R2 (for generalized linear models).

Generally interpreted as the percentage of variation in the data explained by the model.

???

The general interpretation is true for ordinary linear regression and Gaussian GLMs. For other GLMs, residuals do not have the same interpretation as in the Gaussian case, and instead model fitting is based on likelihood.

We use deviance as the generalization of residuals in GLMs, and so pseduo-R2 is sometimes interpreted as % of deviance explained.

Note that not all pseudo-R2 metrics range from 0 to 1. Some have a maximum possible value < 1.

---

## Internal vs. external validity

- Internal validity: measured by comparing the fitted values of the model to the data, *i.e.*, validation using the same data used to fit the model.
  - In-sample prediction

- External validity: measured by comparing model predictions to observed values in a new dataset, *i.e.*, validation using an independent dataset.
  - Out-of-sample prediction

---
class: center, middle
# Methods for HSFs

---

## Methods for HSFs

- Pseudo-R2

- Area Under the Curve (ROC AUC)

- "Boyce method" .note[<a href = "https://doi.org/10.1016/S0304-3800(02)00200-4">(Boyce et al. 2002)</a>]

- Used-habitat calibration plots .note[[(Fieberg et al. 2018)](https://doi.org/10.1111/ecog.03123)]

---

## Pseudo-R2

- Calculated using the likelihood of the fitted model compared with the likelihood of a null (intercept-only) model.

- Measure of internal validity.

- Easy to calculate, lots of R packages with various versions.

- Makes a lot of sense for a model fit using Poisson regression.

- Large body of literature about this for logistic regression, but that's not really the model we want to evaluate.

---

## ROC AUC

- AUC = "Area Under the Curve"

- ROC = "Receiver-Operator Curve"

- Measures the trade-off between sensitivity and specificity for a binary predictor.

- Can be used as a measure of internal validity or external validity.

- Good for evaluating logistic regression, but again, that's not really the model we want to fit.

- Typically bound between 0.5 (random guess) and 1 (perfect prediction).

---

## ROC AUC

An intuitive interpretation of AUC for HSFs is:

1. Randomly select 1 used point from the dataset `\((x_u)\)`.
2. Randomly select 1 available point from the dataset `\((x_a)\)`.
3. Calculate value of `\(w(x)\)` for each point.
4. `\(AUC = Pr(w(x_u) > w(x_a))\)`

*I.e.*, it is the probability under our model that our used location is in "better" habitat (according to our model) than what is available.

---

## Boyce method

---

## Boyce method

More directly related to the IPP because it evaluates density of points rather than a binary classifier.

- Breaks habitats into bins and estimates `\(w(x)\)` for the mean values of the bin.

- Calculates correlation between number of used locations in each bin `\(w(x)\)` for that bin.

- Uses Spearman's (rank) correlation.

---

## Used-habitat calibration plots

- Graphical method for calibration.

- Good for identifying missing covariates, non-linearity, and multicollinearity.

- Much more detail of this method later.

---
class: center, middle

# Methods for iSSFs

---

## Methods

- Pseudo-R2

- Other methods for conditional logistic regression/Cox PH

- Concordance
  - Measure of Explained Randomness .note[[(O'Quigley et al. 2005)](https://doi.org/10.1002/sim.1946)]

- "Modified Boyce method" .note[[(Fortin et al. 2009)](https://doi.org/10.1890/08-0345.1)]

- Potts Method .note[[(Potts et al. 2022)](https://doi.org/10.1111/2041-210X.13904)]
  
- Used-habitat calibration plots .note[[(Fieberg et al. 2018)](https://doi.org/10.1111/ecog.03123)]

- The lineup protocol .note[[(Fieberg et al. 2024)](https://doi.org/10.1111/2041-210X.14336)]

---
## Pseudo-R2

- Calculated using the likelihood of the fitted model compared with the likelihood of a null (intercept-only) model.

- Not clear that the pseudo-R2 of a Cox proportional hazards model is interpretable as the % deviance explained by an (i)SSF.

- Values tend to be low (e.g., < 0.3) even under true generating model.
 - Many readers expect R2-statistics close to 1.

---

## Concordance

- Generalization of AUC.

- Most used measure of goodness-of-fit for survival models.

- Easy to calculate (`summary()` or `survival::concordance()`).

---

## Concordance

- For each stratum:
  - Rank all steps using the fitted model "risk" prediction.
  - Divide rank of used step by total steps per stratum.  
  
- Average across all strata.

- Measures how well your model ranks the used step.

---

## Concordance

**Example**

- Sample 99 available steps for each used step.
  - 100 total steps per stratum

- If used step ranks __ on average:
  - 100 (highest); concordance = 1
  - 1 (lowest); concordance = 0.01

---

## Measure of Explained Randomness

- Cox-Snell pseudo-R2, **but** ... 
 
- N is total of only used steps.
 - Only used steps are actually data.
 
- Typically higher than an ordinary pseudo-R2.

- Can calculate with `survMisc::rsq()`

---

## Modified Boyce method

---
## Modified Boyce method

- Fortin et al. (2009) were evaluating RSFs fit with conditional logistic regression (with a single day of locations comprising a stratum).

- Similar to an SSF
  
- Built the RSF for 80% of strata, withholding 20% for out-of-sample validation.

- Ranked all observations within a stratum by predicted `\(w(x)\)`, then grouped the observed locations according to their rank.

- Calculated Spearman's (rank) correlation between the ranking and the number of observed locations in the ranking.

---

## Potts Method

## Potts Method

- Fit occurrence distribution (OD) to raw data.
  - `amt::hr_od()`

- Simulate new tracks from fitted iSSA and calculate OD.
  - `amt::simulate_path()`

- Compare using Bhattacharyya's Affinity. 
  - `amt::hr_overlap(..., type = 'ba')`
  
- Very computationally intensive!

---

## Used-habitat calibration plots

- Can be used for both HSFs and (i)SSFs.

---
class: center, middle
# Used-habitat Calibration Plots

---

## Used-habitat calibration plots

---

## Used-habitat calibration plots

How well does our model predict the characteristics of used locations?

---
class: center, middle

![Fieberg et al. 2018 Fig. 3](figs/Fieberg_etal_2018_Fig3.png)

---

## Used-habitat calibration plots

First, split data into training and testing datasets.

- Summarize used and available distributions of each environmental variable.

- Fit a candidate model. Extract the `\(\beta\)`s and the variance-covariance matrix `\((\boldsymbol\Sigma)\)`.

...

---

## Used-habitat calibration plots

- For many (bootstrap) iterations:

- Draw new values of all `\(\beta\)`s (don't forget the covariance).
  - Calculate `\(w(x)\)` for the testing data with the sampled `\(\beta\)`s.
  - Sample `\(n_u^{test}\)` used locations from the test data, with probability proportional to `\(w(x)\)` calculated in previous.
  - Summarize predicted distribution of covariates at used locations.
  
- Compare observed and predicted distributions.

---

## Used-habitat calibration plots

UHC plots can help identify missing covariates.

???

If black line (observed distribution) falls outside of gray envelope (predicted distribution), then the model needs calibrating.

In this figure, the correct formula is `~ elev + precip`. Left panel shows model missing precipitation and corresponding incorrect predicted distribution. Right panel shows correct model.

Red line is available distribution.

---

## Used-habitat calibration plots

UHC plots can help identify incorrect functional form.

???

Correct model should have a quadratic term (left).

---

## Lineup protocol

---

## Lineup protocol

- Graphical method for evaluating goodness of fit.

- Takes advantage of the ability of our eyes to decipher patterns.

- Simulate some quantity of interest from fitted model many times, plot the real data and the simulations in a random order, and see if you can identify the real one.

- Protocol can be used to estimate a p-value for the null hypothesis that the data are consistent with the fitted model.

???

Useful for any model you can simulate from, not just SSFs.

---

## Lineup protocol

???

Real data in panel 6

---

## Lineup protocol

???

Real data in panel 11

---

## Lineup protocol

???

Real data in panel 10

---

## Lineup protocol

- See appendix of the preprint for a vignette on creating lineups using `amt`

---

# Questions?

---

# References

<a name=bib-Boyce2002></a>[Boyce, M. S., P. R. Vernier, S. E. Nielsen,
et al.](#cite-Boyce2002) (2002). "Evaluating resource selection
functions". In: _Ecological Modelling_ 157. ISBN: 7804929234, pp.
281-300. DOI:
[10.1016/S0304-3800(02)00200-4](https://doi.org/10.1016%2FS0304-3800%2802%2900200-4).

<a name=bib-Ellner2002></a>[Ellner, S. P., J. Fieberg, D. Ludwig, et
al.](#cite-Ellner2002) (2002). "Precision of Population Viability
Analysis". In: _Conservation Biology_ 16.1, pp. 258-261.

<a name=bib-Fieberg2018></a>[Fieberg, J. R., J. D. Forester, G. M.
Street, et al.](#cite-Fieberg2018) (2018). "Used-habitat calibration
plots: A new procedure for validating species distribution, resource
selection, and step-selection models". In: _Ecography_ 41. ISBN:
1600-0587, pp. 737-752. DOI:
[10.1111/ecog.03123](https://doi.org/10.1111%2Fecog.03123).

<a name=bib-Fieberg2024></a>[Fieberg, J. R., S. Freeman, and J.
Signer](#cite-Fieberg2024) (2024). "Using lineups to evaluate goodness
of fit of animal movement models". In: _Methods in Ecology and
Evolution_ 15 (6), pp. 1048-1059. DOI:
[10.1111/2041-210X.14336](https://doi.org/10.1111%2F2041-210X.14336).

<a name=bib-Daniel2009></a>[Fortin, D., M. E. Fortin, H. L. Beyer, et
al.](#cite-Daniel2009) (2009). "Group-size-mediated habitat selection
and group fusion-fission dynamics of bison under predation risk". In:
_Ecology_ 90.9. ISBN: 0012-9658, pp. 2480-2490. DOI:
[10.1890/08-0345.1](https://doi.org/10.1890%2F08-0345.1).

<a name=bib-oquigley_explained_2005></a>[O'Quigley, J., R. Xu, and J.
Stare](#cite-oquigley_explained_2005) (2005). "Explained randomness in
proportional hazards models". In: _Statistics in Medicine_ 24.3, pp.
479-489. DOI: [10.1002/sim.1946](https://doi.org/10.1002%2Fsim.1946).

<a name=bib-potts_assessing_2022></a>[Potts, J. R., L. Börger, B. K.
Strickland, et al.](#cite-potts_assessing_2022) (2022). "Assessing the
predictive power of step selection functions: how social and
environmental interactions affect animal space use". In: _Methods in
Ecology and Evolution_, pp. 1-39. DOI:
[10.1111/2041-210x.13904](https://doi.org/10.1111%2F2041-210x.13904).