Integrated Population Models

Alejandro Molina Moctezuma

Before we start

Packages we’ll use:

  • ASMbook

    • Applied Statistical Modeling for Ecologists book –> Package contains both datasets and useful examples
  • rstan

Before we start

Brief introductions 🐟

Before we start

  • IPM book by Schaub and Kery

This talk

  • Integrated models

  • Integrated population models

  • Discussion

  • Examples and code?

  • Make it successful! 😄 Participate and discuss

  • If you are interested… you should keep reviewing this!

What is an Integrated Model?

  • What is a model?

  • Integrated Population Model

  • What is an Integrated Model?

Similar to hierarchical models

  • In hierarchical models, we can combine two or more GLM’s in a sequence

  • \[ Profit_i = \alpha_0 + \alpha_1 revenue_i - \alpha_2cost_i + \epsilon_{Profit, i} \]

  • \[ Revenue_i = \beta_0 + \beta_1Yield_i+\epsilon_{Revenue,i} \]

  • \[ Cost_i = \gamma_0 + \gamma_1Labor_i + \epsilon_{Cost,i} \]

  • \[ Yield = \zeta_0 + \zeta_1 Labor_i + \epsilon_{Yield_i} \]

  • \[ Labor_i = \delta_0 + \delta_1 Population_i + \epsilon_{labor_i} \]

  • \[ Population \sim Poisson(\lambda) \]

What is an Integrated Model?

  • What?

  • Why?

  • How?

  • When to use?/When to use not?

Reasons to use them?

  • A survival dataset with low sample size → noisy estimates

  • A count dataset with imperfect detection → biased estimates

  • A reproduction dataset that lacks survival context

  • We can join all together!

  • We could also (potentially) estimate new parameters!

flowchart LR
  A[Survival] --> B[Abundance]
  C[Reproduction] --> B

What are IM’s?

Statistical method to share information about some underlying process by combining data sets

  • How: Form joint likelihood with shared parameters among likelihoods of each data set

  • NOT ONLY WAY! (but we will focus on this)

  • Joint likelihood for two or more “disparate” data sets, with at least one shared parameter

IM’s have a shared process

flowchart LR
  A[$$\theta$$] --$$\omega_1$$ --> B($$data_1$$)
  A[$$\theta$$] --$$\omega_2$$ --> C($$data_2$$)

flowchart LR
  subgraph Hidden
    A[$$\theta$$]
    B[$$\gamma$$]
    C[$$\lambda$$]
  end
  Hidden --$$\omega_1$$ --> E($$data_1$$)
  A--$$\omega_2$$ --> D($$data_2$$)

    style Hidden fill:#bdf1f1,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5 

Integrated models

IPM’s

  • Special case of IM’s: estimate abundance and demographic rates
  • Statistical method to share information about some underlying processes (abundance and demographic rates) by combining data sets
  1. Abundance data (counts)
  2. Productivity data (reproduction)
  3. Capture-reapture data (survival)
  • Often times rely on age or stage structure matrices

Exercise

  1. Abundance data (counts)
  2. Productivity data (reproduction)
  3. Capture-reapture data (survival)

flowchart LR
  subgraph Hidden
    A[$$\theta$$]
    B[$$\gamma$$]
    C[$$\lambda$$]
  end
  Hidden --$$\omega_1$$ --> E($$data_1$$)
  A--$$\omega_2$$ --> D($$data_2$$)

    style Hidden fill:#bdf1f1,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5 

Exercise

CJS model:

flowchart LR
  A($$\phi$$) ---> C[y]
  D(p) ---> C
  

  • \[ z_{i,f_i} = 1 \]

  • \[ z_{i,t+1} | z_{i,t} \sim Bernoulli(z_{i,t}\phi_{i,t}) \]

  • \[ y_{i,t} | z_{i,t} \sim Bernoulli(z_{i,t}p_{i,t}) \]

Exercise

  1. Abundance data (counts) –>y
  2. Productivity data (reproduction; counts of eggs, counts of newborns) –> R and J
  3. Capture-reapture data (survival): CJS –> m
  • Make a diagram
  • Trying to estimate abundance (N) and other demographic rates
  • There can be as many processes as you want (parameters)
  • What are the shared processes?

Exercise

Potential result

Why and how?

How: Formulate the joint density of all observed data (and latent variables)

  • More data is better! Higher precision, more parameters
  • Larger datasets
  • New parameters may become possible to estimate

When to run an IPM?

  • Should you always run an IPM when you have the available data? If it has so many benefits!

  • What should your philosophy be?

  • When should you NOT run an IPM?

When not to run one

  • When data sources violate independence

  • When parameters become non-identifiable

How to run?

Stack all models inside the same stan model statement

  • Important part: Choose the same names for shared parameters

  • Likelihood portion for each model

  • That’s it!

  • Important: You need to understand each model you run

What you should know before running an IPM

  1. Models for population size surveys
  2. Models for “productivity”
  3. Models for survival
  4. Matrix population models

Most complicated aspect of a IPM

  • Understand each individual model

  • Understanding “shared” parameters

  • Theoretical acyclical graph

Things to remember before running a model!

  • Separate data, parameters, model, transformed parameters (generated quantities) clearly

  • Build each model separately, then join

Example of an IM

Let’s go back to IM’s and run an SDM

Example in R

Common swifts in Switzerland

Example

We have count data for common swifts at different altitudes.

Example in R

  • Dataset: countdata1

What model should we run?

We are just wondering what the effect of elevation is on abundance (counts)

What model?

  • A GLM! Very easy to run using the glm function

GLM

summary(fm1 <- glm(counts ~ selev, data= countdata1, family = poisson(link = "log")))

Call:
glm(formula = counts ~ selev, family = poisson(link = "log"), 
    data = countdata1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.69493    0.03668   18.94   <2e-16 ***
selev       -1.96130    0.06700  -29.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1718.8  on 499  degrees of freedom
Residual deviance:  578.4  on 498  degrees of freedom
AIC: 1615.4

Number of Fisher Scoring iterations: 5

Definition of negative log-likelihood for Poisson

library(ASMbook)
NLL1 <- function(param, y, x) {
  alpha <- param[1]
  beta <- param[2]
  lambda <- exp(alpha + beta * x)
  LL <- dpois(y, lambda, log = TRUE)                   # LL contribution for each datum
  return(-sum(LL))                                     # NLL for all data
}

# Minimize NLL
inits <- c(alpha = 0, beta = 0)                        # Need to provide initial values
sol1 <- optim(inits, NLL1, y = countdata1$counts, x = countdata1$selev, hessian = TRUE, method = 'BFGS')
tmp1 <- get_MLE(sol1, 4)
          MLE     ASE LCL.95  UCL.95
alpha  0.6949 0.03668  0.623  0.7668
beta  -1.9613 0.06700 -2.093 -1.8300

We are lucky, and have a secondary (independent) dataset

  • Dataset of “birdwatchers”

  • Pretty reliable

  • They do not record zeroes!

  • What type of model?

  • A zero-truncated GLM! VGAM –> we won’t run the actual model, just maximizing likelihood

Definition of negative log-likelihood (NLL) for the ztPoisson GLM

NLL2 <- function(param, y, x) {
  alpha <- param[1]
  beta <- param[2]
  lambda <- exp(alpha + beta * x)
  L <- dpois(y, lambda)/(1-ppois(0, lambda))           # L. contribution for each datum
  return(-sum(log(L)))                                 # NLL for all data
}

# Minimize NLL
inits <- c(alpha = 0, beta = 0)                        # Need to provide initial values
sol2 <- optim(inits, NLL2, y = countdata2$counts, x = countdata2$selev, hessian = TRUE, method = 'BFGS')
tmp2 <- get_MLE(sol2, 4)
          MLE     ASE  LCL.95  UCL.95
alpha  0.6461 0.03852  0.5706  0.7216
beta  -2.0351 0.07089 -2.1740 -1.8961

We are lucky, and have a third (independent) dataset

  • detdata

  • Detection/nondetection data

  • NOT COUNT DATA!

  • But, underlying abundance process affects the probability of detection!

  • Elevation affects the abundance! Abundance affects detection

  • What model?

  • GLM glm. Binomial! What type of link function?

  • clogclog

  • \[ p(detect) = 1-e^{-\lambda r} \]

  • r is probability of detection of each individual

  • \[ log(-log(1-p))=log(\lambda r) \]

Fitting cloglog Bernoulli GLM

summary(fm3 <- glm(y ~ selev,data=detdata, family = binomial(link = "cloglog")))

Call:
glm(formula = y ~ selev, family = binomial(link = "cloglog"), 
    data = detdata)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.71081    0.04169   17.05   <2e-16 ***
selev       -1.98144    0.09499  -20.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2308.8  on 1999  degrees of freedom
Residual deviance: 1569.8  on 1998  degrees of freedom
AIC: 1573.8

Number of Fisher Scoring iterations: 6

Definition of negative log-likelihood for Bernoulli cloglog

NLL3 <- function(param, y, x) {
  alpha <- param[1]
  beta <- param[2]
  lambda <- exp(alpha + beta * x)
  psi <- 1-exp(-lambda)
  LL <- dbinom(y, 1, psi, log = TRUE)                  # L. contribution for each datum
  return(-sum(LL))                                     # NLL for all data
}

# Minimize NLL
inits <- c(alpha = 0, beta = 0)
sol3 <- optim(inits, NLL3, y = detdata$y, x = detdata$selev, hessian = TRUE, method = 'BFGS')
tmp3 <- get_MLE(sol3, 4)
          MLE     ASE  LCL.95  UCL.95
alpha  0.7108 0.04214  0.6282  0.7934
beta  -1.9815 0.09682 -2.1712 -1.7917

Fitting the IM with a joint likelihood

We are assuming the data are independent (BIG ASSUMPTION! Read about it!)

  • What is the shared process/parameter?
  • \[ L_{joint} = L_1L_2L_3 \]

  • \[NLL_{joint}= - LL_1 - LL_2 -LL_3\]

NLLjoint <- function(param, y1, x1, y2, x2, y3, x3) {
  # Definition of elements in param vector (shared between data sets)
  alpha <- param[1]                                    # log-linear intercept
  beta <- param[2]                                     # log-linear slope
  # Likelihood for the Poisson GLM for data set 1 (y1, x1)
  lambda1 <- exp(alpha + beta * x1)
  L1 <- dpois(y1, lambda1)
  # Likelihood for the ztPoisson GLM for data set 2 (y2, x2)
  lambda2 <- exp(alpha + beta * x2)
  L2 <- dpois(y2, lambda2)/(1-ppois(0, lambda2))
  # Likelihood for the cloglog Bernoulli GLM for data set 3 (y3, x3)
  lambda3 <- exp(alpha + beta * x3)
  psi <- 1-exp(-lambda3)
  L3 <- dbinom(y3, 1, psi)
  # Joint log-likelihood and joint NLL: here you can see that sum!
  JointLL <- sum(log(L1)) + sum(log(L2)) + sum(log(L3)) # Joint LL
  return(-JointLL)                                      # Return joint NLL
}

# Minimize NLLjoint
inits <- c(alpha = 0, beta = 0)
solJoint <- optim(inits, NLLjoint, y1 = countdata1$counts, x1 = countdata1$selev, y2 = countdata2$counts, x2 = countdata2$selev,
  y3 =detdata$y, x3 = detdata$selev, hessian = TRUE, method = 'BFGS')

# Get MLE and asymptotic SE and print and save
(tmp4 <- get_MLE(solJoint, 4))
          MLE     ASE  LCL.95  UCL.95
alpha  0.6861 0.01852  0.6498  0.7224
beta  -1.9704 0.03522 -2.0394 -1.9013
             MLE        ASE     LCL.95     UCL.95
alpha  0.6860979 0.01851556  0.6498081  0.7223878
beta  -1.9703732 0.03522001 -2.0394032 -1.9013432
diy_est <- tmp4[,1]

Not too different from an IPM!