Integrated Population Models

Before we start

Packages we’ll use:

ASMbook
- Applied Statistical Modeling for Ecologists book –> Package contains both datasets and useful examples
rstan

Before we start

Brief introductions 🐟

Before we start

IPM book by Schaub and Kery

This talk

Integrated models
Integrated population models
Discussion
Examples and code?
Make it successful! 😄 Participate and discuss
If you are interested… you should keep reviewing this!

What is an Integrated Model?

What is a model?
Integrated Population Model
What is an Integrated Model?

Similar to hierarchical models

In hierarchical models, we can combine two or more GLM’s in a sequence
\[ Profit_i = \alpha_0 + \alpha_1 revenue_i - \alpha_2cost_i + \epsilon_{Profit, i} \]
\[ Revenue_i = \beta_0 + \beta_1Yield_i+\epsilon_{Revenue,i} \]
\[ Cost_i = \gamma_0 + \gamma_1Labor_i + \epsilon_{Cost,i} \]
\[ Yield = \zeta_0 + \zeta_1 Labor_i + \epsilon_{Yield_i} \]
\[ Labor_i = \delta_0 + \delta_1 Population_i + \epsilon_{labor_i} \]
\[ Population \sim Poisson(\lambda) \]

What is an Integrated Model?

What?
Why?
How?
When to use?/When to use not?

Reasons to use them?

A survival dataset with low sample size → noisy estimates
A count dataset with imperfect detection → biased estimates

A reproduction dataset that lacks survival context
We can join all together!
We could also (potentially) estimate new parameters!

flowchart LR
  A[Survival] --> B[Abundance]
  C[Reproduction] --> B

What are IM’s?

Statistical method to share information about some underlying process by combining data sets

How: Form joint likelihood with shared parameters among likelihoods of each data set
NOT ONLY WAY! (but we will focus on this)
Joint likelihood for two or more “disparate” data sets, with at least one shared parameter

IM’s have a shared process

flowchart LR
  A[$$\theta$$] --$$\omega_1$$ --> B($$data_1$$)
  A[$$\theta$$] --$$\omega_2$$ --> C($$data_2$$)

flowchart LR
  subgraph Hidden
    A[$$\theta$$]
    B[$$\gamma$$]
    C[$$\lambda$$]
  end
  Hidden --$$\omega_1$$ --> E($$data_1$$)
  A--$$\omega_2$$ --> D($$data_2$$)

    style Hidden fill:#bdf1f1,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5

Integrated models

IPM’s

Special case of IM’s: estimate abundance and demographic rates
Statistical method to share information about some underlying processes (abundance and demographic rates) by combining data sets

Abundance data (counts)
Productivity data (reproduction)
Capture-reapture data (survival)

Often times rely on age or stage structure matrices

Exercise

Abundance data (counts)
Productivity data (reproduction)
Capture-reapture data (survival)

flowchart LR
  subgraph Hidden
    A[$$\theta$$]
    B[$$\gamma$$]
    C[$$\lambda$$]
  end
  Hidden --$$\omega_1$$ --> E($$data_1$$)
  A--$$\omega_2$$ --> D($$data_2$$)

    style Hidden fill:#bdf1f1,stroke:#f66,stroke-width:2px,color:#fff,stroke-dasharray: 5 5

Exercise

CJS model:

flowchart LR
  A($$\phi$$) ---> C[y]
  D(p) ---> C

\[ z_{i,f_i} = 1 \]
\[ z_{i,t+1} | z_{i,t} \sim Bernoulli(z_{i,t}\phi_{i,t}) \]
\[ y_{i,t} | z_{i,t} \sim Bernoulli(z_{i,t}p_{i,t}) \]

Exercise

Abundance data (counts) –>y
Productivity data (reproduction; counts of eggs, counts of newborns) –> R and J
Capture-reapture data (survival): CJS –> m

Make a diagram
Trying to estimate abundance (N) and other demographic rates
There can be as many processes as you want (parameters)
What are the shared processes?

Exercise

Potential result

Why and how?

How: Formulate the joint density of all observed data (and latent variables)

More data is better! Higher precision, more parameters
Larger datasets
New parameters may become possible to estimate

When to run an IPM?

Should you always run an IPM when you have the available data? If it has so many benefits!
What should your philosophy be?
When should you NOT run an IPM?

When not to run one

When data sources violate independence
When parameters become non-identifiable

How to run?

Stack all models inside the same stan model statement

Important part: Choose the same names for shared parameters
Likelihood portion for each model
That’s it!
Important: You need to understand each model you run

What you should know before running an IPM

Models for population size surveys
Models for “productivity”
Models for survival
Matrix population models

Most complicated aspect of a IPM

Understand each individual model
Understanding “shared” parameters

Theoretical acyclical graph

Things to remember before running a model!

Separate data, parameters, model, transformed parameters (generated quantities) clearly
Build each model separately, then join

Example of an IM

Let’s go back to IM’s and run an SDM

Example in R

Common swifts in Switzerland

Example

We have count data for common swifts at different altitudes.

Example in R

Dataset: countdata1

What model should we run?

We are just wondering what the effect of elevation is on abundance (counts)

What model?

A GLM! Very easy to run using the glm function

GLM

summary(fm1 <- glm(counts ~ selev, data= countdata1, family = poisson(link = "log")))


Call:
glm(formula = counts ~ selev, family = poisson(link = "log"), 
    data = countdata1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.69493    0.03668   18.94   <2e-16 ***
selev       -1.96130    0.06700  -29.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1718.8  on 499  degrees of freedom
Residual deviance:  578.4  on 498  degrees of freedom
AIC: 1615.4

Number of Fisher Scoring iterations: 5

Definition of negative log-likelihood for Poisson

library(ASMbook)
NLL1 <- function(param, y, x) {
  alpha <- param[1]
  beta <- param[2]
  lambda <- exp(alpha + beta * x)
  LL <- dpois(y, lambda, log = TRUE)                   # LL contribution for each datum
  return(-sum(LL))                                     # NLL for all data
}

# Minimize NLL
inits <- c(alpha = 0, beta = 0)                        # Need to provide initial values
sol1 <- optim(inits, NLL1, y = countdata1$counts, x = countdata1$selev, hessian = TRUE, method = 'BFGS')
tmp1 <- get_MLE(sol1, 4)

          MLE     ASE LCL.95  UCL.95
alpha  0.6949 0.03668  0.623  0.7668
beta  -1.9613 0.06700 -2.093 -1.8300

We are lucky, and have a secondary (independent) dataset

Dataset of “birdwatchers”
Pretty reliable
They do not record zeroes!
What type of model?
A zero-truncated GLM! VGAM –> we won’t run the actual model, just maximizing likelihood

Definition of negative log-likelihood (NLL) for the ztPoisson GLM

NLL2 <- function(param, y, x) {
  alpha <- param[1]
  beta <- param[2]
  lambda <- exp(alpha + beta * x)
  L <- dpois(y, lambda)/(1-ppois(0, lambda))           # L. contribution for each datum
  return(-sum(log(L)))                                 # NLL for all data
}

# Minimize NLL
inits <- c(alpha = 0, beta = 0)                        # Need to provide initial values
sol2 <- optim(inits, NLL2, y = countdata2$counts, x = countdata2$selev, hessian = TRUE, method = 'BFGS')
tmp2 <- get_MLE(sol2, 4)

          MLE     ASE  LCL.95  UCL.95
alpha  0.6461 0.03852  0.5706  0.7216
beta  -2.0351 0.07089 -2.1740 -1.8961

We are lucky, and have a third (independent) dataset

detdata
Detection/nondetection data
NOT COUNT DATA!
But, underlying abundance process affects the probability of detection!
Elevation affects the abundance! Abundance affects detection
What model?
GLM glm. Binomial! What type of link function?
clogclog
\[ p(detect) = 1-e^{-\lambda r} \]
r is probability of detection of each individual
\[ log(-log(1-p))=log(\lambda r) \]

Fitting cloglog Bernoulli GLM

summary(fm3 <- glm(y ~ selev,data=detdata, family = binomial(link = "cloglog")))


Call:
glm(formula = y ~ selev, family = binomial(link = "cloglog"), 
    data = detdata)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.71081    0.04169   17.05   <2e-16 ***
selev       -1.98144    0.09499  -20.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2308.8  on 1999  degrees of freedom
Residual deviance: 1569.8  on 1998  degrees of freedom
AIC: 1573.8

Number of Fisher Scoring iterations: 6

Definition of negative log-likelihood for Bernoulli cloglog

NLL3 <- function(param, y, x) {
  alpha <- param[1]
  beta <- param[2]
  lambda <- exp(alpha + beta * x)
  psi <- 1-exp(-lambda)
  LL <- dbinom(y, 1, psi, log = TRUE)                  # L. contribution for each datum
  return(-sum(LL))                                     # NLL for all data
}

# Minimize NLL
inits <- c(alpha = 0, beta = 0)
sol3 <- optim(inits, NLL3, y = detdata$y, x = detdata$selev, hessian = TRUE, method = 'BFGS')
tmp3 <- get_MLE(sol3, 4)

          MLE     ASE  LCL.95  UCL.95
alpha  0.7108 0.04214  0.6282  0.7934
beta  -1.9815 0.09682 -2.1712 -1.7917

Fitting the IM with a joint likelihood

We are assuming the data are independent (BIG ASSUMPTION! Read about it!)

What is the shared process/parameter?

\[ L_{joint} = L_1L_2L_3 \]
\[NLL_{joint}= - LL_1 - LL_2 -LL_3\]

NLLjoint <- function(param, y1, x1, y2, x2, y3, x3) {
  # Definition of elements in param vector (shared between data sets)
  alpha <- param[1]                                    # log-linear intercept
  beta <- param[2]                                     # log-linear slope
  # Likelihood for the Poisson GLM for data set 1 (y1, x1)
  lambda1 <- exp(alpha + beta * x1)
  L1 <- dpois(y1, lambda1)
  # Likelihood for the ztPoisson GLM for data set 2 (y2, x2)
  lambda2 <- exp(alpha + beta * x2)
  L2 <- dpois(y2, lambda2)/(1-ppois(0, lambda2))
  # Likelihood for the cloglog Bernoulli GLM for data set 3 (y3, x3)
  lambda3 <- exp(alpha + beta * x3)
  psi <- 1-exp(-lambda3)
  L3 <- dbinom(y3, 1, psi)
  # Joint log-likelihood and joint NLL: here you can see that sum!
  JointLL <- sum(log(L1)) + sum(log(L2)) + sum(log(L3)) # Joint LL
  return(-JointLL)                                      # Return joint NLL
}

# Minimize NLLjoint
inits <- c(alpha = 0, beta = 0)
solJoint <- optim(inits, NLLjoint, y1 = countdata1$counts, x1 = countdata1$selev, y2 = countdata2$counts, x2 = countdata2$selev,
  y3 =detdata$y, x3 = detdata$selev, hessian = TRUE, method = 'BFGS')

# Get MLE and asymptotic SE and print and save
(tmp4 <- get_MLE(solJoint, 4))

          MLE     ASE  LCL.95  UCL.95
alpha  0.6861 0.01852  0.6498  0.7224
beta  -1.9704 0.03522 -2.0394 -1.9013

             MLE        ASE     LCL.95     UCL.95
alpha  0.6860979 0.01851556  0.6498081  0.7223878
beta  -1.9703732 0.03522001 -2.0394032 -1.9013432

diy_est <- tmp4[,1]

Not too different from an IPM!

You have the data
Challenge: Run a single integrated model in Stan
https://rpubs.com/amolin a /ipm
https://rpubs.com/amolina/im