Introduction

Introduction

In real-world scenarios, it is challenging to detect all individuals in a population.

Imperfect detection!
Imperfect detection!

Factors like habitat, behavior, and observer error affect detection rates.

Accurate population estimates require models that account for detection imperfections.

Imperfect detection
Imperfect detection

N-mixture model

A statistical model used to estimate abundance of species from repeated count data while accounting for detection probability. (Royle, 2004)

  • Abundance: Ni~ Poisson (λ) (Ecological or state process)
  • Detection: 𝐶ij ~Binomial(𝑁i, 𝑝) (Observation process)

Where,

  • Ni is the True abundance at site i.
  • λ is the expected abundance
  • 𝐶ij is the count at site i during survey j
  • p is the per-individual detection probability

Model assumptions

Model assumptions

  • Homogeneity of detection among Ni individuals
  • Closed population assumption (No births, deaths, immigration, or emigration during the study period!)
  • Independence detection
  • Parametric modeling assumptions
  • No false positive error (Avoid counting non-target species!No recounting!)

What data do we need?

  1. Repeated count data

  2. Site information

  3. Covariates

Measurement error or detection error!
Study area True abundance First visit Second visit Third visit
A 3 2 1 2
B 4 2 0 3

Some terminology in N-mixture model

  • Occasion: Specific period of a count at a site.

  • Occasion length: How long for single occasion! (3 days, 7 days, etc.)

  • Detection History: Record of detections across occasions.

  • NA: Not available (No survey during that period)

Pros and cons of N-mixture model

Pros and cons of N-mixture model

Pros and cons of N-mixture model
Pros Cons
Accounts for detection probability Strict model assumptions
Count data is cheap to collect Count data is less information than mark-recapture data
Does not require auxillary information (eg. Distance, double observer etc.) Require lots of replications
Analysis is straight forward. Overdispersion issues
Handles missing data (NA) Some limitations in camera trap study
Only needs repeated count data
Extensibel with covariates
Assessible software
Implemented in R packages (eg. unmarked, ubms)

How many sample size do we need?

  1. Sites (M),
  2. Occasion (J) ### Source - AHM book

Source - AHM Book

N-mixture model syntax

N-mixture model syntax

Beware when we use NB!
Beware when we use NB!

Good fit / bad prediction dilemma

  • NB can handles overdispersion and sometime shows lower AIC and good fit.
  • But it can sometimes produce unrealistic abundance values.
  • NB can overfit. Don’t solely rely on NB estimates, check residual plot.

Source - AHM Book

Modelling procedure

Modelling procedure

R Demo

R Demo

Install pacakges

library(unmarked)
library(AICcmodavg)
library(ggplot2)

Set working directory and load data

Check the data (Optional)

head(dataraw)
  sites N occ1 occ2 occ3 veght1 veght2 veght3 elevation   eu_road location
1     1 2    0    2    0    212    248    216       218  310.0000        A
2     2 5    1    2    5    217    238    238       209  630.0793        A
3     3 3    0    3    1    227    225    242       225 1100.0000        A
4     4 2    0    1    2    235    224    183       208 1315.7507        A
5     5 2    1    2    2    242    248    201       187 1175.9678        A
6     6 3    1    1    3    239    263    239       185  233.4524        A
summary(dataraw)
     sites             N              occ1          occ2       
 Min.   : 1.00   Min.   :0.000   Min.   :0.0   Min.   :0.0000  
 1st Qu.:18.25   1st Qu.:1.250   1st Qu.:0.0   1st Qu.:0.0000  
 Median :35.50   Median :2.000   Median :0.0   Median :0.5000  
 Mean   :35.50   Mean   :2.186   Mean   :0.3   Mean   :0.6714  
 3rd Qu.:52.75   3rd Qu.:3.000   3rd Qu.:1.0   3rd Qu.:1.0000  
 Max.   :70.00   Max.   :5.000   Max.   :1.0   Max.   :3.0000  
      occ3            veght1          veght2        veght3      elevation    
 Min.   :0.0000   Min.   : 18.0   Min.   : 16   Min.   :162   Min.   :165.0  
 1st Qu.:0.0000   1st Qu.:189.0   1st Qu.:199   1st Qu.:210   1st Qu.:198.2  
 Median :0.0000   Median :213.0   Median :232   Median :241   Median :208.0  
 Mean   :0.6286   Mean   :201.9   Mean   :205   Mean   :239   Mean   :209.6  
 3rd Qu.:1.0000   3rd Qu.:237.2   3rd Qu.:239   3rd Qu.:269   3rd Qu.:218.8  
 Max.   :5.0000   Max.   :286.0   Max.   :263   Max.   :305   Max.   :275.0  
    eu_road          location        
 Min.   :  14.14   Length:70         
 1st Qu.: 442.50   Class :character  
 Median : 768.16   Mode  :character  
 Mean   : 901.44                     
 3rd Qu.:1159.66                     
 Max.   :2575.62                     

Look at the data for true abundance (N)

table(dataraw$N) #True abundance distribution

 0  1  2  3  4  5 
 7 11 24 19  8  1 
sum(dataraw$N) #153, Ture total population size at M sites
[1] 153
sum(dataraw$N>0) #63, True number of occupied sites
[1] 63
mean(dataraw$N) #2.185, True mean abundance (estimate of lambda)
[1] 2.185714

Look at the data for observation

table(apply(dataraw[,c(3:5)],1,max)) #Observed abundance distribution (max count)

 0  1  2  3  5 
26 28 13  2  1 
sum(apply(dataraw[,c(3:5)],1,max)) #Observed total population size at M sites
[1] 65
sum(apply(dataraw[,c(3:5)],1,max)>0) #Observed number of occupied sites
[1] 44
mean(apply(dataraw[,c(3:5)],1,max)) #Observed mean 'relative abundance'
[1] 0.9285714

Look overall summary with table format

result <- data.frame(
  Abundance = c("True abundance (N)", "Observed abundance"),
  Total = c(sum(dataraw$N), sum(apply(dataraw[, 3:5], 1, max))),
  Occupied_sites = c(sum(dataraw$N > 0), sum(apply(dataraw[, 3:5], 1, max) > 0)),
  Mean = c(mean(dataraw$N), mean(apply(dataraw[, 3:5], 1, max)))
)

# Remove row names by setting them to NULL
row.names(result) <- NULL

# Print the result
print(result)
           Abundance Total Occupied_sites      Mean
1 True abundance (N)   153             63 2.1857143
2 Observed abundance    65             44 0.9285714
  • Until here is just checking our dataset information.

  • In real world ecological dataset, it is not possible to check actual abundance (N)

Fitting N-mixture model

1. Format data

y <- dataraw[,c(3:5)] ##Count Matrix for repeated count
siteCovs <- dataraw[,c('elevation','eu_road','location')] ##Site covariates for abundance
obsCovs=list(veght=dataraw[,c(6:8)]) ##Observation covariates for detection

2. Create unmarked dataframe

umf <- unmarkedFramePCount(y = y,   ##Count Matrix for repeated count
                           siteCovs = siteCovs,   ##Site covariates for abundance
                           obsCovs = obsCovs ) ##Observation covariates for detection

2.1 Check our unmarked frame

summary(umf)
unmarkedFrame Object

70 sites
Maximum number of observations per site: 3 
Mean number of observations per site: 3 
Sites with at least one detection: 44 

Tabulation of y observations:
  0   1   2   3   5 
124  65  18   2   1 

Site-level covariates:
   elevation        eu_road        location
 Min.   :165.0   Min.   :  14.14   A:30    
 1st Qu.:198.2   1st Qu.: 442.50   B:40    
 Median :208.0   Median : 768.16           
 Mean   :209.6   Mean   : 901.44           
 3rd Qu.:218.8   3rd Qu.:1159.66           
 Max.   :275.0   Max.   :2575.62           

Observation-level covariates:
     veght      
 Min.   : 16.0  
 1st Qu.:198.2  
 Median :226.0  
 Mean   :215.3  
 3rd Qu.:244.0  
 Max.   :305.0  

Plot

plot(umf)

2.2 Testing K value with null model

Generally, there has two ways to decide K value

  • First approach - According to Kery et al. (2018), we can define K value as K=max(Ci,t) + 100
max(y,na.rm = TRUE) #5 is maximum K value (max(Ci,t)), then add 100
[1] 5
  • Use K value 105 for the model
null_p1=pcount(~1~1,K=105,data=umf) #According to (Kery et al., 2018)
summary(null_p1)  #AIC = 388.2797 

Call:
pcount(formula = ~1 ~ 1, data = umf, K = 105)

Abundance (log-scale):
 Estimate    SE    z P(>|z|)
    0.392 0.207 1.89  0.0585

Detection (logit-scale):
 Estimate    SE     z P(>|z|)
   -0.573 0.308 -1.86   0.063

AIC: 388.2797 
Number of sites: 70
null_p2=pcount(~1~1,K=200,data=umf) #Test another K value > 105 to make                                      sure whether AIC value is stable or not
summary(null_p2)  #AIC = 388.2797    

Call:
pcount(formula = ~1 ~ 1, data = umf, K = 200)

Abundance (log-scale):
 Estimate    SE    z P(>|z|)
    0.392 0.207 1.89  0.0585

Detection (logit-scale):
 Estimate    SE     z P(>|z|)
   -0.573 0.308 -1.86   0.063

AIC: 388.2797 
Number of sites: 70
  • AIC = 388.2797 for both null_p1 and null_p2

  • Same AIC value with different K value, it means we can use K value 105

  • If AIC is not stable, your model has identifiability issues

  • Then try another distributions OR add some covariates and then check the best model

2.2 Second approach to check K value

test_K <- function(K){
  fit <- suppressWarnings(pcount(~1~1, umf, K=K))
  exp(coef(fit)[1])
}

kvals <- c(8,10,15,20,25,30)
lams <- sapply(kvals, test_K)
plot(kvals, lams, type='o', pch=19, xlim=c(5, 30),
     xlab="K", ylab="lambda estimate")
abline(v=15, col='red', lty=2)

  • Based on second method, I will try K > 15.

  • And then, I will test again for AIC

null_p1=pcount(~1~1,K=20,data=umf)
summary(null_p1)  #AIC = 388.2797 

Call:
pcount(formula = ~1 ~ 1, data = umf, K = 20)

Abundance (log-scale):
 Estimate    SE    z P(>|z|)
    0.392 0.207 1.89  0.0585

Detection (logit-scale):
 Estimate    SE     z P(>|z|)
   -0.573 0.308 -1.86   0.063

AIC: 388.2797 
Number of sites: 70
null_p2=pcount(~1~1,K=50,data=umf)
summary(null_p2)  #AIC = 388.2797 

Call:
pcount(formula = ~1 ~ 1, data = umf, K = 50)

Abundance (log-scale):
 Estimate    SE    z P(>|z|)
    0.392 0.207 1.89  0.0585

Detection (logit-scale):
 Estimate    SE     z P(>|z|)
   -0.573 0.308 -1.86   0.063

AIC: 388.2797 
Number of sites: 70
  • AIC value doesn’t change. So I will choose K=20 for this tutorial

  • You can choose either first or second approach to decide your K value in your model as long as your model AIC is stable

  • Higer K value will impact your computation time

  • Now let’s look at our null model estimation

  • N-mixture used log-scale for abundance and logit scale for detection

  • So we need to backtransform to get actual mean abundance and detection

lambda=backTransform(null_p1,type='state') ##state for abundance process
lambda
Backtransformed linear combination(s) of Abundance estimate(s)

 Estimate    SE LinComb (Intercept)
     1.48 0.306   0.392           1

Transformation: exp 
confint(lambda) #to get 95% confident interval
     0.025    0.975
 0.9860346 2.220318
detection=backTransform(null_p1,type='det') ##det for detection process
detection
Backtransformed linear combination(s) of Detection estimate(s)

 Estimate     SE LinComb (Intercept)
     0.36 0.0711  -0.573           1

Transformation: logistic 
confint(detection) #to get 95% confident interval
     0.025     0.975
 0.2353875 0.5077896
  • Compare with original dataset (Recall the original dataset)
print(result)
           Abundance Total Occupied_sites      Mean
1 True abundance (N)   153             63 2.1857143
2 Observed abundance    65             44 0.9285714
  • Our null model estimate:

  • For abundance 1.48 (0.986 Lower CI, 2.22 Upper CI)

  • For detection 0.36 (0.235 Lower CI, 0.507 Upper CI)

  • Close to true abundance!

  • Start fitting different models with covariates

3. Fit Model with pcount

  • First for detection, abundance as constant (i.e., 1)
det_full=pcount(~scale(veght)+scale(elevation)~1,
                K=20,mixture='P',data=umf) #all detection variables
summary(det_full)

Call:
pcount(formula = ~scale(veght) + scale(elevation) ~ 1, data = umf, 
    K = 20, mixture = "P")

Abundance (log-scale):
 Estimate    SE    z P(>|z|)
     0.38 0.193 1.98  0.0482

Detection (logit-scale):
                 Estimate    SE      z P(>|z|)
(Intercept)       -0.5635 0.292 -1.932  0.0533
scale(veght)       0.0234 0.138  0.169  0.8657
scale(elevation)  -0.4410 0.200 -2.201  0.0277

AIC: 387.277 
Number of sites: 70
det_veght=pcount(~scale(veght)~1,
                K=20,mixture='P',data=umf) #only veght variable
summary(det_veght)

Call:
pcount(formula = ~scale(veght) ~ 1, data = umf, K = 20, mixture = "P")

Abundance (log-scale):
 Estimate    SE    z P(>|z|)
    0.399 0.209 1.91  0.0565

Detection (logit-scale):
             Estimate    SE      z P(>|z|)
(Intercept)   -0.5861 0.311 -1.885  0.0595
scale(veght)   0.0573 0.137  0.418  0.6759

AIC: 390.1048 
Number of sites: 70
det_elevation=pcount(~scale(elevation)~1,
                     K=20,mixture='P',data=umf) #only elevation variable
summary(det_elevation)

Call:
pcount(formula = ~scale(elevation) ~ 1, data = umf, K = 20, mixture = "P")

Abundance (log-scale):
 Estimate    SE    z P(>|z|)
    0.378 0.192 1.97  0.0483

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.559 0.290 -1.93  0.0536
scale(elevation)   -0.446 0.198 -2.25  0.0246

AIC: 385.3056 
Number of sites: 70

4. Let’s select the best detection model

Model selection for detection process

Listing models

Modeles <- list(det_full,det_veght,det_elevation)

Names   <- c("det_full","det_veght","det_elevation")

aictab(cand.set = Modeles, modnames = Names)

Model selection based on AICc:

              K   AICc Delta_AICc AICcWt Cum.Wt      LL
det_elevation 3 385.67       0.00   0.70   0.70 -189.65
det_full      4 387.89       2.22   0.23   0.94 -189.64
det_veght     3 390.47       4.80   0.06   1.00 -192.05
  • det_elevation is the best model, then fit for abundance
(mfull =pcount(~scale(elevation)~scale(elevation)+scale(eu_road)+location,
               K=20,mixture='P',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(elevation) + scale(eu_road) + 
    location, data = umf, K = 20, mixture = "P")

Abundance (log-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)         0.872 0.275  3.17 0.00154
scale(elevation)    0.542 0.284  1.91 0.05632
scale(eu_road)     -0.120 0.154 -0.78 0.43538
locationB          -0.716 0.279 -2.56 0.01036

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.708 0.342 -2.07  0.0382
scale(elevation)   -0.762 0.305 -2.50  0.0125

AIC: 383.0385 
Number of sites: 70
(m1 =pcount(~scale(elevation)~scale(elevation)+scale(eu_road),
            K=20,mixture='P',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(elevation) + scale(eu_road), 
    data = umf, K = 20, mixture = "P")

Abundance (log-scale):
                 Estimate    SE      z P(>|z|)
(Intercept)        0.4868 0.240  2.024  0.0429
scale(elevation)   0.3089 0.267  1.155  0.2479
scale(eu_road)    -0.0911 0.150 -0.607  0.5437

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.686 0.334 -2.06  0.0397
scale(elevation)   -0.716 0.307 -2.33  0.0196

AIC: 387.7889 
Number of sites: 70
(m2 =pcount(~scale(elevation)~scale(elevation)+location,
            K=20,mixture='P',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(elevation) + location, 
    data = umf, K = 20, mixture = "P")

Abundance (log-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)         0.864 0.274  3.15 0.00161
scale(elevation)    0.470 0.269  1.74 0.08103
locationB          -0.702 0.279 -2.52 0.01186

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.705 0.341 -2.07  0.0386
scale(elevation)   -0.756 0.304 -2.48  0.0130

AIC: 381.6681 
Number of sites: 70
(m3 =pcount(~scale(elevation)~scale(eu_road)+location,
            K=20,mixture='P',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(eu_road) + location, 
    data = umf, K = 20, mixture = "P")

Abundance (log-scale):
               Estimate    SE       z P(>|z|)
(Intercept)      0.6291 0.236  2.6615 0.00778
scale(eu_road)  -0.0106 0.145 -0.0732 0.94164
locationB       -0.5289 0.266 -1.9913 0.04644

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.551 0.311 -1.77  0.0769
scale(elevation)   -0.283 0.253 -1.12  0.2637

AIC: 385.1921 
Number of sites: 70
(m4 =pcount(~scale(elevation)~scale(elevation),
            K=20,mixture='P',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(elevation), data = umf, 
    K = 20, mixture = "P")

Abundance (log-scale):
                 Estimate    SE    z P(>|z|)
(Intercept)         0.488 0.240 2.03  0.0423
scale(elevation)    0.255 0.253 1.01  0.3132

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.685 0.334 -2.05  0.0403
scale(elevation)   -0.712 0.307 -2.32  0.0203

AIC: 386.1667 
Number of sites: 70
(m5 =pcount(~scale(elevation)~scale(eu_road),
            K=20,mixture='P',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(eu_road), data = umf, 
    K = 20, mixture = "P")

Abundance (log-scale):
               Estimate    SE      z P(>|z|)
(Intercept)       0.374 0.193  1.932  0.0533
scale(eu_road)   -0.027 0.141 -0.192  0.8476

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.556 0.291 -1.91  0.0565
scale(elevation)   -0.429 0.218 -1.97  0.0487

AIC: 387.2684 
Number of sites: 70
(m6 =pcount(~scale(elevation)~location,
            K=20,mixture='P',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ location, data = umf, K = 20, 
    mixture = "P")

Abundance (log-scale):
            Estimate    SE     z P(>|z|)
(Intercept)     0.63 0.235  2.68 0.00733
locationB      -0.53 0.265 -2.00 0.04562

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.549 0.309 -1.78  0.0757
scale(elevation)   -0.291 0.232 -1.25  0.2111

AIC: 383.1975 
Number of sites: 70
  • Let’s select again for the best abundance model

Model selection for abundance process

Listing models

Modeles <- list(mfull,m1,m2,m3,m4,m5,m6)

Names   <- c("mfull","m1","m2","m3","m4","m5","m6")

aictab(cand.set = Modeles, modnames = Names)

Model selection based on AICc:

      K   AICc Delta_AICc AICcWt Cum.Wt      LL
m2    5 382.61       0.00   0.42   0.42 -185.83
m6    4 383.81       1.21   0.23   0.65 -187.60
mfull 6 384.37       1.77   0.17   0.83 -185.52
m3    5 386.13       3.52   0.07   0.90 -187.60
m4    4 386.78       4.18   0.05   0.95 -189.08
m5    4 387.88       5.28   0.03   0.98 -189.63
m1    5 388.73       6.12   0.02   1.00 -188.89
  • Finally, m2 is the best model for both detection and abundance

Caution!

  • There has multiple ways for model selection procedure and you can choose

  • approapriate method based on your experience and literature review

  • In our tutorial, m2 is the best model

(m2 =pcount(~scale(elevation)~scale(elevation)+location,
            K=20, mixture='P',data = umf)) #Best model

Call:
pcount(formula = ~scale(elevation) ~ scale(elevation) + location, 
    data = umf, K = 20, mixture = "P")

Abundance (log-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)         0.864 0.274  3.15 0.00161
scale(elevation)    0.470 0.269  1.74 0.08103
locationB          -0.702 0.279 -2.52 0.01186

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.705 0.341 -2.07  0.0386
scale(elevation)   -0.756 0.304 -2.48  0.0130

AIC: 381.6681 
Number of sites: 70
  • Check another two distributions: ZIP and NB
(m2_zip =pcount(~scale(elevation)~scale(elevation)+location,
                K=20,mixture='ZIP',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(elevation) + location, 
    data = umf, K = 20, mixture = "ZIP")

Abundance (log-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)         1.069 0.434  2.46  0.0138
scale(elevation)    0.540 0.322  1.68  0.0934
locationB          -0.705 0.288 -2.45  0.0145

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.855 0.464 -1.84  0.0653
scale(elevation)   -0.812 0.346 -2.34  0.0190

Zero-inflation (logit-scale):
 Estimate   SE     z P(>|z|)
    -2.31 1.48 -1.56   0.119

AIC: 383.1792 
Number of sites: 70
(m2_nb =pcount(~scale(elevation)~scale(elevation)+location,
               K=20,mixture='NB',data = umf))

Call:
pcount(formula = ~scale(elevation) ~ scale(elevation) + location, 
    data = umf, K = 20, mixture = "NB")

Abundance (log-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)         0.876 0.293  2.99  0.0028
scale(elevation)    0.475 0.277  1.72  0.0863
locationB          -0.700 0.282 -2.48  0.0131

Detection (logit-scale):
                 Estimate    SE     z P(>|z|)
(Intercept)        -0.724 0.372 -1.95  0.0517
scale(elevation)   -0.762 0.312 -2.45  0.0145

Dispersion (log-scale):
 Estimate   SE     z P(>|z|)
     3.68 7.04 0.523   0.601

AIC: 383.6468 
Number of sites: 70

Model selection again between different mixtures

Modeles <- list(m2,m2_zip,m2_nb)

Names   <- c("m2","m2_zip","m2_nb")

aictab(cand.set = Modeles, modnames = Names)

Model selection based on AICc:

       K   AICc Delta_AICc AICcWt Cum.Wt      LL
m2     5 382.61       0.00   0.59   0.59 -185.83
m2_zip 6 384.51       1.91   0.23   0.82 -185.59
m2_nb  6 384.98       2.37   0.18   1.00 -185.82
  • Poisson mixture won among three mixtures

  • Sometime NB shows lower AIC than other two mixtures

  • In this case, it would be nice to check residual plot because of good fit/ bad prediction dilemma,

  • Don’t solely rely on NB estimates, as they can sometimes provide unrealistic abundance values.

5. Goodness of fit test

gof=Nmix.gof.test(m2,nsim=100)  #Use 1000 nsim for real data

gof

Chi-square goodness-of-fit for N-mixture model of 'unmarkedFitPCount' class

Observed chi-square statistic = 206.785 
Number of bootstrap samples = 100
P-value = 0.45

Quantiles of bootstrapped statistics:
  0%  25%  50%  75% 100% 
 158  188  202  223  332 

Estimate of c-hat = 1 
  • We got c-hat value 1. It means no overdispersion, Good fit

  • If c.hat value > 1,our model has overdispersion problem

  • If c.hat value < 1, our model has underdispersion problem

  • Then you need to take into account c-hat value in your prediction

Check three results and compare with true mean

cbind(rbind("Poisson" = exp(coef(m2)[1]), 
            "NegBin" = exp(coef(m2_nb)[1]), 
            "ZIP"= exp(coef(m2_zip)[1])))
        lam(Int)
Poisson 2.373612
NegBin  2.401943
ZIP     2.911520

Print the original dataset

print(result)
           Abundance Total Occupied_sites      Mean
1 True abundance (N)   153             63 2.1857143
2 Observed abundance    65             44 0.9285714
  • Pretty close estimation!

6. Prediction

For Detection

  • Elevation variable

  • Hold everything constant except for elevation

newDat <- data.frame(elevation = seq(min(dataraw$elevation), max(dataraw$elevation), length=100))
Epred=modavgPred(cand.set = list(m2), newdata=newDat, parm.type = "detect", type= "response")

newDat$mod_avg_pred <- Epred$mod.avg.pred
newDat$uncond_se <- Epred$uncond.se
newDat$low95 <- Epred$lower.CL
newDat$upp95 <- Epred$upper.CL

###Plotting the result for elevation

ggplot(newDat, aes(x = elevation)) + 
  geom_line(aes(y = mod_avg_pred), colour = "blue") + 
  geom_ribbon(aes(ymin = low95, ymax = upp95), alpha = 0.2) +
  ylim(0, 1) +  
  labs(y = expression("Detection probability"), 
       x = "Elevation (m)") +
  theme_classic()

For Abundance

  • Location variable

  • Hold everything constant except for location

newDat <- data.frame(location=factor(c("A", "B")), elevation = mean(dataraw$elevation))
Epred=modavgPred(cand.set = list(m2), newdata=newDat, parm.type = "lambda", type = "response")

newDat$mod_avg_pred <- Epred$mod.avg.pred
newDat$uncond_se <- Epred$uncond.se
newDat$low95 <- Epred$lower.CL
newDat$upp95 <- Epred$upper.CL

###Plotting the result for eu_water
ggplot(newDat, aes(location,mod_avg_pred)) +
  geom_col(aes(fill=location), color = "black", width = 0.85) +
  ylim(0,10) +
  geom_errorbar(aes(ymin = low95,
                    ymax = upp95),
                color = "#22292F",
                width = .1)+
  labs(y = expression("Predicted abundance"), 
       x = "Location of camera traps")+
  theme_classic()

  • We do not need to draw a plot for elevation in abundance process becasue that variable is not significant in m2

Thank you very much :)

References

  1. M. Kéry y J. A. Royle. 2016. Applied hierarchical modeling in ecology: analysis of distribution, abundance and species richness in R and BUGS. Vol. 1 Prelude and static models. Academic Press, London, UK. 783 Pp. [ISBN: 978-0-12-801378-6].

  2. Quantitative methods for population dynamics. Two day workshop by Olivier Gimenez, Aurélien Besnard and Sarah Cubaynes. https://oliviergimenez.github.io/popdyn-workshop/

  3. Kéry, M. 2018. Identifiability in N‐mixture models: a large‐scale screening test with bird data. Ecology 99:281–288.

---
title: "N-mixture model"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: scroll
    social: menu
    source_code: embed
    self_contained: true
    code_folding: hide
---

```{r setup, include=FALSE}
library(flexdashboard)
```

# Sidebar {.sidebar}

### **Estimating animal abundance by accounting for imperfect detecting**

**Author: Ye Htet Lwin**

**Date: 2025-12-25**

## **🐾 Content**

**Introduction**

**Model assumption**

**Pros and Cons of N-mixture model**

**Model syntax**

**Modelling procedure**

**Pros and Cons of N-mixture model**

**R demo**

**References**

### **Acknowledgments**

All the code used in this tutorial is based on the work of various authors, contributors, and library developers. All credit goes to them for their invaluable contributions and dedication to the programming community. Thank you for your hard work and for sharing your knowledge and tools.

🐅🐻 \*\* Thank you very much\*\* 🦌🐒

# Introduction

### **Introduction**

In real-world scenarios, it is challenging to detect all individuals in a population.

![Imperfect detection!](images/Screenshot%202025-12-25%20163328.png)

Factors like habitat, behavior, and observer error affect detection rates.

Accurate population estimates require models that account for detection imperfections.

![Imperfect detection](images/Screenshot%202025-12-25%20164039.png)

### **N-mixture model**

A statistical model used to estimate abundance of species from repeated count data while accounting for detection probability. (Royle, 2004)

-   **Abundance**: Ni\~ Poisson (λ) (Ecological or state process)
-   **Detection**: 𝐶ij \~Binomial(𝑁i, 𝑝) (Observation process)

### Where,

-   Ni is the True abundance at site i.
-   λ is the expected abundance
-   𝐶ij is the count at site i during survey j
-   p is the per-individual detection probability

# Model assumptions

### **Model assumptions**

-   Homogeneity of detection among Ni individuals
-   Closed population assumption (No births, deaths, immigration, or emigration during the study period!)
-   Independence detection
-   Parametric modeling assumptions
-   No false positive error (Avoid counting non-target species!No recounting!)

### **What data do we need?**

1.  Repeated count data

2.  Site information

3.  Covariates

### ![](images/Screenshot%202025-12-25%20164908.png)

| Study area | True abundance | First visit | Second visit | Third visit |
|------------|----------------|-------------|--------------|-------------|
| A          | 3              | 2           | 1            | 2           |
| B          | 4              | 2           | 0            | 3           |

: Measurement error or detection error!

### **Some terminology in N-mixture model**

-   **Occasion**: Specific period of a count at a site.

-   **Occasion length**: How long for single occasion! (3 days, 7 days, etc.)

-   **Detection History**: Record of detections across occasions.

-   **NA**: Not available (No survey during that period)

### ![](images/Screenshot%202025-12-25%20165525.png)

# Pros and cons of N-mixture model

### **Pros and cons of N-mixture model**

| Pros | Cons |
|------------------------------------|------------------------------------|
| Accounts for **detection probability** | **Strict** model assumptions |
| Count data is **cheap** to collect | Count data is **less information** than **mark-recapture** data |
| Does **not require** auxillary information (eg. Distance, double observer etc.) | Require **lots of replications** |
| Analysis is **straight forward**. | **Overdispersion** issues |
| Handles **missing data (NA)** | Some limitations in **camera trap** study |
| Only needs **repeated count data** |  |
| **Extensibel** with covariates |  |
| **Assessible software** |  |
| Implemented in **R packages** (eg. unmarked, ubms) |  |

: Pros and cons of N-mixture model

### **How many sample size do we need?**

1.  Sites (M),
2.  Occasion (J) \### ![Source - AHM book](images/Screenshot%202025-12-25%20170330.png)

### ![Source - AHM Book](images/Screenshot%202025-12-25%20170416.png)

# N-mixture model syntax

### **N-mixture model syntax**

![Beware when we use NB!](images/Screenshot%202025-12-25%20170526.png)

### **Good fit / bad prediction dilemma**

-   NB can handles overdispersion and sometime shows lower AIC and good fit.
-   But it can sometimes produce unrealistic abundance values.
-   NB can overfit. Don't solely rely on NB estimates, check residual plot.

### ![Source - AHM Book](images/Screenshot%202025-12-25%20170649.png)

# Modelling procedure

### **Modelling procedure**

### ![](images/Screenshot%202025-12-25%20170807.png)

# R Demo

## **R Demo**

**Install pacakges**

```{r, echo = TRUE}
library(unmarked)
library(AICcmodavg)
library(ggplot2)
```

**Set working directory and load data**

```{r, include=FALSE}
setwd("C:/Users/yehte/Downloads/Nmix (Demo)")
dataraw=read.csv("demo.csv")
```

**Check the data (Optional)**

```{r, echo = TRUE}
head(dataraw)
summary(dataraw)
```

**Look at the data for true abundance (N)**

```{r, echo = TRUE}
table(dataraw$N) #True abundance distribution

sum(dataraw$N) #153, Ture total population size at M sites

sum(dataraw$N>0) #63, True number of occupied sites

mean(dataraw$N) #2.185, True mean abundance (estimate of lambda)

```

**Look at the data for observation**

```{r, echo = TRUE}
table(apply(dataraw[,c(3:5)],1,max)) #Observed abundance distribution (max count)

sum(apply(dataraw[,c(3:5)],1,max)) #Observed total population size at M sites

sum(apply(dataraw[,c(3:5)],1,max)>0) #Observed number of occupied sites

mean(apply(dataraw[,c(3:5)],1,max)) #Observed mean 'relative abundance'

```

**Look overall summary with table format**

```{r, echo = TRUE}
result <- data.frame(
  Abundance = c("True abundance (N)", "Observed abundance"),
  Total = c(sum(dataraw$N), sum(apply(dataraw[, 3:5], 1, max))),
  Occupied_sites = c(sum(dataraw$N > 0), sum(apply(dataraw[, 3:5], 1, max) > 0)),
  Mean = c(mean(dataraw$N), mean(apply(dataraw[, 3:5], 1, max)))
)

# Remove row names by setting them to NULL
row.names(result) <- NULL

# Print the result
print(result)
```

-   Until here is just checking our dataset information.

-   In real world ecological dataset, it is not possible to check actual abundance (N)

**Fitting N-mixture model**

**1. Format data**

```{r, echo = TRUE}
y <- dataraw[,c(3:5)] ##Count Matrix for repeated count
siteCovs <- dataraw[,c('elevation','eu_road','location')] ##Site covariates for abundance
obsCovs=list(veght=dataraw[,c(6:8)]) ##Observation covariates for detection
```

**2. Create unmarked dataframe**

```{r, echo = TRUE}
umf <- unmarkedFramePCount(y = y,   ##Count Matrix for repeated count
                           siteCovs = siteCovs,   ##Site covariates for abundance
                           obsCovs = obsCovs ) ##Observation covariates for detection

```

**2.1 Check our unmarked frame**


```{r, echo = TRUE}
summary(umf)

```

**Plot**

```{r, echo = TRUE}
plot(umf)

```

**2.2 Testing K value with null model**

**Generally, there has two ways to decide K value**

-   First approach - According to Kery et al. (2018), we can define K value as K=max(Ci,t) + 100

```{r, echo = TRUE}
max(y,na.rm = TRUE) #5 is maximum K value (max(Ci,t)), then add 100

```

-   Use K value 105 for the model

```{r, echo = TRUE}
null_p1=pcount(~1~1,K=105,data=umf) #According to (Kery et al., 2018)
summary(null_p1)  #AIC = 388.2797 
null_p2=pcount(~1~1,K=200,data=umf) #Test another K value > 105 to make                                      sure whether AIC value is stable or not
summary(null_p2)  #AIC = 388.2797    
```

-   AIC = 388.2797 for both null_p1 and null_p2

-   Same AIC value with different K value, it means we can use K value 105

-   If AIC is not stable, your model has identifiability issues

-   Then try another distributions OR add some covariates and then check the best model

**2.2 Second approach to check K value**

```{r, echo = TRUE}
test_K <- function(K){
  fit <- suppressWarnings(pcount(~1~1, umf, K=K))
  exp(coef(fit)[1])
}

kvals <- c(8,10,15,20,25,30)
lams <- sapply(kvals, test_K)
plot(kvals, lams, type='o', pch=19, xlim=c(5, 30),
     xlab="K", ylab="lambda estimate")
abline(v=15, col='red', lty=2)

```

-   Based on second method, I will try K \> 15.

-   And then, I will test again for AIC

```{r, echo = TRUE}
null_p1=pcount(~1~1,K=20,data=umf)
summary(null_p1)  #AIC = 388.2797 
null_p2=pcount(~1~1,K=50,data=umf)
summary(null_p2)  #AIC = 388.2797 
```

-   AIC value doesn't change. So I will choose K=20 for this tutorial

-   You can choose either first or second approach to decide your K value in your model as long as your model AIC is stable

-   Higer K value will impact your computation time

-   Now let's look at our null model estimation

-   N-mixture used log-scale for abundance and logit scale for detection

-   So we need to backtransform to get actual mean abundance and detection

```{r, echo = TRUE}
lambda=backTransform(null_p1,type='state') ##state for abundance process
lambda
confint(lambda) #to get 95% confident interval

detection=backTransform(null_p1,type='det') ##det for detection process
detection
confint(detection) #to get 95% confident interval

```

-   Compare with original dataset (Recall the original dataset)

```{r, echo = TRUE}
print(result)
```

-   Our null model estimate:

-   For abundance 1.48 (0.986 Lower CI, 2.22 Upper CI)

-   For detection 0.36 (0.235 Lower CI, 0.507 Upper CI)

-   Close to true abundance!

-   Start fitting different models with covariates

**3. Fit Model with pcount**

-   First for detection, abundance as constant (i.e., 1)

```{r, echo = TRUE}
det_full=pcount(~scale(veght)+scale(elevation)~1,
                K=20,mixture='P',data=umf) #all detection variables
summary(det_full)

det_veght=pcount(~scale(veght)~1,
                K=20,mixture='P',data=umf) #only veght variable
summary(det_veght)

det_elevation=pcount(~scale(elevation)~1,
                     K=20,mixture='P',data=umf) #only elevation variable
summary(det_elevation)
```

**4. Let's select the best detection model**

**Model selection for detection process**

**Listing models**

```{r, echo = TRUE}
Modeles <- list(det_full,det_veght,det_elevation)

Names   <- c("det_full","det_veght","det_elevation")

aictab(cand.set = Modeles, modnames = Names)

```

-   det_elevation is the best model, then fit for abundance

```{r, echo = TRUE}
(mfull =pcount(~scale(elevation)~scale(elevation)+scale(eu_road)+location,
               K=20,mixture='P',data = umf))

(m1 =pcount(~scale(elevation)~scale(elevation)+scale(eu_road),
            K=20,mixture='P',data = umf))

(m2 =pcount(~scale(elevation)~scale(elevation)+location,
            K=20,mixture='P',data = umf))

(m3 =pcount(~scale(elevation)~scale(eu_road)+location,
            K=20,mixture='P',data = umf))

(m4 =pcount(~scale(elevation)~scale(elevation),
            K=20,mixture='P',data = umf))

(m5 =pcount(~scale(elevation)~scale(eu_road),
            K=20,mixture='P',data = umf))

(m6 =pcount(~scale(elevation)~location,
            K=20,mixture='P',data = umf))

```

-   Let's select again for the best abundance model

**Model selection for abundance process**

**Listing models**

```{r, echo = TRUE}
Modeles <- list(mfull,m1,m2,m3,m4,m5,m6)

Names   <- c("mfull","m1","m2","m3","m4","m5","m6")

aictab(cand.set = Modeles, modnames = Names)
```

-   Finally, m2 is the best model for both detection and abundance

**Caution!**

-   There has multiple ways for model selection procedure and you can choose

-   approapriate method based on your experience and literature review

-   In our tutorial, m2 is the best model

```{r, echo = TRUE}
(m2 =pcount(~scale(elevation)~scale(elevation)+location,
            K=20, mixture='P',data = umf)) #Best model
```

-   Check another two distributions: ZIP and NB

```{r, echo = TRUE}
(m2_zip =pcount(~scale(elevation)~scale(elevation)+location,
                K=20,mixture='ZIP',data = umf))

(m2_nb =pcount(~scale(elevation)~scale(elevation)+location,
               K=20,mixture='NB',data = umf))
```

**Model selection again between different mixtures**

```{r, echo = TRUE}
Modeles <- list(m2,m2_zip,m2_nb)

Names   <- c("m2","m2_zip","m2_nb")

aictab(cand.set = Modeles, modnames = Names)

```

-   Poisson mixture won among three mixtures

-   Sometime NB shows lower AIC than other two mixtures

-   In this case, it would be nice to check residual plot because of good fit/ bad prediction dilemma,

-   Don't solely rely on NB estimates, as they can sometimes provide unrealistic abundance values.

**5. Goodness of fit test**

```{r, echo = TRUE}
gof=Nmix.gof.test(m2,nsim=100)  #Use 1000 nsim for real data
gof

```

-   We got c-hat value 1. It means no overdispersion, Good fit

-   If c.hat value \> 1,our model has overdispersion problem

-   If c.hat value \< 1, our model has underdispersion problem

-   Then you need to take into account c-hat value in your prediction

**Check three results and compare with true mean**

```{r, echo = TRUE}
cbind(rbind("Poisson" = exp(coef(m2)[1]), 
            "NegBin" = exp(coef(m2_nb)[1]), 
            "ZIP"= exp(coef(m2_zip)[1])))
```

**Print the original dataset**

```{r, echo = TRUE}
print(result)
```

-   Pretty close estimation!

**6. Prediction**

**For Detection**

-   Elevation variable

-   Hold everything constant except for elevation

```{r, echo = TRUE}
newDat <- data.frame(elevation = seq(min(dataraw$elevation), max(dataraw$elevation), length=100))
Epred=modavgPred(cand.set = list(m2), newdata=newDat, parm.type = "detect", type= "response")

newDat$mod_avg_pred <- Epred$mod.avg.pred
newDat$uncond_se <- Epred$uncond.se
newDat$low95 <- Epred$lower.CL
newDat$upp95 <- Epred$upper.CL

###Plotting the result for elevation

ggplot(newDat, aes(x = elevation)) + 
  geom_line(aes(y = mod_avg_pred), colour = "blue") + 
  geom_ribbon(aes(ymin = low95, ymax = upp95), alpha = 0.2) +
  ylim(0, 1) +  
  labs(y = expression("Detection probability"), 
       x = "Elevation (m)") +
  theme_classic()

```

**For Abundance**

-   Location variable

-   Hold everything constant except for location

```{r, echo = TRUE}
newDat <- data.frame(location=factor(c("A", "B")), elevation = mean(dataraw$elevation))
Epred=modavgPred(cand.set = list(m2), newdata=newDat, parm.type = "lambda", type = "response")

newDat$mod_avg_pred <- Epred$mod.avg.pred
newDat$uncond_se <- Epred$uncond.se
newDat$low95 <- Epred$lower.CL
newDat$upp95 <- Epred$upper.CL

###Plotting the result for eu_water
ggplot(newDat, aes(location,mod_avg_pred)) +
  geom_col(aes(fill=location), color = "black", width = 0.85) +
  ylim(0,10) +
  geom_errorbar(aes(ymin = low95,
                    ymax = upp95),
                color = "#22292F",
                width = .1)+
  labs(y = expression("Predicted abundance"), 
       x = "Location of camera traps")+
  theme_classic()
```

-   We do not need to draw a plot for elevation in abundance process becasue that variable is not significant in m2

**Thank you very much :)**

**References**

1.  M. Kéry y J. A. Royle. 2016. Applied hierarchical modeling in ecology: analysis of distribution, abundance and species richness in R and BUGS. Vol. 1 Prelude and static models. Academic Press, London, UK. 783 Pp. [ISBN: 978-0-12-801378-6].

2.  Quantitative methods for population dynamics. Two day workshop by Olivier Gimenez, Aurélien Besnard and Sarah Cubaynes. <https://oliviergimenez.github.io/popdyn-workshop/>

3.  Kéry, M. 2018. Identifiability in N‐mixture models: a large‐scale screening test with bird data. Ecology 99:281–288.