Single Season Occupancy Model: Exercise in R ‘Unmarked’

Compiled by: Subhasish Arandhara & Samrat Sengupta

$$$$

Occupancy models

Occupancy models estimate where species occur (a biological process) while accounting for imperfect detection (an observation process).

Background: Most mobile species are difficult to detect is not 100% accurate, even proxies as tracks, scats, or nests, can be difficult to detect, occupancy models help us determine the proportion of times we don’t detect the species, indicating whether the species is truly absent or whether the species is present but we didn’t detect it.

A basic occupancy model estimate parameters:

Naive Occupancy: Counting the number of sites/cells where at least one presence was recorded during the repeat visits and dividing by the total number of sites surveyed.

Detection probability (p, the probability of finding a species at a time if it is present at a site)

Occupancy (psi or Ψ, the probability that a species occurs at a site).

Assumptions

  • True detections: There are no false detections (where an observer believed that they saw the species in question when in reality it was a different species).

  • Population closure: The area being studied is closed to changes in species occupancy between sampling times

  • Survey Independence: The surveys and the units are independent (i.e. whether or not a species was previously identified at a site has no impact on the current survey detection at that site).

  • Constant detection and occupancy probability: The probability of detection and occupancy of a species, is constant across sites and can be violated as long as the heterogeneous probability of occupancy across sites is described and modeled by the correct covariates.

Single Season Models

Involves surveys for a species at distinct sample units, and each site is surveyed on a number of sighting occasions in a definite sampling period (eg Winter month). 

Assumes that occupancy status (presence or absence) does not change during the study. That is, a species is either always present or always absent at each site, even if the species is not always detected.

The species is never falsely detected when the species is absent from a site, i.e., there are no false positive species detections. 

When the species is present at a site, the species may be detected (with probability p) or may not be detected (with probability 1 − p). 

$$$$

Installation and setup

Install R and R Studio

Install R and R studio from https://posit.co/download/rstudio-desktop/

$$$$

Installing packages

list.of.packages <- c("ggplot2", "corrplot", "gridExtra", "dplyr", "unmarked", "lubridate", "tibble", "sf", "MuMIn", "AICcmodavg")

new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages, require, character.only = TRUE)

$$$$

The detection data

Our example data come from a selected portion of Asiatic black bear survey at (Pakke Tiger Reserve, Arunachal Pradesh). Surveys were done during the winter season at 511 cells each measuring 2-km square quadrats across the reserve. We will use the 100 cells as sample data for this single season model tutorial.

Our example data come from a selected portion of Asiatic black bear survey at (Pakke Tiger Reserve, Arunachal Pradesh). Surveys were done during the winter season at 511 cells each measuring 2-km square quadrats across the reserve. We will use the 100 cells for this single season model tutorial.

$$$$

Loading detection history

det.his <- read.csv("data/detectBear.csv", row.names = "X") 
head(det.his)
##   event.1 event.2 event.3 event.4 event.5 event.6 event.7 event.8 event.9
## 1       0       0       0       0       0       0       0       0       0
## 2       1       0       1       0       0       1       1       1       0
## 3       1       0       0       0       1       1       0       1       0
## 4       1       1       1       1       0       1       0       0       0
## 5       0       0       0       0       0       0       0       0       0
## 6       0       0       0       0       0       0       0       0       0
##   event.10
## 1        0
## 2        1
## 3        0
## 4        1
## 5        0
## 6        0

$$$$

The unmarked framework

One thing about the unmarked package is that it organises detection data using a different style of dataframe. We require information detailing which surveys detected a species in order to begin. The sample y detection data will be utilised.

unm_det.his <- unmarkedFrameOccu(y = as.matrix(det.his)) 

summary(unm_det.his)
## unmarkedFrame Object
## 
## 100 sites
## Maximum number of observations per site: 10 
## Mean number of observations per site: 10 
## Sites with at least one detection: 62 
## 
## Tabulation of y observations:
##   0   1 
## 710 290

$$$$

Naive occupancy estimate

naive_occ <- sum(ifelse(rowSums(det.his, na.rm=T)>0,1,0))/nrow(det.his)
naive_occ
## [1] 0.62
Naive occupancy: counting the number of sites/cells where at least one presence was recorded during the repeat visits and dividing by the total number of sites surveyed

$$$$

The null model

To begin fitting the occupancy model, let’s create a “null” model that does not include any detection or occupancy predictors. To execute the occupancy model, first ensure that the object is properly “framed” for unmarked modelling. Then, run the occu() function. Your model statement and data can be found in the double right-hand side formula that is required by unmarked.

null.mod <- occu(formula = ~1 # detection formula first
                          ~1, # occupancy formula second, 
                  data = unm_det.his)

summary(null.mod) 
## 
## Call:
## occu(formula = ~1 ~ 1, data = unm_det.his)
## 
## Occupancy (logit-scale):
##  Estimate    SE    z P(>|z|)
##     0.494 0.207 2.39  0.0167
## 
## Detection (logit-scale):
##  Estimate     SE     z P(>|z|)
##    -0.133 0.0811 -1.64   0.102
## 
## AIC: 993.5046 
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 18 
## Bootstrap iterations: 0

one estimate for detection, one for occupancy

$$$$

Transformed detection probability

Both the occupancy and detection probability estimates that we see are on the log-link scale. The probabilities can be determined by utilising the backTransform() function.

p<-backTransform(null.mod, type = "det")
p
## Backtransformed linear combination(s) of Detection estimate(s)
## 
##  Estimate     SE LinComb (Intercept)
##     0.467 0.0202  -0.133           1
## 
## Transformation: logistic
Detection probability p: probability that the species will be detected given that it already occurs at a site

We employ “det” as the transformation type when considering the probability of detection: The probability of spotting a bear is around 0.47 (SE=0.02), assuming they are there and can be seen within a given time frame (10 events in this case).

$$$$

Transformed occupancy probability

psi<-backTransform(null.mod, type = "state")
psi
## Backtransformed linear combination(s) of Occupancy estimate(s)
## 
##  Estimate     SE LinComb (Intercept)
##     0.621 0.0486   0.494           1
## 
## Transformation: logistic
Occupancy psi(Ψ) probability of a species occurring within a spatial unit (or “site”) during the sampling session

When considering the occupancy , the type of transformation we use is “state”: there is a roughly 0.62 (SE=0.04) or 62% chance that a bear occupies the survey sites.


$$$$

The covariates

$$$$

Loading site covariates

site_cov <- read.csv("data/site.covBear.csv",row.names = "X") 

$$$$

Loading observer covariates

effort <- read.csv("data/effortBear.csv", row.names = "X") 
observers <- read.csv("data/obserBear.csv", row.names = "X") 

$$$$

Standardizing the covariates

site_cov <- stdize(site_cov)
observers <- stdize(observers)
effort <- stdize(effort)

$$$$

Unmarked framework with covariates

In order to fit the model, we need to create the unmarked data frame. By assigning site-related and observation-related covariates, this can be similarly accomplished using the unmarkedFrameOccu() function.

unm_cov <- unmarkedFrameOccu( y = as.matrix(det.his),obsCovs = list(effort = effort,  observers = observers), siteCovs = site_cov) 


summary(unm_cov)
## unmarkedFrame Object
## 
## 100 sites
## Maximum number of observations per site: 10 
## Mean number of observations per site: 10 
## Sites with at least one detection: 62 
## 
## Tabulation of y observations:
##   0   1 
## 710 290 
## 
## Site-level covariates:
##   z.canopy.cov      z.grassland.    
##  Min.   :-1.1264   Min.   :-1.3107  
##  1st Qu.:-0.7924   1st Qu.:-0.8541  
##  Median :-0.2536   Median :-0.2692  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6600   3rd Qu.: 0.7558  
##  Max.   : 2.7634   Max.   : 2.4958  
## 
## Observation-level covariates:
##      effort           observers      
##  Min.   :-2.92431   Min.   :-1.4434  
##  1st Qu.:-0.63566   1st Qu.:-0.8222  
##  Median : 0.01198   Median :-0.2416  
##  Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.66591   3rd Qu.: 0.5698  
##  Max.   : 2.92976   Max.   : 3.6037

$$$$

Multicollinearity

There are a lot of statistical models that can’t fit stable parameters if two or more predictor variables are highly correlated, also known as multicolinearity. So, before we add multiple covariates to our models, we need to make sure they are not multi-collinear.

Various methods are available for measuring and eliminating collinearity. Here, we present a straightforward and effective approach that relies on pairwise correlation.

Initially, we examine the pairwise correlations between the predictors. Typically, correlations that are less than |r|<0.7 are considered to be not concerning (or less than |r|<0.5 as a more cautious threshold).

m2 <- cor(site_cov, method='spearman')
corrplot.mixed(m2, upper = "ellipse", lower = "number",
               tl.pos = "lt", tl.col = "black", tl.offset=1, tl.srt = 0)

Correlation (r): identify pairs of variables that have correlation |r|>0.7 and remove the less important variable.

$$$$

The models

Prepare global model

glob.mod <- occu(formula = ~effort + observers~z.canopy.cov + z.grassland.,data = unm_cov)

summary(glob.mod)
## 
## Call:
## occu(formula = ~effort + observers ~ z.canopy.cov + z.grassland., 
##     data = unm_cov)
## 
## Occupancy (logit-scale):
##              Estimate    SE      z P(>|z|)
## (Intercept)     0.503 0.208  2.416  0.0157
## z.canopy.cov    0.177 0.251  0.707  0.4794
## z.grassland.   -0.145 0.241 -0.602  0.5474
## 
## Detection (logit-scale):
##             Estimate     SE      z  P(>|z|)
## (Intercept)  -0.1807 0.0971 -1.860 6.29e-02
## effort        1.1920 0.1151 10.355 3.97e-25
## observers     0.0927 0.1121  0.827 4.08e-01
## 
## AIC: 839.7471 
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 32 
## Bootstrap iterations: 0
Global model: model with all the covariates used

AIC

How can one determine which model provides the most accurate explanation for the observed data? The AIC, or Akaike information criterion, is a valuable statistic that is included in the output of unmarked models. It is derived from the deviance and provides information about the degree of fit between the model and the data.

The value is derived from the log-likelihood L and incorporates an adjustment for the number of parameters p in the model.

The formula for AIC is given by

AIC = -2L + 2(p+1),


where L represents the log-likelihood and p represents the regression coefficients


Therefore, AIC considers the level of complexity in the model. Typically, it is more desirable to have lower values of AIC. However, it is important to note that AIC cannot be interpreted in a definitive manner and cannot be directly compared across different sets of data. Nevertheless, it can offer a significant method to evaluate and contrast various candidate models for a given data set.

The AIC can be extracted directly from the model objects.

glob.mod@AIC
## [1] 839.7471

$$$$

Plot the influence of the covariates on occupancy

The best way to understand the results is to create some plots. We’ll begin with a plot of how occupancy varies with canopy cover at sites and grassland % in the next plot. 

predict_canopy.cov <- cbind(predict(glob.mod,newdata = data.frame(z.canopy.cov = seq(min(site_cov$z.canopy.cov, na.rm = TRUE), max(site_cov$z.canopy.cov,na.rm = TRUE),by = 0.01), z.grassland. = mean(site_cov$z.grassland.)), type = "state"), data.frame(z.canopy.cov = seq(min(site_cov$z.canopy.cov,na.rm = TRUE),max(site_cov$z.canopy.cov, na.rm = TRUE), by = 0.01),z.grassland. = mean(site_cov$z.grassland.)))


ggplot(data = predict_canopy.cov, aes(x = z.canopy.cov, y = Predicted)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue") +
  stat_smooth(method = "loess", col = "black", se = FALSE) +
  labs(x = "Canopy cover (standardized)", y = "Predicted Occupancy") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

predict_grassland. <- cbind(predict(glob.mod, newdata = data.frame(z.grassland. = seq(min(site_cov$z.grassland.,   na.rm = TRUE), max(site_cov$z.grassland., na.rm = TRUE),  by = 0.01),z.canopy.cov = mean(site_cov$z.canopy.cov)), type = "state"),data.frame(z.grassland. = seq(min(site_cov$z.grassland., na.rm = TRUE), max(site_cov$z.grassland., na.rm = TRUE), by = 0.01),z.canopy.cov = mean(site_cov$z.canopy.cov)))


ggplot(data = predict_grassland., aes(x = z.grassland., y = Predicted)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "pink") +
  stat_smooth(method = "loess", col = "black", se = FALSE) +
  labs(x = "Grassland % (standardized)", y = "Predicted Occupancy") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

$$$$

Model comparison

The MuMln package provides the function dredge() which allows us to perform AIC-based model selection. This function offers a comprehensive selection of models for comparison. We will systematically compare the various predictor variables in order to figure out their impact on occupancy and detection. Additionally, we need our null model to compare with the most efficient models selected here.

mod.sel<-dredge(glob.mod)
## Fixed terms are "p(Int)" and "psi(Int)"
mod.sel[1:3,]
## Global model call: occu(formula = ~effort + observers ~ z.canopy.cov + z.grassland., 
##     data = unm_cov)
## ---
## Model selection table 
##    p(Int) psi(Int) p(eff)  p(obs) psi(z.cnp.cov) df   logLik  AICc delta weight
## 2 -0.2041   0.4989  1.210                         3 -414.516 835.3  0.00  0.539
## 4 -0.1808   0.4992  1.192 0.09278                 4 -414.175 836.8  1.49  0.256
## 6 -0.2038   0.4998  1.210                  0.103  4 -414.396 837.2  1.93  0.205
## Models ranked by AICc(x)
AIC (Akaike information criterion): conveys information about how closely the model fits the data, lower values of AIC are preferable.
ΔAIC (delta AIC):

the difference in AIC between the best and different (subsequent) models,

ΔAIC between 0 and 2 indicates substantial support for the candidate models, meaning they are essentially as good as the best model, while values between 4-7 indicate considerably less support compared to the best model

AICwt (Akaike weight): relative likelihood of a model compared to all other models in the candidate model set

$$$$

The Best model

occu.mb <- occu(formula = ~1~z.canopy.cov+z.grassland.,  data = unm_cov)

summary(occu.mb)
## 
## Call:
## occu(formula = ~1 ~ z.canopy.cov + z.grassland., data = unm_cov)
## 
## Occupancy (logit-scale):
##              Estimate    SE      z P(>|z|)
## (Intercept)     0.498 0.208  2.401  0.0164
## z.canopy.cov    0.181 0.251  0.723  0.4694
## z.grassland.   -0.143 0.241 -0.592  0.5538
## 
## Detection (logit-scale):
##  Estimate     SE     z P(>|z|)
##    -0.133 0.0811 -1.64   0.102
## 
## AIC: 996.888 
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 29 
## Bootstrap iterations: 0
backTransform(occu.mb, type = "det")
## Backtransformed linear combination(s) of Detection estimate(s)
## 
##  Estimate     SE LinComb (Intercept)
##     0.467 0.0202  -0.133           1
## 
## Transformation: logistic

$$$$

Predicting the best model

occu.psiPr <- predict(occu.mb, type = "state", appendData = TRUE)
head(occu.psiPr)

Predicting Site wise occupancy

$$$$

Subsequent best model

occu.psi <- occu(formula = ~effort  ~z.canopy.cov, data = unm_cov)
summary(occu.psi)
## 
## Call:
## occu(formula = ~effort ~ z.canopy.cov, data = unm_cov)
## 
## Occupancy (logit-scale):
##              Estimate    SE     z P(>|z|)
## (Intercept)     0.500 0.207 2.409   0.016
## z.canopy.cov    0.103 0.212 0.487   0.626
## 
## Detection (logit-scale):
##             Estimate     SE     z  P(>|z|)
## (Intercept)   -0.204 0.0933 -2.18 2.90e-02
## effort         1.210 0.1145 10.57 3.99e-26
## 
## AIC: 836.7915 
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 23 
## Bootstrap iterations: 0

$$$$

Predicting subsequent best model

occu.psiPr <- predict(occu.psi,type = "state", appendData = TRUE)
head(occu.psiPr) 

$$$$

Model-averaging

There isn’t always a best model when it comes to models, and sometimes multiple models seem equally reasonable. When confronted with such situation lacking a definitive model and the objective is to predict the probability of occupancy at each location, while also taking into account the uncertainty linked to various models, predictions can be calculated using model-averaging.

$$$$

Make model list

occ_list <- list(glob.mod, occu.psi, null.mod, occu.mb)
occ_list

The function modavgPred() automatically assigns weights to the predictions based on the relative level of support for each model.

occu_modavg <- modavgPred(occ_list, parm.type = "psi", newdata = unm_cov@siteCovs)[c("mod.avg.pred", "lower.CL", "upper.CL")]
## Warning in modavgPred.AICunmarkedFitOccu(occ_list, parm.type = "psi", newdata = unm_cov@siteCovs): 
## Model names have been supplied automatically in the table
occu_modavg_df <- data.frame(Predicted = occu_modavg$mod.avg.pred,
                                    lower = occu_modavg$lower.CL,
                                    upper = occu_modavg$upper.CL,
                                    site_cov)
head(occu_modavg_df)
##   Predicted     lower     upper z.canopy.cov z.grassland.
## 1 0.6217598 0.5024774 0.7281198   -0.2731279   -1.2893142
## 2 0.6387277 0.5087606 0.7514366    0.3813072   -1.2145536
## 3 0.6940040 0.3880447 0.8908250    2.6335655   -0.9290180
## 4 0.6862851 0.4101824 0.8736160    2.3288233   -0.8411718
## 5 0.6959380 0.3827938 0.8945975    2.7634368   -0.6466903
## 6 0.6827117 0.4205370 0.8648763    2.2207047   -0.6247310

$$$$

Assessing modelling fit

Goodness of fit tests are crucial for accounting for overdispersion in model selection methods in accounting. MacKenzie and Bailey (2004) employed the Pearson statistic as a measure of goodness of fit and to assess overdispersion in single-species, single-season occupancy problems. This methodology involves the simulation of detection history sets, under the assumption that the detection model is well-fit, followed by a comparison of the binomial distribution of the simulated data with the actual data.

Estimations are made for the following parameters.

  • Variance inflation factor as c^ or “c-hat”.

  • Conduct a hypothesis test using a Chi-square statistic to determine if the detection data used to fit the model exhibits a higher level of extra-binomial noise than what is anticipated.

  • Determine the sites that made the greatest contribution to the Chi-square statistic. This is beneficial for identifying instances where the data does not conform to assumptions (e.g. Is there unaccounted for variability in capture probability?).

Goodness-of-fit test

occu.gof<- mb.gof.test(occu.psi, nsim = 50)

head(occu.gof)
Goodness-of-fit test: how accurately does the fitted model represent the data

In this example, we observe that the detection model is suitable for the data as the Chi-square test shows that the variance inflation factor is not significantly different from 1.0.

Plotting the predictions

At this moment, we hold an approximation of the likelihood that the target species occupies each of the study sites. Utilising spatial packages and the QGIS application with the gridIDs to map these predictions can prove particularly advantageous to interpret visually.

References

  1. MacKenzie, D. I., Nichols, J. D., Royle, J. A., Pollock, K. H., Bailey, L., & Hines, J. E. (2017). Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. Elsevier.

  2. unmarked Package Vignette https://cran.r-project.org/web/packages/unmarked/vignettes/unmarked.pdf

  3. AICcmodavg Package documentation https://cran.r-project.org/web/packages/AICcmodavg/index.html

  4. Morin et al. 2020. Is your ad hoc model selection strategy affecting your multimodel inference? Ecosphere 11(1): e02997 https://doi.org/10.1002/ecs2.2997

  5. Bartoń, Kamil. 2022. MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.

  6. Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach. 2nd ed. Springer.

  7. Dormann, Carsten F., Justin M. Calabrese, Gurutzeta Guillera-Arroita, Eleni Matechou, Volker Bahn, Kamil Bartoń, Colin M. Beale, et al. 2018. “Model Averaging in Ecology: A Review of Bayesian, Information-Theoretic, and Tactical Approaches for Predictive Inference.” Ecological Monographs 88 (4): 485–504. https://doi.org/10.1002/ecm.1309.

  8. Dormann, Carsten F., Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, et al. 2013. “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” Ecography 36 (1): 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x.

  9. Fiske, Ian, and Richard Chandler. 2011. “unmarked: An R Package for Fitting Hierarchical Models of Wildlife Occurrence and Abundance.” Journal of Statistical Software 43 (10): 1–23. https://www.jstatsoft.org/v43/i10/.

  10. https://jamesepaterson.weebly.com/blog/creating-capture-histories-in-r

  11. Guillera-Arroita, Gurutzeta. 2017. “Modelling of Species Distributions, Range Dynamics and Communities Under Imperfect Detection: Advances, Challenges and Opportunities.” Ecography 40 (2): 281–95. https://doi.org/10.1111/ecog.02445.

  12. Guisan, Antoine, Wilfried Thuiller, and Niklaus E. Zimmermann. 2017. Habitat Suitability and Distribution Models with Applications in r. Cambride University Press.

  13. Hosmer, David W., and Stanley Lemeshow. 2013. Applied Logistic Regression. 3rd ed. John Wiley & Sons, Inc.

  14. Kéry, Marc, and J. Andrew Royle. 2015. Applied Hierarchical Modeling in Ecology: Analysis of Distribution, Abundance and Species Richness in r and BUGS. Elsevier.

  15. Mazerolle, Marc J. 2023. AICcmodavg: Model Selection and Multimodel Inference Based on (q)AIC(c). https://cran.r-project.org/package=AICcmodavg.

  16. Nagelkerke, N. J. D. 1991. “A Note on a General Definition of the Coefficient of Determination.” Biometrika 78: 691–92.

  17. https://bcss.org.my/tut/bayes-with-jags-a-tutorial-for-wildlife-researchers/occupancy-modelling/occupancy-with-site-covariates/

  18. https://damariszurell.github.io/EEC-QCB/Occ4_SimpleOccModel.html