Single Season Occupancy Model: Exercise in R ‘Unmarked’
Compiled by: Subhasish Arandhara & Samrat Sengupta
$$$$
Occupancy models estimate where species occur (a biological process) while accounting for imperfect detection (an observation process).
Background: Most mobile species are difficult to detect is not 100% accurate, even proxies as tracks, scats, or nests, can be difficult to detect, occupancy models help us determine the proportion of times we don’t detect the species, indicating whether the species is truly absent or whether the species is present but we didn’t detect it.
A basic occupancy model estimate parameters:
Naive Occupancy: Counting the number of sites/cells where at least one presence was recorded during the repeat visits and dividing by the total number of sites surveyed.
Detection probability (p, the probability of finding a species at a time if it is present at a site)
Occupancy (psi or Ψ, the probability that a species occurs at a site).
Assumptions
True detections: There are no false detections (where an observer believed that they saw the species in question when in reality it was a different species).
Population closure: The area being studied is closed to changes in species occupancy between sampling times
Survey Independence: The surveys and the units are independent (i.e. whether or not a species was previously identified at a site has no impact on the current survey detection at that site).
Constant detection and occupancy probability: The probability of detection and occupancy of a species, is constant across sites and can be violated as long as the heterogeneous probability of occupancy across sites is described and modeled by the correct covariates.
Single Season Models
Involves surveys for a species at distinct sample units, and each site is surveyed on a number of sighting occasions in a definite sampling period (eg Winter month).
Assumes that occupancy status (presence or absence) does not change during the study. That is, a species is either always present or always absent at each site, even if the species is not always detected.
The species is never falsely detected when the species is absent from a site, i.e., there are no false positive species detections.
When the species is present at a site, the species may be detected (with probability p) or may not be detected (with probability 1 − p).
$$$$
list.of.packages <- c("ggplot2", "corrplot", "gridExtra", "dplyr", "unmarked", "lubridate", "tibble", "sf", "MuMIn", "AICcmodavg")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages, require, character.only = TRUE)$$$$
Users can fit models that account for the imperfect detection of unmarked individuals with the help of the unmarked R package, which was developed by Ian Fiske and Richard Chandler with contributions from Marc Kéry, Andy Royle, David Miller, and others.
$$$$
Our example data come from a selected portion of Asiatic black bear survey at (Pakke Tiger Reserve, Arunachal Pradesh). Surveys were done during the winter season at 511 cells each measuring 2-km square quadrats across the reserve. We will use the 100 cells as sample data for this single season model tutorial.
$$$$
y: A data.frame of presence-absence
records. Rows are sites, columns are repeat visits.## event.1 event.2 event.3 event.4 event.5 event.6 event.7 event.8 event.9
## 1 0 0 0 0 0 0 0 0 0
## 2 1 0 1 0 0 1 1 1 0
## 3 1 0 0 0 1 1 0 1 0
## 4 1 1 1 1 0 1 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0
## event.10
## 1 0
## 2 1
## 3 0
## 4 1
## 5 0
## 6 0
$$$$
One thing about the unmarked package is that it organises detection data using a different style of dataframe. We require information detailing which surveys detected a species in order to begin. The sample y detection data will be utilised.
## unmarkedFrame Object
##
## 100 sites
## Maximum number of observations per site: 10
## Mean number of observations per site: 10
## Sites with at least one detection: 62
##
## Tabulation of y observations:
## 0 1
## 710 290
$$$$
To begin fitting the occupancy model, let’s create a “null” model that does not include any detection or occupancy predictors. To execute the occupancy model, first ensure that the object is properly “framed” for unmarked modelling. Then, run the occu() function. Your model statement and data can be found in the double right-hand side formula that is required by unmarked.
##
## Call:
## occu(formula = ~1 ~ 1, data = unm_det.his)
##
## Occupancy (logit-scale):
## Estimate SE z P(>|z|)
## 0.494 0.207 2.39 0.0167
##
## Detection (logit-scale):
## Estimate SE z P(>|z|)
## -0.133 0.0811 -1.64 0.102
##
## AIC: 993.5046
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 18
## Bootstrap iterations: 0
one estimate for detection, one for occupancy
$$$$
Both the occupancy and detection probability estimates that we see are on the log-link scale. The probabilities can be determined by utilising the backTransform() function.
## Backtransformed linear combination(s) of Detection estimate(s)
##
## Estimate SE LinComb (Intercept)
## 0.467 0.0202 -0.133 1
##
## Transformation: logistic
| Detection probability p: | probability that the species will be detected given that it already occurs at a site |
We employ “det” as the transformation type when considering the probability of detection: The probability of spotting a bear is around 0.47 (SE=0.02), assuming they are there and can be seen within a given time frame (10 events in this case).
$$$$
## Backtransformed linear combination(s) of Occupancy estimate(s)
##
## Estimate SE LinComb (Intercept)
## 0.621 0.0486 0.494 1
##
## Transformation: logistic
| Occupancy psi(Ψ) | probability of a species occurring within a spatial unit (or “site”) during the sampling session |
When considering the occupancy , the type of transformation we use is “state”: there is a roughly 0.62 (SE=0.04) or 62% chance that a bear occupies the survey sites.
$$$$
$$$$
siteCovs: A data.frame of the site-level
covariates. These are things that don’t change between visits
like elevation, annual rainfall, distance from roads, etc.. One column
per covariate$$$$
obsCovs: A list of
data.frames for the observation-level covariates. Each
covariate is its own data.frame with rows as sites and
columns as repeat visits. These are things that can change
between visits. Could be environmental conditions like daily temperature
or cloud cover, or methodological variables like survey method
(spotlighting, pitfall traps etc.) or an observer ID.effort <- read.csv("data/effortBear.csv", row.names = "X")
observers <- read.csv("data/obserBear.csv", row.names = "X") $$$$
$$$$
In order to fit the model, we need to create the unmarked data frame. By assigning site-related and observation-related covariates, this can be similarly accomplished using the unmarkedFrameOccu() function.
unm_cov <- unmarkedFrameOccu( y = as.matrix(det.his),obsCovs = list(effort = effort, observers = observers), siteCovs = site_cov)
summary(unm_cov)## unmarkedFrame Object
##
## 100 sites
## Maximum number of observations per site: 10
## Mean number of observations per site: 10
## Sites with at least one detection: 62
##
## Tabulation of y observations:
## 0 1
## 710 290
##
## Site-level covariates:
## z.canopy.cov z.grassland.
## Min. :-1.1264 Min. :-1.3107
## 1st Qu.:-0.7924 1st Qu.:-0.8541
## Median :-0.2536 Median :-0.2692
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6600 3rd Qu.: 0.7558
## Max. : 2.7634 Max. : 2.4958
##
## Observation-level covariates:
## effort observers
## Min. :-2.92431 Min. :-1.4434
## 1st Qu.:-0.63566 1st Qu.:-0.8222
## Median : 0.01198 Median :-0.2416
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.66591 3rd Qu.: 0.5698
## Max. : 2.92976 Max. : 3.6037
$$$$
There are a lot of statistical models that can’t fit stable parameters if two or more predictor variables are highly correlated, also known as multicolinearity. So, before we add multiple covariates to our models, we need to make sure they are not multi-collinear.
Various methods are available for measuring and eliminating
collinearity. Here, we present a straightforward and effective approach
that relies on pairwise correlation.
Initially, we examine the pairwise correlations between the predictors.
Typically, correlations that are less than |r|<0.7 are considered to
be not concerning (or less than |r|<0.5 as a more cautious
threshold).
m2 <- cor(site_cov, method='spearman')
corrplot.mixed(m2, upper = "ellipse", lower = "number",
tl.pos = "lt", tl.col = "black", tl.offset=1, tl.srt = 0)| Correlation (r): | identify pairs of variables that have correlation |r|>0.7 and remove the less important variable. |
$$$$
glob.mod <- occu(formula = ~effort + observers~z.canopy.cov + z.grassland.,data = unm_cov)
summary(glob.mod)##
## Call:
## occu(formula = ~effort + observers ~ z.canopy.cov + z.grassland.,
## data = unm_cov)
##
## Occupancy (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) 0.503 0.208 2.416 0.0157
## z.canopy.cov 0.177 0.251 0.707 0.4794
## z.grassland. -0.145 0.241 -0.602 0.5474
##
## Detection (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) -0.1807 0.0971 -1.860 6.29e-02
## effort 1.1920 0.1151 10.355 3.97e-25
## observers 0.0927 0.1121 0.827 4.08e-01
##
## AIC: 839.7471
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 32
## Bootstrap iterations: 0
| Global model: | model with all the covariates used |
AIC
How can one determine which model provides the most accurate
explanation for the observed data? The AIC, or Akaike information
criterion, is a valuable statistic that is included in the output of
unmarked models. It is derived from the deviance and provides
information about the degree of fit between the model and the
data.
The value is derived from the log-likelihood L and incorporates an
adjustment for the number of parameters p in the model.
The formula for AIC is given by
AIC = -2L + 2(p+1),
where L represents the log-likelihood and
p represents the regression coefficients
Therefore, AIC considers the level of complexity in the model.
Typically, it is more desirable to have lower values of AIC. However, it
is important to note that AIC cannot be interpreted in a definitive
manner and cannot be directly compared across different sets of data.
Nevertheless, it can offer a significant method to evaluate and contrast
various candidate models for a given data set.
The AIC can be extracted directly from the model objects.
## [1] 839.7471
$$$$
The best way to understand the results is to create some plots. We’ll begin with a plot of how occupancy varies with canopy cover at sites and grassland % in the next plot.
predict_canopy.cov <- cbind(predict(glob.mod,newdata = data.frame(z.canopy.cov = seq(min(site_cov$z.canopy.cov, na.rm = TRUE), max(site_cov$z.canopy.cov,na.rm = TRUE),by = 0.01), z.grassland. = mean(site_cov$z.grassland.)), type = "state"), data.frame(z.canopy.cov = seq(min(site_cov$z.canopy.cov,na.rm = TRUE),max(site_cov$z.canopy.cov, na.rm = TRUE), by = 0.01),z.grassland. = mean(site_cov$z.grassland.)))
ggplot(data = predict_canopy.cov, aes(x = z.canopy.cov, y = Predicted)) +
geom_ribbon(aes(ymin = lower, ymax = upper), fill = "skyblue") +
stat_smooth(method = "loess", col = "black", se = FALSE) +
labs(x = "Canopy cover (standardized)", y = "Predicted Occupancy") +
theme_classic()## `geom_smooth()` using formula = 'y ~ x'
predict_grassland. <- cbind(predict(glob.mod, newdata = data.frame(z.grassland. = seq(min(site_cov$z.grassland., na.rm = TRUE), max(site_cov$z.grassland., na.rm = TRUE), by = 0.01),z.canopy.cov = mean(site_cov$z.canopy.cov)), type = "state"),data.frame(z.grassland. = seq(min(site_cov$z.grassland., na.rm = TRUE), max(site_cov$z.grassland., na.rm = TRUE), by = 0.01),z.canopy.cov = mean(site_cov$z.canopy.cov)))
ggplot(data = predict_grassland., aes(x = z.grassland., y = Predicted)) +
geom_ribbon(aes(ymin = lower, ymax = upper), fill = "pink") +
stat_smooth(method = "loess", col = "black", se = FALSE) +
labs(x = "Grassland % (standardized)", y = "Predicted Occupancy") +
theme_classic()## `geom_smooth()` using formula = 'y ~ x'
$$$$
The MuMln package provides the function dredge() which allows us to perform AIC-based model selection. This function offers a comprehensive selection of models for comparison. We will systematically compare the various predictor variables in order to figure out their impact on occupancy and detection. Additionally, we need our null model to compare with the most efficient models selected here.
## Fixed terms are "p(Int)" and "psi(Int)"
## Global model call: occu(formula = ~effort + observers ~ z.canopy.cov + z.grassland.,
## data = unm_cov)
## ---
## Model selection table
## p(Int) psi(Int) p(eff) p(obs) psi(z.cnp.cov) df logLik AICc delta weight
## 2 -0.2041 0.4989 1.210 3 -414.516 835.3 0.00 0.539
## 4 -0.1808 0.4992 1.192 0.09278 4 -414.175 836.8 1.49 0.256
## 6 -0.2038 0.4998 1.210 0.103 4 -414.396 837.2 1.93 0.205
## Models ranked by AICc(x)
| AIC (Akaike information criterion): | conveys information about how closely the model fits the data, lower values of AIC are preferable. |
| ΔAIC (delta AIC): | the difference in AIC between the best and different (subsequent) models, ΔAIC between 0 and 2 indicates substantial support for the candidate models, meaning they are essentially as good as the best model, while values between 4-7 indicate considerably less support compared to the best model |
| AICwt (Akaike weight): | relative likelihood of a model compared to all other models in the candidate model set |
$$$$
##
## Call:
## occu(formula = ~1 ~ z.canopy.cov + z.grassland., data = unm_cov)
##
## Occupancy (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) 0.498 0.208 2.401 0.0164
## z.canopy.cov 0.181 0.251 0.723 0.4694
## z.grassland. -0.143 0.241 -0.592 0.5538
##
## Detection (logit-scale):
## Estimate SE z P(>|z|)
## -0.133 0.0811 -1.64 0.102
##
## AIC: 996.888
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 29
## Bootstrap iterations: 0
## Backtransformed linear combination(s) of Detection estimate(s)
##
## Estimate SE LinComb (Intercept)
## 0.467 0.0202 -0.133 1
##
## Transformation: logistic
$$$$
Predicting Site wise occupancy
$$$$
##
## Call:
## occu(formula = ~effort ~ z.canopy.cov, data = unm_cov)
##
## Occupancy (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) 0.500 0.207 2.409 0.016
## z.canopy.cov 0.103 0.212 0.487 0.626
##
## Detection (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) -0.204 0.0933 -2.18 2.90e-02
## effort 1.210 0.1145 10.57 3.99e-26
##
## AIC: 836.7915
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 23
## Bootstrap iterations: 0
$$$$
We may get misleading occupancy estimates due to imperfect detection. In order to account for imperfect detection, we determine the true percentage of the area occupied by our target species by using empirical Bayes methods to estimate the posterior distributions of the random variables
AICbest <- occu.mb
re <- ranef(AICbest)
EBUP <- bup(re, stat="mean")
PAO = c(Estimate = sum(EBUP)/100)
PAO## Estimate
## 0.6211539
We get a proportion of the area occupied that is more than 50%, with the intervals . That difference might be substantially larger depending on how difficult it is to detect a particular species using a particular survey method.
$$$$
If we make the simplistic assumption of perfect detection, the proportion of area occupied by a species can be determined by calculating the proportion of observed sites where the species was present to the total number of sites.
## [1] 0.62
$$$$
There isn’t always a best model when it comes to models, and sometimes multiple models seem equally reasonable. When confronted with such situation lacking a definitive model and the objective is to predict the probability of occupancy at each location, while also taking into account the uncertainty linked to various models, predictions can be calculated using model-averaging.
$$$$
The function modavgPred() automatically assigns weights to the predictions based on the relative level of support for each model.
occu_modavg <- modavgPred(occ_list, parm.type = "psi", newdata = unm_cov@siteCovs)[c("mod.avg.pred", "lower.CL", "upper.CL")]## Warning in modavgPred.AICunmarkedFitOccu(occ_list, parm.type = "psi", newdata = unm_cov@siteCovs):
## Model names have been supplied automatically in the table
occu_modavg_df <- data.frame(Predicted = occu_modavg$mod.avg.pred,
lower = occu_modavg$lower.CL,
upper = occu_modavg$upper.CL,
site_cov)
head(occu_modavg_df)## Predicted lower upper z.canopy.cov z.grassland.
## 1 0.6217598 0.5024774 0.7281198 -0.2731279 -1.2893142
## 2 0.6387277 0.5087606 0.7514366 0.3813072 -1.2145536
## 3 0.6940040 0.3880447 0.8908250 2.6335655 -0.9290180
## 4 0.6862851 0.4101824 0.8736160 2.3288233 -0.8411718
## 5 0.6959380 0.3827938 0.8945975 2.7634368 -0.6466903
## 6 0.6827117 0.4205370 0.8648763 2.2207047 -0.6247310
$$$$
Goodness of fit tests are crucial for accounting for overdispersion in model selection methods in accounting. MacKenzie and Bailey (2004) employed the Pearson statistic as a measure of goodness of fit and to assess overdispersion in single-species, single-season occupancy problems. This methodology involves the simulation of detection history sets, under the assumption that the detection model is well-fit, followed by a comparison of the binomial distribution of the simulated data with the actual data.
Estimations are made for the following parameters.
Variance inflation factor as c^ or “c-hat”.
Conduct a hypothesis test using a Chi-square statistic to determine if the detection data used to fit the model exhibits a higher level of extra-binomial noise than what is anticipated.
Determine the sites that made the greatest contribution to the Chi-square statistic. This is beneficial for identifying instances where the data does not conform to assumptions (e.g. Is there unaccounted for variability in capture probability?).
| Goodness-of-fit test: | how accurately does the fitted model represent the data |
In this example, we observe that the detection model is suitable for the data as the Chi-square test shows that the variance inflation factor is not significantly different from 1.0.
At this moment, we hold an approximation of the likelihood that the target species occupies each of the study sites. Utilising spatial packages and the QGIS application with the gridIDs to map these predictions can prove particularly advantageous to interpret visually.
MacKenzie, D. I., Nichols, J. D., Royle, J. A., Pollock, K. H., Bailey, L., & Hines, J. E. (2017). Occupancy estimation and modeling: inferring patterns and dynamics of species occurrence. Elsevier.
unmarked Package Vignette https://cran.r-project.org/web/packages/unmarked/vignettes/unmarked.pdf
AICcmodavg Package documentation https://cran.r-project.org/web/packages/AICcmodavg/index.html
Morin et al. 2020. Is your ad hoc model selection strategy affecting your multimodel inference? Ecosphere 11(1): e02997 https://doi.org/10.1002/ecs2.2997
Bartoń, Kamil. 2022. MuMIn: Multi-Model Inference. https://CRAN.R-project.org/package=MuMIn.
Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach. 2nd ed. Springer.
Dormann, Carsten F., Justin M. Calabrese, Gurutzeta Guillera-Arroita, Eleni Matechou, Volker Bahn, Kamil Bartoń, Colin M. Beale, et al. 2018. “Model Averaging in Ecology: A Review of Bayesian, Information-Theoretic, and Tactical Approaches for Predictive Inference.” Ecological Monographs 88 (4): 485–504. https://doi.org/10.1002/ecm.1309.
Dormann, Carsten F., Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carré, Jaime R. García Marquéz, et al. 2013. “Collinearity: A Review of Methods to Deal with It and a Simulation Study Evaluating Their Performance.” Ecography 36 (1): 27–46. https://doi.org/10.1111/j.1600-0587.2012.07348.x.
Fiske, Ian, and Richard Chandler. 2011. “unmarked: An R Package for Fitting Hierarchical Models of Wildlife Occurrence and Abundance.” Journal of Statistical Software 43 (10): 1–23. https://www.jstatsoft.org/v43/i10/.
https://jamesepaterson.weebly.com/blog/creating-capture-histories-in-r
Guillera-Arroita, Gurutzeta. 2017. “Modelling of Species Distributions, Range Dynamics and Communities Under Imperfect Detection: Advances, Challenges and Opportunities.” Ecography 40 (2): 281–95. https://doi.org/10.1111/ecog.02445.
Guisan, Antoine, Wilfried Thuiller, and Niklaus E. Zimmermann. 2017. Habitat Suitability and Distribution Models with Applications in r. Cambride University Press.
Hosmer, David W., and Stanley Lemeshow. 2013. Applied Logistic Regression. 3rd ed. John Wiley & Sons, Inc.
Kéry, Marc, and J. Andrew Royle. 2015. Applied Hierarchical Modeling in Ecology: Analysis of Distribution, Abundance and Species Richness in r and BUGS. Elsevier.
Mazerolle, Marc J. 2023. AICcmodavg: Model Selection and Multimodel Inference Based on (q)AIC(c). https://cran.r-project.org/package=AICcmodavg.
Nagelkerke, N. J. D. 1991. “A Note on a General Definition of the Coefficient of Determination.” Biometrika 78: 691–92.
https://damariszurell.github.io/EEC-QCB/Occ4_SimpleOccModel.html