Occupancy Analysis (single species and season)

Author

Antonio Uzal

Published

October 1, 2024

Occupancy Analysis

Firstly, install required packages, if they are not already in your computer, and load them:

#Load Packages
list.of.packages <- c(
                      "unmarked",       # occupancy modeling packate
                      "tidyr",         # A package for data manipulation
                      "dplyr",         # A package for data manipulation
                      "vegan",         # tools for descriptive parameters in ecology
                      "AICcmodavg",    # package to implement data averaging
                      "readr",          # package to read tables
                      "ggplot2",       # visusalisations
                      "gridExtra",     # multiple grid based
                      "kableExtra",     # HTML tables
                      "knitr")         # dynamic reporting (nice tables) 
                              

# Check you have them in your library
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
# install any missing packages and load them
if(length(new.packages)) install.packages(new.packages,repos = "http://cran.us.r-project.org")
lapply(list.of.packages, require, character.only = TRUE)

Step 1: Read detection history & covariates

We continue using the example described here. We already have 2 dataframes:

  • sheep_detection contains detection history and detection/site covariates. This has already been filtered only to include summer deployments and saved as sheep_detection.csv.

  • effort_df contains the effort covariate (detection covariate); saved as effort_data.csv.

sheep_detection <- read.csv("sheep_detection.csv",
                 header = TRUE, 
                 sep = ",", 
                 stringsAsFactors = FALSE) 

effort_df <- read.csv("effort_data.csv",
                 header = TRUE, 
                 sep = ",", 
                 stringsAsFactors = FALSE)  

# Match effort_df to sheep_detection based on deployment_id, to make sure detections, covariates and effort are in the same order and match
effort_aligned <- effort_df[match(sheep_detection$deployment_id, effort_df$deployment_id), ]

# Create data frame with detection histories columns 7 -> 17 (SO1-SO11)
y <- sheep_detection[,7:17]

# Create data frame with site covariates, which are found in column 4-6
siteCovs <- sheep_detection[,4:6]

# elevation and ruggedness need to be standardised
siteCovs$elevation <- scale(siteCovs$elevation)
siteCovs$ruggedness <- scale(siteCovs$ruggedness)

# Read in & create data frame with effort, which is found in column 2-6

effort <-effort_aligned[,2:12]

# the dimensions (number of rows and columns) of effort and detections should be the same
dim(y)
[1] 70 11
dim(effort)
[1] 70 11
Note

We standardise variables for a variety of reasons. For me the most important are:

  • Easier comparison

    Standardization makes it easier to compare the relative magnitudes of different regression coefficients.

  • Ensures equal contribution

    Standardisation ensures that all variables contribute equally to a scale when items are added together. Standardization is often used as a preprocessing step in statistical analyses to ensure that all variables have the same scale and importance in the model.

  • Reduces multicollinearity

    Standardizing variables can help reduce multicollinearity, which can cause problems like hiding statistically significant terms. 

The scale() function in R standardises the data by applying the formula:

Where:

  • X: The original values in siteCovs$elevation.

  • μ: The mean of siteCovs$elevation.

  • σ: The standard deviation of siteCovs$elevation.

The result is a vector where the mean is0 and the standard deviation is 1.

Step 2: create & visualise unmarked data frame

Join all dataframes in an single unmarked dataframe

# create single unmarked data frame with detection histories, site covs and sample covs #

sheep <- unmarkedFrameOccu(y = y, siteCovs = siteCovs,obsCovs = list(effort=effort))

# get a breakdown of no. of sites, no. of detection/non-detections,
# site specific details etc.

summary(sheep)
unmarkedFrame Object

70 sites
Maximum number of observations per site: 11 
Mean number of observations per site: 3.83 
Sites with at least one detection: 15 

Tabulation of y observations:
   0    1 <NA> 
 244   24  502 

Site-level covariates:
     feature_type     elevation.V1        ruggedness.V1    
 Road dirt :45    Min.   :-1.5823629   Min.   :-1.9245071  
 Trail game:25    1st Qu.:-1.0802551   1st Qu.:-0.6111238  
                  Median : 0.1946443   Median : 0.0392450  
                  Mean   : 0.0000000   Mean   : 0.0000000  
                  3rd Qu.: 0.8889169   3rd Qu.: 0.6019574  
                  Max.   : 1.6327803   Max.   : 2.1095067  

Observation-level covariates:
     effort      
 Min.   : 1.000  
 1st Qu.: 5.000  
 Median :10.000  
 Mean   : 7.675  
 3rd Qu.:10.000  
 Max.   :10.000  
 NA's   :502     

Output tell us that over the 70 deployments, 28 had detections of wolves.

Plotting detections vs non detections

# plot of detection/non-detection

plot(sheep)

This outputs show us the detection history per deployment (1 = detection; 0 = no detection, blank = no data).

Calculate naive detection and occupancy

  • This does not consider detection or occupancy covariates (i.e. null model)
# run null model to get naive estimate of detection and occupancy
# i.e. one without covariates of either detection or occupancy
# note that the first ~1 is detection and the second ~1 is occupancy

ele_null <- occu(~1 ~1, sheep)

# backtransform to get estimates

#detection
p_det <- backTransform(ele_null, "det")
p_det
Backtransformed linear combination(s) of Detection estimate(s)

 Estimate     SE LinComb (Intercept)
    0.267 0.0682   -1.01           1

Transformation: logistic 
# detection probability = 0.267 (+/- SE 0.068)

# confidence intervals for detection
detCI <- confint(p_det)
detCI
     0.025     0.975
 0.1550488 0.4185526
# CI 95% 0.155 - 0.418

# occupancy
psi <- backTransform(ele_null, "state")
psi
Backtransformed linear combination(s) of Occupancy estimate(s)

 Estimate     SE LinComb (Intercept)
    0.325 0.0891  -0.731           1

Transformation: logistic 
# naive occupancy = 0.325 (+/- SE 0.089) 

psiCI <- confint(psi)
psiCI
     0.025     0.975
 0.1784229 0.5161171
# CI 95% 0.178 - 0.516
  • Detection probability = 0.267 (+/- SE 0.068), CI 95% 0.155 - 0.418.

  • Naive probability of occupancy = 0.325 (+/- SE 0.089), CI 95% 0.178 - 0.516

Step 3: Check for correlation between covariates

cor.test(siteCovs$elevation, siteCovs$ruggedness)

    Pearson's product-moment correlation

data:  siteCovs$elevation and siteCovs$ruggedness
t = 7.0296, df = 68, p-value = 1.254e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4881816 0.7668211
sample estimates:
      cor 
0.6487388 

The correlation between elevation and ruggedness is close to the threshold of 0.7. We will keep both for now, although you will find in the literature that others might decide removing one of them.

Step 4: Fitting Occupancy Model combinations

## null model:
null <- occu(~1 ~1, sheep)

# we take the variables selected in the most parismonious detection model from above (i.e. effort )
# first we use each of the site covariates
A1 <- occu(~effort~elevation, sheep)
A2 <- occu(~effort~ruggedness, sheep)

# And then we combine both site covariates
B1 <- occu(~effort~elevation+ruggedness, sheep)

# there are no other possible combinations

Step 5: Rank occupancy model combinations

# Make a list of all the models that need to be ranked

AllModels<-fitList(Null = null,A1=A1,A2=A2,B1=B1)

# Assess candidate models:

models<-modSel(AllModels, nullmod="Null")
models
     nPars    AIC delta  AICwt cumltvWt  Rsq
A2       4 140.89  0.00 0.5484     0.55 0.23
B1       5 141.76  0.87 0.3545     0.90 0.25
A1       4 144.38  3.49 0.0958     1.00 0.19
Null     2 152.99 12.10 0.0013     1.00 0.00

The most parsimonious model is A2 (i.e. lower AIC)

summary(A2) 

Call:
occu(formula = ~effort ~ ruggedness, data = sheep)

Occupancy (logit-scale):
            Estimate    SE     z P(>|z|)
(Intercept)   -0.994 0.479 -2.07 0.03801
ruggedness     1.433 0.533  2.69 0.00723

Detection (logit-scale):
            Estimate    SE     z P(>|z|)
(Intercept)   -2.569 0.973 -2.64 0.00824
effort         0.188 0.105  1.80 0.07231

AIC: 140.8918 
Number of sites: 70
optim convergence code: 0
optim iterations: 21 
Bootstrap iterations: 0 

Following (Anderson and Burnham 2002) all models with AIC < 2 have substantial level of support. In this case A2 and B1 are additional model with substantial level of support.

FinalModels<-fitList(Null = null,A2=A2, B1=B1)

final_models<-modSel(FinalModels, nullmod="Null")
final_models
     nPars    AIC delta  AICwt cumltvWt  Rsq
A2       4 140.89  0.00 0.6065     0.61 0.23
B1       5 141.76  0.87 0.3921     1.00 0.25
Null     2 152.99 12.10 0.0014     1.00 0.00

Step 6: Produce model-averaged coefficients

Only if more than one model are within deltaAIC<2.

library(MuMIn)

## ALL MODELS deltaAIC < 2:

best<-list(A2, B1)
avgmod<-model.avg(best,fit=T)
summary(avgmod)

Call:
model.avg(object = best, fit = T)

Component model call: 
occu(formula = <2 unique values>, data = sheep)

Component models: 
    df logLik   AICc delta weight
13   4 -66.45 141.51  0.00   0.65
123  5 -65.88 142.70  1.19   0.35

Term codes: 
      p(effort)  psi(elevation) psi(ruggedness) 
              1               2               3 

Model-averaged coefficients:  
(full average) 
                Estimate Std. Error z value Pr(>|z|)   
psi(Int)         -1.0290     0.4938   2.084  0.03718 * 
psi(ruggedness)   1.3167     0.5673   2.321  0.02029 * 
p(Int)           -2.5679     0.9742   2.636  0.00839 **
p(effort)         0.1881     0.1049   1.793  0.07293 . 
psi(elevation)    0.2006     0.4243   0.473  0.63644   
 
(conditional average) 
                Estimate Std. Error z value Pr(>|z|)   
psi(Int)         -1.0290     0.4938   2.084  0.03718 * 
psi(ruggedness)   1.3167     0.5673   2.321  0.02029 * 
p(Int)           -2.5679     0.9742   2.636  0.00839 **
p(effort)         0.1881     0.1049   1.793  0.07293 . 
psi(elevation)    0.5651     0.5489   1.029  0.30327   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# for 95% confidence intervals of beta coefficients 

confint(avgmod, level = 0.95)
                      2.5 %      97.5 %
psi(Int)        -1.99677985 -0.06114378
psi(ruggedness)  0.20480106  2.42853636
p(Int)          -4.47723826 -0.65860429
p(effort)       -0.01748701  0.39377211
psi(elevation)  -0.51077988  1.64094613
# sum of model weights
sw(best)
                     p(effort) psi(ruggedness) psi(elevation)
Sum of weights:      1.00      1.00            0.35          
N containing models:    2         2               1          

This output is telling us that we now have an overall model including effort as detection covariates, and elevation + ruggedness as site covariates. The model indicates that effort is marginally significant predictor (very close to p~0.5) of detection probability, and ruggedness is a significant predictor of occupancy.

Caution

The model seems to indicate that there is a higher probability of summer occupancy in the ridges than in the valleys.

Step 7: Goodness of Fit test

In the context of occupancy modeling, mb.gof.test can be used for assessing the goodness-of-fit of models where the goal is to estimate the probability that a site is occupied by a species, while accounting for imperfect detection. The Goodness of Fit test checks whether the observed detection/non-detection data match the expected probabilities derived from the model.

  • P-values help indicate whether the model fits well. A p-value greater than 0.05 suggests that the fit is acceptable, while a smaller value points to lack of fit.

  • c-hat (Variance Inflation Factor) measures overdispersion in the data. In occupancy models, a c-hat significantly greater than 1 suggests the model doesn’t account for all the variability, perhaps due to unmodeled heterogeneity in occupancy or detection. C-hat should be close to 1, anything over 1.1 would be overdispersed (observed variance is higher than expected), whilst a low value shows underdispersion or lack of fit of the model

# 1000 simulations used as example here (will take a few minutes)
obs.boot <- mb.gof.test(A2, nsim=1000) # 1000 bootstrapped samples

obs.boot

MacKenzie and Bailey goodness-of-fit for single-season occupancy model

Pearson chi-square table:

            Cohort Observed Expected Chi-square
000........      1       11     9.53       0.23
00.........      2        1     0.83       0.04
..00000000.      3        3     2.38       0.16
..00000....      4        1     2.07       0.56
..01000....      4        2     0.07      49.63
..0000.....      5        2     2.91       0.29
..0100.....      5        1     0.28       1.88
..0111.....      5        1     0.02      41.98
..000......      6        6     5.30       0.09
..00.......      7        8     7.75       0.01
..01.......      7        1     0.84       0.03
...00000...      8        5     4.75       0.01
...01010...      8        1     0.15       4.63
...01011...      8        1     0.04      22.43
...0000....      9        1     0.65       0.19
....000000.     10        1     1.30       0.07
....010000.     10        1     0.04      24.39
....000....     11        4     4.09       0.00
....010....     11        1     0.47       0.61
.....010000     12        1     0.07      13.19
.....110000     12        1     0.02      50.87
.....00000.     13        2     1.59       0.11
.....100...     14        1     0.03      33.11
.....00....     15        1     1.85       0.39
.....11....     15        1     0.02      47.92
......00000     16        2     2.16       0.01
......11010     16        1     0.01     152.63
......0000.     17        7     6.06       0.14
......0100.     17        1     0.38       1.00

Chi-square statistic = 460.9306 
Number of bootstrap samples = 1000
P-value = 0.332

Quantiles of bootstrapped statistics:
   0%   25%   50%   75%  100% 
   52   253   354   560 18539 

Estimate of c-hat = 0.87 

The output shows a P-value ~0.323, which is a good outcome and seems to indicate the model is a good fit. The c-hat is close to 1, as desired, so that indicates that there is no overdispersion.

What is bootstrapping?

Bootstrapping is a resampling method in statistics used to estimate the properties (e.g., mean, standard error, confidence intervals) of a dataset by repeatedly sampling, with replacement, from the original data.

This technique does not assume a specific distribution for the data, making it a powerful non-parametric tool.

Bootstrapping provides confidence intervals for parameters like means, medians, or regression coefficients without assuming normality.

Example:

Suppose you have a dataset of 10 values:

Original Data: {2,4,5,7,9,10,15,18,20,25}

  1. Generate a bootstrap sample: Bootstrap Sample 1: {7,9,5,25,18,7,4,20,15,2}

  2. Compute the mean of this sample:

    Mean of Sample 1: 11.2

  3. Repeat this process (e.g., 1,000 times), obtaining a distribution of sample means.

  4. Use the distribution to estimate the mean’s confidence interval.

Step 8: Plot predictions for the best model

# As an example, let's plot Occupancy and ruggedness
plot_model <- occu(~effort~ruggedness, sheep)

# Create dummy dataframe
range(siteCovs$ruggedness)
[1] -1.924507  2.109507
#range for SiteCovs -1.924507  2.109507

# create dummy dataframe, using the range of values for ruggedness, with the same number of data points as the detection dataset

MyData1 <- expand.grid(
  ruggedness = seq(-1.924507 , 2.109507, length = 70))

# Fit model predictions to this dummy dataframe

Predicted_ruggedness<-predict(plot_model, MyData1, type="state",se=TRUE)

P_ruggedness<-cbind(MyData1,Predicted_ruggedness)

# Plot the predictions

f1 <- ggplot(data=P_ruggedness, aes(x=ruggedness, y=Predicted))
f1 + geom_line(colour = "blue") + 
  geom_ribbon(aes(ymin = lower, ymax = upper ), alpha=0.2, colour =NA) +
  labs(x = "Standardised ruggedness", y = "Predicted sheep occupancy") +
  theme_bw() +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"),
        axis.title.y = element_text(margin = unit(c(0, 4, 0, 0), "mm"))) +
  theme(axis.text=element_text(color = "black", size=11),axis.line=element_line(colour="black"))+
  theme(axis.title.x = element_text(color = "black", face = "bold", size = 12)) +
  theme(axis.title.y = element_text(color = "black", face = "bold", size = 12)) +
  ylim(0,1) +
  theme(plot.margin = unit(c(0.25,0.25,0.25,0.25), "cm"))

References

Anderson, David R., and Kenneth P. Burnham. 2002. “Avoiding Pitfalls When Using Information-Theoretic Methods.” The Journal of Wildlife Management 66 (3): 912. https://doi.org/10.2307/3803155.