Variance Estimation for Density Surface Models

Ben Best, but really just grok'ing David L. Miller (dill)
2014-02-06

Flow of Density Surface Model and DSM Variations

DSM flow

Variation in density surface models (DSM) consist of two steps (correction for imperfect detection, then spatial modelling), while others jointly estimate the relevant parameters.

Two-stage. A detection function is fitted to the distance data to obtain detection probabilities for clusters or individuals. Counts are then summarised per segment. A generalised additive model (GAM) is then constructed with the per-segment counts as the response with either counts or segment areas corrected for detectability.
One-stage. Estimate parameters of detection and spatial models simultaneously.

Comparison:

Generally very little information is lost by taking a two-stage approach
One-stage models are more difficult to both estimate and check as both steps occur at once
Two-stage models have the disadvantage that to accurately quantify model uncertainty one must appropriately combine uncertainty from the detection function and spatial models.

Source: http://dill.github.io/papers/dsm-paper.pdf

Variance Estimation

incorporate detection + spatial model variance
abundances in adjacent segments are likely to be correlated (failure to account for this spatial autocorrelation will lead to artificially low variance estimates and hence misleadingly narrow confidence intervals)

Sources

Spatial models for distance sampling data: recent developments and future directions
- APPENDIX A: EXAMPLE DSM ANALYSIS
- APPENDIX B: CALCULATION OF VARIANCE IN DENSITY SURFACE MODELS

Software Install

Install the Distance and density surface model (dsm) R package from Dave L. Miller (dill) like so:

library(devtools)
install_github('dill/Distance')
install_github('dill/dsm')

This will install the latest R package from source code at http://github.com/dill/dsm, and all R packages upon which it depends, assuming you have the library devtools already installed.

Detection Fitting, single observer: dsm.fit()

Detection component: ddf.ds()

no covariates Detection function is simply evaluated between left and right truncations.
with covariates Detection function is evaluated at a series of distances (\( y_t \)) between left and right truncations (say seq(left,width,100)), and evaluated for each observed covariate combination (\( \mathbf{z}_i \)). Values are averaged at each distance, weighted by the fitted probabilities of detection (\( p_i \)). For a given \( y_t \), we evaluate: \[ \frac{\left(\sum_{i=1}^n\frac{ g(y_t, \mathbf{z}_i,\mathbf{\theta})}{p_i} \right) }{ \left( \sum_{i=1}^n\frac{1}{p_i} \right)} \]

Source: http://distancedevelopment.github.io/mrds/Plotting.html

Detection Prediction: dsm.predict()

Detection component: predict.ds()

the predicted average probability of detection for a given set of covariates \[ \hat{p}(\mathbf{z}; \mathbf{\theta}) = \int_0^w g(y, \mathbf{z}; \mathbf{\theta}) \pi(y) \text{d}y, \] where for line transects \( \pi(y) = \frac{1}{w} \)
the predicted effective strip width for a given set of covariates \[ \hat{\mu}(\mathbf{z}; \mathbf{\theta}) = \int_0^w g(y, \mathbf{z}; \mathbf{\theta}) \text{d}y, \] Giving either of these quantities a subscript \( i \) will denote that they are fitted values (i.e. predictions at the observed data): \( \hat{p}_i(\mathbf{z}; \mathbf{\theta}) \) or \( \hat{\mu}_i(\mathbf{z}; \mathbf{\theta}) \).

Source: http://distancedevelopment.github.io/mrds/Prediction.html

Variance for Detection: function()? "sandwich estimator"

Let's take their word for it…

Obtaining the variance of the average detectability requires a little work.

\[ \hat{P_a} = \frac{1}{w} \int_0^w g(x, \mathbf{z}; \hat{\mathbf{\theta}}) \text{d}x \]

for a line transect, with left truncation point 0 and right truncation point \( w \). We might refer to the integral as \( \hat{\mu} \) and therefore think of \( \hat{\mu} \) as a function of \( \hat{\mathbf{\theta}} \): \( \hat{\mu}(\hat{\mathbf{\theta}}) \). With this in mind we can use the “sandwich estimator” as in Borchers et al (2002) to calculate the variance of a function of the maximum likelihood estimate of \( \hat{\mathbf{\theta}} \)

\[ \text{Var}(P_a) = \frac{1}{w^2} \text{Var}(\hat{\mu}(\hat{\mathbf{\theta}})) \]

Now applying the sandwich estimator:

\[ \hat{P_a} = \frac{1}{w^2} \left[ \frac{\partial \hat{\mu}(\mathbf{\theta})}{\partial \mathbf{\theta}} {\Bigg|}_{\mathbf{\theta}=\hat{\mathbf{\theta}}}\right]^T (-H)^{-1} \left[ \frac{\partial \hat{\mu}(\mathbf{\theta})}{\partial \mathbf{\theta}} {\Bigg|}_{\mathbf{\theta}=\hat{\mathbf{\theta}}}\right] \]

where \( {\big|}_{\mathbf{\theta}=\hat{\mathbf{\theta}}} \) indicates that the derivates are evaluated at their maximum likelihood estimates.

The derivatives of the effective strip width can be tricky to calculate analytically, so it may be necessary to use finite differences to approximate them.

Source: http://distancedevelopment.github.io/mrds/Variance.html#toc_4

Variance for DSM: dsm.var.movblk()

dsm.var.movblk() Variance estimation via parametric moving block bootstrap Estimate the variance in abundance over an area using a moving block bootstrap. Two procedures are implemented, one incorporating detection function uncertainty, one not.

Other functions to estimate variance:

dsm.var.gam() Variance estimation via Bayesian results by standard GAM theory (using the delta method to combine detection function and GAM uncertainties). Use results from the Bayesian interpretation of the GAM to obtain uncertainty estimates. See Wood (2006).
dsm.var.prop() Variance propogation for DSM models Rather than use a bootstrap to calculate the variance in a dsm model, use the clever variance propogation trick from Williams et al. (2011) by MVB. The idea is to refit the spatial model but including the Hessian of the offset as an extra term. Variance estimates using this new model can then be used to calculate the variance of abundance estimates which incorporate detection function uncertainty. Further mathematical details are given in the paper in the references below.

Williams, R., Hedley, S.L., Branch, T.A., Bravington, M.V., Zerbini, A.N. and Findlay, K.P. (2011). Chilean Blue Whales as a Case Study to Illustrate Methods to Estimate Abundance and Evaluate Conservation Status of Rare Species. Conservation Biology 25(3), 526-535.

Source: http://distancedevelopment.github.io/dsm/dsm113to215.html

DSM Data Format

Two data.frames must be provided to dsm. They are referred to as observation.data and segment.data (for observation and segment data, respectively).

`observation.data`

The observation data frame must have the following columns:

object – unique object identifier.
Sample.Label – the identifier for the segment where the observation occurred.
size – the size of each observed group (i.e. 1 for individuals).
distance – perpendicular/radial distance to observation.

`segment.data`

The segment data frame must have the following columns:

Effort – the effort (in terms of length of the segment). Note this is not necessary if the segment.area argument is given. See main dsm documentation.
Sample.Label – identifier for the segment (unique!).
??? – environmental covariates, for example x and y for geographical coordinates.

Source: https://github.com/dill/dsm/wiki/DataFormat

Load Gulf of Mexico Dataset

pantropical dolphins in the Gulf of Mexico from http://seamap.env.duke.edu/dataset/25.

library(Distance)
library(dsm)
library(plyr)
library(ggplot2)

# load the Gulf of Mexico dolphin data (see ?mexdolphins)
data(mexdolphins)
attach(mexdolphins)

llply(mexdolphins, function(x) head(x, 1))

$obsdata
   object Sample.Label size distance Effort
45     45   19960421-9   21     3297  36300

$preddata
  latitude longitude depth     x      y width height
1    30.08    -87.58    35 70832 341079 32072  37065

$segdata
  latitude longitude Effort Transect.Label Sample.Label depth      x
1    29.94    -86.93  13800       19960417   19960417-1   135 134159
       y
1 325561

$distdata
   object size distance Effort detected beaufort latitude longitude      x
45     45   21     3297  36300        1        4    27.73       -86 228139
       y
45 79258

$survey.area
  longitude latitude
1    -87.39    25.78

Plot Dataset

data(mexdolphins)
attach(mexdolphins)

The following objects are masked from mexdolphins (position 3):

    distdata, obsdata, preddata, segdata, survey.area


# calculate x, y as distance from centroid
lon0 <- -88.31951
lat0 <- 27.01594
sa.tmp <- latlong2km(survey.area$longitude, survey.area$latitude, lon0=lon0, lat0=lat0)
survey.area <- data.frame(x=1000*sa.tmp$km.e, y=1000*sa.tmp$km.n)

# plot
p <- qplot(data=survey.area,x=x, y=y, geom="polygon", fill=I("lightblue"), ylab="y", xlab="x", alpha=I(0.7))
p <- p + coord_equal()
p <- p+gg.opts
p <- p + geom_line(aes(x,y,group=Transect.Label),data=segdata)
print(p)

The survey area with transect lines.

Fit Detection Function

# fit a detection function and look at summary
hr.model <- ds(mexdolphins$distdata, max(mexdolphins$distdata$distance),
               key = "hr", adjustment = NULL)
summary(hr.model)


Summary for distance analysis 
Number of observations :  47 
Distance range         :  0  -  7847 

Model : Hazard-rate key function 
AIC   : 841.3 

Detection function parameters
Scale Coefficients:  
            estimate     se
(Intercept)    7.983 0.9532

Shape parameters:  
            estimate     se
(Intercept)        0 0.7835

                    Estimate      SE     CV
Average p             0.5913  0.2224 0.3762
N in covered region  79.4879 30.8073 0.3876


# plot
plot(hr.model)

plot of chunk r_ds

Fit Spatial Model

# fit a simple smooth of x and y
mod1 <- dsm(N ~ s(x,y), hr.model, mexdolphins$segdata, mexdolphins$obsdata)
summary(mod1)


Family: quasipoisson 
Link function: log 

Formula:
N ~ s(x, y) + offset(off.set)
<environment: 0x10b8aa350>

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -18.409      0.394   -46.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
        edf Ref.df    F p-value    
s(x,y) 26.1   28.2 5.61  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.113   Deviance explained =   44%
GCV score = 42.985  Scale est. = 37.611    n = 387

Predict

# create an offset (in metres), each prediction cell is 444km2
off.set <- 444*1000*1000

# predict over a grid
mod1.pred <- predict(mod1, mexdolphins$preddata, off.set)

# calculate the predicted abundance over the grid
sum(mod1.pred)

[1] 47034


# plot the smooth
plot(mod1)

plot of chunk r_predict

Calculate Variance for DSM: dsm.var.movblk() without DS Uncertainty


# calculate the variance by 500 moving block bootstraps
system.time({

  mod1.movblk.uF <- dsm.var.movblk(mod1, mexdolphins$preddata, n.boot=500,
     block.size = 3, samp.unit.name = "Transect.Label", off.set = off.set,
     bar = TRUE, bs.file = "mexico-bs.csv", 
     ds.uncertainty=F)

})
#    user  system elapsed 
# 305.118   5.807 310.626
# 5.1771/60 = 5.1771 minutes
summary(mod1.movblk.uF)

Calculate Variance for DSM: dsm.var.movblk() with DS Uncertainty

system.time({

  mod1.movblk.uT <- dsm.var.movblk(mod1, mexdolphins$preddata, n.boot=500,
     block.size = 3, samp.unit.name = "Transect.Label", off.set = off.set,
     bar = TRUE, bs.file = "mexico-bs.csv", 
     ds.uncertainty=T)

})
# ** Warning: Problems with fitting data. Did not converge**
# Error in detfct.fit.opt(ddfobj, optim.options, bounds, misc.options) : 
#   No convergence.
#
#    user  system elapsed 
# 903.071  13.191 916.000 
# 916.00/60 = 15.26667 minutes

summary(mod1.movblk.uT)

plot(mod1.movblk.uT)

Calculate Variance for DSM: dsm.var.prop()


# Calculate the variance
system.time({

  mod1.varprop <- dsm.var.prop(mod1, mexdolphins$pred, off.set)

})

   user  system elapsed 
  0.395   0.114   8.952

#  user  system elapsed 
# 0.306   0.015   0.320
# 0.320/60 = 0.005333333 minutes

# summary over the whole area in mexdolphins$pred
summary(mod1.varprop)

Summary of uncertainty in a density surface model calculated
 by variance propagation.

Quantiles of differences between fitted model and variance model
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-3.52e-12 -2.71e-13 -3.10e-14 -2.22e-13  0.00e+00  1.66e-13 

Approximate asymptotic confidence interval:
   5%  Mean   95% 
29962 47034 73832 
(Using delta method)

Point estimate                 : 47034 
Standard error                 : 10966 
Coefficient of variation       : 0.2331


# Plot a map of the CV
#   need to format the prediction data with split
mod1.varprop.map <- dsm.var.prop(mod1, split(mexdolphins$pred,1:nrow(mexdolphins$pred)), off.set)
plot( mod1.varprop.map)

plot of chunk r_dsm.var.prop

Calculate Variance for DSM: dsm.var.gam()


# Calculate the variance
system.time({

  mod1.vargam <- dsm.var.gam(mod1, mexdolphins$pred, off.set)

})

   user  system elapsed 
  0.137   0.005   0.142

#  user  system elapsed 
# 0.138   0.013   0.151 
# 0.151/60 = 0.002516667 minutes

# summary over the whole area in mexdolphins$pred
summary(mod1.vargam)

Summary of uncertainty in a density surface model calculated
 analytically for GAM, with delta method

Approximate asymptotic confidence interval:
   5%  Mean   95% 
22467 47034 98465 
(Using delta method)

Point estimate                 : 47034 
Standard error                 : 4971 
CV of detection function       : 0.3762 
CV from GAM                    : 0.1057 
Total coefficient of variation : 0.3908

Others to Explore

guides:

faqs:

https://github.com/dill/dsm/wiki/Weirdtransectshapes for varying shapes or one-sided transects, use segment.area argument in dsm() function
https://github.com/dill/dsm/wiki/bam Using bam to speed up DSM fitting

mrds variance functions:

covn(): computes empirical variance of encounter rate
create.varstructure(): creates structures needed to compute abundance and variance

other packages from DistanceDevelopment

DSsim Distance Sampling Simulation R Package
mads Multi-Analysis Distance Sampling. Deals with unidentified sightings, covariate uncertainty and model uncertainty in Distance sampling

papers: