Ben Best, but really just grok'ing David L. Miller (dill)
2014-02-06
Variation in density surface models (DSM) consist of two steps (correction for imperfect detection, then spatial modelling), while others jointly estimate the relevant parameters.
Two-stage. A detection function is fitted to the distance data to obtain detection probabilities for clusters or individuals. Counts are then summarised per segment. A generalised additive model (GAM) is then constructed with the per-segment counts as the response with either counts or segment areas corrected for detectability.
One-stage. Estimate parameters of detection and spatial models simultaneously.
Comparison:
Install the Distance and density surface model (dsm) R package from Dave L. Miller (dill) like so:
library(devtools)
install_github('dill/Distance')
install_github('dill/dsm')
This will install the latest R package from source code at http://github.com/dill/dsm, and all R packages upon which it depends, assuming you have the library devtools already installed.
Detection component: ddf.ds()
no covariates Detection function is simply evaluated between left and right truncations.
with covariates
Detection function is evaluated at a series of distances (\( y_t \)) between left and right truncations (say seq(left,width,100)), and evaluated for each observed covariate combination (\( \mathbf{z}_i \)). Values are averaged at each distance, weighted by the fitted probabilities of detection (\( p_i \)). For a given \( y_t \), we evaluate:
\[
\frac{\left(\sum_{i=1}^n\frac{ g(y_t, \mathbf{z}_i,\mathbf{\theta})}{p_i} \right) }{ \left( \sum_{i=1}^n\frac{1}{p_i} \right)}
\]
Source: http://distancedevelopment.github.io/mrds/Plotting.html
Detection component: predict.ds()
the predicted average probability of detection for a given set of covariates \[ \hat{p}(\mathbf{z}; \mathbf{\theta}) = \int_0^w g(y, \mathbf{z}; \mathbf{\theta}) \pi(y) \text{d}y, \] where for line transects \( \pi(y) = \frac{1}{w} \)
the predicted effective strip width for a given set of covariates \[ \hat{\mu}(\mathbf{z}; \mathbf{\theta}) = \int_0^w g(y, \mathbf{z}; \mathbf{\theta}) \text{d}y, \] Giving either of these quantities a subscript \( i \) will denote that they are fitted values (i.e. predictions at the observed data): \( \hat{p}_i(\mathbf{z}; \mathbf{\theta}) \) or \( \hat{\mu}_i(\mathbf{z}; \mathbf{\theta}) \).
Source: http://distancedevelopment.github.io/mrds/Prediction.html
Let's take their word for it…
Obtaining the variance of the average detectability requires a little work.
\[ \hat{P_a} = \frac{1}{w} \int_0^w g(x, \mathbf{z}; \hat{\mathbf{\theta}}) \text{d}x \]
for a line transect, with left truncation point 0 and right truncation point \( w \). We might refer to the integral as \( \hat{\mu} \) and therefore think of \( \hat{\mu} \) as a function of \( \hat{\mathbf{\theta}} \): \( \hat{\mu}(\hat{\mathbf{\theta}}) \). With this in mind we can use the “sandwich estimator” as in Borchers et al (2002) to calculate the variance of a function of the maximum likelihood estimate of \( \hat{\mathbf{\theta}} \)
\[ \text{Var}(P_a) = \frac{1}{w^2} \text{Var}(\hat{\mu}(\hat{\mathbf{\theta}})) \]
Now applying the sandwich estimator:
\[ \hat{P_a} = \frac{1}{w^2} \left[ \frac{\partial \hat{\mu}(\mathbf{\theta})}{\partial \mathbf{\theta}} {\Bigg|}_{\mathbf{\theta}=\hat{\mathbf{\theta}}}\right]^T (-H)^{-1} \left[ \frac{\partial \hat{\mu}(\mathbf{\theta})}{\partial \mathbf{\theta}} {\Bigg|}_{\mathbf{\theta}=\hat{\mathbf{\theta}}}\right] \]
where \( {\big|}_{\mathbf{\theta}=\hat{\mathbf{\theta}}} \) indicates that the derivates are evaluated at their maximum likelihood estimates.
The derivatives of the effective strip width can be tricky to calculate analytically, so it may be necessary to use finite differences to approximate them.
Source: http://distancedevelopment.github.io/mrds/Variance.html#toc_4
dsm.var.movblk() Variance estimation via parametric moving block bootstrap Estimate the variance in abundance over an area using a moving block bootstrap. Two procedures are implemented, one incorporating detection function uncertainty, one not.
Other functions to estimate variance:
dsm.var.gam() Variance estimation via Bayesian results by standard GAM theory (using the delta method to combine detection function and GAM uncertainties). Use results from the Bayesian interpretation of the GAM to obtain uncertainty estimates. See Wood (2006).
dsm.var.prop() Variance propogation for DSM models Rather than use a bootstrap to calculate the variance in a dsm model, use the clever variance propogation trick from Williams et al. (2011) by MVB. The idea is to refit the spatial model but including the Hessian of the offset as an extra term. Variance estimates using this new model can then be used to calculate the variance of abundance estimates which incorporate detection function uncertainty. Further mathematical details are given in the paper in the references below.
Williams, R., Hedley, S.L., Branch, T.A., Bravington, M.V., Zerbini, A.N. and Findlay, K.P. (2011). Chilean Blue Whales as a Case Study to Illustrate Methods to Estimate Abundance and Evaluate Conservation Status of Rare Species. Conservation Biology 25(3), 526-535.
Source: http://distancedevelopment.github.io/dsm/dsm113to215.html
Two data.frames must be provided to dsm. They are referred to as observation.data and segment.data (for observation and segment data, respectively).
observation.dataThe observation data frame must have the following columns:
object – unique object identifier.Sample.Label – the identifier for the segment where the observation occurred.size – the size of each observed group (i.e. 1 for individuals).distance – perpendicular/radial distance to observation.segment.dataThe segment data frame must have the following columns:
Effort – the effort (in terms of length of the segment). Note this is not necessary if the segment.area argument is given. See main dsm documentation.Sample.Label – identifier for the segment (unique!).??? – environmental covariates, for example x and y for geographical coordinates.pantropical dolphins in the Gulf of Mexico from http://seamap.env.duke.edu/dataset/25.
library(Distance)
library(dsm)
library(plyr)
library(ggplot2)
# load the Gulf of Mexico dolphin data (see ?mexdolphins)
data(mexdolphins)
attach(mexdolphins)
llply(mexdolphins, function(x) head(x, 1))
$obsdata
object Sample.Label size distance Effort
45 45 19960421-9 21 3297 36300
$preddata
latitude longitude depth x y width height
1 30.08 -87.58 35 70832 341079 32072 37065
$segdata
latitude longitude Effort Transect.Label Sample.Label depth x
1 29.94 -86.93 13800 19960417 19960417-1 135 134159
y
1 325561
$distdata
object size distance Effort detected beaufort latitude longitude x
45 45 21 3297 36300 1 4 27.73 -86 228139
y
45 79258
$survey.area
longitude latitude
1 -87.39 25.78
data(mexdolphins)
attach(mexdolphins)
The following objects are masked from mexdolphins (position 3):
distdata, obsdata, preddata, segdata, survey.area
# calculate x, y as distance from centroid
lon0 <- -88.31951
lat0 <- 27.01594
sa.tmp <- latlong2km(survey.area$longitude, survey.area$latitude, lon0=lon0, lat0=lat0)
survey.area <- data.frame(x=1000*sa.tmp$km.e, y=1000*sa.tmp$km.n)
# plot
p <- qplot(data=survey.area,x=x, y=y, geom="polygon", fill=I("lightblue"), ylab="y", xlab="x", alpha=I(0.7))
p <- p + coord_equal()
p <- p+gg.opts
p <- p + geom_line(aes(x,y,group=Transect.Label),data=segdata)
print(p)
# fit a detection function and look at summary
hr.model <- ds(mexdolphins$distdata, max(mexdolphins$distdata$distance),
key = "hr", adjustment = NULL)
summary(hr.model)
Summary for distance analysis
Number of observations : 47
Distance range : 0 - 7847
Model : Hazard-rate key function
AIC : 841.3
Detection function parameters
Scale Coefficients:
estimate se
(Intercept) 7.983 0.9532
Shape parameters:
estimate se
(Intercept) 0 0.7835
Estimate SE CV
Average p 0.5913 0.2224 0.3762
N in covered region 79.4879 30.8073 0.3876
# plot
plot(hr.model)
# fit a simple smooth of x and y
mod1 <- dsm(N ~ s(x,y), hr.model, mexdolphins$segdata, mexdolphins$obsdata)
summary(mod1)
Family: quasipoisson
Link function: log
Formula:
N ~ s(x, y) + offset(off.set)
<environment: 0x10b8aa350>
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18.409 0.394 -46.7 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x,y) 26.1 28.2 5.61 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.113 Deviance explained = 44%
GCV score = 42.985 Scale est. = 37.611 n = 387
# create an offset (in metres), each prediction cell is 444km2
off.set <- 444*1000*1000
# predict over a grid
mod1.pred <- predict(mod1, mexdolphins$preddata, off.set)
# calculate the predicted abundance over the grid
sum(mod1.pred)
[1] 47034
# plot the smooth
plot(mod1)
# calculate the variance by 500 moving block bootstraps
system.time({
mod1.movblk.uF <- dsm.var.movblk(mod1, mexdolphins$preddata, n.boot=500,
block.size = 3, samp.unit.name = "Transect.Label", off.set = off.set,
bar = TRUE, bs.file = "mexico-bs.csv",
ds.uncertainty=F)
})
# user system elapsed
# 305.118 5.807 310.626
# 5.1771/60 = 5.1771 minutes
summary(mod1.movblk.uF)
system.time({
mod1.movblk.uT <- dsm.var.movblk(mod1, mexdolphins$preddata, n.boot=500,
block.size = 3, samp.unit.name = "Transect.Label", off.set = off.set,
bar = TRUE, bs.file = "mexico-bs.csv",
ds.uncertainty=T)
})
# ** Warning: Problems with fitting data. Did not converge**
# Error in detfct.fit.opt(ddfobj, optim.options, bounds, misc.options) :
# No convergence.
#
# user system elapsed
# 903.071 13.191 916.000
# 916.00/60 = 15.26667 minutes
summary(mod1.movblk.uT)
plot(mod1.movblk.uT)
# Calculate the variance
system.time({
mod1.varprop <- dsm.var.prop(mod1, mexdolphins$pred, off.set)
})
user system elapsed
0.395 0.114 8.952
# user system elapsed
# 0.306 0.015 0.320
# 0.320/60 = 0.005333333 minutes
# summary over the whole area in mexdolphins$pred
summary(mod1.varprop)
Summary of uncertainty in a density surface model calculated
by variance propagation.
Quantiles of differences between fitted model and variance model
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.52e-12 -2.71e-13 -3.10e-14 -2.22e-13 0.00e+00 1.66e-13
Approximate asymptotic confidence interval:
5% Mean 95%
29962 47034 73832
(Using delta method)
Point estimate : 47034
Standard error : 10966
Coefficient of variation : 0.2331
# Plot a map of the CV
# need to format the prediction data with split
mod1.varprop.map <- dsm.var.prop(mod1, split(mexdolphins$pred,1:nrow(mexdolphins$pred)), off.set)
plot( mod1.varprop.map)
# Calculate the variance
system.time({
mod1.vargam <- dsm.var.gam(mod1, mexdolphins$pred, off.set)
})
user system elapsed
0.137 0.005 0.142
# user system elapsed
# 0.138 0.013 0.151
# 0.151/60 = 0.002516667 minutes
# summary over the whole area in mexdolphins$pred
summary(mod1.vargam)
Summary of uncertainty in a density surface model calculated
analytically for GAM, with delta method
Approximate asymptotic confidence interval:
5% Mean 95%
22467 47034 98465
(Using delta method)
Point estimate : 47034
Standard error : 4971
CV of detection function : 0.3762
CV from GAM : 0.1057
Total coefficient of variation : 0.3908
guides:
faqs:
segment.area argument in dsm() function mrds variance functions:
covn(): computes empirical variance of encounter ratecreate.varstructure(): creates structures needed to compute abundance and varianceother packages from DistanceDevelopment
papers: