Forest inventory data can offer huge numbers of plots that pose challenges not encountered with traditional data. In this vignette we explore how models fitted to large networks having many small plots make noisy predictions, but they can hold as much information as smaller networks of few large plots.
Today
Semester projects
review last vignette
this vignette
Learning goals
Reading:
Data files
xdataCluster.rdata, densPlus15cluster.rdata
on Sakai/Resources/data
Software
Get the updated clarkFunctions2024.R file on Sakai:
source('clarkFunctions2024.r')
Longitudinal studies of trees have been implemented all over the globe, by ecologists and managers alike, to monitor changes in forest resources and biodiversity. The term longitudinal refers to studies where subjects are followed over time. Trees lend themselves to permanent plot studies because individuals can be identified by tags, and they are easy to map and (at least for diameter) to measure.
Ecological studies range from individual plots to networks like NEON and forestGeo. Some of these longitudinal studies are experimental, including Duke University’s Free-Air CO2 Enrichment (FACE) experiment. The USDA Forest Inventory and Analysis (FIA) program is one of many national forest inventories that can address biogeographic questions.
More than 200,000 FIA plots vary geographically in size and sample years. Shown here are young stands in the South and East (light shading at left) and dry site classes in mountainous sites and southern Plains (brown at right).
Background on the FIA program describes the spatial design and sampling. Key features to note include the following:
plots are small, but not uniformly so,
protocols have shifted over time, especially in the last two decades
resampling of plots occurs at intervals of several years (up to a decade); there is variation because the inventory is conducted by states
not all plots have been sampled more than once.
for privacy considerations, plot locations are approximate (you cannot find them)
The last bullet is important because estimates of change require replication in time, i.e., at least two censuses, but preferably more. For these reasons, any given analysis may use a subset of the full inventory.
Several vignettes explore long-term plot data to illustrate three main concepts related to observational data. First, I look at the signal and noise for spatial data in the context of model fitting and prediction. Next, I use the FACE experiment to show how hierarchical Bayes can change the interpretation of variables responsible for tree growth. Finally, I take up hierarchical spatial models using FIA.
There is nothing discussed here that is unique to FIA. These are general considerations.
Observational data have important limitations. As pointed out in the first vignette, interpreting associations between variables can be challenging. Which variable affects others? Which unmeasured variables are important effects?
Despite the disadvantages, most questions in ecology cannot be answered in the lab or even small field plots. They engage landscape and atmospheric processes that do not operate at fine scales. Ecologists use experiments where then can. But they also need to learn from observational data.
Parameter estimates and predictions have uncertainty that, ideally, can be expressed with a probability statement. Parameter estimates can help us understand the effects of one process on another. Predictions help us evaluate models, anticipate responses outside the fitted data set, and understand how a process works. Noisy data sets can result in high uncertainty. However, estimates and predictions may only be useful when uncertainty is not too high. Here we consider how a sampling design can affect uncertainty differently for parameter estimates and predictions.
Several years ago some of us were developing models to estimate the effects of climate and habitat on tree recruitment (rates of seedling establishment in small plots) using the FIA. We typically use prediction as one of several model-testing techniques, and predictions for this data set were bad (see figure below left). We assumed that poor predictions could be blamed on the small sizes of seedling plots in FIA data. How could a small number of predictors, say temperature, precipitation, and soils, possibly be enough to predict the number of seedlings that would be found on such a tiny plot? We decided to aggregate seeding plots from similar climates and habitats (We were not the first ones to think about aggregating FIA plots, see Iverson et al., 1998).
Predicted seedling recruitment at fine (left) and coarse (right) spatial scales obtained by aggregating small plots on the basis of similar conditions. (from Zhu et al., 2015)
The effect of aggregation on predictions was dramatic. The left panel of this figure shows the prediction we expect for noise. Regardless of the observed number of seedlings (horizontal axis), the model everywhere predicts the mean number of seedlings for the entire sample. Said another way, the model seems to learn nothing from the observation at a given plot. If an observation was having any impact, wouldn’t we expect that the model prediction would show some evidence of that? By contrast, the model fitted at right would be judged as useful, because it is capable of predicting the data; clearly, there is signal.
Recall that the same data are used in both of these examples. If there is any variation between plots, it might degrade the fit for aggregated plots, i.e., a blurring of covariate information that occurs when aggregating plots with similar (but not identical) climate and soils. Of course, if there is a huge number of plots (as in this example), then this information loss by aggregation should be small. So if we want to use prediction for model-checking, we want to understand why simple aggregation finds signal where raw data do not.
Prediction is one of several model-checking tools, one that takes place in data space. There are other ways to check and compare models in parameter space, which has the disadvantage that models of different structures often cannot readily be compared in this way. A capacity to predict the data used to fit the model might be compared across diverse models.
A few more words on parameters and predictions are needed before we proceed further. A fitted model can provide estimates of relationships (parameters) and predictions out-of-sample for times and places other than when/where the data were collected. Out-of-sample and in-sample prediction (predicting the data used to fit the model) are used for model checking. If I want to learn about relationships, then a model fit is useful if the parameter uncertainty is not too big. If I want to predict data, then a model fit is useful if the prediction uncertainty is not too big. Sometimes parameter uncertainty is acceptably low where prediction uncertainty is not. Whether or not a model fit is useful may depend on whether I am after parameters or predictions.
Large prediction error is often taken as a sign that a model is not useful. Poor prediction might be judged by a large mean square prediction error (rmse) and/or an \(r^2\) near zero. Recall from introductory statistics that \(r^2\) compares the predictions based on the scatter around two lines, the observed values (dashed 1:1 line in the figure) versus the sample mean (horizontal).
The two fits being compared here include the same model and the same data, yet they contrast in prediction performance.
Whether or not I am interested in parameter estimates, I need to understand how they affect prediction uncertainty.
Parameter estimates have uncertainty that is determined by how well the model “explains” the data and the sample size or number of observations \(n\). The parameter uncertainty is translated into prediction uncertainty, which is represented by the vertical lines for each observation in the figure above. To see the direct role played by sample size, recall that the traditional standard error for an estimate of a sample mean has the factor \(\sqrt{n}\) in the denominator,
\[SE( \hat{{\mu}} ) = \frac{\sigma}{\sqrt{n}} \tag{1}\] The derivation of this result for Poisson intensity from a Bayesian perspective is given in the Appendix.
The fact that I am dividing by \(\sqrt{n}\) means that the standard errors for parameter estimates will be tiny when the sample size is huge (as it is here). In other words, all parameters are ‘significant’. Poor prediction in this example cannot be blamed on large parameter error or small \(n\). Tang et al. show that the Poisson result also applies to the estimate of the mean of the normal distribution.
The variance of the parameter estimate is the square of the standard error,
\[Var( \hat{{\mu}} ) = \frac{\sigma^2}{n} \tag{1}\] Because a prediction from the model is a convolution of independent normal distributions (one for the data, \(N(y | \mu, \sigma^2)\) and another for the posterior estimate of the mean, which is asymptotically \(N(\hat{\mu} | \bar{y}, \sigma^2/n)\)), the prediction variance is sum of the two variances,
\[Var( \hat{{y}} ) = \sigma^2 + \frac{\sigma^2}{n} \tag{1} = \sigma^2 (1 + 1/n)\] Unlike the variance on the parameter estimate, this variance on the prediction is only affected by sample size \(n\) when \(n\) is small. Even as uncertainty in \(\hat{{\mu}}\) continues to decline with increasing observations, the predictions can never improve beyond (drop below) the residual variance from model fitting.
Aggregation in the seedling example did not add any information that could explain the differences in plots, but it did change the scale of the data and, thus, the model. In the current example, the spatial scale is set by plot area. The response variable is a count (number of recruits) per plot, and the plot area may have been assigned simply by convenience. For example, seedling plots in many ecological studies are 1 m\(^2\), perhaps because plots of this size can be accessed without disturbance by trampling (reach in, rather than step in). This scale effect is implicit, because it disappears when I multiply the factors on the right,
\[\frac{\mbox{count}}{ \mbox{plot} } = \frac{\mbox{count}}{ \mbox{area} }\times \frac{\mbox{area}}{\mbox{plot}}\] I will refer to the second factor on the right, in this case area per plot, as the observation effort devoted to each count. More generally, it can be the time spent counting birds (e.g., minutes), the number of hooks in longline fisheries, or the number of days that camera traps are deployed. I’ll refer to this type of effort per observation with \(E\), and I will have to qualify it with units (e.g., time, area).
The first factor on the right is the quantity we want to model. In longline fisheries, it would be catch per unit effort (CPUE). In our example it is recruits per ha.
The effort afforded each observation sums to the sample effort. The number of observations represents the number of times I recorded what I saw. This is the previously-defined sample size \(n\). The total work I do on the sample is \(nE\). In other words;
\[\mbox{observation effort} = E\] and
\[\mbox{sample effort} = nE = S\] Of course, effort can differ for each observation. If so, I include an index for observation \(i\), which has observation effort \(E_i\) and sample effort \(\sum_{i=1}^n E_i\). For now, I will assume consistent effort (no subscript). For this example, sample effort is the total area sampled in the study.
If the resources that I have available to do a study limits sample effort, I might think about how much I can learn for a given \(n\) and \(E\). How I organize sampling is called design. [FIA sampling is a stratified random design, where stratification is by area and randomized within areas.] I mention this here only to point out that design considerations can include how to allocate total effort \(nE\) into many observations, each with low effort, or vice versa.
To make this simple, I want to estimate the mean intensity \(\lambda\) for Poisson counts \(y_i\), where \(i = 1, \dots, n\) is a label for an observation. I use this notation,
\[y_i \sim Poi( \lambda ) \tag{2}\]
The Poisson distribution is used for count data. It has a single intensity parameter \(\lambda > 0\) that describes the number of independent, identically distributed events expected for a defined effort of, say, space or time. The intensity can be modeled with predictors. Linear models for \(\lambda\) typically use a log link function that insures positive values, e.g., \(\log \lambda = \mathbf{x}'\boldsymbol{\beta}\).
This model estimates an intensity parameter \(\lambda\) for observed counts per plot. Counts per plot is a dimensionless quantity. However, plots can vary in size, and large plots are expected to yield large counts. The next section brings effort \(E\) into the model.
Simulating data is used to check if the parameters used to generate the data can be recovered by the fitted model. It can show how the data-generating mechanism affects uncertainty in estimates.
The answer to the aggregation paradox can be shown with a simple
simulation. Here is an example using the R function glm to
fit a simulated data set. To simulate data I assume a sample size
n, a parameter lambda, and I draw at random
n ‘observations’ y from the Poisson
distribution. This function estimates the Poisson parameter on the log
scale, so I define it as \(\mu =
\log{\lambda}\)
n <- 100 # sample size
mu <- 2.8 # an arbitrary log lambda
lambda <- exp(mu)
y1 <- rpois( n, lambda ) # random observations
fit1 <- glm( y1 ~ 1, "poisson" ) # estimate parameter
est1 <- summary(fit1)$coefficients
The estimate in est1 is 2.81 +/- 0.0246. Check
help( glm ) to find arguments and syntax for this
function.
Note that the model estimates mu close to the value used
to simulate the data, 2.8 (table above). The standard error
(SE) represents parameter uncertainty. Below is a histogram of the
simulated values and the lambda estimate \(\pm
1\) SE.
hist(y1, nclass = 15, col = 'brown', border = 'brown' )
abline(v = exp( est1[1]), lwd = 2 ) # mean estimate
abline(v = exp( est1[1] + c(-1,1)*est1[2] ), lty=2) # +/- one SE
Histogram of counts with mean estimate (solid) +/- 1 SE.
The tight dashed lines indicate confidence in the estimate of
mu.
If I want parameter estimates on an area basis I could include observation effort,
\[y_i \sim Poi( E \lambda )\]
If I define \(E\) in terms of area, then I am really estimating density (e.g., plants per m\(^2\)).
I can repeat the analysis using an area basis, i.e., a different value for \(E\),
E <- 10
fit2 <- glm( y1 ~ 1, "poisson", offset = rep(log(E), n) ) # offset E for log link
est2 <- summary(fit2)$coefficients
ftab <- rbind(est1, est2)
rownames(ftab) <- paste( 'fit', c(1:2) )
The glm function with a Poisson likelihood defaults to a
log link. For this reason, the offset is log(E). On the log
scale, the mean is \(\log(E) + \log(
\lambda)\). ftab looks like this:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| fit 1 | 2.807 | 0.025 | 114.228 | 0 |
| fit 2 | 0.504 | 0.025 | 20.526 | 0 |
Note that the mean estimates differ in this table, but the standard errors have not changed. They are identical on the log (proportionate) basis, because all I have done is change the units from per-plot to per-one-tenth-of-a-plot. To see that this is nothing more than a change of units I can overlay them on the same plot using this code:
abline(v = E*exp( est2[1]), lwd = 2, col=3 ) # mean estimate
abline(v = E*exp( est2[1] + c(-1,1)*est2[2] ), lty=2, col=4) # +/- one SE
So by introducing \(E\), I changed the units without actually changing the observation effort or the parameter uncertainty. To increase the actual effort, I need to increase the area of sampled plots.
Variation in count data depend on scale (e.g., plot area), which represents observation effort. Parameter estimates improve with observation effort.
I will repeat the analysis of simulated data, but this time with increased effort per observation. This means that I expect to observe \(E\) times as many recruits on a plot,
y3 <- rpois( n, E*lambda ) # sampled with new effort
fit3 <- glm( y3 ~ 1, "poisson", offset = rep(log(E), n) )
est3 <- summary(fit3)$coefficients
| n | E | S | offset | Estimate | Std. Error | |
|---|---|---|---|---|---|---|
| fit 1 | 100 | 1 | 100 | 1 | 2.807 | 0.025 |
| fit 2 | 100 | 1 | 100 | 10 | 0.504 | 0.025 |
| fit 3 | 100 | 10 | 1000 | 10 | 2.807 | 0.008 |
By contrast with fit2, fit3 benefits from
increased effort with the result that the SE is reduced. The mean
estimate for fit3 agrees with fit1, because
the offset passed to glm is same as the effort used to
simulate the data.
The reduction in parameter uncertainty resulted not from an increase
in sample size, \(n\), which was the
same for all of these model fits. It comes instead from an increase in
observation effort per sample. This difference in effort can also be
viewed in terms of the total number of recruits that were counted,
sum(y1) = 1656 and sum(y3) = 16554.
Although the first model has the dimensions of ‘per plot’ and the second has dimensions ‘per area’, they both have the same scale, determined by the area of the plot observed. In other words, if plots are 1 m\(^2\) in area and I choose to report parameters on the scale of hectares (easily done by setting \(1/10000\) ha per m\(^2\)), I have not changed the fact that parameter estimates are determined by variation that occurs at the m\(^2\) scale, which explains why they have the same proportionate SE.
The third model had increased effort, because I used \(E\) to simulate more counts per observation. Without increasing sample size \(n\), I increased sample effort \(S = nE\) by increasing \(E\). The aggregation paradox depends on scales dominated by signal and noise, but implications differ for parameters and predictions.
Total sample effort combines \(n\) and \(E\).
With this background, I can reconsider the initial example as one in which the sample effort is held constant, while observation effort \(S\) is redistributed over a different sample size \(n\) and observation effort \(E\). When we aggregated seedling data we changed the observation effort \(E\) without changing the sample effort \(S = nE\). Sample effort remained constant because we reduced \(n\) by the same factor that we increased \(E\). The aggregation simply gathered the same counts into a smaller number of observations. In this section I show that this redistribution of effort had almost no effect on parameter estimates, despite its devastating effect on prediction. These results have implications for the role of prediction, which I take up at the end of this section.
The model can be expanded to include a predictor variable, written as
\[ \begin{aligned} \log \lambda_i &= \mathbf{x}_i' \boldsymbol{\beta} \\ &= \beta_0 + \beta_1 x_i \end{aligned} \] where \(\mathbf{x}'_i = (1, x_i)\) is a length-2 vector holding the value of the covariate in the second position, and \(\boldsymbol{\beta}' = (\beta_0, \beta_1)\) is the corresponding coefficient vector holding the intercept and slope, respectively. Simulated data must now include these two vectors.
Here I write a function poisReg in
clarkFunctions2024.R to simulate data, fit a glm, and
generate plots. I do this so I can easily change \(n\) and \(E\) to explore its effects on parameter
estimates and prediction.
Here is an example using this code assuming a slope parameter \(\beta_1 = 0.5\),
par( bty='n' )
slope <- .5
fit1 <- poisReg(n = 100, E = 10, slope = slope)
Simulated data and predicted with 1 SE vs observed.
The figure shows observed counts with predictive intervals (\(\pm 1\) predictive standard error) for observations. These intervals were obtained by taking the standard errors for coefficients \(\boldsymbol{\beta}\) and combining them with the Poisson sampling error. The 1:1 line is the line of agreement. The horizontal line is set at the mean count, i.e., the expected prediction for pure noise. Clearly, there is signal in this simulated data set.
So the question becomes, can I learn more about relationships (\(\boldsymbol{\beta}\)) and make better predictions (\(\{ \hat{y_i} \}\)) if I redistribute my total effort?
I can redistribute sample effort to more plots, each with lower sampling effort, by increasing \(n\) and decreasing \(E\) both by a factor of 10:
fit2 <- poisReg(n = 1000, E = 1, slope = slope, PLOT = F)
The table shows comparable mean estimates and SEs for
fit1 and fit2. Recall that the value used to
simulate data was slope = .5.
| n | E | estimate | |
|---|---|---|---|
| fit 1 (few large) | 100 | 10 | 0.535 +/- 0.0317 |
| fit 2 (many small) | 1000 | 1 | 0.473 +/- 0.0271 |
Prediction uncertainty changes with \(E\) but not \(n\) (rmspe shown in the figure). To see
this, consider fit3, which has the same sample size used
for fit1 but with the observation effort (per sample) used
for fit2. Prediction comparisons for the three models tell
us that effort per sample \(E\) is the
key to good predictions in this simple example, not \(n\) and not \(nE\).
lc <- c('#1c9099', '#636363', '#f03b20')
par(mfrow=c(1,3), bty='n')
f1 <- poisReg(100, 100, col = lc[1]); text(200, 50, 'few large', cex = 1.3 )
f2 <- poisReg(10000, 1, col = lc[2]); text( 6, .3, 'many small', cex = 1.3)
f3 <- poisReg(100, 1, col = lc[3]) ; text( 4, .8, 'few small', cex = 1.3)
Predictive means +/- 1 standard deviation for fit 1, 2, and 3. Counts are jittered to reduce overlap. Colors indicate models in figure 5.
Whereas prediction changes with \(E\) but not \(n\), confidence in parameter estimates depends on \(nE\). The figure shows agreement in parameter estimates when sample size is reduced by a factor of 10 provided it is offset by a corresponding increase in effort per observation (blue versus grey). By contrast, reducing sample size without increasing observation effort results in a large increase in uncertainty (red).
Parameter estimates for values of (n, E): blue = (100, 100), grey = (10000, 1), and red = (100, 1).
If there is enough information to provide good parameter estimates, and
predictions come from parameters, then shouldn’t I be able to recover
useful predictions from fit2 (“many small”) when
aggregating to the “few-large” scale?
Below is code that aggregates predictions (and their standard
errors). In this code I bin \(n =
1000\) observations into 100 aggregated observations (the sample
size for fit1). They are binned to have similar covariate
values x2. I find the mean for the aggregated versions of
\(x\), \(y\), and \(\hat{y}\) and the prediction error for
aggregated \(\hat{y}\). The predictions
now look like those for fit1.
x <- f2$out[,'x2'] # covariate values
y <- f2$out[,'y'] # counts
b <- quantile(x, seq(0, 1, by = .01) ) # intervals of x2
i <- findInterval(x, b, all.inside=T) # bin index for each x
X <- tapply( x, i, mean) # mean x2 for each bin
Y <- tapply( y, i, sum) # sum counts y by bin
ymu <- tapply( f2$out[,'yhat'], i, sum) # sum predictions by x2
yse <- sqrt( tapply( f2$out[,'yse']^2, i, sum) ) # if iid: var(sum) = sum(var)
z <- jitter(Y)
par(bty='n')
plot(z, ymu, xlab = 'counts', ylab = 'predicted', pch = 14, cex=.3)
abline(0, 1, lty=2)
abline( h = mean(Y), lty = 2 )
segments( z, ymu - yse, z, ymu + yse )
title( 'Aggregate scale (n = 100, E = 100)' )
Aggregated predictions in fit2 shows them to be as good as those for fit1.
Apparently, the information is still there, but it is so granular as to hide the fact that prediction can work at a coarser scale than that of the observed plot area.
The foregoing application to discrete counts can be generalized to continous responses. Consider the common practice of compositing soil, air, or water samples to reduce the noise in concentrations, densities, or mineralization rates and to avoid costs and processing time of multiple samples. If samples are composited, then how many? How does this affect parameter estimates and prediction? In the appendix I show that parameter uncertainty declines in direct proportion to sample effort, just as we saw for discrete counts.
Could the concept of sample effort help us optimize inventory design?
The findings that i) parameter estimates do not suffer from
aggregation and that ii) aggregate predictions can be useful, even when
disaggregated predictions are not, shows that aggregation can be useful.
To reduce > 200,000 plots to a manageable size for this unit I
generated an aggregated data set that uses a k-means clustering of
predictors (climate, soils) together with location. This is the same
technique used to aggregate plots by Zhu et
al. Here are clustered data, with locations and covariates in
xdataCluster.rdata and counts of trees > 15 cm in
diameter by species in densPlus15cluster.rdata
load( 'xdataCluster.rdata' )
load( 'densPlus15cluster.rdata' )
The matrix baCluster holds plots (rows) by species
(columns). Here is a map of total Quercus basal area on
clustered plots:
w <- grep( 'quercus', colnames(densPlus15cluster) ) # quercus in column names
quercus <- rowSums( densPlus15cluster[,w], na.rm=T) # total ba per ha
y <- data.frame( cbind(xdataCluster[,c('lon','lat')], quercus) ) # include location
ramp <- c('#edf8fb','#b2e2e2','#66c2a4','#2ca25f','#006d2c') # color ramp low to high ba
ncols <- 20
grFun <- colorRampPalette( ramp ) # function generates ramp
cseq <- seq(0, max(quercus), length = ncols ) # partition ba scale
i <- findInterval( quercus, cseq, all.inside = T) # assign ba to partition
g <- grFun(ncols) # ramp of length ncols
col <- g[ i ] # assign colors
This block of code generates a map, with dark colors indicating high basal area:
library( ggplot2 )
library( raster )
usa <- map_data("usa")
ggplot() +
geom_polygon(data = usa, aes(x=long, y = lat, group = group),
fill = '#f7fcfd', color = 'tan') +
geom_point(data = y, aes(x = lon, y = lat), color = col, size = .2) +
coord_fixed(xlim = c(-124, -68), ylim = c(26, 50), ratio = 1.2)
Quercus species in aggregated FIA plots show tendency for the genus to dominate at mid latitudes over a wide range of moisture conditions.
The call to ggplot draws the boundaries
(geom_polgon) and the points (geom_point).
Note that this aggregated version of the data with 13880 plots gives
good geographic coverage.
Exercise 1. Within each of the routes we examined for BBS data there are actually 50 stops. The routes are approximately 40 km long. Counts last 3 minutes per stop.
Provide thoughts on the following:
Exercise 2. My budget can no longer support the costs
of monitoring a full network. I can reduce costs by limiting the number
of plots, the sizes of plots, or both. For this study it is important
that the network can provide estimates of the effects of predictors on
tree abundance. Using the FIA aggregated data as a model, determine the
impact of sample size \(n\) versus
effort \(E\) (plot area) on estimates
of effects from standAge, meanTemp, and
annualPrec in xdataCluster for a species of
choice.
glm using family = poisson,
see code block below.
fraction of the plots. I could select them randomly using
the sample( n, n*fraction), where n is the
number of plots. If you do this, remember that \(x\) and \(y\) must maintain their alignment.
function rbinom( n, y, fraction ), where y is
the count for each plot (see below).
y <- round( densPlus15cluster[,'pinusTaeda'] )
xdata <- data.frame( xdataCluster )
full <- glm( y ~ standAge + meanTemp + moist + annualDef, data = xdata, family = poisson )
When it comes to parameter uncertainty, observation effort can make up for sample size; \(S = nE\) is the important consideration. The aggregation that we used for FIA would not be sensible for many studies, particularly for those with smaller sample sizes. By aggregating we are essentially exploiting redundancy in the 10s of thousands of plots, many of which have near identical covariate values. Whether to design for big \(n\) and small \(E\) (like FIA), or vice versa, is an issue that emerged with big data.
The big data example causes us to rethink basic ideas like sample size and the role of prediction. Parameter estimates may produce predictions that are deemed useful or not, depending on design considerations that are not immediately obvious. Despite the useless predictions for FIA recruitment plots, the fitted model contains as much information on effects as a fit that generates good predictions at a different scale.
To expose the role of sample effort as transparently as possible I have overlooked additional design considerations. For example, I have ignored stratification and the way in which increasing spatial dependence could potentially reduce the effective sample size when more observations are crammed into the same study area. A design aimed at coverage of covariate space might nonetheless benefit from careful thought about the importance of estimating relationships (the coefficients) versus predicting responses.
The size of data sets in environmental science is increasing, holding out promise for new insights while it raises new challenges. Increasingly, models are being applied to structured networks (e.g., FIA, NEON, BBS) and unstructured networks (eBird, iNaturalist). Remote sensing products continue to expand. Additional environmental data include gridded climate, soils, and nutrient data. Tools we consider this semester will help us to learn from large data sets while attempting to avoid pitfalls.
The conjugate Poisson-gamma likelihood-prior pair gives the posterior distribution
\[ [\lambda | \mathbf{y} ] \propto \prod_{i=1}^n Pois(y_i| \lambda) gamma(\lambda | a, b) \] The factors involving \(\lambda\) lead to the gamma posterior density
\[ \begin{align} &\propto e^{-n\lambda} \lambda^{n \bar{y}} \times e^{-b \lambda}\lambda^{a-1} \\ &= e^{-\lambda(n + b)} \lambda^{n \bar{y} + a - 1} \\ &=gam(\lambda | n \bar{y} + a, n + b) \end{align} \] The posterior has variance
\[ V(\hat{\lambda}) = \frac{n \bar{y} + a}{(n + b)^2} \] When data dominate prior, the parameter variance we would get from the data is \(\bar{y}/n\), with standard error \(\sqrt{\bar{y}/n}\). Recall that for the Poisson, the sample mean \(\bar{y}\) is an estimate of both the Poisson mean and the Poisson variance, so the standard error can be viewed as \(se(\hat{\lambda}) = \sqrt{V(y)/n}\).
A simple example illustrates that parameter uncertainty can be directly equated with sample effort, rather than sample size or observation effort. The Poisson likelihood for discrete counts,
\[ [\mathbf{y} | \lambda ] \propto \prod_{i=1}^n (E \lambda)^{y_i} e^{-E \lambda} \] has MLE obtained by simple optimization \(\hat{\lambda} = \bar{y}/E\) and parameter variance (from Fisher Information) \(var(\hat{\lambda}) = \frac{\bar{y}}{nE}\). If the total area sampled is \(A = nE\), then \(\hat{\lambda} = n \bar{y}/A\) and parameter variance is \(var(\hat{\lambda}) = \frac{\bar{y}}{A}\); it depends not on \(n\) or \(E\) individually, but on their product.
This result is relevant for continuous responses, because many site-level variables come from aggregating observations at a site. The Gaussian likelihood for a continuous response variable might have an aggregate variance \(s^2\) obtained from \(m_i\) samples at each location (observation) \(i\). The mean response recorded at location \(i\) is taken to be the observed value \(y_i\). There is an error variance that is typically ignored, \(s^2/m_i\), i.e., the local variance divided by the number of observations used to obtain that mean value. We can think of the reciprical of this value as the effort
\[ E_i = \frac{m_i}{s^2} \] i.e., the number of observations scaled by the noise at the local (observation) scale, sometimes called a “nugget” variance. There is also a variance between observation locations, call it \(\tau^2\). Focusing on the mean parameter I want to estimate in a Gaussian model, \(\mu\), the log likelihood can be written this way,
\[ \log [\mathbf{y} | \mu ] \propto -\frac{1}{2} \sum_{i=1}^n \frac{(y_i - \mu)^2}{\tau^2 + E_i^{-1}} \] The Fisher Information is now
\[ I = -\left[ \frac{d^2}{d\mu^2} \log [\mathbf{y} | \mu ] \right]^{-1} = \left[ \sum_{i=1}^n \frac{1}{\tau^2 + E_i^{-1}} \right]^{-1} \] Now note that if there is no error in the \(y_i\) (either infinite number of samples \(m_i\) or \(s_i^2 = 0\)), then we recover the standard result for variance in \(var(\hat{\mu}) = \tau^2/n\). Alternatively, if there is just one measurement per observation, \(m_i = 1\), then error variance is the sum of between-site and local variance,
\[ var(\hat{\mu}) = \frac{\tau^2 + s^2}{n} \]
Finally, if local variance dominates (\(s >> \tau\)), the error variance declines with sample effort. For transparency, assume effort is the same for all observations, \(E_i = E\),
\[ var(\hat{\mu}) = \frac{1}{nE} \] Just as in the case for discrete counts, we have sample effort in the denominator indicating that, all else being equal, total sample effort determines parameter uncertainty.
Tang, B., R. Kamakura, D. Barnett and J. S. Clark. 2023. Learning from monitoring networks: Few-large versus many-small (FLvMS) plots and multi-scale analysis. Frontiers in Ecology and Evolution, 11, DOI = 10.3389/fevo.2023.1114569.
Zhu, K. et al. 2013. Dual impacts of climate change: forest migration and turnover through life history. Global Change Biology 20:251