Forest inventory data can offer huge numbers of plots that pose challenges not encountered with traditional data. In this first of three units for forest inventories we show that models fitted to large networks having many small plots make noisy predictions, but they can hold as much information as smaller networks of few large plots.

Resources

Permanent forest plots

Longitudinal studies of trees have been implemented all over the globe, by ecologists and managers alike, to monitor changes in forest resources and biodiversity. The term longitudinal refers to studies where subjects are followed over time. Trees lend themselves to permanent plot studies because individuals can be identified by tags, and they are easy to map and (at least for diameter) to measure.

Ecological studies range from individual plots to networks like NEON and forestGeo. Some of these longitudinal studies are experimental, including Duke University’s Free-Air CO2 Enrichment (FACE) experiment. The USDA Forest Inventory and Analysis (FIA) program is one of many national inventories that can address biogeographic questions.

More than 200,000 FIA plots vary geographically in size and sample years. Shown here are young stands in the South and East (light shading at left) and dry site classes in mountainous sites and southern Plains (brown at right).

Background on the FIA program describes the spatial design and sampling. Key features to note at the outset include the following:

The last bullet is important because estimates of change require replication in time, i.e., at least two censuses, but preferably mores. For these reasons, any given analysis may use a subset of the full inventory.

This vignette explores long-term plot data to illustrate three main concepts. In this document, I look at the signal and noise for spatial data in the context of model fitting and prediction. In unit B, I use the FACE experiment to show how hierarchical Bayes can change the interpretation of variables responsible for tree growth. In unit C, I take up hierarchical spatial models using FIA.

Signal and noise

Several years ago some of us were developing models to estimate the effects of climate and habitat on tree recruitment (rates of seedling establishment in small plots) using the FIA. We typically use prediction as one of several model-testing techniques, and predictions for this data set were bad (see figure below left). We assumed that poor predictions could be blamed on the small sizes of seedling plots in FIA data. How could a small number of predictors, say temperature, precipitation, and soils, possibly be enough to predict the number of seedlings that would be found on such a tiny plot? We decided to aggregate seeding plots from similar climates and habitats (We were not the first ones to think about aggregating FIA plots, see Iverson et al., 1998).

Predicted seedling recruitment at fine (left) and coarse (right) spatial scales obtained by aggregating small plots on the basis of similar conditions. (from Zhu et al., 2015)

Predicted seedling recruitment at fine (left) and coarse (right) spatial scales obtained by aggregating small plots on the basis of similar conditions. (from Zhu et al., 2015)

The effect of aggregation on predictions was dramatic. The left panel of this figure shows the prediction we expect for noise. Regardless of the observed number of seedlings (horizontal axis), the model everywhere predicts the mean number of seedlings for the entire sample. Said another way, the model seems to learn nothing from the observation at a given plot. If an observation was having any impact, wouldn’t we expect that the model prediction would show some evidence of that? By contrast, the model fit at right would be judged as useful, because it is capable of predicting the data; clearly, there is signal.

Recall that the same data are used in both of these examples. If there is any difference in information between them, it should degrade the fit for aggregated plots, i.e., the blurring of covariate information that occurs when aggregating plots with similar (but not identical) climate and soils. Of course, if there is a huge number of plots (as in this example), then this information loss by aggregation should be small. So if we want to use prediction for model-checking, we want to understand why simple aggregation finds signal where raw data do not.

Big data expose new questions

The term big data references a host of issues that arise when volume, complexity, and/or the pace at which data arrive overwhelm traditional methods for management and analysis.

Just increasing the amount of data can expose relationships that would never be encountered with small data sets. A traditional data set in environmental science might come from a designed experiment or field surveys, structured or not. Sample size \(n\) would range up to hundreds, but rarely thousands. Because the number of observations is limited, models would tend to fit a small number of predictors \(p << n\). The question posed by the FIA example would not emerge, because one would not consider aggregating plots when the sample size is, say, \(n = 100\). We would not do this, because only with massive sample size could we expect that many plots will have essentially the same predictor values. The aggregation paradox emerged when sample sizes got big. I’ll return to this later, after a closer look at the role played by uncertainty.

Prediction for model checking

Prediction is one of several model-checking tools, one that takes place in data space. There are other ways to check and compare models in parameter space, which has the disadvantage that models of different structures often cannot readily be compared in this way. A capacity to predict the data used to fit the model might be compared across diverse models.

A few words on parameters and predictions are needed before we proceed further. A fitted model can provide estimates of relationships (parameters) and predictions out-of-sample for times and places other than when/where the data were collected. Out-of-sample and in-sample prediction (predicting the data used to fit the model) are used for model checking. If I want to learn about relationships, then a model fit is useful if the parameter uncertainty is not too big. If I want to predict data, then a model fit is useful if the prediction uncertainty is not too big. Sometimes parameter uncertainty is acceptably low where prediction uncertainty is not. Whether or not a model fit is useful may depend on whether I am after parameters or predictions.

Large prediction error is taken as a sign that a model is not useful. Poor prediction might be judged by a large mean square prediction error (rmse) and/or an \(r^2\) near zero. Recall from introductory statistics that \(r^2\) compares the predictions based on the scatter around two lines, the observed values (dashed 1:1 line in the figure) versus the sample mean (horizontal).

The two fits being compared here include the same model and the same data, yet they contrast in prediction performance.

Parameter estimates

Whether or not I am interested in parameter estimates, I need to understand how they affect prediction uncertainty.

The predictions in this example come from a model that is used to estimate the relationships between predictors–climate and soils–and tree recruitment. The model contains parameters or coefficients that must be estimated. These estimates have uncertainty that is determined by how well the model “explains” the data and the sample size or number of observations \(n\). The parameter uncertainty is translated into prediction uncertainty, which is represented by the vertical lines for each observation in the figure above. To see the direct role played by sample size, recall that the traditional standard error for an estimate of a sample mean has the factor \(\sqrt{n}\) in the denominator,

\[SE( \hat{{\mu}} ) = \frac{\sigma^2}{\sqrt{n}} \tag{1}\] The derivation of this result for Poisson intensity from a Bayesian perspective is given in the Appendix.

The fact that I am dividing by \(\sqrt{n}\) means that the standard errors for parameter estimates will be tiny when the sample size is huge (as it is here). In other words, all parameters are ‘significant’. Poor prediction in this example cannot be blamed on large parameter error or small \(n\).

Scale and effort

Aggregation did not add any information that could explain the differences in seedling plots, but it did change the scale of the data and, thus, the model. In the current example, the spatial scale is set by plot area. The response variable is a count (number of recruits) per plot, and the plot area may have been assigned simply by convenience. For example, seedling plots in many ecological studies are 1 m\(^2\), perhaps because plots of this size can be accessed without disturbance by trampling (reach in, rather than step in). This scale effect is implicit, because it disappears when I multiply the factors on the right,

\[\frac{\mbox{count}}{ \mbox{plot} } = \frac{\mbox{count}}{ \mbox{area} }\times \frac{\mbox{area}}{\mbox{plot}}\] I will refer to the second factor on the right, in this case area per plot, as the observation effort devoted to each count. More generally, it can be the time spent counting birds (e.g., minutes), the number of hooks in longline fisheries, or the number of days that camera traps are deployed. I’ll refer to this type of effort per observation with \(E\), and I will have to qualify it with units (e.g., time, area).

The first factor on the right is the quantity we want to model. In longline fisheries, it would be catch per unit effort (CPUE). In our example it is recruits per ha.

The effort afforded each observation sum to the sample effort. The number of observations represents the number of times I recorded what I saw. This is the previously-defined sample size \(n\). The total work I do on the sample is \(nE\). In other words;

\[\mbox{observation effort} = E\] and

\[\mbox{sample effort} = nE\] Of course, effort can differ for each observation. If so, I include an index for observation \(i\), which has observation effort \(E_i\) and sample effort \(\sum_{i=1}^n E_i\). For now, I will assume consistent effort (no subscript). For this example, sample effort is the total area sampled in the study.

If the resources that I have available to do a study limits sample effort, I might think about how much I can learn for a given \(n\) and \(E\). How I organize sampling is called design. [FIA sampling is a stratified random design, where stratification is by area and randomized within areas.] I mention this here only to point out that design considerations can include how to allocate total effort \(nE\) into many observations, each with low effort, or vice versa.

To make this simple, I want to estimate the mean intensity \(\lambda\) for Poisson counts \(y_i\), where \(i = 1, \dots, n\) is a label for an observation. I use this notation,

\[y_i \sim Poi( \lambda ) \tag{2}\]

The Poisson distribution is used for count data. It has a single intensity parameter \(\lambda > 0\) that describes the number of independent, identically distributed events expected in a set interval of, say, space or time. The intensity can be modeled with predictors. Linear models for \(\lambda\) typically use a log link function that insures positive values, e.g., \(\log \lambda = \mathbf{x}'\boldsymbol{\beta}\).

This model estimates an intensity parameter \(\lambda\) for observed counts per plot. Counts per plot is a dimensionless quantity. However, plots can vary in size, and large plots are expected to yield large counts. The next section brings effort \(E\) into the model.

Simulated data

Simulating data is used to check if the parameters used to generate the data can be recovered by the fitted model. It can show how the data-generating mechanism affects uncertainty in estimates.

The answer to the aggregation paradox can be shown with a simple simulation. Here is an example using the R function glm to fit a simulated data set. To simulate data I assume a sample size n, a parameter lambda, and I draw random ‘observations’ y from the Poisson distribution. This function estimates the Poisson parameter on the log scale, so I define it as \(\mu = \log{\lambda}\)

n      <- 100                         # sample size
mu     <- 2.8                         # an arbitrary log lambda
lambda <- exp(mu)
y1     <- rpois( n, lambda )         # random observations
fit1   <- glm( y1 ~ 1, "poisson" )   # estimate parameter
est1   <- summary(fit1)$coefficients 

The estimate in est1 is 2.83 +/- 0.0242. Check help( glm ) to find arguments and syntax for this function.

Note that the model estimates mu close to the value used to simulate the data, 2.8 (table above). The standard error (SE) represents parameter uncertainty. Below is a histogram of the simulated values and the lambda estimate \(\pm 1\) SE.

hist(y1, nclass = 20)
abline(v = exp( est1[1]), lwd = 2 )                  # mean estimate
abline(v = exp( est1[1] + c(-1,1)*est1[2] ), lty=2)  # +/- one SE

Histogram of counts with mean estimate (solid) +/- 1 SE.

Changing units

If I want parameter estimates on an area basis I could include observation effort,

\[y_i \sim Poi( E \lambda )\]

If I define \(E\) in terms of area, then I am really estimating density (e.g., plants per m\(^2\)).

I can repeat the analysis using an area basis, i.e., a different value for \(E\),

E        <- 10
fit2     <- glm( y1 ~ 1, "poisson", offset = rep(log(E), n) ) # note offset E for log link
est2     <- summary(fit2)$coefficients
fitTable <- rbind(est1, est2)

The glm function with a Poisson likelihood defaults to a log link. For this reason, the offset is log(E).

Comparison of two fitted models in fitTable
Estimate Std. Error
fit 1 2.830 0.0242
fit 2 0.531 0.0242

Note that the mean estimates differ in this table, but the standard errors have not changed. They are identical on the log (proportionate) basis, because all I have done is change the units from per-plot to per-one-tenth-of-a-plot. To see that this is nothing more than a change of units I can overlay them on the same plot using this code:

abline(v = E*exp( est2[1]), lwd = 2, col=3 )                  # mean estimate
abline(v = E*exp( est2[1] + c(-1,1)*est2[2] ), lty=2, col=3)  # +/- one SE

So by introducing \(E\), I changed the units without actually changing the observation effort or the parameter uncertainty. To increase the effort, I need to increase the area of sampled plots.

Changing effort

Variation in count data depend on scale (e.g., plot area), which represents observation effort. Parameter estimates improve with observation effort.

I will repeat the analysis of simulated data, but this time with increased effort per observation. This means that I expect to observe \(E\) times as many recruits on a plot,

y3   <- rpois( n, E*lambda )                              # sampled with new effort
fit3 <- glm( y3 ~ 1, "poisson", offset = rep(log(E), n) ) 
est3 <- summary(fit3)$coefficients
Comparison of three fitted models
n E offset Estimate Std. Error
fit 1 100 1 1 2.830 0.02420
fit 2 100 1 10 0.531 0.02420
fit 3 100 10 10 2.790 0.00786

By contrast with fit2, here I have actually increased the effort with the result that the SE is reduced. The mean estimate for fit3 agrees with fit1, because the the offset passed to glm is same as the effort used to simulate the data.

The reduction in parameter uncertainty resulted not from an increase in sample size \(n\), which was the same for all of these model fits. It comes instead from an increase in observation effort per sample. This difference in effort can also be viewed in terms of the total number of recruits that were counted, sum(y1) = 1701 and sum(y3) = 16201.

Although the first model has the dimensions of ‘per plot’ and the second has dimensions ‘per area’, they both have the same scale, determined by the area of the plot observed. In other words, if plots are 1 m\(^2\) in area and I choose to report parameters on the scale of hectares (easily done by setting \(E = 1/10000\) ha per m\(^2\)), I have not changed the fact that parameter estimates are determined by variation that occurs at the m\(^2\) scale, which explains why they have the same proportionate SE.

The third model had increased effort, because I used \(E\) to simulate more counts per observation. Without increasing sample size \(n\), I increased sample effort \(nE\) by increasing \(E\). The aggregation paradox depends on scales dominated by signal and noise, but implications differ for parameters and predictions.

Many small versus few large plots?

Total sample effort combines \(n\) and \(E\).

With this background, I can reconsider the initial example as one in which the sample effort is held constant, while observation effort is redistributed over a different sample size. When we aggregated seedling data we changed the observation effort \(E\) without changing the sample effort \(nE\). Sample effort remained constant because we reduced \(n\) by the same factor that we increased \(E\). The aggregation simply gathered the same counts into a smaller number of observations. In this section I show that this redistribution of effort had no effect on parameter estimates, despite its devastating effect on prediction. These results have implications for the role of prediction, which I take up at the end of this section.

The model can be expanded to include a predictor variable, written as

\[ \begin{aligned} \log \lambda_i &= \mathbf{x}_i' \boldsymbol{\beta} \\ &= \beta_0 + \beta_1 x_i \end{aligned} \] where \(\mathbf{x}'_i = (1, x_i)\) is a length-2 vector holding the value of the covariate in the second position abd \(\boldsymbol{\beta}' = (\beta_0, \beta_1)\) is the corresponding coefficient vector holding the intercept and slope, respectively. Simulated data must now include these two vectors.

Here I write a function to simulate data, fit a glm, and generate plots. I do this so I can easily change \(n\) and \(E\) to explore its effects on parameter estimates and prediction.

poisReg <- function(n = 100, E = 1, nsim = 1000, slope = .5, PLOT = T, col = 'black'){
  
  # simulate data for intercept and slope with a
  #   Poisson likelihood, log link, and effort E
  beta <- matrix( c(.1, slope), ncol = 1)
  x    <- cbind(1, rnorm(n) )
  mu   <- exp(x%*%beta)
  y    <- rpois( n, mu*E )
  
  # estimate beta, predict counts, note offset E for log link
  fit  <- glm( y ~ x[,2], "poisson", offset = rep(log(E), n) )
  pred <- predict.glm(fit, type = 'response', se.fit = T)      # contains error in lambda
  
  # combine parameter error with Poisson observations
  lambda <- rnorm(n*nsim, pred$fit, pred$se)
  lambda[ lambda < .001 ] <- .001
  py <- rpois( n*nsim, lambda )
  py <- matrix(py, n, nsim)
  pred$se <- apply( py, 1, sd )                                # lambda and sampling error
  
  if(PLOT){
    rmspe <- sqrt( mean( fit$residuals^2 ) )
    z <- jitter(y)
    plot(z, pred$fit, xlab = 'counts', ylab = 'predicted', 
         pch = 14, cex=.3, col = col)
    abline(0, 1, lty=2)
    abline( h = mean(y), lty = 2 )
    segments( z, pred$fit - pred$se.fit, 
              z, pred$fit + pred$se.fit, col = col )
    title( paste('rmspe =', round(rmspe, 2)) )
  }
  out <- cbind(x[,2], y, pred$fit, pred$se)
  colnames(out) <- c('x2', 'y', 'yhat','yse')
  invisible( list( fit = fit, out = out) )
}

Here is an example using this code assuming a slope parameter \(\beta_1 = 0.5\),

par( bty='n' )
slope <- .5
fit1 <- poisReg(n = 100, E = 10, slope = slope)

Simulated data and predicted with 1 SE vs observed.

The figure shows observed counts with predictive intervals (\(\pm 1\) predictive standard error) for observations. These intervals were obtained by taking the standard errors for coefficients \(\boldsymbol{\beta}\) and combining them with the Poisson sampling error. The 1:1 line is the line of agreement. The horizontal line is set at the mean count, i.e., the expected prediction for pure noise. Clearly, there is signal in this simulated data set.

So the question becomes, can I learn more about relationships (\(\boldsymbol{\beta}\)) and make better predictions (\(\{ \hat{y_i} \}\)) if I redistribute my total effort?

I can redistribute sample effort to more plots, each with lower sampling effort, by increasing \(n\) and decreasing \(E\) both by a factor of 10:

fit2 <- poisReg(n = 1000, E = 1, slope = slope, PLOT = F)

The table shows comparable mean estimates and SEs for fit1 and fit2. Recall that the value used to simulate data was slope = .5.

Parameter estimates for equal sample effort.
n E estimate
fit 1 100 10 0.481 +/- 0.0284
fit 2 1000 1 0.523 +/- 0.0284

Prediction uncertainty changes with \(E\) but not \(n\) (rmspe shown in the figure). For comparison, consider fit3, which has the same sample size used for fit1 but with the observation effort (per sample) used for fit2. Prediction comparisons for the three models tell us that effort per sample \(E\) is the key to good predictions in this simple example, not \(n\) and not \(nE\).

lc <- c('#1c9099', '#636363', '#f03b20')

par(mfrow=c(1,3), bty='n')
f1 <- poisReg(100, 100, col = lc[1]) 
f2 <- poisReg(10000, 1, col = lc[2]) 
f3 <- poisReg(100, 1, col = lc[3]) 

Predictive means +/- 1 standard deviation for fit 1, 2, and 3. Counts are jittered to reduce overlap. Colors indicate models in figure 5.

Whereas prediction changes with \(E\) but not \(n\), confidence in parameter estimates depends on \(nE\). The figure shows agreement in parameter estimates when sample size is reduced by a factor of 10 provided it is offset by a corresponding increase in effort per observation (blue versus grey). By contrast, reducing sample size without increasing observation effort results in a large increase in uncertainty (red).

Parameter estimates for values of (n, E): blue = (100, 100), grey = (10000, 1), and red = (100, 1).

Where did the information go?

If there is enough information to provide good parameter estimates, and predictions come from parameters, then shouldn’t I be able to recover useful predictions from fit2 at the aggregate scale?

Below is code that aggregates predictions (and their standard errors). In this code I bin \(n = 1000\) observations into 100 aggregated observations (the sample size for fit1). I find the mean for the aggregated versions of \(x\), \(y\), and \(\hat{y}\) and the prediction error for aggregated \(\hat{y}\). The predictions now look like those for fit1.

x <- f2$out[,'x2']                       # covariate values
y <- f2$out[,'y']                        # counts
n <- length(x)
b <- quantile(x, seq(0, 1, by = .01) )   # intervals to bin n = 1000 to 100
i <- findInterval(x, b, all.inside=T)    # bin index for each x
X <- tapply( x, i, mean)                 # mean x by bin
Y <- tapply( y, i, sum)                  # aggregate y by bin

# aggregate yhat to sum and SE for bin
ymu <- tapply( f2$out[,'yhat'], i, sum)
yse <- sqrt( tapply( f2$out[,'yse']^2, i, sum) ) # if iid: var(sum) = sum(var)

z <- jitter(Y)
par(bty='n')
plot(z, ymu, xlab = 'counts', ylab = 'predicted', pch = 14, cex=.3)
abline(0, 1, lty=2)
abline( h = mean(Y), lty = 2 )
segments( z, ymu - yse, z, ymu + yse )
title( 'Aggregate scale (n = 100, E = 100)' )

Aggregated predictions in fit2 shows them to be as good as those for fit1.

Apparently, the information is still there, but it is so granular as to hide the fact that prediction can work at a coarser scale than that of the observed plot area.

Generalization to continuous responses

The foregoing application to discrete counts can be generalized to continous responses. Consider the common practice of aggregating soil, air, or water samples to reduce the noise in concentrations, densities, or mineralization rates and to avoid costs and processing time of multiple samples. If samples are aggregated, then how many? How does this affect parameter estimates and prediction? In the appendix I show that parameter uncertainty declines in direct proportion to sample effort, just as we saw for discrete counts.

A map of aggregated FIA

Could the concept of sample effort help us optimize inventory design?

The findings that i) parameter estimates do not suffer from aggregation and that ii) aggregate predictions can be useful, even when disaggregated predictions are not, shows that aggregated data can be useful. To reduce > 200,000 plots to a manageable size for this unit I generated an aggregated data set that uses a k-means clustering of predictors (climate, soils) together with location. This is the same technique used to aggregate plots by Zhu et al. Here are clustered data, with locations and covariates in xdataCluster.rdata and counts of trees > 15 cm in diameter by species in densPlus15cluster.rdata

load( 'xdataCluster.rdata' )
load( 'densPlus15cluster.rdata' )

The matrix baCluster holds plots (rows) by species (columns). Here is a map of total Quercus basal area on clustered plots:

w       <- grep( 'quercus', colnames(densPlus15cluster) )     # quercus in column names
quercus <- rowSums( densPlus15cluster[,w], na.rm=T)           # total ba per ha
y       <- data.frame( cbind(xdataCluster[,c('lon','lat')], quercus) ) # include location

cramp <- c('#edf8fb','#b2e2e2','#66c2a4','#2ca25f','#006d2c') # color ramp low to high ba
ncols <- 20
grFun <- colorRampPalette( cramp )                            # function generates ramp
cseq  <- seq(0, max(quercus), length = ncols )                # partition ba scale
i     <- findInterval( quercus, cseq, all.inside = T)         # assign ba to partition
g     <- grFun(ncols)                                         # ramp of length ncols
col   <- g[ i ]                                               # assign colors

This block of code generates a map, with dark colors indicating high basal area:

# map data
library( ggplot2 )
library( raster )
## Loading required package: sp
usa    <- map_data("usa")

# map with country boundaries
ggplot() + 
  geom_polygon(data = usa, aes(x=long, y = lat, group = group), 
               fill = '#f7fcfd', color = 'tan') + 
  geom_point(data = y, aes(x = lon, y = lat), color = col, size = .2) +
  coord_fixed(xlim = c(-124, -68),  ylim = c(26, 50), ratio = 1.2)

Quercus species in aggregated FIA plots show tendency for the genus to dominate at mid latitudes over a wide range of moisture conditions. The call to ggplot draws the boundaries (geom_polgon) and the points (geom_point). Note that this aggregated version of the data with 14129 plots gives good geographic coverage.

Exercise. My budget can no longer support the costs of monitoring a full network. I can reduce costs by limiting the number of plots, the sizes of plots, or both. For this study it is important the network can provide estimates of the effects of predictors on tree abundance. Using the FIA aggregated data as a model, determine the impact sample size \(n\) versus effort \(E\) (plot area) on estimates of standAge, meanTemp, and annualPrec in xdataCluster effects for a species of choice.

I might do this:
  1. Fit the model with glm using family = poisson, see code block below.
  2. For small \(n\), refit with a fraction of the plots. I could select them randomly using the sample( n, n*fraction), where n is the number of plots. If you do this, remember that \(x\) and \(y\) must maintain their alignment.
  3. For small \(E\), refit the model with a sample of trees from each plot, as would occur it plot areas were reduced. I can do this with the function rbinom( n, y, fraction ), where y is the count for each plot (see below).
y     <- round( densPlus15cluster[,'pinusTaeda'] )
xdata <- data.frame( xdataCluster )
full  <- glm( y ~ standAge + meanTemp + moist + annualDef, data = xdata, family = poisson )

Big data and the role of prediction

When it comes to parameter uncertainty, sample effort can make up for sample size; \(nE\) is the important consideration. The aggregation that we used for FIA would not be sensible for many studies, particularly for those with smaller sample sizes. By aggregating we are essentially exploiting redundancy in the 10s of thousands of plots, many of which have near identical covariate values. Whether to design for big \(n\) and small \(E\) (like FIA), or vice versa, is an issue that emerged with big data.

The big data example causes us to rethink basic ideas like sample size and the role of prediction. The same parameter estimates for blue and grey in Figure 5 may produce predictions that are deemed useful or not, depending on design considerations that are not immediately obvious. Despite the useless predictions for FIA recruitment plots, the fitted model contains as much information on effects as a fit that generates good predictions at a different scale.

To expose the role of sample effort as transparently as possible I have overlooked additional design considerations. For example, I have ignored stratification and the way in which increasing spatial dependence could potentially reduce the effective sample size when more observations are crammed into the same study area. A design aimed at coverage of covariate space might nonetheless benefit from careful thought about the importance of estimating relationships (the coefficients) versus predicting responses.

The size of data sets in environmental science is increasing, holding out promise for new insights while it raises new challenges. Increasingly, models are being applied to structured networks (e.g., FIA, NEON, BBS) and unstructured networks (eBird, iNaturalist). Remote sensing products continue to expand. Additional environmental data include gridded climate, soils, and nutrient data. Tools we consider this semester will help us to learn from large data sets while attempting to avoid pitfalls.

Appendix

Poisson intensity standard error

The conjugate Poisson-gamma likelihood-prior pair gives the posterior distribution

\[ [\lambda | \mathbf{y} ] \propto \prod_{i=1}^n Pois(y_i| \lambda) gamma(\lambda | a, b) \] The factors involving \(\lambda\) lead to the gamma posterior density

\[ \begin{align} &\propto e^{-n\lambda} \lambda^{n \bar{y}} \times e^{-b \lambda}\lambda^{a-1} \\ &= e^{-\lambda(n + b)} \lambda^{n \bar{y} + a - 1} \\ &=gam(\lambda | n \bar{y} + a, n + b) \end{align} \] The posterior has variance

\[ V(\hat{\lambda}) = \frac{n \bar{y} + a}{(n + b)^2} \] When data dominate prior, the parameter variance we would get from the data is \(\bar{y}/n\), with standard error \(\sqrt{\bar{y}/n}\). Recall that for the Poisson, the sample mean \(\bar{y}\) is an estimate of both the Poisson mean and the Poisson variance, so the standard error can be viewed as \(se(\hat{\lambda}) = \sqrt{V(y)/n}\).

Sample effort to parameter uncertainty

A simple example illustrates that parameter uncertainty can be directly equated with sample effort, rather than sample size or observation effort. The Poisson likelihood for discrete counts,

\[ [\mathbf{y} | \lambda ] \propto \prod_{i=1}^n (E \lambda)^{y_i} e^{-E \lambda} \] has MLE obtained by simple optimization \(\hat{\lambda} = \bar{y}/E\) and parameter variance (from Fisher Information) \(var(\hat{\lambda}) = \frac{\bar{y}}{nE}\). If the total area sampled is \(A = nE\), then \(\hat{\lambda} = n \bar{y}/A\) and parameter variance is \(var(\hat{\lambda}) = \frac{\bar{y}}{A}\); it depends not on \(n\) or \(E\) individually, but on their product.

This result is relevant for continuous responses, because many site-level variables come from aggregating observations at a site. The Gaussian likelihood for a continuous response variable might have an aggregate variance \(s^2\) obtained from \(m_i\) samples at each location (observation) \(i\). The mean response recorded at location \(i\) is taken to be the observed value \(y_i\). There is an error variance that is typically ignored, \(s^2/m_i\), i.e., the local variance divided by the number of observations used to obtain that mean value. We can think of the reciprical of this value as the effort

\[ E_i = \frac{m_i}{s^2} \] i.e., the number of observations scaled by the noise at the local (observation) scale, sometimes called a “nugget” variance. There is also a variance between observation locations, call it \(\tau^2\). Focusing on the mean parameter I want to estimate in a Gaussian model, \(\mu\), the log likelihood can be written this way,

\[ \log [\mathbf{y} | \mu ] \propto -\frac{1}{2} \sum_{i=1}^n \frac{(y_i - \mu)^2}{\tau^2 + E_i^{-1}} \] The Fisher Information is now

\[ I = -\left[ \frac{d^2}{d\mu^2} \log [\mathbf{y} | \mu ] \right]^{-1} = \left[ \sum_{i=1}^n \frac{1}{\tau^2 + E_i^{-1}} \right]^{-1} \] Now note that if there is no error in the \(y_i\) (either infinite number of samples \(m_i\) or \(s_i^2 = 0\)), then we recover the standard result for variance in \(var(\hat{\mu}) = \tau^2/n\). Alternatively, if there is just one measurement per observation, \(m_i = 1\), then error variance is the sum of between-site and local variance,

\[ var(\hat{\mu}) = \frac{\tau^2 + s^2}{n} \]

Finally, if local variance dominates (\(s >> \tau\)), the error variance declines with sample effort. For transparency, assume effort is the same for all observations, \(E_i = E\),

\[ var(\hat{\mu}) = \frac{1}{nE} \] Just as in the case for discrete counts, we have sample effort in the denominator indicating that, all else being equal, total sample effort determines parameter uncertainty.