The emergence of large data sets in environmental science has changed statistical analysis: more focus on data wrangling and algorithmic approaches, especially Bayes. Bayesian analysis departs from alternative methods in its application of probability to all aspects of model fitting with side benefits of simplifying interpretation. The elements of a Bayesian analysis include distributions for prior, likelihood, and posterior. Hierarchical models emerge naturally in the Bayesian framework as a means for analyzing high-dimensional problems without requiring a change in approach. Graphs help to organize hierarchical modeling. Basic concepts are introduced with the Forest Inventory and Analysis (FIA) monitoring network.

Logistics

For next time

Background video: Biodiversity confronts climate change in the big-data era: promise and pitfalls for understanding and anticipating change

Discussion reading:

Redefining statistical significance, classical and Bayesian statisticians compromise, if $P$ values are to used, then $P = 0.05$ is way too high, Nature.
Why Big Data Could Be a Big Fail, Jordan on potential and limitations of Big Data (misleading title).
Why environmental scientists are becoming Bayesians, why the proliferation of Bayes in environmental science, Ecol Letters.

Questions for discussion: questions1 on Sakai Assignments

Today’s plan

Breakout: Identify groups and moderator for Discussion
Jim: Foundation materials in Unit 1

Objectives:

Recognize and generate notation for a simple model
Identify the basic elements of a Bayesian model
Identify the deterministic vs stochastic variables in a model
Interpret a hierarchical model
- construct a simple graphical model
- assemble parameters, process, and data as a hierarchy
- describe a regression model with notation and graph
Articulate advantages and disadvantages of observational and experimental evidence
Define Simpson’s Paradox and identify when it could be operating.

Two themes will repeat throughout this course, including big data and Bayesian analysis. Statistical analysis traditionally focused on canned software applied to simple (and small) data sets containing hundreds of observations and several variables. Linear regression and anova’s led to $P$-values that determined whether or not a study was publishable. Large observational data sets are now commonly wrangled from the internet and used to ask questions that require new tools for data manipulation and hierarchical model building. Hierarchical models offer the potential to integrate multiple types of data, while preserving a focus on science–How does the process work? Can we estimate the relationships and predict them? Theory is needed for model building. Algorithms are the basic tools needed both for data manipulation and model evalution. I start with a case study that illustrates important concepts.

Goals of a Bayesian analysis

A Bayesian analysis serves several purposes:

Estimation to learn about parameters.
Model evaluation to determine whether or not the model describes the data well enough to be useful.
Variable selection to evaluate how adding/subtracting varibles may improve the model.
Prediction to anticipate processes or data, for model evaluation, or for understanding. Predictions can be ‘in-sample’ or ‘out-of-sample’, the latter refering to times/locations other than where data were collected. They can be functions of outputs, such as sensitivity.

In Bayesian analysis, all are based on/expressed as probability distributions.

What makes it Bayes?

In Bayesian analysis, all quantities are assigned probability distributions. Data analysis involves i) a process, represented by a model (that summarizes how it works), ii) evidence, which comes in the form of observations, or data, and additional prior information that needs consideration, and iii) unknowns, or parameters, that must be estimated by combining all of the above.

All of you will have taken a traditional statistics course, often termed frequentist statistics, but many of you will not have had exposure to a Bayesian perspective. Because traditional statistics did not derive a probability distribution for unknown parameters it developed a ‘frequency’ interpretation for point estimates–there is one best value for a given data set, but I could have a distribution of these point estimates if I were to repeat an identical experiment many times. In other words, I might do an end-run around the need for probability with a concept of frequency. The probabilistic basis for Bayesian analysis makes outputs intuitive and defensible. In the next section I introduce some of basics and connect them to some of the more familiar topics you about about from an intro statistics course.

Structure leads to hierarchical modeling

Structure in data is common and can be exploited.

The expansion of modern Bayes has gone hand-in-hand with hierarchical modeling. Hierarchical models simplify the synthesis of observations with processes and parameters. They commonly exploit the Bayesian paradigm. Hierarchical models don’t have to be Bayesian, but they usually are. I introduce these concepts together.

Environmental processes and observations can often be summarized hierarchically. Where processes are not strictly hierarchical, a scientist may see advantages to imposing structure. Structure in time can include hours within days within seasons within years. Time can be structured by ‘active periods’ for insects or census intervals to trees or humans. Spatial structure can be defined by watersheds, estuaries, soil types, and land cover types. Voting districts, zip codes, or states can be used. Even where space and time are treated as continuous variables, they may be structured within discrete units.

Process models and data can refer to discrete groups. There can be species within communities, individuals within flocks, sample plots, or political parties. Sample plots can be assigned to blocks (stratification) and political parties to countries. Hierarchies can be a natural way to organize data and models.

Structure in data and processes can not only facilitate but also complicate, interpretation. Simpson’s Paradox and the related ecological fallacy are important considerations for any analysis that involves multiple scales. They can occur when attempting to fit a model at one scale and predict at another scale. It can arise when there are unmeasured variables that affect observations, which is ubiquitous in environmental problems. Examples will arise throughout the semester. In the next section I discuss why not only variables, but also model analyses that can be hierarchical–some parameters generate a model, other parameters and model together generate data.

Structured variables can be misaligned, such as species distributions and communities, tropic levels (e.g., omnivory), race and political affiliation, and zip codes and voting precincts. Misalignment introduces challenges for hierarchical modeling.

Many problems are set up with continuous structure. For example, a simple advection model describes how change in concentration relates to a concentration gradients in space,

\[ \frac{\partial{c}}{\partial{t}} = -u \frac{\partial{c}}{\partial{x}} \] where $x$ is location, $t$ is time, and $u$ is velocity (distance/time). Because computation requires discretizing, continuously structured problems are discretized. There is a discrete time interval $\Delta_t$ and a discrete distance $\Delta_x$ such that $\partial{c}/\partial{t}$ is approximated by $\left( c(t + \Delta_t,x) - c(t, x) \right)/\Delta_t$. Discretization often introduces instabilities and requires careful thought on issues of rates and scale.

Hierarchical modeling exploits structure in data and in how processes operate. A submodels can be specified for a process, while another submodel could decribe the connections from state variables to data. Hierarchical model building uses structure, which enter as the distributions introduced in the next section.

Notation for a distribution

In Bayesian analysis, components are specified by probability distributions that contain parameters. If I write the "probability of $y$ given parameters $\theta$, then I have to know or assume $\theta$ in order to construct a distribution for $y$. Because $y$ is described by a distribution (rather than a specific value) its value is unknown. I can use the bracket notation for a distribution,

\[[\mbox{unknowns | knowns}] = [y | \theta]\] The vertical bar separates the random or stochastic $y$ on the left from the fixed parameters $\theta$ on the right.

The terms random, unknown, and stochasitic refer have variables having probability distributions. They appear to the left of the vertical bar. The terms fixed, known, or deterministic refer to quantities that appear to the right of the vertical bar. A variable that is fixed in one distribution can be random in another.

This generic notation says nothing about the form of the distribution or the parameters needed to specify it. I use the generic notation when speaking of relationships that do not depend on the specific distribution.

To get specific, a common model for count data is the Poisson distribution, where the response is a non-negative integer $y = 0, 1, \dots$ depend on an intensity parameter $\lambda > 0$. The generic and specific specifications for this distribution are

\[[y | \lambda ] = Poi(y | \lambda)\] The left-hand side could refer to any distribution, while the right-hand side is specific.

The familiar normal distribution has a continuous response, $y \in (-\infty, \infty)$, and two parameters, one corresponding to the mean (and mode) of the distribution, and another for the variance,

\[[y | \mu, \sigma^2 ] = N(y | \mu, \sigma^2)\] In this notation, $y$ is random (its specific value is unknown), and the parameters are fixed–they must be known in order to define the distribution for $y$.

Jointly distributed random variables

When more than one variable is unknown there can be a joint distribution. Suppose I have two observations represented by the vector $\mathbf{y} = [y_1, y_2]$. If the observations are generated independently from the same normal distribution, then there is joint distribution

\[[y_1, y_2 | \mu, \sigma^2] = N(y_1 | \mu, \sigma^2) \times N(y_2 | \mu, \sigma^2)\] Because I have assumed that they are drawn independently of one another, the probability of both $y_1$ and $y_2$ occurring together is equal to the product observing them separately.

A joint distribution has more than one variable to the left of the bar. Variables are separated by commas, i.e., $[y_1, \dots, y_n]$, or they are represented by a vector $[\mathbf{y}]$ holding the values $[y_1, \dots, y_n]$.

Elements of Bayes

The basic elements of a Bayesian analysis are prior, likelihood, and posterior. Bayesian analysis specifies these elements as probability distributions.

In data analysis, we commonly speak of responses that are observed and parameters to be estimated. The prior distribution describes what is known about the parameters from external information. The likelihood describes how the observations are generated from parameters. Because data are unknown before they are observed and they are generated by parameters, the likelihood is a distribution conditional on parameters. The posterior distribution is the inverse of the likelihood: it is the probability for parameters after data are observed. Before saying more about the Bayesian model specifically, I introduce the concept of hierarchical modeling, which can be viewed as a general framework for Bayes.

Hierarchical models organize distributions for prior, likelihood, and posterior into levels or stages. These are the knowns (data and priors) and the unknowns (latent processes and parameter values). It is natural to think of inference this way:

\[[\mbox{unknowns | knowns}] = [\mbox{process, parameters | data, priors}]\]

In the generic notation for a distribution, $[A]$ is the probability or density of event $A$, $[A, B]$ is the joint probability of $A$ and $B$, and $[A|B]$ is the conditional probability of $A$ given $B$. Again, anything to the left of the vertical bar is random. Anything to the right is fixed. If there is no vertical bar, everything is random. The Appendix summarizes notation.

This structure from Mark Berliner (example here) comes naturally to a Bayesian. It can be unwieldy to the non-Bayesian and can involve improper distributions (e.g., when omitting prior distributions). A hierarchical model typically breaks this down as

\[[\mbox{process, parameters | data, priors}] \propto \] \[[\mbox{data | process, parameters}] [\mbox{process | parameters}] [\mbox{parameters | priors}]\]

The left hand side is the joint distribution of unknowns to be estimated; it is the posterior distribution. This hierarchical structure is amenable to simulation, using Markov Chain Monte Carlo (MCMC). Gibbs sampling is a common MCMC technique, because the posterior can be factored into smaller pieces that can be dealt with one at a time, as simpler conditional distributions.

The elements of the hierarchical model relate directly to Bayes’ theorem. To simplify, consider a joint distribution of data $D$ and parameters $P$,

\[ [D, P] = [D | P][P] = [P | D] [D] \] Setting the two expressions on the right-hand side equal to one another and rearranging, I obtain Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] which can be read as “the probability of parameters given data”.

Exercise: I am tested for covid and am diagnosed as not-infected. Use Bayes’ theorem to write down the probability, in generic form, that I have covid.

Steps:

Decide with is a) observed and what is unknown and b) what is random and what is fixed
Write a generic distribution for each component
Make appropriate substitutions in Bayes’ theorem

Prior to posterior

I would like to estimate the mean $\mu$ of a set of $n$ measurements $y_i, i = 1, \dots, n$, using a Bayesian analysis assuming that observations are independent and normally distributed. The variance in observations $\sigma^2$ is known from previous studies (e.g., instrument error).

I can think about this problem as a graph:

A graph for the model having known observations $y_i$, a known parameter $\sigma^2$, known prior parameter values $m$ and $M$, and unknown $\mu$ that I want to estimate.

The likelihood for the observations $\mathbf{y}$ is the conditional distribution $[D | P]$. For this normal distribution I write it as

\[[D|P] = [\mathbf{y} | \mu, \sigma^2] = \prod_i^n N(y_i | \mu, \sigma^2)\] I am assuming the observations to be independent and identically distributed. They are independent, because I am assuming that the joint distribution $[y_1, \dots, y_n]$ is the same as the product of probabilities taken independently,

\[\begin{align} [y_1, \dots, y_n] &= [y_1] \times \dots \times [y_n] \\ &= \prod_{i=1}^n[y_i] \end{align}\] They are identically distributed, because I assume that each has same parameter values for $\mu, \sigma^2$.

Now from the last section I have Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I see that I also need a prior distribution for the mean, which I also take to be a normal distribution,

\[ [P] = N(\mu | m, M) \] For the moment I overlook the factor $[D]$ in the denominator of Bayes theorem, because we are focused on $\mu$, which does not appear in $[D]$. So for now, I write

\[ \begin{align} [P|D] = [\mu | \mathbf{y}, \sigma^2] &\propto [D|P][P] \\ &=\prod_i^n N(y_i | \mu, \sigma^2) \times N(\mu | m, M) \end{align} \] For reasons we will see later, this posterior distribution is also normal,

\[ [\mu | \mathbf{y}, \sigma^2] = N \left( vV, V \right) \] where the parameters are given by

\[ \begin{align} v &= \frac{n }{\sigma^2}\bar{y} + \frac{m}{M} \\ V &= \left( \frac{n }{\sigma^2} + \frac{1}{M} \right)^{-1} \end{align} \] You might recognize this to be a weighted average of the data, with mean $\bar{y}$, and prior, with mean $m$. The weights are the two variances. In fact, as $n$ gets large (many observations) and $M$ gets large (uncertain prior mean), the posterior mean converges to $vV \rightarrow \bar{y}$ and the posterior variance to $V^{-1} \rightarrow \sigma^2/n$. These are the traditional estimates of the mean and the standard error, i.e., the square root of the posterior variance $\rightarrow \sigma/\sqrt(n)$. Here is a plot of prior and posterior:

n  <- 5
mu <- 1.8
m  <- 0
M  <- 1
sigma <- 2
mseq  <- mu + seq(-5, 5, length=100)

y      <- rnorm(n, mu, sigma)
my     <- mean(y)
postVr <- 1/(n/sigma^2 + 1/M)
postSd <- sqrt( postVr )
postMu <- (my*n/sigma^2 + m/M)*postVr

prior     <- dnorm( mseq, m)
posterior <- dnorm( mseq, postMu, postSd )

plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
      ylab = 'Density (1/mu)', col = 'brown' )
lines( mseq, prior, col = 'grey', lwd = 2 )
abline( v = my, lty = 2 )                                                     # data mean
segments( m, 0, m, dnorm(m, m, sd = sqrt(M) ), col = 'grey' )                 # prior mean
segments( postMu, 0, postMu, dnorm(postMu, postMu, postSd ), col = 'brown' )  # post mean

Note how the posterior mean lies between the prior mean and the data mean(y) = 1.88.

The uncertainty in a parameter estimate is often summarized by a credible interval defined as values that bound a proportion of the posterior distribution. Credible intervals are often defined for 95%, i.e., at probabilities bounded by quantiles $(0.025, 0.975)$.

The 95% credible interval for $\mu$ is obtained from the posterior distribution, which in this case is normal,

ciProb <- c(0.025, 0.975)                                            # probabilities
ci     <- qnorm( ciProb, postMu, postSd )                            # quantile values
plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
      ylab = 'Density (1/mu)' )
segments( ci, 0, ci, dnorm(ci, postMu, sd = postSd ), lty=2 )

This normal distribution has 2.5% in each of the two tails, bounded by two values in the vector ci = (-0.264, 2.35).

Exercise: A weighted average for two values of a variable $x$ is $\bar{x} = (w_1x_1 + w_2x_2)/(w_1 + w_2)$, where $w_i$ is the weight for the $i^{th}$ value of $x$. In the posterior distribution for the mean estimate $N \left( \mu | vV, V \right)$, identify $x_1, x_2, w_1$ and $w_2$ in the equations for $v$ and $V$. What is the effect of increasing $n$ on the posterior mean and variance in the above example and why?

Exercise: If the data are noisy (large $\sigma^2$), is it still possible to obtain a precise estimate of $\mu$? If so, how?

If Bayes were always as simple as this problem suggests it would have captured much of the modeling space decades ago. I previously skipped over a sticky detail. Returning to Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I discussed everything but the denominator, a marginal distribution of the data, which is the source of all sorrow in Bayesian analysis. For this expression to be true, this distribution must involve integration,

\[ [D] = \int [D | P][P] dP \] Recall that multiplication is easy, and division is hard. In calculus, the tables are turned. The calculus counterpart of division is differentiation, and the counterpart to multiplication is integration. However, differentiation is much easier than integration. This integral is known for the normal likelihood $\times$ normal prior example above, but it is not known for most likelihood-prior combinations.

Posterior simulation techniques like Markov chain Monte Carlo (MCMC) offered ways to approximate this integral and more. Beyond resolving the integration problem, Gibbs sampling (Gelfand and Smith 1990) allowed for hierarchical expansion of models. After gaining familiarity with simple Bayes in the first few chapters, we’ll devote most of the semester to hierarchical problems.

Bayes factor

The Bayes factor (BF) is used to evaluate models relative to one another. From the posterior distribution we can evaluate a model probability conditional on the data, $[M | D]$ and combine it with a prior probability for the model, $[M]$. The ratio for two models is

\[ BF = \frac{[M_1 | D] [M_1]}{[M_2 | D] [M_2 ]} \]

A BF value much greater than 1 finds support for model 1 and vice versa. Obtaining $[M | D]$ requires evaluating an integral. We discuss this further later in the semester.

Regression as an example of simple Bayes

A regression model contains a response, predictors, parameters, and residual (‘unexplained’) variation. Least squares minimizes error, maximum likelihood finds parameter that make the data most likely, and Bayes evaluates a probability distribution of parameters.

This is a good time to read this section of the appendix. I introduced a subscript $i = 1, \dots, n$ to indicate the $n$ observations of a response $y_i$ with corresponding predictors $\mathbf{x}_i$. The index $i$ is exchangeable, meaning that the order of observations doesn’t matter. The responses make up a length=$n$ vector $\mathbf{y}$. The predictors make up a $n \times p$ matrix $\mathbf{X}$, where $p$ is the number of parameters to be estimated in a vector $\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)'$.

A traditional regression model looks like this:

\[ \begin{aligned} y_i &= \mu_i + \epsilon_i \\ \mu_i &= \beta_1 + \beta_2 x_{i,2} + \dots + \beta_p x_{i,p} = \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] (again, notation is summarized in an Appendix.) There is an observed response $y_i$ and predictors in a length-$p$ design vector $\mathbf{x}_i$. Here is an example of a design vector for an intercept and two slope variables,

\[ \mathbf{x}_i = (1, x_{i2}, x_{i2})' \]

Note that the first element of the design vector is $x_{i,1} = 1$, corresponding to the intercept for the model. The error has an expection $E[\epsilon] = 0$ and variance $Var[\epsilon] = \sigma^2$. This is called a linear model, because it is a linear function of the parameters $\boldsymbol{\beta}$.

Typically, I want to estimate the parameters in the model. To do this, I have to decide on a criterion that constitutes a ‘good fit’. The simplest assumption could be to find parameter values that minimize the left over (‘residual’) variation. Without specifying a distribution for $\epsilon$, I could minimize the expected value of the squared error,

\[mse = \min_{\boldsymbol{\beta}} E[\epsilon^2] = \min_{\boldsymbol{\beta}} E\left[ \sum_i^n (y_i - \mathbf{x}'_i \boldsymbol{\beta})^2 \right]\]

with respect to parameters $\boldsymbol{\beta}$. Within the brackets I am summing the squared difference between the observed $y_i$ and the value that would be predicted by the model, $\mathbf{x}'_i\boldsymbol{\beta}$. By taking derivatives with respect to $\boldsymbol{\beta}$, setting them equal to zero, and solving for $\boldsymbol{\beta}$, I obtain the least-squares estimates $\mathbf{\hat{\boldsymbol{\beta}}}$. This is the method of least-squares.

Rather than minimizing residual error, maximum likelihood looks for a set of parameters that make the observations most likely given the model. I say most likely rather than most probable, because the parameters are not specified with a probability distribution (that would be Bayesian). The likelihood is probability model for the observations, not for the parameters.

If I specify a normal distribution for errors, $\epsilon \sim N(0, \sigma^2)$, I can write a likelihood function for the observations. Here again, $N$ indicates a normal distribution, with with mean zero and variance $\sigma^2$. Due to special properties of the normal distribution, if $\epsilon \sim N(0, \sigma^2)$ and $y = \mu + \epsilon$, then $y \sim N(\mu, \sigma^2)$, where parameters are $\mu$ (mean) and $\sigma^2$ (variance).

Because the regression model assumes independent observations, I can write the likelihood function as a simple product over the observations:

\[L(\boldsymbol{\beta}, \sigma^2) = \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] Once again, the independence assumption means that the probability of all $n$ observations is equal to the product of the probabilities of each individual observation,

\[\prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) = N(y_1 | \mathbf{x}'_1 \boldsymbol{\beta}, \sigma^2) \times \dots \times N(y_n |\mathbf{x}'_n \boldsymbol{\beta},\sigma^2)\]

The maximum likelihood estimate comes from maximizing the likelihood function with respect to parameters. I obtain the same estimates as I did with least squares. This is true if $\epsilon$ is normally distributed. For least squares and this maximum likelihood approach, the likelihood is treated as a function of the parameters. Again, the parameters are not defined probabilistically.

The Bayesian view differs from traditional approaches by introducing a prior distribution that I now consider in the context of the regression model. The prior distribution has two important functions:

It allows for information outside the data at hand.
It allows us to derive a posterior distribution, i.e., a probability distribution for parameters.

Recall Bayes’ theorem:

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] where $D$ represented data, and $P$ represented parameters. For the regression model the parameters are $P = (\boldsymbol{\beta}, \sigma^2)$, and the data are $D = (\mathbf{y}, \mathbf{X})$. Although both predictors $\mathbf{X}$ and responses $\mathbf{y}$ are observed, the regression model assumes that $\mathbf{X}$ is part of a fixed design–it is not random. For this reason, we do not write a distribution for $\mathbf{X}$, but rather we ‘condition’ on the fixed values in $\mathbf{X}$. Making these substitutions, Bayes’ theorem becomes

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}] = \frac{[\mathbf{y} | \boldsymbol{\beta}, \sigma^2] [\boldsymbol{\beta}, \sigma^2]}{[\mathbf{y}]} \]

Again, the posterior distribution of parameters is on the LHS. The denominator is the marginal distribution of the data, $[\mathbf{y}]$. As in the example for the simple mean, I’ll ignore it for now.

As before, the left-hand side is “parameters given data”, or the posterior distribution, and the right-hand side has likelihood $\times$ prior in the numerator. In Bayesian analysis the likelihood is treated not as a function to be optimized, but rather as a distribution for the data that must be combined with the prior to yield a distribution for parameters (the posterior). The second factor on the right indicates an (unspecified) prior distribution for parameters, this time for the regression coefficients. Technical details follow later, for now I focus on general concepts.

Substituting for the likelihood and leaving the prior distribution unspecified (for now), I have

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}] \propto \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) [\boldsymbol{\beta}, \sigma^2] \]

If the prior distribution is flat, the posterior distribution for $\boldsymbol{\beta}$ will have the same shape as the likelihood. Thus, parameter estimates will have the same marginal means and standard errors as I obtained by maximum likelihood. The impact of the prior distribution becames important when it is made informative and as models increase in size.

Model graph

Models can be represented as graphs. Model elements are connected by arrows that establish the relationships between variables that guide model building.

Below are three graphs, each describing a different model. The Bayesian regression model is shown in (a).

Three regression variations in graphical form. Circles are random. Squares ‘anchor’ the fit–observed variables $X$ and $Y$ and prior parameter values are known. a) In traditional regression, there are unknown parameters, but they are not given a probability model. b) Simple Bayesian regression has a probability model for parameters. Once respons $Y$ is observed, it is fixed. c) The hierarchical model has an additional stochastic stage.

In this diagram, there are boxes, circles, dashed lines, and solid lines. The graph provides a visual interpretation of the model. I discuss symbolism in the next sections.

Known and unknown variables: Varibles are nodes in the graph. The known variables are boxes, including data and prior parameter values. They anchor the model fit, because they are fixed. The unknown variables and parameters are circles. In the regression example, these are the regression coefficients, the length-$p$ vector $\boldsymbol{\beta}$, and the residual variance $\sigma^2$

Stochastic and deterministic variables: Both the response $y_i$ and the predictors $\mathbf{x}_i$ are known. However, the predictors are treated as deterministic, whereas the response is treated as stochastic before it is observed. Specifically, $y_i$ is assigned a distribution, the likelihood. The vector $\mathbf{x}_i$ is not assigned a distribution. In observational studies it is not always clear what should be a predictor and what should be a response.

Stochastic and deterministic arrows: Solid arrows in the graph indicate stochastic relationships, meaning that knowledge of a parent node does not tell us the value at the child node (or vice versa). Deterministic relationships can often be omitted from the graph. One deterministic relationship is shown in part (b) as a dashed line. This example will be discussed below.

Some additional background concepts

In this section, I introduce some concepts that will come up again this semester.

Terms used to classify models

The terms process model and statistical model represent a false dichotomy.

Hierarchical models typically include both elements. They are used for inference and to predict processes and data, with uncertainty. To take a simple recent example, development of spring phenological state could be written as a rate equation,

\[\frac{dh}{dt} = r(t; \mathbf{x}(t)) \left(1 - h(t) \right)\]

for development rate $r(t)$ that changes over time, in part due to environmental changes contained in $\mathbf{x}(t)$. Most ecologists would call this a ‘process model’. However, when discretized, written in terms of predictors, and combined with a data model for ordinal observations, it’s used for inference (Clark et a. 2014). Hierarchical modeling has made the distinction between statistical and process a false dichotomy.

A generative model goes forward and backward; it predicts the data that fitted it.

Prediction and simulation is the forward direction: start with parameters and generate data. Model fitting is the inverse direction. Generative models are needed if the model is to be evaluated with prediction–can model fitting recover the parameters used to simulate the data.

Many models are not generative. When they are not, evaluation is difficult. In this course prediction plays a large role in model checking, so we focus on generative models.

Observation and experiment

Design determines the information that data have to offer model fitting and lends it’s name to the design matrix in anova and regression. The common elements of a study design include replication, stratification, and randomization (RSR).

Information can be evaluated in many ways, such as the precision a model provides on parameter estimates or predictions. There is a large literature on ‘optimal design’ and how it benefits from RSR.

Studies can be observational (variables controlled only to the extent possible from sample deployment) or experimental (variable manipulation). The model for experimental data may be a model for the design (e.g., ANOVA), but it doesn’t have to be. Experiments may be established to evaluate many processes, each analyzed with a different model.

Observational data have an advantage in being available at scales that could be not be studied with controlled experiments. Large observational networks, some global, are examples. Weather station data, now combined with satellite imagery, are the basis for modern understanding of ocean and atmospheric circulation. Epidemiological evidence for public health trends are widely used in the social sciences, as are polling data in public policy. Disease surveillance can be active or passive, the latter being easier to get, but harder to use.

The big concern with observational data is how to attribute cause and effect. As many variables change, interacting with one another as well as unmeasured variables, there may be no way to identify what’s affecting what. Are people healty because they exercise, or does being healthly motivate exercise?

It can be hard to even decide what goes on the right and the left side of the model. Does high mineralization rate promote N-demanding plants and microbes, or vice versa? What do I do when I know that both are true?

Experiments benefit from manipulation and control. Treatment effects might be isolated when other variables are controlled. Control is critical to inference in many settings.

Experimental results can be difficult to extrapolate outside the experimental frame, which could be the laboratory, growth chamber, greenhouse, or small field plot. Only a fraction of important questions are amenable to experimentation. Many ecologists and social scientists look for ‘natural experiments’.

Model fitting is a balance between data and model.

Many arguments about climate change focus on global mean temperatures, obtained by averaging station data that go back over a century. This seems to be a data-driven debate. (“what do the data say?”)–until we confront the fact that the mean of the stations is not the global mean, because stations are not uniform over the globe. Models are used to translate unevenly distributed data to a less biased estimate of regional climate. The PRISM data are an example.

Dependence in data

I previously said that traditional data modeling assumes that observations are i.i.d. (independent and identically distributed). Many samples do not contain independent observations. For example, eBird counts in Sandy Creek, NC and Alexandria VA might appear to offer independent information. However, the eBird counts from Sandy Creek, NC last Saturday might be largely redundant–each observation does not offer new information. Likewise, if birds are migrating through an area, observations obtained in different weeks might offer more information than observations taken on successive days.

The i.i.d assumptuion overestimates the information content in observations when those observations depend on one another.

Spatial data may depend on nearby observations. Note that arrows point in both directions.

There is often serial dependence in time-series data.

Building dependence into traditional models is hard. Hierarchical modeling solves this problem, because dependence can come simply from adding a stage, as shown in the graphical model for regression.

Multiplicity corrections

Multiple testing refers to questions that depend on more than one hypothesis test.

I might want to know if the abundances of microbial taxa decline with salinity in estuaries, so I fit regressions to concentrations of each of thousands of OTUs (operational taxonomic units). We expect to reject the null at $\alpha < 0.05$ for 5% of tests where the null hypothesis is true. A traditional Bonferonni correction divides the $\alpha$-value threshold by the number of tests based on the assumption that tests are independent. By building models hierarchically, Bayesian analysis sidesteps this problem–the dependence between ‘tests’ is built in to the model.

Ecological fallacy/Simpson’s paradox

The ecological fallacy and Simpson’s paradox refer to the fact that inference at one scale or level of organization may not represent relationships at another.

It’s not hard to find examples where individual and group behavior appear to show different relationships. In public policy, we now know that rich states vote democratic, while rich people vote republican (Gelman 2009). In the last few decades, the mean household income has increased, while the income of the mean household has declined. Ecological communities (species distributions and abundances) are poorly predicted by aggregating predictions for individual species (Clark et al. 2011). These are examples of ‘Simpson’s Paradox’.

Conversely, models of total numbers of species do not predict extinction threat, which operates at the scale of individual species, an example of the ‘ecological fallacy’. The ecological fallacy refers to the fact that group behavior does not predict individuals within groups (rich people do not vote like rich states). Gerrymandered voting precincts exploit this fact.

Exercise. What are the three elements of a design? What does each contribute?

Trends in data modeling

Models can be referenced by such terms as traditional, algorithmic, machine learning/artificial intelligence (ML/AI), and Bayesian. A model might be described by more than one (e.g., modern Bayes is highly algorithmic).

Recent decades have witnessed a revolution in the volumes of data, the rates at which they accumulate, and the tools available to process and learn from them. As computation moved from the domain of a few experts to society at large, data analysis expanded on several fronts. Bayesian analysis permeated the natural and social sciences within two decades of the benchmark 1990 paper of Gelfand and Smith, which applied a method called Gibbs sampling as a single framework for all high-dimensional problems (analyses involving large numbers of observations, variables, or both). Meanwhile, machine learning methods increasingly found ways to improve and automate pattern recognition and generate fast predictions.

Types of data have likewise transformed from the tradition small experiment with a few observations and variables to remote sensing, monitoring networks, and citizen science. Not only the volumes of data are new, but also the capacity to place them in the context of space, time, and other variables.

The current challenges concern how to learn from large and heterogeneous information, with valid uncertainty estimates.

Recent decades have seen a shift in data types (more networks and citizen science) and methods (hierarchical Bayes and machine learning). Unfamiliar terms in this figure will come up later in the text.

I’ll mention some highlights of the rocky transition from simple applications of Bayes’ theorem to hierarchical Bayes. By the mid-20th century Bayesian analysis was embraced, fostered, and promoted by statisticians who wanted methods that were axiomatic. Unlike other branches of mathematics, statistical methods did not follow from a set of foundational axioms, i.e., accepted principles. In the absence of a unifying framework, statisticians and consumers of their methods could not agree on concepts important for interpretation, like confidence intervals and predictive intervals. This interview with David Lindley includes his first-hand recollections and contributions to the Bayesian revolution that gained steam in the mid 20th century. There are many review papers and excellent books on the history of Bayesian analysis.

As Lindley points out, the early books written about Bayesian statistics had titles like “Theory of Probability”, reflecting the view that these developments were really about a coherent (axiomatic) framework that could be defended on basic principles. A fundamental departure from what is now sometimes called ‘traditional’ or ‘conventional’ statistics is the prior distribution, which, in turn, emphasizes a conceptual challenge with the notion of “probability” itself. Subjective probability refers to the fact that you and I often assign different probabilities to the same events. Probabilities are not “real”, but rather describe degree of belief. Beyond conceptual differences on what probability means, the early ‘Bayesians’ could agree that subjective belief can be coherent if it follows the rules of probability. Coherent inference requires a prior belief that is updated by data.

Broad dissemination of Bayesian methods required computational developments, especially Markov chain Monte Carlo (MCMC), which traces its roots at least to the Manhattan Project. Inroads into statistics during the 1980’s presaged the benchmark 1990 publication of Gelfand and Smith’s application of Gibbs sampling, highlighting its widespread potential. Robert and Casella use the term ‘epiphany’. Gibbs sampling is a technique now extensively used to estimate a posterior distribution. Essentially one framework could admit nearly all high-dimensional problems. It contributed to the modern potential of ‘big data’, particularly where estimates of uncertainty are needed. In modern Bayes, modeling and computation are intimately connected.

Practitioners from many disciplines find Bayes attractive, because probability theory provides a conceptual foundation that all of us can apply to notions like uncertainty. Engineers, computer scientists, social scientists, and natural scientists may not agree on or appreciate the myriad of “tests” that make up the practice of traditional statistics. There can be a lot of confusion about what a P value means and whether or not it is even a valid way to interpret evidence. Probability theory is not burdened by the same types of confusion–the laws of probability offer a common language.

Some ecologists express concerns about use of prior distributions. Lele and Dennis worry that it could be used to ‘cook’ the results of a study. Many others find unconvincing the complaint that Bayes is especially susceptable to ethical transgression. We mostly have to trust one another that data are authentic. By contrast, the prior (like the rest of the model) is transparent.

[Still less transparent than data and models is computer code. As computation plays an increasing role in science, finding ways to evaluate analyses of high dimensional problems has become a challenge. Simply making code available may not help at a time of increasing difficulty finding reviewers with time to even read manuscripts, not to mention work through code.]

Finally, there can be confusion that an informative prior facilitates cheating to “improve the fit”. The best fit to the data emphasizes likelihood over prior; the best possible fit has the weakest possible prior. The role of the prior is different–it is given weight to assure a balance of information, quite the opposite of improving the fit.

Bayesian analysis has proliferated in environmental science, because many view it as simple, transparent, and coherent. Computational tools provide a common framework that accommodates models from simple to high-dimensional. This course provides an introduction.

There are many ways to learn from evidence. I have already mentioned that Bayes has attracted scientists who appreciate simplicity, transparency, and an axiomatic foundation. The simplicity I speak of here refers to the fact that Bayes’ theorem leads to a posterior distribution for many types of problems, where a classical approach would require a multitude of different tests, often triggered by small changes in assumptions. I also pointed out that adding a non-informative prior to a simple model may lead to an conclusions that are not much different from a classical approach, albeit without the classical $P$ value.

The advantages of Bayes increase with the size of the model. As the number of variables and types of data increase, the hierarchical structure built on conditional distributions can provide a way forward where other methods bog down.

The big-data era has emphasized the value of more algorithmic approaches and the important contributions of machine learning and artificial intelligence. I would not want to run MCMC on my phone to search the internet or anticipate freeway traffic ahead; uncertainty estimates would add clutter where approximate answers are close enough. However, when I do science, I want uncertainty. In most environmental applications, a point estimate is of little use without an understanding of how much better it is than other values.

The growing availability of big data has amplified issues with confirmation bias. P-hacking is one of many terms applied to the growing practice of re-analyzing data sets until significant results emerge, often to confirm pre-existing biases. An overfitted model is one that predicts the data well simply due to its complexity, which makes it capable of fitting essentially any scatter well. One guard against this practice is out-of-sample prediction. Michael Jordan’s reading for this section points out the fact that, with enough predictors, you can predict anything.

Recap

Am I a Bayesian? Many who make use of Bayesian models and software would not necessarily see themselves as particularly Bayesian. Models and software may be selected for convenience or other considerations that have little to do with philosophy. Avowed non-Bayesians may exploit MCMC methods, while making a point of omitting priors. Staunch Bayesians may sometimes omit prior where they do not affect results or interpretation, but include them elsewhere. Users of packages like INLA may not know that it uses prior distributions (it’s Bayesian). Many big data applications bridge machine-learning and statistical worlds, exploiting approximations of many types. Bayes has expanded rapidly in large part for pragmatic reasons of simplicity and capacity to address large problems.

Bayesian analysis combines a prior distribution with a data model, or likelihood, to arrive at a posterior distribution of unknown parameters and latent variables. Deterministic variables are constant, whereas stochastic variables are defined by distributions. Known variables are constant, but they can be stochastic and, thus, can be defined by distributions, such as the response $Y$ in the regression model. Graphs are used to communicate model structure and organize algorithms. They provide a visual representation of hierarchical models. A common hierarhical structure is [data|process, parameters][process|parameters][parameters].

Basic regression models provide an opportunity to introduce similarities and differences. Least-squares, maximum likelihood, and Bayes share some common features at this simplistic level, but diverge as models increase in size.

Both experimental and observational data are analyzed in environmental sciences. They differ in design, control, and, often, scale. Fundamental design considerations include randomization, stratification, and replication.

Model fitting and prediction are ‘backward’ and ‘forward’ views. Generative models do both and, thus, can be fully evaluated (in both directions).

Appendix

Note about R environments

To fully understand the block of code for the lm fit, I need to know that the variable biomass is defined in my global environment (see previous code block). I can enter length(biomass) and get an answer, because biomass exists in the working environment. I have not assigned the variables deficit and moisture in my global environment. They only exist within the data.frame data. When I call the function lm it knows to look for variables in my global environment or in data, because I have passed data as an argument. The functions plot and points do not look for variables in this way. When I call them, I must specify the data.frame with the variable name, data$deficit. Using R is the subject of Unit 2.

Notation

Here are some notation conventions used in these vignettes.

notation	example	interpretation
italic	$x$, $X$	scalar quantity, known
greek	$\alpha, \beta, \dots$	stochastic (fitted) variable, unknown
parentheses	$\phi(\mu, \sigma^2)$, $N(\mu, \sigma^2)$	parametric function/density
curly brackets	$\{0, 1, \dots\}$	a set on objects
closed interval	$(-\infty,0]$, $[0, \infty)$	intervals include zero
open interval	$(-\infty,0)$, $(0, \infty)$	exclude zero
distributed as	$x \sim N(\mu, \sigma^2)$	distribution or density
expectation	$E[\epsilon] = 0$	expected value of a variable
variance	$Var[\epsilon] = \sigma^2$	variance of a variable
bracket, distribution	$[A, B, C]$	unspecified density or distribution
proportional	$f(x) \propto g(x)$	differ by a constant, $f(x) = c g(x)$
approximate	$\pi \approx 3.14159$	approximately equal
real numbers	$\mathbb{R} = (-\infty, \infty)$	note: also positive real $\mathbb{R}_+$
is an element of	$\pi \in \mathbb{R}$	$\pi$ is a real number
subset	$\{a\}$ and $\{a, b\} \subseteq \{a, b\}$	in the set
proper subset	$\{a\}$, but not $\{a, b\} \subset \{a, b\}$	cannot include all elements
union	$a \cup b$	either $a$ or $b$
intersection	$a \cap b$	both $a$ and $b$
sum	$\sum_{i=1}^n x_i$	$x_1 + \dots + x_n$
product	$\prod_{i=1}^n x_i$	$x_1 \times \dots \times x_n$
exponentiate	$e^x$, $exp(x)$	$e \approx 2.71828$, inverse of $ln(x)$
natural logarithm	$ln(x)$	inverse of $e^x$

Matrices

notation	example	interpretation
bold, l.c.	$\mathbf{x}$, $\mathbf{x}'$	column and row vectors, respectively
bold, u.c.	$\mathbf{X}$	matrix
dimensions	$\underset{\scriptscriptstyle n\times q}{\mathbf{X}}$	$n$ rows, $q$ columns
subscript	$\mathbf{x}_i$, $\mathbf{X}_{ij}$	element of an array
row vector	$\mathbf{X}_{i\cdot}$	row $i$
column vector	$\mathbf{X}_{\cdot j}$	column $j$
transpose	$\mathbf{X}'$	rows become columns
matrix inverse	$\mathbf{A}^{-1}$	solve systems of linear equations
identity matrix	$\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}$	$p \times p$ with 1’s on the diagonal, zeros elsewhere
matrix determinant	$det\left( \mathbf{A} \right)$	obtain for a square matrix, e.g., covariance
kronecker product	$\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}$	$np \times mq$ matrix, defined in text

logs and exponentiation:

these are equivaluent: $exp(x) = e^x$
$\log( e^x ) = e^{ \log x } = x$
$exp(y_1) exp(y_2) = exp(y_1 + y_2)$
$\prod_{i=1}^n exp( y_i ) = exp \left( \sum_{i=1}^n y_i \right)$
$log( a/b ) = log(a) - log(b)$

Simple derivatives

$\frac{d}{dx} \left( x^a \right) = a x^{a-1}$
$\frac{d}{dx} \left( e^x \right) = e^x$
$\frac{d}{dx} \left( e^{f(x)} \right) = \frac{df}{dx} e^{f(x)}$

Some sample statistics

The notation $E(y)$ refers to the expectation of $y$. For random data it is the sample mean. We do not simply refer to it as a mean, because it is more general: a probability distribution has an expectation, even if there are no data.

mean: $E(y) = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i$
variance: $Var(y) = E(y^2) - E^2(y) =\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2 = \frac{1}{n} \sum_{i=1}^n y^2_i - \bar{y}^2$
covariance: $Cov(x,y) = E(xy) - E(x)E(y) = \frac{1}{n} \sum_{i=1}^n x_i y_i - \bar{x} \bar{y}$

Models we will encounter

Univariate response models include the following:

Name	response	additional attributes
linear (LM)	normal	linear in parameters
linear mixed model (LMM)	normal	LM with random effects
generalized linear model (GLM)	discrete	linear in parameters on link scale
logistic regression	binomial	GLM includes logit link
probit regression	binomial	GLM includes probit link
Poisson regression	Poisson	GLM typically with log link
mixed GLM (GLMM)	binomial, Poisson, …	GLM with random effects

Multivariate response models have a vector of responses:

Name	response	additional attributes
linear (LM)	MVN	linear in parameters
categorical	multinomial	multiple classes, one outcome, MV logit or probit link
multinomial	multinomial	multiple classes, mulitple outcomes
generalized joint attribute model (GJAM)	all types	linear in parameters

Time series have dependence in time. They can be state-space models that are normal or not for continuous states. The Kalman filter is a simple (normal-normal) state-space model. Hidden Markov model is a term are most often applied to discrete states. Autoregressive models are normal and have dependence on $p$ previous times, e.g., AR($p$).

Spatial models have dependence in space. They can be LM or GLM with $n = 1$ and a $m \times m$ covariance matrix $\Sigma$, where $m$ is the number of locations. Kriging is a traditional spatial model for continuous space. Spatial autoregressive models are used where space is viewed as discrete blocks (e.g., census tracks, counties, …).

Unit 1. Introductory concepts for Bayesian analysis
The Bayesics
Duke University

env/bio 665 Bayesian inference for environmental models

Jim Clark

2022-01-11

Logistics

For next time

Today’s plan

Goals of a Bayesian analysis

What makes it Bayes?

Structure leads to hierarchical modeling

Notation for a distribution

Jointly distributed random variables

Elements of Bayes

Prior to posterior

Bayes factor

Regression as an example of simple Bayes

Model graph

Some additional background concepts

Terms used to classify models

Observation and experiment

Dependence in data

Multiplicity corrections

Ecological fallacy/Simpson’s paradox

Trends in data modeling

Recap

Appendix

Note about R environments

Notation

Matrices

logs and exponentiation:

Simple derivatives

Some sample statistics

Models we will encounter

notation	example	interpretation
italic	\(x\), \(X\)	scalar quantity, known
greek	\(\alpha, \beta, \dots\)	stochastic (fitted) variable, unknown
parentheses	\(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\)	parametric function/density
curly brackets	\(\{0, 1, \dots\}\)	a set on objects
closed interval	\((-\infty,0]\), \([0, \infty)\)	intervals include zero
open interval	\((-\infty,0)\), \((0, \infty)\)	exclude zero
distributed as	\(x \sim N(\mu, \sigma^2)\)	distribution or density
expectation	\(E[\epsilon] = 0\)	expected value of a variable
variance	\(Var[\epsilon] = \sigma^2\)	variance of a variable
bracket, distribution	\([A, B, C]\)	unspecified density or distribution
proportional	\(f(x) \propto g(x)\)	differ by a constant, \(f(x) = c g(x)\)
approximate	\(\pi \approx 3.14159\)	approximately equal
real numbers	\(\mathbb{R} = (-\infty, \infty)\)	note: also positive real \(\mathbb{R}_+\)
is an element of	\(\pi \in \mathbb{R}\)	\(\pi\) is a real number
subset	\(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\)	in the set
proper subset	\(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\)	cannot include all elements
union	\(a \cup b\)	either \(a\) or \(b\)
intersection	\(a \cap b\)	both \(a\) and \(b\)
sum	\(\sum_{i=1}^n x_i\)	\(x_1 + \dots + x_n\)
product	\(\prod_{i=1}^n x_i\)	\(x_1 \times \dots \times x_n\)
exponentiate	\(e^x\), \(exp(x)\)	\(e \approx 2.71828\), inverse of \(ln(x)\)
natural logarithm	\(ln(x)\)	inverse of \(e^x\)

notation	example	interpretation
bold, l.c.	\(\mathbf{x}\), \(\mathbf{x}'\)	column and row vectors, respectively
bold, u.c.	\(\mathbf{X}\)	matrix
dimensions	\(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\)	\(n\) rows, \(q\) columns
subscript	\(\mathbf{x}_i\), \(\mathbf{X}_{ij}\)	element of an array
row vector	\(\mathbf{X}_{i\cdot}\)	row \(i\)
column vector	\(\mathbf{X}_{\cdot j}\)	column \(j\)
transpose	\(\mathbf{X}'\)	rows become columns
matrix inverse	\(\mathbf{A}^{-1}\)	solve systems of linear equations
identity matrix	\(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\)	\(p \times p\) with 1’s on the diagonal, zeros elsewhere
matrix determinant	\(det\left( \mathbf{A} \right)\)	obtain for a square matrix, e.g., covariance
kronecker product	\(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\)	\(np \times mq\) matrix, defined in text

Unit 1. Introductory concepts for Bayesian analysis The Bayesics Duke University

env/bio 665 Bayesian inference for environmental models

Jim Clark

2022-01-11

Logistics

For next time

Today’s plan

Goals of a Bayesian analysis

What makes it Bayes?

Structure leads to hierarchical modeling

Notation for a distribution

Jointly distributed random variables

Elements of Bayes

Prior to posterior

Bayes factor

Regression as an example of simple Bayes

Model graph

Some additional background concepts

Terms used to classify models

Observation and experiment

Dependence in data

Multiplicity corrections

Ecological fallacy/Simpson’s paradox

Trends in data modeling

Recap

Appendix

Note about R environments

Notation

Matrices

logs and exponentiation:

Simple derivatives

Some sample statistics

Models we will encounter

Unit 1. Introductory concepts for Bayesian analysis
The Bayesics
Duke University