The emergence of large data sets in environmental science has changed statistical analysis: more focus on data wrangling and algorithmic approaches, especially Bayes. Bayesian analysis departs from alternative methods in its application of probability to all aspects of model fitting with side benefits of simplifying interpretation. The elements of a Bayesian analysis include distributions for prior, likelihood, and posterior. Hierarchical models emerge naturally in the Bayesian framework as a means for analyzing high-dimensional problems without requiring a change in approach. Graphs help to organize hierarchical modeling. Basic concepts are introduced with the Forest Inventory and Analysis (FIA) monitoring network.

Logistics

For next time

Background video: Biodiversity confronts climate change in the big-data era: promise and pitfalls for understanding and anticipating change

Discussion reading:

Questions for discussion: questions1 on Sakai Assignments

Today’s plan

  • Breakout: Identify groups and moderator for Discussion

  • Jim: Foundation materials in Unit 1

Objectives:

  1. Recognize and generate notation for a simple model
  2. Identify the basic elements of a Bayesian model
  3. Identify the deterministic vs stochastic variables in a model
  4. Interpret a hierarchical model
    • construct a simple graphical model
    • assemble parameters, process, and data as a hierarchy
    • describe a regression model with notation and graph
  5. Articulate advantages and disadvantages of observational and experimental evidence
  6. Define Simpson’s Paradox and identify when it could be operating.


Two themes will repeat throughout this course, including big data and Bayesian analysis. Statistical analysis traditionally focused on canned software applied to simple (and small) data sets containing hundreds of observations and several variables. Linear regression and anova’s led to \(P\)-values that determined whether or not a study was publishable. Large observational data sets are now commonly wrangled from the internet and used to ask questions that require new tools for data manipulation and hierarchical model building. Hierarchical models offer the potential to integrate multiple types of data, while preserving a focus on science–How does the process work? Can we estimate the relationships and predict them? Theory is needed for model building. Algorithms are the basic tools needed both for data manipulation and model evalution. I start with a case study that illustrates important concepts.


Goals of a Bayesian analysis

A Bayesian analysis serves several purposes:

In Bayesian analysis, all are based on/expressed as probability distributions.

What makes it Bayes?

In Bayesian analysis, all quantities are assigned probability distributions. Data analysis involves i) a process, represented by a model (that summarizes how it works), ii) evidence, which comes in the form of observations, or data, and additional prior information that needs consideration, and iii) unknowns, or parameters, that must be estimated by combining all of the above.

All of you will have taken a traditional statistics course, often termed frequentist statistics, but many of you will not have had exposure to a Bayesian perspective. Because traditional statistics did not derive a probability distribution for unknown parameters it developed a ‘frequency’ interpretation for point estimates–there is one best value for a given data set, but I could have a distribution of these point estimates if I were to repeat an identical experiment many times. In other words, I might do an end-run around the need for probability with a concept of frequency. The probabilistic basis for Bayesian analysis makes outputs intuitive and defensible. In the next section I introduce some of basics and connect them to some of the more familiar topics you about about from an intro statistics course.

Structure leads to hierarchical modeling

Structure in data is common and can be exploited.

The expansion of modern Bayes has gone hand-in-hand with hierarchical modeling. Hierarchical models simplify the synthesis of observations with processes and parameters. They commonly exploit the Bayesian paradigm. Hierarchical models don’t have to be Bayesian, but they usually are. I introduce these concepts together.

Environmental processes and observations can often be summarized hierarchically. Where processes are not strictly hierarchical, a scientist may see advantages to imposing structure. Structure in time can include hours within days within seasons within years. Time can be structured by ‘active periods’ for insects or census intervals to trees or humans. Spatial structure can be defined by watersheds, estuaries, soil types, and land cover types. Voting districts, zip codes, or states can be used. Even where space and time are treated as continuous variables, they may be structured within discrete units.

Process models and data can refer to discrete groups. There can be species within communities, individuals within flocks, sample plots, or political parties. Sample plots can be assigned to blocks (stratification) and political parties to countries. Hierarchies can be a natural way to organize data and models.

Structure in data and processes can not only facilitate but also complicate, interpretation. Simpson’s Paradox and the related ecological fallacy are important considerations for any analysis that involves multiple scales. They can occur when attempting to fit a model at one scale and predict at another scale. It can arise when there are unmeasured variables that affect observations, which is ubiquitous in environmental problems. Examples will arise throughout the semester. In the next section I discuss why not only variables, but also model analyses that can be hierarchical–some parameters generate a model, other parameters and model together generate data.

Structured variables can be misaligned, such as species distributions and communities, tropic levels (e.g., omnivory), race and political affiliation, and zip codes and voting precincts. Misalignment introduces challenges for hierarchical modeling.

Many problems are set up with continuous structure. For example, a simple advection model describes how change in concentration relates to a concentration gradients in space,

\[ \frac{\partial{c}}{\partial{t}} = -u \frac{\partial{c}}{\partial{x}} \] where \(x\) is location, \(t\) is time, and \(u\) is velocity (distance/time). Because computation requires discretizing, continuously structured problems are discretized. There is a discrete time interval \(\Delta_t\) and a discrete distance \(\Delta_x\) such that \(\partial{c}/\partial{t}\) is approximated by \(\left( c(t + \Delta_t,x) - c(t, x) \right)/\Delta_t\). Discretization often introduces instabilities and requires careful thought on issues of rates and scale.

Hierarchical modeling exploits structure in data and in how processes operate. A submodels can be specified for a process, while another submodel could decribe the connections from state variables to data. Hierarchical model building uses structure, which enter as the distributions introduced in the next section.

Notation for a distribution

In Bayesian analysis, components are specified by probability distributions that contain parameters. If I write the "probability of \(y\) given parameters \(\theta\), then I have to know or assume \(\theta\) in order to construct a distribution for \(y\). Because \(y\) is described by a distribution (rather than a specific value) its value is unknown. I can use the bracket notation for a distribution,

\[[\mbox{unknowns | knowns}] = [y | \theta]\] The vertical bar separates the random or stochastic \(y\) on the left from the fixed parameters \(\theta\) on the right.

The terms random, unknown, and stochasitic refer have variables having probability distributions. They appear to the left of the vertical bar. The terms fixed, known, or deterministic refer to quantities that appear to the right of the vertical bar. A variable that is fixed in one distribution can be random in another.

This generic notation says nothing about the form of the distribution or the parameters needed to specify it. I use the generic notation when speaking of relationships that do not depend on the specific distribution.

To get specific, a common model for count data is the Poisson distribution, where the response is a non-negative integer \(y = 0, 1, \dots\) depend on an intensity parameter \(\lambda > 0\). The generic and specific specifications for this distribution are

\[[y | \lambda ] = Poi(y | \lambda)\] The left-hand side could refer to any distribution, while the right-hand side is specific.

The familiar normal distribution has a continuous response, \(y \in (-\infty, \infty)\), and two parameters, one corresponding to the mean (and mode) of the distribution, and another for the variance,

\[[y | \mu, \sigma^2 ] = N(y | \mu, \sigma^2)\] In this notation, \(y\) is random (its specific value is unknown), and the parameters are fixed–they must be known in order to define the distribution for \(y\).

Jointly distributed random variables

When more than one variable is unknown there can be a joint distribution. Suppose I have two observations represented by the vector \(\mathbf{y} = [y_1, y_2]\). If the observations are generated independently from the same normal distribution, then there is joint distribution

\[[y_1, y_2 | \mu, \sigma^2] = N(y_1 | \mu, \sigma^2) \times N(y_2 | \mu, \sigma^2)\] Because I have assumed that they are drawn independently of one another, the probability of both \(y_1\) and \(y_2\) occurring together is equal to the product observing them separately.

A joint distribution has more than one variable to the left of the bar. Variables are separated by commas, i.e., \([y_1, \dots, y_n]\), or they are represented by a vector \([\mathbf{y}]\) holding the values \([y_1, \dots, y_n]\).

Elements of Bayes

The basic elements of a Bayesian analysis are prior, likelihood, and posterior. Bayesian analysis specifies these elements as probability distributions.

In data analysis, we commonly speak of responses that are observed and parameters to be estimated. The prior distribution describes what is known about the parameters from external information. The likelihood describes how the observations are generated from parameters. Because data are unknown before they are observed and they are generated by parameters, the likelihood is a distribution conditional on parameters. The posterior distribution is the inverse of the likelihood: it is the probability for parameters after data are observed. Before saying more about the Bayesian model specifically, I introduce the concept of hierarchical modeling, which can be viewed as a general framework for Bayes.

Hierarchical models organize distributions for prior, likelihood, and posterior into levels or stages. These are the knowns (data and priors) and the unknowns (latent processes and parameter values). It is natural to think of inference this way:

\[[\mbox{unknowns | knowns}] = [\mbox{process, parameters | data, priors}]\]

In the generic notation for a distribution, \([A]\) is the probability or density of event \(A\), \([A, B]\) is the joint probability of \(A\) and \(B\), and \([A|B]\) is the conditional probability of \(A\) given \(B\). Again, anything to the left of the vertical bar is random. Anything to the right is fixed. If there is no vertical bar, everything is random. The Appendix summarizes notation.

This structure from Mark Berliner (example here) comes naturally to a Bayesian. It can be unwieldy to the non-Bayesian and can involve improper distributions (e.g., when omitting prior distributions). A hierarchical model typically breaks this down as

\[[\mbox{process, parameters | data, priors}] \propto \] \[[\mbox{data | process, parameters}] [\mbox{process | parameters}] [\mbox{parameters | priors}]\]

The left hand side is the joint distribution of unknowns to be estimated; it is the posterior distribution. This hierarchical structure is amenable to simulation, using Markov Chain Monte Carlo (MCMC). Gibbs sampling is a common MCMC technique, because the posterior can be factored into smaller pieces that can be dealt with one at a time, as simpler conditional distributions.

The elements of the hierarchical model relate directly to Bayes’ theorem. To simplify, consider a joint distribution of data \(D\) and parameters \(P\),

\[ [D, P] = [D | P][P] = [P | D] [D] \] Setting the two expressions on the right-hand side equal to one another and rearranging, I obtain Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] which can be read as “the probability of parameters given data”.

Exercise: I am tested for covid and am diagnosed as not-infected. Use Bayes’ theorem to write down the probability, in generic form, that I have covid.

Steps:
  1. Decide with is a) observed and what is unknown and b) what is random and what is fixed
  2. Write a generic distribution for each component
  3. Make appropriate substitutions in Bayes’ theorem

Prior to posterior

I would like to estimate the mean \(\mu\) of a set of \(n\) measurements \(y_i, i = 1, \dots, n\), using a Bayesian analysis assuming that observations are independent and normally distributed. The variance in observations \(\sigma^2\) is known from previous studies (e.g., instrument error).

I can think about this problem as a graph:

A graph for the model having known observations \(y_i\), a known parameter \(\sigma^2\), known prior parameter values \(m\) and \(M\), and unknown \(\mu\) that I want to estimate.

The likelihood for the observations \(\mathbf{y}\) is the conditional distribution \([D | P]\). For this normal distribution I write it as

\[[D|P] = [\mathbf{y} | \mu, \sigma^2] = \prod_i^n N(y_i | \mu, \sigma^2)\] I am assuming the observations to be independent and identically distributed. They are independent, because I am assuming that the joint distribution \([y_1, \dots, y_n]\) is the same as the product of probabilities taken independently,

\[\begin{align} [y_1, \dots, y_n] &= [y_1] \times \dots \times [y_n] \\ &= \prod_{i=1}^n[y_i] \end{align}\] They are identically distributed, because I assume that each has same parameter values for \(\mu, \sigma^2\).

Now from the last section I have Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I see that I also need a prior distribution for the mean, which I also take to be a normal distribution,

\[ [P] = N(\mu | m, M) \] For the moment I overlook the factor \([D]\) in the denominator of Bayes theorem, because we are focused on \(\mu\), which does not appear in \([D]\). So for now, I write

\[ \begin{align} [P|D] = [\mu | \mathbf{y}, \sigma^2] &\propto [D|P][P] \\ &=\prod_i^n N(y_i | \mu, \sigma^2) \times N(\mu | m, M) \end{align} \] For reasons we will see later, this posterior distribution is also normal,

\[ [\mu | \mathbf{y}, \sigma^2] = N \left( vV, V \right) \] where the parameters are given by

\[ \begin{align} v &= \frac{n }{\sigma^2}\bar{y} + \frac{m}{M} \\ V &= \left( \frac{n }{\sigma^2} + \frac{1}{M} \right)^{-1} \end{align} \] You might recognize this to be a weighted average of the data, with mean \(\bar{y}\), and prior, with mean \(m\). The weights are the two variances. In fact, as \(n\) gets large (many observations) and \(M\) gets large (uncertain prior mean), the posterior mean converges to \(vV \rightarrow \bar{y}\) and the posterior variance to \(V^{-1} \rightarrow \sigma^2/n\). These are the traditional estimates of the mean and the standard error, i.e., the square root of the posterior variance \(\rightarrow \sigma/\sqrt(n)\). Here is a plot of prior and posterior:

n  <- 5
mu <- 1.8
m  <- 0
M  <- 1
sigma <- 2
mseq  <- mu + seq(-5, 5, length=100)

y      <- rnorm(n, mu, sigma)
my     <- mean(y)
postVr <- 1/(n/sigma^2 + 1/M)
postSd <- sqrt( postVr )
postMu <- (my*n/sigma^2 + m/M)*postVr

prior     <- dnorm( mseq, m)
posterior <- dnorm( mseq, postMu, postSd )

plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
      ylab = 'Density (1/mu)', col = 'brown' )
lines( mseq, prior, col = 'grey', lwd = 2 )
abline( v = my, lty = 2 )                                                     # data mean
segments( m, 0, m, dnorm(m, m, sd = sqrt(M) ), col = 'grey' )                 # prior mean
segments( postMu, 0, postMu, dnorm(postMu, postMu, postSd ), col = 'brown' )  # post mean

Note how the posterior mean lies between the prior mean and the data mean(y) = 1.88.

The uncertainty in a parameter estimate is often summarized by a credible interval defined as values that bound a proportion of the posterior distribution. Credible intervals are often defined for 95%, i.e., at probabilities bounded by quantiles \((0.025, 0.975)\).

The 95% credible interval for \(\mu\) is obtained from the posterior distribution, which in this case is normal,

ciProb <- c(0.025, 0.975)                                            # probabilities
ci     <- qnorm( ciProb, postMu, postSd )                            # quantile values
plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
      ylab = 'Density (1/mu)' )
segments( ci, 0, ci, dnorm(ci, postMu, sd = postSd ), lty=2 )

This normal distribution has 2.5% in each of the two tails, bounded by two values in the vector ci = (-0.264, 2.35).

Exercise: A weighted average for two values of a variable \(x\) is \(\bar{x} = (w_1x_1 + w_2x_2)/(w_1 + w_2)\), where \(w_i\) is the weight for the \(i^{th}\) value of \(x\). In the posterior distribution for the mean estimate \(N \left( \mu | vV, V \right)\), identify \(x_1, x_2, w_1\) and \(w_2\) in the equations for \(v\) and \(V\). What is the effect of increasing \(n\) on the posterior mean and variance in the above example and why?

Exercise: If the data are noisy (large \(\sigma^2\)), is it still possible to obtain a precise estimate of \(\mu\)? If so, how?

If Bayes were always as simple as this problem suggests it would have captured much of the modeling space decades ago. I previously skipped over a sticky detail. Returning to Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I discussed everything but the denominator, a marginal distribution of the data, which is the source of all sorrow in Bayesian analysis. For this expression to be true, this distribution must involve integration,

\[ [D] = \int [D | P][P] dP \] Recall that multiplication is easy, and division is hard. In calculus, the tables are turned. The calculus counterpart of division is differentiation, and the counterpart to multiplication is integration. However, differentiation is much easier than integration. This integral is known for the normal likelihood \(\times\) normal prior example above, but it is not known for most likelihood-prior combinations.

Posterior simulation techniques like Markov chain Monte Carlo (MCMC) offered ways to approximate this integral and more. Beyond resolving the integration problem, Gibbs sampling (Gelfand and Smith 1990) allowed for hierarchical expansion of models. After gaining familiarity with simple Bayes in the first few chapters, we’ll devote most of the semester to hierarchical problems.

Bayes factor

The Bayes factor (BF) is used to evaluate models relative to one another. From the posterior distribution we can evaluate a model probability conditional on the data, \([M | D]\) and combine it with a prior probability for the model, \([M]\). The ratio for two models is

\[ BF = \frac{[M_1 | D] [M_1]}{[M_2 | D] [M_2 ]} \]

A BF value much greater than 1 finds support for model 1 and vice versa. Obtaining \([M | D]\) requires evaluating an integral. We discuss this further later in the semester.

Regression as an example of simple Bayes

A regression model contains a response, predictors, parameters, and residual (‘unexplained’) variation. Least squares minimizes error, maximum likelihood finds parameter that make the data most likely, and Bayes evaluates a probability distribution of parameters.

This is a good time to read this section of the appendix. I introduced a subscript \(i = 1, \dots, n\) to indicate the \(n\) observations of a response \(y_i\) with corresponding predictors \(\mathbf{x}_i\). The index \(i\) is exchangeable, meaning that the order of observations doesn’t matter. The responses make up a length=\(n\) vector \(\mathbf{y}\). The predictors make up a \(n \times p\) matrix \(\mathbf{X}\), where \(p\) is the number of parameters to be estimated in a vector \(\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)'\).

A traditional regression model looks like this:

\[ \begin{aligned} y_i &= \mu_i + \epsilon_i \\ \mu_i &= \beta_1 + \beta_2 x_{i,2} + \dots + \beta_p x_{i,p} = \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] (again, notation is summarized in an Appendix.) There is an observed response \(y_i\) and predictors in a length-\(p\) design vector \(\mathbf{x}_i\). Here is an example of a design vector for an intercept and two slope variables,

\[ \mathbf{x}_i = (1, x_{i2}, x_{i2})' \]

Note that the first element of the design vector is \(x_{i,1} = 1\), corresponding to the intercept for the model. The error has an expection \(E[\epsilon] = 0\) and variance \(Var[\epsilon] = \sigma^2\). This is called a linear model, because it is a linear function of the parameters \(\boldsymbol{\beta}\).

Typically, I want to estimate the parameters in the model. To do this, I have to decide on a criterion that constitutes a ‘good fit’. The simplest assumption could be to find parameter values that minimize the left over (‘residual’) variation. Without specifying a distribution for \(\epsilon\), I could minimize the expected value of the squared error,

\[mse = \min_{\boldsymbol{\beta}} E[\epsilon^2] = \min_{\boldsymbol{\beta}} E\left[ \sum_i^n (y_i - \mathbf{x}'_i \boldsymbol{\beta})^2 \right]\]

with respect to parameters \(\boldsymbol{\beta}\). Within the brackets I am summing the squared difference between the observed \(y_i\) and the value that would be predicted by the model, \(\mathbf{x}'_i\boldsymbol{\beta}\). By taking derivatives with respect to \(\boldsymbol{\beta}\), setting them equal to zero, and solving for \(\boldsymbol{\beta}\), I obtain the least-squares estimates \(\mathbf{\hat{\boldsymbol{\beta}}}\). This is the method of least-squares.

Rather than minimizing residual error, maximum likelihood looks for a set of parameters that make the observations most likely given the model. I say most likely rather than most probable, because the parameters are not specified with a probability distribution (that would be Bayesian). The likelihood is probability model for the observations, not for the parameters.

If I specify a normal distribution for errors, \(\epsilon \sim N(0, \sigma^2)\), I can write a likelihood function for the observations. Here again, \(N\) indicates a normal distribution, with with mean zero and variance \(\sigma^2\). Due to special properties of the normal distribution, if \(\epsilon \sim N(0, \sigma^2)\) and \(y = \mu + \epsilon\), then \(y \sim N(\mu, \sigma^2)\), where parameters are \(\mu\) (mean) and \(\sigma^2\) (variance).

Because the regression model assumes independent observations, I can write the likelihood function as a simple product over the observations:

\[L(\boldsymbol{\beta}, \sigma^2) = \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] Once again, the independence assumption means that the probability of all \(n\) observations is equal to the product of the probabilities of each individual observation,

\[\prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) = N(y_1 | \mathbf{x}'_1 \boldsymbol{\beta}, \sigma^2) \times \dots \times N(y_n |\mathbf{x}'_n \boldsymbol{\beta},\sigma^2)\]

The maximum likelihood estimate comes from maximizing the likelihood function with respect to parameters. I obtain the same estimates as I did with least squares. This is true if \(\epsilon\) is normally distributed. For least squares and this maximum likelihood approach, the likelihood is treated as a function of the parameters. Again, the parameters are not defined probabilistically.

The Bayesian view differs from traditional approaches by introducing a prior distribution that I now consider in the context of the regression model. The prior distribution has two important functions:

  • It allows for information outside the data at hand.

  • It allows us to derive a posterior distribution, i.e., a probability distribution for parameters.

Recall Bayes’ theorem:

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] where \(D\) represented data, and \(P\) represented parameters. For the regression model the parameters are \(P = (\boldsymbol{\beta}, \sigma^2)\), and the data are \(D = (\mathbf{y}, \mathbf{X})\). Although both predictors \(\mathbf{X}\) and responses \(\mathbf{y}\) are observed, the regression model assumes that \(\mathbf{X}\) is part of a fixed design–it is not random. For this reason, we do not write a distribution for \(\mathbf{X}\), but rather we ‘condition’ on the fixed values in \(\mathbf{X}\). Making these substitutions, Bayes’ theorem becomes

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}] = \frac{[\mathbf{y} | \boldsymbol{\beta}, \sigma^2] [\boldsymbol{\beta}, \sigma^2]}{[\mathbf{y}]} \]

Again, the posterior distribution of parameters is on the LHS. The denominator is the marginal distribution of the data, \([\mathbf{y}]\). As in the example for the simple mean, I’ll ignore it for now.

As before, the left-hand side is “parameters given data”, or the posterior distribution, and the right-hand side has likelihood \(\times\) prior in the numerator. In Bayesian analysis the likelihood is treated not as a function to be optimized, but rather as a distribution for the data that must be combined with the prior to yield a distribution for parameters (the posterior). The second factor on the right indicates an (unspecified) prior distribution for parameters, this time for the regression coefficients. Technical details follow later, for now I focus on general concepts.

Substituting for the likelihood and leaving the prior distribution unspecified (for now), I have

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}] \propto \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) [\boldsymbol{\beta}, \sigma^2] \]

If the prior distribution is flat, the posterior distribution for \(\boldsymbol{\beta}\) will have the same shape as the likelihood. Thus, parameter estimates will have the same marginal means and standard errors as I obtained by maximum likelihood. The impact of the prior distribution becames important when it is made informative and as models increase in size.

Model graph

Models can be represented as graphs. Model elements are connected by arrows that establish the relationships between variables that guide model building.

Below are three graphs, each describing a different model. The Bayesian regression model is shown in (a).

Three regression variations in graphical form. Circles are random. Squares ‘anchor’ the fit–observed variables \(X\) and \(Y\) and prior parameter values are known. a) In traditional regression, there are unknown parameters, but they are not given a probability model. b) Simple Bayesian regression has a probability model for parameters. Once respons \(Y\) is observed, it is fixed. c) The hierarchical model has an additional stochastic stage.


In this diagram, there are boxes, circles, dashed lines, and solid lines. The graph provides a visual interpretation of the model. I discuss symbolism in the next sections.

Known and unknown variables: Varibles are nodes in the graph. The known variables are boxes, including data and prior parameter values. They anchor the model fit, because they are fixed. The unknown variables and parameters are circles. In the regression example, these are the regression coefficients, the length-\(p\) vector \(\boldsymbol{\beta}\), and the residual variance \(\sigma^2\)

Stochastic and deterministic variables: Both the response \(y_i\) and the predictors \(\mathbf{x}_i\) are known. However, the predictors are treated as deterministic, whereas the response is treated as stochastic before it is observed. Specifically, \(y_i\) is assigned a distribution, the likelihood. The vector \(\mathbf{x}_i\) is not assigned a distribution. In observational studies it is not always clear what should be a predictor and what should be a response.

Stochastic and deterministic arrows: Solid arrows in the graph indicate stochastic relationships, meaning that knowledge of a parent node does not tell us the value at the child node (or vice versa). Deterministic relationships can often be omitted from the graph. One deterministic relationship is shown in part (b) as a dashed line. This example will be discussed below.

Some additional background concepts

In this section, I introduce some concepts that will come up again this semester.

Terms used to classify models

The terms process model and statistical model represent a false dichotomy.

Hierarchical models typically include both elements. They are used for inference and to predict processes and data, with uncertainty. To take a simple recent example, development of spring phenological state could be written as a rate equation,

\[\frac{dh}{dt} = r(t; \mathbf{x}(t)) \left(1 - h(t) \right)\]

for development rate \(r(t)\) that changes over time, in part due to environmental changes contained in \(\mathbf{x}(t)\). Most ecologists would call this a ‘process model’. However, when discretized, written in terms of predictors, and combined with a data model for ordinal observations, it’s used for inference (Clark et a. 2014). Hierarchical modeling has made the distinction between statistical and process a false dichotomy.

A generative model goes forward and backward; it predicts the data that fitted it.

Prediction and simulation is the forward direction: start with parameters and generate data. Model fitting is the inverse direction. Generative models are needed if the model is to be evaluated with prediction–can model fitting recover the parameters used to simulate the data.

Many models are not generative. When they are not, evaluation is difficult. In this course prediction plays a large role in model checking, so we focus on generative models.

Observation and experiment

Design determines the information that data have to offer model fitting and lends it’s name to the design matrix in anova and regression. The common elements of a study design include replication, stratification, and randomization (RSR).

Information can be evaluated in many ways, such as the precision a model provides on parameter estimates or predictions. There is a large literature on ‘optimal design’ and how it benefits from RSR.

Studies can be observational (variables controlled only to the extent possible from sample deployment) or experimental (variable manipulation). The model for experimental data may be a model for the design (e.g., ANOVA), but it doesn’t have to be. Experiments may be established to evaluate many processes, each analyzed with a different model.

Observational data have an advantage in being available at scales that could be not be studied with controlled experiments. Large observational networks, some global, are examples. Weather station data, now combined with satellite imagery, are the basis for modern understanding of ocean and atmospheric circulation. Epidemiological evidence for public health trends are widely used in the social sciences, as are polling data in public policy. Disease surveillance can be active or passive, the latter being easier to get, but harder to use.

The big concern with observational data is how to attribute cause and effect. As many variables change, interacting with one another as well as unmeasured variables, there may be no way to identify what’s affecting what. Are people healty because they exercise, or does being healthly motivate exercise?

It can be hard to even decide what goes on the right and the left side of the model. Does high mineralization rate promote N-demanding plants and microbes, or vice versa? What do I do when I know that both are true?

Experiments benefit from manipulation and control. Treatment effects might be isolated when other variables are controlled. Control is critical to inference in many settings.

Experimental results can be difficult to extrapolate outside the experimental frame, which could be the laboratory, growth chamber, greenhouse, or small field plot. Only a fraction of important questions are amenable to experimentation. Many ecologists and social scientists look for ‘natural experiments’.

Model fitting is a balance between data and model.

Many arguments about climate change focus on global mean temperatures, obtained by averaging station data that go back over a century. This seems to be a data-driven debate. (“what do the data say?”)–until we confront the fact that the mean of the stations is not the global mean, because stations are not uniform over the globe. Models are used to translate unevenly distributed data to a less biased estimate of regional climate. The PRISM data are an example.

Dependence in data

I previously said that traditional data modeling assumes that observations are i.i.d. (independent and identically distributed). Many samples do not contain independent observations. For example, eBird counts in Sandy Creek, NC and Alexandria VA might appear to offer independent information. However, the eBird counts from Sandy Creek, NC last Saturday might be largely redundant–each observation does not offer new information. Likewise, if birds are migrating through an area, observations obtained in different weeks might offer more information than observations taken on successive days.

The i.i.d assumptuion overestimates the information content in observations when those observations depend on one another.

Spatial data may depend on nearby observations. Note that arrows point in both directions.

There is often serial dependence in time-series data.

Building dependence into traditional models is hard. Hierarchical modeling solves this problem, because dependence can come simply from adding a stage, as shown in the graphical model for regression.

Multiplicity corrections

Multiple testing refers to questions that depend on more than one hypothesis test.

I might want to know if the abundances of microbial taxa decline with salinity in estuaries, so I fit regressions to concentrations of each of thousands of OTUs (operational taxonomic units). We expect to reject the null at \(\alpha < 0.05\) for 5% of tests where the null hypothesis is true. A traditional Bonferonni correction divides the \(\alpha\)-value threshold by the number of tests based on the assumption that tests are independent. By building models hierarchically, Bayesian analysis sidesteps this problem–the dependence between ‘tests’ is built in to the model.

Ecological fallacy/Simpson’s paradox

The ecological fallacy and Simpson’s paradox refer to the fact that inference at one scale or level of organization may not represent relationships at another.

It’s not hard to find examples where individual and group behavior appear to show different relationships. In public policy, we now know that rich states vote democratic, while rich people vote republican (Gelman 2009). In the last few decades, the mean household income has increased, while the income of the mean household has declined. Ecological communities (species distributions and abundances) are poorly predicted by aggregating predictions for individual species (Clark et al. 2011). These are examples of ‘Simpson’s Paradox’.

Conversely, models of total numbers of species do not predict extinction threat, which operates at the scale of individual species, an example of the ‘ecological fallacy’. The ecological fallacy refers to the fact that group behavior does not predict individuals within groups (rich people do not vote like rich states). Gerrymandered voting precincts exploit this fact.

Exercise. What are the three elements of a design? What does each contribute?

Recap

Am I a Bayesian? Many who make use of Bayesian models and software would not necessarily see themselves as particularly Bayesian. Models and software may be selected for convenience or other considerations that have little to do with philosophy. Avowed non-Bayesians may exploit MCMC methods, while making a point of omitting priors. Staunch Bayesians may sometimes omit prior where they do not affect results or interpretation, but include them elsewhere. Users of packages like INLA may not know that it uses prior distributions (it’s Bayesian). Many big data applications bridge machine-learning and statistical worlds, exploiting approximations of many types. Bayes has expanded rapidly in large part for pragmatic reasons of simplicity and capacity to address large problems.

Bayesian analysis combines a prior distribution with a data model, or likelihood, to arrive at a posterior distribution of unknown parameters and latent variables. Deterministic variables are constant, whereas stochastic variables are defined by distributions. Known variables are constant, but they can be stochastic and, thus, can be defined by distributions, such as the response \(Y\) in the regression model. Graphs are used to communicate model structure and organize algorithms. They provide a visual representation of hierarchical models. A common hierarhical structure is [data|process, parameters][process|parameters][parameters].

Basic regression models provide an opportunity to introduce similarities and differences. Least-squares, maximum likelihood, and Bayes share some common features at this simplistic level, but diverge as models increase in size.

Both experimental and observational data are analyzed in environmental sciences. They differ in design, control, and, often, scale. Fundamental design considerations include randomization, stratification, and replication.

Model fitting and prediction are ‘backward’ and ‘forward’ views. Generative models do both and, thus, can be fully evaluated (in both directions).


Appendix

Note about R environments

To fully understand the block of code for the lm fit, I need to know that the variable biomass is defined in my global environment (see previous code block). I can enter length(biomass) and get an answer, because biomass exists in the working environment. I have not assigned the variables deficit and moisture in my global environment. They only exist within the data.frame data. When I call the function lm it knows to look for variables in my global environment or in data, because I have passed data as an argument. The functions plot and points do not look for variables in this way. When I call them, I must specify the data.frame with the variable name, data$deficit. Using R is the subject of Unit 2.

Notation

Here are some notation conventions used in these vignettes.

notation example interpretation
italic \(x\), \(X\) scalar quantity, known
greek \(\alpha, \beta, \dots\) stochastic (fitted) variable, unknown
parentheses \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) parametric function/density
curly brackets \(\{0, 1, \dots\}\) a set on objects
closed interval \((-\infty,0]\), \([0, \infty)\) intervals include zero
open interval \((-\infty,0)\), \((0, \infty)\) exclude zero
distributed as \(x \sim N(\mu, \sigma^2)\) distribution or density
expectation \(E[\epsilon] = 0\) expected value of a variable
variance \(Var[\epsilon] = \sigma^2\) variance of a variable
bracket, distribution \([A, B, C]\) unspecified density or distribution
proportional \(f(x) \propto g(x)\) differ by a constant, \(f(x) = c g(x)\)
approximate \(\pi \approx 3.14159\) approximately equal
real numbers \(\mathbb{R} = (-\infty, \infty)\) note: also positive real \(\mathbb{R}_+\)
is an element of \(\pi \in \mathbb{R}\) \(\pi\) is a real number
subset \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) in the set
proper subset \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) cannot include all elements
union \(a \cup b\) either \(a\) or \(b\)
intersection \(a \cap b\) both \(a\) and \(b\)
sum \(\sum_{i=1}^n x_i\) \(x_1 + \dots + x_n\)
product \(\prod_{i=1}^n x_i\) \(x_1 \times \dots \times x_n\)
exponentiate \(e^x\), \(exp(x)\) \(e \approx 2.71828\), inverse of \(ln(x)\)
natural logarithm \(ln(x)\) inverse of \(e^x\)

Matrices

notation example interpretation
bold, l.c. \(\mathbf{x}\), \(\mathbf{x}'\) column and row vectors, respectively
bold, u.c. \(\mathbf{X}\) matrix
dimensions \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) \(n\) rows, \(q\) columns
subscript \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) element of an array
row vector \(\mathbf{X}_{i\cdot}\) row \(i\)
column vector \(\mathbf{X}_{\cdot j}\) column \(j\)
transpose \(\mathbf{X}'\) rows become columns
matrix inverse \(\mathbf{A}^{-1}\) solve systems of linear equations
identity matrix \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) \(p \times p\) with 1’s on the diagonal, zeros elsewhere
matrix determinant \(det\left( \mathbf{A} \right)\) obtain for a square matrix, e.g., covariance
kronecker product \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) \(np \times mq\) matrix, defined in text

logs and exponentiation:

  • these are equivaluent: \(exp(x) = e^x\)

  • \(\log( e^x ) = e^{ \log x } = x\)

  • \(exp(y_1) exp(y_2) = exp(y_1 + y_2)\)

  • \(\prod_{i=1}^n exp( y_i ) = exp \left( \sum_{i=1}^n y_i \right)\)

  • \(log( a/b ) = log(a) - log(b)\)

Simple derivatives

  • \(\frac{d}{dx} \left( x^a \right) = a x^{a-1}\)

  • \(\frac{d}{dx} \left( e^x \right) = e^x\)

  • \(\frac{d}{dx} \left( e^{f(x)} \right) = \frac{df}{dx} e^{f(x)}\)

Some sample statistics

The notation \(E(y)\) refers to the expectation of \(y\). For random data it is the sample mean. We do not simply refer to it as a mean, because it is more general: a probability distribution has an expectation, even if there are no data.

  • mean: \(E(y) = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\)

  • variance: \(Var(y) = E(y^2) - E^2(y) =\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2 = \frac{1}{n} \sum_{i=1}^n y^2_i - \bar{y}^2\)

  • covariance: \(Cov(x,y) = E(xy) - E(x)E(y) = \frac{1}{n} \sum_{i=1}^n x_i y_i - \bar{x} \bar{y}\)

Models we will encounter

Univariate response models include the following:

Name response additional attributes
linear (LM) normal linear in parameters
linear mixed model (LMM) normal LM with random effects
generalized linear model (GLM) discrete linear in parameters on link scale
logistic regression binomial GLM includes logit link
probit regression binomial GLM includes probit link
Poisson regression Poisson GLM typically with log link
mixed GLM (GLMM) binomial, Poisson, … GLM with random effects

Multivariate response models have a vector of responses:

Name response additional attributes
linear (LM) MVN linear in parameters
categorical multinomial multiple classes, one outcome, MV logit or probit link
multinomial multinomial multiple classes, mulitple outcomes
generalized joint attribute model (GJAM) all types linear in parameters

Time series have dependence in time. They can be state-space models that are normal or not for continuous states. The Kalman filter is a simple (normal-normal) state-space model. Hidden Markov model is a term are most often applied to discrete states. Autoregressive models are normal and have dependence on \(p\) previous times, e.g., AR(\(p\)).

Spatial models have dependence in space. They can be LM or GLM with \(n = 1\) and a \(m \times m\) covariance matrix \(\Sigma\), where \(m\) is the number of locations. Kriging is a traditional spatial model for continuous space. Spatial autoregressive models are used where space is viewed as discrete blocks (e.g., census tracks, counties, …).