The emergence of large data sets in environmental science has changed statistical analysis: more focus on data wrangling and algorithmic approaches, especially Bayes. Bayesian analysis departs from alternative methods in its application of probability to all aspects of model fitting with side benefits of simplifying interpretation. The elements of a Bayesian analysis include distributions for prior, likelihood, and posterior. Hierarchical models emerge naturally in the Bayesian framework as a means for analyzing high-dimensional problems without requiring a change in approach. Graphs help to organize hierarchical modeling. Basic concepts are introduced using regression.

Logistics

For next time

Discussion reading: select two papers and come prepared to discuss them. Post questions for discussion on sakai resources/groupDocs.

Today’s plan

  • Breakout: Identify groups and moderator for Discussion next time

  • Plenary: Foundation materials in Unit 1

Objectives:

  1. Recognize and generate notation for a simple model
  2. Identify the basic elements of a Bayesian model and its probability interpretation
  3. Identify the deterministic vs stochastic variables in a model
  4. Interpret a hierarchical model
    • construct a simple graphical model
    • assemble parameters, process, and data as a hierarchy
    • describe a regression model with notation and graph
  5. Articulate advantages and disadvantages of observational and experimental evidence
  6. Define Simpson’s Paradox and identify when it could be operating
  7. Identify key advantages and limitations of artificial intelligence


Two themes will repeat throughout this course, including big data and Bayesian analysis. Statistical analysis traditionally focused on canned software applied to simple (and small) data sets containing tens to hundreds of observations and several variables. Linear regression and anova’s led to \(P\)-values that determined whether or not a study was publishable. Large observational data sets are now commonly wrangled from the internet and used to ask questions that require new tools for data manipulation and hierarchical model building. Increasingly, machine learning methods process large data files, but often without clear process interpretation or tractable treatment of uncertainty. Hierarchical models offer the potential to integrate multiple types of data, while preserving a focus on science–How does the process work? Can we estimate the relationships and predict them? Can we assign a probability statement to the result? Theory is needed for model building. Algorithms are the basic tools needed both for data manipulation and model evalution.


Goals of a Bayesian analysis

A Bayesian analysis serves several purposes:

In Bayesian analysis, all are based on/expressed as probability distributions.

Why probability?

Uncertainty is one of the most important concepts in science, and probability is one way to communicate uncertainty. It is closely linked to the notion of reproducibility in science. How confident am I that the outcome of an experiment or observational study would agree with future studies? Probability, a value between zero and one, expresses confidence or “belief”. Maximum uncertainty (ignorance) is expressed by a probability of 1/2, e.g., a coin toss.

Probability has an advantage over other ways of describing uncertainty, because it can be quantified and tracked through combinations of events, e.g., “will it rain tomorrow” and, if so, “will I remember to carry an unbrella”? Distribution theory allows us to put these events together.

What makes it Bayes?

In Bayesian analysis, all quantities are assigned probability distributions. Data analysis involves i) a process, represented by a model (that summarizes how it works), ii) evidence, which comes in the form of observations, or data, and additional prior information that needs consideration, and iii) unknowns, or parameters, that must be estimated by combining all of the above.

All of us have have taken a traditional statistics course, often termed frequentist statistics, but not everyone has been exposed to a Bayesian perspective. Because traditional statistics did not derive a probability distribution for unknown parameters it developed a ‘frequency’ interpretation for point estimates–I could have a distribution of point estimates if I were to repeat an identical experiment many times. In other words, I might do an end-run around the need for probability with a concept of frequency. This approach gets harder to apply as problems become more complex. The methods can also become unstable. The probabilistic basis for Bayesian analysis makes outputs intuitive and defensible. Complexity is added hierchically. In the next section I introduce some of the basics and connect them to some of the more familiar topics you learn about from an intro statistics course.

Structure leads to hierarchical models

Structure in data is common and can be exploited.

The expansion of modern Bayes has gone hand-in-hand with hierarchical modeling. Hierarchical models simplify the synthesis of observations with processes and parameters. They commonly exploit the Bayesian paradigm. Hierarchical models don’t have to be Bayesian, but they usually are. After defining a few terms I introduce these concepts together.

Variables are continuous or discrete. Time is a continuous variable, because it can be measured to a degree of precision that depends on the measurement device. Temperature is likewise continuous. Population size is a discrete variable, limited to non-negative integer values \(0, 1, \dots\).

State variables describe how a process develops in space, time, or both. Population size, degree days, and energy expenditure are examples of state variables. Like time and space, state variables can be continuous or discrete. Again, population size is discrete, taking values on the non-negative integers.

Continuous and discrete variables are not always treated that way in models. For example, I can choose to reference continuous variables with discrete values, such as hour, day, month, year, or epoch. Likewise, I can choose to reference discrete variables with continuous values, such as population density (number/area). Population density is a continuous variable, taking values on the non-negative real numbers, but with discrete zero, \(\{ 0, \mathbb{R}^+ \}\). In this notation, \(\mathbb{R}\) refers to the (continuous) real numbers, which are truncated to a specific number of significant figures. The value for \(\pi\) truncated to six significant figures is 3.14159.

A state variable can be structured into continuous (size or age) or discrete (size or age class) states. Age is often taken to the nearest (discrete) year. Relatedly, developmental stage can summarize discrete structure, such as young-of-the-year, juvenile, and adult; or seedling, sapling, codominant, and dominant. Structure in time can be represented as days within seasons within years. Time can be structured by ‘active periods’ for insects or census intervals for trees or humans–the US census occurs every 10 yr since 1790. Spatial structure can be discrete, as defined by boundaries of watersheds, estuaries, soil types, and land cover types. Voting districts, zip codes, and states are discrete structures. Even where space and time are treated as continuous variables, they may be structured within discrete units.

When I want to refer to continuous structure, I put the continuous variable in parentheses, such as abundance as a function of the continuous variable time \(t\), \(y(t)\), or location \(x\), \(y(x)\). To reference discrete stucture I use a subscript, such as abundance in year \(t\), \(y_t\), or on plot \(x\), \(y_x\). I can have multiple discrete classes or continuous variables, such as discrete plot and year \(y_{x,t}\) or continuous location and time \(y(x,t)\). Structure can be mixed, such as continuous population growth within a discrete habitat, \(y_x(t)\). There can be species within communities or individuals within flocks, sample plots, or political parties. Sample plots can be assigned to blocks (stratification).

Structure in data and the processes of interest requires proper interpretation. Simpson’s Paradox and the related ecological fallacy are important considerations for any analysis that involves multiple scales. Scale is a consideration when data are aggregated, e.g., individuals to plots, zip codes, precincts, or watersheds. It arises when attempting to fit a model at one scale and predict at another scale. It can arise when there are unmeasured variables that affect observations. I may be interested in a “species response” to fertility or climate, but the data I have is the abundance of individuals on sample plots across a climate gradient. We return to this issue throughout the semester.

Structured variables can be misaligned, e.g., when some variables are referenced to zip codes, while others are referenced to voting precincts. Forest inventory plots are not aligned with weather station data used to assign climate variables to the plot. These misaligned variables introduce challenges that can be addressed with hierarchical modeling.

Consider a simple example with continuous structure. A simple population growth model describes how change in abundance \(z\) relates to the population size and its rate of increase in continuous time,

\[ \frac{dz}{dt} = r z \] where \(z\) is population density, \(t\) is continuous time, and \(r\) is the per capita rate of increase (1/time). I can solve this model for abundance at an arbitrary time, \(z(t) = z(0)e^{rt}\). However, I do not observe the population continuously. Instead I count the number of individuals \(y_{j,k}\) at specific times \(k = 1, \dots, K\) on one or more sample plots, \(j = 1, \dots, J\). The counts might have a Poisson distribution,

\[\begin{equation} \tag{1.1} y_{j,k} \sim Poi(A_j z(t_k) ) \end{equation}\] where \(A_j\) is the area of sample plot \(j\), and \(z(t_k)\) is the populations size when observation \(k\) was obtained. The unknown population density has units of animals per ha, while the counts \(y_{j,k}\) are dimensionless. The full model can be written as two parts:

\[\begin{align} y_{j,k} &\sim Poi(A_j z(t_k) ) \\ z(t) &= z(0)e^{rt} \end{align}\]

Because computation requires discretizing, continuously structured problems are often discretized in space, time, or both. I can solve eqn (1) for arbitrary values of time \(t\). However, a differential equation that cannot be solved can be approximated for discrete time increments as a difference equation. There is a discrete time interval \(\Delta_k\) and a discrete growth increment \(\Delta_n\) such that \(dz/dt\) can be approximated by \(\left( z(t + \Delta_k) - z(t) \right)/\Delta_k\). Discretization can introduce instabilities in non-linear models and requires careful thought on issues of rates and time scale (e.g., \(\Delta_k\)).

We will consider how environmental processes and observations can often be summarized hierarchically. Hierarchies can be a natural way to organize data and models. A submodel can be specified for a process, while another submodel could describe the connections between state variables to data. Hierarchical models use structure, which enters as distributions. In the next section I discuss why model analyses can be hierarchical–some parameters generate a biophysical process (process A affects a process or state variable B), while other parameters generate the data (B can only be partially observed).

Notation for a distribution

In Bayesian analysis, components are specified by probability distributions that contain parameters. If I write the “probability of \(y\) given parameters \(\theta\), then I have to know or assume a value for \(\theta\) in order to construct a distribution for \(y\). Because \(y\) is described by a distribution of values (rather than a specific value) its value is said to be”unknown”. The unknowns in a model are the variables that will be estimated. I can use the bracket notation for a distribution,

\[[\mbox{unknowns | knowns}] = [y | \theta]\] The vertical bar separates the random or stochastic \(y\) on the left from the fixed parameter \(\theta\) on the right.

The terms random, unknown, and stochasitic refer to variables having probability distributions. They can appear to the left of a vertical bar, which indicates a conditional relationship. The terms fixed, known, or deterministic refer to quantities that appear to the right of the vertical bar. A variable that is fixed in one distribution can be random in another. Bayes theorem is an example.

This generic notation says nothing about the form of the distribution or the parameters needed to specify it. I use the generic notation when speaking of relationships that do not depend on the specific distribution.

To get specific, a common model for count data is the Poisson distribution mentioned above, where the response is a non-negative integer \(y = 0, 1, \dots\). This response depends on an intensity parameter \(\lambda > 0\). This equation says that \(y\) given \(\lambda\) is a Poisson distribution:

\[[y | \lambda ] = Poi(y | \lambda)\] The left-hand side could refer to any distribution having one parameter, while the right-hand side is specific.

The familiar normal distribution has a continuous response on the real numbers, \(y \in (-\infty, \infty)\) or \(y \in \mathbb{R}\), and two parameters, one corresponding to the mean (and mode) of the distribution, and another for the variance,

\[[y | \mu, \sigma^2 ] = N(y | \mu, \sigma^2)\] In this notation, \(y\) is random (its specific value is unknown), and the parameters are fixed–they must be known in order to define the normal distribution for \(y\).

Jointly distributed random variables

When more than one variable is unknown there can be a joint distribution. Suppose I have two observations represented by the vector \(\mathbf{y} = [y_1, y_2]\). If the observations are generated independently from the same normal distribution, then there is joint distribution represented by the product of their (independent) distributions,

\[[y_1, y_2 | \mu, \sigma^2] = N(y_1 | \mu, \sigma^2) \times N(y_2 | \mu, \sigma^2)\] Because I have assumed that they are drawn independently of one another, the probability of both \(y_1\) and \(y_2\) occurring together is equal to the product of observing them separately.

A joint distribution has more than one variable to the left of the bar. Variables are separated by commas, i.e., \([y_1, \dots, y_n]\), or they are represented by a vector \([\mathbf{y}]\) holding the values \([y_1, \dots, y_n]\).

Elements of Bayes

The basic elements of a Bayesian analysis are prior, likelihood, and posterior. Bayesian analysis specifies these elements as probability distributions.

In data analysis, we commonly speak of responses that are observed and parameters to be estimated (they are “unknown”). The prior distribution describes what is known about the parameters from external information. The likelihood describes how the observations are generated from parameters. Because data are unknown before they are observed, and they are generated by parameters, the likelihood is a distribution conditional on parameters. The posterior distribution is the inverse of the likelihood: it is the probability for parameters after data are observed. Before saying more about the Bayesian model specifically, I introduce the concept of hierarchical modeling, which can be viewed as a general framework for Bayes.

Hierarchical models organize distributions for prior, likelihood, and posterior into levels or stages. These are the knowns (data and priors) and the unknowns (latent processes and parameter values). It is natural to think of inference this way:

\[[\mbox{unknowns | knowns}] = [\mbox{process, parameters | data, priors}]\]

In the generic notation for a distribution, \([A]\) is the probability or density of event \(A\), \([A, B]\) is the joint probability of \(A\) and \(B\), and \([A|B]\) is the conditional probability of \(A\) given \(B\). Again, anything to the left of the vertical bar is random. Anything to the right is fixed. If there is no vertical bar, everything is random. The Appendix summarizes notation.

This structure from Mark Berliner (example here) comes naturally to a Bayesian. It can be unwieldy to the non-Bayesian and can involve improper distributions (e.g., when omitting prior distributions). A hierarchical model typically breaks this down as

\[[\mbox{process, parameters | data, priors}] \propto \] \[[\mbox{data | process, parameters}] [\mbox{process | parameters}] [\mbox{parameters | priors}]\]

The left hand side is the joint distribution of unknowns to be estimated; it is the posterior distribution. This hierarchical structure is amenable to simulation, using Markov Chain Monte Carlo (MCMC). Gibbs sampling is a common MCMC technique that exploits the fact that the posterior can be factored into smaller pieces that can be dealt with one at a time, as simpler conditional distributions. We’ll get back to this.

The elements of the hierarchical model relate directly to Bayes’ theorem. To simplify, consider a joint distribution of data \(D\) and parameters \(P\), which can be factored in two ways,

\[ [D, P] = [D | P][P] = [P | D] [D] \] Here I have factored the joint distribution into two equivalent products of a conditional and a marginal distribution (\([P]\) or \([D]\)).

Setting the two expressions on the right-hand side equal to one another and rearranging, I obtain Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] which can be read as “the probability of parameters given data”.

Exercise: I am tested for covid and am diagnosed as not-infected. Use Bayes’ theorem to write down the probability, in generic form, that I have covid.

Steps:
  1. Decide a) what are observed vs unknown and b) what is random vs fixed
  2. Write a generic distribution for each component
  3. Make appropriate substitutions in Bayes’ theorem

Prior to posterior

I would like to estimate the mean \(\mu\) of a set of \(n\) measurements \(y_i, i = 1, \dots, n\), using a Bayesian analysis assuming that observations are independent and normally distributed. The variance in observations \(\sigma^2\) is known from previous studies (e.g., instrument error).

I can think about this problem as a graph:

A graph for the model having known observations \(y_1, y_2\), a known parameter \(\sigma^2\), known prior parameter values \(m\) and \(M\), and unknown \(\mu\) that I want to estimate.

The likelihood for the observations \(\mathbf{y}\) is the conditional distribution \([D | P]\). For this normal distribution I write it as

\[[D|P] = [\mathbf{y} | \mu, \sigma^2] = \prod_i^n N(y_i | \mu, \sigma^2)\] I am assuming the observations to be independent and identically distributed. They are independent, because I am assuming that the joint distribution \([y_1, \dots, y_n]\) is the same as the product of probabilities taken independently,

\[\begin{align} [y_1, \dots, y_n] &= [y_1] \times \dots \times [y_n] \\ &= \prod_{i=1}^n[y_i] \end{align}\] They are identically distributed, because I assume that each has same parameter values for \(\mu, \sigma^2\).

Now from the last section I have Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I see that I also need a prior distribution for the mean, which I also take to be a normal distribution,

\[ [P] = N(\mu | m, M) \] This says that I start with a distribution for \(\mu\), before observing data, consistent with my belief that it is centered on a mean value \(m\). If I have strong prior belief in the value \(m\), then the prior variance \(M\) is small. If I have no idea, then \(M\) is big. We’ll see that ‘small’ and ‘big’ in this context is relative to information that will come from data.

For the moment I overlook the factor \([D]\) in the denominator of Bayes theorem, because I am interested in \(\mu\), which does not appear in \([D]\). So for now, I write

\[ \begin{align} [P|D] = [\mu | \mathbf{y}] & \propto [D|P][P] \\ &=\prod_i^n N(y_i | \mu, \sigma^2) \times N(\mu | m, M) \end{align} \] For reasons we will see later, this posterior distribution is also normal,

\[ [\mu | \mathbf{y}, \sigma^2] = N \left( vV, V \right) \] where the parameters are given by

\[ \begin{align} v &= \frac{n }{\sigma^2}\bar{y} + \frac{m}{M} \\ V &= \left( \frac{n }{\sigma^2} + \frac{1}{M} \right)^{-1} \end{align} \] You might recognize this to be a weighted average of the data, with mean \(\bar{y}\), and prior, with mean \(m\). The weights for observations (first term) and the prior (second term) both include a variance in the denominator. So the bigger the variance, the smaller the weight contributed by each term. The weight for the observations includes \(n\) in the numerator. So the larger the sample size, the greater the weight of the data. In fact, as \(n\) gets large (many observations) and \(M\) gets large (uncertain prior mean), the data dominate the prior. In this case, the posterior mean converges to \(vV \rightarrow \bar{y}\) and the posterior variance to \(V^{-1} \rightarrow \sigma^2/n\). These are the traditional estimates of the sample mean and its standard error, i.e., the square root of the posterior variance \(\rightarrow \sigma/\sqrt(n)\). Here is a plot of prior and posterior:

n  <- 5                                    # specify sample size and parameter values
mu <- 1.8                                  # true, but unknown value
m  <- 0
M  <- 1
sigma <- 2                                 # fixed standard deviation
mseq  <- m + seq(-5, 5, length=100)        # sequence of mean values centered on prior mean

y      <- rnorm(n, mu, sigma)              # draw a random sample
my     <- mean(y) 
postVr <- 1/(n/sigma^2 + 1/M)              # terms for posterior distribution
postSd <- sqrt( postVr )
postMu <- (my*n/sigma^2 + m/M)*postVr

prior     <- dnorm( mseq, m)               # evaluate prior and posterior
posterior <- dnorm( mseq, postMu, postSd )

plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
      ylab = 'Density (1/mu)', col = 'brown' )
lines( mseq, prior, col = 'grey', lwd = 2 )
abline( v = my, lty = 2 )                                                     # data mean
segments( m, 0, m, dnorm(m, m, sd = sqrt(M) ), col = 'grey' )                 # prior mean
segments( postMu, 0, postMu, dnorm(postMu, postMu, postSd ), col = 'brown' )  # post mean

Note how the posterior mean (brown) lies between the prior mean (grey) and the data mean(y) = 1.57.

The uncertainty in a parameter estimate is often summarized by a credible interval defined as values that bound a proportion of the posterior distribution. Credible intervals are often defined for 95%, i.e., at probabilities bounded by quantiles \((0.025, 0.975)\).

The 95% credible interval for \(\mu\) is obtained from the posterior distribution, which in this case is normal (“Gaussian”),

ci <- qnorm( c(0.025, 0.975), postMu, postSd )                  # quantile values
plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
      ylab = 'Density (1/mu)' )
segments( ci, 0, ci, dnorm(ci, postMu, sd = postSd ), lty=2 )

This normal distribution has 2.5% in each of the two tails, bounded by two values in the vector ci = (-0.435, 2.18).

Exercise: A weighted average for two values of a variable \(x\) is \(\bar{x} = (w_1x_1 + w_2x_2)/(w_1 + w_2)\), where \(w_i\) is the weight for the \(i^{th}\) value of \(x\). In the posterior distribution for the mean estimate \(N \left( \mu | vV, V \right)\), identify \(x_1, x_2, w_1\) and \(w_2\) in the equations for \(v\) and \(V\). What is the effect of increasing \(n\) on the posterior mean and variance in the above example and why?

Exercise: If the data are noisy (large \(\sigma^2\)), is it still possible to obtain a precise estimate of \(\mu\)? Why or why not?

If Bayes were always as simple as this problem suggests it would have captured much of the modeling space decades ago. I previously skipped over a sticky detail. Returning to Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I discussed everything but the denominator, a marginal distribution of the data, which is the source of aggrevation in Bayesian analysis. For this expression to be true, this distribution must involve integration,

\[ [D] = \int [D | P][P] dP \] Recall that multiplication is easy, and division is hard. In calculus, the tables are turned. The calculus counterpart of division is differentiation, and the counterpart to multiplication is integration. However, differentiation is much easier than integration. This integral is known for the normal likelihood \(\times\) normal prior example above, but it is not known for most likelihood-prior combinations.

Posterior simulation techniques like Markov chain Monte Carlo (MCMC) offered ways to approximate this integral and more. Beyond resolving the integration problem, Gibbs sampling (Gelfand and Smith 1990) allowed for hierarchical expansion of models. After gaining familiarity with simple Bayes in the first few chapters, we’ll devote most of the semester to hierarchical problems.

Bayes factor

The Bayes factor (BF) is used to evaluate models relative to one another. From the posterior distribution we can evaluate a model probability conditional on the data, \([M | D]\) and combine it with a prior probability for the model, \([M]\). The ratio for two models is

\[ BF = \frac{[M_1 | D] [M_1]}{[M_2 | D] [M_2 ]} \]

A BF value much greater than 1 finds support for model 1 and vice versa. Obtaining \([M | D]\) requires evaluating an integral. We discuss this further later in the semester.

Regression as an example of simple Bayes

A regression model contains a response, predictors, parameters, and residual (‘unexplained’) variation. The predictors and parameters explain part of the response variation. The residual variation is left over once we have fitted the parameters. Least squares minimizes error, maximum likelihood finds parameter that make the data most likely, and Bayes evaluates a probability distribution of parameters.

This is a good time to read this section of the appendix. Here again, the subscript \(i = 1, \dots, n\) indicates the \(n\) observations of a response \(y_i\) with corresponding predictors \(\mathbf{x}_i\). The index \(i\) is exchangeable, meaning that the order of observations doesn’t matter. The responses make up a length-\(n\) vector \(\mathbf{y} = (y_1, \dots, y_n)\). The predictors make up a \(n \times p\) matrix \(\mathbf{X}\), where \(p\) is the number of parameters to be estimated in a vector \(\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)'\).

A traditional regression model looks like this:

\[ \begin{aligned} y_i &= \mu_i + \epsilon_i \\ \mu_i &= \beta_1 + \beta_2 x_{i,2} + \dots + \beta_p x_{i,p} = \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] (again, notation is summarized in an Appendix.) There is an observed response \(y_i\) and predictors in a length-\(p\) design vector \(\mathbf{x}_i\). Here is an example of a design vector for an intercept and two slope variables,

\[ \mathbf{x}_i = (1, x_{i2}, x_{i2})' \] The first element of the design vector is \(x_{i,1} = 1\), corresponding to the intercept for the model. In this notation, \(x_{i2}\) is the predictor that will be multiplied by the first slope variable. \(x_{i3}\) is the predictor that will be multiplied by the second slope variable. The error has an expection \(E[\epsilon] = 0\) and variance \(Var[\epsilon] = \sigma^2\). This is called a linear model, because it is a linear function of the parameters \(\boldsymbol{\beta}\).

Typically, I want to estimate the parameters in the model. To do this, I have to decide on a criterion that constitutes a ‘good fit’. The simplest assumption could be to find parameter values that minimize the left over (‘residual’) variation. Without specifying a distribution for \(\epsilon\), I could minimize the expected value (\(E[\cdot]\)) of the squared error,

\[\min_{\boldsymbol{\beta}} E[\epsilon^2] = \min_{\boldsymbol{\beta}} E\left[ \sum_i^n (y_i - \mathbf{x}'_i \boldsymbol{\beta})^2 \right]\]

with respect to parameters in the vector \(\boldsymbol{\beta}\). Within the brackets I am summing the squared difference between the observed \(y_i\) and the value that would be predicted by the model, \(\mathbf{x}'_i\boldsymbol{\beta}\). By taking derivatives with respect to \(\boldsymbol{\beta}\), setting them equal to zero, and solving for \(\boldsymbol{\beta}\), I obtain the least-squares estimates \(\mathbf{\hat{\boldsymbol{\beta}}}\). This is the method of least-squares.

Rather than minimizing residual error, maximum likelihood looks for a set of parameters that make the observations most likely given the model. I say most likely rather than most probable, because the parameters are not specified with a probability distribution (that would be Bayesian). The likelihood is a probability model for the observations, not for the parameters.

If I specify a normal distribution for errors, \(\epsilon \sim N(0, \sigma^2)\), I can write a likelihood function for the observations. Here again, \(N\) indicates a normal distribution, with mean zero and variance \(\sigma^2\). Due to properties of the normal distribution, if \(\epsilon \sim N(0, \sigma^2)\) and \(y = \mu + \epsilon\), then \(y \sim N(\mu, \sigma^2)\), where parameters are \(\mu\) (mean) and \(\sigma^2\) (variance).

Because the regression model assumes independent observations, I can write the likelihood function as a simple product over the observations:

\[L(\boldsymbol{\beta}, \sigma^2) = \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] Once again, the independence assumption means that the probability of all \(n\) observations is equal to the product of the probabilities of each individual observation,

\[\prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) = N(y_1 | \mathbf{x}'_1 \boldsymbol{\beta}, \sigma^2) \times \dots \times N(y_n |\mathbf{x}'_n \boldsymbol{\beta},\sigma^2)\]

The maximum likelihood estimate comes from maximizing the likelihood function with respect to parameters. I obtain the same estimates as I did with least squares. This is true if \(\epsilon\) is normally distributed. For least squares and this maximum likelihood approach, the likelihood is treated as a function of the parameters. Again, the parameters are not defined probabilistically.

The Bayesian view differs from traditional approaches by introducing a prior distribution that I now consider in the context of the regression model. The prior distribution has two important functions:

  • It allows for information outside the data at hand.

  • It allows us to derive a posterior distribution, i.e., a probability distribution for parameters.

Recall Bayes’ theorem:

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] where \(D\) represented data, and \(P\) represented parameters. For the regression model the parameters are \(P = (\boldsymbol{\beta}, \sigma^2)\), and the data are \(D = (\mathbf{y}, \mathbf{X})\). Although both predictors \(\mathbf{X}\) and responses \(\mathbf{y}\) are observed, the regression model assumes that \(\mathbf{X}\) is part of a fixed design–it is not random. For this reason, we do not write a distribution for \(\mathbf{X}\), but rather we ‘condition’ on the fixed values in \(\mathbf{X}\). Making these substitutions, Bayes’ theorem becomes

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}] = \frac{[\mathbf{y} | \boldsymbol{\beta}, \sigma^2, \mathbf{X}] [\boldsymbol{\beta}, \sigma^2]}{[\mathbf{y}]} \]

Again, the posterior distribution of parameters is on the LHS. The denominator is the marginal distribution of the data, \([\mathbf{y}]\). As in the example for the simple mean, I’ll ignore it for now.

As before, the left-hand side is “parameters given data”, or the posterior distribution, and the right-hand side has likelihood \(\times\) prior in the numerator. In Bayesian analysis the likelihood is treated not as a function to be optimized, but rather as a distribution for the data that must be combined with the prior to yield a distribution for parameters (the posterior). The second factor on the right indicates the as-yet unspecified prior distribution for parameters, this time for the regression coefficients. Technical details follow later, for now I focus on general concepts.

Substituting for the likelihood and leaving the prior distribution unspecified (for now), I have

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}] \propto \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) [\boldsymbol{\beta}, \sigma^2] \]

If the prior distribution is flat, the posterior distribution for \(\boldsymbol{\beta}\) will have the same shape as the likelihood. Thus, parameter estimates will have the same marginal means and standard errors as I obtained by maximum likelihood. The impact of the prior distribution becames important when it is made informative and as models increase in size.

Model graph

Models can be represented as graphs. Model elements are connected by arrows that establish the relationships between variables that guide model building.

Below are three graphs, each describing a different model. In this diagram, there are boxes, circles, dashed lines, and solid lines. The graph provides a visual interpretation of the model. The traditional regression model is shown in (a). I discuss symbolism in the next sections.

Three variations on the regression model in graphical form. Circles are random–they must be estimated. Squares ‘anchor’ the fit; these observed variables \(X\) and \(Y\) and prior parameter values are known. a) In traditional regression, there are unknown parameters, but they are not given a probability model. b) Simple Bayesian regression has a probability model for parameters. Once respons \(Y\) is observed, it is fixed. c) The hierarchical model has an additional stochastic stage.


Known and unknown variables: Varibles are nodes in the graph. The known variables are boxes, including data and prior parameter values. They anchor the model fit, because they are fixed. The unknown variables and parameters are circles. In the regression example, these are the regression coefficients, the length-\(p\) vector \(\boldsymbol{\beta}\), and the residual variance \(\sigma^2\)

Stochastic and deterministic variables: Both the response \(y_i\) and the predictors \(\mathbf{x}_i\) are known. However, the predictors are treated as deterministic, whereas the response is treated as stochastic before it is observed. Specifically, \(y_i\) is assigned a distribution, the likelihood. The vector \(\mathbf{x}_i\) is not assigned a distribution. In observational studies it is not always clear what should be a predictor and what should be a response.

Stochastic and deterministic arrows: Solid arrows in the graph indicate stochastic relationships, meaning that knowledge of a parent node does not tell us the value at the child node (or vice versa). Deterministic relationships can often be omitted from the graph. This example will be discussed in chapters that follow. For now, I introduce several more concepts.

Some additional background concepts

In this section, I introduce some concepts that will come up again this semester.

Terms used to classify models

The terms process model and statistical model represent a false dichotomy.

Hierarchical models typically include both elements. They are used for inference and to predict processes and data, with uncertainty. To take a simple recent example, development of spring phenological state could be written as a rate equation,

\[\frac{dh}{dt} = r(t; \mathbf{x}(t)) \left(1 - h(t) \right)\]

for development rate \(r(t)\) that changes over time, in part due to environmental changes contained in \(\mathbf{x}(t)\). Most ecologists would call this a ‘process model’. However, when discretized, written in terms of predictors, and combined with a data model for ordinal observations, it’s used for inference (Clark et al. 2014). By allowing for the a biologically reasonable process at one stage and a relevant data model at another, the estimates of the process itself are more accurate than they would be otherwise. Hierarchical modeling has made the distinction between statistical and process a false dichotomy.

A generative model can be used both forward and backward; it predicts the data that were used to fit it.

Prediction and simulation is the forward direction: start with parameters and generate data. This means moving from bottom to top in the graphical model that I drew for the regression. Model fitting is the inverse direction, i.e., top to bottom in the model graph. Generative models are important for prediction–can model fitting recover the parameters used to simulate the data?

Many models are not generative. When they are not, evaluation is difficult. In this course prediction plays a large role in model checking, so we focus on generative models.

Continuous to discrete

Often models are specified as rate equations (as above), and we need to fit them to data. For a simple example, consider the logistic equation,

\[ \frac{dn}{dt} = rn \left( 1 - n/k \right) \] If we want proccess error (the process is not exactly logistic), this must be stochasticized. Data are also likely obtained at discrete time points. For model fitting it can be a good idea to start by multiplying both sides bt \(dt\):

\[ dn = rn \left( 1 - n/k \right) \times dt \] A discretized version of the model is specified in time increments:

\[ n_{t + dt} = n_t + dn = n_t + rn \left( 1 - n/k \right) \times dt \] To include process error we add noise:

\[ n_{t + dt} = n_t + rn \left( 1 - n/k \right) \times dt + \epsilon_t \] Now we just need a distribution for \(\epsilon_t\), and we have a process model with error. The noise term takes up both the model misspecification and error of integration. Non-linear models like this can be unstable, but model fitting is stabilized by the data. This approach works for systems of ODEs and PDEs.

Observation and experiment

Design determines the information that data have to offer for model fitting. It lends it’s name to the design matrix in anova and regression. The common elements of a study design include replication, stratification, and randomization (RSR).

Information can be evaluated in several ways, such as the precision a model provides on parameter estimates or predictions. There is a large literature on ‘optimal design’ and how it benefits from RSR.

Studies can be observational (control is limited to when/where/how samples are obtained) or experimental (variable manipulation). The model for experimental data may be a model for the design (e.g., ANOVA), but it doesn’t have to be. Experiments may be established to evaluate many processes, each analyzed with a different model.

Observational data have an advantage in being available at scales that could be not be studied with controlled experiments. Large observational networks, some global, are examples. Weather station data, now combined with satellite imagery, are the basis for modern understanding of ocean and atmospheric circulation. Epidemiological evidence for public health trends are widely used in the social sciences, as are polling data in public policy. Disease surveillance can be active or passive, the latter being easier to get, but harder to use in models. For example, how do we interpret prevalence of a disease in a city where only symptomatic individuals get tested? Relatedly, polls are notoriously hard to interpret due to the fact that methods used to reach potential voters bias the outcomes.

Observational data can be highly unbalanced in space and time; because the distribution of observations usually determines the distribution of predictors, all interpretations have to consider how the distribution of data affects estimates and predictions (Tang et al., 2023). Like FIA, many national forest inventories strive for balanced spatial coverage. However, they are concentrated in recent years, and sample years are misaligned between states. Citizen science platforms like ebird and iNaturalist have become so dense as to cover many habitats and geographic regions. However, both fall off rapidly with time before present, making it hard to estimate trends. The coverage is uneven across habitats and species, the latter depending on behavior and crypsis and degrees of observer interest (Tang et al. 2022, Scher and Clark 2023).

A big concern with observational data is how to attribute cause and effect. As many variables change, interacting with one another as well as with unmeasured variables, there may be no way to identify what’s affecting what. Are people healthy because they exercise, or does being healthly motivate exercise?

It can be hard to even decide what goes on the right versus the left side of the model. Does high mineralization rate promote N-demanding plants and microbes, or vice versa? What do I do when I know that both are true?

Experiments benefit from manipulation and control. Treatment effects might be isolated when other variables are controlled. Control is critical to inference in many settings. However, experiments are limited in scale, often conducted in the laboratory, growth chamber, greenhouse, or small field plot. This scale mismatch means that a variable’s ‘effect’ on the response can be unrealistic. Because many ecological processes play out over multiple scales, experimental results can be difficult to extrapolate outside the experimental frame. Only a fraction of important questions are amenable to experimentation. Many ecologists and social scientists look for ‘natural experiments’.

Model fitting is a balance between data and model.

Many arguments about climate change focus on global mean temperatures, obtained by averaging station data that go back over a century. This seems to be a data-driven debate. (“what do the data say?”)–until we confront the fact that the mean of the stations is not the global mean, because stations are not uniform over the globe. Models are used to translate unevenly distributed data to a less biased estimate of regional climate. The PRISM data are an example.

Exercise. What are the elements of a design? What does each contribute?

Dependence in data

I previously said that traditional data modeling assumes that observations are i.i.d. (independent and identically distributed). Many samples do not contain independent observations. For example, eBird counts in Sandy Creek, NC and Alexandria VA might appear to offer independent information (Tang et al. 2022). However, the eBird counts from Sandy Creek, NC last Saturday contain substantail redundancy–each observation does not offer new information if volunteers observe the same birds repeatedly. Likewise, if birds are migrating through an area, observations obtained in different weeks offer different information from observations taken on successive days.

The i.i.d assumption overestimates the information content in observations when those observations depend on one another.

Spatial data may depend on nearby observations. Note that arrows point in both directions.

There is often serial dependence in time-series data.

Building dependence into traditional models is hard. Hierarchical modeling solves this problem, because dependence can come simply from adding a stage, as shown in the graphical model for regression.

Multiplicity corrections

Multiple testing refers to interpretations that rely on more than one hypothesis test.

I might want to know if the abundances of microbial taxa decline with salinity in estuaries, so I fit regressions to concentrations of each of thousands of OTUs (operational taxonomic units). I expect to reject the null at \(\alpha < 0.05\) for 5% of tests where the null hypothesis is true. A traditional Bonferonni correction divides the \(\alpha\)-value threshold by the number of tests based on the assumption that tests are independent. By building models hierarchically, Bayesian analysis sidesteps this problem–the dependence between ‘tests’ is built in to the model.

Ecological fallacy/Simpson’s paradox

The ecological fallacy and Simpson’s paradox refer to the fact that inference at one scale or level of organization may not represent relationships at another.

It’s not hard to find examples where individual and group behavior appear to show different relationships. In public policy, we now know that rich states vote democratic, while rich people vote republican (Gelman 2009). In the last few decades, the mean household income has increased, while the income of the mean household has declined. Ecological communities (species distributions and abundances) are poorly predicted by aggregating predictions for individual species (Clark et al. 2011). These are examples of ‘Simpson’s Paradox’.

Conversely, models of total numbers of species do not predict extinction threat, which operates at the scale of each individual species, an example of the ‘ecological fallacy’. The ecological fallacy refers to the fact that group behavior does not predict individuals within groups (rich people do not vote like rich states, a species’ extinction risk cannot be interpreted from variation in number of species). Gerrymandered voting precincts exploit this fact.

Recap

Am I a Bayesian? Many who make use of Bayesian models and software would not necessarily see themselves as particularly Bayesian. Models and software may be selected for convenience or for other considerations that have little to do with philosophy. Avowed non-Bayesians may exploit MCMC methods, while making a point of omitting priors. Staunch Bayesians may sometimes omit prior where they do not affect results or interpretation, but include them elsewhere. Users of packages like INLA may not know that it uses prior distributions (it’s Bayesian). Many big data applications bridge machine-learning and statistical worlds, exploiting approximations of many types. Bayes has expanded rapidly in large part for pragmatic reasons of simplicity and capacity to address large problems.

Bayesian analysis combines a prior distribution with a data model, or likelihood, to arrive at a posterior distribution of unknown parameters and latent variables. Deterministic variables are constant, whereas stochastic variables are defined by distributions. Known variables are constant, but they can be stochastic and, thus, can be defined by distributions, such as the response \(Y\) in the regression model. Graphs are used to communicate model structure and organize algorithms. They provide a visual representation of hierarchical models. A common hierarhical structure is [data|process, parameters][process|parameters][parameters].

Basic regression models provide an opportunity to introduce similarities and differences. Least-squares, maximum likelihood, and Bayes share some common features at this simplistic level, but diverge as models increase in size.

Both experimental and observational data are analyzed in environmental sciences. They differ in design, control, and, often, scale. Fundamental design considerations include randomization, stratification, and replication.

Model fitting and prediction are ‘backward’ and ‘forward’ views. Generative models do both and, thus, can be fully evaluated (in both directions).

In this course, Bayesian modeling and computation are presented together as means for describing a process, while allowing for how the data are obtained.


Appendix

Notation

Here are some notation conventions used in these vignettes.

notation example interpretation
italic \(x\), \(X\) scalar quantity, known
greek \(\alpha, \beta, \dots\) stochastic (fitted) variable, unknown
parentheses \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) parametric function/density
curly brackets \(\{0, 1, \dots\}\) a set on objects
closed interval \((-\infty,0]\), \([0, \infty)\) intervals include zero
open interval \((-\infty,0)\), \((0, \infty)\) exclude zero
distributed as \(x \sim N(\mu, \sigma^2)\) distribution or density
expectation \(E[\epsilon] = 0\) expected value of a variable
variance \(Var[\epsilon] = \sigma^2\) variance of a variable
bracket, distribution \([A, B, C]\) unspecified density or distribution
proportional \(f(x) \propto g(x)\) differ by a constant, \(f(x) = c g(x)\)
approximate \(\pi \approx 3.14159\) approximately equal
real numbers \(\mathbb{R} = (-\infty, \infty)\) note: also positive real \(\mathbb{R}_+\)
is an element of \(\pi \in \mathbb{R}\) \(\pi\) is a real number
subset \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) in the set
proper subset \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) cannot include all elements
union \(a \cup b\) either \(a\) or \(b\)
intersection \(a \cap b\) both \(a\) and \(b\)
sum \(\sum_{i=1}^n x_i\) \(x_1 + \dots + x_n\)
product \(\prod_{i=1}^n x_i\) \(x_1 \times \dots \times x_n\)
exponentiate \(e^x\), \(exp(x)\) \(e \approx 2.71828\), inverse of \(ln(x)\)
natural logarithm \(ln(x)\) inverse of \(e^x\)

Matrices

notation example interpretation
bold, l.c. \(\mathbf{x}\), \(\mathbf{x}'\) column and row vectors, respectively
bold, u.c. \(\mathbf{X}\) matrix
dimensions \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) \(n\) rows, \(q\) columns
subscript \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) element of an array
row vector \(\mathbf{X}_{i\cdot}\) row \(i\)
column vector \(\mathbf{X}_{\cdot j}\) column \(j\)
transpose \(\mathbf{X}'\) rows become columns
matrix inverse \(\mathbf{A}^{-1}\) solve systems of linear equations
identity matrix \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) \(p \times p\) with 1’s on the diagonal, zeros elsewhere
matrix determinant \(det\left( \mathbf{A} \right)\) obtain for a square matrix, e.g., covariance
kronecker product \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) \(np \times mq\) matrix, defined in text

logs and exponentiation:

  • these are equivaluent: \(exp(x) = e^x\)

  • \(\log( e^x ) = e^{ \log x } = x\)

  • \(exp(y_1) exp(y_2) = exp(y_1 + y_2)\)

  • \(\prod_{i=1}^n exp( y_i ) = exp \left( \sum_{i=1}^n y_i \right)\)

  • \(log( a/b ) = log(a) - log(b)\)

Simple derivatives

  • \(\frac{d}{dx} \left( x^a \right) = a x^{a-1}\)

  • \(\frac{d}{dx} \left( e^x \right) = e^x\)

  • \(\frac{d}{dx} \left( e^{f(x)} \right) = \frac{df}{dx} e^{f(x)}\)

Some sample statistics

The notation \(E(y)\) refers to the expectation of \(y\). For random data it is the sample mean. We do not simply refer to it as a mean, because it is more general: a probability distribution has an expectation, even if there are no data.

  • mean: \(E(y) = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\)

  • variance: \(Var(y) = E(y^2) - E^2(y) =\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2 = \frac{1}{n} \sum_{i=1}^n y^2_i - \bar{y}^2\)

  • covariance: \(Cov(x,y) = E(xy) - E(x)E(y) = \frac{1}{n} \sum_{i=1}^n x_i y_i - \bar{x} \bar{y}\)

Models we will encounter

Univariate response models include the following:

Name response additional attributes
linear (LM) normal linear in parameters
linear mixed model (LMM) normal LM with random effects
generalized linear model (GLM) discrete linear in parameters on link scale
logistic regression binomial GLM includes logit link
probit regression binomial GLM includes probit link
Poisson regression Poisson GLM typically with log link
mixed GLM (GLMM) binomial, Poisson, … GLM with random effects

Multivariate response models have a vector of responses:

Name response additional attributes
linear (LM) MVN linear in parameters
categorical multinomial multiple classes, one outcome, MV logit or probit link
multinomial multinomial multiple classes, mulitple outcomes
generalized joint attribute model (GJAM) all types linear in parameters

Time series have dependence in time. They can be state-space models that are normal or not for continuous states. The Kalman filter is a simple (normal-normal) state-space model. Hidden Markov model is a term are most often applied to discrete states. Autoregressive models are normal and have dependence on \(p\) previous times, e.g., AR(\(p\)).

Spatial models have dependence in space. They can be LM or GLM with \(n = 1\) and a \(m \times m\) covariance matrix \(\Sigma\), where \(m\) is the number of locations. Kriging is a traditional spatial model for continuous space. Spatial autoregressive models are used where space is viewed as discrete blocks (e.g., census tracks, counties, …).