The emergence of large data sets in environmental science has changed statistical analysis: more focus on data wrangling and algorithmic approaches, especially Bayes. Bayesian analysis departs from alternative methods in its application of probability to all aspects of model fitting with side benefits of simplifying interpretation. The elements of a Bayesian analysis include distributions for prior, likelihood, and posterior. Hierarchical models emerge naturally in the Bayesian framework as a means for analyzing high-dimensional problems without requiring a change in approach. Graphs help to organize hierarchical modeling. Basic concepts are introduced with the Forest Inventory and Analysis (FIA) monitoring network.

Logistics

Resources

For next time

  • Discussion from reading:

    • Articulate the following: \(P\) value, multiple testing, \(P\)-hacking, Bayes Factor, prior odds

    • What is a false-positive in this context?

    • Why is 0.005 suggested as an alternative to 0.05?

    • Why might some disciplines have adopted \(P\)-value thresholds as low as \(< 10^{-5}\)?

  • Exercises from this document due one week from today using exerciseTemplate.Rmd

Today’s plan

  • exerciseTemplate.Rmd for assignments, installing R packages

  • Breakout: Identify groups and moderator for Discussion paper

  • Jim: Foundation materials

Objectives:

  1. Recognize and generate proper notation for a simple model
  2. Identify the basic elements of a Bayesian model
  3. Identify the deterministic vs stochastic variables in a model
  4. Interpret a hierarchical model
    • construct a simple graphical model
    • assemble parameters, process, and data as a hierarchy
    • describe a regression model with notation and graph
  5. Articulate advantages and disadvantages of observational and experimental evidence
  6. Define Simpson’s Paradox and identify when it could be operating.


Once you have downloaded from Sakai the file clarkFunctions2021.r, move it to your current directory, and source it this way:

source('clarkFunctions2021.r')


Two themes will repeat throughout this course, including big data and Bayesian analysis. Statistical analysis traditionally focused on canned software applied to simple (and small) data sets containing hundreds of observations and several variables. Linear regression and anova’s led to p-values that determined whether or not a study was publishable. Large observational data sets are now commonly wrangled from the internet and used to ask questions that require new tools for data manipulation and hierarchical model building. Hierarchical models offer the potential to integrate multiple types of data, while preserving a focus on science–How does the process work? Can we estimate the relationships and predict them? Theory is needed for model building. Algorithms are the basic tools needed both for data manipulation and model evalution. I start with a case study that illustrates important concepts.


Case study: USDA’s Forest Inventory and Analysis (FIA) program

Monitoring networks are increasingly used to evaluate geographic patterns in biodiversity and ecosystem properties, how they are changing, and why. The USDA Forest Inventory and Analysis (FIA) program is one of many national inventory programs.

The FIA network provides opportunity to illustrate some of the concepts that are important for this course. It is too big to manipulate with the standard 8GB or 16GB ram that is installed in most modern laptops (or desktops). [If your machine can expand to 32GB you should do this–not expensive and can make a big difference.] Every tree on > \(10^5\) plots shown below are distributed across the country. Datq include > 300 tree species. The latest update includes > 20M rows for tree-years and > 200 variables. The plots themselves are deployed as a stratified-random design, with the stratification being over a uniform geographic grid. Plots are remeasured at intervals of several years, depending on state. Data include tree diameter, crown class, and a large number of codes related to individual condition. There are small seedling plots and larger tree plots, but even the latter are small, approximately 1/14th of a hectare in the eastern US, larger in the West. Several environmental variables include slope, aspect, estimated stand age, and site moisture status.

200,000 FIA plots support 20M sampled trees, here showing young stands in the South and East (yellow at left) and dry site classes in mountainous sites and southern Plains (brown at right).

Although we cannot manipulate the entire data set in class, we can examine a subset of it. To do this, I clustered plots based on their environmental characteristics to result in approximately 1-ha plots. To manipulate FIA data I use built-in functions and packages. A package is a bundle of functions that together offer tools to process a related set of problems. To use a function that is contained in a package, I need to install the package. Here is the installation of two packages:

install.packages('repmis')  # read from github
install.packages('gjam')    # extract file

Once a package is installed I can make it available in my work space using the library function. I can use a function without placing all of its functions in my workspace using the syntax package::function.

In the following code I first set a variable biomass to the the column for Pinus taeda. I get the data from a remote repository using the function source_data(fileName) in the package repmis:

d <- "https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True"
repmis::source_data(d)
## [1] "forestTraits"
data    <- forestTraits$xdata                         # compressed
tmp     <- gjam::gjamReZero(forestTraits$treesDeZero) # decompress
biomass <- tmp[,"pinuTaed"]                           # response
data    <- cbind(biomass, data)                       # all variables

The function gjamReZero is a function in the packages gjam that decompresses a file containing mostly zeros.

The data.frame data includes both continuous variables and factors. It is formatted as observations (rows) by variables (columns). I’ll say more about formats as we go.

As always, I start with some simple exploratory data analysis (EDA). First I plot some variables in data using the R function plot. Soil is a factor in the model, meaning that it is numeric, but takes only discrete values. It has five levels, one being a reference type. In these first plots I assign colors to soil types.

# symbol color from soil type
col <- match( data$soil, levels(data$soil) )

# symbol size is response:biomass
cex <- .1 + biomass
cex <- sqrt( cex/max(cex) )

par(mfrow=c(1,2),bty='n', mar=c(4,4,3,1))
plot(data$temp, data$deficit, xlab='Winter temperature',  # climate deficit
     ylab='Moisture deficit', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
        bty='n', cex = .7)
plot(data$temp, data$moisture, xlab='Winter temperature', # local moisture level
     ylab='Site moisture', cex=cex, col = col)

Some variables in the data.frame data

The syntax data$temp indicates that data is a type of R object termed a list. In this instance, data is a specific type of list called a data.frame. The $temp part of this name means that data is an object known as a list of objects, one of which having the name temp. Use the help(plot) page to interpret the arguments to the function plot.

par(mfrow=c(2,1),bty='n', mar=c(4,4,1,3))

plot(data$deficit, data$moisture, xlab='Deficit', 
     ylab='Site moisture', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
        bty='n', cex = .7)
plot(data$soil, biomass)

Additional variables, including a factor soil.

I’ll return to this example after further discussion of some basic concepts.

Exercise 1: Based on these plots does it appear useful to include these predictors in a model for biomass of this species? Do both variables bring information?

Before going further with FIA I introduce background concepts on Bayesian analysis and regression.

Goals of a Bayesian analysis

A Bayesian analysis serves several purposes:

In Bayesian analysis, all are based on/expressed as probability distributions.

What makes it Bayes?

In Bayesian analysis, all quantities are assigned probability distributions. Data analysis involves i) a process, represented by a model (that summarizes )how it works), ii) evidence, which comes in the form of observations, or data, and additional prior information that needs consideration, and iii) unknowns, or parameters, that must be estimated by combining all of the above.

All of you will have taken a traditional statistics course, often termed frequentist statistics, but many of you will not have had exposure to a Bayesian perspective. Because traditional statistics did not derive a probability distribution for unknown parameters it developed a ‘frequency’ interpretation for point estimates–there is one best value for a given data set, but I could have a distribution of these point estimates if I were to repeat an identical experiment many times. In other words, I might do an end-run around the need for probability with a concept of frequency. The probabilistic basis for Bayesian analysis makes outputs intuitive and defensible. In the next section I introduce some of basics and connect them to some of the more familiar topics you about about from an intro statistics course.

Structure leads to hierarchical modeling

Structure in data is common and can be exploited.

The expansion of modern Bayes has gone hand-in-hand with hierarchical modeling. Hierarchical models simplify the synthesis of observations with processes and parameters. They commonly exploit the Bayesian paradigm. Hierarchical models don’t have to be Bayesian, but they usually are. I introduce these concepts together.

Environmental processes and observations can often be summarized hierarchically. Where processes are not strictly hierarchical, a scientist may see advantages to imposing structure. Structure in time can include hours within days within seasons within years. Time can be structured by ‘active periods’ for insects or census intervals to trees or humans. Spatial structure can be defined by watersheds, estuaries, soil types, and land cover types. Voting districts, zip codes, or states can be used. Even where space and time are treated as continuous variables, they may be structured within discrete units.

Process models and data can refer to discrete groups. There can be species within communities, individuals within flocks, sample plots, or political parties. Sample plots can be assigned to blocks (stratification) and political parties to countries. Hierarchies can be a natural way to organize data and models.

Structure in data and processes can not only facilitate but also complicate, interpretation. Simpson’s Paradox and the related ecological fallacy are important considerations for any analysis that involves multiple scales. They can occur when attempting to fit a model at one scale and predict at another scale. It can arise when there are unmeasured variables that affect observations, which is ubiquitous in environmental problems. Examples will arise throughout the semester. In the next section I discuss why not only variables, but also model analyses that can be hierarchical–some parameters generate a model, other parameters and model together generate data.

Structured variables can be misaligned, such as species distributions and communities, tropic levels (e.g., omnivory), race and political affiliation, and zip codes and voting precincts. Misalignment introduces challenges for hierarchical modeling.

Many problems are set up with continuous structure. For example, a simple advection model describes how change in concentration relates to a concentration gradients in space,

\[ \frac{\partial{c}}{\partial{t}} = -u \frac{\partial{c}}{\partial{x}} \] where \(x\) is location, \(t\) is time, and \(u\) is velocity (distance/time). Because computation requires discretizing, continuously structured problems are discretized. There is a discrete time interval \(\Delta_t\) and a discrete distance \(\Delta_x\) such that \(\partial{c}/\partial{t}\) is approximated by \(\left( c(t + \Delta_t,x) - c(t, x) \right)/\Delta_t\). Discretization often introduces instabilities and requires careful thought on issues of rates and scale.

Hierarchical modeling exploits structure in data and in how processes operate. A submodels can be specified for a process, while another submodel could decribe the connections from state variables to data. Hierarchical model building uses structure, which enter as the distributions introduced in the next section.

Elements of Bayes

The basic elements of a Bayesian analysis are prior, likelihood, and posterior. Bayesian analysis specifies these elements as probability distributions.

Hierarchical models organize these pieces into levels or stages. These are the knowns (data and priors) and the unknowns (latent processes and parameter values). It is natural to think of inference this way:

\[[\mbox{unknowns | knowns}] = [\mbox{process, parameters | data, priors}]\]

If notation is unfamiliar: \([A]\) is the probability or density of event \(A\), \([A, B]\) is the joint probability of \(A\) and \(B\), and \([A|B]\) is the conditional probability of \(A\) given \(B\). It is important to remember that anything to the left of the vertical bar is random. Anything to the right is fixed. If there is no vertical bar, everything is random. The Appendix summarizes notation.

This structure from Mark Berliner (example here) comes naturally to a Bayesian. It can be unwieldy to the non-Bayesian and can involve improper distributions (e.g., when omitting prior distributions). A hierarchical model typically breaks this down as

\[[\mbox{process, parameters | data, priors}] \propto \] \[[\mbox{data | process, parameters}] [\mbox{process | parameters}] [\mbox{parameters | priors}]\]

The left hand side is the joint distribution of unknowns to be estimated, called the posterior distribution. This structure is amenable to simulation, using Markov Chain Monte Carlo (MCMC). Gibbs sampling is a common MCMC technique, because the posterior can be factored into smaller pieces that can be dealt with one at a time, as simpler conditional distributions.

The elements of the hierarchical model relate directly to Bayes’ theorem. To simplify, consider a joint distribution of data \(D\) and parameters \(P\),

\[ [D, P] = [D | P][P] = [P | D] [D] \] Setting the two expressions on the right-hand side equal to one another and rearranging, I obtain Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] which can be read as ’the probability of parameters given data".

Exercise 2. I am tested for covid and am diagnosed as not-infected. Use Bayes’ theorem to write down the probability that I have covid.

Steps:
  1. Decide with is observed and what is unknown
  2. Fill in the equation

Exercise 3. What is random in the posterior distribution? What is random in the likelihood? Why are they the same or different?

Prior to posterior

I would like to estimate the mean \(\mu\) of a set of \(n\) measurements \(y_i, i = 1, \dots, n\), using a Bayesian analysis assuming that observations are independent and normally distributed. The variance in observations \(\sigma^2\) is known from previous studies (e.g., instrument error).

The likelihood for the observations \(\mathbf{y}\) is the conditional distribution \([D | P]\). For this normal distribution I write it as

\[[D|P] = [\mathbf{y} | \mu, \sigma^2] = \prod_i^n N(y_i | \mu, \sigma^2)\] From the last section,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I see that I also need a prior distribution for the mean, which I also take to be a normal distribution,

\[ [P] = N(\mu, \sigma^2) \] The factor \([D]\) in the denominator is simple to obtain in this example, but a distraction for the moment: we are focused on \(\mu\) for now, which is a parameter \(P\) and does not appear in \([D]\). So for now, I write

\[ \begin{align} [P|D] = [\mu | \mathbf{y}, \sigma^2] &\propto [D|P][P] \\ &=\prod_i^n N(y_i | \mu, \sigma^2) \times N(\mu | m, M) \end{align} \] For reasons we will see later, this posterior distribution is also normal,

\[ [\mu | \mathbf{y}, \sigma^2] = N \left( vV, V \right) \] where the parameters defined by

\[ \begin{align} v &= \frac{n \bar{y}}{\sigma^2} + \frac{m}{M} \\ V &= \left( \frac{n }{\sigma^2} + \frac{1}{M} \right)^{-1} \end{align} \] You might recognize this to be a weighted average of the data, with mean \(\bar{y}\), and prior, with mean \(m\). The weights are the two variances. In fact, as \(n\) gets large (many observations) and \(M\) gets large (uncertain prior mean), the posterior mean converges to \(vV \rightarrow \bar{y}\) and the posterior variance to \(V^{-1} \rightarrow \sigma^2/n\). These are the traditional estimates of the mean and the standard error, i.e., the square root of the posterior variance \(\rightarrow \sigma/\sqrt(n)\). Here is a plot of prior and posterior:

n  <- 5
mu <- 1.8
M  <- 1
sigma <- 2
y  <- rnorm(n, mu, sigma)
m  <- 0
mseq <- mu + seq(-5, 5, length=100)

postVr <- 1/(n/sigma^2 + 1/M)
postMu <- (n*mean(y)/sigma^2 + m/M)*postVr

prior     <- dnorm( mseq, m)
posterior <- dnorm( mseq, postMu, sqrt(postVr) )

plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
      ylab = 'Density (1/mu)' )
lines( mseq, prior, col = 'grey', lwd = 2)
abline( v = mean(y) )
abline( v = m, col = 'grey')

Note how the posterior mean lies between the prior mean and mean(y) = -0.566.

Exercise 4: What is the effect of increasing \(n\) on posterior mean and variance in the above example and why?

Exercise 5: Can you change the parameters and data to obtain a posterior variance larger than the variance in the data? Why would the answer to this question be important?

If Bayes were as simple as this problem suggests it would have captured much of the modeling space decades ago. I previously skipped over a sticky detail. Returning to Bayes’ theorem,

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I discussed everything but the denominator, a marginal distribution of the data, which is the source of all sorrow in Bayesian analysis. For this expression to be true, this distribution must involve integration,

\[ [D] = \int [D | P][P] dP \] Recall that multiplication is easy, and division is hard. In calculus, the tables are turned. The calculus counterpart of division is differentiation, and the counterpart to multiplication is integration. However, differentiation is much easier than integration. This integral is known for the normal likelihood \(\times\) normal prior example above, but it is not known for most likelihood-prior combinations.

Posterior simulation techniques like Markov chain Monte Carlo (MCMC) offered ways to approximate this integral and more. Beyond resolving the integration problem, Gibbs sampling (Gelfand and Smith 1990) allowed for hierarchical expansion of models. After gaining familiarity with simple Bayes in the first few chapters, we’ll devote most of the semester to hierarchical problems.

Regression as an example of simple Bayes

A regression model contains a response, predictors, parameters, and residual (‘unexplained’) variation. Least squares minimizes error, maximum likelihood finds parameter that make the data most likely, and Bayes evaluates a probability distribution of parameters.

This is a good time to read this section of the appendix. First, I introduce a subscript \(i = 1, \dots, n\) to indicate the \(n\) observations of a response \(y_i\) with corresponding predictors \(\mathbf{x}_i\). The index \(i\) is exchangeable, meaning that the order of observations doesn’t matter. The responses make up a length=\(n\) vector \(\mathbf{y}\). The predictors make up a \(n \times p\) matrix \(\mathbf{X}\).

A traditional regression model looks like this:

\[ \begin{aligned} y_i &= \mu_i + \epsilon_i \\ \mu_i &= \beta_1 + \beta_2 x_{i,2} + \dots + \beta_p x_{i,p} = \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] (again, notation is summarized in an Appendix.) There is an observed response \(y_i\) and predictors in a length-\(p\) design vector \(\mathbf{x}_i\). Here is an example of a design vector for an intercept and two slope variables,

\[ \mathbf{x}_i = (1, x_{i2}, x_{i2})' \]

Note that the first element of the design vector is \(x_{i,1} = 1\), corresponding to the intercept for the model. The error has an expection \(E[\epsilon] = 0\) and variance \(Var[\epsilon] = \sigma^2\). This is called a linear model, because it is a linear function of the parameters \(\boldsymbol{\beta}\).

Typically, I want to estimate the parameters in the model. To do this, I have to decide on a criterion that constitutes a ‘good fit’. The simplest assumption could be to find parameter values that minimize the left over (‘residual’) variation. Without specifying a distribution for \(\epsilon\), I could minimize the expected value of the squared error,

\[mse = \min_{\boldsymbol{\beta}} E[\epsilon^2] = \min_{\boldsymbol{\beta}} E\left[ \sum_i^n (y_i - \mathbf{x}'_i \boldsymbol{\beta})^2 \right]\]

with respect to parameters \(\boldsymbol{\beta}\). Within the brackets I am summing the squared difference between the observed \(y_i\) and the value that would be predicted by the model, \(\mathbf{x}'_i\boldsymbol{\beta}\). By taking derivatives with respect to \(\boldsymbol{\beta}\), setting them equal to zero, and solving for \(\boldsymbol{\beta}\), I obtain the least-squares estimates \(\mathbf{\hat{\boldsymbol{\beta}}}\). This is the method of least-squares.

Rather than minimizing residual error, maximum likelihood looks for a set of parameters that make the observations most likely given the model. I say most likely rather than most probable, because the parameters are specified with a probability distribution (that would be Bayesian).

If I specify a normal distribution for errors, \(\epsilon \sim N(0, \sigma^2)\), I can write a likelihood function for the observations, where \(N\) indicates a normal distribution, with with mean zero and variance \(\sigma^2\). Due to special properties of the normal distribution, if \(\epsilon \sim N(0, \sigma^2)\) and \(y = \mu + \epsilon\), then \(y \sim N(\mu, \sigma^2)\), where parameters are \(\mu\) (mean) and \(\sigma^2\) (variance).

Because the regression model assumes independent observations, I can write the likelihood function as a simple product over the observations:

\[L(\boldsymbol{\beta}, \sigma^2) = \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] The independence assumption means that the probability of all \(n\) observations is equal to the product of the probilities of each individual observation,

\[\prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) = N(y_1 | \mathbf{x}'_1 \boldsymbol{\beta}, \sigma^2) \times \dots \times N(y_n |\mathbf{x}'_n \boldsymbol{\beta},\sigma^2)\]

The maximum likelihood estimate comes from maximizing the likelihood function with respect to parameters. I obtain the same estimates as I did with least squares. This is true if \(\epsilon\) is normally distributed. For least squares and this maximum likelihood approach, the likelihood is treated as a function of the parameters. Again, the parameters are not defined probabilistically.

A Bayesian view differs from traditional approaches by introducing a prior distribution, which has two important functions:

Recall Bayes’ theorem:

\[ [P | D] = \frac{[D | P] [P]}{[D]} \] where \(D\) represented data, and \(P\) represented parameters. For the regression model the parameters are \(P = (\boldsymbol{\beta}, \sigma^2)\), and the data are \(D = (\mathbf{y}, \mathbf{X})\). Although both predictors \(\mathbf{X}\) and responses \(\mathbf{y}\) are observed, the regression model assumes that \(\mathbf{X}\) is part of a fixed design–it is not random. For this reason, we do not write a distribution for \(\mathbf{X}\), but rather we ‘condition’ on the fixed values in \(\mathbf{X}\). Making these substitutions, Bayes’ theorem becomes

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}] = \frac{[\mathbf{y} | \boldsymbol{\beta}, \sigma^2] [\boldsymbol{\beta}, \sigma^2]}{[\mathbf{y}]} \]

Again, the posterior distribution of parameters is on the LHS. The denominator is the marginal distribution of the data, \([\mathbf{y}]\). Again, I’ll say more about this later, but ignore it for now.

As before, the left-hand side is “parameters given data”, or the posterior distribution. In the numerator on the right, I have the likelihood times prior. In Bayesian analysis the likelihood is treated not as a function to be optimized, but rather as a distribution for the data that must be combined with the prior to yield a distribution for parameters (the posterior). The second factor on the right indicates an (unspecified) prior distribution for parameters, this time for the regression coefficients. Technical details follow later, for now I focus on general concepts.

Substituting for the likelihood and leaving the prior distribution unspecified (for now), I have

\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}] \propto \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) [\boldsymbol{\beta}, \sigma^2] \]

There are similarities and differences in these approaches. If the prior distribution is flat, the posterior distribution I obtain by analysis of probability distribution for this regression model will have the same marginal means and standard errors as I obtain by traditional methods. The impact of the prior distribution becames important when it is made informative and as models increase in size.

Model graph

Models can be represented as graphs. Model elements are connected by arrows that establish the relationships between variables that guide model building.

Below are three graphs, each describing a different model. The Bayesian regression model is shown in (a).

Three regression variations in graphical form. Solid arrows are stochastic. Circles are unknown (estimated). Symbols in (a) emphasize variables that ‘anchor’ the fit–observed variables and prior parameter values are known, whereas others are estimated. a) Unknown parameters and known predictors \(X\) generate response \(Y\). b) The Tobit model could be viewed as hierachical or not. \(Y\) is a deterministic function of \(W\) (see text). c) The hierarchical model has an additional stochastic stage.


In this diagram, there are boxes, circles, dashed lines, and solid lines. The graph provides a visual interpretation of the model. I discuss symbolism in the next sections.

Known and unknown variables: Varibles are nodes in the graph. The known variables are boxes, including data and prior parameter values. They anchor the model fit, because they are constant. The unknown variables and parameters are circles. In the regression example, these are the regression coefficients, the length-\(p\) vector \(\boldsymbol{\beta}\), and the residual variance \(\sigma^2\)

Stochastic and deterministic variables: Both the response \(y_i\) and the predictors \(\mathbf{x}_i\) are known. However, the predictors are treated as deterministic, whereas the response is treated as stochastic before it is observed. Specifically, \(y_i\) is assigned a distribution, the likelihood. The vector \(\mathbf{x}_i\) is not assigned a distribution. In observational studies it is not always clear what should be a predictor and what should be a response.

Stochastic and deterministic arrows: Solid arrows in the graph indicate stochastic relationships, meaning that knowledge of a parent node does not tell us the value at the child node (or vice versa). Deterministic relationships can often be omitted from the graph. One deterministic relationship is shown in part (b) as a dashed line. This example will be discussed below.

An application to FIA

I use an application to illustrate some differences between a traditional and Bayesian analysis. Here are some R objects I use in this example:

R packages: gjam, repmis

R objects: list, data.frame,numeric, numeric vector, numeric matrix, source_data

R functions: as.formula, attr, cbind, lm, names, par, plot, points, predict, summary

To learn about a function, I use the help page:

help(lm)

In addition to these objects, I introduce the connection between a variables subscript and the location in an array, which can be a vector or a matrix.

A traditional regression looks like the graph in part (a), but it does not include the prior distribution for parameters. Here is a linear regression fitted to the abundance of loblolly pine (Pinus taeda) in the USDA Forest Inventory and Analysis (FIA) program data. I use the same file loaded at the beginning of this unit.


To determine the relationship between abundance and environmental variables, I want a model for the response biomass with predictors. Here are the names of variables in data:

names(data)
## [1] "biomass"  "temp"     "deficit"  "moisture" "u1"       "u2"       "u3"      
## [8] "stdage"   "soil"

These are the column headings in data. To see this, I can look at the first two rows:

 data[1:2,]

With the exception of "soil", these are continuous variables. Again, the variable soil is a factor with these levels:

attr(data$soil,'levels')
## [1] "EntVert"   "Mol"       "reference" "SpodHist"  "UltKan"

These are soil ‘orders’, based on parent material.

I start with the standard function lm in R for linear regression. In the formula below I have specified a model containing moisture and with quadratic effects, as I(moisture^2). The I() function in a formula indicates that I want to evaluate the argument as it appears in the function I.

High correlation in predictor variables will mean that I cannot descriminate their contributions. These variables are ok. A call to pairs(data, cex=.1) will show all combinations.

I can fit a linear regression with reponse biomass to several combinations of variables using lm and plot responses.

par(mfrow=c(1,2),bty='n', mar=c(4,4,1,1))
plot( data$moisture, biomass, cex=.2)                # wet sites
fit1 <- lm(biomass ~ moisture, data)   
p1   <- predict(fit1)
fit2 <- lm(biomass ~ moisture + I(moisture^2), data) # quadratic
p2   <- predict(fit2)                                # predictive mean
points(data$moisture,p1, col='green', cex=.2)        # add to plot
points(data$moisture,p2, col='blue', cex=.2)   

#repeat with additional variables
plot( data$deficit, biomass, cex=.2)

Biomass fitted to moisture and climate deficit showing observed and predicted.

form <- as.formula(biomass ~ soil + moisture*stdage + temp + deficit)
fit3 <- lm(form, data)  
p3   <- predict(fit3)
plot(biomass, p3, cex=.2, xlab = 'Observed', ylab = 'Predicted')
abline(0, 1, lty=2)

Biomass fitted to moisture and climate deficit showing observed and predicted.


For more on use of variables names in the above block of code, see the note about R environments.

Plots show the predictions for a linear and quadratic model, fit1 and fit2 and for a large model with main effects, an interaction moisture*stdage and the factor soil.

I can use the function summary to look at the estimates in a fitted objects, e.g.,

summary(fit3)
## 
## Call:
## lm(formula = form, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.026  -4.051  -0.579   1.673  61.399 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.0033     0.7137   2.807  0.00506 ** 
## soilMol          -0.9777     1.3819  -0.708  0.47934    
## soilreference     1.1961     0.7473   1.601  0.10966    
## soilSpodHist      2.2552     0.8486   2.658  0.00795 ** 
## soilUltKan        7.2434     1.2289   5.894 4.58e-09 ***
## moisture         -0.4506     0.1837  -2.453  0.01427 *  
## stdage           -1.4494     0.1887  -7.682 2.70e-14 ***
## temp              3.8207     0.2339  16.334  < 2e-16 ***
## deficit           0.4002     0.2167   1.847  0.06491 .  
## moisture:stdage   0.9336     0.1980   4.715 2.62e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.308 on 1607 degrees of freedom
## Multiple R-squared:  0.3031, Adjusted R-squared:  0.2992 
## F-statistic: 77.66 on 9 and 1607 DF,  p-value: < 2.2e-16

Standard products of a traditional regression include a table of point estimates for the coefficients in \(\boldsymbol{\beta}\), with their standard errors, which are a measure of uncertainty. There is a Student’s t statistic for each estimate, a measure of how far the estimate is from a hypothesized value of zero. For each estimate there is a P value, the probability for obtaining a value of t at least this large under the hypothesis that the coefficient is equal to zero. There is a F statistic and P value for the entire model. The degrees of freedom is the sample size \(n\) minus the number of fitted parameters \((2)\). The R-squared is interpreted as the ‘variance explained’ by the model. The residual standard error is the estimate of parameter \(\sigma = \sqrt{\sigma^2}\). The 95% confidence intervals for these estimates would be approximately the point estimates \(\pm 1.96\) times the standard errors, or:

##                 estimate   0.025   0.975
## (Intercept)        2.000  0.6080  3.4000
## soilMol           -0.978 -3.6800  1.7200
## soilreference      1.200 -0.2650  2.6600
## soilSpodHist       2.260  0.5960  3.9100
## soilUltKan         7.240  4.8400  9.6500
## moisture          -0.451 -0.8100 -0.0915
## stdage            -1.450 -1.8200 -1.0800
## temp               3.820  3.3600  4.2800
## deficit            0.400 -0.0234  0.8240
## moisture:stdage    0.934  0.5470  1.3200

Together, this combination of outputs for the linear regression would contribute to an interpretation of how abundance of this species responds to climate and habitat.

Exercise 6. Repeat this analysis using function lm with a different species as a response. Interpret the analysis.**

A Bayesian approach

Leaving technical detail for subsequent units, this section compares the previous result with a Bayesian analysis, incorporating a non-informative prior distribution. As mentioned above, a Bayesian analysis combines the likelihood with a prior distribution. In this analysis, the prior distribution is taken to be non-informative for coefficients and residual variance \(\boldsymbol{\beta}, \sigma\). Here is a Bayesian analysis to compare with fit3:

fitb <- bayesReg(form, data)
## 
## Coefficients:
##                  median std error     0.025   0.975 not zero
## intercept         1.996    0.7245    0.5601   3.381        *
## soilMol           -1.02      1.39     -3.62   1.717         
## soilreference     1.193    0.7572   -0.2457   2.684         
## soilSpodHist      2.261    0.8384     0.644   3.918        *
## soilUltKan        7.262     1.243     4.763   9.682        *
## moisture        -0.4547    0.1805   -0.8034 -0.0959        *
## stdage           -1.445    0.1887     -1.81  -1.075        *
## temp              3.822    0.2349     3.371   4.279        *
## deficit          0.3997    0.2152 -0.004583  0.8248         
## moisture:stdage  0.9293    0.1955    0.5416   1.302        *
## 
##  * indicates that 95% predictive interval does not include zero
## 
## Residual standard error 7.308, with 1607 degrees of freedom, 
##  root mean sq prediction error 7.183.

The function BayesReg organizes output a bit differently from lm, reporting 95% credible intervals [values at (0.025 and 0.975)]. The output is simpler than that generated by lm–there are not a lot of statistics. However, I see that point estimates and standard errors for coefficients are nearly the same as I obtained with lm. I also see that the coefficients having credible intervals that do not span zero in the Bayesian analysis are the same coefficients that lm flagged as ‘signficant’.

Same estimates from both methods.

The header for this section asks ‘what’s different about Bayes?’. The shift from a classical to Bayesian analysis did not change how I interpret effects of climate and habitat on biomass of white oak. However, the Bayesian model can be extended in a number of ways. In the next section I illustrate a hierarchical version having an additional stage.

At least two problems with both the traditional and Bayesian regressions are that the normal distribution i) does not allow any discrete values, including zeros, and ii) it is non-zero for negative values, whereas biomass can only be non-negative. Biomass data diverge from the assumptions of the model in that observations \(Y \in [0, \infty)\)–there is point mass at zero. Recall that the probability for any discrete value is zero for a continuous variable. Positive biomass values are continous, but zero is discrete. Here the data are dominated by zeros:

hist(biomass, nclass=50)                            # distribution

Response has many zeros.

length(which(biomass == 0))/length(biomass)
## [1] 0.733457


If there were a few zeros, with most values bounded away from zero, I might argue that it’s close enough. That’s not the case here. Standard diagnostic plots, such as plot(fit1) make this clear.

If these were discrete data, I might turn to a zero-inflated Poisson or negative binomial model. These are examples of generalized linear models (GLMs) to be discussed in later units. These models sometimes work ok, provided there are not too many zeros, but here zeros dominate. Regardless, these GLMs describe discrete data, so I cannot use either.

I cannot add a small amount to each observation and then transform the data to a log scale. If I do that, every coefficient I estimate depends on the arbitrary value I used.

A Tobit regression allows for continuous responses with zeros, and it is readily implemented as a Bayesian model. Part (b) of the model graph includes an additional stage for a variable \(W\). This variable has a normal distribution; it is defined on \((-\infty, \infty)\). The observed \(Y\) is a censored version of \(W\). It is equal to the response \(Y\) whenever \(Y > 0\). When \(Y = 0\), then the latent variable \(W\) is negative:

\[ y_i = \begin{cases} w_i & \quad w_i > 0\\ 0 & \quad w_i \leq 0\\ \end{cases} \]

Treating \(Y\) as a censored version of \(W\) allows me to combine continuous and censored data without changing the scale. In a Tobit model the censored value is zero, but it also works with other values.

With this model the regression moves to the latent \(W\),

\[w_i \sim N(\mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] The model has become hierarchical. I fit this model in a Bayesian setting, again, requiring a prior distribution.

Now allowing for the zeros in \(Y\), the fitted coefficients differ substantially from both previous models:

fitTobit <- bayesReg(form, data, TOBIT=T)
## fitted as Tobit model
## 
## Coefficients:
##                   median std error  0.025  0.975 not zero
## intercept         -27.96     3.074 -33.48 -21.96        *
## soilMol           -8.337      7.02 -23.39  4.459         
## soilreference      6.517     2.161   2.22  10.78        *
## soilSpodHist      -3.945     3.392 -10.96  2.325         
## soilUltKan         15.24     2.854  9.762  20.87        *
## moisture        -0.03353     0.562 -1.177  1.057         
## stdage            -4.146    0.6453 -5.393 -2.859        *
## temp               23.84     1.865  19.98  26.85        *
## deficit           -1.102    0.6886 -2.431 0.3052         
## moisture:stdage    1.348    0.6052 0.1422  2.502        *
## 
##  * indicates that 95% predictive interval does not include zero
## 
## Residual standard error 14.2, with 1607 degrees of freedom, 
##  root mean sq prediction error 6.892.
par(mfrow=c(1,2),bty='n')
plot(biomass, p3, cex=.2, ylab='Predicted values')
points(biomass,fitTobit$predictY[,2], col=2, cex=.2)
abline(0,1,lty=2)
abline(h=0)
plot(summary(fit3)$coefficients[,1],fitTobit$coeff[,1],
     xlab='Traditional regression',ylab='Tobit')
abline(0,1,lty=2)

Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.


Exercise 7. Using a different species and predictors, interpret differences between estimates for the Tobit and standard regression model.

I called the Tobit model ‘hierarchical’, but some could see it differently. The \(W\) stage in the model is ‘partially known’. Note the dashed line in part (c) of the graphical model. If I know the value of \(W\), then I also know the value of \(Y\). So in the ‘generative direction’ from \(W\) to \(Y\) I can view their relationship as deterministic. However, given \(Y\) does not necessarily mean that I know \(W\). If \(Y = 0\), then \(W\) is stochastic.

In summary, not only is the Tobit defensible as a valid model for continuous data with zeros–it also finds more effects and better predicts data than the traditional regression. Many value the hierarchical framework for its flexibility to admit additional stages.

Background concepts

In this section, I introduce some concepts that will come up again this semester.

The terms process model and statistical model represent a false dichotomy.

Hierarchical models typically include both elements. They are used for inference and to predict processes and data, with uncertainty. To take a simple recent example, development of spring phenological state could be written as a rate equation,

\[\frac{dh}{dt} = r(t; \mathbf{x}(t)) \left(1 - h(t) \right)\]

for development rate \(r(t)\) that changes over time, in part due to environmental changes contained in \(\mathbf{x}(t)\). Most ecologists would call this a ‘process model’. However, when discretized, written in terms of predictors, and combined with a data model for ordinal observations, it’s used for inference (Clark et a. 2014). Hierarchical modeling has made the distinction between statistical and process a false dichotomy.

Design determines the information that data have to offer model fitting and lends it’s name to the design matrix in anova and regression. The common elements of a study design include replication, stratification, and randomization (RSR).

Information can be evaluated in many ways, such as the precision a model provides on parameter estimates or predictions. There is a large literature on ‘optimal design’ and how it benefits from RSR.

Studies can be observational (variables controlled only to the extent possible from sample deployment) or experimental (variable manipulation). The model for experimental data may be a model for the design (e.g., ANOVA), but it doesn’t have to be. Experiments may be established to evaluate many processes, each analyzed with a different model.

Observational data have an advantage in being available at scales that could be not be studied with controlled experiments. Large observational networks, some global, are examples. Weather station data, now combined with satellite imagery, are the basis for modern understanding of ocean and atmospheric circulation. Epidemiological evidence for public health trends are widely used in the social sciences, as are polling data in public policy. Disease surveillance can be active or passive, the latter being easier to get, but harder to use.

The big concern with observational data is how to attribute cause and effect. As many variables change, interacting with one another as well as unmeasured variables, there may be no way to identify what’s affecting what. Are people healty because they exercise, or does being healthly motivate exercise?

It can be hard to even decide what goes on the right and the left side of the model. Does high mineralization rate promote N-demanding plants and microbes, or vice versa? What do I do when I know that both are true?

Experiments benefit from manipulation and control. Treatment effects might be isolated when other variables are controlled. Control is critical to inference in many settings.

Experimental results can be difficult to extrapolate outside the experimental frame, which could be the laboratory, growth chamber, greenhouse, or small field plot. Only a fraction of important questions are amenable to experimentation. Many ecologists and social scientists look for ‘natural experiments’.

Model fitting is a balance between data and model.

Many arguments about climate change focus on global mean temperatures, obtained by averaging station data that go back over a century. This seems to be a data-driven debate. (“what do the data say?”)–until we confront the fact that the mean of the stations is not the global mean, because stations are not uniform over the globe. Models are used to translate unevenly distributed data to a less biased estimate of regional climate. The PRISM data are an example.

Multiple testing refers to questions that depend on more than one hypothesis test.

I might want to know if the abundances of microbial taxa decline with salinity in estuaries, so I fit regressions to concentrations of each of thousands of OTUs (operational taxonomic units). We expect to reject the null at \(\alpha < 0.05\) for 5% of tests where the null hypothesis is true. A traditional Bonferonni correction divides the \(\alpha\)-value threshold by the number of tests based on the assumption that tests are independent. By building models hierarchically, Bayesian analysis sidesteps this problem–the dependence between ‘tests’ is built in to the model.

The ecological fallacy and Simpson’s paradox refer to the fact that inference at one scale or level of organization may not represent relationships at another.

It’s not hard to find examples where individual and group behavior appear to show different relationships. In public policy, we now know that rich states vote democratic, while rich people vote republican (Gelman 2009). In the last few decades, the mean household income has increased, while the income of the mean household has declined. Ecological communities (species distributions and abundances) are poorly predicted by aggregating predictions for individual species (Clark et al. 2011). These are examples of ‘Simpson’s Paradox’.

Conversely, models of total numbers of species do not predict extinction threat, which operates at the scale of individual species, an example of the ‘ecological fallacy’. The ecological fallacy refers to the fact that group behavior does not predict individuals within groups (rich people do not vote like rich states). Gerrymandered voting precincts exploit this fact.

A generative model goes forward and backward; it predicts the data that fitted it.

Prediction and simulation is the forward direction: start with parameters and generate data. Model fitting is the inverse direction. Generative models are needed if the model is to be evaluated with prediction–can model fitting recover the parameters used to simulate the data.

Many models are not generative. When they are not, evaluation is difficult. In this course prediction plays a large role in model checking, so we focus on generative models.

Exercise 8. What are the three elements of a design? What does each contribute?

Recap

Am I a Bayesian? Many who make use of Bayesian models and software would not necessarily see themselves as particularly Bayesian. Models and software may be selected for convenience or other considerations that have little to do with philosophy. Avowed non-Bayesians may exploit MCMC methods, while making a point of omitting priors. Staunch Bayesians may sometimes omit prior where they do not affect results or interpretation, but include them elsewhere. Users of packages like INLA may not know that it uses prior distributions (it’s Bayesian). Many big data applications bridge machine-learning and statistical worlds, exploiting approximations of many types. Bayes has expanded rapidly in large part for pragmatic reasons of simplicity and capacity to address large problems.

Bayesian analysis combines a prior distribution with a data model, or likelihood, to arrive at a posterior distribution of unknown parameters and latent variables. Deterministic variables are constant, whereas stochastic variables are defined by distributions. Known variables are constant, but they can be stochastic and, thus, can be defined by distributions, such as the response \(Y\) in the regression model. Graphs are used to communicate model structure and organize algorithms. They provide a visual representation of hierarchical models. A common hierarhical structure is [data|process, parameters][process|parameters][parameters].

Basic regression models provide an opportunity to introduce similarities and differences. Least-squares, maximum likelihood, and Bayes share some common features at this simplistic level, but diverge as models increase in size.

Both experimental and observational data are analyzed in environmental sciences. They differ in design, control, and, often, scale. Fundamental design considerations include randomization, stratification, and replication.

Model fitting and prediction are ‘backward’ and ‘forward’ views. Generative models do both and, thus, can be fully evaluated (in both directions).


Appendix

Note about R environments

To fully understand the block of code for the lm fit, I need to know that the variable biomass is defined in my global environment (see previous code block). I can enter length(biomass) and get an answer, because biomass exists in the working environment. I have not assigned the variables deficit and moisture in my global environment. They only exist within the data.frame data. When I call the function lm it knows to look for variables in my global environment or in data, because I have passed data as an argument. The functions plot and points do not look for variables in this way. When I call them, I must specify the data.frame with the variable name, data$deficit. Using R is the subject of Unit 2.

Notation

Here are some notation conventions used in these vignettes.

notation example interpretation
italic \(x\), \(X\) scalar quantity, known
greek \(\alpha, \beta, \dots\) stochastic (fitted) variable, unknown
parentheses \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) parametric function/density
curly brackets \(\{0, 1, \dots\}\) a set on objects
closed interval \((-\infty,0]\), \([0, \infty)\) intervals include zero
open interval \((-\infty,0)\), \((0, \infty)\) exclude zero
distributed as \(x \sim N(\mu, \sigma^2)\) distribution or density
expectation \(E[\epsilon] = 0\) expected value of a variable
variance \(Var[\epsilon] = \sigma^2\) variance of a variable
bracket, distribution \([A, B, C]\) unspecified density or distribution
proportional \(f(x) \propto g(x)\) differ by a constant, \(f(x) = c g(x)\)
approximate \(\pi \approx 3.14159\) approximately equal
real numbers \(\mathbb{R} = (-\infty, \infty)\) note: also positive real \(\mathbb{R}_+\)
is an element of \(\pi \in \mathbb{R}\) \(\pi\) is a real number
subset \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) in the set
proper subset \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) cannot include all elements
union \(a \cup b\) either \(a\) or \(b\)
intersection \(a \cap b\) both \(a\) and \(b\)
sum \(\sum_{i=1}^n x_i\) \(x_1 + \dots + x_n\)
product \(\prod_{i=1}^n x_i\) \(x_1 \times \dots \times x_n\)
exponentiate \(e^x\), \(exp(x)\) \(e \approx 2.71828\), inverse of \(ln(x)\)
natural logarithm \(ln(x)\) inverse of \(e^x\)

##matrices

notation example interpretation
bold, l.c. \(\mathbf{x}\), \(\mathbf{x}'\) column and row vectors, respectively
bold, u.c. \(\mathbf{X}\) matrix
dimensions \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) \(n\) rows, \(q\) columns
subscript \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) element of an array
row vector \(\mathbf{X}_{i\cdot}\) row \(i\)
column vector \(\mathbf{X}_{\cdot j}\) column \(j\)
transpose \(\mathbf{X}'\) rows become columns
matrix inverse \(\mathbf{A}^{-1}\) solve systems of linear equations
identity matrix \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) \(p \times p\) with 1’s on the diagonal, zeros elsewhere
matrix determinant \(det\left( \mathbf{A} \right)\) obtain for a square matrix, e.g., covariance
kronecker product \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) \(np \times mq\) matrix, defined in text