The emergence of large data sets in environmental science has changed statistical analysis: more focus on data wrangling and algorithmic approaches, especially Bayes. Bayesian analysis departs from alternative methods in its application of probability to all aspects of model fitting with side benefits of simplifying interpretation. The elements of a Bayesian analysis include distributions for prior, likelihood, and posterior. Hierarchical models emerge naturally in the Bayesian framework as a means for analyzing high-dimensional problems without requiring a change in approach. Graphs help to organize hierarchical modeling. Basic concepts are introduced using regression.
Discussion reading: select two papers and come prepared to discuss them. Post questions for discussion on sakai resources/groupDocs.
Reproducibility (in The Atlantic and NY Times) is intimately linked to the notion of uncertainty and, thus, probability. Statistics are used both to communicate uncertainty and for forensic analyses of questionable results.
Redefining statistical significance, motivated in part by the reproducibility crisis in psychology, classical and Bayesian statisticians compromise; if \(P\) values are to used, then \(P = 0.05\) is way too high, Nature.
Why Big Data Could Be a Big Fail, Jordan on potential and limitations of Big Data (misleading title).
Why environmental scientists are becoming Bayesians, why the proliferation of Bayes in environmental science, Ecol Letters.
Breakout: Identify groups and moderator for Discussion next time
Plenary: Foundation materials in Unit 1
Objectives:
Two themes will repeat throughout this course, including big data and Bayesian analysis. Statistical analysis traditionally focused on canned software applied to simple (and small) data sets containing tens to hundreds of observations and several variables. Linear regression and anova’s led to \(P\)-values that determined whether or not a study was publishable. Large observational data sets are now commonly wrangled from the internet and used to ask questions that require new tools for data manipulation and hierarchical model building. Increasingly, machine learning methods process large data files, but often without clear process interpretation or tractable treatment of uncertainty. Hierarchical models offer the potential to integrate multiple types of data, while preserving a focus on science–How does the process work? Can we estimate the relationships and predict them? Can we assign a probability statement to the result? Theory is needed for model building. Algorithms are the basic tools needed both for data manipulation and model evalution.
A Bayesian analysis serves several purposes:
Estimation to learn about parameters.
Model evaluation to determine whether or not the model describes the data well enough to be useful.
Variable selection to evaluate how adding/subtracting varibles may improve the model.
Prediction to anticipate processes or data, for model evaluation, or for understanding. Predictions can be ‘in-sample’ or ‘out-of-sample’, the latter refering to times/locations other than where data were collected. They can be functions of outputs, such as sensitivity.
In Bayesian analysis, all are based on/expressed as probability distributions.
Uncertainty is one of the most important concepts in science, and probability is one way to communicate uncertainty. It is closely linked to the notion of reproducibility in science. How confident am I that the outcome of an experiment or observational study would agree with future studies? Probability, a value between zero and one, expresses confidence or “belief”. Maximum uncertainty (ignorance) is expressed by a probability of 1/2, e.g., a coin toss.
Probability has an advantage over other ways of describing uncertainty, because it can be quantified and tracked through combinations of events, e.g., “will it rain tomorrow” and, if so, “will I remember to carry an unbrella”? Distribution theory allows us to put these events together.
In Bayesian analysis, all quantities are assigned probability distributions. Data analysis involves i) a process, represented by a model (that summarizes how it works), ii) evidence, which comes in the form of observations, or data, and additional prior information that needs consideration, and iii) unknowns, or parameters, that must be estimated by combining all of the above.
All of us have have taken a traditional statistics course, often termed frequentist statistics, but not everyone has been exposed to a Bayesian perspective. Because traditional statistics did not derive a probability distribution for unknown parameters it developed a ‘frequency’ interpretation for point estimates–I could have a distribution of point estimates if I were to repeat an identical experiment many times. In other words, I might do an end-run around the need for probability with a concept of frequency. This approach gets harder to apply as problems become more complex. The methods can also become unstable. The probabilistic basis for Bayesian analysis makes outputs intuitive and defensible. Complexity is added hierchically. In the next section I introduce some of the basics and connect them to some of the more familiar topics you learn about from an intro statistics course.
Structure in data is common and can be exploited.
The expansion of modern Bayes has gone hand-in-hand with hierarchical modeling. Hierarchical models simplify the synthesis of observations with processes and parameters. They commonly exploit the Bayesian paradigm. Hierarchical models don’t have to be Bayesian, but they usually are. After defining a few terms I introduce these concepts together.
Variables are continuous or discrete. Time is a continuous variable, because it can be measured to a degree of precision that depends on the measurement device. Temperature is likewise continuous. Population size is a discrete variable, limited to non-negative integer values \(0, 1, \dots\).
State variables describe how a process develops in space, time, or both. Population size, degree days, and energy expenditure are examples of state variables. Like time and space, state variables can be continuous or discrete. Again, population size is discrete, taking values on the non-negative integers.
Continuous and discrete variables are not always treated that way in models. For example, I can choose to reference continuous variables with discrete values, such as hour, day, month, year, or epoch. Likewise, I can choose to reference discrete variables with continuous values, such as population density (number/area). Population density is a continuous variable, taking values on the non-negative real numbers, but with discrete zero, \(\{ 0, \mathbb{R}^+ \}\). In this notation, \(\mathbb{R}\) refers to the (continuous) real numbers, which are truncated to a specific number of significant figures. The value for \(\pi\) truncated to six significant figures is 3.14159.
A state variable can be structured into continuous (size or age) or discrete (size or age class) states. Age is often taken to the nearest (discrete) year. Relatedly, developmental stage can summarize discrete structure, such as young-of-the-year, juvenile, and adult; or seedling, sapling, codominant, and dominant. Structure in time can be represented as days within seasons within years. Time can be structured by ‘active periods’ for insects or census intervals for trees or humans–the US census occurs every 10 yr since 1790. Spatial structure can be discrete, as defined by boundaries of watersheds, estuaries, soil types, and land cover types. Voting districts, zip codes, and states are discrete structures. Even where space and time are treated as continuous variables, they may be structured within discrete units.
When I want to refer to continuous structure, I put the continuous variable in parentheses, such as abundance as a function of the continuous variable time \(t\), \(y(t)\), or location \(x\), \(y(x)\). To reference discrete stucture I use a subscript, such as abundance in year \(t\), \(y_t\), or on plot \(x\), \(y_x\). I can have multiple discrete classes or continuous variables, such as discrete plot and year \(y_{x,t}\) or continuous location and time \(y(x,t)\). Structure can be mixed, such as continuous population growth within a discrete habitat, \(y_x(t)\). There can be species within communities or individuals within flocks, sample plots, or political parties. Sample plots can be assigned to blocks (stratification).
Structure in data and the processes of interest requires proper interpretation. Simpson’s Paradox and the related ecological fallacy are important considerations for any analysis that involves multiple scales. Scale is a consideration when data are aggregated, e.g., individuals to plots, zip codes, precincts, or watersheds. It arises when attempting to fit a model at one scale and predict at another scale. It can arise when there are unmeasured variables that affect observations. I may be interested in a “species response” to fertility or climate, but the data I have is the abundance of individuals on sample plots across a climate gradient. We return to this issue throughout the semester.
Structured variables can be misaligned, e.g., when some variables are referenced to zip codes, while others are referenced to voting precincts. Forest inventory plots are not aligned with weather station data used to assign climate variables to the plot. These misaligned variables introduce challenges that can be addressed with hierarchical modeling.
Consider a simple example with continuous structure. A simple population growth model describes how change in abundance \(z\) relates to the population size and its rate of increase in continuous time,
\[ \frac{dz}{dt} = r z \] where \(z\) is population density, \(t\) is continuous time, and \(r\) is the per capita rate of increase (1/time). I can solve this model for abundance at an arbitrary time, \(z(t) = z(0)e^{rt}\). However, I do not observe the population continuously. Instead I count the number of individuals \(y_{j,k}\) at specific times \(k = 1, \dots, K\) on one or more sample plots, \(j = 1, \dots, J\). The counts might have a Poisson distribution,
\[\begin{equation} \tag{1.1} y_{j,k} \sim Poi(A_j z(t_k) ) \end{equation}\] where \(A_j\) is the area of sample plot \(j\), and \(z(t_k)\) is the populations size when observation \(k\) was obtained. The unknown population density has units of animals per ha, while the counts \(y_{j,k}\) are dimensionless. The full model can be written as two parts:
\[\begin{align} y_{j,k} &\sim Poi(A_j z(t_k) ) \\ z(t) &= z(0)e^{rt} \end{align}\]
Because computation requires discretizing, continuously structured problems are often discretized in space, time, or both. I can solve eqn (1) for arbitrary values of time \(t\). However, a differential equation that cannot be solved can be approximated for discrete time increments as a difference equation. There is a discrete time interval \(\Delta_k\) and a discrete growth increment \(\Delta_n\) such that \(dz/dt\) can be approximated by \(\left( z(t + \Delta_k) - z(t) \right)/\Delta_k\). Discretization can introduce instabilities in non-linear models and requires careful thought on issues of rates and time scale (e.g., \(\Delta_k\)).
We will consider how environmental processes and observations can often be summarized hierarchically. Hierarchies can be a natural way to organize data and models. A submodel can be specified for a process, while another submodel could describe the connections between state variables to data. Hierarchical models use structure, which enters as distributions. In the next section I discuss why model analyses can be hierarchical–some parameters generate a biophysical process (process A affects a process or state variable B), while other parameters generate the data (B can only be partially observed).
In Bayesian analysis, components are specified by probability distributions that contain parameters. If I write the “probability of \(y\) given parameters \(\theta\), then I have to know or assume a value for \(\theta\) in order to construct a distribution for \(y\). Because \(y\) is described by a distribution of values (rather than a specific value) its value is said to be”unknown”. The unknowns in a model are the variables that will be estimated. I can use the bracket notation for a distribution,
\[[\mbox{unknowns | knowns}] = [y | \theta]\] The vertical bar separates the random or stochastic \(y\) on the left from the fixed parameter \(\theta\) on the right.
The terms random, unknown, and stochasitic refer to variables having probability distributions. They can appear to the left of a vertical bar, which indicates a conditional relationship. The terms fixed, known, or deterministic refer to quantities that appear to the right of the vertical bar. A variable that is fixed in one distribution can be random in another. Bayes theorem is an example.
This generic notation says nothing about the form of the distribution or the parameters needed to specify it. I use the generic notation when speaking of relationships that do not depend on the specific distribution.
To get specific, a common model for count data is the Poisson distribution mentioned above, where the response is a non-negative integer \(y = 0, 1, \dots\). This response depends on an intensity parameter \(\lambda > 0\). This equation says that \(y\) given \(\lambda\) is a Poisson distribution:
\[[y | \lambda ] = Poi(y | \lambda)\] The left-hand side could refer to any distribution having one parameter, while the right-hand side is specific.
The familiar normal distribution has a continuous response on the real numbers, \(y \in (-\infty, \infty)\) or \(y \in \mathbb{R}\), and two parameters, one corresponding to the mean (and mode) of the distribution, and another for the variance,
\[[y | \mu, \sigma^2 ] = N(y | \mu, \sigma^2)\] In this notation, \(y\) is random (its specific value is unknown), and the parameters are fixed–they must be known in order to define the normal distribution for \(y\).
When more than one variable is unknown there can be a joint distribution. Suppose I have two observations represented by the vector \(\mathbf{y} = [y_1, y_2]\). If the observations are generated independently from the same normal distribution, then there is joint distribution represented by the product of their (independent) distributions,
\[[y_1, y_2 | \mu, \sigma^2] = N(y_1 | \mu, \sigma^2) \times N(y_2 | \mu, \sigma^2)\] Because I have assumed that they are drawn independently of one another, the probability of both \(y_1\) and \(y_2\) occurring together is equal to the product of observing them separately.
A joint distribution has more than one variable to the left of the bar. Variables are separated by commas, i.e., \([y_1, \dots, y_n]\), or they are represented by a vector \([\mathbf{y}]\) holding the values \([y_1, \dots, y_n]\).
The basic elements of a Bayesian analysis are prior, likelihood, and posterior. Bayesian analysis specifies these elements as probability distributions.
In data analysis, we commonly speak of responses that are observed and parameters to be estimated (they are “unknown”). The prior distribution describes what is known about the parameters from external information. The likelihood describes how the observations are generated from parameters. Because data are unknown before they are observed, and they are generated by parameters, the likelihood is a distribution conditional on parameters. The posterior distribution is the inverse of the likelihood: it is the probability for parameters after data are observed. Before saying more about the Bayesian model specifically, I introduce the concept of hierarchical modeling, which can be viewed as a general framework for Bayes.
Hierarchical models organize distributions for prior, likelihood, and posterior into levels or stages. These are the knowns (data and priors) and the unknowns (latent processes and parameter values). It is natural to think of inference this way:
\[[\mbox{unknowns | knowns}] = [\mbox{process, parameters | data, priors}]\]
In the generic notation for a distribution, \([A]\) is the probability or density of event \(A\), \([A, B]\) is the joint probability of \(A\) and \(B\), and \([A|B]\) is the conditional probability of \(A\) given \(B\). Again, anything to the left of the vertical bar is random. Anything to the right is fixed. If there is no vertical bar, everything is random. The Appendix summarizes notation.
This structure from Mark Berliner (example here) comes naturally to a Bayesian. It can be unwieldy to the non-Bayesian and can involve improper distributions (e.g., when omitting prior distributions). A hierarchical model typically breaks this down as
\[[\mbox{process, parameters | data, priors}] \propto \] \[[\mbox{data | process, parameters}] [\mbox{process | parameters}] [\mbox{parameters | priors}]\]
The left hand side is the joint distribution of unknowns to be estimated; it is the posterior distribution. This hierarchical structure is amenable to simulation, using Markov Chain Monte Carlo (MCMC). Gibbs sampling is a common MCMC technique that exploits the fact that the posterior can be factored into smaller pieces that can be dealt with one at a time, as simpler conditional distributions. We’ll get back to this.
The elements of the hierarchical model relate directly to Bayes’ theorem. To simplify, consider a joint distribution of data \(D\) and parameters \(P\), which can be factored in two ways,
\[ [D, P] = [D | P][P] = [P | D] [D] \] Here I have factored the joint distribution into two equivalent products of a conditional and a marginal distribution (\([P]\) or \([D]\)).
Setting the two expressions on the right-hand side equal to one another and rearranging, I obtain Bayes’ theorem,
\[ [P | D] = \frac{[D | P] [P]}{[D]} \] which can be read as “the probability of parameters given data”.
Exercise: I am tested for covid and am diagnosed as not-infected. Use Bayes’ theorem to write down the probability, in generic form, that I have covid.
Steps:I would like to estimate the mean \(\mu\) of a set of \(n\) measurements \(y_i, i = 1, \dots, n\), using a Bayesian analysis assuming that observations are independent and normally distributed. The variance in observations \(\sigma^2\) is known from previous studies (e.g., instrument error).
I can think about this problem as a graph:
A graph for the model having known observations \(y_1, y_2\), a known parameter \(\sigma^2\), known prior parameter values \(m\) and \(M\), and unknown \(\mu\) that I want to estimate.
The likelihood for the observations \(\mathbf{y}\) is the conditional distribution \([D | P]\). For this normal distribution I write it as
\[[D|P] = [\mathbf{y} | \mu, \sigma^2] = \prod_i^n N(y_i | \mu, \sigma^2)\] I am assuming the observations to be independent and identically distributed. They are independent, because I am assuming that the joint distribution \([y_1, \dots, y_n]\) is the same as the product of probabilities taken independently,
\[\begin{align} [y_1, \dots, y_n] &= [y_1] \times \dots \times [y_n] \\ &= \prod_{i=1}^n[y_i] \end{align}\] They are identically distributed, because I assume that each has same parameter values for \(\mu, \sigma^2\).
Now from the last section I have Bayes’ theorem,
\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I see that I also need a prior distribution for the mean, which I also take to be a normal distribution,
\[ [P] = N(\mu | m, M) \] This says that I start with a distribution for \(\mu\), before observing data, consistent with my belief that it is centered on a mean value \(m\). If I have strong prior belief in the value \(m\), then the prior variance \(M\) is small. If I have no idea, then \(M\) is big. We’ll see that ‘small’ and ‘big’ in this context is relative to information that will come from data.
For the moment I overlook the factor \([D]\) in the denominator of Bayes theorem, because I am interested in \(\mu\), which does not appear in \([D]\). So for now, I write
\[ \begin{align} [P|D] = [\mu | \mathbf{y}] & \propto [D|P][P] \\ &=\prod_i^n N(y_i | \mu, \sigma^2) \times N(\mu | m, M) \end{align} \] For reasons we will see later, this posterior distribution is also normal,
\[ [\mu | \mathbf{y}, \sigma^2] = N \left( vV, V \right) \] where the parameters are given by
\[ \begin{align} v &= \frac{n }{\sigma^2}\bar{y} + \frac{m}{M} \\ V &= \left( \frac{n }{\sigma^2} + \frac{1}{M} \right)^{-1} \end{align} \] You might recognize this to be a weighted average of the data, with mean \(\bar{y}\), and prior, with mean \(m\). The weights for observations (first term) and the prior (second term) both include a variance in the denominator. So the bigger the variance, the smaller the weight contributed by each term. The weight for the observations includes \(n\) in the numerator. So the larger the sample size, the greater the weight of the data. In fact, as \(n\) gets large (many observations) and \(M\) gets large (uncertain prior mean), the data dominate the prior. In this case, the posterior mean converges to \(vV \rightarrow \bar{y}\) and the posterior variance to \(V^{-1} \rightarrow \sigma^2/n\). These are the traditional estimates of the sample mean and its standard error, i.e., the square root of the posterior variance \(\rightarrow \sigma/\sqrt(n)\). Here is a plot of prior and posterior:
n <- 5 # specify sample size and parameter values
mu <- 1.8 # true, but unknown value
m <- 0
M <- 1
sigma <- 2 # fixed standard deviation
mseq <- m + seq(-5, 5, length=100) # sequence of mean values centered on prior mean
y <- rnorm(n, mu, sigma) # draw a random sample
my <- mean(y)
postVr <- 1/(n/sigma^2 + 1/M) # terms for posterior distribution
postSd <- sqrt( postVr )
postMu <- (my*n/sigma^2 + m/M)*postVr
prior <- dnorm( mseq, m) # evaluate prior and posterior
posterior <- dnorm( mseq, postMu, postSd )
plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
ylab = 'Density (1/mu)', col = 'brown' )
lines( mseq, prior, col = 'grey', lwd = 2 )
abline( v = my, lty = 2 ) # data mean
segments( m, 0, m, dnorm(m, m, sd = sqrt(M) ), col = 'grey' ) # prior mean
segments( postMu, 0, postMu, dnorm(postMu, postMu, postSd ), col = 'brown' ) # post mean
Note how the posterior mean (brown) lies between the prior mean
(grey) and the data mean(y) = 1.57.
The uncertainty in a parameter estimate is often summarized by a credible interval defined as values that bound a proportion of the posterior distribution. Credible intervals are often defined for 95%, i.e., at probabilities bounded by quantiles \((0.025, 0.975)\).
The 95% credible interval for \(\mu\) is obtained from the posterior distribution, which in this case is normal (“Gaussian”),
ci <- qnorm( c(0.025, 0.975), postMu, postSd ) # quantile values
plot( mseq, posterior, lwd = 2, type = 'l', xlab='Estimate of mu',
ylab = 'Density (1/mu)' )
segments( ci, 0, ci, dnorm(ci, postMu, sd = postSd ), lty=2 )
This normal distribution has 2.5% in each of the two tails, bounded
by two values in the vector ci = (-0.435, 2.18).
Exercise: A weighted average for two values of a variable \(x\) is \(\bar{x} = (w_1x_1 + w_2x_2)/(w_1 + w_2)\), where \(w_i\) is the weight for the \(i^{th}\) value of \(x\). In the posterior distribution for the mean estimate \(N \left( \mu | vV, V \right)\), identify \(x_1, x_2, w_1\) and \(w_2\) in the equations for \(v\) and \(V\). What is the effect of increasing \(n\) on the posterior mean and variance in the above example and why?
Exercise: If the data are noisy (large \(\sigma^2\)), is it still possible to obtain a precise estimate of \(\mu\)? Why or why not?
If Bayes were always as simple as this problem suggests it would have captured much of the modeling space decades ago. I previously skipped over a sticky detail. Returning to Bayes’ theorem,
\[ [P | D] = \frac{[D | P] [P]}{[D]} \] I discussed everything but the denominator, a marginal distribution of the data, which is the source of aggrevation in Bayesian analysis. For this expression to be true, this distribution must involve integration,
\[ [D] = \int [D | P][P] dP \] Recall that multiplication is easy, and division is hard. In calculus, the tables are turned. The calculus counterpart of division is differentiation, and the counterpart to multiplication is integration. However, differentiation is much easier than integration. This integral is known for the normal likelihood \(\times\) normal prior example above, but it is not known for most likelihood-prior combinations.
Posterior simulation techniques like Markov chain Monte Carlo (MCMC) offered ways to approximate this integral and more. Beyond resolving the integration problem, Gibbs sampling (Gelfand and Smith 1990) allowed for hierarchical expansion of models. After gaining familiarity with simple Bayes in the first few chapters, we’ll devote most of the semester to hierarchical problems.
The Bayes factor (BF) is used to evaluate models relative to one another. From the posterior distribution we can evaluate a model probability conditional on the data, \([M | D]\) and combine it with a prior probability for the model, \([M]\). The ratio for two models is
\[ BF = \frac{[M_1 | D] [M_1]}{[M_2 | D] [M_2 ]} \]
A BF value much greater than 1 finds support for model 1 and vice versa. Obtaining \([M | D]\) requires evaluating an integral. We discuss this further later in the semester.
A regression model contains a response, predictors, parameters, and residual (‘unexplained’) variation. The predictors and parameters explain part of the response variation. The residual variation is left over once we have fitted the parameters. Least squares minimizes error, maximum likelihood finds parameter that make the data most likely, and Bayes evaluates a probability distribution of parameters.
This is a good time to read this section of the appendix. Here again, the subscript \(i = 1, \dots, n\) indicates the \(n\) observations of a response \(y_i\) with corresponding predictors \(\mathbf{x}_i\). The index \(i\) is exchangeable, meaning that the order of observations doesn’t matter. The responses make up a length-\(n\) vector \(\mathbf{y} = (y_1, \dots, y_n)\). The predictors make up a \(n \times p\) matrix \(\mathbf{X}\), where \(p\) is the number of parameters to be estimated in a vector \(\boldsymbol{\beta} = (\beta_1, \dots, \beta_p)'\).
A traditional regression model looks like this:
\[ \begin{aligned} y_i &= \mu_i + \epsilon_i \\ \mu_i &= \beta_1 + \beta_2 x_{i,2} + \dots + \beta_p x_{i,p} = \mathbf{x}'_i \boldsymbol{\beta} \end{aligned} \] (again, notation is summarized in an Appendix.) There is an observed response \(y_i\) and predictors in a length-\(p\) design vector \(\mathbf{x}_i\). Here is an example of a design vector for an intercept and two slope variables,
\[ \mathbf{x}_i = (1, x_{i2}, x_{i2})' \] The first element of the design vector is \(x_{i,1} = 1\), corresponding to the intercept for the model. In this notation, \(x_{i2}\) is the predictor that will be multiplied by the first slope variable. \(x_{i3}\) is the predictor that will be multiplied by the second slope variable. The error has an expection \(E[\epsilon] = 0\) and variance \(Var[\epsilon] = \sigma^2\). This is called a linear model, because it is a linear function of the parameters \(\boldsymbol{\beta}\).
Typically, I want to estimate the parameters in the model. To do this, I have to decide on a criterion that constitutes a ‘good fit’. The simplest assumption could be to find parameter values that minimize the left over (‘residual’) variation. Without specifying a distribution for \(\epsilon\), I could minimize the expected value (\(E[\cdot]\)) of the squared error,
\[\min_{\boldsymbol{\beta}} E[\epsilon^2] = \min_{\boldsymbol{\beta}} E\left[ \sum_i^n (y_i - \mathbf{x}'_i \boldsymbol{\beta})^2 \right]\]
with respect to parameters in the vector \(\boldsymbol{\beta}\). Within the brackets I am summing the squared difference between the observed \(y_i\) and the value that would be predicted by the model, \(\mathbf{x}'_i\boldsymbol{\beta}\). By taking derivatives with respect to \(\boldsymbol{\beta}\), setting them equal to zero, and solving for \(\boldsymbol{\beta}\), I obtain the least-squares estimates \(\mathbf{\hat{\boldsymbol{\beta}}}\). This is the method of least-squares.
Rather than minimizing residual error, maximum likelihood looks for a set of parameters that make the observations most likely given the model. I say most likely rather than most probable, because the parameters are not specified with a probability distribution (that would be Bayesian). The likelihood is a probability model for the observations, not for the parameters.
If I specify a normal distribution for errors, \(\epsilon \sim N(0, \sigma^2)\), I can write a likelihood function for the observations. Here again, \(N\) indicates a normal distribution, with mean zero and variance \(\sigma^2\). Due to properties of the normal distribution, if \(\epsilon \sim N(0, \sigma^2)\) and \(y = \mu + \epsilon\), then \(y \sim N(\mu, \sigma^2)\), where parameters are \(\mu\) (mean) and \(\sigma^2\) (variance).
Because the regression model assumes independent observations, I can write the likelihood function as a simple product over the observations:
\[L(\boldsymbol{\beta}, \sigma^2) = \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] Once again, the independence assumption means that the probability of all \(n\) observations is equal to the product of the probabilities of each individual observation,
\[\prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) = N(y_1 | \mathbf{x}'_1 \boldsymbol{\beta}, \sigma^2) \times \dots \times N(y_n |\mathbf{x}'_n \boldsymbol{\beta},\sigma^2)\]
The maximum likelihood estimate comes from maximizing the likelihood function with respect to parameters. I obtain the same estimates as I did with least squares. This is true if \(\epsilon\) is normally distributed. For least squares and this maximum likelihood approach, the likelihood is treated as a function of the parameters. Again, the parameters are not defined probabilistically.
The Bayesian view differs from traditional approaches by introducing a prior distribution that I now consider in the context of the regression model. The prior distribution has two important functions:
It allows for information outside the data at hand.
It allows us to derive a posterior distribution, i.e., a probability distribution for parameters.
Recall Bayes’ theorem:
\[ [P | D] = \frac{[D | P] [P]}{[D]} \] where \(D\) represented data, and \(P\) represented parameters. For the regression model the parameters are \(P = (\boldsymbol{\beta}, \sigma^2)\), and the data are \(D = (\mathbf{y}, \mathbf{X})\). Although both predictors \(\mathbf{X}\) and responses \(\mathbf{y}\) are observed, the regression model assumes that \(\mathbf{X}\) is part of a fixed design–it is not random. For this reason, we do not write a distribution for \(\mathbf{X}\), but rather we ‘condition’ on the fixed values in \(\mathbf{X}\). Making these substitutions, Bayes’ theorem becomes
\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}] = \frac{[\mathbf{y} | \boldsymbol{\beta}, \sigma^2, \mathbf{X}] [\boldsymbol{\beta}, \sigma^2]}{[\mathbf{y}]} \]
Again, the posterior distribution of parameters is on the LHS. The denominator is the marginal distribution of the data, \([\mathbf{y}]\). As in the example for the simple mean, I’ll ignore it for now.
As before, the left-hand side is “parameters given data”, or the posterior distribution, and the right-hand side has likelihood \(\times\) prior in the numerator. In Bayesian analysis the likelihood is treated not as a function to be optimized, but rather as a distribution for the data that must be combined with the prior to yield a distribution for parameters (the posterior). The second factor on the right indicates the as-yet unspecified prior distribution for parameters, this time for the regression coefficients. Technical details follow later, for now I focus on general concepts.
Substituting for the likelihood and leaving the prior distribution unspecified (for now), I have
\[ [\boldsymbol{\beta}, \sigma^2 | \mathbf{y}, \mathbf{X}] \propto \prod_i^n N(y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2) [\boldsymbol{\beta}, \sigma^2] \]
If the prior distribution is flat, the posterior distribution for \(\boldsymbol{\beta}\) will have the same shape as the likelihood. Thus, parameter estimates will have the same marginal means and standard errors as I obtained by maximum likelihood. The impact of the prior distribution becames important when it is made informative and as models increase in size.
Models can be represented as graphs. Model elements are connected by arrows that establish the relationships between variables that guide model building.
Below are three graphs, each describing a different model. In this diagram, there are boxes, circles, dashed lines, and solid lines. The graph provides a visual interpretation of the model. The traditional regression model is shown in (a). I discuss symbolism in the next sections.
Three variations on the regression model in graphical form. Circles are random–they must be estimated. Squares ‘anchor’ the fit; these observed variables \(X\) and \(Y\) and prior parameter values are known. a) In traditional regression, there are unknown parameters, but they are not given a probability model. b) Simple Bayesian regression has a probability model for parameters. Once respons \(Y\) is observed, it is fixed. c) The hierarchical model has an additional stochastic stage.
Known and unknown variables: Varibles are nodes in the graph. The known variables are boxes, including data and prior parameter values. They anchor the model fit, because they are fixed. The unknown variables and parameters are circles. In the regression example, these are the regression coefficients, the length-\(p\) vector \(\boldsymbol{\beta}\), and the residual variance \(\sigma^2\)
Stochastic and deterministic variables: Both the response \(y_i\) and the predictors \(\mathbf{x}_i\) are known. However, the predictors are treated as deterministic, whereas the response is treated as stochastic before it is observed. Specifically, \(y_i\) is assigned a distribution, the likelihood. The vector \(\mathbf{x}_i\) is not assigned a distribution. In observational studies it is not always clear what should be a predictor and what should be a response.
Stochastic and deterministic arrows: Solid arrows in the graph indicate stochastic relationships, meaning that knowledge of a parent node does not tell us the value at the child node (or vice versa). Deterministic relationships can often be omitted from the graph. This example will be discussed in chapters that follow. For now, I introduce several more concepts.
In this section, I introduce some concepts that will come up again this semester.
The terms process model and statistical model represent a false dichotomy.
Hierarchical models typically include both elements. They are used for inference and to predict processes and data, with uncertainty. To take a simple recent example, development of spring phenological state could be written as a rate equation,
\[\frac{dh}{dt} = r(t; \mathbf{x}(t)) \left(1 - h(t) \right)\]
for development rate \(r(t)\) that changes over time, in part due to environmental changes contained in \(\mathbf{x}(t)\). Most ecologists would call this a ‘process model’. However, when discretized, written in terms of predictors, and combined with a data model for ordinal observations, it’s used for inference (Clark et al. 2014). By allowing for the a biologically reasonable process at one stage and a relevant data model at another, the estimates of the process itself are more accurate than they would be otherwise. Hierarchical modeling has made the distinction between statistical and process a false dichotomy.
A generative model can be used both forward and backward; it predicts the data that were used to fit it.
Prediction and simulation is the forward direction: start with parameters and generate data. This means moving from bottom to top in the graphical model that I drew for the regression. Model fitting is the inverse direction, i.e., top to bottom in the model graph. Generative models are important for prediction–can model fitting recover the parameters used to simulate the data?
Many models are not generative. When they are not, evaluation is difficult. In this course prediction plays a large role in model checking, so we focus on generative models.
Often models are specified as rate equations (as above), and we need to fit them to data. For a simple example, consider the logistic equation,
\[ \frac{dn}{dt} = rn \left( 1 - n/k \right) \] If we want proccess error (the process is not exactly logistic), this must be stochasticized. Data are also likely obtained at discrete time points. For model fitting it can be a good idea to start by multiplying both sides bt \(dt\):
\[ dn = rn \left( 1 - n/k \right) \times dt \] A discretized version of the model is specified in time increments:
\[ n_{t + dt} = n_t + dn = n_t + rn \left( 1 - n/k \right) \times dt \] To include process error we add noise:
\[ n_{t + dt} = n_t + rn \left( 1 - n/k \right) \times dt + \epsilon_t \] Now we just need a distribution for \(\epsilon_t\), and we have a process model with error. The noise term takes up both the model misspecification and error of integration. Non-linear models like this can be unstable, but model fitting is stabilized by the data. This approach works for systems of ODEs and PDEs.
Design determines the information that data have to offer for model fitting. It lends it’s name to the design matrix in anova and regression. The common elements of a study design include replication, stratification, and randomization (RSR).
Information can be evaluated in several ways, such as the precision a model provides on parameter estimates or predictions. There is a large literature on ‘optimal design’ and how it benefits from RSR.
Studies can be observational (control is limited to when/where/how samples are obtained) or experimental (variable manipulation). The model for experimental data may be a model for the design (e.g., ANOVA), but it doesn’t have to be. Experiments may be established to evaluate many processes, each analyzed with a different model.
Observational data have an advantage in being available at scales that could be not be studied with controlled experiments. Large observational networks, some global, are examples. Weather station data, now combined with satellite imagery, are the basis for modern understanding of ocean and atmospheric circulation. Epidemiological evidence for public health trends are widely used in the social sciences, as are polling data in public policy. Disease surveillance can be active or passive, the latter being easier to get, but harder to use in models. For example, how do we interpret prevalence of a disease in a city where only symptomatic individuals get tested? Relatedly, polls are notoriously hard to interpret due to the fact that methods used to reach potential voters bias the outcomes.
Observational data can be highly unbalanced in space and time; because the distribution of observations usually determines the distribution of predictors, all interpretations have to consider how the distribution of data affects estimates and predictions (Tang et al., 2023). Like FIA, many national forest inventories strive for balanced spatial coverage. However, they are concentrated in recent years, and sample years are misaligned between states. Citizen science platforms like ebird and iNaturalist have become so dense as to cover many habitats and geographic regions. However, both fall off rapidly with time before present, making it hard to estimate trends. The coverage is uneven across habitats and species, the latter depending on behavior and crypsis and degrees of observer interest (Tang et al. 2022, Scher and Clark 2023).
A big concern with observational data is how to attribute cause and effect. As many variables change, interacting with one another as well as with unmeasured variables, there may be no way to identify what’s affecting what. Are people healthy because they exercise, or does being healthly motivate exercise?
It can be hard to even decide what goes on the right versus the left side of the model. Does high mineralization rate promote N-demanding plants and microbes, or vice versa? What do I do when I know that both are true?
Experiments benefit from manipulation and control. Treatment effects might be isolated when other variables are controlled. Control is critical to inference in many settings. However, experiments are limited in scale, often conducted in the laboratory, growth chamber, greenhouse, or small field plot. This scale mismatch means that a variable’s ‘effect’ on the response can be unrealistic. Because many ecological processes play out over multiple scales, experimental results can be difficult to extrapolate outside the experimental frame. Only a fraction of important questions are amenable to experimentation. Many ecologists and social scientists look for ‘natural experiments’.
Model fitting is a balance between data and model.
Many arguments about climate change focus on global mean temperatures, obtained by averaging station data that go back over a century. This seems to be a data-driven debate. (“what do the data say?”)–until we confront the fact that the mean of the stations is not the global mean, because stations are not uniform over the globe. Models are used to translate unevenly distributed data to a less biased estimate of regional climate. The PRISM data are an example.
Exercise. What are the elements of a design? What does each contribute?
I previously said that traditional data modeling assumes that observations are i.i.d. (independent and identically distributed). Many samples do not contain independent observations. For example, eBird counts in Sandy Creek, NC and Alexandria VA might appear to offer independent information (Tang et al. 2022). However, the eBird counts from Sandy Creek, NC last Saturday contain substantail redundancy–each observation does not offer new information if volunteers observe the same birds repeatedly. Likewise, if birds are migrating through an area, observations obtained in different weeks offer different information from observations taken on successive days.
The i.i.d assumption overestimates the information content in observations when those observations depend on one another.
Spatial data may depend on nearby observations. Note that arrows point in both directions.
There is often serial dependence in time-series data.
Building dependence into traditional models is hard. Hierarchical modeling solves this problem, because dependence can come simply from adding a stage, as shown in the graphical model for regression.
Multiple testing refers to interpretations that rely on more than one hypothesis test.
I might want to know if the abundances of microbial taxa decline with salinity in estuaries, so I fit regressions to concentrations of each of thousands of OTUs (operational taxonomic units). I expect to reject the null at \(\alpha < 0.05\) for 5% of tests where the null hypothesis is true. A traditional Bonferonni correction divides the \(\alpha\)-value threshold by the number of tests based on the assumption that tests are independent. By building models hierarchically, Bayesian analysis sidesteps this problem–the dependence between ‘tests’ is built in to the model.
The ecological fallacy and Simpson’s paradox refer to the fact that inference at one scale or level of organization may not represent relationships at another.
It’s not hard to find examples where individual and group behavior appear to show different relationships. In public policy, we now know that rich states vote democratic, while rich people vote republican (Gelman 2009). In the last few decades, the mean household income has increased, while the income of the mean household has declined. Ecological communities (species distributions and abundances) are poorly predicted by aggregating predictions for individual species (Clark et al. 2011). These are examples of ‘Simpson’s Paradox’.
Conversely, models of total numbers of species do not predict extinction threat, which operates at the scale of each individual species, an example of the ‘ecological fallacy’. The ecological fallacy refers to the fact that group behavior does not predict individuals within groups (rich people do not vote like rich states, a species’ extinction risk cannot be interpreted from variation in number of species). Gerrymandered voting precincts exploit this fact.
Models can be referenced by such terms as traditional, algorithmic, machine learning/artificial intelligence (ML/AI), and Bayesian. A model might be described by more than one (e.g., modern Bayes and AI are highly algorithmic).
Recent decades have witnessed a revolution in the volumes of data, the rates at which they accumulate, and the tools available to process and learn from them. As computation moved from the domain of a few experts to society at large, data analysis has expanded on several fronts. Bayesian analysis permeated the natural and social sciences within two decades of the benchmark 1990 paper of Gelfand and Smith, which applied a method called Gibbs sampling as a single framework for a wide range of high-dimensional problems (analyses involving large numbers of observations, variables, or both). Meanwhile, machine learning methods increasingly found ways to improve and automate pattern recognition and generate fast predictions.
Types of data have likewise transformed from the tradition small experiment with a few observations and variables to remote sensing, monitoring networks, and citizen science. Not only the volumes of data are new, but also the capacity to place them in the context of space, time, and other variables.
The current challenges concern how to learn from large and heterogeneous information, with valid uncertainty estimates.
Recent decades have seen a shift in data types (more networks and citizen science) and methods (hierarchical Bayes and machine learning). Unfamiliar terms in this figure will come up later in the text.
I’ll mention some highlights of the rocky transition from simple applications of Bayes’ theorem to hierarchical Bayes. By the mid-20th century Bayesian analysis was embraced, fostered, and promoted by statisticians who wanted methods that were axiomatic. Unlike other branches of mathematics, statistical methods did not follow from a set of foundational axioms, i.e., accepted principles. In the absence of a unifying framework, statisticians and consumers of their methods could not agree on concepts important for interpretation, like confidence intervals and predictive intervals. This interview with David Lindley includes his first-hand recollections and contributions to the Bayesian revolution that gained steam in the mid 20th century. There are many review papers and excellent books on the history of Bayesian analysis.
As Lindley points out, the early books written about Bayesian statistics had titles like “Theory of Probability”, reflecting the view that these developments were really about a coherent (axiomatic) framework that could be defended on basic principles. A fundamental departure from what is now sometimes called ‘traditional’ or ‘conventional’ statistics is the prior distribution, which, in turn, emphasizes a conceptual challenge with the notion of “probability” itself. Subjective probability refers to the fact that you and I often assign different probabilities to the same events. Probabilities are not “real”, but rather describe degree of belief. For example, the statement “20% chance of rain tomorrow” is a belief, but it can never be a reality. A frequentist idea that I might repeat “tomorrow” thousands of times and count those with rain is not only impossible; it could never be more than a concept. If I could precisely repeat tomorrow, it would rain every time or not at all. Beyond conceptual differences on what probability means, the early ‘Bayesians’ could agree that subjective belief can be coherent if it follows the rules of probability. Coherent inference requires a prior belief that is updated by data.
Broad dissemination of Bayesian methods required computational developments, especially Markov chain Monte Carlo (MCMC), which traces its roots at least to the Manhattan Project. Inroads into statistics during the 1980’s presaged the benchmark 1990 publication of Gelfand and Smith’s application of Gibbs sampling, highlighting its widespread potential. Robert and Casella use the term ‘epiphany’ for its implementation in Bayesian analysis. Gibbs sampling is a technique now extensively used to estimate a posterior distribution. Essentially one framework could admit nearly all high-dimensional problems. It contributed to the modern potential of ‘big data’, particularly where estimates of uncertainty are needed. In modern Bayes, modeling and computation are intimately connected.
Practitioners from many disciplines find Bayes attractive, because probability theory provides a conceptual foundation that all of us can apply to notions like uncertainty. Engineers, computer scientists, social scientists, and natural scientists may not agree on or appreciate the myriad of “tests” that make up the practice of traditional statistics. There can be a lot of confusion about what a P value means and whether or not it is even a valid way to interpret evidence. Probability theory is not burdened by the same confusion–the laws of probability offer a common language.
Some ecologists express concerns about the use of prior distributions. Lele and Dennis worry that it could be used to ‘cook’ the results of a study. However, many others find unconvincing the complaint that Bayes is especially susceptible to ethical transgression. We mostly have to trust one another that data are authentic. By contrast, the prior (like the rest of the model) is transparent. Cooking the evidence with a warped prior is as dumb as robbing banks while wearing a name tag. Indeed, a number of recent efforts to falsify scientific evidence have involved changing the data, not the prior distributions.
Part of the confusion comes with a belief that an informative prior facilitates cheating to “improve the fit”, which is the opposite of the truth. The best fit to the data has the weakest possible prior. The role of the prior is not to improve the fit; the prior is given weight to assure a balance of information, quite the opposite of improving the fit.
[Still less transparent than data and models is computer code. As computation plays an increasing role in science, finding ways to evaluate analyses of high dimensional problems has become a challenge. Simply making code available may not help at a time of increasing difficulty finding reviewers to even read manuscripts, let alone work through code.]
Bayesian analysis has proliferated in environmental science because it emphasizes simplicity, transparency, and coherency. The simplicity I speak of here refers to the fact that Bayes’ theorem leads to a posterior distribution for many types of problems, where a classical approach would require a multitude of different tests, often triggered by small changes in assumptions. The advantages of Bayes increase with the size of the model. As the number of variables and types of data increase, the hierarchical structure built on conditional distributions can provide a way forward where other methods bog down.
Where does Bayesian analysis fit in the era of expanding artificial intelligence? The desire to rapidly extract pattern from big data has led to many important contributions of machine learning and artificial intelligence. I would not want to run MCMC on my phone to anticipate freeway traffic ahead. The uncertainty estimates that a Bayesian analysis might offer would only clutter the picture I need when attempting to navigate traffic. I do not see a need for Bayesian analysis here.
Another example from Jacob Clark of Tesla: consider AI used to drive cars. The information comes from multiple sensors, including cameras, and a driver’s responses. A neural network holds these responses in the form of a large number of nodes (or neurons) and the connections between. Something like the nervous system. As the driver responds to multiple inputs, nodes are added, and the connections between them are modified. The fitted model represents learning about how the driver responds to the many situations that occur in the input data stream. Learning would be different with another set of inputs (routes, parking lots, encounters with public transportation, bicycles, pedestrians, emergency vehicles, …) and a different driver.
“Fixing” the code does not involve changing lines in a script. The fitted neural network is not readable. Instead, developers study problem footage and advise drivers to focus their training sessions where similar difficult situations could arise. The model may not expose new insights on how the process works (how the vehicle operates, how the driver thinks, …), but that may not be the goal.
Unlike the foregoing example, when I do science, I want to learn about a process, and I need uncertainty that can be expressed in a general way. There are already many productive applications of AI, notably pattern recognition and prediction. However, in most environmental applications, a point estimate is of little use without an understanding of how much better it is than other values. Hierarchical modeling lets me describe how a process might work, estimate the parameters I need to understand its importance, and assign confidence to the result. Bayesian computation is a powerful vehicle for this analysis.
There are additional issues to think about in the big data era, some of which arise in this course. Big data have amplified issues with confirmation bias. P-hacking is one of many terms applied to the growing practice of re-analyzing data sets until significant results emerge, often to confirm pre-existing biases. An overfitted model is one that predicts the data well simply due to its complexity, which makes it capable of fitting essentially any scatter well. One guard against this practice is out-of-sample prediction. Michael Jordan’s reading for this section points out the fact that, with enough predictors, you can predict anything. The “reproducibility crisis” in some fields engages some of these issues, such as P-hacking.
Am I a Bayesian? Many who make use of Bayesian models and software would not necessarily see themselves as particularly Bayesian. Models and software may be selected for convenience or for other considerations that have little to do with philosophy. Avowed non-Bayesians may exploit MCMC methods, while making a point of omitting priors. Staunch Bayesians may sometimes omit prior where they do not affect results or interpretation, but include them elsewhere. Users of packages like INLA may not know that it uses prior distributions (it’s Bayesian). Many big data applications bridge machine-learning and statistical worlds, exploiting approximations of many types. Bayes has expanded rapidly in large part for pragmatic reasons of simplicity and capacity to address large problems.
Bayesian analysis combines a prior distribution with a data model, or likelihood, to arrive at a posterior distribution of unknown parameters and latent variables. Deterministic variables are constant, whereas stochastic variables are defined by distributions. Known variables are constant, but they can be stochastic and, thus, can be defined by distributions, such as the response \(Y\) in the regression model. Graphs are used to communicate model structure and organize algorithms. They provide a visual representation of hierarchical models. A common hierarhical structure is [data|process, parameters][process|parameters][parameters].
Basic regression models provide an opportunity to introduce similarities and differences. Least-squares, maximum likelihood, and Bayes share some common features at this simplistic level, but diverge as models increase in size.
Both experimental and observational data are analyzed in environmental sciences. They differ in design, control, and, often, scale. Fundamental design considerations include randomization, stratification, and replication.
Model fitting and prediction are ‘backward’ and ‘forward’ views. Generative models do both and, thus, can be fully evaluated (in both directions).
In this course, Bayesian modeling and computation are presented together as means for describing a process, while allowing for how the data are obtained.
Here are some notation conventions used in these vignettes.
| notation | example | interpretation |
|---|---|---|
| italic | \(x\), \(X\) | scalar quantity, known |
| greek | \(\alpha, \beta, \dots\) | stochastic (fitted) variable, unknown |
| parentheses | \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) | parametric function/density |
| curly brackets | \(\{0, 1, \dots\}\) | a set on objects |
| closed interval | \((-\infty,0]\), \([0, \infty)\) | intervals include zero |
| open interval | \((-\infty,0)\), \((0, \infty)\) | exclude zero |
| distributed as | \(x \sim N(\mu, \sigma^2)\) | distribution or density |
| expectation | \(E[\epsilon] = 0\) | expected value of a variable |
| variance | \(Var[\epsilon] = \sigma^2\) | variance of a variable |
| bracket, distribution | \([A, B, C]\) | unspecified density or distribution |
| proportional | \(f(x) \propto g(x)\) | differ by a constant, \(f(x) = c g(x)\) |
| approximate | \(\pi \approx 3.14159\) | approximately equal |
| real numbers | \(\mathbb{R} = (-\infty, \infty)\) | note: also positive real \(\mathbb{R}_+\) |
| is an element of | \(\pi \in \mathbb{R}\) | \(\pi\) is a real number |
| subset | \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) | in the set |
| proper subset | \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) | cannot include all elements |
| union | \(a \cup b\) | either \(a\) or \(b\) |
| intersection | \(a \cap b\) | both \(a\) and \(b\) |
| sum | \(\sum_{i=1}^n x_i\) | \(x_1 + \dots + x_n\) |
| product | \(\prod_{i=1}^n x_i\) | \(x_1 \times \dots \times x_n\) |
| exponentiate | \(e^x\), \(exp(x)\) | \(e \approx 2.71828\), inverse of \(ln(x)\) |
| natural logarithm | \(ln(x)\) | inverse of \(e^x\) |
| notation | example | interpretation |
|---|---|---|
| bold, l.c. | \(\mathbf{x}\), \(\mathbf{x}'\) | column and row vectors, respectively |
| bold, u.c. | \(\mathbf{X}\) | matrix |
| dimensions | \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) | \(n\) rows, \(q\) columns |
| subscript | \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) | element of an array |
| row vector | \(\mathbf{X}_{i\cdot}\) | row \(i\) |
| column vector | \(\mathbf{X}_{\cdot j}\) | column \(j\) |
| transpose | \(\mathbf{X}'\) | rows become columns |
| matrix inverse | \(\mathbf{A}^{-1}\) | solve systems of linear equations |
| identity matrix | \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) | \(p \times p\) with 1’s on the diagonal, zeros elsewhere |
| matrix determinant | \(det\left( \mathbf{A} \right)\) | obtain for a square matrix, e.g., covariance |
| kronecker product | \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) | \(np \times mq\) matrix, defined in text |
these are equivaluent: \(exp(x) = e^x\)
\(\log( e^x ) = e^{ \log x } = x\)
\(exp(y_1) exp(y_2) = exp(y_1 + y_2)\)
\(\prod_{i=1}^n exp( y_i ) = exp \left( \sum_{i=1}^n y_i \right)\)
\(log( a/b ) = log(a) - log(b)\)
\(\frac{d}{dx} \left( x^a \right) = a x^{a-1}\)
\(\frac{d}{dx} \left( e^x \right) = e^x\)
\(\frac{d}{dx} \left( e^{f(x)} \right) = \frac{df}{dx} e^{f(x)}\)
The notation \(E(y)\) refers to the expectation of \(y\). For random data it is the sample mean. We do not simply refer to it as a mean, because it is more general: a probability distribution has an expectation, even if there are no data.
mean: \(E(y) = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\)
variance: \(Var(y) = E(y^2) - E^2(y) =\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2 = \frac{1}{n} \sum_{i=1}^n y^2_i - \bar{y}^2\)
covariance: \(Cov(x,y) = E(xy) - E(x)E(y) = \frac{1}{n} \sum_{i=1}^n x_i y_i - \bar{x} \bar{y}\)
Univariate response models include the following:
| Name | response | additional attributes |
|---|---|---|
| linear (LM) | normal | linear in parameters |
| linear mixed model (LMM) | normal | LM with random effects |
| generalized linear model (GLM) | discrete | linear in parameters on link scale |
| logistic regression | binomial | GLM includes logit link |
| probit regression | binomial | GLM includes probit link |
| Poisson regression | Poisson | GLM typically with log link |
| mixed GLM (GLMM) | binomial, Poisson, … | GLM with random effects |
Multivariate response models have a vector of responses:
| Name | response | additional attributes |
|---|---|---|
| linear (LM) | MVN | linear in parameters |
| categorical | multinomial | multiple classes, one outcome, MV logit or probit link |
| multinomial | multinomial | multiple classes, mulitple outcomes |
| generalized joint attribute model (GJAM) | all types | linear in parameters |
Time series have dependence in time. They can be state-space models that are normal or not for continuous states. The Kalman filter is a simple (normal-normal) state-space model. Hidden Markov model is a term are most often applied to discrete states. Autoregressive models are normal and have dependence on \(p\) previous times, e.g., AR(\(p\)).
Spatial models have dependence in space. They can be LM or GLM with \(n = 1\) and a \(m \times m\) covariance matrix \(\Sigma\), where \(m\) is the number of locations. Kriging is a traditional spatial model for continuous space. Spatial autoregressive models are used where space is viewed as discrete blocks (e.g., census tracks, counties, …).