There are deep philosophical questions about what we mean by probability, but it has been defined as the long-term frequency of an event occurring and also as our strength of our belief that an event could have occurred or will occur. Regardless, we might say that probability comes down to counting, either the number of successful events in an infinite series, or the hypothetical ways one unique event might come to pass.
Imagine you want to estimate the proportion of a 1 x 1 m sample plot that is covered by herbaceous vegetation. What do you know before you even start? What is your a priori expectation?
Imagine that you are in a forest – you know that proportional cover will be between zero and one, and probability greater than zero and less than one, but neither extreme would be a great surprise. So, let’s say you have no expectation - all values are equally probable.
To estimate proportion cover, you use a pin frame, in which you drop a steel pin through a frame at eight randomized points. Every time the pin touches vegetation before it hits the ground, you score that 1, and it is otherwise 0. Here are your data.
veg_counts <- c(1, 0, 0, 1, 1, 1, 0, 1 )
Every time you drop a pin and it touches or doesn’t touch vegetation, you learn more about your sample plot (Fig. @ref(fig:plot_p)). You start by assuming any value between zero and 1 is equally likely (upper left subfigure in the figure below. With that as your prior expectation (dotted line, upper middle figure), you collect one datum, and learn that zero is no longer a possibility. Combining your prior expectation with your datum, your new belief is expressed (or should be) as the solid, slanted line in the upper middle figure.
Before you drop your second pin, your new belief is now your prior expectation for your second sample (dotted line, upper right subfigure). Once you drop the second pin, you learn that a proportion of 1 (100%) is no longer a possibility, and your new belief is now expressed with the symmetric hump-shaped solid line in the upper right subfigure.
## State Proportion Belief Datum
## 1 Prior 0.00 NA zeroth
## 2 Prior 0.01 NA zeroth
## 3 Prior 0.02 NA zeroth
## 4 Prior 0.03 NA zeroth
## 5 Prior 0.04 NA zeroth
## 6 Prior 0.05 NA zeroth
## Warning: Removed 101 row(s) containing missing values (geom_path).
Our prior (dotted lines) and posterior (solid lines) bleiefs about p, the proportion of our plot covered with vegetation, as we drop successively more pins on the plot.
Each time you drop a pin, you gain more knowledge about your plot, and your belief becomes more refined and stronger. This is how organisms learn about the world, and is how we do Bayesian statistics.
Details The lines in Fig. @ref(fig:plot_p) are probability densities of the beta distribution. The beta distribution is a continuous probability distribution, just like the normal distribution. Maximum entropy theory tells us the beta distribution is the maximally uncertain and unbiased description of such data, and therefore the least likely to lead us astray.
Most people want to know the probability that their hypothesis (or model) is true, given their data. In 2021, our usual statistical analyses can’t tell us that. These usual analyses are what is referred to as frequentist analysis, often in which we test a null hypothesis. Such an analysis tells us how often our sampling and analysis procedure would give us data and a result as extreme or more extreme than what we saw if we repeated our procedures a huge number of times. It answers the question, how weird are our data if our null hypothesis is true and how weird are data that are even more extreme. When we calculate a frequentist 95% confidence interval, the procedure is designed to describe a region that is likely to include the true mean 95% of the time if we repeated the experiment a huge number of times.
A frequentist analysis does not incorporate any learning. It does take into account any previous experience or knowledge that we learned in the past. It assumes that the the answer could be anything at all. In this sense, it is a very unscientific procedure.
Bayesian analysis is different. A Bayesian analysis incorporates previous knowledge. It assumes that some outcomes are more likely than others, and the scientist can guess at which these are. Because of this, a Bayesian analysis is able to generate the answer we want, the probability that our hypothesis is true.
Bayesian analysis follows the scientific process:
The answer we get is proportional to the summed probabilities of the data when we assume the hypothesis is true (‘likelihood’) times our best guess probability of the hypothesis before we collected the data (‘a prior distribution’ or ‘priors’). We write this as
\[\mathrm{Pr}(H|D) \propto \mathrm{Pr}(D|H)\mathrm{Pr}(H)\] where \(H\) is our hypothesis, and \(D\) is our observed data. These are “proportional”; if we wanted equality, we would divide the right hand side (RHS) by the total probability of the data. When we do a Bayesian analysis on a computer, we typically don’t need it.
Next, realize that each of these probabilities is a distribution rather than a point estimate. So, our answer (the left hand side) is the posterior distribution. The right hand side is the product of the likelihood distribution and the prior distribution. Thus, both sides have uncertainty built into them because they are distributions rather than point estimates.
We write code for these elements in a Bayesian analysis.
Consistent with our desire to know the probability that something is true, we will estimate the probability that the average seedling density is within a some interval. We cannot do that with a frequentist confidence interval. We will do it with a Bayesian credible interval.
Imagine we have the following vector of seedling densities.
# number per sq meter
n <- 7
df <- data.frame(
y = c(5.6, 6.3, 5.2, 2.0, 8.1, 3.8, 4.8),
x = rep('a', length(n))
)
ggplot(data=df, aes(x=x, y=y)) +
geom_dotplot(binaxis='y', stackdir = 'center') +
labs(y="Number of seedlings per sq. m.", x="Sample") + theme_bw()
## Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
The core of a Bayesian analysis is the model. It comprises the priors, the likelihood, and the data. Here we write a model to estimate the mean of a vector \(y\). We assume that y is distributed as a normal random variable, which is characterized by a mean (\(\mu\)) and uncertainty that we characterize with the standard deviation (\(\sigma\)). We write the following, and save it as an R script named my_jags_model.R
## Put this in an R script called 'my_jags_model.R'
model{
#############
### Priors
## my a priori guess of the mean
## mean of five, and a low precision of 0.01.
mu ~ dnorm(5, 0.01)
## standard deviation is between 0.1 and 5
sigma ~ dunif(0.1, 5)
## BUGS and JAGS use 'precision', tau, which is defined as 1/variance
tau <- 1/sigma^2
############
### Likelihood
for(i in 1:N){ # for each obs, from the first to the N-th,
## multiply the probabilities of the data,
## given an estimate of mu and sigma
y[i] ~ dnorm(mu, tau)
}
### If we wanted to, we could embed the data, but we won't.
### It would look like this:
########## Data
# data{
# y <- c(2, 3.7, 8.6, 5.2, 5.1, 3.2, 4.7)
# }
}
Next we
jags.model(..., adapt=1000)).update()).coda.samples()).We will organize our data using lists.
## number of observations
N <- nrow(df)
## All data into one list
myData <- list(y = df$y, N = N)
Here are the numbers of iterations we use in each of the following steps.
n.adapt = 1000
n.update = 10000
n.iter = 10000
Create, initialize starting points, and adapt the model.
### Just for repeatability
set.seed(1)
## We give the procedure a starting point for each parameter
initial_start <- list( mu = 5 )
## Create computer object and adapt its sampling procedures
jagsModel <- jags.model(file="my_jags_model.R", data=myData, inits=initial_start,
n.adapt=n.adapt, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 7
## Unobserved stochastic nodes: 2
## Total graph size: 17
##
## Initializing model
Next we warm up (or ‘burn in’) the model, and finally sample from the posterior distributions of each parameter.
## Warm up
update(jagsModel, n.iter=n.update)
## sample
codaSamples <- coda.samples(jagsModel,
variable.names=c("mu", "sigma"),
n.iter=n.iter, n.thin = 1)
Did it work? The trace is the Markov chain - its guesses. The densities and the posterior distributions of the parameters. It all looks typical for a robust estimate. The chains bounce around the mean, \(\mu\) looks normally distributed, and the variance estimates are greater than zero and right-skewed. Thumbs up!
## plot(codaSamples) # in rjags
MCMCtrace(codaSamples, pdf=FALSE) # in MCMCvis
Can we diagnosis it? Unsurprisingly, there are no problems. This output gives us two iportant number, R-hat (\(\hat{R}\)) and the number of samples from the posterior distribution that are effectively independent from each other (n.eff). The above code created three separate MCMC chains; R-hat is the ratio of the variances between vs. within chains. It should be \(\hat{R} < 1.1\), so \(\hat{R} = 1\) is excellent. N.eff > 10000 is great too - it is overkill for this example.
MCMCsummary(codaSamples)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## mu 5.114001 0.9313431 3.218877 5.113374 6.999635 1 30535
## sigma 2.378079 0.7703463 1.308226 2.216744 4.326868 1 10106
What’s the answer?? The above analysis shows us that we can be 95% sure that the underlying mean is between about 2.6 and 6.7. More on this to come.
Small samples can’t be trusted. When samples include fewer than 30 replicates, we usually describe these with the Student’s t-distribution, which has fatter tails than a normal distribution. We will estimate fatness of tails these tails, so this comes at the cost of an extra parameter. As our sample size increases, than this fatness thins out and the t-distribution is virtually indistinguishable from the normal by the time \(N=30\).
Here is John Kruschke’s approach to a JAGS model of the t-distribution. Using our prior knowledge, we assume that the mean is distributed normally, that the standard deviation is approximated safely with a uniform distribution, and that tail thinness (\(\nu\), nu) is distributed exponentially, with more weight toward smaller values.
{par(mfrow=c(1,3), mar=c(2,2,1,1), mgp=c(1,0,0))
curve(dexp(x, rate=1/30), 0,60, axes=F, ylab="Pr", xlab="nu"); box()
curve(dnorm(x, 5, 1), 3,7, axes=F, ylab="Pr", xlab="mu"); box()
curve(dunif(x, 0.1,10), 0, 10.1, axes=F, ylab="Pr", xlab="sigma", n=1001, ylim=c(0,0.13)); box()
}
Prior distributions for the tail thinness (nu), mean (mu), and standard deviation (sigma) of the t-distribution.
The model is then the following.
## Put this in an R script called 'jags_t1.R'
model{
#############
### Priors
## my a priori guess of the mean
## mean of five, and a low precision of 0.01.
mu ~ dnorm(5, 0.01)
## standard deviation is between 0.1 and 5
sigma ~ dunif(0.1, 5)
## thinness is not. Let nu = 1/n
nu ~ dexp(1/7)
## BUGS and JAGS use 'precision', tau, which is defined as 1/variance
tau <- 1/sigma^2
############
### Likelihood
for(i in 1:N){ # for each obs, from the first to the N-th,
## multiply the probabilities of the data,
## given an estimate of mu and sigma
y[i] ~ dt(mu, tau, nu)
}
### If we wanted to, we could embed the data, but we won't.
### It would look like this:
########## Data
# data{
# y <- c(2, 3.7, 8.6, 5.2, 5.1, 3.2, 4.7)
# }
}
Now we use it.
## We will organize our data using lists.
## number of observations
N <- nrow(df)
## All data into one list
myData <- list(y = df$y, N = N)
n.adapt = 1000
n.update = 10000
n.iter = 10000
### Just for repeatability
set.seed(1)
## We give the procedure a starting point for each parameter
initial_start <- list(
list( mu = 1, sigma=9, nu=1 ),
list( mu = 5, sigma=9, nu=5 ),
list( mu = 10, sigma=9, nu=15 ))
#initial_start <- list( mu = 5, sigma=3, nu=7 )
## Create computer object and adapt its sampling procedures
jagsModel_t <- jags.model(file="jags_t.R", data=myData, inits=initial_start,
n.adapt=n.adapt, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 7
## Unobserved stochastic nodes: 3
## Total graph size: 21
##
## Initializing model
## Warm up
update(jagsModel_t, n.iter=n.update)
## sample
codaSamples_t <- coda.samples(jagsModel_t,
variable.names=c("mu", "sigma", "nu"),
n.iter=n.iter, n.thin = 1)
MCMCtrace(codaSamples_t, pdf=FALSE)
MCMCsummary(codaSamples_t)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## mu 5.141125 0.9365031 3.2300139 5.148315 7.028937 1 14796
## nu 8.643353 7.0519522 1.0996277 6.612103 27.382071 1 6265
## sigma 2.165768 0.9703596 0.8715478 1.976407 4.613030 1 6374
flips <- c(1,1,1,1,0,0,0)
## number of observations
N <- length(flips)
## All data into one list
myData <- list(y = flips, N = N)
n.adapt = 1000
n.update = 10000
n.iter = 10000
### Just for repeatability
set.seed(1)
## We give the procedure a starting point for each parameter
initial_start <- list( theta = 0.5)
## Create computer object and adapt its sampling procedures
jagsModel_b <- jags.model(file="jags_bern.R", data=myData, inits=initial_start,
n.adapt=n.adapt, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 7
## Unobserved stochastic nodes: 1
## Total graph size: 10
##
## Initializing model
## Warm up
update(jagsModel_b, n.iter=n.update)
## sample
codaSamples_b <- coda.samples(jagsModel_b,
variable.names=c("theta"),
n.iter=n.iter, n.thin = 1)
MCMCtrace(codaSamples_b, pdf=FALSE)
MCMCsummary(codaSamples_b)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## theta 0.5551795 0.1559487 0.2478539 0.5603438 0.8408986 1 28813
## We give the procedure a starting point for each parameter
initial_start <- list( b0 = 0.5)
## Create computer object and adapt its sampling procedures
jagsModel_b <- jags.model(file="jags_bern2.R", data=myData, inits=initial_start,
n.adapt=n.adapt, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 7
## Unobserved stochastic nodes: 1
## Total graph size: 15
##
## Initializing model
## Warm up
update(jagsModel_b, n.iter=n.update)
## sample
codaSamples_b <- coda.samples(jagsModel_b,
variable.names=c("b0"),
n.iter=n.iter, n.thin = 1)
MCMCtrace(codaSamples_b, pdf=FALSE)
MCMCsummary(codaSamples_b)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## b0 0.2847718 0.7523531 -1.192674 0.2824394 1.778051 1 18071
flips <- c(1,1,1,1,1,1,0,
1,0,0,0,0,0,0)
## number of observations
N <- length(flips)
x <- c( rep(0,7), rep(1,7))
## All data into one list
myData <- list(y = flips, x=x, N = N)
n.adapt = 1000
n.update = 10000
n.iter = 10000
### Just for repeatability
set.seed(1)
## We give the procedure a starting point for each parameter
initial_start <- list( b0 = 1, b1 = -1)
## Create computer object and adapt its sampling procedures
jagsModel_b <- jags.model(file="jags_bern_2grps.R", data=myData, inits=initial_start,
n.adapt=n.adapt, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 14
## Unobserved stochastic nodes: 2
## Total graph size: 42
##
## Initializing model
## Warm up
update(jagsModel_b, n.iter=n.update)
## sample
codaSamples_b <- coda.samples(jagsModel_b,
variable.names=c("b0", "b1"),
n.iter=n.iter, n.thin = 1)
MCMCtrace(codaSamples_b, pdf=FALSE)
MCMCsummary(codaSamples_b)
## mean sd 2.5% 50% 97.5% Rhat n.eff
## b0 1.108157 0.7865483 -0.3481021 1.076995 2.7479261 1 9189
## b1 -2.499682 1.0818191 -4.7603828 -2.461772 -0.4854842 1 9101
df <- MCMCchains(codaSamples_b)
HPDinterval(as.mcmc(df[,1]-df[,2]))
## lower upper
## var1 0.4666805 6.975999
## attr(,"Probability")
## [1] 0.95