I’m a graduate student at McGill University studying marine ecology, and I have spent much of the past couple of years learning how to build Bayesian models. I am not a statistician by training, so be warned- although I have done my best to provide accurate and useful information (and to direct the reader to more authoritative sources) I am not an “expert.” Also note that this is a living document that I may update in the future.
Feedback is appreciated (constructive, please!). I can be reached at nicole.knight@mail.mcgill.ca
If you haven’t worked with generalized linear models before, you might wonder why we have to use the beta distribution at all- why can’t we just model all data following a friendly neighbourhood normal (Gaussian) distribution? After all, if you could just use a normal linear regression, you wouldn’t have to troll the internet for blog posts written by graduate students who should really be working on their thesis (if one of my supervisors reads this, please know I’m working on this while I wait for some other code to finish running).
The general motivation for using the beta distribution and other generalized linear models (GLMs) is that the normal distribution carries some assumptions that aren’t true for all datasets. For example, the normal distribution expects that values between −∞ and ∞ are possible, and that non-integer values are possible, and that your data are symmetrical about the mean (after accounting for predictors). But what if we know that one or more of these assumptions is wrong? There are plenty of examples of non-normal data sets in ecology. Negative values might be impossible in your biomass dataset. You could be working with discrete count data. Perhaps your data are skewed. In all of these cases, you would want to use a generalized linear model, which, in a nutshell, allows you to model data using probability distributions that have different sets of assumptions than the normal distribution.
One situation that commonly occurs in ecology is that you have data that are bound between certain values. For example, when we measure percent cover data, we can’t have values less than 0% or greater than 100% (unless we’re measuring across multiple layers of vegetation). Simpson’s diversity index, which describes the probability that two randomly selected individuals from a sample belong to the same species, can only produce values between 0 and 1. In cases like these, if we use the normal distribution to model these data, the model will generate impossible predictions, and won’t do a good job describing the data. Enter the beta distribution!
Figure 2.1: An example of bounded data. Intertidal surveys are often conducted by calculating percent cover, i.e., what percentage of the quadrat is taken up by each species. Photo credit: Kylla Benes.
Note: for an excellent and accessible introduction to generalized linear models, I recommend Chapters 10 and 11 of Statistical Rethinking (McElreath 2020).
The beta distribution is used to describe data that fall between 0 and 1 (non-inclusive), e.g., proportion, probability or percent data. So if we have some random variable bound between 0 and 1, θ, we can describe that random variable using the probability density function:
p(θ|a,b)=beta(θ|a,b)=θa−1(1−θ)b−1B(a,b)
In plain(ish) English, a probability density function (PDF) describes the relative likelihood of observing our random variable (θ) within some range of values. e.g., You can use this function to calculate the probability that the “true” Simpson’s diversity of your site is between 0.1 and 0.3, or 0.01 and 0.03, or 0.1 and 0.9. You might notice a function on the bottom of our PDF, B(a,b). This is the beta function1, which serves to normalize the distribution, i.e., to make sure the probability mass sums to one. We won’t worry about this function, because you don’t really need to understand it to use the beta distribution in practice.
What we will focus on are the two parameters that control the shape of the PDF, a and b. Together, these parameters describe the mean and spread of the probability mass. Neither parameter can tell you mean or spread on its own. We can see this by building plots using the dbeta() function and the lovely PNWColors package:
dbeta(x, shape1, shape2)
where shape1 is a, and shape2 is b.
library(ggplot2)
library(tidyverse)
library(PNWColors)
x <- seq(0.001, 0.999, by = 0.001) #generate a series of values at which to evaluate the beta distribution
a_values <- seq(1, 5, by = 1)
b_values <- seq(1, 5, by = 1)
beta_df <- expand.grid(a_values, b_values, x)
colnames(beta_df) <-c ("a", "b", "x")
beta_df$beta <- dbeta(beta_df$x,
beta_df$a,
beta_df$b)
beta_df$ID <- paste(beta_df$a, ", ", beta_df$b, sep = "")
beta_df$a_label <- paste("a = ", beta_df$a, sep = "")
beta_df$b_label <- paste("b = ", beta_df$b, sep = "")
colour_scale <- pnw_palette(name="Starfish", length(unique(beta_df$a_label)))
beta_df %>%
ggplot(aes(x = x, y = beta)) +
geom_point(aes(colour = a_label), size = 1) +
theme_light() +
scale_colour_manual(values = colour_scale) +
facet_grid(vars(b_label), vars(a_label)) +
ylab("Beta(x)") +
theme(legend.position = "none",
strip.text = element_text(size = 12, colour = "black"),
strip.background = element_blank())
Figure 3.1: Beta distributions parameterized with a and b.
Right now, you might be thinking that a and b are annoying, non-intuitive ways to parameterize the beta distribution. Luckily, there is a second way to parameterize the beta distribution, using parameters called μ and ϕ. As mentioned above, the effect of a on the distribution of probability mass depends on the value of b, and vice-versa. So, μ and ϕ combine a and b to describe the mean and variance of the beta distribution, respectively. μ and ϕ are related to a and b by the following equations:
μ=aa+b
and
ϕ=a+b
Note that μ must be greater than 0 and less than 1 (as the beta distribution cannot accept values ≤0 or ≥1), and ϕ must be positive (i.e., you can’t have negative variance).
Now we can try building these plots again, but describing our probability distributions using μ and ϕ:
#-Generate values of a and b from mu and phi-#
mu <- c(0.1, 0.2, 0.4, 0.6, 0.8)
phi <- c(1, 5, 10, 20, 50)
par_values <- expand.grid(mu, phi)
colnames(par_values) <- c("mu", "phi")
#-We can back-calculate a and b from mu and phi with a little algebra-#
par_values$a <- par_values$mu *par_values$phi
par_values$b <- par_values$phi - par_values$mu * par_values$phi
#-Create labels for the plot-#
par_values$mu_label <- paste("mu = ", par_values$mu, sep =)
par_values$phi_label <- paste("phi = ", par_values$phi, sep = "")
par_values$ID <- paste(par_values$a, ", ", par_values$b, sep = "")
beta_df2 <- expand.grid(par_values$ID, x)
colnames(beta_df2) <- c("ID", "x")
beta_df2 <- full_join(beta_df2, par_values, by = "ID")
#-Calculate probability estimates-#
beta_df2$beta <- dbeta(beta_df2$x, beta_df2$a, beta_df2$b)
#-Build our plot-#
beta_df2$phi_label <- factor(as.factor(beta_df2$phi_label),
c('phi = 1', 'phi = 5', 'phi = 10', 'phi = 20', 'phi = 50'))
colour_scale2 <- pnw_palette("Sailboat", length(unique(beta_df2$mu_label)))
beta_df2 %>%
ggplot(aes(x = x, y = beta)) +
geom_point(aes(colour = mu_label), size = 1) +
theme_light() +
scale_colour_manual(values = colour_scale2) +
facet_grid(vars(phi_label), vars(mu_label), scale = "free") +
ylab("Beta(x)") +
theme(legend.position = "none",
strip.text = element_text(size = 12, colour = "black"),
strip.background = element_blank())
Figure 3.2: Beta distributions created using different values of mu and phi.
From Figure 3.2, it’s much easier to see that as μ increases (left to right), the mean increases, and as ϕ increases (top to bottom), the variation decreases. This is a much more intuitive way to think about our data, and if we model our data in the Bayesian package brms, we will be estimating the values of μ and ϕ, not a and b.
Note: for a great, thorough introduction to the beta distribution, check out Chapter 6 of Doing Bayesian Data Analysis (Kruschke 2015).
So at this point, we understand a bit about the beta distribution (it describes data bounded between 0 and 1 using the parameters μ = mean, ϕ ~ variance), and we know we can use it to describe our data by building a GLM. So how do we build a GLM?
Let’s go back to the normal distribution. Remember that if we were working with a normally distributed variable, we could describe it as
y∼Normal(μ,σ)
Which just says that y is some random variable that we think is normally distributed with a mean μ and a standard deviation σ. We can estimate these parameters directly, but often we want to test whether μ is associated with other variables- i.e., predictor variables. So we can build a model that allows different means for different groups (i):
yi∼Normal(μi,σ)
For example, we might want to know whether maximum daily temperature in the intertidal (response variable yi) depends on microhabitat (predictor variable xi). To describe this relationship, we can use a simple linear equation:
μi=α+βixi
Note that now we aren’t estimating μi directly, but are instead estimating the values of two new parameters, α (the intercept) and βi (the slope). Say we are estimating the (normally distributed) average daily temperature in two microhabitats, “exposed” (i = 0) and “shaded” (i = 1). In the context of our model, the α parameter would tell us the temperature of our “reference” habitat, “exposed”, and the βi parameter would tell us the difference in temperature between “exposed” and “shaded” habitats. To estimate μi for different habitats, we set xi to 0 or 1: i.e., for exposed habitats, we end up with μi=α+β∗0 (intercept but no slope). For shaded habitats, we get μi=α+β∗1 (intercept + slope).
The problem is that in the equation above, μi can take on any real value. But, when we’re working with beta-distributed data, we know that μi can only take on values greater than 0 and less than 1. We can resolve this by using something called a link function.
A link function is an equation that takes data with certain properties (e.g., bounded between 0 and 1, constrained to integers, etc.) and transforms them into data that are unbounded and continuous. For example, in the case of the beta distribution, we can often use the logit link, which takes data constrained between 0 and 1 and transforms them into unbounded, continuous data (Figure 4.1). The logit link is defined by the log-odds ratio:
logit(μi)=log(μi1−μi)
We can plot out this relationship:
mu <- seq(0.001, 0.999, by = 0.001)
logit_mu <- log(mu/(1-mu))
mu_tbl <- bind_cols(mu = mu, logit_mu = logit_mu)
mu_tbl %>%
ggplot(aes(x = mu, y = logit_mu)) +
geom_point() +
theme_light(base_size = 16) +
ylab("logit(mu)")
Figure 4.1: The relationship between mu and logit(mu).
We can (sort of) see from Figure 4.1 that logit(μi) is unbounded and continuous. As mu (μ) asymptotically approaches 0 and 1, logit(μ) just keeps decreasing and increasing, respectively. So now we have continuous, unbounded data that we can model as:
logit(μi)=α+βixi
Let’s try implementing these ideas. The example is built in brms, my preferred modelling package. It’s Bayesian, so it might be unfamiliar to some readers, but hopefully some additional explanations will allow you to follow along.
Let’s say you’re studying yellow band disease in the Caribbean. You have data from two locations, Papaya Reef and Mango Reef. You suspect that Papaya Reef was exposed to yellow band disease before Mango Reef, and you want to compare the progression of the disease on each reef. To do this, you quantify the proportion of healthy and diseased tissue on 100 coral heads at each reef. Since we’re working with proportional data (a coral head can’t have less than 0% or more than 100% of its tissue infected), we know off the bat our data are beta-distributed, and so we need to use a GLM.
For now, we’ll assume that every coral head you look at has at least some diseased and some healthy tissue (to avoid dealing with 0s and 1s, which cannot be accommodated by a basic beta distribution).2
First, let’s generate some fake data to work with using the rbeta() function:
For Papaya Reef, we’ll generate data for 100 coral heads, say with an average (μ) of 60% diseased tissue. For Mango reef, we’ll generate data for another 100 coral heads with an average of 30% diseased tissue. We’ll make the variance (ϕ) the same at both sites.
set.seed(3714)
n_corals <- 100
#-Convert our values of mu and phi into a and b-#
#-So that we can use the rbeta function-#
calculate_ab <- function(mu, phi)
{
a <- mu * phi
b <- phi - mu * phi
ab <- c(a, b)
return(ab)
}
ab_PYA <- calculate_ab(0.8, 15)
ab_MGO <- calculate_ab(0.45, 15)
#-Generate random beta-distributed data for each reef-#
DT_PYA <- tibble(diseased_tissue = rbeta(n_corals, ab_PYA[1], ab_PYA[2]),
reefID = "Papaya Reef")
DT_MGO <- tibble(diseased_tissue = rbeta(n_corals, ab_MGO[1], ab_MGO[2]),
reefID = "Mango Reef")
DT_dat <- bind_rows(DT_PYA, DT_MGO)
#-Plot histograms of simulated data-#
DT_dat %>%
ggplot(
aes(x = diseased_tissue)) +
geom_histogram(bins = 20, alpha = 0.75, aes(y = ..density.., fill = reefID)) +
geom_density(size = 2, aes(colour = reefID)) +
facet_grid(rows = vars(reefID)) +
theme_classic(base_size = 16) +
scale_fill_manual(values = pnw_palette("Sunset", n = 2)) +
scale_colour_manual(values = pnw_palette("Sunset", n = 2)) +
xlab("Diseased tissue (DT)") +
ylab("Density") +
theme(legend.position = "none")
Figure 5.1: Our randomly simulated diseased tissue data.
We can see from this plot 5.1 that corals at Papaya Reef seem to have more diseased tissue than those at Mango Reef, but there’s still a lot of variation. Now that we have our simulated data, let’s describe our model:
DTreefID∼Beta(μreefID,ϕ)
All that this means is that we have data describing the proportion of diseased tissue (DTreefID) on coral heads that we hypothesize is Beta-distributed with a mean (μreefID) that is different across reefs, and a variance ϕ that is the same across reefs. The model we build will estimate the parameters we need to calculate μreefID and ϕ.
Let’s start with μreefID. We know that the mean proportion of diseased tissue is determined by the reefID predictor, since we coded it that way, but how do we express that? We use a linear equation, like we would for the normal distribution. The only difference now is that our random variable is logit-transformed, as discussed above, so we get:
logit(μreefID)=α+βreefID(reefID)
Since logit(μreefID) can take on any real value, we can estimate the parameters that describe logit(μreefID), just as we would for a model built using the normal distribution. But notice that just like any normal model built with predictors, brms will not be estimating a single parameter to determine μ. Here, α is an intercept describing the proportion of diseased tissue at our reference reef (here Mango Reef, because brms chooses the reference level alphabetically). βreefID is the difference between the proportion of diseased tissue at Mango Reef and Papaya Reef, the only other level of our “reefID” predictor (but you could have many levels).
But what about ϕ? ϕ is relatively straightforward because we haven’t coded it to differ between reefs, so we will ask brms to estimate it directly.3 However, we will once again require a link function, as we know that ϕ must be a positive value. So, we’ll use the (default) log-link:
log(ϕ)=αϕ
Because we’re building a Bayesian model, we need to figure out some priors. If you aren’t familiar with priors, they’re just statistical distributions that tell the sampling algorithm where to start looking when it’s estimating parameter values. You have to be careful when assigning priors- bad priors can lead to bad parameter estimates, especially at small sample sizes. Here we’ll just calculate the mean and variance of our randomly generated data to use as priors for intercept and slope:
#-Build a function to transform our data-#
logit <- function(mu) {
logit_mu <- log(mu/ (1 - mu))
return(logit_mu)
}
DT_means <- aggregate(DT_dat$diseased_tissue, by = list(DT_dat$reefID), mean) %>%
as_tibble()
colnames(DT_means) <- c("ReefID", "mean")
DT_means$logit_mean <- logit(DT_means$mean)
print(paste("The logit-transformed intercept is", round(DT_means$logit_mean[1], 2)))
## [1] "The logit-transformed intercept is -0.18"
print(paste("The logit-transformed slope is", round((DT_means$logit_mean[2] - DT_means$logit_mean[1]), 2)))
## [1] "The logit-transformed slope is 1.5"
We’ll use these values to create priors for our brms models using the prior() function:
library(brms)
DT_priors <- c(prior(normal(-0.2, 4), class = "Intercept"),
prior(normal(1.6, 4), class = "b"),
prior(gamma(0.01, 0.01), class = "phi"))
Notice above that I have set the phi (ϕ) prior to gamma(0.01,0.01). This is the default prior brms uses for ϕ, and in my experience it works quite well. You never have to use the defaults, but we will here.
library(parallel)
n_cores <- parallel::detectCores() #-use all computer cores-#
#-Build our model-#
DT_fit1 <- brm(diseased_tissue ~ reefID,
data = DT_dat,
family = Beta(link = "logit", link_phi = "log"),
prior = DT_priors,
cores = n_cores)
DT_fit1
## Family: beta
## Links: mu = logit; phi = identity
## Formula: diseased_tissue ~ reefID
## Data: DT_dat (Number of observations: 200)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -0.18 0.05 -0.28 -0.08 1.00 4054 3134
## reefIDPapayaReef 1.48 0.08 1.32 1.65 1.00 2794 2486
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## phi 14.62 1.43 11.99 17.56 1.00 3491 3069
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Now, let’s take a look at our model output. Notice that the parameter estimates for Intercept and reefIDPapayaReef are clearly logit-transformed, since the beta distribution doesn’t allow for values outside the 0 > x > 1 range. Eventually, when we’re reporting our results, we will back-transform our parameter estimates so that they are easier to interpret. But let’s start by checking how well the model ran:
plot(DT_fit1)
First, we’ll look at the density (left) and trace (right) plots. The density plots show us which (logit-transformed) parameter values are most probable, and the trace plots show us whether the model converged properly. Here, we can see the model has converged because the lines from each chain are mixed and distributed around the same mean value.
Next, we can look at some posterior checks:
pp_check(DT_fit1, "dens_overlay_grouped", nsamples = 1000, group = "reefID")
There are a multitude of posterior predictive checks that you can run when you build a model, but here I’ve chosen to use just one. pp_check(“dens_overlay_grouped”) compares the distributions of the original data (dark blue lines; here separated by reefID) with 1000 datasets drawn from the model-generated posterior distributions (light blue lines). It looks like our model is making predictions that are consistent with our original data, which is what you want. Notice that here the output is not logit-transformed.
It’s clear from the posterior predictive check that our parameter estimates should match up well with the values that we initially coded, but we can back-transform the parameter estimates just to be sure.
library(tidybayes)
#-Extract (transformed) parameter estimates from our posterior distributions-#
par_draws <- posterior_samples(DT_fit1) %>%
as_tibble()
par_draws
## # A tibble: 4,000 x 4
## b_Intercept b_reefIDPapayaReef phi lp__
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.0882 1.40 13.2 145.
## 2 -0.264 1.55 16.5 145.
## 3 -0.266 1.63 16.4 145.
## 4 -0.213 1.54 14.5 147.
## 5 -0.178 1.41 14.2 147.
## 6 -0.218 1.48 15.0 147.
## 7 -0.210 1.46 14.1 147.
## 8 -0.155 1.49 15.3 147.
## 9 -0.154 1.53 18.4 144.
## 10 -0.136 1.32 15.3 145.
## # ... with 3,990 more rows
#-Write function to back-transform these estimates-#
inv_logit <- function(logit_mu) {
mu <- exp(logit_mu)/(1 + exp(logit_mu))
return(mu)
}
backtransformed_parameters <- tibble("Mango_Reef" = inv_logit(par_draws$b_Intercept),
"Papaya_Reef" = inv_logit(par_draws$b_Intercept + par_draws$b_reefIDPapayaReef),
"phi" = par_draws$phi) #-Don't need to back-transform phi-#
#-Wrangle the backtransformed parameters to create a nice plot-#
library(reshape2)
backtransformed_parameters <- melt(backtransformed_parameters) %>%
as_tibble()
backtransformed_parameters
## # A tibble: 12,000 x 2
## variable value
## <fct> <dbl>
## 1 Mango_Reef 0.478
## 2 Mango_Reef 0.434
## 3 Mango_Reef 0.434
## 4 Mango_Reef 0.447
## 5 Mango_Reef 0.456
## 6 Mango_Reef 0.446
## 7 Mango_Reef 0.448
## 8 Mango_Reef 0.461
## 9 Mango_Reef 0.462
## 10 Mango_Reef 0.466
## # ... with 11,990 more rows
backtransformed_parameters$variable <- factor(as.factor(backtransformed_parameters$variable),
c("phi", "Mango_Reef", "Papaya_Reef"))
#-Create a column with the estimated means-#
par_means <- aggregate(backtransformed_parameters$value, by = list(backtransformed_parameters$variable), mean)
#-Create a column with the actual values that we coded into rbeta()-#
backtransformed_parameters$actual <- mean(DT_MGO$diseased_tissue)
backtransformed_parameters$actual[which(backtransformed_parameters$variable == "Papaya_Reef")] <- mean(DT_PYA$diseased_tissue)
backtransformed_parameters$actual[which(backtransformed_parameters$variable == "phi")] <- 15
backtransformed_parameters %>%
ggplot(
aes(x = value)) +
geom_histogram(bins = 20, alpha = 0.75, aes(y = ..density.., fill = variable)) +
geom_vline(aes(xintercept = actual, colour = variable),
size = 2, linetype = "dashed") +
geom_density(size = 2, aes(colour = variable)) +
facet_wrap(facets = vars(variable), scales = "free") +
theme_classic(base_size = 16) +
scale_fill_manual(values = pnw_palette("Sunset", n = 3)) +
scale_colour_manual(values = pnw_palette("Sunset", n = 3)) +
xlab("Diseased tissue (DT)") +
ylab("Density") +
theme(legend.position = "none")
Looking good! Our estimates are pretty close to the actual mean values of our data.