brms, categorical predictor variables, and the Poisson distribution
We will begin by fitting a model in brms. The goal of the code below is to demonstrate what the entire work flow might look like for fitting one of these models!
figs=read.csv("fig_visitors.csv")
library(brms)As always, plot the data:
plot(Bat~FIG.DBH,xlab="Diameter at Breast Height of sacred fig trees (cm)",
ylab="Counts of bats that visited trees",pch=19,col="purple4",data=figs)Usually a good idea to do a histogram of data as well:
hist(figs$Bat,xlab="Counts of bats")Now, let’s fit a brms model.
Bat_model<-brm(Bat~FIG.DBH,family="poisson",data=fig_visitors)note that this model is equivalent to: Bat_model<-brm(Bat~FIG.DBH,family=poisson(link="log"),data=fig_visitors) The default link is exp, which is referred to as log here.
You may see some warnings here, about low effective sample size, chain mixing, Rhat… without getting into too much detail about these warnings (we will cover them later!) let’s discuss one reason why they are occurring. the poisson glm uses a log-link, meaning: exp(intercept + slope*x) our predictor variable, Fig Tree DBH, has big numbers (up to 500!) What happens if you take exp(500)? You get a huge number, and huge numbers break the algorithm. here is a simple solution:
fig_visitors$log_DBH<-log(fig_visitors$FIG.DBH)
hist(fig_visitors$log_DBH)this simple transformation will help model fit, at the cost of interpretation. We will cover some other solutions to this problem later in class:
Bat_model_log<-brm(Bat~log_DBH,family="poisson",data=fig_visitors)First, we can look at the summary:
summary(Bat_model_log)This summary provides a lot of info. We will be going through this more as class progresses. What I want for you to take away is the median estimate and the lower and upper 95% CI.
## Family: poisson
## Links: mu = log
## Formula: Bat ~ log_DBH
## Data: fig_visitors (Number of observations: 22)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept -3.27 0.33 -3.92 -2.62 1.00 1589
## log_DBH 1.26 0.06 1.14 1.38 1.00 1717
## Tail_ESS
## Intercept 2127
## log_DBH 2314
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The summary gives you estimates of the effect sizes, as well as 95% Credible Intervals
Note: Bayesians call these Credible Intervals to differentiate from Confidence Intervals. The abbreviation (CI) is the same for both.
We can also extract the effects directly:
fixef(Bat_model_log)## Estimate Est.Error Q2.5 Q97.5
## Intercept -3.270446 0.33163155 -3.916658 -2.621115
## log_DBH 1.264471 0.06180139 1.140710 1.384329
the real action is in the posterior samples! Let’s get ’em! You can explore this data frame the exact same way you would any data frame in R. I always like to take the first six rows of the data, using the R function head:
pos_samples=data.frame(Bat_model_log)
head(pos_samples)## b_Intercept b_log_DBH Intercept lprior lp__
## 1 -3.285678 1.261975 2.762494 -1.950647 -306.7637
## 2 -3.224140 1.262241 2.825308 -1.958458 -306.7887
## 3 -2.768159 1.186100 2.916374 -1.971184 -309.1169
## 4 -2.755452 1.172846 2.865561 -1.963880 -307.7804
## 5 -3.633050 1.329632 2.739376 -1.947973 -307.1281
## 6 -3.513822 1.309657 2.762871 -1.950692 -306.7879
Each row represents one sample from the posterior distribution of the model parameters. b_Intercept and b_log_DBH are posterior samples of the intercept and the slope for log_DBH, respectively (on the model’s link scale). These are the important parts that I want you to focus on!
Less important: lprior is the log prior density for that draw. lp__ is the total log posterior density (log likelihood + log prior) for that sample, used internally by Stan for sampling diagnostics.
Note that each row here represents one plausible set of parameter values given your data and priors, and together these samples characterize the full posterior distribution of your bat model. we can start exploring these values by plotting.
Here are two different ways of doing the same thing. Note that because there are 4000 values of the posterior draws, I am making the histogram breaks larger (100 breaks, rather than the default of 10).
hist(pos_samples$b_log_DBH,breaks=100,col="green",xlab="plausible values of effect of log(DBH) on bat visits")
plot(density(pos_samples$b_log_DBH))
polygon(density(pos_samples$b_log_DBH),col="green2",main="Plausible values for slope parameter")You can choose whichever plot you want! Important thing to note here about these particular results is that we can be very certain that the value of the slope is somewhere between 1.1 and 1.5. This conclusion is based on the histogram (or density plot, whatever you prefer).
Now, let’s explore the slope values a bit more here. What is the probability that increasing tree size has a positive impact on bat visits? That is, more bats visit bigger trees. Remember: Bayesian statistics is COUNTING!
Let’s count the posterior samples that are positive. The R code below is asking, how many samples are greater than 0? The length function measures how many elements are in a vector. The which statement is subsetting the vector to only those values that are >0.
length(which(pos_samples$b_log_DBH>0))The R code above gives us a count. But we want a probability (or a proportion). Specifically, we want to know what the probability is that the effect of tree size on bat visits is positive.
We will get this by dividing the number of positive samples by the total length of all the samples:
length(which(pos_samples$b_log_DBH>0))/length(pos_samples$b_log_DBH)This value for the statement above is 100%. There are no posterior samples for the slope that are <0. This makes biological sense: bats like visiting bigger fruiting trees! How would we write this in a paper? Well, we never want to say we are “100% certain” about something. So I would write something like: “There was near-certainty that bats were more likely to visit larger trees (Probability>99.99%)”
I want to emphasize that the posterior draws are columns of numbers. you can treat them like you would any other column of numbers! Here is code for getting the mean and standard deviation of the effect. You could present these in a paper as a summary of the effect. It is up to you whether you think this is easier to understand than quantiles.
mean(pos_samples$b_log_DBH)
sd(pos_samples$b_log_DBH)The summary of the model object already reports the 95% quantiles. But what if we want 80% quantiles?
quantile(pos_samples$b_log_DBH,c(0.1,0.9))In my view, it is good to think critically about what you want the quantiles to be here. If you are building a nuclear power plant, you probably want a probability of failure <0.01% (or some very small number). However, for messy ecology datasets, like counts of bats, a value of 95% seems a bit high.
## 10% 90%
## 1.185225 1.343142
If we wanted to describe this verbally, we could say there is an 80% probability that the effect of log_DBH on bat visits is between 0.042 and 0.05.
Of course, we should also look at the effect graphically: We can curve out the median effect. But a single line is kind of boring, because there’s no uncertainty. This is ignoring the rich information the posterior samples give us. You can always show all the samples all at once. Here’s the code I usually use, a big clunky for loop.
plot(Bat~log_DBH,xlab="log(Diameter at Breast Height of sacred fig trees (cm))",
ylab="Counts of bats that visited trees",pch=19,col="purple4",data=fig_visitors)
median_slope<-median(pos_samples$b_log_DBH)
median_intercept<-median(pos_samples$b_Intercept)
curve(exp(median_intercept+median_slope*x),add=T,col="tomato",lwd=5)
for(i in 1:length(pos_samples$b_log_DBH)){
curve(exp(pos_samples$b_Intercept[i]+pos_samples$b_log_DBH[i]*x),add=T,lwd=0.5,col=rgb(0.8, 0, 0.1, alpha = 0.4))
}
curve(exp(median_intercept+median_slope*x),add=T,col="tomato",lwd=5)If you are fancy, you can make the same plot in tidybayes, using tidy verse techniques. Is this easier than doing a for loop? That is a question you must answer on your own.
library(tidybayes)
library("tibble")
library(ggplot2)
library(dplyr)
x_grid <- tibble(log_DBH = seq(min(fig_visitors$log_DBH),
max(fig_visitors$log_DBH),
length.out = 100))
plot(Bat ~ log_DBH, data = fig_visitors,
xlab = "log(Diameter at Breast Height of sacred fig trees (cm))",
ylab = "Counts of bats that visited trees",
pch = 19, col = "purple4")
Bat_model_log %>%
add_linpred_draws(newdata = x_grid, ndraws = 200) %>% # linpred = on link scale
mutate(mu = exp(.linpred)) %>% # convert to counts (log link)
split(.$.draw) %>%
lapply(function(d)
lines(d$log_DBH, d$mu, col = rgb(0.8, 0, 0.1, 0.25), lwd = 0.7)
)Bat_model_log %>%
add_linpred_draws(newdata = x_grid, ndraws = 2000) %>%
group_by(log_DBH) %>%
summarise(mu_med = median(exp(.linpred)), .groups = "drop") %>%
with(lines(log_DBH, mu_med, col = "tomato", lwd = 5))Finally, here is the published paper from these data: https://www.jstor.org/stable/24084509. This was my first paper! 😍 My statistical expertise has evolved a lot since this!