brms, categorical predictor variables, and the poisson distribution
This guide was created by Trevor Caughlin for EEB622 in Spring 2026.
How does canopy temperature relate to plant height?
thermal <- read.csv("thermal_data.csv")
plot(thermal$plant_height, thermal$temperature, xlab="Plant Height (m)", ylab=expression("Temperature ("*degree*C*")"), col="#9EA000")
Figure 1a: Data showing temperature of plants predicted by plant height. You can see a general trend that as plants increase in height, temperature of the plant decreases.
hist(thermal$temperature, col="pink", border="white", xlab=expression("Continous Temperature Data ("*degree*C*")"), main="Hist. response, assume Gaussian")
Figure 1b: Distribution of response variable to
determine the model family. Here we are assuming normality so we will
use Gaussian (the default distribution for the brm
function.
We are assuming the histogram of the above response variable (Temperature) is normal.
temp_model <- brm(temperature~plant_height, data=thermal, silent=T, refresh=0)
## Running "C:/PROGRA~1/R/R-45~1.2/bin/x64/Rcmd.exe" SHLIB foo.c
## using C compiler: 'gcc.exe (GCC) 14.3.0'
## gcc -I"C:/PROGRA~1/R/R-45~1.2/include" -DNDEBUG -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/Rcpp/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/unsupported" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/BH/include" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/src/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include "C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp" -std=c++1y -I"C:/rtools45/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -std=gnu2x -mfpmath=sse -msse2 -mstackrealign -c foo.c -o foo.o
## cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
## In file included from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Core:19,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Dense:1,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
## from <command-line>:
## C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
## 679 | #include <cmath>
## | ^~~~~~~
## compilation terminated.
## make: *** [C:/PROGRA~1/R/R-45~1.2/etc/x64/Makeconf:289: foo.o] Error 1
Summary gives you estimates of the effect sizes as well
as confidence intervals for uncertainty. fixef will do
exactly the same thing!
fixef(temp_model)
## Estimate Est.Error Q2.5 Q97.5
## Intercept 44.184472 0.3915403 43.420134 44.937630
## plant_height -8.169058 0.8138395 -9.745344 -6.588254
summary(temp_model)
## Family: gaussian
## Links: mu = identity
## Formula: temperature ~ plant_height
## Data: thermal (Number of observations: 207)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 44.18 0.39 43.42 44.94 1.00 4271 3144
## plant_height -8.17 0.81 -9.75 -6.59 1.00 4337 3112
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 2.16 0.11 1.96 2.38 1.00 3609 2974
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
#put the model into a data frame so you can use it
post_thermal <- data.frame(temp_model)
Baseline temperature is 44 \(^\circ\)C, for every 1 unit plant growth, temperature decreases by 8 \(^\circ\)C.
There are a lot of questions we can ask/answer using the posterior distribution below:
plot(density(post_thermal$b_plant_height))
polygon(density(post_thermal$b_plant_height), col="#00B9BE", main="Plausible values for slope parameter")
Figure 2: Histogram of posterior slope estimates from the Bayesian regression of temperature on plant height. The posterior mass is entirely negative, showing near-certain evidence that taller plants are associated with lower canopy temperatures. After collecting all values of posterior slope, pulling out only those that are less than or equal to -10 and finding the mean, gives the following information: The probability that canopy temperature decreases by at least 10 °C per meter of plant height is approximately 0.9%
Here you can see that our slope value is almost CERTAINLY between -11 and -5 and does not cross 0. So we would want to ask a question about negative direction (see 2 below).
Is the effect positive or negative? How certain? \(P (\beta<0)\) or \(P (\beta>0)\)
We went with negative \(P
(\beta>0)\) because of the posterior distribution
above, all x values are in the negative.
mean(post_thermal$b_plant_height < 0)
## [1] 1
## 1
Answer: There was near-certainty (>99.9% probability) that canopy temperature decreases as plant height increases.
How likely is the effect at least [x] big?
What is the probability canopy temperature drops at least 10°C per meter of plant height?
\(P (\beta < -10)\)
prob_drop10 <- mean(post_thermal$b_plant_height <= -10)
prob_drop10
## [1] 0.0135
Answer: There was only about a 1.3% probability that canopy temperature decreases by at least 10 °C per meter of plant height, indicating the cooling effect is unlikely to be that strong.
What range of parameter values is plausible?
quantile(post_thermal$b_plant_height, c(0.25, 0.5, 0.975).
This is basically what fixef() summarizes.
What is the temperature change between two plant heights? The thermal
data has heights ranging from 0.897332 to
0.01694893.
We can use these to answer a question like,
What is the temperature change between plants that are 0.1 m tall and plants that are 0.8 m tall?
max(thermal$plant_height)
## [1] 0.897332
min(thermal$plant_height)
## [1] 0.01694893
h1 <- 0.10
h2 <- 0.80
changeinTemp <- post_thermal$b_plant_height * (h2-h1) #the change in heights are multiplied by each posterior slope here
quantile(changeinTemp) #let's change the % ranges
## 0% 25% 50% 75% 100%
## -7.815580 -6.091488 -5.721048 -5.335805 -3.843027
quantile(changeinTemp, c(0.25, 0.50, 0.975))
## 25% 50% 97.5%
## -6.091488 -5.721048 -4.611778
mean(changeinTemp < 0)
## [1] 1
Answer: Plants 0.8 m tall were predicted to be about 5.7 °C cooler than plants 0.1 m tall, with a plausible range of roughly 4.6 to 6.1 °C cooler and essentially 100% probability of cooling.
You can ask questions about predictions too. Like,
What temperature do we expect at a given height (with uncertainty [x])?
h <- 0.50
T_hat <- post_thermal$b_Intercept + post_thermal$b_plant_height * h
quantile(T_hat, c(.025,.5,.975))
## 2.5% 50% 97.5%
## 39.78876 40.09958 40.40063
Answer: A plant 0.5 m tall is predicted to have a canopy temperature of approximately 40.1 °C, with a 95% credible interval of about 39.8–40.4 °C.
There are so many useful ecological questions we can answer
using these posterior samples. Posterior samples are just a
huge (4000 row) data frame of plausible worlds. And once
you have them, there are so many questions you can ask over and over in
ANY model.
ce <- conditional_effects(temp_model,method ="posterior_predict",prob = 0.90) #here the predictive interval of 90 was requested
p <- plot(ce, points = TRUE)[[1]] #how to pull out the desired plot from the conditional_effects function
Figure 3a & b: Conditional_effects
function to show the posterior mean relationship between plant height
and temperature and a credible interval band (Default to the
conditional_effects function). Doesn’t show the shape of uncertainty the
way Fig. 5 does because it is an interval for the mean and not the full
predictive interval unless specifically requested. Unfortunately this
function is clunky and annoying and printed out two versions of itself
with no real reason found as to why.
p + geom_point(color = "#E068EB", size = 2, alpha = 0.8)+
labs(title="Predicted temperature with 90% CI", x="Plant Height (m)", y=expression("Temperature ("*degree*C*")")) #make it pretty
Figure 3a & b: Conditional_effects
function to show the posterior mean relationship between plant height
and temperature and a credible interval band (Default to the
conditional_effects function). Doesn’t show the shape of uncertainty the
way Fig. 5 does because it is an interval for the mean and not the full
predictive interval unless specifically requested. Unfortunately this
function is clunky and annoying and printed out two versions of itself
with no real reason found as to why.
plot(temperature~plant_height, xlab="Plant Height (m)", ylab=expression("Temperature (" *degree*C*")"), pch=19, col="#00B9BE", data=thermal)
median_slope <- median(post_thermal$b_plant_height)
median_intercept<-median(post_thermal$b_Intercept)
for(i in 1:length(post_thermal$b_plant_height)){
curve((post_thermal$b_Intercept[i]+post_thermal$b_plant_height[i]*x),add=T,lwd=0.5,col=rgb(0.45, 0.47, 0.55, 0.05))
}
curve((median_intercept+median_slope*x),add=T,col="black",lwd=2)
Figure 4: Relationship between plant height and canopy temperature. Points show observed data. Thin grey lines represent posterior regression draws from the Bayesian model, illustrating uncertainty in the estimated relationship. The black line shows the posterior mean prediction, indicating that taller plants are associated with lower temperatures. This plot is showing you where the posterior density is, which can be more useful sometimes than Figure 3. Effect size Every meter that plants grow in height, there is a decrease in temperature by 8 °C. Uncertainty The true cooling rate is very likely between roughly 6.6 and 9.7 °C per meter (95% credible interval).
figs <- read.csv("fig_visitors.csv")
plot(Bat~FIG.DBH,xlab="Diameter at Breast Height of sacred fig trees (cm)",
ylab="Counts of bats that visited trees",pch=19,col="purple4",data=figs)
hist(figs$Bat,xlab="Counts of bats")
# note this model will give errors or warnings because the exp of some of these values is HUGE
#Bat_model <- brm(Bat~FIG.DBH, family = "poisson", data=figs, silent=T, refresh=0)
#SO, lets add column of log DBH
figs$log_DBH <- log(figs$FIG.DBH)
Bat_model <- brm(Bat~log_DBH, family = "poisson", data=figs, silent=T, refresh=0)
## Running "C:/PROGRA~1/R/R-45~1.2/bin/x64/Rcmd.exe" SHLIB foo.c
## using C compiler: 'gcc.exe (GCC) 14.3.0'
## gcc -I"C:/PROGRA~1/R/R-45~1.2/include" -DNDEBUG -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/Rcpp/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/unsupported" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/BH/include" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/src/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include "C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp" -std=c++1y -I"C:/rtools45/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -std=gnu2x -mfpmath=sse -msse2 -mstackrealign -c foo.c -o foo.o
## cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
## In file included from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Core:19,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Dense:1,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
## from <command-line>:
## C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
## 679 | #include <cmath>
## | ^~~~~~~
## compilation terminated.
## make: *** [C:/PROGRA~1/R/R-45~1.2/etc/x64/Makeconf:289: foo.o] Error 1
fixef#check out the summary
summary(Bat_model)
## Family: poisson
## Links: mu = log
## Formula: Bat ~ log_DBH
## Data: figs (Number of observations: 22)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -3.28 0.34 -3.94 -2.62 1.00 1482 1860
## log_DBH 1.27 0.06 1.14 1.39 1.00 1580 1986
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Notice in the summary(Bat_model) table
contains the median estimates for slope and intercept. It also
has the CREDIBLE INTERVALS (CI) of
l-95% CI and u-95% CI. Those are going to come
in handy.
Now lefts extract the effects directly:
fixef(Bat_model)
## Estimate Est.Error Q2.5 Q97.5
## Intercept -3.277950 0.34269780 -3.936452 -2.615018
## log_DBH 1.265819 0.06373246 1.144000 1.388716
But these don’t contain “the real action”! The posterior samples are where it’s at.
posterior_samples <- data.frame(Bat_model)
head(posterior_samples)
## b_Intercept b_log_DBH Intercept lprior lp__
## 1 -3.284170 1.273675 2.820080 -1.957778 -306.8077
## 2 -3.436360 1.305167 2.818817 -1.957614 -307.3043
## 3 -3.283822 1.267373 2.790223 -1.953997 -306.5366
## 4 -3.021571 1.217529 2.813588 -1.956941 -306.8171
## 5 -3.097356 1.233123 2.812540 -1.956806 -306.6679
## 6 -3.478449 1.308692 2.793620 -1.954418 -306.9219
In the data above, each row represents one sample from the posterior
distribution of the model parameters. b_Intercept and
b_log_DBH are posterior samples of the intercept and the
slope for log_DBH, respectively (on the model’s link
scale). These are the important parts that I want you to focus on!
Less important: lprior is the log prior density for that
draw. lp__ is the total log posterior density (log
likelihood + log prior) for that sample, used internally by Stan for
sampling diagnostics.
Note that each row here represents one plausible set of parameter values given your data and priors, and together these samples characterize the full posterior distribution of your bat model. we can start exploring these values by plotting.
Here are two different ways of doing the same thing. Note that because there are 4000 values of the posterior draws, I am making the histogram breaks larger (100 breaks, rather than the default of 10).
hist(posterior_samples$b_log_DBH,breaks=100,col="pink",xlab="plausible values of effect of log(DBH) on bat visits")
plot(density(posterior_samples$b_log_DBH))
polygon(density(posterior_samples$b_log_DBH),col="pink",main="Plausible values for slope parameter")
Choose whichever plot you want above for posterior distribution. The most important thing about these plots is that our slope value is almost CERTAINLY between 1.1 and 1.5. This conclusion is based on the histogram (or density plot) above.
Now, let’s explore the slope values a bit more here.
What is the probability that increasing tree size has a positive impact on bat visits?
That is, more bats visit bigger trees. Remember: Bayesian statistics is COUNTING!
Let’s count the posterior samples that are positive. The R code below
is asking, how many samples are greater than 0? The length
function measures how many elements are in a vector. The
which statement is subsetting the vector to only those
values that are >0.
posterior_slopes <- posterior_samples$b_log_DBH
mean(posterior_slopes>0)
## [1] 1
This gives us the answer to the following question:
What is the probability that the effect of tree size on bat visits is positive?
100%
There are no posterior samples for the slope that are less than 0 here. This makes biological sense too, bats like visiting bigger fruiting trees. How would we write this in a paper?
We would never want to say that we are 100% certain. That’s nutty in science. Here is a better way to phrase it:
There was near-certainty that bats were more likely to visit larger trees (P > 99.99%).
First, quickly get the slope means for the posterior values effects. You could present these in a paper as a summary of the effect. It is up to you whether you think this is easier to understand than quantiles.
mean(posterior_samples$b_log_DBH)
## [1] 1.265819
sd(posterior_samples$b_log_DBH)
## [1] 0.06373246
And if you want to get the 80% quantiles instead of the model output values of 95%? Use this code:
quantile(posterior_samples$b_log_DBH, c(0.2, 0.8))
## 20% 80%
## 1.209796 1.320589
In my view, it is good to think critically about what you want the quantiles to be here. If you are building a nuclear power plant, you probably want a probability of failure <0.01% (or some very small number). However, for messy ecology datasets, like counts of bats, a value of 95% seems a bit high.
We can curve out the median effect. But a single line is kind of boring, because there’s no uncertainty. This is ignoring the rich information the posterior samples give us. You can always show all the samples all at once. Here’s the code I usually use, a big clunky for loop.
library(tidybayes)
library(tibble)
library(ggplot2)
library(dplyr)
library(ggplot2)
# Create a sequence of x values for the regression line
x_seq <- seq(min(figs$log_DBH), max(figs$log_DBH), length.out = 100)
# 1. Calculate Median Line Data
median_slope <- median(posterior_samples$b_log_DBH)
median_intercept <- median(posterior_samples$b_Intercept)
median_line <- data.frame(
log_DBH = x_seq,
Bat = exp(median_intercept + median_slope * x_seq)
)
# 2. Calculate Individual Posterior Samples (Spaghetti Plot Data)
# Note: For large samples, subsetting (e.g., 1:100) is recommended for speed
sample_lines <- expand.grid(log_DBH = x_seq, i = 1:length(posterior_samples$b_log_DBH))
sample_lines$Bat <- exp(posterior_samples$b_Intercept[sample_lines$i] +
posterior_samples$b_log_DBH[sample_lines$i] * sample_lines$log_DBH)
#create ggplot
ggplot(figs, aes(x = log_DBH, y = Bat)) +
# Add individual posterior sample lines (spaghetti plot)
geom_line(data = sample_lines, aes(group = i),
color = "#F0726A", alpha = 0.05, size = 0.1) +
# Add raw data points
geom_point(color = "#00B9BE", size = 2, alpha = 0.7) +
# Add median regression line
geom_line(data = median_line, color = "#F0726A", size = 1.5) +
# Labels and themes
labs(x = "log(Diameter at Breast Height of sacred fig trees (cm))",
y = "Counts of bats that visited trees") +
theme_minimal()