🗺️ ATLAS OF THE GLM

in progress: updates still being made

DECISIONS

N O T E

This markdown (will be) a map for generative modeling [currently, completed tabs are denoted with an emoji]. The goal is to use the same 6 steps listed under F L O W on all distributions for fluidity’s sake. I used data provided in Spring 2026 EEB622 by Dr. T Caughlin under the Binomial & Poisson tabs. I provided data for the Negative Binomial flow. All other data was taken from Analysis of Biological Data site, and links to data used are provided under each tab.

F L O W

Load Data & define biological question
Define Variables plot it, decide x & y, determine what the intercept represents
Set Priors plausible worlds, plausible parameters
Fit Bayesian Model GLM: brm()
Posterior Predictions/Draws: posterior_epred()
Ecological Inference translate it

A decision about likelihood can be made by asking: WHAT TYPE OF VARIABLE IS Y?

🧭 C H A R T

Gaussian → Y is continuous, ranges from −∞ to +∞ | e.g. temperature, height, measurement error | link: identity

Bernoulli → Y is binary (0 or 1) | e.g. presence/absence, alive/dead, detected/not detected | link: logit

Binomial → Y is discrete proportion (can't have half an egg) | e.g. hatched/laid, survival/total | link: logit

Poisson → Y is discrete counts of events (0, 1, 2, 3...) | e.g. fish counts, insects trapped | link: log

Negative Binomial → Y is discrete counts where variance > mean, many 0's, big values | e.g. phytoplankton counts | link: log

Gamma → Y is continuous but ONLY positive (> 0), right-skewed | e.g. biomass, productivity, time | link: log

L I B R A R I E S

brms, ggplot2, dplyr, tidyr, gt, tidybayes

GAUSSIAN

D A T A

bluetits.csv. These data can be downloaded by clicking the link and is called Figure 17.5-4 left Cap color on the Analysis of Biological Data site. The scientific paper for these data can be found here.

J. Bell

C H E C K L I S T

Load Data
Define Variables
Set Priors
Fit Bayesian Model: brm()
Posterior Predictions: posterior_epred()
Ecological Inference

Q U E S T I O N
??

1. Load Data

bluetits <- read.csv("bluetits.csv")

BERNOULLI

D A T A

guppy.csv. These data can be downloaded by clicking the link and is called Figure 17.9-1 Guppy temperature mortality on the Analysis of Biological Data site. The scientific paper for these data can be found here.

R. G. Pitkow

C H E C K L I S T

Load Data
Define Variables
Set Priors
Fit Bayesian Model: brm()
Posterior Predictions: posterior_epred()
Ecological Inference

Q U E S T I O N
??

1. Load Data

guppy <- read.csv("guppy.csv")

BINOMIAL

D A T A

seeds.csv. These data can be downloaded on our Slack Channel and belongs to Dr. T Caughlin, whose work on these data can be found here.

J. Elliot/Flickr

C H E C K L I S T

Load Data
Define Variables
Set Priors
Fit Bayesian Model: brm()
Posterior Predictions: posterior_epred()
Ecological Inference

Q U E S T I O N
??

1. Load Data

seeds <- read.csv("seeds.csv")

POISSON 🐠

D A T A

Anableps_anableps.csv. These data can be downloaded on our Slack Channel and belongs to Dr. T Caughlin, whose work on these data can be found here.

California Academy of Sciences

C H E C K L I S T

Load Data
Define Variables
Set Priors
Fit Bayesian Model: brm()
Posterior Predictions: posterior_epred()
Ecological Inference

Q U E S T I O N
How does restoration treatment impact four-eyed fish abundance?

1. Load Data

anableps <- read.csv("anableps_anableps.csv")

#give the data a look


anableps_dt <- gt(anableps)|> #everything below this point is just for aesthetics. gt(dataframe) is enough to create a table of your data
  opt_table_font(google_font("Caveat"))|> 
  opt_table_outline(style="solid")|>
  tab_header(
    title = "Anableps Data")|>
  tab_options(table.align='left') 

opt_interactive(anableps_dt) #makes your datatable interactive

Anableps Data

Check out data types:

str(anableps) #this shows that our x Treatment data are "chr: characters" and we want them to be "factors with 3 levels" so...this will be delt with in step 2.

## 'data.frame':    27 obs. of  4 variables:
##  $ Treatment : chr  "Degraded" "Degraded" "Degraded" "Degraded" ...
##  $ Site.ID   : chr  "Degraded1" "Degraded2" "Degraded3" "Degraded4" ...
##  $ Species   : chr  "Anableps anableps" "Anableps anableps" "Anableps anableps" "Anableps anableps" ...
##  $ TotalCatch: int  28 0 0 0 0 0 8 7 2 6 ...

2. Define Variables

Variables were already defined by Dr. Caughlin’s question. In this step, I have cleaned up the data and set the reference condition to Intact (instead of Active Restoration, the default).

Response: TotalCatch (count data)
Predictor: Treatment (multiple categories)

# ensure categorical predictor
anableps$Treatment <- as.factor(anableps$Treatment) #here, as.factor forces the chr data into factors which was previously a character

# set ecological reference condition
anableps$Treatment <- relevel(anableps$Treatment, ref = "Intact")

Purpose

Define the biological comparison:

Intercept = expected abundance in intact habitat
Slopes = differences relative to intact habitat

Plot

ggplot(anableps, aes(Treatment, TotalCatch, fill=Treatment))+
  geom_boxplot()+ 
  labs(title="Fish catch between treatments", x="", y="Total Fish Catch")+ 
  theme(legend.position="none")+ #I take off the redundant legend
  scale_fill_manual(values=c("#00B9BE", "#E665E6", "#78A900"))

3. Set Priors (plausible worlds)

#intercept mean 

mean(anableps$TotalCatch)

## [1] 27.55556

priors <- c(prior(normal(log(27), 1), class = "Intercept"), prior(normal(0, 1), class = "b"))  #log 27 was chosen by looking at the mean of catch using the above code

Interpretation

Baseline fish abundance expected near observed mean.
Treatment effects assumed moderate on log scale.

This represents knowledge before seeing data.

4. Fit Bayesian Model (updating step)

anableps_model <- brm(
  TotalCatch ~ Treatment,
  family = poisson(), #you can also write "poisson" here instead of poisson()
  data = anableps,
  prior = priors, #priors were explicitly created in last step
  chains = 2, #rest of code is just to get the model to run faster for homework quickness
  iter = 1000,
  warmup = 500,
  backend = "cmdstanr",
  refresh = 0)

## Start sampling

## Running MCMC with 2 sequential chains...
## 
## Chain 1 finished in 0.0 seconds.
## Chain 2 finished in 0.0 seconds.
## 
## Both chains finished successfully.
## Mean chain execution time: 0.0 seconds.
## Total execution time: 0.3 seconds.

## Loading required namespace: rstan

This step performs Bayesian updating

\[ \color{#00B9BE}{Posterior} \propto \color{#CA75FD}{Likelihood} * \color{#78A900}{Prior} \]

Output = distribution of plausible parameter values.

5. Posterior Predictions (`epred`)

posterior_draws <- as.data.frame(anableps_model)

## Loading required namespace: rstan

head(posterior_draws)

##   b_Intercept b_TreatmentDegraded b_TreatmentIntact Intercept    lprior
## 1    4.235709           -2.639014         -2.031395  2.678906 -8.492599
## 2    4.197850           -2.553858         -1.912751  2.708980 -8.019421
## 3    4.167447           -2.270784         -1.941889  2.763223 -7.362352
## 4    4.175955           -2.581140         -2.094368  2.617452 -8.511248
## 5    4.278789           -2.762184         -2.020183  2.684667 -8.798981
## 6    4.273027           -2.427569         -1.970966  2.806848 -7.765270
##        lp__
## 1 -566.2531
## 2 -566.5310
## 3 -568.4309
## 4 -567.5790
## 5 -567.6665
## 6 -567.9873

Each row represents one plausible ecological reality consistent with observed data.

newdata <- data.frame(Treatment = c("Intact", "Degraded", "Active Restoration")) 

#the above code just sets up an empty dataframe with better column names than v1, v2, v3

ep <- posterior_epred(anableps_model, newdata = newdata)

#posterior_epred draws parameters from the Expected Value of the Posterior Predictive Distribution

head(ep)

##          [,1]     [,2]     [,3]
## [1,] 9.064032 4.936691 69.11069
## [2,] 9.826652 5.175787 66.54308
## [3,] 9.258650 6.663620 64.55046
## [4,] 8.017182 4.927417 65.10196
## [5,] 9.569740 4.556728 72.15302
## [6,] 9.994754 6.330997 71.73844

posterior_epred() converts parameter uncertainty into predicted fish abundance for each treatment.

Rows = posterior draws
Columns = treatments

PLOTS

ep_df <- as.data.frame(ep)
colnames(ep_df) <- c("Intact", "Degraded", "Active Restoration")

ep_df |>
  pivot_longer(everything(), names_to = "Treatment", values_to = "lambda") |>
  ggplot(aes(lambda, fill = Treatment)) +
  geom_density() +
  labs(title = "Posterior Predicted Fish Abundance",x = "Expected Fish Count",y = "Density") +
  scale_fill_manual(values=c("#E665E6", "#78A900", "#00B9BE"))+
  theme_minimal()

# Posterior expected predictions on response scale
# (epred = expected mean, not noisy simulated observations)
ep_draws <- add_epred_draws(anableps_model, newdata = newdata, ndraws = 1000)

ggplot() +
  geom_jitter(data = anableps,  aes(Treatment, TotalCatch, color=Treatment),width = 0.1) +
  stat_pointinterval(data= ep_draws, aes(Treatment, .epred), point_interval = mean_qi, .width = 0.89) +
  labs(title = "Fish abundance by treatment", subtitle = "Observed data and posterior mean of the expected abundance (black dot)",x = "", y = "Expected fish abundance") +
  scale_color_manual(values=c("#00B9BE", "#E665E6", "#78A900"))+
  theme_minimal()

6. Ecological Inference

Difference Between Treatments

rest_vs_deg <- ep[,3] - ep[,2] #take column 3 from ep data and subtract column 2 from ep data from each row

mean(rest_vs_deg)

## [1] 63.03343

PLOT

diff <- data.frame(rest_vs_deg)


data.frame(diff = rest_vs_deg) |>
  ggplot(aes(diff)) +
  geom_density(fill="#F0726A") +
  geom_vline(xintercept=0, linetype="dashed") +
  labs(title="Posterior distribution of difference",x="Restored − Degraded (fish)", y="Density") +
  theme_minimal()

Across the posterior distribution, restored sites are expected to have about 63 more fish than degraded sites, on average.

quantile(rest_vs_deg, c(.055, .945))

##     5.5%    94.5% 
## 58.53613 67.48714

With 89% (McElreath’s fav number) of the credible interval laying between 58-67 fish.

mean(rest_vs_deg > 0)

## [1] 1

#what if you want to know the difference between degraded and control (aka Intact)? 
#deg_vs_cont <- ep[,2]- ep[,1] these columns follow the setup made earlier: newdata <- data.frame(Treatment = c("Intact"[,1], "Degraded[,2]", "Active Restoration[,3]"))

Probability that restoration increases fish abundance \(\approx\) 100%. The direction is positive.

Interpretation

Mean difference in expected fish abundance
Credible interval
Probability restoration improves habitat

Multiplicative Effect (Ratio)

rest_vs_deg_ratio <- ep[,3] / ep[,2]

mean(rest_vs_deg_ratio)

## [1] 13.06362

quantile(rest_vs_deg_ratio, c(.055, .945))

##     5.5%    94.5% 
## 10.18799 16.60354

mean(rest_vs_deg_ratio > 1)

## [1] 1

Interpretation

This gives the expected proportional change in fish abundance between restoration and degraded habitats.

- BINOMIAL

D A T A

halosphaera.csv. These data can be downloaded by clicking the following link and were collected via phytoplankton tows of Bellingham Bay, WA during the winter months of 2024. More on this unfinished project can be found here.

J. Bell

C H E C K L I S T

Load Data
Define Variables
Set Priors
Fit Bayesian Model: brm()
Posterior Predictions: posterior_epred()
Ecological Inference

Q U E S T I O N
??

1. Load Data

halosphaera <- read.csv("halosphaera.csv")

GAMMA

D A T A

pondplants.csv. These data can be downloaded by clicking the link and is called Figure 17.8-2 Plant species and productivity on the Analysis of Biological Data site. The scientific paper for these data can be found here.

MPCA Photos/Flickr

C H E C K L I S T

Load Data
Define Variables
Set Priors
Fit Bayesian Model: brm()
Posterior Predictions: posterior_epred()
Ecological Inference

Q U E S T I O N
??

1. Load Data

pondplants <- read.csv("pondplants.csv")

🗺️ ATLAS OF THE GLM

DECISIONS

GAUSSIAN

1. Load Data

BERNOULLI

1. Load Data

BINOMIAL

1. Load Data

POISSON 🐠

1. Load Data

2. Define Variables

3. Set Priors (plausible worlds)

4. Fit Bayesian Model (updating step)

5. Posterior Predictions (epred)

6. Ecological Inference

- BINOMIAL

1. Load Data

GAMMA

1. Load Data

5. Posterior Predictions (`epred`)