This worksheet was created by Dr. Trevor Caughlin for EEB622 in Spring 2026.
Categorical predictors (y) are really common in environmental datasets. These explanatory variables represent group membership (categories) rather than numbers. They classify observations into discrete levels such as habitat type (shrubland, forest, water) or treatment (burned, unburned) or species identity.
Categorical data can be nominal, meaning there is no ranking system or rhyme to their order or ordinal, meaning the order matters like the grading system (A, B, C, D, F). There is a good tutorial on ordinal predictors here.
When we add a nominal variable to a model in R, it automatically converts it into 0s and 1s and adds it to a matrix. This matrix of 0s and 1s is called a design matrix.
#let's create 9 observations that will be used as nominal categorical predictor variables
variable <- as.factor(rep(c("Grassland", "Shrubland", "Forest"), 3))
Now let’s see how R turns the variable above
containing nominal categorical data into a design matrix.
model_designmatrix <- model.matrix(~variable)
head(model_designmatrix)
## (Intercept) variableGrassland variableShrubland
## 1 1 1 0
## 2 1 0 1
## 3 1 0 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 0
This code is choosing a level to be the baseline (in this case
Forests because that is alphabetically first) and using it to compare
the other levels to it. Fprest became a reference level and is converted
to all 0s. Because regression models cannot include all 3 variables,
because that would cause perfect multicollinearity it
drops one of the categories. And forest is represented only when both
Grassland and Shrubland are 0.
Suppose the following model:
\[ y = { \color{#CA75FD}{\beta_0} + \color{#78A900}{\beta_1Grassland} + }{ \color{#00B9BE}{\beta_2 Shrubland} } \]
Habitat | Predicted Value
Forest | \(\beta_0\)
Grassland | \(\beta_0 + \beta_1\)
Shrubland | \(\beta_0+\beta_2\)
So
\(\beta_0\) = mean of Forest
\(\beta_1\) = difference: Grassland − Forest
\(\beta_2\) = difference: Shrubland − Forest
That’s what it means when it says that forest became the anchor the shrubland and grasslands were being compared to.
The model is asking:
How different are Grassland and Shrubland from Forest?
You can change the baseline comparison if you would like:
variable_changebaseline <- relevel(variable, ref="Grassland")
head(model.matrix(~variable_changebaseline))
## (Intercept) variable_changebaselineForest variable_changebaselineShrubland
## 1 1 0 0
## 2 1 0 1
## 3 1 1 0
## 4 1 0 0
## 5 1 0 1
## 6 1 1 0
Intercept <- 5
Grassland.effect <- 10
Shrubland.effect <- 5
sample.size <- nrow(model_designmatrix)
bodyweight_mean <- Intercept+Grassland.effect*model_designmatrix[,2]+Shrubland.effect*model_designmatrix[,3]
bodyweight_sampled=rnorm(n=sample.size,mean= bodyweight_mean, sd=3)
library(ggplot2)
data <- data.frame(variable, bodyweight_sampled)
ggplot(data, aes(variable, bodyweight_sampled, fill=variable))+
geom_boxplot()+
scale_fill_manual(values=c("#CA75FD","#78A900", "#00B9BE"))+
theme(legend.position = "none")+
labs(y="Bird Body Weight", x="Nominal Categories")
The important code here is:
bodyweight_mean <- Intercept+Grassland.effect*model_designmatrix[,2]+Shrubland.effect*model_designmatrix[,3]
This code is saying that we are going to multiply the effect of being a particular habitat (which we made up since this is simulated data) by either one or zero. If the effect is multiplied by a 1, that row of dataset represents that habitat, and the other habitats are multiplied by zero. Note that we do not have a term in this code for forest. That is because we typically need one level to serve as a baseline. Other levels are then compared to this baseline.
library(brms)
bird_data<-data.frame(weight=bodyweight_sampled, habitat=variable)
bird_model<-brm(weight~habitat,data=bird_data, silent=2, refresh=0)
## Running "C:/PROGRA~1/R/R-45~1.2/bin/x64/Rcmd.exe" SHLIB foo.c
## using C compiler: 'gcc.exe (GCC) 14.3.0'
## gcc -I"C:/PROGRA~1/R/R-45~1.2/include" -DNDEBUG -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/Rcpp/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/unsupported" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/BH/include" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/src/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include "C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp" -std=c++1y -I"C:/rtools45/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -std=gnu2x -mfpmath=sse -msse2 -mstackrealign -c foo.c -o foo.o
## cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
## In file included from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Core:19,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Dense:1,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
## from <command-line>:
## C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
## 679 | #include <cmath>
## | ^~~~~~~
## compilation terminated.
## make: *** [C:/PROGRA~1/R/R-45~1.2/etc/x64/Makeconf:289: foo.o] Error 1
summary(bird_model)
## Family: gaussian
## Links: mu = identity
## Formula: weight ~ habitat
## Data: bird_data (Number of observations: 9)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 4.53 1.50 1.58 7.57 1.00 2765 2485
## habitatGrassland 6.42 2.20 1.86 10.78 1.00 2809 2487
## habitatShrubland 5.15 2.23 0.61 9.48 1.00 2831 2364
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 2.63 0.78 1.55 4.62 1.00 2431 2671
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
looks like our model did a pretty good job of recapturing our parameters! (or at least, indicating that grassland is better than shrubland, and both habitats are better than forest for bird weight).
We have used 3 link functions in class and each depends on the
distribution family (for response variable). The link
function connects the distribution’s mean (\(\mu\)) to the linear predictor which
enables a generalized linear model. The brm function
automatically uses these depending on the family you choose. If you want
to transform the linear predictor (\(x\beta\)) back into the scale of the
original data, you take the inverse of the
link. This conversion allows you to interpret results more
easily.
logit | inverse logit: for things between 0 and 1, binomial and beta (continuous for variables between 0 & 1).
log | inverse exp: for probability distribution functions that are greater than 0, poisson, negative binomial, and gamma (continuous version of poisson).
identity | inverse identity (linear regression with a normal distribution), any real number.
Poisson regression with categorical predictor:
Here we have the predictor variables (categorical)
zonetype and we have the response variables (count)
eggs. Because we have the response as count data, we will
use a Poisson distribution.
Predictor zonetype
Response eggs
\(Eggs_i{\sim}Poisson(\lambda_i)\) Here each nest has its own expected mean.
and
\(log(\lambda) = a + b * Commercial\)
and then the dummy variable becomes:
\(log(\lambda) = a + b * Commercial_i\)
Ag sites
Commercial = 0
\(\lambda_{Ag} = exp(a)\)
Com sites
Commercial = 1
\(\lambda_{Com} = exp(a+b)\)
and btw, b is
\(b=log(\frac{\lambda_{Com}}{\lambda{Ag}})\)
This is why we take the exp of \(b\), Because the Poisson GLM uses a log
link, coefficients are on the log scale. Taking exp(b) converts the
coefficient into a multiplicative change in expected
counts.
eggs <- c(3, 2, 1, 1, 0, 3, 2) #count data RESPONSE
zonetype <- c("Com", "Ag", "Ag", "Ag", "Com", "Ag", "Com") #categories PREDICTOR
dummyvariable_commercial <- ifelse(zonetype == "Com", 1, 0) #this is the dummy variable and it will do this automatically in the brm function for these data. You can see exactly what the brm function will do to these data if you peer into the design.matrix function.
data <- data.frame(eggs, zonetype)
design.matrix <- model.matrix(~zonetype)
head(design.matrix)
## (Intercept) zonetypeCom
## 1 1 1
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 1
## 6 1 0
Above we have 1 set to Commercial and 0
set to Agriculture for our zonetype data.
What is the average number of eggs in nests in the commercial areas?
\(exp(a+b*(Commercial))\rightarrow exp(a+b*1)\)
What is the average number of eggs in nests in the agricultural areas?
\(exp(a+b*(Agricultural))\rightarrow exp(a+b*0)\)
The dummy variable is essentially turning the effect of commercial zoning on or off.
What is the effect size of commercial zoning?
egg_model <- brm(eggs~zonetype, data=data, family="poisson", refresh=0)
## Running "C:/PROGRA~1/R/R-45~1.2/bin/x64/Rcmd.exe" SHLIB foo.c
## using C compiler: 'gcc.exe (GCC) 14.3.0'
## gcc -I"C:/PROGRA~1/R/R-45~1.2/include" -DNDEBUG -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/Rcpp/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/unsupported" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/BH/include" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/src/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include "C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp" -std=c++1y -I"C:/rtools45/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -std=gnu2x -mfpmath=sse -msse2 -mstackrealign -c foo.c -o foo.o
## cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
## In file included from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Core:19,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Dense:1,
## from C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
## from <command-line>:
## C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
## 679 | #include <cmath>
## | ^~~~~~~
## compilation terminated.
## make: *** [C:/PROGRA~1/R/R-45~1.2/etc/x64/Makeconf:289: foo.o] Error 1
egg_modeldf <- as.data.frame(egg_model)
mean(exp(egg_modeldf$b_Intercept+egg_modeldf$b_zonetypeCom)-exp(egg_modeldf$b_Intercept))
## [1] -0.08802431
meanintercept <- mean(egg_modeldf$b_Intercept)
meanslope <- mean(egg_modeldf$b_zonetypeCom)
meanintercept
## [1] 0.5026682
meanslope
## [1] -0.07695086
#now we can solve the Agricultural mean eggs:
print(c("Ag Zones mean:", exp(meanintercept))) #approximately 2 eggs in the ag zones
## [1] "Ag Zones mean:" "1.65312634301219"
print(c("Com Zones mean:", exp(meanintercept+meanslope))) #approximately 1.5 egg in the commercial zones
## [1] "Com Zones mean:" "1.53068812097647"
##EFFECT SIZE##
#exp(b)=
print(c("Effect Size:", exp(meanslope)))
## [1] "Effect Size:" "0.925935351188812"
What is the average number of eggs in nests in the commercial areas?
On average there are ~ 1.49 eggs found in each nest in commercial areas.
What is the average number of eggs in nests in the agricultural areas?
On average there are ~ 1.64 more eggs found in each nest in agricultural areas.
What is the effect size of commercial zoning?
exp(meanslope) tells us that Commercial areas have 0.91
times as many eggs as Agricultural areas. That is about 9% less eggs are
found (\(1-0.91=0.09\), \(9%\)) per nest in Commercial areas.