Lecture11_categoricalresponsevariables.knit

Categorical variables

This worksheet was created by Dr. Trevor Caughlin for EEB622 in Spring 2026.

Categorical predictors (y) are really common in environmental datasets. These explanatory variables represent group membership (categories) rather than numbers. They classify observations into discrete levels such as habitat type (shrubland, forest, water) or treatment (burned, unburned) or species identity.

Categorical data can be nominal, meaning there is no ranking system or rhyme to their order or ordinal, meaning the order matters like the grading system (A, B, C, D, F). There is a good tutorial on ordinal predictors here.

Design Matrix

When we add a nominal variable to a model in R, it automatically converts it into 0s and 1s and adds it to a matrix. This matrix of 0s and 1s is called a design matrix.

#let's create 9 observations that will be used as nominal categorical predictor variables

variable <- as.factor(rep(c("Grassland", "Shrubland", "Forest"), 3))

Now let’s see how R turns the variable above containing nominal categorical data into a design matrix.

model_designmatrix <- model.matrix(~variable)

head(model_designmatrix)

##   (Intercept) variableGrassland variableShrubland
## 1           1                 1                 0
## 2           1                 0                 1
## 3           1                 0                 0
## 4           1                 1                 0
## 5           1                 0                 1
## 6           1                 0                 0

This code is choosing a level to be the baseline (in this case Forests because that is alphabetically first) and using it to compare the other levels to it. Fprest became a reference level and is converted to all 0s. Because regression models cannot include all 3 variables, because that would cause perfect multicollinearity it drops one of the categories. And forest is represented only when both Grassland and Shrubland are 0.

Suppose the following model:

\[ y = { \color{#CA75FD}{\beta_0} + \color{#78A900}{\beta_1Grassland} + }{ \color{#00B9BE}{\beta_2 Shrubland} } \]

Habitat | Predicted Value

Forest | \(\beta_0\)

Grassland | \(\beta_0 + \beta_1\)

Shrubland | \(\beta_0+\beta_2\)

\(\beta_0\) = mean of Forest
\(\beta_1\) = difference: Grassland − Forest
\(\beta_2\) = difference: Shrubland − Forest

That’s what it means when it says that forest became the anchor the shrubland and grasslands were being compared to.

The model is asking:

How different are Grassland and Shrubland from Forest?

You can change the baseline comparison if you would like:

variable_changebaseline <- relevel(variable, ref="Grassland")

head(model.matrix(~variable_changebaseline))

##   (Intercept) variable_changebaselineForest variable_changebaselineShrubland
## 1           1                             0                                0
## 2           1                             0                                1
## 3           1                             1                                0
## 4           1                             0                                0
## 5           1                             0                                1
## 6           1                             1                                0

Habitats, 🦤 Body Mass

Intercept <- 5
Grassland.effect <- 10
Shrubland.effect <- 5

sample.size <- nrow(model_designmatrix)

bodyweight_mean <- Intercept+Grassland.effect*model_designmatrix[,2]+Shrubland.effect*model_designmatrix[,3]

bodyweight_sampled=rnorm(n=sample.size,mean= bodyweight_mean, sd=3)

library(ggplot2)

data <- data.frame(variable, bodyweight_sampled)

ggplot(data, aes(variable, bodyweight_sampled, fill=variable))+
  geom_boxplot()+
  scale_fill_manual(values=c("#CA75FD","#78A900", "#00B9BE"))+
  theme(legend.position = "none")+
  labs(y="Bird Body Weight", x="Nominal Categories")

The important code here is:

bodyweight_mean <- Intercept+Grassland.effect*model_designmatrix[,2]+Shrubland.effect*model_designmatrix[,3]

This code is saying that we are going to multiply the effect of being a particular habitat (which we made up since this is simulated data) by either one or zero. If the effect is multiplied by a 1, that row of dataset represents that habitat, and the other habitats are multiplied by zero. Note that we do not have a term in this code for forest. That is because we typically need one level to serve as a baseline. Other levels are then compared to this baseline.

Make a model

library(brms)

bird_data<-data.frame(weight=bodyweight_sampled, habitat=variable)

bird_model<-brm(weight~habitat,data=bird_data, silent=2, refresh=0)

## Running "C:/PROGRA~1/R/R-45~1.2/bin/x64/Rcmd.exe" SHLIB foo.c
## using C compiler: 'gcc.exe (GCC) 14.3.0'
## gcc  -I"C:/PROGRA~1/R/R-45~1.2/include" -DNDEBUG   -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/Rcpp/include/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/unsupported"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/BH/include" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/src/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include "C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp"  -std=c++1y    -I"C:/rtools45/x86_64-w64-mingw32.static.posix/include"      -O2 -Wall -std=gnu2x  -mfpmath=sse -msse2 -mstackrealign   -c foo.c -o foo.o
## cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
## In file included from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Core:19,
##                  from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Dense:1,
##                  from C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
##                  from <command-line>:
## C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
##   679 | #include <cmath>
##       |          ^~~~~~~
## compilation terminated.
## make: *** [C:/PROGRA~1/R/R-45~1.2/etc/x64/Makeconf:289: foo.o] Error 1

summary(bird_model)

##  Family: gaussian 
##   Links: mu = identity 
## Formula: weight ~ habitat 
##    Data: bird_data (Number of observations: 9) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Regression Coefficients:
##                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept            4.53      1.50     1.58     7.57 1.00     2765     2485
## habitatGrassland     6.42      2.20     1.86    10.78 1.00     2809     2487
## habitatShrubland     5.15      2.23     0.61     9.48 1.00     2831     2364
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     2.63      0.78     1.55     4.62 1.00     2431     2671
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

looks like our model did a pretty good job of recapturing our parameters! (or at least, indicating that grassland is better than shrubland, and both habitats are better than forest for bird weight).

Class notes

We have used 3 link functions in class and each depends on the distribution family (for response variable). The link function connects the distribution’s mean (\(\mu\)) to the linear predictor which enables a generalized linear model. The brm function automatically uses these depending on the family you choose. If you want to transform the linear predictor (\(x\beta\)) back into the scale of the original data, you take the inverse of the link. This conversion allows you to interpret results more easily.

logit | inverse logit: for things between 0 and 1, binomial and beta (continuous for variables between 0 & 1).
log | inverse exp: for probability distribution functions that are greater than 0, poisson, negative binomial, and gamma (continuous version of poisson).
identity | inverse identity (linear regression with a normal distribution), any real number.

Eggs & Zone Type

Poisson regression with categorical predictor:

Here we have the predictor variables (categorical) zonetype and we have the response variables (count) eggs. Because we have the response as count data, we will use a Poisson distribution.

Predictor zonetype

Response eggs

\(Eggs_i{\sim}Poisson(\lambda_i)\) Here each nest has its own expected mean.

and

\(log(\lambda) = a + b * Commercial\)

and then the dummy variable becomes:

\(log(\lambda) = a + b * Commercial_i\)

Ag sites

Commercial = 0

\(\lambda_{Ag} = exp(a)\)

Com sites

Commercial = 1

\(\lambda_{Com} = exp(a+b)\)

and btw, b is

\(b=log(\frac{\lambda_{Com}}{\lambda{Ag}})\)

This is why we take the exp of \(b\), Because the Poisson GLM uses a log link, coefficients are on the log scale. Taking exp(b) converts the coefficient into a multiplicative change in expected counts.

eggs <- c(3, 2, 1, 1, 0, 3, 2) #count data RESPONSE
zonetype <- c("Com", "Ag", "Ag", "Ag", "Com", "Ag", "Com") #categories PREDICTOR

dummyvariable_commercial <- ifelse(zonetype == "Com", 1, 0) #this is the dummy variable and it will do this automatically in the brm function for these data. You can see exactly what the brm function will do to these data if you peer into the design.matrix function. 

data <- data.frame(eggs, zonetype)

design.matrix <- model.matrix(~zonetype)
head(design.matrix)

##   (Intercept) zonetypeCom
## 1           1           1
## 2           1           0
## 3           1           0
## 4           1           0
## 5           1           1
## 6           1           0

Above we have 1 set to Commercial and 0 set to Agriculture for our zonetype data.

What is the average number of eggs in nests in the commercial areas?

\(exp(a+b*(Commercial))\rightarrow exp(a+b*1)\)

What is the average number of eggs in nests in the agricultural areas?

\(exp(a+b*(Agricultural))\rightarrow exp(a+b*0)\)

The dummy variable is essentially turning the effect of commercial zoning on or off.

What is the effect size of commercial zoning?

Model it

egg_model <- brm(eggs~zonetype, data=data, family="poisson", refresh=0)

## Running "C:/PROGRA~1/R/R-45~1.2/bin/x64/Rcmd.exe" SHLIB foo.c
## using C compiler: 'gcc.exe (GCC) 14.3.0'
## gcc  -I"C:/PROGRA~1/R/R-45~1.2/include" -DNDEBUG   -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/Rcpp/include/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/unsupported"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/BH/include" -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/src/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/"  -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/jessi/AppData/Local/R/win-library/4.5/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include "C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp"  -std=c++1y    -I"C:/rtools45/x86_64-w64-mingw32.static.posix/include"      -O2 -Wall -std=gnu2x  -mfpmath=sse -msse2 -mstackrealign   -c foo.c -o foo.o
## cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
## In file included from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Core:19,
##                  from C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Dense:1,
##                  from C:/Users/jessi/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
##                  from <command-line>:
## C:/Users/jessi/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
##   679 | #include <cmath>
##       |          ^~~~~~~
## compilation terminated.
## make: *** [C:/PROGRA~1/R/R-45~1.2/etc/x64/Makeconf:289: foo.o] Error 1

egg_modeldf <- as.data.frame(egg_model)

mean(exp(egg_modeldf$b_Intercept+egg_modeldf$b_zonetypeCom)-exp(egg_modeldf$b_Intercept))

## [1] -0.08802431

meanintercept <- mean(egg_modeldf$b_Intercept)

meanslope <- mean(egg_modeldf$b_zonetypeCom)

meanintercept

## [1] 0.5026682

meanslope

## [1] -0.07695086

#now we can solve the Agricultural mean eggs: 

print(c("Ag Zones mean:", exp(meanintercept))) #approximately 2 eggs in the ag zones

## [1] "Ag Zones mean:"   "1.65312634301219"

print(c("Com Zones mean:", exp(meanintercept+meanslope))) #approximately 1.5 egg in the commercial zones

## [1] "Com Zones mean:"  "1.53068812097647"

##EFFECT SIZE##
#exp(b)= 

print(c("Effect Size:", exp(meanslope)))

## [1] "Effect Size:"      "0.925935351188812"

Questions revisited

What is the average number of eggs in nests in the commercial areas?

On average there are ~ 1.49 eggs found in each nest in commercial areas.

What is the average number of eggs in nests in the agricultural areas?

On average there are ~ 1.64 more eggs found in each nest in agricultural areas.

What is the effect size of commercial zoning?

exp(meanslope) tells us that Commercial areas have 0.91 times as many eggs as Agricultural areas. That is about 9% less eggs are found (\(1-0.91=0.09\), \(9%\)) per nest in Commercial areas.