categorical response variables

Author

Dr. T

Categorical predictor variables

Categorical predictor variables are really common in environmental datasets. These explanatory variables represent group membership, rather than numeric values. They classify observations into discrete levels like habitat type (shrubland, forest, water) or treatment (burned vs. unburned) or species identity. Note that we distinguish between unordered or nominal categorical predictors and ordered or ordinal categorical predictors. Nominal predictors have levels with no inherent ranking. This document is about nominal predictors. In contrast, ordinal predictors have ranking (for example, bad, medium, and good). For a good tutorial on ordinal predictors see: https://cran.r-project.org/web/packages/brms/vignettes/brms_monotonic.html

The design matrix

When we have a nominal variable in R, what the statistical model is doing “under the hood” is converting it into a matrix of 1s and 0s called a design matrix. Below is a typical categorical predictor variable.

variable<-as.factor(rep(c("Grassland","Shrubland","Forest"),3))
print(variable)

[1] Grassland Shrubland Forest    Grassland Shrubland Forest    Grassland
[8] Shrubland Forest   
Levels: Forest Grassland Shrubland

Let’s look at the design matrix for this variable:

mm= model.matrix(~variable)
head(mm)

  (Intercept) variableGrassland variableShrubland
1           1                 1                 0
2           1                 0                 1
3           1                 0                 0
4           1                 1                 0
5           1                 0                 1
6           1                 0                 0

This code is choosing a level to be the baseline level (the alphabetically first level). Then the other levels are compared to this baseline. Let’s look at how this plays out with simulated data. In this case, the data represent body mass of a bird species in different habitats.

Intercept=5
Grassland.effect=10
Shrubland.effect=5
sample.size=nrow(mm)
Body_weight_mean=Intercept+Grassland.effect*mm[,2]+Shrubland.effect*mm[,3]
Body_weight_sampled=rnorm(n=sample.size,mean=Body_weight_mean,sd=3)
plot(Body_weight_sampled~variable)

A lot going on in this code! The important line of code here is:

Body_weight_mean=Intercept+Grassland.effect*mm[,2]+Shrubland.effect*mm[,3]

This code is saying that we are going to multiply the effect of being in a particular habitat (which we made up, since this is simulated data) by either a one or a zero. If the effect is multiplied by a 1, that row of the dataset represents that habitat, and the other habitats are multiplied by zero. Note that we do not have a term in this code for “Forest.” That is because we typically need one level to serve as a baseline. Other levels are then compared to this baseline.

To test our understanding, we will run a model in brms to see how well we can capture our parameters (that we made up, so we know exactly what they are):

library(brms)

Loading required package: Rcpp

Loading 'brms' package (version 2.23.0). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: 'brms'

The following object is masked from 'package:stats':

    ar

bird_data<-data.frame(weight=Body_weight_sampled,habitat=variable)
bird_model<-brm(weight~habitat,data=bird_data)

Compiling Stan program...

WARNING: Rtools is required to build R packages, but is not currently installed.

Please download and install the appropriate version of Rtools for 4.5.2 from
https://cran.r-project.org/bin/windows/Rtools/.

Trying to compile a simple C file

Running "C:/PROGRA~1/R/R-45~1.2/bin/x64/Rcmd.exe" SHLIB foo.c
using C compiler: 'gcc.exe (GCC) 14.3.0'
gcc  -I"C:/PROGRA~1/R/R-45~1.2/include" -DNDEBUG   -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/Rcpp/include/"  -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/RcppEigen/include/"  -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/RcppEigen/include/unsupported"  -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/BH/include" -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/StanHeaders/include/src/"  -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/StanHeaders/include/"  -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include "C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp"  -std=c++1y    -I"C:/rtools45/x86_64-w64-mingw32.static.posix/include"      -O2 -Wall -std=gnu2x  -mfpmath=sse -msse2 -mstackrealign   -c foo.c -o foo.o
cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
In file included from C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Core:19,
                 from C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/Dense:1,
                 from C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
C:/Users/trevorcaughlin/AppData/Local/R/win-library/4.5/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [C:/PROGRA~1/R/R-45~1.2/etc/x64/Makeconf:289: foo.o] Error 1

WARNING: Rtools is required to build R packages, but is not currently installed.

Please download and install the appropriate version of Rtools for 4.5.2 from
https://cran.r-project.org/bin/windows/Rtools/.

Start sampling


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000122 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.22 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.056 seconds (Warm-up)
Chain 1:                0.045 seconds (Sampling)
Chain 1:                0.101 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.053 seconds (Warm-up)
Chain 2:                0.049 seconds (Sampling)
Chain 2:                0.102 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 9e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.05 seconds (Warm-up)
Chain 3:                0.044 seconds (Sampling)
Chain 3:                0.094 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.1e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.068 seconds (Warm-up)
Chain 4:                0.048 seconds (Sampling)
Chain 4:                0.116 seconds (Total)
Chain 4:

summary(bird_model)

 Family: gaussian 
  Links: mu = identity 
Formula: weight ~ habitat 
   Data: bird_data (Number of observations: 9) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            5.36      1.90     1.63     9.20 1.00     2467     1952
habitatGrassland    10.29      2.60     4.86    15.59 1.00     2501     2029
habitatShrubland     6.34      2.58     0.93    11.58 1.00     2514     2253

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     3.05      1.17     1.61     6.00 1.00     1701     2127

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

looks like our model did a pretty good job of recapturing our parameters! (or at least, indicating that grassland is better than shrubland, and both habitats are better than forest for bird weight).