Ordinal models

Setup

This section contains the setup and the various utility functions used throughout.

Libraries used:

library(data.table)
library(ggplot2)
library(ggrepel)
library(scico)
library(cmdstanr)
# library(survival)
library("loo")
sl <- scico(palette = "lajolla", n = 9)

tok <- unlist(tstrsplit(getwd(), "/"))
if(tok[length(tok)] == 'rpubs'){path_pref = "."} else{path_pref = ".."}

Introduction

Assume we have some categorical outcome variable of interest. For example, a 7 point ordinal scale, assessed on day 14, ranging from (1) not hospitalised and no limitations to (7) death where the complete set of categories are:

Not hospitalised, no limitations on activities
Not hospitalised, limitation on activities
Hospitalised, not requiring supplemental oxygen
Hospitalised, requiring supplemental oxygen
Hospitalised, on non-invasive ventilation or high flow oxygen devices
Hospitalised, on invasive mechanical ventilation or extracorporeal membrane oxygenation (ECMO)
Death

Assume that empirically the probability of being in each group maps to some latent continuous score as follows.

p_tru <- c(0.03,0.28,0.34,0.155,0.048,0.04,0.107)
d1 <- data.table(cat = 1:6, cuts = qlogis(cumsum(p_tru[1:6])))
ggplot(d1, aes(x = cuts)) +
  geom_vline(aes(xintercept = cuts), lwd = 0.3, lty = 2) +
  geom_text_repel(aes(x = cuts, y = 0.1, label = paste0(cat)),
                   box.padding = 2.5,
                    max.overlaps = Inf,
                  nudge_x = 0.2,
                  ylim = c(0.1, Inf), col = 2,
                    size = 3, inherit.aes = F) +
  stat_function(fun = dlogis, n = 101)  +
  scale_x_continuous("log-odds", lim = c(-5, 5)) +
  scale_y_continuous("Density") +
  theme_bw()

Figure 1: Figure: Location of cutpoints on logistic density based on probability of score at day 14

There are a variety of models that can be applied but the cumulative model is probably most common. This assumes that the category that was observed originates from some underlying latent continuous random variable. For example, a discrete happiness score 1 to 10 might be assumed to map to our current state of happiness. The underlying continuous variable is usually assumed to have a normal or logistic distribution (the logit form is used here).

The cumulative logit approach splits the probability density function into sections where the area enclosed by the boundaries corresponds to the probability of observing a particular score. We construct our model to try to estimate the cutpoints. Two things need to be kept in mind:

there are only \(J - 1\) cutpoints and
without some constraints the model may struggle to identify the parameters

The probability that an observation equals a particular category can be found with:

\[ \begin{equation} \begin{split} \mathsf{Pr}(Y = j) = F(\alpha_j) - F(\alpha_{j-1}) \end{split} \tag{1} \end{equation} \]

where \(F\) is the cumulative cdf of the distribution that we have chosen to model the underlying latent variable and the \(\alpha_j\) denotes the upper bound of the \(j^{th}\) category.

We can introduce a linear predictor as follows:

\[ \begin{equation} \begin{split} \mathsf{Pr}(Y = j|\eta) = F(\alpha_j - \eta) - F(\alpha_{j-1} - \eta) \end{split} \tag{2} \end{equation} \]

and so the effect of \(\eta\) is to shift the cutpoints. Additionally, we have:

\[ \begin{equation} \begin{split} \mathsf{Pr}(Y \le j|\eta) &= \mathsf{Pr}(Y^* \le \alpha_j|\eta) \\ &= F(\alpha_j - \eta) \end{split} \tag{3} \end{equation} \]

where \(Y^*\) is the latent random variable.

We assume a constant linear predictor across the response categories. This is known as a proportional odds assumption.

To increase the probability of being in the lowest category, the upper boundary (cutpoint) for this category needs to increase. We usually like to have odds ratios less than one to imply that treatment reduces the odds of the worst condition. The easiest way to achieve this is to parameterise the model for the current scenario as

\[ \begin{equation} \begin{split} \mathsf{logit}(\mathsf{Pr}(y_i \le j)) = \alpha_j - \beta x_i \end{split} \tag{4} \end{equation} \]

where the \(\beta\) represents a treatment effect (log-odds ratio), constant across all response categories. So, if \(\beta\) is negative (implying an OR < 1) the cutpoints are being increased and we are reducing the odds of higher scores (i.e. scores towards death).

The proportional odds assumption isn’t always valid. For example, an intervention could be potentially useful for healthier patients, but harmful to patients that are already very sick.

The following shows a summary of simulated data when the proportional odds assumption holds.

N <- 1000
b <- -0.5
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))

y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)

d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))

d_tbl <- d2[, .(p_obs = round(.N/N, 2)), keyby = .(trt, y)]
d_tbl[, p_tru := as.numeric(c(p0 = p0, p1 = p1))]

d_tbl <- dcast(d_tbl, y ~ trt, value.var = list("p_obs", "p_tru"))

Initial cutpoints are defined along with a linear predictor that contains a single categorical variable for treatment status. The treatment effect was set at -0.5 which corresponds to an odds ratio of 0.61 Categorical data is generated based on the implied probabilities for each group and then the observed proportions for each treatment and category are shown along with the true probabilities that we used to simulate the data. We see that the effect of treatment is to increase the probability of observing lower values of y and lower the probability of high values for y.

knitr::kable(d_tbl, digits = 2)

y	p_obs_0	p_obs_1	p_tru_0	p_tru_1
1	0.02	0.04	0.03	0.05
2	0.30	0.40	0.28	0.38
3	0.32	0.31	0.34	0.33
4	0.15	0.12	0.16	0.12
5	0.04	0.04	0.05	0.03
6	0.04	0.02	0.04	0.03
7	0.12	0.08	0.11	0.07

And the following summary shows what happens when the proportional odds assumption does not hold.

N <- 1000
b <- c(-0.5, -0.3, -0.2, 0.2, 0.3, 0.4)
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))

y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)

d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))

d_tbl <- d2[, .(p_obs = round(.N/N, 2)), keyby = .(trt, y)]
d_tbl[, p_tru := as.numeric(c(p0 = p0, p1 = p1))]

d_tbl <- dcast(d_tbl, y ~ trt, value.var = list("p_obs", "p_tru"))

This time, there is a distinct treatment effect for each category. Here we see that the treatment results in a simultaneous increase in the probabilities of both the lowest and highest scores. To achieve this, the probabilities for the mid-categories have all be reduced.

knitr::kable(d_tbl, digits = 3)

y	p_obs_0	p_obs_1	p_tru_0	p_tru_1
1	0.02	0.06	0.030	0.049
2	0.27	0.33	0.280	0.329
3	0.37	0.31	0.340	0.317
4	0.14	0.07	0.155	0.078
5	0.05	0.04	0.048	0.040
6	0.04	0.04	0.040	0.037
7	0.11	0.16	0.107	0.152

One option to evaluate whether there are category specific effects is to fit separate logistic regression model to a set of collapsed comparisons. For example, you could look at the relative odds having a score of 1 vs 2:7 under treatment, the relative odds of having a score of 1:2 vs 3:7 under treatment etc. However, this becomes burdensome when there are many categories.

Another option is to use a different model, known as an adjacent category model. This approach emphasises comparisons between adjacent categories and is motivated more by statistical convenience rather than any real world construct.

Under the ACM approach, we conceptualise a latent variable for each category and each has a single threshold, \(\alpha_j\)

\[ \begin{equation} \begin{split} \mathsf{Pr}(Y = j | Y \in \{j, j+1\}, \eta) = F(\alpha_j - \eta) \end{split} \tag{5} \end{equation} \]

Implementation

A stan implementation for the cumulative ordinal proportional odds model follows. It includes a single variable in the linear predictor for treatment status.

Note that the log_lik variable in the generated quantities block is used for Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO) for purposes of model checking and model comparison, see https://mc-stan.org/loo/articles/loo2-example.html.

mod1 <- cmdstanr::cmdstan_model(paste0(path_pref, "/stan/ordinal1.stan"))
mod1

## data {
##   int<lower=1> N;  // total number of observations
##   array[N] int y; 
##   vector[N] x; 
##   int<lower=2> ncat;  // number of categories 
##   int prior_only; 
## }
## parameters {
##   simplex[ncat] a0; 
##   real b;  
## }
## transformed parameters {
##   vector[ncat-1] cuts = logit(cumulative_sum(a0[1:(ncat-1)]));
## }
## model {
##   if (!prior_only) {
##     vector[N] eta = b * x;
##     target += ordered_logistic_lpmf(y | eta, cuts);
##   }
##   target += normal_lpdf(b | 0, 4);
##   target += dirichlet_lpdf(a0 | rep_vector(1, ncat));
## }
## generated quantities {
##   simplex[ncat] a1; 
##   vector[N] log_lik;
##   a1[1] = inv_logit(cuts[1] - b);
##   for(i in 2:(ncat-1)){
##     a1[i] = inv_logit(cuts[i] - b) - inv_logit(cuts[i-1] - b);
##   }
##   a1[ncat] = 1 - sum(a1[1:(ncat-1)]);
##   for(i in 1:N){
##     log_lik[i] = ordered_logistic_lpmf(y[i] | x[i]*b, cuts); 
##   } 
## }

First fit the model based on data generated so that proportional odds holds:

set.seed(5)
N <- 2500
b <- -0.5
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))

y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)

d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))
ld <- list(N = nrow(d2), y = as.integer(d2$y), 
           ncat = 7, x = d2$trt,
           prior_only = 0)

prop.table(table(d2$trt, d2$y))
f1 <- mod1$sample(data = ld,
                  iter_warmup = 1000, iter_sampling = 2000,
                  parallel_chains = 3, chains = 3,
                  refresh = 0, show_exceptions = F)

d_smpl1 <- data.table(f1$draws(variables = c("a0", "a1"), format = "matrix"))

d_smpl1 <- melt(d_smpl1, measure.vars = colnames(d_smpl1))
d_smpl1[, trt := as.integer(substr(variable, 2, 2))]
d_smpl1[, cat := as.integer(substr(variable, 4, 4))]

d_tbl <- d_smpl1[, .(p_mu = mean(value)), keyby = .(trt, cat)]
d_tbl[, p_obs := d2[, .(p_obs = .N/N), keyby = .(trt, y)]$p_obs]
setcolorder(d_tbl, c("trt", "cat", "p_obs", "p_mu"))

d_tbl <- dcast(d_tbl, cat ~ trt, value.var = list("p_obs", "p_mu"))

knitr::kable(d_tbl, digits = c(0, rep(3, 4)))

##    
##          1      2      3      4      5      6      7
##   0 0.0160 0.1350 0.1686 0.0834 0.0250 0.0232 0.0488
##   1 0.0258 0.1860 0.1650 0.0588 0.0146 0.0162 0.0336
## Running MCMC with 3 parallel chains...
## 
## Chain 1 finished in 10.1 seconds.
## Chain 2 finished in 10.4 seconds.
## Chain 3 finished in 10.6 seconds.
## 
## All 3 chains finished successfully.
## Mean chain execution time: 10.4 seconds.
## Total execution time: 10.8 seconds.

cat	p_obs_0	p_obs_1	p_mu_0	p_mu_1
1	0.032	0.052	0.032	0.052
2	0.270	0.372	0.271	0.371
3	0.337	0.330	0.339	0.328
4	0.167	0.118	0.162	0.122
5	0.050	0.029	0.047	0.032
6	0.046	0.032	0.048	0.031
7	0.098	0.067	0.102	0.063

And now fit the same model to data where proportional odds is violated by construction.

set.seed(5)
N <- 2500
b <- c(-0.5, -0.3, -0.2, 0.2, 0.3, 0.4)
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))

y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)

d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))
ld <- list(N = nrow(d2), y = as.integer(d2$y), 
           ncat = 7, x = d2$trt,
           prior_only = 0)
f2 <- mod1$sample(data = ld,
                  iter_warmup = 1000, iter_sampling = 2000,
                  parallel_chains = 3, chains = 3,
                  refresh = 0, show_exceptions = F)

## Running MCMC with 3 parallel chains...
## 
## Chain 2 finished in 10.3 seconds.
## Chain 3 finished in 10.3 seconds.
## Chain 1 finished in 10.4 seconds.
## 
## All 3 chains finished successfully.
## Mean chain execution time: 10.3 seconds.
## Total execution time: 10.5 seconds.

d_smpl2 <- data.table(f2$draws(variables = c("a0", "a1"), format = "matrix"))

d_smpl2 <- melt(d_smpl2, measure.vars = colnames(d_smpl2))
d_smpl2[, trt := as.integer(substr(variable, 2, 2))]
d_smpl2[, cat := as.integer(substr(variable, 4, 4))]

d_tbl <- d_smpl2[, .(p_mu = mean(value)), keyby = .(trt, cat)]
d_tbl[, p_obs := d2[, .(p_obs = .N/N), keyby = .(trt, y)]$p_obs]
setcolorder(d_tbl, c("trt", "cat", "p_obs", "p_mu"))

d_tbl <- dcast(d_tbl, cat ~ trt, value.var = list("p_obs", "p_mu"))

knitr::kable(d_tbl, digits = c(0, rep(3, 4)))

cat	p_obs_0	p_obs_1	p_mu_0	p_mu_1
1	0.032	0.045	0.035	0.042
2	0.270	0.324	0.281	0.313
3	0.337	0.324	0.331	0.331
4	0.167	0.082	0.129	0.119
5	0.050	0.040	0.047	0.043
6	0.046	0.042	0.047	0.041
7	0.098	0.144	0.130	0.112

The observed proportions suggest that the probability of category 1 and 7 have both increased in the treatment group relative to the control group. However, the model has not only done a fairly poor job of representing the observed proportions across the treatment groups. You can see how the modelled probability of observing category 1 has increased in the treatment group, and the probability of observing category 7 has decreased.

This setting can be modelling with an adjacent category model. A stan implementation for the adjacent category ordinal model with category specific treatment effects is as follows

mod2 <- cmdstanr::cmdstan_model(paste0(path_pref, "/stan/ordinal2.stan"))
mod2

## data {
##   int<lower=1> N;  
##   array[N] int y;  
##   int<lower=2> ncat;  
##   vector[N] x; 
##   int prior_only;  
## }
## parameters {
##   vector[ncat-1] b0;
##   // Category specific treatment effects.
##   row_vector[ncat-1] b;  
## }
## model {
##   if (!prior_only) {
##     matrix[N, ncat-1] eta = x * b;
##     vector[ncat] lp; 
##     for (i in 1:N) {
##       // For explanation of likelihood, see: 
##       // https://en.wikipedia.org/wiki/Polytomous_Rasch_model#The_Polytomous_Rasch_Model
##       // Note the zero is chosen for convenience, see last line from the reference. 
##       lp = append_row(0, cumulative_sum(transpose(eta[i]) - b0));    
##       target += lp[y[i]] - log_sum_exp(lp);
##     }
##   }
##   target += normal_lpdf(b | 0, 4);
##   target += normal_lpdf(b0 | 0, 3);
## }
## generated quantities{  
##   simplex[ncat] a0;
##   simplex[ncat] a1;
##   vector[N] log_lik;
##   {
##     vector[ncat] lp0; 
##     vector[ncat] lp1; 
##     vector[ncat] lp; 
##     real lse0;
##     real lse1;
##     lp0 = append_row(0, cumulative_sum(-b0));
##     lp1 = append_row(0, cumulative_sum(transpose(b) - b0));
##     lse0 = log_sum_exp(lp0);
##     lse1 = log_sum_exp(lp1);
##     for(i in 1:ncat){
##       a0[i] = exp(lp0[i] - lse0);
##       a1[i] = exp(lp1[i] - lse1);
##     }
##     for(i in 1:N){
##       lp = append_row(0, cumulative_sum(transpose(x[i]*b) - b0));
##       log_lik[i] = lp[y[i]] - log_sum_exp(lp);
##     }
##   }
## }

It isn’t perfect, but the results are more indicative of the data.

f3 <- mod2$sample(data = ld,
                  iter_warmup = 1000, iter_sampling = 2000,
                  parallel_chains = 3, chains = 3,
                  refresh = 0, show_exceptions = F)

## Running MCMC with 3 parallel chains...
## 
## Chain 3 finished in 209.9 seconds.
## Chain 1 finished in 213.0 seconds.
## Chain 2 finished in 225.3 seconds.
## 
## All 3 chains finished successfully.
## Mean chain execution time: 216.1 seconds.
## Total execution time: 225.4 seconds.

d_smpl3 <- data.table(f3$draws(variables = c("a0", "a1"), format = "matrix"))

d_smpl3 <- melt(d_smpl3, measure.vars = colnames(d_smpl3))
d_smpl3[, trt := as.integer(substr(variable, 2, 2))]
d_smpl3[, cat := as.integer(substr(variable, 4, 4))]

d_tbl <- d_smpl3[, .(p_mu = mean(value)), keyby = .(trt, cat)]
d_tbl[, p_obs := d2[, .(p_obs = .N/N), keyby = .(trt, y)]$p_obs]
setcolorder(d_tbl, c("trt", "cat", "p_obs", "p_mu"))

d_tbl <- dcast(d_tbl, cat ~ trt, value.var = list("p_obs", "p_mu"))

knitr::kable(d_tbl, digits = c(0, rep(3, 4)))

cat	p_obs_0	p_obs_1	p_mu_0	p_mu_1
1	0.032	0.045	0.032	0.045
2	0.270	0.324	0.270	0.323
3	0.337	0.324	0.337	0.324
4	0.167	0.082	0.167	0.082
5	0.050	0.040	0.050	0.040
6	0.046	0.042	0.046	0.042
7	0.098	0.144	0.098	0.144

The parameter estimates for the cutpoints and the category specific treatment effects are shown below:

f3$print(variables = c("b0"))

##  variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
##     b0[1] -2.13  -2.13 0.12 0.11 -2.32 -1.94 1.00     4013     4230
##     b0[2] -0.22  -0.22 0.05 0.05 -0.31 -0.14 1.00     3645     4020
##     b0[3]  0.70   0.70 0.06 0.06  0.61  0.80 1.00     3612     3511
##     b0[4]  1.21   1.20 0.10 0.10  1.04  1.38 1.00     3172     3855
##     b0[5]  0.08   0.08 0.13 0.13 -0.13  0.29 1.00     2921     3912
##     b0[6] -0.75  -0.75 0.11 0.11 -0.93 -0.57 1.00     3334     4532

f3$print(variables = c("b"))

##  variable  mean median   sd  mad    q5   q95 rhat ess_bulk ess_tail
##      b[1] -0.15  -0.15 0.15 0.15 -0.41  0.10 1.00     4290     4383
##      b[2] -0.22  -0.22 0.07 0.07 -0.34 -0.11 1.00     3660     3949
##      b[3] -0.68  -0.67 0.10 0.10 -0.84 -0.52 1.00     3441     3785
##      b[4]  0.49   0.49 0.16 0.16  0.24  0.75 1.00     3043     4072
##      b[5]  0.12   0.12 0.19 0.19 -0.20  0.43 1.00     3049     3904
##      b[6]  0.50   0.50 0.15 0.15  0.24  0.75 1.00     3696     4137

which can be compared to the true values for the treatment effects: -0.5, -0.3, -0.2, 0.2, 0.3, 0.4

Leave-one-out cross validation can be used to compare the cumulative ordinal and adjacent category models when fit to data where the proportional odds assumption is violated. The first column shows the difference in expected log predictive density relative to the model with the largest ELPD. The results suggest that the ACM (model 2) has better fit.

loo1 <- f2$loo()
loo2 <- f3$loo()
# The adjacent category models looks like it fits better
loo::loo_compare(list(`Cumulative logit` = loo1, `Adjacent category` = loo2))

##                   elpd_diff se_diff
## Adjacent category   0.0       0.0  
## Cumulative logit  -48.9      10.3

Conclusions

Commonly used models for ordinal response variables include the cumulative logit and the adjacent categories model. The latter is useful when category specific effects are of interest. LOOCV can be used to compare the fit of models where the proportional odds assumption is suspected to not hold. There are several other models applicable to ordinal data amongst which is the sequential model in which to reach a response category all lower categories must first be visited. This model is useful in settings where we want to compare discrete durations such as the number of years married.