Ordinal regression with stan

Introduction

This post considers ordinal data, for example, the number of days where patients did not take their medication (grouped into bins):

less than 7 days
7-13 days
14-20 days
21-27 days
28+ days

where the actual number of days that someone did not take medication could be 1.63, or 22.7 etc.

The most common approach to this is via a cumulative logit model, for which I provided an earlier example. This example provides a little bit more information. Under this model, the category value that we observe (1 to 5 in the above) is assumed to originate from a continuous latent variable. To recap, we assume that there are a series of cutpoints dividing a continous distribution where the enclosed areas correspond to the probability that our observed value takes on any given category.

In the above example, there are five categories, so there will be four cutpoints dividing a continuous distribution into five segments. The probability of being in each group is calculated as the area under the curve for the relevant segment. Therefore, for each category \(y\), the probability of category \(k = 1 \dots K\) is

\[ p(y = k) = F(c_k) - F(c_{k-1}) \]

where \(c_k\) denotes the cutpoint/section boundary value at index \(k = 0, ..., K+1\) and \(F\) is the CDF of the distribution chosen to represent the latent variable.

For the first category, the segment boundary is at the lower threshold (\(-\infty\)), so we have:

\[ p(y = 1) = F(c_1) - 0 \]

and for the last category, the segment boundary will be at the upper threshold (\(\infty\)) so:

\[ p(y = 1) = 1 - F(c_{k-1}) \]

and the cumulative probability that \(y\) is less than or equal to \(k\) is the sum of the corresponding areas \(\pi_k = Pr(y = k)\):

\[ p(y \le k) = \pi_1 + \dots + \pi_k \]

The approach models the cutpoints and then derives the probability of being in each category for \(y\). Often it is the case that the cumulative logistic distribution is assumed for \(F\), but the normal CDF is also used. For the logistic distribution the CDF is

\[ F(z) = \frac{1}{1 + exp(-(z - \mu)/\sigma)} \]

where \(\mu\) and \(\sigma\) are the location and scale parameters (by default 0 and 1) which produces a mean equal to \(\mu\) and a variance equal to \(\pi^2/(3 \sigma^2)\). Here’s what the CDF looks like, the black showing the output from my implementation and the red dashes showing the built in plogis function.

F_logistic <- function(x, mu = 0, sigma = 1){
  p <- 1/(1 + exp(-(x - mu)/sigma))
  p
}

x <- seq(-5, 5, len = 500)
d_fig <- data.table(
  x = x, 
  p1 = F_logistic(x, 0, 1),
  p2 = plogis(x, 0, 1)
)

ggplot(d_fig, aes(x = x, y = p1))+
  geom_line(col = 1, lwd = 2) +
  geom_line(aes(y = p2), col = 2, lty = 2, lwd = 1) +
  scale_x_continuous("x") +
  scale_y_continuous("F(x)", breaks = seq(0, 1, by = 0.1)) +
  theme_bw()

When a logistic distribution is assumed for the latent variable, the cutpoints will be on the log-odds scale (the logit function just converts probabilities to log-odds).

\[ logit(p(y \le j)) = \log \left[ \frac{p(y \le j)}{1 - p(y \le j)} \right] = \log \left[ \frac{\pi_1 + \dots + \pi_j}{\pi_{j+1} + \dots + \pi_K} \right] = c_j \]

for \(j = 1,...,K-1\).

The above doesn’t have much use as a regression model as of yet. In order to incorporate covariates, a linear predictor is defined (note the absence of an intercept due to the cutpoints playing that role)

\[ \eta = \beta_1 x_1 + \beta_2 x_2 + \dots \]

which is then incorporated into the cumulative logits to adjust the cutpoints conditional on the parameters

\[ logit(p(y \le j)) = c_j + \eta \]

and therefore the \(c_k\) become the intercepts and the \(\beta\)’s are assumed to be the same for all cutpoints (the proportional odds assumption). Under this parameterisation, when \(\eta>0\) the probability that \(y\) is in the first category increases because the cutpoint will shift right. Accordingly, the probability that \(y\) is in the last category will decrease. Sometimes you will see a different setup, e.g. for \(logit(p(y \ge j))\) and sometimes with the \(\eta\) subtracted rather than added. This just changes the interpretation of the parameters.

For simplicity, say we had a linear predictor \(\eta = \beta x\) where \(x\) was some binary indictor of treatment status (control vs active). For this case, you would have:

\[ logit(p(y \le j | x = 1)) = \log \left[ \frac{p(y \le j | x = 1)}{p(y > j | x = 1)} \right] = c_j + \beta \]

for the treatment group and

\[ logit(p(y \le j | x = 0)) = \log \left[ \frac{p(y \le j | x = 0)}{p(y > j | x = 0)} \right] = c_j \]

for the control group. Exponentiating

\[ p(y \le j | x = 1)/p(y > j | x = 1) = \exp(c_j + \beta) \]

and

\[ p(y \le j | x = 0)/p(y > j | x = 0) = \exp(c_j) \]

Then we have the cumulative odds ratio

\[ \frac{p(y \le j | x = 1)/p(y > j | x = 1)}{p(y \le j | x = 0)/p(y > j | x = 0)} = \frac{\exp(c_j + \beta)}{\exp(c_j)} = \exp(\beta) \]

The interpretation would be that for any fixed category \(j\), the odds that a patient’s outcome under active treatment is towards the lower number of missed days rather than the higher number is equal to \(\exp(\beta)\) times the odds for a patient under the control treatment (i.e \(y \le j\) vs. \(y > j\)). This means that if \(\exp(\beta)\) is greater than 1 then the odds are in favour of the treatment group having lower days.

Aside, the ordered_logistic_lpmf in stan is an example of the variation in parameterisations. In stan-land they define

\[ \begin{aligned} p(y = 1 | \eta) &= 1 - logit^{-1}(\eta - c_1) \\ p(y = j | \eta) &= logit^{-1}(\eta - c_{j-1}) - logit^{-1}(\eta - c_j) \quad 1 < j< K \\ p(y = K | \eta) &= logit^{-1}(\eta - c_{K-1}) - 0 \end{aligned} \]

for \(K > 2\) and with \(c_k < c_{k+1}\).

Data

Generate some data for an ordinal outcome that adheres to proportional odds and that aligns with the parameterisation used by stan. First set up the control group cutpoints, corresponding to a set of probabilities of observed each category. Let’s build the cumulative probabilities and do a check to make sure that I haven’t stuffed up.

K <- 5
# probability of being in each category
p0 <- c(0.10, 0.30, 0.2, 0.35, 0.05)
# covers off on rounding errors that you will sometimes see in R
p0[K] = 1 - sum(p0[1:(K-1)])
# cumulative probabilities
pc0 <- cumsum(p0)
# cumulative logits - these are the cutpoints for the control group
# note that the lower and upper are not required
l0 <- qlogis(pc0[-K])
# shoud be the same 
rbind(p0, round(c(1, plogis(-l0[-K])) - c(plogis(-l0[-K]), 0), 3))

##    [,1] [,2] [,3] [,4] [,5]
## p0  0.1  0.3  0.2 0.35 0.05
##     0.1  0.3  0.2 0.35 0.05

Assume that the active treatment increases the odds of lower days.

# log cumulative odds ratio
b <- -0.75
# cumulative logits for treatment group
l1 <- b - l0
# cumulative probs
pc1 <- plogis(l1)
p1 <- c(1, pc1) - c(pc1, 0)
rbind(p0, p1)

##         [,1]      [,2]     [,3]     [,4]      [,5]
## p0 0.1000000 0.3000000 0.200000 0.350000 0.0500000
## p1 0.1904291 0.3948626 0.175216 0.215234 0.0242583

Simulate some data from these probabilities

N <- 500
y0 <- sample(1:K, size = N, replace = T, prob = p0)
y1 <- sample(1:K, size = N, replace = T, prob = p1)

Model

Say that we had the same linear predictor for \(\eta\) with a single parameter \(\beta\). For the first category we have

\[ p(y = 1 | x = 1) = 1 - logit^{-1}(\beta - c_1) \]

for the active treatment group and

\[ p(y = 1 | x = 0) = 1 - logit^{-1}(- c_1) \]

for the control group.

If \(\beta = -0.75\) and the first cutpoint is at -2.197 then \(1 - logit^{-1}(\beta - c_1) = 0.19\) is the probability of the first category in the treatment group. For the control group, the \(\beta\) term drops out and we have \(1 - logit^{-1}(0 - c_1) = 0.1\).

So, under the stan implementation for ordered_logistic_lpmf, when \(\beta\) is less than zero, the probability of observing a lower category is increased in the treatment group (assuming the treatment group is indicated by one and the control group indicated by a zero).

m1 <- cmdstanr::cmdstan_model(paste0(path_pref, "/stan/cumulative_logit3.stan"))
m1

## data{
##   int N;
##   int K;
##   array[N] int y;
##   array[N] int x;
##   int prior_only;
##   real pri_b_s;
##   real pri_c_s;
## }
## parameters {
##   real b;
##   ordered[K - 1] c;
## }
## transformed parameters{
## }
## model {
##   target += normal_lpdf(b | 0, pri_b_s);
##   target += normal_lpdf(c | 0, pri_c_s);
##   if(!prior_only){
##     for (i in 1:N) {
##       target += ordered_logistic_lpmf(y[i] | x[i] * b, c);
##     }
##   }
## }
## generated quantities{
##   matrix[K,2] p;
##   p[1,1] = 1 - inv_logit(-c[1]);  
##   p[1,2] = 1 - inv_logit(b-c[1]);  
##   p[K,1] = inv_logit(-c[K-1]);  
##   p[K,2] = inv_logit(b-c[K-1]);  
##   for(j in 2:(K-1)){
##     p[j,1] = inv_logit(-c[j-1]) - inv_logit(-c[j]);
##     p[j,2] = inv_logit(b-c[j-1]) - inv_logit(b-c[j]);
##   }
## }

Priors are applied to the cutpoints and the treatment effect. These are all set on the log-odds scale. We can see what they imply on the probability scale by running the model without the likelihood part.

lsd <- list(
  N = 2*N,
  K = K,
  y = c(y0, y1),
  x = rep(0:1, each = N),
  prior_only = 1, pri_b_s = 1, pri_c_s = 1.8
)

f1 <- m1$sample(lsd,
    iter_warmup = 1000,
    iter_sampling = 2000,
    parallel_chains = 3,
    chains = 3,
    refresh = 0)

## Running MCMC with 3 parallel chains...
## 
## Chain 1 finished in 0.1 seconds.
## Chain 2 finished in 0.1 seconds.
## Chain 3 finished in 0.1 seconds.
## 
## All 3 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.2 seconds.

Which gives us the following suggesting that the priors assign similar belief to each category. Maybe there are instances where we do not want to assign similar beliefs to the occurrence of every category, but for this example, these will do. Note how the means are all centred around 0.2.

post_p <- data.table(f1$draws(variables = "p", format = 'matrix'))
post_p <- melt(post_p, measure.vars = names(post_p))
post_p[, k := gsub("p[", "", variable, fixed = T)]
post_p[, k := gsub(",.*$", "", k)]
post_p[, trt := gsub("^p\\[.,", "", variable)]
post_p[, trt := gsub("]", "", trt, fixed = T)]
post_p[trt == 1, grp := "Control"]
post_p[trt == 2, grp := "Treatment"]

dfig <- post_p[, .(p_mu = mean(value)), keyby = .(grp, k)]

ggplot(post_p, aes(x = value, group = grp, col = grp)) +
  geom_density() +
  geom_vline(data = dfig, aes(xintercept = p_mu, col = grp),
             lwd = 0.2) +
  scale_color_discrete("") +
  scale_x_continuous("Probability", breaks = seq(0, 1, by = 0.2)) +
  facet_wrap(~k, scales = "free") +
  theme(legend.position = "bottom")

Run the model again, this time update based on the simulated data and see what we get for the beta term.

lsd$prior_only <- 0
f1 <- m1$sample(lsd,
    iter_warmup = 1000,
    iter_sampling = 2000,
    parallel_chains = 3,
    chains = 3,
    refresh = 0, show_exceptions = F)

## Running MCMC with 3 parallel chains...
## 
## Chain 2 finished in 15.0 seconds.
## Chain 3 finished in 15.2 seconds.
## Chain 1 finished in 16.7 seconds.
## 
## All 3 chains finished successfully.
## Mean chain execution time: 15.6 seconds.
## Total execution time: 16.8 seconds.

f1$summary(variables = "b")

## # A tibble: 1 × 10
##   variable   mean median    sd   mad     q5    q95  rhat ess_bulk ess_tail
##   <chr>     <num>  <num> <num> <num>  <num>  <num> <num>    <num>    <num>
## 1 b        -0.676 -0.676 0.117 0.116 -0.870 -0.486  1.00    3979.    3913.

Below are the model posterior means for the probability of each category along with the true values that were used to simulate the data.

post_p <- data.table(f1$draws(variables = "p", format = 'matrix'))
post_p <- melt(post_p, measure.vars = names(post_p))
post_p[, k := gsub("p[", "", variable, fixed = T)]
post_p[, k := gsub(",.*$", "", k)]
post_p[, trt := gsub("^p\\[.,", "", variable)]
post_p[, trt := gsub("]", "", trt, fixed = T)]
post_p[trt == 1, grp := "Control"]
post_p[trt == 2, grp := "Treatment"]

d_tbl <- post_p[, .(p_mu = mean(value)), keyby = .(grp, k)]
d_tbl[, p_tru := c(p0, p1)]

d_tbl <- melt(d_tbl, id.vars = c("grp", "k"))

d_tbl <- dcast(
  d_tbl, grp + variable ~ k, value.var = "value"
)
# compare to the truth
knitr::kable(d_tbl, format = "simple", digits = 3)

grp	variable	1	2	3	4	5
Control	p_mu	0.095	0.298	0.217	0.343	0.047
Control	p_tru	0.100	0.300	0.200	0.350	0.050
Treatment	p_mu	0.171	0.389	0.194	0.221	0.025
Treatment	p_tru	0.190	0.395	0.175	0.215	0.024

And finally, here is a plot showing the posterior distribution for each category and in each group.

ggplot(post_p, aes(x = value, group = grp, col = grp)) +
  geom_density() +
  scale_color_discrete("") +
  scale_x_continuous("Probability", breaks = seq(0, 1, by = 0.2)) +
  facet_wrap(~k, scales = "free") +
  theme(legend.position = "bottom")

Ordinal regression with stan

2023-08-24

Introduction

Data

Model