This section contains the setup and the various utility functions used throughout.
Libraries used:
library(data.table)
library(ggplot2)
library(ggrepel)
library(scico)
library(cmdstanr)
# library(survival)
library("loo")
sl <- scico(palette = "lajolla", n = 9)
tok <- unlist(tstrsplit(getwd(), "/"))
if(tok[length(tok)] == 'rpubs'){path_pref = "."} else{path_pref = ".."}
Assume we have some categorical outcome variable of interest. For example, a 7 point ordinal scale, assessed on day 14, ranging from (1) not hospitalised and no limitations to (7) death where the complete set of categories are:
Assume that empirically the probability of being in each group maps to some latent continuous score as follows.
p_tru <- c(0.03,0.28,0.34,0.155,0.048,0.04,0.107)
d1 <- data.table(cat = 1:6, cuts = qlogis(cumsum(p_tru[1:6])))
ggplot(d1, aes(x = cuts)) +
geom_vline(aes(xintercept = cuts), lwd = 0.3, lty = 2) +
geom_text_repel(aes(x = cuts, y = 0.1, label = paste0(cat)),
box.padding = 2.5,
max.overlaps = Inf,
nudge_x = 0.2,
ylim = c(0.1, Inf), col = 2,
size = 3, inherit.aes = F) +
stat_function(fun = dlogis, n = 101) +
scale_x_continuous("log-odds", lim = c(-5, 5)) +
scale_y_continuous("Density") +
theme_bw()
Figure 1: Figure: Location of cutpoints on logistic density based on probability of score at day 14
There are a variety of models that can be applied but the cumulative model is probably most common. This assumes that the category that was observed originates from some underlying latent continuous random variable. For example, a discrete happiness score 1 to 10 might be assumed to map to our current state of happiness. The underlying continuous variable is usually assumed to have a normal or logistic distribution (the logit form is used here).
The cumulative logit approach splits the probability density function into sections where the area enclosed by the boundaries corresponds to the probability of observing a particular score. We construct our model to try to estimate the cutpoints. Two things need to be kept in mind:
The probability that an observation equals a particular category can be found with:
\[ \begin{equation} \begin{split} \mathsf{Pr}(Y = j) = F(\alpha_j) - F(\alpha_{j-1}) \end{split} \tag{1} \end{equation} \]
where \(F\) is the cumulative cdf of the distribution that we have chosen to model the underlying latent variable and the \(\alpha_j\) denotes the upper bound of the \(j^{th}\) category.
We can introduce a linear predictor as follows:
\[ \begin{equation} \begin{split} \mathsf{Pr}(Y = j|\eta) = F(\alpha_j - \eta) - F(\alpha_{j-1} - \eta) \end{split} \tag{2} \end{equation} \]
and so the effect of \(\eta\) is to shift the cutpoints. Additionally, we have:
\[ \begin{equation} \begin{split} \mathsf{Pr}(Y \le j|\eta) &= \mathsf{Pr}(Y^* \le \alpha_j|\eta) \\ &= F(\alpha_j - \eta) \end{split} \tag{3} \end{equation} \]
where \(Y^*\) is the latent random variable.
We assume a constant linear predictor across the response categories. This is known as a proportional odds assumption.
To increase the probability of being in the lowest category, the upper boundary (cutpoint) for this category needs to increase. We usually like to have odds ratios less than one to imply that treatment reduces the odds of the worst condition. The easiest way to achieve this is to parameterise the model for the current scenario as
\[ \begin{equation} \begin{split} \mathsf{logit}(\mathsf{Pr}(y_i \le j)) = \alpha_j - \beta x_i \end{split} \tag{4} \end{equation} \]
where the \(\beta\) represents a treatment effect (log-odds ratio), constant across all response categories. So, if \(\beta\) is negative (implying an OR < 1) the cutpoints are being increased and we are reducing the odds of higher scores (i.e. scores towards death).
The proportional odds assumption isn’t always valid. For example, an intervention could be potentially useful for healthier patients, but harmful to patients that are already very sick.
The following shows a summary of simulated data when the proportional odds assumption holds.
N <- 1000
b <- -0.5
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))
y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)
d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))
d_tbl <- d2[, .(p_obs = round(.N/N, 2)), keyby = .(trt, y)]
d_tbl[, p_tru := as.numeric(c(p0 = p0, p1 = p1))]
d_tbl <- dcast(d_tbl, y ~ trt, value.var = list("p_obs", "p_tru"))
Initial cutpoints are defined along with a linear predictor that contains a single categorical variable for treatment status. The treatment effect was set at -0.5 which corresponds to an odds ratio of 0.61 Categorical data is generated based on the implied probabilities for each group and then the observed proportions for each treatment and category are shown along with the true probabilities that we used to simulate the data. We see that the effect of treatment is to increase the probability of observing lower values of y and lower the probability of high values for y.
knitr::kable(d_tbl, digits = 2)
y | p_obs_0 | p_obs_1 | p_tru_0 | p_tru_1 |
---|---|---|---|---|
1 | 0.02 | 0.04 | 0.03 | 0.05 |
2 | 0.30 | 0.40 | 0.28 | 0.38 |
3 | 0.32 | 0.31 | 0.34 | 0.33 |
4 | 0.15 | 0.12 | 0.16 | 0.12 |
5 | 0.04 | 0.04 | 0.05 | 0.03 |
6 | 0.04 | 0.02 | 0.04 | 0.03 |
7 | 0.12 | 0.08 | 0.11 | 0.07 |
And the following summary shows what happens when the proportional odds assumption does not hold.
N <- 1000
b <- c(-0.5, -0.3, -0.2, 0.2, 0.3, 0.4)
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))
y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)
d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))
d_tbl <- d2[, .(p_obs = round(.N/N, 2)), keyby = .(trt, y)]
d_tbl[, p_tru := as.numeric(c(p0 = p0, p1 = p1))]
d_tbl <- dcast(d_tbl, y ~ trt, value.var = list("p_obs", "p_tru"))
This time, there is a distinct treatment effect for each category. Here we see that the treatment results in a simultaneous increase in the probabilities of both the lowest and highest scores. To achieve this, the probabilities for the mid-categories have all be reduced.
knitr::kable(d_tbl, digits = 3)
y | p_obs_0 | p_obs_1 | p_tru_0 | p_tru_1 |
---|---|---|---|---|
1 | 0.02 | 0.06 | 0.030 | 0.049 |
2 | 0.27 | 0.33 | 0.280 | 0.329 |
3 | 0.37 | 0.31 | 0.340 | 0.317 |
4 | 0.14 | 0.07 | 0.155 | 0.078 |
5 | 0.05 | 0.04 | 0.048 | 0.040 |
6 | 0.04 | 0.04 | 0.040 | 0.037 |
7 | 0.11 | 0.16 | 0.107 | 0.152 |
One option to evaluate whether there are category specific effects is to fit separate logistic regression model to a set of collapsed comparisons. For example, you could look at the relative odds having a score of 1 vs 2:7 under treatment, the relative odds of having a score of 1:2 vs 3:7 under treatment etc. However, this becomes burdensome when there are many categories.
Another option is to use a different model, known as an adjacent category model. This approach emphasises comparisons between adjacent categories and is motivated more by statistical convenience rather than any real world construct.
Under the ACM approach, we conceptualise a latent variable for each category and each has a single threshold, \(\alpha_j\)
\[ \begin{equation} \begin{split} \mathsf{Pr}(Y = j | Y \in \{j, j+1\}, \eta) = F(\alpha_j - \eta) \end{split} \tag{5} \end{equation} \]
A stan implementation for the cumulative ordinal proportional odds model follows. It includes a single variable in the linear predictor for treatment status.
Note that the log_lik
variable in the generated quantities block is used for Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO) for purposes of model checking and model comparison, see https://mc-stan.org/loo/articles/loo2-example.html.
mod1 <- cmdstanr::cmdstan_model(paste0(path_pref, "/stan/ordinal1.stan"))
mod1
## data {
## int<lower=1> N; // total number of observations
## array[N] int y;
## vector[N] x;
## int<lower=2> ncat; // number of categories
## int prior_only;
## }
## parameters {
## simplex[ncat] a0;
## real b;
## }
## transformed parameters {
## vector[ncat-1] cuts = logit(cumulative_sum(a0[1:(ncat-1)]));
## }
## model {
## if (!prior_only) {
## vector[N] eta = b * x;
## target += ordered_logistic_lpmf(y | eta, cuts);
## }
## target += normal_lpdf(b | 0, 4);
## target += dirichlet_lpdf(a0 | rep_vector(1, ncat));
## }
## generated quantities {
## simplex[ncat] a1;
## vector[N] log_lik;
## a1[1] = inv_logit(cuts[1] - b);
## for(i in 2:(ncat-1)){
## a1[i] = inv_logit(cuts[i] - b) - inv_logit(cuts[i-1] - b);
## }
## a1[ncat] = 1 - sum(a1[1:(ncat-1)]);
## for(i in 1:N){
## log_lik[i] = ordered_logistic_lpmf(y[i] | x[i]*b, cuts);
## }
## }
First fit the model based on data generated so that proportional odds holds:
set.seed(5)
N <- 2500
b <- -0.5
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))
y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)
d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))
ld <- list(N = nrow(d2), y = as.integer(d2$y),
ncat = 7, x = d2$trt,
prior_only = 0)
prop.table(table(d2$trt, d2$y))
f1 <- mod1$sample(data = ld,
iter_warmup = 1000, iter_sampling = 2000,
parallel_chains = 3, chains = 3,
refresh = 0, show_exceptions = F)
d_smpl1 <- data.table(f1$draws(variables = c("a0", "a1"), format = "matrix"))
d_smpl1 <- melt(d_smpl1, measure.vars = colnames(d_smpl1))
d_smpl1[, trt := as.integer(substr(variable, 2, 2))]
d_smpl1[, cat := as.integer(substr(variable, 4, 4))]
d_tbl <- d_smpl1[, .(p_mu = mean(value)), keyby = .(trt, cat)]
d_tbl[, p_obs := d2[, .(p_obs = .N/N), keyby = .(trt, y)]$p_obs]
setcolorder(d_tbl, c("trt", "cat", "p_obs", "p_mu"))
d_tbl <- dcast(d_tbl, cat ~ trt, value.var = list("p_obs", "p_mu"))
knitr::kable(d_tbl, digits = c(0, rep(3, 4)))
##
## 1 2 3 4 5 6 7
## 0 0.0160 0.1350 0.1686 0.0834 0.0250 0.0232 0.0488
## 1 0.0258 0.1860 0.1650 0.0588 0.0146 0.0162 0.0336
## Running MCMC with 3 parallel chains...
##
## Chain 1 finished in 10.1 seconds.
## Chain 2 finished in 10.4 seconds.
## Chain 3 finished in 10.6 seconds.
##
## All 3 chains finished successfully.
## Mean chain execution time: 10.4 seconds.
## Total execution time: 10.8 seconds.
cat | p_obs_0 | p_obs_1 | p_mu_0 | p_mu_1 |
---|---|---|---|---|
1 | 0.032 | 0.052 | 0.032 | 0.052 |
2 | 0.270 | 0.372 | 0.271 | 0.371 |
3 | 0.337 | 0.330 | 0.339 | 0.328 |
4 | 0.167 | 0.118 | 0.162 | 0.122 |
5 | 0.050 | 0.029 | 0.047 | 0.032 |
6 | 0.046 | 0.032 | 0.048 | 0.031 |
7 | 0.098 | 0.067 | 0.102 | 0.063 |
And now fit the same model to data where proportional odds is violated by construction.
set.seed(5)
N <- 2500
b <- c(-0.5, -0.3, -0.2, 0.2, 0.3, 0.4)
p0 <- c(plogis(d1$cuts), 1) - c(0, plogis(d1$cuts))
p1 <- c(plogis(d1$cuts - b), 1) - c(0, plogis(d1$cuts - b))
y0 <- sample(1:7, N, replace = TRUE, prob = p0)
y1 <- sample(1:7, N, replace = TRUE, prob = p1)
d2 <- data.table(trt = rep(0:1, each = N), y = c(y0, y1))
ld <- list(N = nrow(d2), y = as.integer(d2$y),
ncat = 7, x = d2$trt,
prior_only = 0)
f2 <- mod1$sample(data = ld,
iter_warmup = 1000, iter_sampling = 2000,
parallel_chains = 3, chains = 3,
refresh = 0, show_exceptions = F)
## Running MCMC with 3 parallel chains...
##
## Chain 2 finished in 10.3 seconds.
## Chain 3 finished in 10.3 seconds.
## Chain 1 finished in 10.4 seconds.
##
## All 3 chains finished successfully.
## Mean chain execution time: 10.3 seconds.
## Total execution time: 10.5 seconds.
d_smpl2 <- data.table(f2$draws(variables = c("a0", "a1"), format = "matrix"))
d_smpl2 <- melt(d_smpl2, measure.vars = colnames(d_smpl2))
d_smpl2[, trt := as.integer(substr(variable, 2, 2))]
d_smpl2[, cat := as.integer(substr(variable, 4, 4))]
d_tbl <- d_smpl2[, .(p_mu = mean(value)), keyby = .(trt, cat)]
d_tbl[, p_obs := d2[, .(p_obs = .N/N), keyby = .(trt, y)]$p_obs]
setcolorder(d_tbl, c("trt", "cat", "p_obs", "p_mu"))
d_tbl <- dcast(d_tbl, cat ~ trt, value.var = list("p_obs", "p_mu"))
knitr::kable(d_tbl, digits = c(0, rep(3, 4)))
cat | p_obs_0 | p_obs_1 | p_mu_0 | p_mu_1 |
---|---|---|---|---|
1 | 0.032 | 0.045 | 0.035 | 0.042 |
2 | 0.270 | 0.324 | 0.281 | 0.313 |
3 | 0.337 | 0.324 | 0.331 | 0.331 |
4 | 0.167 | 0.082 | 0.129 | 0.119 |
5 | 0.050 | 0.040 | 0.047 | 0.043 |
6 | 0.046 | 0.042 | 0.047 | 0.041 |
7 | 0.098 | 0.144 | 0.130 | 0.112 |
The observed proportions suggest that the probability of category 1 and 7 have both increased in the treatment group relative to the control group. However, the model has not only done a fairly poor job of representing the observed proportions across the treatment groups. You can see how the modelled probability of observing category 1 has increased in the treatment group, and the probability of observing category 7 has decreased.
This setting can be modelling with an adjacent category model. A stan implementation for the adjacent category ordinal model with category specific treatment effects is as follows
mod2 <- cmdstanr::cmdstan_model(paste0(path_pref, "/stan/ordinal2.stan"))
mod2
## data {
## int<lower=1> N;
## array[N] int y;
## int<lower=2> ncat;
## vector[N] x;
## int prior_only;
## }
## parameters {
## vector[ncat-1] b0;
## // Category specific treatment effects.
## row_vector[ncat-1] b;
## }
## model {
## if (!prior_only) {
## matrix[N, ncat-1] eta = x * b;
## vector[ncat] lp;
## for (i in 1:N) {
## // For explanation of likelihood, see:
## // https://en.wikipedia.org/wiki/Polytomous_Rasch_model#The_Polytomous_Rasch_Model
## // Note the zero is chosen for convenience, see last line from the reference.
## lp = append_row(0, cumulative_sum(transpose(eta[i]) - b0));
## target += lp[y[i]] - log_sum_exp(lp);
## }
## }
## target += normal_lpdf(b | 0, 4);
## target += normal_lpdf(b0 | 0, 3);
## }
## generated quantities{
## simplex[ncat] a0;
## simplex[ncat] a1;
## vector[N] log_lik;
## {
## vector[ncat] lp0;
## vector[ncat] lp1;
## vector[ncat] lp;
## real lse0;
## real lse1;
## lp0 = append_row(0, cumulative_sum(-b0));
## lp1 = append_row(0, cumulative_sum(transpose(b) - b0));
## lse0 = log_sum_exp(lp0);
## lse1 = log_sum_exp(lp1);
## for(i in 1:ncat){
## a0[i] = exp(lp0[i] - lse0);
## a1[i] = exp(lp1[i] - lse1);
## }
## for(i in 1:N){
## lp = append_row(0, cumulative_sum(transpose(x[i]*b) - b0));
## log_lik[i] = lp[y[i]] - log_sum_exp(lp);
## }
## }
## }
It isn’t perfect, but the results are more indicative of the data.
f3 <- mod2$sample(data = ld,
iter_warmup = 1000, iter_sampling = 2000,
parallel_chains = 3, chains = 3,
refresh = 0, show_exceptions = F)
## Running MCMC with 3 parallel chains...
##
## Chain 3 finished in 209.9 seconds.
## Chain 1 finished in 213.0 seconds.
## Chain 2 finished in 225.3 seconds.
##
## All 3 chains finished successfully.
## Mean chain execution time: 216.1 seconds.
## Total execution time: 225.4 seconds.
d_smpl3 <- data.table(f3$draws(variables = c("a0", "a1"), format = "matrix"))
d_smpl3 <- melt(d_smpl3, measure.vars = colnames(d_smpl3))
d_smpl3[, trt := as.integer(substr(variable, 2, 2))]
d_smpl3[, cat := as.integer(substr(variable, 4, 4))]
d_tbl <- d_smpl3[, .(p_mu = mean(value)), keyby = .(trt, cat)]
d_tbl[, p_obs := d2[, .(p_obs = .N/N), keyby = .(trt, y)]$p_obs]
setcolorder(d_tbl, c("trt", "cat", "p_obs", "p_mu"))
d_tbl <- dcast(d_tbl, cat ~ trt, value.var = list("p_obs", "p_mu"))
knitr::kable(d_tbl, digits = c(0, rep(3, 4)))
cat | p_obs_0 | p_obs_1 | p_mu_0 | p_mu_1 |
---|---|---|---|---|
1 | 0.032 | 0.045 | 0.032 | 0.045 |
2 | 0.270 | 0.324 | 0.270 | 0.323 |
3 | 0.337 | 0.324 | 0.337 | 0.324 |
4 | 0.167 | 0.082 | 0.167 | 0.082 |
5 | 0.050 | 0.040 | 0.050 | 0.040 |
6 | 0.046 | 0.042 | 0.046 | 0.042 |
7 | 0.098 | 0.144 | 0.098 | 0.144 |
The parameter estimates for the cutpoints and the category specific treatment effects are shown below:
f3$print(variables = c("b0"))
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## b0[1] -2.13 -2.13 0.12 0.11 -2.32 -1.94 1.00 4013 4230
## b0[2] -0.22 -0.22 0.05 0.05 -0.31 -0.14 1.00 3645 4020
## b0[3] 0.70 0.70 0.06 0.06 0.61 0.80 1.00 3612 3511
## b0[4] 1.21 1.20 0.10 0.10 1.04 1.38 1.00 3172 3855
## b0[5] 0.08 0.08 0.13 0.13 -0.13 0.29 1.00 2921 3912
## b0[6] -0.75 -0.75 0.11 0.11 -0.93 -0.57 1.00 3334 4532
f3$print(variables = c("b"))
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## b[1] -0.15 -0.15 0.15 0.15 -0.41 0.10 1.00 4290 4383
## b[2] -0.22 -0.22 0.07 0.07 -0.34 -0.11 1.00 3660 3949
## b[3] -0.68 -0.67 0.10 0.10 -0.84 -0.52 1.00 3441 3785
## b[4] 0.49 0.49 0.16 0.16 0.24 0.75 1.00 3043 4072
## b[5] 0.12 0.12 0.19 0.19 -0.20 0.43 1.00 3049 3904
## b[6] 0.50 0.50 0.15 0.15 0.24 0.75 1.00 3696 4137
which can be compared to the true values for the treatment effects: -0.5, -0.3, -0.2, 0.2, 0.3, 0.4
Leave-one-out cross validation can be used to compare the cumulative ordinal and adjacent category models when fit to data where the proportional odds assumption is violated. The first column shows the difference in expected log predictive density relative to the model with the largest ELPD. The results suggest that the ACM (model 2) has better fit.
loo1 <- f2$loo()
loo2 <- f3$loo()
# The adjacent category models looks like it fits better
loo::loo_compare(list(`Cumulative logit` = loo1, `Adjacent category` = loo2))
## elpd_diff se_diff
## Adjacent category 0.0 0.0
## Cumulative logit -48.9 10.3
Commonly used models for ordinal response variables include the cumulative logit and the adjacent categories model. The latter is useful when category specific effects are of interest. LOOCV can be used to compare the fit of models where the proportional odds assumption is suspected to not hold. There are several other models applicable to ordinal data amongst which is the sequential model in which to reach a response category all lower categories must first be visited. This model is useful in settings where we want to compare discrete durations such as the number of years married.