Modeling a very small clinical trial using Stan

The data
The model
Evaluating the model

This clinical trial was a small (N=10) sleep study comparing two drugs (Cushny and Peebles, 1904).

Load packages

library(tidyr)
library(rstan)
library(bayesplot)
library(knitr)
library(directlabels)

The data

Ten people were given some drug to help them sleep, and then the same people were given another drug to help them sleep. The researchers recorded the amount of extra hours in sleep that each participant achieved while taking each drug, as compared to a baseline measurement. The data is very small, in the sense that we have low n and just one predictor variable (the treatment). Luckily, we know a lot about the design of the experiment, which warrants a hierarchical model and priors. So we’ll use Stan, a platform for fully Bayesian statistical modeling. The goal is to express the hierarchical structure of the data in the model, incorporate sensible priors, and get a realistic quantification of the (very high) uncertainty in the parameter estimates.

Let’s load the data and take a look.

data("sleep")
sleep$ID <- as.numeric(sleep$ID) # ID was saved as a factor
kable(data <- sleep, align="c")

extra	group	ID
0.7	1	1
-1.6	1	2
-0.2	1	3
-1.2	1	4
-0.1	1	5
3.4	1	6
3.7	1	7
0.8	1	8
0.0	1	9
2.0	1	10
1.9	2	1
0.8	2	2
1.1	2	3
0.1	2	4
-0.1	2	5
4.4	2	6
5.5	2	7
1.6	2	8
4.6	2	9
3.4	2	10

Raw data is hard to read, so here’s a plot of the same information. We label each line by subject ID. It looks like treatment 1 may have had some effect, and that treatment 2 was generally better.

data %>% 
  ggplot() +
  aes(x = group, y = extra, group=ID) +
  xlab("group (aka treatment condition)") +
  ylab("extra (aka gains in sleep)") +
  geom_line(alpha=.8) +
  geom_dl(aes(label=ID), method="last.points")

The model

First we split the data into a training set consisting of the first 8 subjects, and a test set of the remaining 2 subjects.

train <- subset(data, ID <= 8)
test <- subset(data, ID > 8)

We feed data into stan as a list.

N1 <- nrow(train) # number of observations in training dataset
N2 <- nrow(test) # number of observations in test dataset
J1 <- length(unique(train$ID)) # number of participants in training dataset
J2 <- length(unique(test$ID)) # number of participants in test dataset
Y1 <- train$extra # observed outcome for training dataset
Y2 <- test$extra # observed outcome for test dataset
treatment1 <- ifelse(train$group == 2, 1, 0) # create a binary variable indicating whether the person was in group 2 or not
treatment2 <- ifelse(test$group == 2, 1, 0)
id1 <- train$ID # id indicator for training dataset
id2 <- test$ID - J1 # id indicator for test dataset. We subtract J1 because stan needs this value to count up from 1
stan_dat <- list(N1, Y1, treatment1, J1, id1,
                 N2, Y2, treatment2, J2, id2)

The stan program is hard to read, so I’m going to present it in chunks. First the data block, where we declare the data that we’re going to feed into the program.

data_block <- "
data {
  // training set
  int<lower=1> N1; 
  int<lower=1> J1; 
  vector[N1] Y1;   
  real treatment1[N1]; // treatment group (1 if in group 2, else 0)
  int<lower=1,upper=J1> id1[N1]; //  id indicator
  
  // test set
  int<lower=1> N2; 
  int<lower=1> J2; 
  real treatment2[N2]; 
  int<lower=1,upper=J2> id2[N2]; 
  vector[N2] Y2;
}
"

Then the parameters block, where we specify the parameters that we’re going to estimate via Hamiltonian Monte Carlo. Notice that we’re already putting prior information into the model in the form of hard constraints on parameter estimates: given the design of the experiment, it is impossible for the difference between treatments to be larger than +/- 24 (although such a difference is so unlikely that this constraint has basically no effect on parameter estimates).

parameters_block <- "
parameters {
  vector[J1] id_intercept; // varying intercept for IDs in training set
  real mu; // mu for varying intercepts
  real<lower=0> tau; //  standard error for varying intercepts
  real<lower=0> sigma; // global sigma
  real<lower=-24, upper=24> treatment_effect; // beta for treatment effect
}
"

The model block is for, uh, specifying the model. We fit a multilevel linear regression, with varying intercepts for subjects, and estimate a coefficient for the treatment effect (i.e., being in group 2). We set fairly weak priors on the parameters in the model, based on common sense and simulation (unconditioned on the data). For example, we put most of the prior probability mass for the treatment effects around a gain of zero hours, while leaving substantial amount of density as high as +/- 10 hours of extra sleep and very little density at +/- 24 hours of sleep. Such a prior may be described as regularizing or very weakly informative.

model_block <- "
model {
  mu ~ normal(0,2.5);
  tau ~ normal(0,2.5); // implicit half-normal due to lower bound specified in param block
  id_intercept ~ normal(mu, tau);
  sigma ~ normal(0,5); // implicit half-normal
  treatment_effect ~ normal(0,5);
  for (n in 1:N1){
    Y1[n] ~ normal(id_intercept[id1[n]] + treatment_effect * treatment1[n], sigma);
  }
}
"

Finally, we use the generated quantities block to generate predictions for the test dataset and calculate prediction error as Root Mean Squared Difference (RMSD).

generated_quantities_block <- "
generated quantities{
  real RMSD; // root mean squared difference
  {
    vector[N2] Y_pred;
    vector[N2] SD; // squared difference
    vector[J2] id_intercept_pred;
    for (j2 in 1:J2){
     id_intercept_pred[j2] = normal_rng(mu, tau);
    }
    for (n2 in 1:N2){
      Y_pred[n2] = normal_rng(id_intercept_pred[id2[n2]] + treatment_effect * treatment2[n2], sigma);
      SD[n2] = (Y_pred[n2] - Y2[n2])^2;
    }
    RMSD = sqrt(mean(SD));
  }
}
"

Concatenate all the blocks to form the stan code.

model_code <- c(data_block, parameters_block, model_block, generated_quantities_block)

Run the stan model.

fit <- stan(model_code = model_code, data=stan_dat, control = list(adapt_delta = 0.95), cores=4)

And check the basic results.

print(fit)

## Inference for Stan model: c5cb2d2d52450509475c21ce8a6f4435.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                   mean se_mean   sd   2.5%   25%   50%   75% 97.5% n_eff
## id_intercept[1]   0.68    0.01 0.47  -0.28  0.39  0.68  0.97  1.60  4000
## id_intercept[2]  -0.92    0.01 0.49  -1.79 -1.21 -0.95 -0.64  0.18  3057
## id_intercept[3]  -0.12    0.01 0.47  -1.00 -0.41 -0.13  0.16  0.85  3470
## id_intercept[4]  -1.05    0.01 0.48  -1.96 -1.36 -1.07 -0.76 -0.05  4000
## id_intercept[5]  -0.62    0.01 0.47  -1.52 -0.91 -0.63 -0.35  0.35  3490
## id_intercept[6]   3.13    0.01 0.53   1.95  2.86  3.17  3.47  4.09  2625
## id_intercept[7]   3.79    0.01 0.53   2.57  3.53  3.85  4.12  4.69  2290
## id_intercept[8]   0.59    0.01 0.47  -0.37  0.30  0.59  0.86  1.53  4000
## mu                0.62    0.01 0.75  -0.88  0.15  0.63  1.10  2.05  3051
## tau               2.09    0.01 0.60   1.20  1.67  2.00  2.41  3.54  3083
## sigma             0.62    0.01 0.22   0.35  0.47  0.57  0.71  1.21   898
## treatment_effect  1.22    0.01 0.33   0.58  1.02  1.22  1.41  1.88  2649
## RMSD              2.74    0.02 1.19   1.14  1.88  2.51  3.36  5.65  3463
## lp__             -6.38    0.14 3.77 -15.20 -8.54 -5.83 -3.57 -0.62   742
##                  Rhat
## id_intercept[1]  1.00
## id_intercept[2]  1.00
## id_intercept[3]  1.00
## id_intercept[4]  1.00
## id_intercept[5]  1.00
## id_intercept[6]  1.00
## id_intercept[7]  1.00
## id_intercept[8]  1.00
## mu               1.00
## tau              1.00
## sigma            1.01
## treatment_effect 1.00
## RMSD             1.00
## lp__             1.01
## 
## Samples were drawn using NUTS(diag_e) at Tue Jan  1 23:48:09 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Evaluating the model

First I want to look at the model estimates for each subject and compare them to the measured values for each subject, just to get an idea of how well this model fits the data.

We extract the relevant posterior samples.

id_intercepts <- as.data.frame(fit, pars="id_intercept")
treatment_effect <- as.data.frame(fit, pars="treatment_effect")

We plot the posterior distribution for treatment 1, with the measured values as vertical lines. The hierarchical structure tends to pull relative outliers toward the mean, which helps reduce overfitting. Notice that subject 5 appears to be an exception. That’s because this was the only participant who didn’t do any better in treatment 2 than treatment 1, so our model underestimates their treatment 1 score (and overestimates their treatment 2 score).

for (i in 1:ncol(id_intercepts)){
 true.value <- subset(data, group==1 & ID == i)$extra
 plot(mcmc_dens(data.frame(id_intercepts[,i])) + 
        vline_at(v = true.value) +
        xlim(c(-4,6.5)) +
        xlab(paste("subject",i)))
}

And now the same subjects, but for treatment 2.

id_intercepts_treated <- as.data.frame(apply(id_intercepts, 2, function(x){x + treatment_effect}))

for (i in 1:ncol(id_intercepts_treated)){
 true.value <- subset(data, group==2 & ID == i)$extra
 plot(mcmc_dens(data.frame(id_intercepts_treated[,i])) + 
        vline_at(v = true.value) +
        xlim(c(-4,6.5)) +
        xlab(paste("subject",i)))
}

Things look okay. Just eyeballing it, it does seem like 90% of the true values are within the 90% highest density credible intervals.

We can see that the sample gives us limited evidence for the efficacy of the first drug on sleep.

mcmc_intervals(as.array(fit, pars="mu")) + scale_y_discrete(labels="treatment 1 effect")

But the second drug did measurably better than the first.

mcmc_intervals(treatment_effect) + scale_y_discrete(labels="difference between treatments")

The second drug was associated with about 2 hours of extra sleep per night.

mcmc_intervals(treatment_effect + as.array(fit, pars="mu")) + scale_y_discrete(labels="treatment 2 effect")

This is a hierarchical model, in that we (very roughly) estimated the population distribution–\(\text{normal}(\mu,\tau)\)–that the participants were being drawn from. Here is the estimate for \(\tau\), which shows the amount of variation by subject ID.

mcmc_dens(as.array(fit, pars="tau"))

To estimate the out-of-sample prediction error, we can look at the RMSD between the out-of-sample test dataset observations and the predictions from the fitted model. We averaged about 2 to 3 hours from the true value; an excellent match for the model-estimated variation (\(\tau\) and \(\sigma\)). It’s a decent result, especially considering that the trajectory of participant 9, who was left out of the training dataset, was so peculiar.

mcmc_dens(as.array(fit, pars="RMSD"))