Non-inferiority Trials Sample Size

Setup

This section contains the setup and the various utility functions used throughout.

Libraries used:

library(data.table)
library(ggplot2)

Compiled code (any models used are shown later):

Introduction

Non-inferiority randomised controlled trials generally have the goal of establishing whether a proposed intervention is no worse than an existing treatment.

For example, denote a new and existing treatment as \(A\) and \(B\) respectively and assume that we are treating a condition where we get complete or no response, i.e. the patient recovers or not. Denote the response of a patient as \(y\) with 1 and 0 coding recovery or not and take \(\pi_A\) and \(\pi_B\) to be the propensity for recovery (equivalently proportion recovered) under each of the treatments with higher values indicating a better chance of recovery.

For a non-inferiority trial, the goal would be to evaluate whether \(\pi_A\) is not substantially lower than \(\pi_B\) (statisticians love double negatives for some reason - it probably has something to do with Karl Popper, but that is for another time). The concept of “substantially lower” is defined on a case by case basis but here could be represented in terms of (1) an absolute difference between the two propensities for recovery (equivalently proportion recovered) (2) a ratio of the two response propensities or (3) differences between derived quantities, such as an odds-ratio.

Usually, clinical expertise is coerced (joke-ish) into providing a non-inferiority margin \(\omega\) for each study that defines how much worse the new treatment can be while still being acceptable as an adequate clinical response. So, let’s say that 49 people out of 100 recovered under treatment \(A\) and 50/100 recovered under treatment \(B\) and we had chosen a non-inferiority margin of 0.05. Might we then conclude that \(A\) was an acceptable alternative?

Well, if we used assume a Binomial likelihood and a (non-informative) Beta prior (see https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions) then we get a closed form solution for the posterior distribution of \(\pi_A\) and \(\pi_B\). We can easily sample from these in order to summarise and conduct inference:

nsmpl <- 100000
pi_A <- rbeta(nsmpl, 1 + 49, 1 + 100 - 49)
pi_B <- rbeta(nsmpl, 1 + 50, 1 + 100 - 50)
d <- data.table(pi_A, pi_B)
d <- melt(d, measure.vars = names(d))
ggplot(d, aes(x = value, group = variable, col = variable)) +
  geom_density() +
  theme_bw() +
  scale_x_continuous(expression(paste(theta))) +
  scale_y_continuous("Density") +
  scale_colour_discrete("Group") +
  theme(legend.position = "right")

Figure 1: Figure: Posterior distribution for difference between response rates

We can use the posteriors draws to look at the difference in the propensity for response in the two groups as shown below:

d <- data.table(delta = pi_A - pi_B)
omega <- 0.05
dd <- density(d$delta)
dd <- data.table(x = dd$x, y = dd$y)
dd <- dd[x >= -0.05, ]
ggplot(d, aes(x = delta)) +
  geom_ribbon(data = dd,
              aes(x=x, ymax=y),
              ymin=0, fill="red", alpha=0.5)+
  geom_density() +
  geom_vline(xintercept = 0) +
  geom_vline(xintercept = -omega, lty = 2, lwd = 0.3) +
  theme_bw() +
  scale_x_continuous(expression(paste(delta, "=", pi[A], "-", pi[B]))) +
  scale_y_continuous("Density")

Figure 2: Figure: Posterior distribution for difference between response rates

The difference between the two arms is centred just below zero and the probability that the propensity for recovery with treatment \(A\) is non-inferior to recovery with \(B\) is shaded in red. We are interested in evaluating whether the red shaded area gives us sufficient certainty to justify adopting treatment \(A\) as an alternative to \(B\), i.e. something like:

\[ \begin{equation} \begin{split} \mathsf{Pr}(\pi_A - \pi_B > -0.05) > 0.975 \\ \end{split} \tag{1} \end{equation} \]

By non-inferiority, all we are saying here, is that the propensity for recovery under treatment \(A\) is not more than the NI margin (\(\omega = 0.05\)) below the propensity for recovery with treatment \(B\). Based on the data and model used in this example, we have

mean(pi_A - pi_B > -omega)

## [1] 0.71968

which is below the 97.5% that I included in the identity above and so on the basis of this evidence, we would not declare treatment \(A\) to be non-inferior. However, if we increased the sample size to 1000 per arm and and observed that 490/1000 recovered under \(A\) and 500/1000 recovered in \(B\) then we have

nsmpl <- 100000
pi_A <- rbeta(nsmpl, 1 + 490, 1 + 1000 - 144)
pi_B <- rbeta(nsmpl, 1 + 500, 1 + 1000 - 150)
d <- data.table(delta = pi_A - pi_B)
dd <- density(d$delta)
dd <- data.table(x = dd$x, y = dd$y)
dd <- dd[x >= -0.05, ]
ggplot(d, aes(x = delta)) +
  geom_ribbon(data = dd,
              aes(x=x, ymax=y),
              ymin=0, fill="red", alpha=0.5)+
  geom_density() +
  geom_vline(xintercept = 0) +
  geom_vline(xintercept = -omega, lty = 2, lwd = 0.3) +
  theme_bw() +
  scale_x_continuous(expression(paste(delta, "=", pi[A], "-", pi[B]))) +
  scale_y_continuous("Density")

Figure 3: Figure: Posterior distribution for difference between response rates

and the revised probability that \(A\) is non-inferior to \(B\) is

mean(pi_A - pi_B > -omega)

## [1] 0.99092

which is materially above the 0.975 decision probability that I nominated by Fisherian convention.

Sample Size

There are plenty of sample size calculators to compute the number of participants per arm for a non-inferiority trial comparing propensities or proportions, for example:

These use a normal approximation to derive an analytic approach to estimating sample size. However, simulation is more flexible and not much more difficult.

As an alternative example for a sample size calculation, consider a trial where we want to evaluate whether the reactogenicity of concurrent administration of COVID19 vaccines with a flu vaccine is non-inferior to delivering a COVID19 vaccine with a saline placebo in place of the flu vaccine. For this example, let \(y\) denote the occurrence of an adverse reaction (1 = AE, 0 = no AE), so higher values of \(\pi_j\), \(j \in \{A, B\}\) are now considered bad. Also assume that the propensity for adverse reactions when administering the COVID19 vaccine by itself is 50% and that we will tolerate a whopping non-inferiority margin of 25%.

Entering these values, along with frequentist type one and two errors of \(\alpha = 0.05\) and \(\beta = 0.1\) (90% power), into the online calculate at Sealed Envelope indicates that 69 people are required in each arm.

Figure 4: Figure: Sealed envelope sample size calculator results

A simulation based approach to this might look something like this

N <- 69
y_A <- rbinom(1000, N, 0.5)
y_B <- rbinom(1000, N, 0.5)
p_decision <- 0.95
omega <- 0.25
i <- 1
win <- lapply(1:1000, function(i){
  pi_A <- rbeta(10000, 1 + y_A[i], 1 + N - y_A[i])  
  pi_B <- rbeta(10000, 1 + y_B[i], 1 + N - y_B[i])  
  res <- c(mu_pi_A = mean(pi_A), 
           mu_pi_B = mean(pi_B), 
           win = mean(pi_A - omega < pi_B)>p_decision
           )
  return(res)
})
pwr <- data.table(do.call(rbind, win))
pwr[, .(pwr = mean(win))]

##      pwr
## 1: 0.911

which is about 90% power as suggested by the sample size formula (note that I write the equality out as pi_A - omega < pi_B because it makes it clearer for me). In practice, you would just adjust the value for the N variable until you reached the implied frequentist power that you are trying to obtain at a given decision threshold.

Now, if you wanted to extend the study to look at multiple COVID19 brands and multiple varieties of flu vaccine you could simply scale up the sample size. For example, if you were considering the second dose of AZ and PF along with 3 brands of flu vaccine then there are 6 cohorts to consider and therefore we might say that \(69 \times 12 = 828\) participants required (6 under coad with flu, 6 under coad with pbo). However, it is possible to have a common control and thereby save on the number of participants required, which I illustrate below by defining a shared control arm of 100 participants by brand (100 for AZ, 100 for PF).

N <- 69
N_PBO <- 130
y_AZ_FLU1 <- rbinom(1000, N, 0.5)
y_AZ_FLU2 <- rbinom(1000, N, 0.5)
y_AZ_FLU3 <- rbinom(1000, N, 0.5)
y_AZ_PBO <- rbinom(1000, N_PBO, 0.5)

y_PF_FLU1 <- rbinom(1000, N, 0.5)
y_PF_FLU2 <- rbinom(1000, N, 0.5)
y_PF_FLU3 <- rbinom(1000, N, 0.5)
y_PF_PBO <- rbinom(1000, N_PBO, 0.5)

p_decision <- 0.99
omega <- 0.25
i <- 1
win <- lapply(1:1000, function(i){
  pi_AZ_FLU1 <- rbeta(10000, 1 + y_AZ_FLU1[i], 1 + N - y_AZ_FLU1[i])
  pi_AZ_FLU2 <- rbeta(10000, 1 + y_AZ_FLU2[i], 1 + N - y_AZ_FLU2[i])  
  pi_AZ_FLU3 <- rbeta(10000, 1 + y_AZ_FLU3[i], 1 + N - y_AZ_FLU3[i])  
  pi_AZ_PBO <- rbeta(10000, 1 + y_AZ_PBO[i], 1 + N_PBO - y_AZ_PBO[i])  
  
  pi_PF_FLU1 <- rbeta(10000, 1 + y_PF_FLU1[i], 1 + N - y_PF_FLU1[i])
  pi_PF_FLU2 <- rbeta(10000, 1 + y_PF_FLU2[i], 1 + N - y_PF_FLU2[i])  
  pi_PF_FLU3 <- rbeta(10000, 1 + y_PF_FLU3[i], 1 + N - y_PF_FLU3[i]) 
  pi_PF_PBO <- rbeta(10000, 1 + y_PF_PBO[i], 1 + N_PBO - y_PF_PBO[i])  
  
  res <- c(
    winA1 = mean(pi_AZ_FLU1 - omega < pi_AZ_PBO)>p_decision,
    winA2 = mean(pi_AZ_FLU2 - omega < pi_AZ_PBO)>p_decision,
    winA3 = mean(pi_AZ_FLU3 - omega < pi_AZ_PBO)>p_decision,
    winP1 = mean(pi_PF_FLU1 - omega < pi_PF_PBO)>p_decision,
    winP2 = mean(pi_PF_FLU2 - omega < pi_PF_PBO)>p_decision,
    winP3 = mean(pi_PF_FLU3 - omega < pi_PF_PBO)>p_decision
           )
  fam <- any(res)
  return(c(res, family = fam))
})
pwr <- data.table(do.call(rbind, win))
colMeans(pwr)

##  winA1  winA2  winA3  winP1  winP2  winP3 family 
##  0.855  0.858  0.877  0.887  0.859  0.876  0.999

A nice thing about a simulation approach is that you can explicitly test the false positive rate (along with any variations in assumptions you would like to explore) by simulating the AE rate in the treatment groups exactly on the NI margin.

You will noted in the above code that I have increased the decision threshold to 99%. The reason I did this was because when I used a 95% threshold the type-one error (as computed below) was well above the desired one-sided 0.05.

N <- 69
N_PBO <- 130

p_decision <- 0.99
omega <- 0.25

y_AZ_FLU1 <- rbinom(1000, N, 0.5 + omega)
y_AZ_FLU2 <- rbinom(1000, N, 0.5 + omega)
y_AZ_FLU3 <- rbinom(1000, N, 0.5 + omega)
y_AZ_PBO <- rbinom(1000, N_PBO, 0.5)

y_PF_FLU1 <- rbinom(1000, N, 0.5 + omega)
y_PF_FLU2 <- rbinom(1000, N, 0.5 + omega)
y_PF_FLU3 <- rbinom(1000, N, 0.5 + omega)
y_PF_PBO <- rbinom(1000, N_PBO, 0.5)

i <- 1
win <- lapply(1:1000, function(i){
  pi_AZ_FLU1 <- rbeta(10000, 1 + y_AZ_FLU1[i], 1 + N - y_AZ_FLU1[i])
  pi_AZ_FLU2 <- rbeta(10000, 1 + y_AZ_FLU2[i], 1 + N - y_AZ_FLU2[i])  
  pi_AZ_FLU3 <- rbeta(10000, 1 + y_AZ_FLU3[i], 1 + N - y_AZ_FLU3[i])  
  pi_AZ_PBO <- rbeta(10000, 1 + y_AZ_PBO[i], 1 + N_PBO - y_AZ_PBO[i])  
  
  pi_PF_FLU1 <- rbeta(10000, 1 + y_PF_FLU1[i], 1 + N - y_PF_FLU1[i])
  pi_PF_FLU2 <- rbeta(10000, 1 + y_PF_FLU2[i], 1 + N - y_PF_FLU2[i])  
  pi_PF_FLU3 <- rbeta(10000, 1 + y_PF_FLU3[i], 1 + N - y_PF_FLU3[i]) 
  pi_PF_PBO <- rbeta(10000, 1 + y_PF_PBO[i], 1 + N_PBO - y_PF_PBO[i])  
  
  res <- c(
    winA1 = mean(pi_AZ_FLU1 - omega < pi_AZ_PBO)>p_decision,
    winA2 = mean(pi_AZ_FLU2 - omega < pi_AZ_PBO)>p_decision,
    winA3 = mean(pi_AZ_FLU3 - omega < pi_AZ_PBO)>p_decision,
    winP1 = mean(pi_PF_FLU1 - omega < pi_PF_PBO)>p_decision,
    winP2 = mean(pi_PF_FLU2 - omega < pi_PF_PBO)>p_decision,
    winP3 = mean(pi_PF_FLU3 - omega < pi_PF_PBO)>p_decision
           )
  fam <- any(res)
  return(c(res, family = fam))
})
pwr <- data.table(do.call(rbind, win))
colMeans(pwr)

##  winA1  winA2  winA3  winP1  winP2  winP3 family 
##  0.018  0.008  0.010  0.009  0.005  0.013  0.059

Anyway, for those still concious, that’s an introduction to non-inferiority trials and their sample size calculations.