Sampling from truncated distributions

Introduction

In some instances it might be necessary to simulate data from a non-standard distribution. For example, you might have time to event data with the origin being the date of birth and you want to sample from that same distribution when participants enter a trial 24 months of age. We can generate the distribution from the time of birth easily, which is shown on the LHS of the plot below. However, what we want is the distribution on the RHS, which is only non-zero from the entry date (and I have set this to 104 weeks).

## Loading required package: survival

There are a few ways to achieve this goal. I will demonstrate two.

Rejection sampling

Rejection sampling is a very generic but an inefficient way of sampling from arbitrary distributions. It works if you can evaluate the target distribution density and you can find a proposal distribution that envelopes the target distribution. For example, say we want a left truncated weibull distribution, which is what is shown on the RHS figure above.

\[ \begin{aligned} \pi(t) &\sim \begin{cases} \mathcal{W}(\alpha, \lambda) & t > \tau \\ 0 & \text{otherwise} \end{cases} \end{aligned} \]

For the proposal distribution we could use a vanilla non-truncated weibull distribution with the relevant parameters. We then apply the following algorithm.

Algorithm 1: Rejection sampling procedure

\[ \begin{aligned} \text{Generate proposal : }& \quad \quad Y \sim \mathcal{W}(\alpha, \lambda) \\ \text{Generate standard uniform samples : }& \quad \quad U \sim \mathcal{U}() \\ \text{Evaluate the density at the proposal : }& \quad \quad \pi(y) = \pi(y) > \tau \text{ ? } \pi(y) \text{ : } 0 \\ \text{Decide if to accept the proposal : }& \quad \quad A = u \le \pi(y) / w(y) \end{aligned} \]

If you accept then \(y\) is a draw from the target distribution. If not, then start the procedure again.

Implementation

A basic version of this procedure in R might look something like this.

rweibullPHLT1 <- function(n, alpha, lambda, lb = 0, nsimstop = 20) {
  
  y <- u <- pi <- numeric(n)
  accepted <- rep(FALSE, n)
  rejected <- n;  i <- 1
  while(rejected>0 & i < nsimstop){
    
    # distribution of y envelops the target dist
    y[!accepted] <- flexsurv::rweibullPH(rejected, alpha, lambda)
    u[!accepted] <- runif(rejected)
    
    # target dist
    pi[!accepted] <- flexsurv::dweibullPH(y[!accepted], alpha, lambda)
    pi[y <= lb] <- 0
    
    # decision
    accepted <- u <= (pi / flexsurv::dweibullPH(y, alpha, lambda))
    table(accepted)
    rejected <- sum(!accepted)
    i <- i + 1
  }
  y
}

However, generating a large number of samples and comparing the empirical distribution to the standard weibull distribution shows they are not exactly what is expected. The density of the truncated distribution seems elongated.

y0a <- flexsurv::rweibullPH(n, alpha, exp(logb0))
y0b <- rweibullPHLT1(n, alpha, exp(logb0), lb = 104)
 
hgA <- density(y0a)
hgB <- density(y0b) 

plot(1, type = "n", xlim = range(c(hgA$x, hgB$x)), ylim = range(c(hgA$y, hgB$y)))
lines(hgA$x, hgA$y)
lines(hgB$x, hgB$y, col = 2)

The reason for this follows from the definition of a valid probability distribution, i.e. it must integrate to unity. If the density of the target distribution is scale by the probability mass to the right of 104 in the proposal distribution then the distributions align.

qc <- flexsurv::pweibullPH(104, alpha, exp(logb0), lower.tail = F)
plot(1, type = "n", xlim = range(c(hgA$x, hgB$x)), ylim = range(c(hgA$y, hgB$y)))
lines(hgA$x, hgA$y)
lines(hgB$x, hgB$y*qc, col = 2)

Another approach

Here is another approach. This samples directly by using a uniform sample from a distribution constrained by the lower bound implied by lb. The uniform sample is then used to create a draw from the target distribution by using the inverse cdf approach.

Algorithm 2: Inverse cdf

\[ \begin{aligned} \text{Evaluate probability mass to LHS : }& \quad \quad p = \int_0^{lb} f(t) dt \\ \text{Generate truncated uniform : }& \quad \quad U \sim \mathcal{U}(p, 1) \\ \text{Generate sample by inverse cdf : }& \quad \quad y = cdf^{-1}(u) \end{aligned} \]

Here is an implementation.

rweibullPHLT2 <- function(n, alpha, lambda, lb = 0) {
  p <- flexsurv::pweibullPH(lb, alpha, exp(logb0))
  if(is.nan(p)) p = 1 - 1e-08;
  if (p >= 1) p = 1e-08;
  if (p <= 0) p = 1e-08;
  u = runif(n, p, 1);      
  z = (-log(1-u)/lambda)^(1/alpha);
  z
}

y0c <- rweibullPHLT2(n, alpha, exp(logb0), lb = 104)
hgC <- density(y0c) 

plot(1, type = "n", xlim = range(c(hgA$x, hgB$x)), ylim = range(c(hgA$y, hgB$y)))
lines(hgA$x, hgA$y)
lines(hgB$x, hgB$y, col = 2)
lines(hgC$x, hgC$y, col = 3)