Sampling from weird probability distributions

Realisations of random variables

A random variable \(X\) is a quantity that could be within a range of values such that some values of \(X\) are more likely to be observed than others. That is, \(X\) has a probability distribution defined over the possible values of \(X\). We are probably all familiar with the common example of \(X \sim Normal(0, 1)\), the classic bell curve centred on zero with a standard deviation of 1. In this case, values of \(X\) around 0 are much more likely than extremely positive or negative values.

x <- seq(-4, 4, 0.1)
plot(x, dnorm(x), xlab = "X ~ Normal(0, 1)", type = "l")

Simulating values from distributions

Drawing values from a distribution such as the standard normal is easy. There are algorithms and analytical expressions for this. As such, there are corresponding, efficient functions in R. These functions normally begin with an r and end with a truncated form of the name of the distribution. For example, we can use the functions for the normal and binomial distributions to sample from them.

# Set seed
set.seed(1997)
# The standard normal 
x <- rnorm(1e6, 0, 1)
plot(density(x), main = "")

# The binomial distribution
x <- rbinom(1e6, 10, 0.3)
hist(x, freq = FALSE, main = "")

Distributions where no R function exists

An analytical approach

There are many examples of valid probability distributions that don’t have R functions to sample from them. If you need to sample from such a distribution, you may encounter some problems.

One strategy is to exploit the Cumulative Distribution Function (CDF) to analytically derive a system for simulating values. The example that’s often used is the exponential distribution. It has the probability density function \(f(t) = \lambda\exp{(-\lambda t)}\), over a support of \(0 \leq t < \infty\). Its CDF can be derived as

\[\begin{align} F(t) &= \int_0^\tau f(t) ~dt, ~ \tau < \infty\\ &= \lambda\int_0^\tau \exp{(-\lambda t)} ~ dt\\ &= \lambda\left[-\frac{1}{\lambda}\exp{(-\lambda t)}\right]^\tau_0\\ &= \lambda\left(-\frac{1}{\lambda}\exp{(-\lambda\tau)} + \frac{1}{\lambda}\right)\\ &= 1 - \exp{(-\lambda\tau)} \end{align}\]

The trick here is to note that, whatever the value of \(\tau\), we get that \(1 - \exp{(-\lambda\tau)} = u\) where \(0 < u < 1\). We can then manipulate this expression to isolate \(\tau\), which leads us to

\[\begin{align} u &= 1 - \exp{(-\lambda\tau)}\\ 1 - u &= \exp{(-\lambda\tau)}\\ \log(1 - u) &= -\lambda\tau\\ -\frac{1}{\lambda}\log(1 - u) &= \tau \end{align}\]

Now we can simulate values from the exponential distribution by simulating random values from the uniform distribution on the interval (0, 1) and plugging them into \(u\) from the equation above.

If you don’t believe me, we can show that the expression we just derived behaves the same way as the standard R function rexp.

# Define our custom function
r_exp <- function(u, lambda) -1/lambda * log(1 - u)
# Simulate one million values from a uniform distribution
u <- runif(1e6)
# Then from the exponential (set lambda to 0.5)
t_custom <- r_exp(u, 0.5)
# Simulate from the standard R function
t_standard <- rexp(1e6, 0.5)
# Plot results side-by-side
par(mfrow = c(1, 2))
hist(
  t_custom, 
  breaks = seq(0, 30, 0.5),
  main = "Homebrew R function"
)
hist(
  t_standard, 
  breaks = seq(0, 30, 0.5),
  main = "Standard R function"
)

# Check that quantiles are the same
r_exp(0.5, 0.5) # this is the median

## [1] 1.386294

qexp(0.5, 0.5) # this is the median from the standard r function

## [1] 1.386294

The computational approach

Of course, there are drawbacks to the analytical approach. It is not always possible to derive the expression required to link a uniform random variable to quantiles of the target variable. Most of the time, it is possible but it’s just hard, or boring. It is actually much easier to get approximate results by discretising the probability density function and using the sample() function to draw values from the support on a fine grid.

To walk you through my approach, let’s consider the log-Cauchy distribution. For \(x > 0\), it has the probability density function \[\begin{align} f(x) &= \left(\pi\sigma x \left(1 + \left(\frac{\log(x) - \mu}{\sigma}\right)^2\right)\right)^{-1} \end{align}\]

It is not impossible to integrate this, but it looks painful. So, what do we do?

Firstly, we define the support and probability density function in R.

# Define the support
support <- seq(0.01, 15, 0.01)
# Define the probability density function
pdf <- function(x, mu, sigma){
  denominator <- x * pi * sigma * (1 + ((log(x) - mu)/sigma)^2)
  1/denominator
}

Secondly, we compute the density at each point in our support. This essentially discretises our probability density function because our support is split up into a finite grid and we use our probability density function to attach specific probabilities of observing each point on that grid.

# Find the probability densities over the support
probs <- pdf(support, 1, .5)
probs <- probs/sum(probs) # normalise

Here is our discrete approximation to the log-Cauchy distribution.

plot(support, probs, type = "l")

From here, the only remaining step is to define a function that samples from the support, with the probability of selecting each point in the support being proportional to the probs we calculated previously. Thankfully, this process is simplified by the sample function in R.

samp_fun <- function(n) sample(
  support, 
  n, # n draws
  TRUE,  # with replacement
  probs # using these probabilities
)

Now, we can do approximate simulations from the log-Cauchy distribution.

# One million values from our grid
simulated <- samp_fun(1e6)
# Plot simulated values
hist(
  simulated, 
  breaks = seq(0, 15, 0.25),
  freq = FALSE,
  main = "log-Cauchy (1, .5)",
  xlab = "Support"
)

Summary

R has functions to simulate values from many common random variables. However, if you need to simulate values from a random variable that has no corresponding sampling function in R, you may need to

put pen to paper to link a uniform random variable to quantiles of your target variable, or
use a grid-based approach to sampling the distribution based on its probability density function.

The second is usually the more convenient approach.