Sampling Algorithms

Sampling from Beta(\(a,b\))

“The beta distribution is a continuous probability distribution that models random variables with values falling inside a finite interval. Use it to model subject areas with both an upper and lower bound for possible values. Analysts commonly use it to model the time to complete a task, the distribution of order statistics, and the prior distribution for binomial proportions in Bayesian analysis.”

“The standard beta distribution uses the interval [0,1]. This range is ideal for modeling probabilities, particularly for experiments with only two outcomes. However, other intervals are possible.”

“Unlike other distributions with shape and scale parameters, the beta distribution has two shape parameters, α and β. Both parameters must be positive values.”

The parameters control the shape of the curve, we can plot the curve resulting from variations in \(a\) and \(\beta\) to illustrate how changes in \(\alpha\) and \(\beta\) affect \(B(ab)\).

We want to sample from Beta(\(a, b\)) but let’s say we only know how to draw random variables from Uniform(0,1). We can perform the following transformation to obtain a Gamma(\(n, \lambda\)) random variable,

sample.from.gamma <- function(n, lambda) {
  u <- runif(min = 0, max = 1, n)
  exp <- -1 / lambda * log(u)
  return(sum(exp))
}

which we can then transform into a Beta(\(a, b\)) random variable using the transformation below.

sample.from.beta <- function(a, b, lambda) {
  #' Sample from Beta(a, b, lambda) distribution.
  #'
  #' @description:     Use this function to sample from the Beta distribution following the definition.
  #' @param a numerical. 
  #' @param b numerical.
  #' @return Returns an object of class "?". Description of what the function returns
  #' @examples
  #' # Add some code illustrating how to use the function
  x <- sample.from.gamma(a, lambda)
  y <- sample.from.gamma(b, lambda)
  return(x / (x + y))
}

Next, we can draw some samples and plot the results of our transformations with the true distribution included as a red overlay.

Sampling from F(\(m,n\))

If \(X \sim Beta({d_1 \over 2}, {d_2 \over 2})\), then \({d_2 X \over d_1(1-X)} \sim F(d_1, d_2)\) where \(d_1\) and \(d_2\) are the degrees of freedom parameters for the F distribution.

sample.from.F <- function(X, d1, d2) { 
  return(d2 * X / (d1 * (1 - X)))
}

And like before, we can draw some samples and plot them.

n <- 100
d1 <- 100
d2 <- 100
F <- rep(0, n)

for (i in 1:n) {
  X <- sample.from.beta(d1, d2, lambda = 1)
  F[i] <- sample.from.F(X, d1, d2)
}

Sampling from N(0,1)

Box-Muller Transformation

We first sample from the Normal distribution using the Box-Muller transformation.

box.muller.transform1 <- function(n, mu, sigma) {
  U1 <- runif(n)
  U2 <- runif(n)
  return(sigma * sqrt(-2 * log(U1)) * cos(2 * pi * U2) + mu)
}

box.muller.transform2 <- function(n, mu, sigma) {
  U1 <- runif(n)
  U2 <- runif(n)
  return(sigma * sqrt(-2 * log(U1)) * sin(2 * pi * U2) + mu)
}

We can plot these transformations, ⊕They look normal!

plot(density(box.muller.transform2(100000, 10, 2)))
curve(dnorm(x, 10, 2), add=T, col="red")

plot(density(box.muller.transform1(100000, 2, 3)))
curve(dnorm(x, 2, 3), add = T, col = "red")

Accept-Reject Algorithm

We are going to use an accept-reject method to sample from \(\mathcal{N}(0,1)\), using the Laplace distribution as a proposal distribution and envelope. In order to sample from the Laplace distribution we will use the following facts, given \(U \sim Uniform(-1,2,1/2]\) then \(X = \mu - b \cdot \text{sgn}(U) \ln (1 - 2 \cdot |U|)\) has a Laplace distribution with parameters \(\mu\) and \(b\). We then write the following function to sample from the Laplace distribution.

rlaplace<- function(n, mu, b) {
  U <- runif(n, min = -0.5, max = 0.5)
  return(mu - b * sign(U) * log(1 - 2 * abs(U)))
}
  
plot(density(rlaplace(1000, 0, 1)))

And we will need the Laplace density.

dlaplace <- function(x, mu, b) 1 / (2 * b) * exp(-1 / b * abs(x - mu))

The following function performs the accept-reject algorithm.

reject.normal <- function(n, mu, sigma) {
  c <- sqrt(2 * exp(1) / pi)
  sample <- vector(length=n)
  for (i in 1:n) {
    accept <- F
    while (accept == F) {
      u <- runif(1)
      x <- rlaplace(1, 0, 1)
      if ( u <= (dnorm(x, 0, 1) / (c * dlaplace(x, 0, 1))) ) {
        accept = T
        sample[i] <- x 
      }
    }
  }
  return(sigma * sample + mu)
}

Here we run the algorithm and plot the results.

mu <- 2
sigma <- 2

samples <- reject.normal(100, mu, sigma)

plot(density(samples),
     xlab = "x",
     ylab = "f(x)",
     main="Density estimate (black) and true (red)"
)

curve(dnorm(x, mu, sigma), from = -10, to = 10, add = T, col = "red")

Below we use the package to compare the execution times of the Box-Muller method and the accept-reject method. The results show that the Box-Muller method is much faster (notice the log scale).

mbm1 <- microbenchmark("Accept-Reject" = {b <- reject.normal(100, mu, sigma)},
                       "Box-Muller-2" = {b <- box.muller.transform2(100, 10, 2)},
                       "Box-Muller-1" = {b <- box.muller.transform1(100, 10, 2)}
)

autoplot(mbm1)

Expectation Maximization

This took a long time to figure out. I don’t even know if this is the EM algorithm. I just wrote down what I understood EM to be. First I loaded the data.

bivariate.normal <- read.csv(file = './bivariatenormal.txt', sep = ' ', header = T)

col1 <- bivariate.normal[, 1]
col2 <- bivariate.normal[, 2]

And the following is my function,

my.em <- function(col1, col2) {
  
  mean1 <- 0
  mean2 <- 0
  sigma11 <- 0
  sigma22 <- 0
  sigma12 <- 0
  max.iterations <- 100
  epsilon <- 0.000000000001
  n <- length(col1)

  ## Save the indices of the missing data
  index1 <- c()
  index2 <- c()

  ## Expectation:
  ## The first Expectation step is simply imputing the missing values with naive
  ## means of the columns
  for (i in 1:n) {
    
    if (is.na(col1[i])) {
      col1[i] <- mean(col1, na.rm = T)
      index1 <- c(index1, i)
    }
    
    if (is.na(col2[i])){
      col2[i] <- mean(col2, na.rm = T)
      index2 <- c(index2, i)
    }
    
  }



  i <- 0
  convergence <- FALSE
  while ((i < max.iterations) && (convergence == FALSE)) {
    
    ## Maximization:
    s1 <- sum(col1)
    s2 <- sum(col2)
    s11 <- sum(col1 * col1)
    s22 <- sum(col2 * col2)
    s12 <- sum(col1 * col2)

    ## Save the old parameters before we update them
    old.parameters <- c(mean1, mean2, sigma11, sigma12, sigma22)

    ## Update the parameters
    mean1 <- s1 / n
    mean2 <- s2 / n
    sigma11 <- s11 / n - mean1 ** 2
    sigma22 <- s22 / n - mean2 ** 2
    sigma12 <- sigma21 <- s12 / n - mean1 * mean2

    ## Expectation:
    ## Prepare the linear regression coefficients for E(Stuff|Other Stuff)
    b11 <- sigma12 / sigma22
    b10 <- mean1 - b11 * mean2
    b21 <- sigma12 / sigma11
    b20 <- mean2 - b21 * mean1

    ## Replace the indexed elements (they were NA at the onset), with linear
    ## predictions, because I am not sure what else to do:

    col1[index1] <- b10 + col2[index1] * b11
    col2[index2] <- b20 + col1[index2] * b21

    ## Package the parameters
    parameters <- c(mean1, mean2, sigma11, sigma12, sigma22)

    ## Check for convergence
    if (abs(parameters[1] - old.parameters[1]) < epsilon){
      convergence <- TRUE
    }
    
    i <- i + 1
  }
  
  return(list("Mean1" = mean1,
              "Mean2" = mean2,
              "Covariance" = matrix(c(sigma11, sigma12,
                                      sigma12, sigma22), ncol=2, byrow = TRUE),
              "Column1" = col1,
              "Column2" = col2
         )
    )
}

em <- my.em(col1, col2)

## I also plotted a normal distribution using the parameters I found.
bivn <- mvrnorm(100, mu = c(em$Mean1, em$Mean2),
                Sigma = em$Covariance)
plot(bivn, col="grey")

I compared my.em to an R package for data imputation, and from the plot below you can see that my imputations (in red) are close to forming a straight line through the distribution.

a <- amelia(bivariate.normal)

## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7
## 
## -- Imputation 2 --
## 
##   1  2  3  4
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5
## 
## -- Imputation 4 --
## 
##   1  2  3  4
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5

a2 <- a$imputations$imp5[, 2]
a1 <- a$imputations$imp5[, 1]

plot(em$Column1, em$Column2, col = "red")
par(new = TRUE)
plot(a1, a2, col = "blue")

Example with Data

Oil Spills

For this section we’ll work with a data set pertaining to oil spills in US waters between 1974 and 1999. Each record includes the year and the following measurements for that year: the number of spills, the amount of imported and exported oil. We expect the amount of oil shipped in a year is related to the number of oil spills. We therefore fit the following model:

\[\begin{align*} N_{i} \sim Poisson(\lambda_{i} ) \end{align*}\]

where

\[\begin{align*} \lambda_{i} = \alpha_{1} X_{i,1} + \alpha_{2} X_{i,2} \end{align*}\]

First I’ll derive and implement a Newton-Raphson methon to find maximum likelihood estimators (MLEs) of α1 and α2 (b) Derive a Fisher scoring scheme for finding the MLEs of α1 and α2 (c) Compare the ease of implementation and performance of the two methods (d) Estimate the standard errors of the MLEs of α1 and α2

A subset of our Oil Spills dataset.

	year	N	X1	X2
21	1994	0	1.290	0.65
22	1995	1	1.235	0.59
23	1996	0	1.340	0.56
24	1997	0	1.440	0.51
25	1998	0	1.450	0.42
26	1999	0	1.510	0.44

We have some data where we assume \(N_1 \sim \text{Poisson}(\lambda_i), \quad \lambda_i = \alpha_1x_{i,1} + \alpha_2x_{i,2}\). We want to find the optimal \(\boldsymbol{\alpha} = \left( \alpha_1, \alpha_2 \right)\), and we are going to use the Newton-Raphson method. We do some calculations and the following method appears, \[\begin{align*} \boldsymbol{\alpha^{t+1}} = \boldsymbol{\alpha^t} - \begin{bmatrix} \sum_{i=1}^n {-k_ix_{i,1}^2 \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})^2} & \sum_{i=1}^n {-k_ix_{i,1}x_{i,2} \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})^2} \\ \sum_{i=1}^n {-k_ix_{i,1}x_{i,2} \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})^2} & \sum_{i=1}^n {-k_ix_{i,2}^2 \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})^2} \end{bmatrix} ^{-1} \begin{bmatrix} \sum_{i=1}^n x_{i,1}\left[ k_i(\alpha_1x_{i,1} + \alpha_2x_{i,2}^{-1} - 1) \right] \\ \sum_{i=1}^n x_{i,2}\left[ k_i(\alpha_1x_{i,1} + \alpha_2x_{i,2}^{-1} - 1) \right] \end{bmatrix}. \end{align*}\]

And the Fisher scoring version is:

\[\begin{align*} \boldsymbol{\alpha^{t+1}} = \boldsymbol{\alpha^t} - \begin{bmatrix} \sum_{i=1}^n {-x_{i,1}^2 \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})} & \sum_{i=1}^n {-x_{i,1}x_{i,2} \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})} \\ \sum_{i=1}^n {-x_{i,1}x_{i,2} \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})} & \sum_{i=1}^n {-x_{i,2}^2 \over (\alpha_1x_{i,1} + \alpha_2x_{i,2})} \end{bmatrix} ^{-1} \begin{bmatrix} \sum_{i=1}^n x_{i,1}\left[ k_i(\alpha_1x_{i,1} + \alpha_2x_{i,2}^{-1} - 1) \right] \\ \sum_{i=1}^n x_{i,2}\left[ k_i(\alpha_1x_{i,1} + \alpha_2x_{i,2}^{-1} - 1) \right] \end{bmatrix}. \end{align*}\]

The following program computes the mle of the parameters. There is a setting to choose between Fisher scoring and NR.

newton.raphson <- function(x1, x2, k, alpha,
                           epsilon = 10 ^ (-6),
                           maxit = 1000,
                           fisher = FALSE) {

  lprime1 <- function(x1, x2, k, alpha) {
    rbind(sum(x1 * (k / (alpha[1] * x1 + alpha[2] * x2) - 1)),
          sum(x2 * (k / (alpha[1] * x1 + alpha[2] * x2) - 1)))
  }

  lprime2 <- function(x1, x2, k, alpha) {
    nw <- sum(-k * x1 ** 2 / (alpha[1] * x1 + alpha[2] * x2) ** 2)
    ne <- sw <- sum(-k * x1 * x2 / (alpha[1] * x1 + alpha[2] * x2) ** 2)
    se <- sum(-k * x2 ** 2 / (alpha[1] * x1 + alpha[2] * x2) ** 2)
    matrix(c(nw, ne, sw, se), byrow = T, ncol = 2)
  }

  fisher.information.matrix <- function(x1, x2, alpha) {
    nw <- sum(-x1 ** 2 / (alpha[1] * x1 + alpha[2] * x2))
    ne <- sw <- sum(-x1 * x2 / (alpha[1] * x1 + alpha[2] * x2))
    se <- sum(-x2 ** 2 / (alpha[1] * x1 + alpha[2] * x2))
    matrix(c(nw, ne, sw, se), byrow = T, ncol = 2)
  }

  i <- 1
  convergence <- F
  new.alpha <- c(0,0)

  while ((i < maxit) && (convergence == F)) {
    
    if (fisher == TRUE) {
      new.alpha <- alpha - solve(fisher.information.matrix(x1, x2, alpha)) %*% lprime1(x1, x2, k, alpha)
    } else {
      new.alpha <- alpha - solve(lprime2(x1, x2, k, alpha)) %*% lprime1(x1, x2, k, alpha)
    }
    
    if (abs(new.alpha[1] - alpha[1]) < epsilon) {
      convergence <- T
    } else {
      alpha <- new.alpha
      i <- i + 1
    }
    
  }
  
  cat(i, "iterations until convergence.")
  
  return(alpha)
}

I call the function below. I set the initial alpha parameters to something random. I can do this with the Fisher scoring method,

alpha0 <- rnorm(2, 1, 20)

newton.raphson(x1 = oil.spills[, 3],
              x2 = oil.spills[, 4],
              k = oil.spills[, 2],
              alpha = alpha0,
              epsilon = 0.00001,
              maxit = 100,
              fisher = TRUE
)

## 13 iterations until convergence.

##           [,1]
## [1,] 1.0971584
## [2,] 0.9375457

but for the NR method, I have to be careful about my initial choice.

alpha0 <- c(1, 1)

newton.raphson(x1 = oil.spills[, 3],
               x2 = oil.spills[, 4],
               k = oil.spills[, 2],
               alpha = alpha0,
               epsilon = 0.00001,
               maxit = 100,
               fisher = FALSE
)

## 4 iterations until convergence.

##           [,1]
## [1,] 1.0971525
## [2,] 0.9375546

The Fisher scoring method is way more robust. The NR method often does not work since the second derivative is often singular. And the NR method often converges to different values given different starting alpha. The performance comparison is found below.

Statistical Sampling Methods

Optimization and Simulation of Random Variables with R

Bryan Paget