W. Q. Meeker, L. A. Escobar, and J. K. Freels
03 August 2018
The use of
Methods for generating bootstrap samples
How to obtain and interpret simulation-based parametric pointwise bootstrap confidence intervals
How to obtain and interpret simulation-based nonparametric pointwise bootstrap confidence intervals
CI’s based on the normal approximation
Are adequate for informal analyses
Have good coverage probabilities when the sample size is large
CI’s based on the relative likelihood
Have coverage probabilities closer to \(1-\alpha\) than CI’s based on normal approximation
Can be unreasonably computationally intensive
CI’s based on simulation (bootstrapping)
Have better coverage probabilities than those based on the normal approximation
Have coverage probabilities that can be competitive with likelihood-based methods
Objective:
Want to base the CI on the true distribution of the estimator, not a large sample approximation
Problem:
Observed sample data provides limited information on the overall population (a snapshot)
Solution:
If the sample data was produced by the same process governing the overall population we can…
Repeatedly simulate the sampling process that generated the original data (create many snapshots)
Use the collective information from these snapshots to compute a confidence interval
This collection of snapshots is called a bootstrap sample
Suppose we’re interested in computing a \(100(1-\alpha)\%\) CI for some value \(\theta \in \mathbb{R}^+\)
Since \(\theta\) is strictly positive, a CI based on the normal approximation assumes that
\[ \Big[\underset{\sim}{\theta}, \overset{\sim}{\theta}\Big] = \Big[\widehat{\theta}/w,\widehat{\theta}\times w\Big] \]
where \(w = \exp\big(z_{1-\alpha/2}\widehat{se}_{\log_{\theta}}/\widehat{\theta}\big)\).
However, this interval is based on the assumption that
\[ Z_{\log(\hat{\theta)}}=\frac{\log(\hat{\theta})-\log(\theta)}{\widehat{se}_{\log(\hat{\theta})}}\sim NOR(0,1) \]
But, what if this assumption was not valid - what if the true distribution of \(Z_{\log(\widehat\theta)}\) is positively or negatively-skewed?
Simulation allows us to base the CI on the actual distribution of \(Z_{\log(\hat{\theta)}}\)
We’ll focus on two of them
Parametric bootstrap sampling
Nonparametric bootstrap sampling
First, let’s make sure that we’ve defined all our terms
\(\;\;F(t|\theta) \rightarrow\) The population cdf based on the true, but unknown value of \(\theta\)
\(\;\;\;\;\;\;\;\;\;n \rightarrow\) The size of our observed data (the initial snapshot)
\(\;F(t|\hat{\theta})\rightarrow\) The population cdf based on our estimate of \(\theta\)
\(F(t|\hat{\theta}^*_i)\rightarrow\) The population cdf based on the \(i^{th}\) bootstrap estimate of \(\theta\)
\(\;\;\;\;\;\;\;\;\;B \rightarrow\) The size of the bootstrap sample (number of simulations)
Observe \(n\) observations generated from some process with unknown distrubution \(F(t|\mathbf{\theta})\)
Use maximum likelihood to estimate the form of \(F(t)\) and the value(s) of \(\hat{\theta}\)
Generate bootstrap samples of size \(n\) from the estimated distribution
For bootstrap sample \(i = 1\) simulate \(n\) observations from \(F(t|\mathbf{\hat{\theta}})\)
Use ML to estimate \(\hat{\theta}^*_1\) for bootstrap sample \(1\)
Repeat this process of sampling & estimating until \(B\) samples have been generated and \(B\) estimates of \(\hat{\theta}^*_i, i = 1,...,B\) have been computed
\(B\) should be a large number
Order the estimated values \(\hat{\theta}^*_i, i = 1,...,B\) from smallest to largest
The upper and lower endpoints of the \(95\%\) CI for \(\theta\) using the bootstrap method are found as
The lower endpoint is the value of \(\hat{\theta}^*_i\) that is the \(B \times 2.5%\) smallest
The upper endpoint is the value of \(\hat{\theta}^*_i\) that is the \(B \times 97.5%\) smallest
Background
This example follows Example 7.3 to construct a \(95\%\) CI for \(\theta\) using Berkson’s \(\alpha\)-particle dataset \(n=20\)
In this example, we construct a CI using bootstrap sampling
Data
The data used in this example is contained in the SMRD package as berkson20
The data are shown in the table below
Analysis (First bootstrap sample)
To compute the bootstrap CI, we need to follow the steps outlined above
The code in the chunk below walks through the process to generate the first bootstrap sample
library(SMRD)
## Step 1 - observe data (already done)
## Step 2 - Compute ML estimate for parameters based on assumed model
berk20.ld <- frame.to.ld(berkson20,
response.column = c(1,2),
censor.column = 3,
case.weight.column = 4)
berk20.ml <- mlest(berk20.ld, distribution = 'exponential')
theta_ml <- print(berk20.ml)$mle[3,1]
theta_ml[1] 440.1711
## Step 3 - Generate first set of n bootstrap observations from ML distribution
berk20_boot <- rexp(20, rate = 1/theta_ml)
# enforce censoring by cutting bootstrap samples into observation intervals
breaks <- c(0,unique(c(berkson20[[1]], berkson20[[2]])),Inf)
Cut <- cut(berk20_boot,
breaks = breaks,
ordered_result = T)
# build data.frame or bootstrap data
berkson20_boot <- cbind(berkson20[,-4], as.numeric(table(Cut)))
# use this new data set to generate the first bootstrap sample for theta
berk20.ld_boot <- frame.to.ld(berkson20_boot,
response.column = c(1,2),
censor.column = 3,
case.weight.column = 4)
berk20.ml_boot <- mlest(berk20.ld_boot, distribution = 'exponential')
theta_ml_boot <- print(berk20.ml_boot)$mle[3,1]
theta_ml_boot[1] 477.402
Analysis (Remaining bootstrap samples)
To compute the bootstrap CI, we need to follow the steps outlined above
The code in the chunk below walks through the process of generating \(B\) bootstrap samples of \(\theta^{*}\)
library(SMRD)
## Step 1 - observe data (already done)
## Step 2 - Compute ML estimate for parameters based on assumed model
berk20.ld <- frame.to.ld(berkson20,
response.column = c(1,2),
censor.column = 3,
case.weight.column = 4)
berk20.ml <- mlest(berk20.ld, distribution = 'exponential')
theta_ml <- print(berk20.ml)
# Set number of desired bootstrap samples
B = 2000
# Pre-allocate matrix to contain results
theta_star <- matrix(NA, nrow = B, ncol = 5)
colnames(theta_star) <- c('theta*',
'log(theta*)',
'se_log(theta*)',
'Z_theta*',
'Z_log(theta*)')
for(i in 1:B) {
# Generate n bootstrap observations from ML distribution
berk20_boot <- rexp(20, rate = 1/theta_ml$mle[3,1])
# enforce censoring by cutting bootstrap samples into observation intervals
breaks <- c(0,unique(c(berkson20[[1]], berkson20[[2]])),Inf)
Cut <- cut(berk20_boot,
breaks = breaks,
ordered_result = T)
# build data.frame or bootstrap data
berkson20_boot <- cbind(berkson20[,-4], as.numeric(table(Cut)))
# use this new data set to generate the first bootstrap sample for theta
berk20.ld_boot <- frame.to.ld(berkson20_boot,
response.column = c(1,2),
censor.column = 3,
case.weight.column = 4)
berk20.ml_boot <- print(mlest(berk20.ld_boot, distribution = 'exponential'))
theta_star[i,1] <- berk20.ml_boot$mle[3,1]
theta_star[i,2] <- log(berk20.ml_boot$mle[3,1])
theta_star[i,3] <- berk20.ml_boot$mle[3,2]
theta_star[i,4] <- (berk20.ml_boot$mle[3,1] - theta_ml$mle[3,1]) / berk20.ml_boot$mle[3,2]
theta_star[i,5] <- (log(berk20.ml_boot$mle[3,1]) - log(theta_ml$mle[3,1])) / (berk20.ml_boot$mle[3,2] / berk20.ml_boot$mle[3,1])
}First derivatives of the loglikelihood= 0,0
Start values
[1] 6.774224 1.000000
[1] 6.774224 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.937314 1.000000
[1] 6.937314 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.038784 1.000000
[1] 7.038784 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.349231 1.000000
[1] 7.349231 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.922644 1.000000
[1] 6.922644 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.247793 1.000000
[1] 7.247793 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.856462 1.000000
[1] 6.856462 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.951772 1.000000
[1] 6.951772 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.247793 1.000000
[1] 7.247793 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.902743 1.000000
[1] 6.902743 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.984716 1.000000
[1] 6.984716 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.200425 1.000000
[1] 7.200425 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.003065 1.000000
[1] 7.003065 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.951772 1.000000
[1] 6.951772 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.802395 1.000000
[1] 6.802395 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.507278 1.000000
[1] 6.507278 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.04316 1.00000
[1] 7.04316 1.00000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.077498 1.000000
[1] 7.077498 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.007601 1.000000
[1] 7.007601 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.756932 1.000000
[1] 6.756932 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.84588 1.00000
[1] 6.84588 1.00000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.142827 1.000000
[1] 7.142827 1.000000
First derivatives of the loglikelihood= 0.333147794008255,0
Start values
[1] 6.59987 1.00000
[1] 6.59987 1.00000
First derivatives of the loglikelihood= 0,0
Start values
[1] 7.077498 1.000000
[1] 7.077498 1.000000
First derivatives of the loglikelihood= 0,0
Start values
[1] 6.807935 1.000000
[1] 6.807935 1.000000
par(mfrow = c(1,2))
hist(theta_star[,4],
breaks = 50,
col = scales::alpha('red', 0.35),
border = 'white',
las = 1,
xlab = expression(Z[theta^'*']),
main = expression('Histogram of '*Z[theta^'*']))
hist(theta_star[,5],
breaks = 50,
col = scales::alpha('blue', 0.35),
border = 'white',
las = 1,
xlab = expression(Z[log(theta^'*')]),
main = expression('Histogram of '*Z[log(theta^'*')]))Plots
exp_plot <- function(theta, p = seq(0.01, 0.99, 0.01)) { -log(1 - p) * theta }
xmat <- matrix(NA, nrow = B, ncol = length(seq(0.01, 0.99, 0.01)))
ymat <- matrix(NA, nrow = B, ncol = length(seq(0.01, 0.99, 0.01)))
# fill ymat
for(i in 1:B) {
ymat[i,] <- -log(1-seq(0.01, 0.99, 0.01))*theta_star[i,1]
xmat[i,] <- exp_plot(theta = theta_star[i,1])
}
matplot(x = xmat, y = ymat, type = 'l')