Estimating a proportion \(\theta\) and its relationship with the number of Monte Carlo simulations

In simulation studies, a crucial choice is the number of Monte Carlo simulations. This number is usually chosen by trying to balance accuracy of the estimates of interest and the computing (CPU) time. In the case where you are trying to estimate a proportion (for instance, in coverage studies), the accuracy of the estimation can be quantified using confidence intervals for the estimated coverage, which is the proportion of the intervals that include the true value of the parameter. There is a mystic air around \(N=1,000\) simulation studies or more, but the basis of this choice is often unclear.

If we are estimating a proportion \(\theta\), based on a sample of Bernoulli trials \({\bf x} = x_1,\dots,x_N\), an asymptotic \(100(1-\alpha)\%\) confidence interval for \(\theta\) is

\[\widehat{\theta} \pm Z_{1-\alpha/2}\sqrt{\dfrac{\widehat{\theta}\left(1-\widehat{\theta} \right)}{N}},\] where \(\widehat{\theta} = \bar{{\bf x}} = \frac{1}{N}\sum_{i=1}^N x_i\), and \(Z_{1-\alpha/2}\) is the \(1-\alpha/2\) quantile of the standard normal distribution. Thus, we can see that increasing \(N\) will increase the precision at a square-root rate, and that this precision (understood as the length of the confidence interval) also depends on the estimated value \(\widehat{\theta}\). The computing time, on the other hand, increases linearly on the number of Monte Carlo iterations \(N\). The following plots show how the length of the intervals decreases as a function of \(N\), for different values of \(\widehat{\theta}\). Also, the relationship is symmetric for \(\widehat{\theta} \leq 0.5\) or \(\widehat{\theta} \geq 0.5\), so we only present the analysis for the former scenario.

rm(list = ls())

p.hat <- seq(0.1,0.5,by=0.05)
  
# First plot
N <- seq(10,1000,by=10)
L <- 2*1.96*sqrt(p.hat[1]*(1-p.hat[1])/N)
plot(N,L,ylim=c(0,0.65), type = "l", col = gray.colors(length(p.hat))[1], xlab = "N", ylab = "Length of CIs", lwd = 2)

# Remaining plots
for(i in 2:length(p.hat)){
  L <- 2*1.96*sqrt(p.hat[i]*(1-p.hat[i])/N)
  points(N,L, type = "l", xlab = "N", ylab = "Length of CIs", col = gray.colors(length(p.hat))[i])
  }
# Add a legend
legend(800, 0.6, legend=p.hat,
       col=gray.colors(length(p.hat)), lty=1, cex=1, lwd=2)

Another way to visualise the effect of \(\widehat{\theta}\) and \(N\) is by looking at the standard errors: \[SE = \sqrt{\dfrac{\widehat{\theta}\left(1-\widehat{\theta} \right)}{N}},\]

rm(list = ls())

p.hat <- seq(0.1,0.5,by=0.05)

# First plot
N <- seq(10,1000,by=10)
SE <- sqrt(p.hat[1]*(1-p.hat[1])/N)
plot(N,SE,ylim=c(0,0.15), type = "l", col = gray.colors(length(p.hat))[1], xlab = "N", ylab = "SEs", lwd = 2)

# Remaining plots
for(i in 2:length(p.hat)){
  SE <- sqrt(p.hat[i]*(1-p.hat[i])/N)
  points(N,SE, type = "l", xlab = "N", ylab = "SEs", col = gray.colors(length(p.hat))[i])
}
# Add a legend
legend(800, 0.125, legend=p.hat,
       col=gray.colors(length(p.hat)), lty=1, cex=1, lwd=2)

Thus, for \(N \geq 250\) the gain in accuracy by increasing \(N\) is not that appealing for \(\widehat{\theta}\leq 0.25\) or \(\widehat{\theta}\geq 0.75\). For \(N\geq 500\), all of the the intervals have similar accuracy. Thus, if you are dealing with a computationally expensive model, you may want to look at the gain in accuracy by increasing the number of Monte Carlo iterations, and select \(N\) on this basis. Of course, if the model of interest is computationally cheap (the more, the merrier) one can think of using \(N= 1,000\) or \(N = 10^9\).

When \(\widehat{\theta}\) is close to \(0\) or \(1\), and \(N\) is small, the asymptotic confidence intervals discussed here may spill out of the interval \((0,1)\). In such cases, there are corrections that have been proposed. Another alternative is to produce confidence intervals in the logit scale, and transform them back to \((0,1)\). See also: Binomial proportion confidence interval.