A simulation study to check the performance of asymptotic normal CIs for the log-odds

``Knew that we ventured on such dangerous seas That if we wrought out life ‘twas ten to one’’ –William Shakespeare,

Let \(X\) be a random variable with binomial distribution with \(n\) trials and \(\theta\in(0,1)\) probability of success. The MLE of \(\theta\), \(\widehat{\theta} = \dfrac{X}{n}\), \(E[\widehat{\theta}] = \theta\), \(Var[\widehat{\theta}] = \frac{\theta(1-\theta)}{n}\). We know that the MLE is consistent and asymptotically normal.

A parameter of interest is the log-odds, which is the logarithm of the odds: \[\varphi(\theta) = \log\left( \dfrac{\theta}{1-\theta} \right).\] This function is also known as the function. The aim now is to construct a \(100\gamma\%\) CI for \(\varphi(\theta)\), using the consistent estimator \(\varphi(\widehat{\theta})\) and the plug-in principle. It is easy to check that \(g'(\widehat{\theta}) = \dfrac{1}{\widehat{\theta}(1-\widehat{\theta})}\). Then, a \(100\gamma\%\) confidence interval for \(\varphi(\theta)\) can be obtained as \[ \varphi(\widehat{\theta}_n) - \dfrac{Z_{\frac{1+\gamma}{2}}}{\sqrt{n \widehat{\theta}(1-\widehat{\theta})}} < \varphi(\theta) < \varphi(\widehat{\theta}_n) + \dfrac{Z_{\frac{1+\gamma}{2}}}{\sqrt{n \widehat{\theta}(1-\widehat{\theta})}}. \] The following R code presents a simulation study to assess the confidence level of asymptotic normal confidence intervals of level \(95\%\). The idea is, for \(i=1,\dots,N\) 1. Simulate a binomial sample \(x_i\) with \(n\) trials and \(\theta_0\) probability of success, 2. Calculate the confidence interval for the log-odds for this sample, 3. Check whether or not that confidence interval contains the true value of the parameter \(\theta_0\).

Finally, check the proportion of confidence intervals that contain the true value of the parameter. This is a Monte Carlo approximation of the confidence of those confidence intervals (for that sample size \(n\) and parameter \(\theta_0\)). We will do the simulation study for \(\theta_0=0.1, 0.2,\dots,0.9\) and \(n=5,10,30,50,100,250,1000\).

Recall that, the closer the inclusion proportion to the nominal value (\(0.95\)), the better the normal approximation. We also show the proportion of intervals that are undefined (\(\widehat{\theta}=0,1\)) for each sample size.

Simulation Study I: Excluding undefined intervals and counting the proportion of undefined intervals

# Clear memory
rm(list = ls())
# True value of the parameter
theta <- seq(0.1, 0.9, by = 0.1)
# Sample sizes
ns <- c(5,10,30,50,100,250,500,1000)
# Matrix of proportions of true inclusions
prop.mc <- matrix(0, nrow = length(theta), ncol = length(ns))
# Matrix of proportions of undefined intervals
prop.na <- matrix(0, nrow = length(theta), ncol = length(ns))
# Z_{0.975} 
Z <- qnorm(0.975)
# logit function
logit <- Vectorize(function(t)  log(t/(1-t)))

# Simulation study
N <- 20000 # Monte Carlo iterations


for(j in 1:length(theta)){
  for(k in 1:length(ns)){
    inc <- vector() # temporal vector
    for(i in 1:N){
      set.seed(i)
      x <- rbinom(n = 1, size = ns[k], prob = theta[j])
      theta.hat <- x/ns[k]
      if(theta.hat>0 & theta.hat < 1){
        L <- logit(theta.hat) - Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
        U <- logit(theta.hat) + Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
        true.logit <- logit(theta[j])
        inc[i] <- ifelse(L<= true.logit & U>=true.logit,1,0)
      }
      if(theta.hat==0 | theta.hat == 1){
        inc[i] <- NA
      }
    }
    prop.mc[j,k] <- mean(inc,na.rm = TRUE)
    prop.na[j,k] <- mean(is.na(inc))
  }
}


# Required package
library(knitr)

# Displaying the proportion of intervals that include the true value
colnames(prop.mc) <- ns
rownames(prop.mc) <- theta

kable(prop.mc, digits = 4)
5 10 30 50 100 250 500 1000
0.1 0.8038 0.9811 0.9719 0.9742 0.9512 0.9548 0.9560 0.9497
0.2 0.9143 0.9628 0.9733 0.9501 0.9412 0.9544 0.9506 0.9479
0.3 0.9659 0.9893 0.9732 0.9558 0.9609 0.9555 0.9538 0.9504
0.4 1.0000 0.9879 0.9604 0.9406 0.9492 0.9543 0.9520 0.9491
0.5 1.0000 0.9801 0.9567 0.9665 0.9471 0.9507 0.9466 0.9521
0.6 1.0000 0.9879 0.9604 0.9406 0.9492 0.9543 0.9520 0.9491
0.7 0.9659 0.9893 0.9732 0.9558 0.9614 0.9555 0.9538 0.9504
0.8 0.9143 0.9628 0.9733 0.9501 0.9412 0.9544 0.9506 0.9479
0.9 0.8038 0.9811 0.9719 0.9742 0.9512 0.9548 0.9560 0.9497
# Displaying the proportion of intervals with undefined normal approximation
colnames(prop.na) <- ns
rownames(prop.na) <- theta

kable(prop.na, digits = 4)
5 10 30 50 100 250 500 1000
0.1 0.5901 0.3420 0.0424 0.0062 1e-04 0 0 0
0.2 0.3211 0.1059 0.0012 0.0001 0e+00 0 0 0
0.3 0.1696 0.0273 0.0001 0.0000 0e+00 0 0 0
0.4 0.0862 0.0071 0.0000 0.0000 0e+00 0 0 0
0.5 0.0624 0.0020 0.0000 0.0000 0e+00 0 0 0
0.6 0.0862 0.0071 0.0000 0.0000 0e+00 0 0 0
0.7 0.1696 0.0273 0.0001 0.0000 0e+00 0 0 0
0.8 0.3211 0.1059 0.0012 0.0001 0e+00 0 0 0
0.9 0.5901 0.3420 0.0424 0.0062 1e-04 0 0 0
# This table is, in fact, related to the probability of observing 0 and 1 for each scenario
probs <- matrix(0, nrow = length(theta), ncol = length(ns))

for(j in 1:length(theta)){
  for(k in 1:length(ns)){
    probs[j,k] <- dbinom(x = 0, size = ns[k], prob = theta[j]) + dbinom(x = ns[k], size = ns[k], prob = theta[j])
  }}
colnames(probs) <- ns
rownames(probs) <- theta

kable(probs, digits = 4)
5 10 30 50 100 250 500 1000
0.1 0.5905 0.3487 0.0424 0.0052 0 0 0 0
0.2 0.3280 0.1074 0.0012 0.0000 0 0 0 0
0.3 0.1705 0.0283 0.0000 0.0000 0 0 0 0
0.4 0.0880 0.0062 0.0000 0.0000 0 0 0 0
0.5 0.0625 0.0020 0.0000 0.0000 0 0 0 0
0.6 0.0880 0.0062 0.0000 0.0000 0 0 0 0
0.7 0.1705 0.0283 0.0000 0.0000 0 0 0 0
0.8 0.3280 0.1074 0.0012 0.0000 0 0 0 0
0.9 0.5905 0.3487 0.0424 0.0052 0 0 0 0

Simulation Study II: Assigning 0 inclusion of the true value to undefined intervals

# Clear memory
rm(list = ls())
# True value of the parameter
theta <- seq(0.1, 0.9, by = 0.1)
# Sample sizes
ns <- c(5,10,30,50,100,250,500,1000)
# Matrix of proportions of true inclusions
prop.mc <- matrix(0, nrow = length(theta), ncol = length(ns))
# Matrix of proportions of undefined intervals
prop.na <- matrix(0, nrow = length(theta), ncol = length(ns))
# Z_{0.975} 
Z <- qnorm(0.975)
# logit function
logit <- Vectorize(function(t)  log(t/(1-t)))

# Simulation study
N <- 20000 # Monte Carlo iterations


for(j in 1:length(theta)){
  for(k in 1:length(ns)){
    inc <- vector() # temporal vector
    for(i in 1:N){
      set.seed(i)
      x <- rbinom(n = 1, size = ns[k], prob = theta[j])
      theta.hat <- x/ns[k]
      if(theta.hat>0 & theta.hat < 1){
      L <- logit(theta.hat) - Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
      U <- logit(theta.hat) + Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
      true.logit <- logit(theta[j])
      inc[i] <- ifelse(L<= true.logit & U>=true.logit,1,0)
      }
      if(theta.hat==0 | theta.hat == 1){
        inc[i] <- 0
      }
    }
    prop.mc[j,k] <- mean(inc,na.rm = TRUE)
    prop.na[j,k] <- mean(is.na(inc))
  }
}

# Required package
library(knitr)

# Displaying the proportion of intervals that include the true value
colnames(prop.mc) <- ns
rownames(prop.mc) <- theta

kable(prop.mc, digits = 4)
5 10 30 50 100 250 500 1000
0.1 0.3294 0.6455 0.9307 0.9682 0.9512 0.9548 0.9560 0.9497
0.2 0.6207 0.8608 0.9722 0.9500 0.9412 0.9544 0.9506 0.9479
0.3 0.8020 0.9622 0.9732 0.9558 0.9609 0.9555 0.9538 0.9504
0.4 0.9138 0.9809 0.9604 0.9406 0.9492 0.9543 0.9520 0.9491
0.5 0.9376 0.9781 0.9567 0.9665 0.9471 0.9507 0.9466 0.9521
0.6 0.9138 0.9809 0.9604 0.9406 0.9492 0.9543 0.9520 0.9491
0.7 0.8020 0.9622 0.9732 0.9558 0.9614 0.9555 0.9538 0.9504
0.8 0.6207 0.8608 0.9722 0.9500 0.9412 0.9544 0.9506 0.9479
0.9 0.3294 0.6455 0.9307 0.9682 0.9512 0.9548 0.9560 0.9497