``Knew that we ventured on such dangerous seas That if we wrought out life ‘twas ten to one’’ –William Shakespeare,
Let \(X\) be a random variable with binomial distribution with \(n\) trials and \(\theta\in(0,1)\) probability of success. The MLE of \(\theta\), \(\widehat{\theta} = \dfrac{X}{n}\), \(E[\widehat{\theta}] = \theta\), \(Var[\widehat{\theta}] = \frac{\theta(1-\theta)}{n}\). We know that the MLE is consistent and asymptotically normal.
A parameter of interest is the log-odds, which is the logarithm of the odds: \[\varphi(\theta) = \log\left( \dfrac{\theta}{1-\theta} \right).\] This function is also known as the function. The aim now is to construct a \(100\gamma\%\) CI for \(\varphi(\theta)\), using the consistent estimator \(\varphi(\widehat{\theta})\) and the plug-in principle. It is easy to check that \(g'(\widehat{\theta}) = \dfrac{1}{\widehat{\theta}(1-\widehat{\theta})}\). Then, a \(100\gamma\%\) confidence interval for \(\varphi(\theta)\) can be obtained as \[ \varphi(\widehat{\theta}_n) - \dfrac{Z_{\frac{1+\gamma}{2}}}{\sqrt{n \widehat{\theta}(1-\widehat{\theta})}} < \varphi(\theta) < \varphi(\widehat{\theta}_n) + \dfrac{Z_{\frac{1+\gamma}{2}}}{\sqrt{n \widehat{\theta}(1-\widehat{\theta})}}. \] The following R code presents a simulation study to assess the confidence level of asymptotic normal confidence intervals of level \(95\%\). The idea is, for \(i=1,\dots,N\) 1. Simulate a binomial sample \(x_i\) with \(n\) trials and \(\theta_0\) probability of success, 2. Calculate the confidence interval for the log-odds for this sample, 3. Check whether or not that confidence interval contains the true value of the parameter \(\theta_0\).
Finally, check the proportion of confidence intervals that contain the true value of the parameter. This is a Monte Carlo approximation of the confidence of those confidence intervals (for that sample size \(n\) and parameter \(\theta_0\)). We will do the simulation study for \(\theta_0=0.1, 0.2,\dots,0.9\) and \(n=5,10,30,50,100,250,1000\).
Recall that, the closer the inclusion proportion to the nominal value (\(0.95\)), the better the normal approximation. We also show the proportion of intervals that are undefined (\(\widehat{\theta}=0,1\)) for each sample size.
# Clear memory
rm(list = ls())
# True value of the parameter
theta <- seq(0.1, 0.9, by = 0.1)
# Sample sizes
ns <- c(5,10,30,50,100,250,500,1000)
# Matrix of proportions of true inclusions
prop.mc <- matrix(0, nrow = length(theta), ncol = length(ns))
# Matrix of proportions of undefined intervals
prop.na <- matrix(0, nrow = length(theta), ncol = length(ns))
# Z_{0.975}
Z <- qnorm(0.975)
# logit function
logit <- Vectorize(function(t) log(t/(1-t)))
# Simulation study
N <- 20000 # Monte Carlo iterations
for(j in 1:length(theta)){
for(k in 1:length(ns)){
inc <- vector() # temporal vector
for(i in 1:N){
set.seed(i)
x <- rbinom(n = 1, size = ns[k], prob = theta[j])
theta.hat <- x/ns[k]
if(theta.hat>0 & theta.hat < 1){
L <- logit(theta.hat) - Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
U <- logit(theta.hat) + Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
true.logit <- logit(theta[j])
inc[i] <- ifelse(L<= true.logit & U>=true.logit,1,0)
}
if(theta.hat==0 | theta.hat == 1){
inc[i] <- NA
}
}
prop.mc[j,k] <- mean(inc,na.rm = TRUE)
prop.na[j,k] <- mean(is.na(inc))
}
}
# Required package
library(knitr)
# Displaying the proportion of intervals that include the true value
colnames(prop.mc) <- ns
rownames(prop.mc) <- theta
kable(prop.mc, digits = 4)
| 5 | 10 | 30 | 50 | 100 | 250 | 500 | 1000 | |
|---|---|---|---|---|---|---|---|---|
| 0.1 | 0.8038 | 0.9811 | 0.9719 | 0.9742 | 0.9512 | 0.9548 | 0.9560 | 0.9497 |
| 0.2 | 0.9143 | 0.9628 | 0.9733 | 0.9501 | 0.9412 | 0.9544 | 0.9506 | 0.9479 |
| 0.3 | 0.9659 | 0.9893 | 0.9732 | 0.9558 | 0.9609 | 0.9555 | 0.9538 | 0.9504 |
| 0.4 | 1.0000 | 0.9879 | 0.9604 | 0.9406 | 0.9492 | 0.9543 | 0.9520 | 0.9491 |
| 0.5 | 1.0000 | 0.9801 | 0.9567 | 0.9665 | 0.9471 | 0.9507 | 0.9466 | 0.9521 |
| 0.6 | 1.0000 | 0.9879 | 0.9604 | 0.9406 | 0.9492 | 0.9543 | 0.9520 | 0.9491 |
| 0.7 | 0.9659 | 0.9893 | 0.9732 | 0.9558 | 0.9614 | 0.9555 | 0.9538 | 0.9504 |
| 0.8 | 0.9143 | 0.9628 | 0.9733 | 0.9501 | 0.9412 | 0.9544 | 0.9506 | 0.9479 |
| 0.9 | 0.8038 | 0.9811 | 0.9719 | 0.9742 | 0.9512 | 0.9548 | 0.9560 | 0.9497 |
# Displaying the proportion of intervals with undefined normal approximation
colnames(prop.na) <- ns
rownames(prop.na) <- theta
kable(prop.na, digits = 4)
| 5 | 10 | 30 | 50 | 100 | 250 | 500 | 1000 | |
|---|---|---|---|---|---|---|---|---|
| 0.1 | 0.5901 | 0.3420 | 0.0424 | 0.0062 | 1e-04 | 0 | 0 | 0 |
| 0.2 | 0.3211 | 0.1059 | 0.0012 | 0.0001 | 0e+00 | 0 | 0 | 0 |
| 0.3 | 0.1696 | 0.0273 | 0.0001 | 0.0000 | 0e+00 | 0 | 0 | 0 |
| 0.4 | 0.0862 | 0.0071 | 0.0000 | 0.0000 | 0e+00 | 0 | 0 | 0 |
| 0.5 | 0.0624 | 0.0020 | 0.0000 | 0.0000 | 0e+00 | 0 | 0 | 0 |
| 0.6 | 0.0862 | 0.0071 | 0.0000 | 0.0000 | 0e+00 | 0 | 0 | 0 |
| 0.7 | 0.1696 | 0.0273 | 0.0001 | 0.0000 | 0e+00 | 0 | 0 | 0 |
| 0.8 | 0.3211 | 0.1059 | 0.0012 | 0.0001 | 0e+00 | 0 | 0 | 0 |
| 0.9 | 0.5901 | 0.3420 | 0.0424 | 0.0062 | 1e-04 | 0 | 0 | 0 |
# This table is, in fact, related to the probability of observing 0 and 1 for each scenario
probs <- matrix(0, nrow = length(theta), ncol = length(ns))
for(j in 1:length(theta)){
for(k in 1:length(ns)){
probs[j,k] <- dbinom(x = 0, size = ns[k], prob = theta[j]) + dbinom(x = ns[k], size = ns[k], prob = theta[j])
}}
colnames(probs) <- ns
rownames(probs) <- theta
kable(probs, digits = 4)
| 5 | 10 | 30 | 50 | 100 | 250 | 500 | 1000 | |
|---|---|---|---|---|---|---|---|---|
| 0.1 | 0.5905 | 0.3487 | 0.0424 | 0.0052 | 0 | 0 | 0 | 0 |
| 0.2 | 0.3280 | 0.1074 | 0.0012 | 0.0000 | 0 | 0 | 0 | 0 |
| 0.3 | 0.1705 | 0.0283 | 0.0000 | 0.0000 | 0 | 0 | 0 | 0 |
| 0.4 | 0.0880 | 0.0062 | 0.0000 | 0.0000 | 0 | 0 | 0 | 0 |
| 0.5 | 0.0625 | 0.0020 | 0.0000 | 0.0000 | 0 | 0 | 0 | 0 |
| 0.6 | 0.0880 | 0.0062 | 0.0000 | 0.0000 | 0 | 0 | 0 | 0 |
| 0.7 | 0.1705 | 0.0283 | 0.0000 | 0.0000 | 0 | 0 | 0 | 0 |
| 0.8 | 0.3280 | 0.1074 | 0.0012 | 0.0000 | 0 | 0 | 0 | 0 |
| 0.9 | 0.5905 | 0.3487 | 0.0424 | 0.0052 | 0 | 0 | 0 | 0 |
# Clear memory
rm(list = ls())
# True value of the parameter
theta <- seq(0.1, 0.9, by = 0.1)
# Sample sizes
ns <- c(5,10,30,50,100,250,500,1000)
# Matrix of proportions of true inclusions
prop.mc <- matrix(0, nrow = length(theta), ncol = length(ns))
# Matrix of proportions of undefined intervals
prop.na <- matrix(0, nrow = length(theta), ncol = length(ns))
# Z_{0.975}
Z <- qnorm(0.975)
# logit function
logit <- Vectorize(function(t) log(t/(1-t)))
# Simulation study
N <- 20000 # Monte Carlo iterations
for(j in 1:length(theta)){
for(k in 1:length(ns)){
inc <- vector() # temporal vector
for(i in 1:N){
set.seed(i)
x <- rbinom(n = 1, size = ns[k], prob = theta[j])
theta.hat <- x/ns[k]
if(theta.hat>0 & theta.hat < 1){
L <- logit(theta.hat) - Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
U <- logit(theta.hat) + Z/( sqrt(ns[k]*theta.hat*(1-theta.hat)))
true.logit <- logit(theta[j])
inc[i] <- ifelse(L<= true.logit & U>=true.logit,1,0)
}
if(theta.hat==0 | theta.hat == 1){
inc[i] <- 0
}
}
prop.mc[j,k] <- mean(inc,na.rm = TRUE)
prop.na[j,k] <- mean(is.na(inc))
}
}
# Required package
library(knitr)
# Displaying the proportion of intervals that include the true value
colnames(prop.mc) <- ns
rownames(prop.mc) <- theta
kable(prop.mc, digits = 4)
| 5 | 10 | 30 | 50 | 100 | 250 | 500 | 1000 | |
|---|---|---|---|---|---|---|---|---|
| 0.1 | 0.3294 | 0.6455 | 0.9307 | 0.9682 | 0.9512 | 0.9548 | 0.9560 | 0.9497 |
| 0.2 | 0.6207 | 0.8608 | 0.9722 | 0.9500 | 0.9412 | 0.9544 | 0.9506 | 0.9479 |
| 0.3 | 0.8020 | 0.9622 | 0.9732 | 0.9558 | 0.9609 | 0.9555 | 0.9538 | 0.9504 |
| 0.4 | 0.9138 | 0.9809 | 0.9604 | 0.9406 | 0.9492 | 0.9543 | 0.9520 | 0.9491 |
| 0.5 | 0.9376 | 0.9781 | 0.9567 | 0.9665 | 0.9471 | 0.9507 | 0.9466 | 0.9521 |
| 0.6 | 0.9138 | 0.9809 | 0.9604 | 0.9406 | 0.9492 | 0.9543 | 0.9520 | 0.9491 |
| 0.7 | 0.8020 | 0.9622 | 0.9732 | 0.9558 | 0.9614 | 0.9555 | 0.9538 | 0.9504 |
| 0.8 | 0.6207 | 0.8608 | 0.9722 | 0.9500 | 0.9412 | 0.9544 | 0.9506 | 0.9479 |
| 0.9 | 0.3294 | 0.6455 | 0.9307 | 0.9682 | 0.9512 | 0.9548 | 0.9560 | 0.9497 |