Interval Estimation for a Binomial Proportion

The R package binom makes it extremely easy to compute interval estimators for a binomial proportion.

library(binom)
## Loading required package: lattice
binom.confint(x = 5, n = 50)
##           method x  n   mean   lower  upper
## 1  agresti-coull 5 50 0.1000 0.03914 0.2179
## 2     asymptotic 5 50 0.1000 0.01685 0.1832
## 3          bayes 5 50 0.1078 0.03919 0.2054
## 4        cloglog 5 50 0.1000 0.03673 0.2010
## 5          exact 5 50 0.1000 0.03328 0.2181
## 6          logit 5 50 0.1000 0.04224 0.2187
## 7         probit 5 50 0.1000 0.03960 0.2096
## 8        profile 5 50 0.1000 0.03707 0.2027
## 9            lrt 5 50 0.1000 0.03705 0.2027
## 10     prop.test 5 50 0.1000 0.03740 0.2259
## 11        wilson 5 50 0.1000 0.04348 0.2136

Many well-known intervals (e.g., 'asymptotic', 'cloglog', 'logit', 'probit') can be easily obtained in a unified approach by fitting a Binomial generalized linear model (GLM) with different link functions.

binom.ci <- function(x, n, alpha = 0.05, link = "logit") {
    z <- qnorm(1 - alpha/2)
    family <- binomial(link = link)
    fit <- summary(glm(cbind(x, n - x) ~ 1, family = family))
    est <- fit$coef[, "Estimate"]
    se <- fit$coef[, "Std. Error"]
    family$linkinv(est + c(-z, z) * se)
}
binom.ci(x = 5, n = 50, link = "identity")
## [1] 0.01685 0.18315
binom.ci(x = 5, n = 50, link = "cloglog")
## [1] 0.04289 0.22371
binom.ci(x = 5, n = 50, link = "logit")
## [1] 0.04224 0.21869
binom.ci(x = 5, n = 50, link = "probit")
## [1] 0.0396 0.2096

Note that different results are obtained for 'cloglog' intervals! The following is part of the codes from the 'binom.confint' function:

if (any(method == "cloglog") || all.methods) {
    if (any(method != "exact")) {
        x1 <- x == 0
        x2 <- x == n
    }
    inner <- !x1 & !x2
    log.mu <- sd <- lcl <- ucl <- rep(NA, length(x))
    log.mu[inner] <- log(-log(p[inner]))
    sd[inner] <- sqrt(var.cloglog(p[inner], n[inner]))
    lcl[inner] <- exp(-exp(log.mu[inner] + z[inner] * sd[inner]))
    ucl[inner] <- exp(-exp(log.mu[inner] - z[inner] * sd[inner]))
    lcl[x1] <- rep(0, sum(x1))
    lcl[x2] <- alpha2[x2]^(1/n[x2])
    ucl[x1] <- 1 - alpha2[x1]^(1/n[x1])
    ucl[x2] <- rep(1, sum(x2))
    res.cloglog <- data.frame(method = rep("cloglog", NROW(x)), xn, mean = p, 
        lower = lcl, upper = ucl)
    res <- if (is.null(res)) 
        res.cloglog else rbind(res, res.cloglog)
}

It appears that the 'binom.confint' function is using the 'loglog' link \( g(p) = \log\{-\log(p)\} \) instead of the 'cloglog' link \( g(p) = \log\{-\log(1 - p)\} \). Therefore, the 'cloglog' interval returned by the 'binom.confint' function is very misleading!

For many common intervals, fitting a Binomial GLM is not always necessary because analytic formulae can be derived. For a reasonably large sample size \( n \), the Central Limit Theorem (CLT) indicates that \[ \sqrt{n}(\hat{p} - p) \overset{d}{\to} N(0, p(1 - p)). \] The above CLT allows us to construct the 'asymptotic' (Wald) interval \( \hat{p} \pm z_{1 - \alpha / 2}\sqrt{\hat{p} (1 - \hat{p}) / n} \):

x <- 5
n <- 50
p <- x/n
alpha <- 0.05
z <- qnorm(1 - alpha/2)
p + c(-z, z) * sqrt(p * (1 - p)/n)
## [1] 0.01685 0.18315

However, the approximation may be very unsatisfactory when the sample size is small and/or \( p \) is near the boundaries of the parameter spaces (i.e., 0 or 1)! To resolve this issue, we can apply the Delta's method to achieve better approximation: \[ \sqrt{n}\{g(\hat{p}) - g(p)\} \overset{d}{\to} N\left(0, \sigma^{2}(p)\right), \] where \( \sigma^{2}(p) = \{g'(p)\}^{2}p(1 - p) \). Based on the above results, one can first construct an approximate interval for \( \eta = g(p) \): \[ g(\hat{p}) \pm z_{1 - \alpha / 2} \sigma(\hat{p}) / \sqrt{n}, \] and transform the interval back to the original scale to obtain an approximate interval for \( p \): \[ g^{-1}\{g(\hat{p}) \pm z_{1 - \alpha / 2} \sigma(\hat{p}) / \sqrt{n}\}. \]

For \( g(p) = \log\{-\log(p)\} \), we have \( \sigma^{2}(p) = (1 - p) / [p\{\log(p)\}^{2}] \) and \( g^{-1}(\eta) = \exp\{-\exp(\eta)\} \). Note that \( g(p) = \log\{-\log(p)\} \) is a decreasing function of \( p \)!

est.loglog <- log(-log(p))
var.loglog <- (1 - p)/(p * (log(p))^2)
eta.loglog <- est.loglog + c(z, -z) * sqrt(var.loglog)/sqrt(n)
exp(-exp(eta.loglog))
## [1] 0.03673 0.20096

For \( g(p) = \log\{-\log(1 - p)\} \), we have \( \sigma^{2}(p) = p / [(1 - p)\{\log(1 - p)\}^{2}] \) and \( g^{-1}(\eta) = 1 - \exp\{-\exp(\eta)\} \).

est.cloglog <- log(-log(1 - p))
var.cloglog <- p/((1 - p) * (log(1 - p))^2)
eta.cloglog <- est.cloglog + c(-z, z) * sqrt(var.cloglog)/sqrt(n)
1 - exp(-exp(eta.cloglog))
## [1] 0.04289 0.22371

For \( g(p) = \log\{p / (1 - p)\} \), we have \( \sigma^{2}(p) = 1 / \{p (1 - p)\} \) and \( g^{-1}(\eta) = \exp(\eta) / \{1 + \exp(\eta)\} \).

est.logit <- log(p/(1 - p))
var.logit <- 1/(p * (1 - p))
eta.logit <- est.logit + c(-z, z) * sqrt(var.logit)/sqrt(n)
exp(eta.logit)/(1 + exp(eta.logit))
## [1] 0.04224 0.21869

According to Wikipedia, the Clopper-Pearson (exact) interval can be expressed as \[ \left(B_{\alpha/2}(x, n - x + 1), B_{1 - \alpha/2}(x + 1, n - x)\right) \]

c(qbeta(alpha/2, x, n - x + 1), qbeta(1 - alpha/2, x + 1, n - x))
## [1] 0.03328 0.21814

and the Wilson (score) interval can be expressed as \[ \frac{1}{1 + z^{2} / n}\left\{\hat{p} + z^{2} / (2n) \pm z \sqrt{\hat{p}(1 - \hat{p}) / n + z^{2} / (4n^{2})}\right\} \]

(p + z^2/(2 * n) + c(-z, z) * sqrt(p * (1 - p)/n + z^2/(4 * n^2)))/(1 + z^2/n)
## [1] 0.04348 0.21360