How Unpopular is Jeremy Hunt?

Very, Apparently

The recently replaced Culture Secretary in the current (September 2012) UK Government achieved something fairly remarkable in this poll. In a reasonably sized opinion poll, it is quite unlikely to get a count of zero ‘yes’ responses to a ‘yes/no’ question. The poll is reported in Museums Journal, who do not state the number of responses, but assuming there are a reasonable number (ie more than one or two, and not including Jeremy Hunt’s mother) this is really quite an extreme result. However, it leads to an interesting question. Generally, when analysing binary response questions, it is useful to supply a confidence interval, in addition to an estimate of the proportion of ‘yes’ answers. The de facto way of obtaining a \(100 \alpha\) per cent confidence interval is by using the Wald approximation: \[ \left[\hat{\theta}-z_{\alpha/2} \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{n}},\hat{\theta}+z_{\alpha/2} \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{n}} \right] \] where the number of ‘yes’ responses (\(m\), say) are assumed to follow a binomial distribution where \(n\) is the number of responses, \(\theta\) is the probability of a ‘yes’ response, its estimate \(\hat{\theta} = m/n\) is the proportion of ‘yes’ responses in the survey, and \(z_{q}\) is the \(q\) th quantile of the standard normal distribution. The problem here is that \(m\), and hence \(\hat{\theta}\) is zero. This gives a confidence interval of \([0,0]\) - that is, an implied certainty that \(\theta\) is zero, regardless of \(\alpha\) - or arguably a degenerate confidence interval. Even for the Culture Secretary in question, this is at best unihelpful - there will be some non-zero values of \(\theta\) - albeit rather small - for which this survey outcome is still a reasonable possibility. So, are there better ways of generating a confidence interval when we get a result like this? That is the subject of this post. There are many problems associated with Jeremy Hunt, but this is the only statistical one I know of…

Methods Of Computing Binomial Confidence Intervals

Although the Wald method is not much help here, it can be coded in R as below:

# Function to compute Wald Confidence interval: 
# m - number of 'yes' responses;
# n - Total responses;
# alpha - confidence interval (specified as a percentage)
# Returns: a two element array with lower and upper CI limits
wald.binom.ci <- function(m,n,alpha) {
  theta.hat <- m/n
  theta.se <- sqrt(theta.hat*(1 - theta.hat)/n)
  z.val <- qnorm(1-(1 - alpha/100)/2)
  return(c(theta.hat-theta.se*z.val,theta.hat+theta.se*z.val)) }

If there is a low - but non-zero - response rate, it gives something at least plausible. For example, suppose we have 4 ‘yes’ responses out of 100, and want a 95% confidence interval:

wald.binom.ci(4,100,95)
## [1] 0.001592707 0.078407293

and also, sensibly it gives a tighter confidence interval for a larger survey with the same proportion of ‘yes’ responses:

wald.binom.ci(40,1000,95)
## [1] 0.02785455 0.05214545

However, it does give the above mentioned unhelpful output for a situation in which there are zero ‘yes’ responses:

wald.binom.ci(0,100,95)
## [1] 0 0

it also gives some counter-intuitive results for very small ‘yes’ response counts:

wald.binom.ci(2,100,95)
## [1] -0.007439496  0.047439496

This highlights a second problem with this formula - the lower limit of the confidence interval given here is negative, but since \(\theta\) is a probability, \(\theta \in [0,1]\).

What Went Wrong?

With the confidence interval, that is, rather than with Jeremy Hunt - I’m trying to keep these posts reasonably short. To answer this, it is helpful to recall what a confidence interval is. Essentially an \(\alpha\) % confidence interval for a parameter \(\theta\) is an interval of real numbers whose upper and lower limits are computed from some sample from a probability distribution, in a way such that the probability that \(\theta\) is inside this interval is \(\alpha / 100\). This is a commonly used idea in classical statistical inference - and it is worth reinforcing that the random quantities here are the upper and lower limits of the CI, given a sample data set. \(\theta\) itself is assumed to be deterministic, but not directly observable. However, to obtain a confidence interval, we need to make some kind of distributional assumption for the data - basically so that we are able to come up with formulae for upper and lower limits of a confidence interval that do indeed have a \(\alpha/100\) probability of containing \(\theta\). However sometimes the assumptions are based on approximation. There are often situations where these approximations do not work very well – two examples are given in the previous section.

The negative lower limit on the confidence interval is fairly easy to correct. Since we know that, by definition, \(\theta \in [0,1]\), the probability that the CI contains \(\theta\) will not change if we replace a negative lower limit by zero (similarly, we could use the same argument to replace an upper limit that is greater than unity, by unity). We can re-write the wald.binom.ci function to handle this - with the help of a trim.ends function that forces the confidence interval limts to remain within \([0,1]\).

trim.ends <- function(x) return(pmax(pmin(x,1),0))

wald.binom.ci <- function(m,n,alpha) {
  theta.hat <- m/n
  theta.se <- sqrt(theta.hat*(1 - theta.hat)/n)
  z.val <- qnorm(1-(1 - alpha/100)/2)
  return(trim.ends(c(theta.hat-theta.se*z.val,theta.hat+theta.se*z.val))) }

wald.binom.ci(2,100,95)
## [1] 0.0000000 0.0474395

However, this hasn’t really solved the problem - firstly it can be checked that for a Jeremy Hunt style zero ‘yes’ count we still get the \([0,0]\) confidence interval. Secondly, if the lower confidence limit is negative, clearly the results here are suspect - and it is a fair possibility that the upper limit is also unreliable. Perhaps we need to think further about the formula for the confidence interval - how it was derived, and when the approximation may not be reliable.

Shortcomings of the Approximation

The confidence interval used above is derived via the Wald large-sample Normal test for general maximum-likelihood estimators: \[ \left| \frac{\hat{\theta} - \theta}{\hat{\textrm{se}}(\hat{\theta})} \right| \leq z_{\alpha/2} \] where \(\hat{\theta}\) is the maximum likelihood estimate of \(\theta\) and \(\hat{\textrm{se}}(\hat{\theta})\) is the estimate of the standard error obtained by ‘plugging in’ \(\hat\theta\) to the asymptotic formula obtained from likelihood theory. There are a number of situations where this approximation breaks down: it is assumed that the left hand side of the Wald inequality within the modulus follows a Normal distribution. A normal distribution is defined for a contiuous random variate in the range \((-\infty,-\infty)\) - but for the binomial model here, \(\hat\theta\) will always be in the range \([0,1]\). Also, since the sample size is an integer, as is the number of ‘yes’ responses, \(\hat\theta\) will always be a rational number. For fixed \(n\) it can take exactly \(n+1\) possible values - for example if \(n=100\) then \(\hat\theta \in \{0,0.01,0.02,...,0.99,1 \}\). Sometimes these qualitative discrepancies are very small - in particular if \(\theta\) is not close to zero or one, and \(n\) is large. However on other occasions (such as the example here), there are clearly problems. Since, for the binomial, \(\hat{\textrm{se}}(\hat\theta) = n^{-1/2}\left[\hat\theta(1 - \hat\theta)\right]^{1/2}\) is zero for zero ‘yes’ responses, the approximation pretty much breaks down entirely.

Some authors have offered advice as to when a Wald confidencde interval is trustworthy - but a key problem here is that they offer very different advice! One such guideline is that \(n\theta(1-\theta) \geq 10\) - although other texts put 5 on the right hand side instead of 10. Also many texts just use terms like ‘unless \(\theta\) is very small’ or ‘for large \(n\)’ without specifying quantities. Also, even the more practical guideline quoted requires \(\theta\) - the true parameter - and not \(\hat\theta\) the estimate we have obtained. If we knew \(\theta\) there would be no need for the survey in the first place! Also, in this case, if we did substitute \(\hat\theta\) - equal to zero - then the conditions where the approximation were appropriate would never apply. Thus, we know that the Wald approximation isn’t going to work well here, so we need to consider alternatives.

Alternative Binomial Confidence Intervals

Fortunately, others have identified this problem and have suggested alternative confidence interval formulae. One is the Agresti-Coull interval. Although the justification (Agresti and Coull, 1998) is omitted here, this interval is obtained by adding \(z_{\alpha/2}^2\) to \(n\), and \(z_{\alpha/2} ^2/2\) to \(m\) and then plugging these into the formulae for the Wald interval. For 95% intervals, \(z_{\alpha/2}\) is close to 2, so the rule here is sometimes simplified to ‘add two ’yes’ responses and two ‘no’ responses to the data, and then compute a Wald interval’.

ac.binom.ci <- function(m,n,alpha) {
  adjust <- qnorm(1-(1 - alpha/100)/2)^2
  return(wald.binom.ci(m+adjust/2,n+adjust,alpha)) }

ac.binom.ci(0,100,95)
## [1] 0.00000000 0.04441205

Having defined this approach in R and tested it on a zero ‘yes’ response result, the output is at least plausible. Clearly the estimated proportion of ‘yes’ responses in the population is low, but not fixed at zero. We have an upper 95% CI of about 4%.

Another alternative is the Wilson confidence interval. This finds the set of values for \(\theta^*\) such that an \(\alpha\)% significance test of the hypothesis \(H_0: \theta=\theta^*\) would fail to be rejected given the observed data, if the Wald Normal approximation for the distribution of \((\hat\theta - \theta)/\hat{\textrm{se}}(\hat{\theta})\) is assumed. This method of deriving confidence intervals is sometimes called ‘inverting the hypothesis test’. The formula here is

\[ \left[ \frac{m + z_{\alpha/2}^2/2}{n + z_{\alpha/2}^2} - \frac{z_{\alpha/2}n^{1/2}}{n + z_{\alpha/2}^2}\left(\hat\theta(1-\hat\theta)+z_{\alpha/2}^2/(4n) \right)^{1/2},\frac{m + z_{\alpha/2}^2/2}{n + z_{\alpha/2}^2} + \frac{z_{\alpha/2}n^{1/2}}{n + z_{\alpha/2}^2}\left(\hat\theta(1-\hat\theta)+z_{\alpha/2}^2/(4n) \right)^{1/2} \right] \]

This confidence interval has the same centre as the Agresti-Coull interval, but generally is not as wide. It is of course a lot more complicated to remember! Here it is in R:

wilson.binom.ci <- function(m,n,alpha) {
  theta.hat <- m/n
  z.val <- qnorm(1-(1 - alpha/100)/2)
  z.val.sq <- z.val * z.val
  mid <- (m + z.val.sq/2)/(n + z.val.sq) 
  gap <- z.val*sqrt(n)/(n + z.val.sq)
  gap <- gap*sqrt(theta.hat*(1 - theta.hat) + z.val.sq/(4*n))
  return(trim.ends(c(mid - gap,mid + gap))) }

wilson.binom.ci(0,100,95)
## [1] 0.0000000 0.0369935

For the same data, the upper limit is somewhat lower than with the Agresti-Coull interval, as said earlier. However, both of these approaches still assume a Normal distribution in some way - and so are still approximations. How do we decide which is best?

Assessing Confidence Interval Formulae

One way of assessing a confidence interval is by computing is coverage. This is defined to be the probability that \(\theta\) falls within the confidence interval. This sounds rather strange, hadn’t this already been defined to be \(\alpha\), the confidence level? Well, the difference is that coverage is meant to be the exact probability, even if the confidence interval itself is approximate. Coverage is also useful even if the confidence interval is exact - if the distribution assumptions made do not apply exactly. For example, one might wish to compute confidence intervals for a mean, using a Normal distribution assumption, but the true distribution may be slightly skewed. If that were the case, what would the probability of the confidence interval containing the mean now become - and would this lead to major problems?

So here, it would be useful to compute the coverage for the Agresti-Coull and Wilson approaches. Another complication is that in most cases, th coverage may depend on \(\theta\) - so it is necessary to supply a value of \(\theta\) to obtain the coverage - or to compute coverage over a range of \(\theta\) values to obtain a range of possible values.

For results based on binomial data, with a given \(n\) and \(\theta\), there are exactly \(n+1\) possible confidence intervals (since the only variable quantity in the formula is \(m\) under these conditions, and it can only take values \(\{0,1,...,n\}\)).Thus, a brute force search can be used to find the coverage - simply find the values of \(m\) whose associated confidence intervals contain \(\theta\) and add up their probabilities as given by the binomial distribution.

Here is some code to do this:

# returns coverage probability,  given:
# n - the sample size;
# theta - the binomial success probability;
# alpha - the level of the confidence interval;
# ci.fun - the function used to compute the confidence interval.  Any of the ones defined above will work.

coverage <- function(n,theta,alpha,ci.fun) {
  p.success <- 0
  for (m in 0:n) {
    ci <- ci.fun(m,n,alpha)
    if (ci[1] < theta && theta < ci[2]) { p.success <- p.success + dbinom(m,n,theta) }
    }
  return(p.success)}

# Try this on a sample of size 100 if the population probability of a 'yes' response is 0.02,  for Agresti Coull,  and Wilson intervals at 95% 

coverage(100,0.02,95,ac.binom.ci)
## [1] 0.9845164
coverage(100,0.02,95,wilson.binom.ci)
## [1] 0.9491696

It seems from this that both are reasonably close approximations if \(\theta = 0.02\) - although it seems in this case that the Wilson interval is slightly more ‘risky’ than the Agesti-Coull, in that although we were hoping for at least 95% probability of coverage, the reality is slightly less than this. Interestingly, if \(\theta = 0.01\) the situation worsens:

coverage(100,0.01,95,ac.binom.ci)
## [1] 0.981626
coverage(100,0.01,95,wilson.binom.ci)
## [1] 0.9206268

Here, the coverage of the Wilson interval is notably below 95%. This does need to be offset - although the Agresti-Coull interval has a better coverage it is always a wider interval - in the two examples above it is more like a 98% confidence interval in practice. To see how the two approaches perform for a given \(n\) as \(\theta\) varies, graphs may be drawn in R:

lseq=501
theta.seq <- seq(0,1,l=lseq)
theta.seq <- theta.seq[-c(1,lseq)]
covg.w <- covg.ac <- theta.seq * 0.0


for (i in 1:length(theta.seq)) {
  covg.ac[i] <- coverage(100,theta.seq[i],95,ac.binom.ci)
  covg.w[i]  <- coverage(100,theta.seq[i],95,wilson.binom.ci)
}

par(mfrow=c(1,2))
plot(theta.seq,covg.ac,xlab=expression(theta),ylab='coverage (Agesti-Coull)',type='l',ylim=c(0.9,1))
abline(h=0.95,lty=2,col='red')
plot(theta.seq,covg.w,xlab=expression(theta),ylab='coverage (Wilson)',type='l',ylim=c(0.9,1))
abline(h=0.95,lty=2,col='red')

This is quite a surprise - at different \(\theta\) values there are quite different coverages, with a notable degree of oscillation for both kinds of confidence interval. For low values of \(\theta\) it seems that the Agresti-Coull interval is the best in terms of ensuring 95% coverage, even if it is a larger interval. Experimentation with different values of \(n\) seems to lead to similar conclusions.

One possible explanation for the rather strange shape of these curves is that as a binomial distribution for a given \(n\) has a finite number of outcomes, there are – as stated earlier – a finite number of confidence intervals, each associated with a single probability. So the possible values of coverage probability is only a fixed number of points in the range \([0,1]\). There may not actually be an interval with a coverage probability of 0.95 - and a small change in \(\theta\) on some occasions leads the coverage probability to make a ‘quantum leap’. However, these leaps depend on the finite binomial probability values of each outcome, and do not appear to happen in a simple pattern. Both the Wilson and Agresti-Coull intervals have this characteristic. However, whereas the Wilson interval seems to fluctuate around the desired 95% coverage level wildly when \(\theta\) is close to zero or one, the Agresti-Coull interval seems to have an increasing coverage probability in these areas. To me, it looks as though the jumping effect has another U-shaped curve effect superimposed - and this is helpful if we require a conservative interval, and \(\theta\) is quite low (say less than about 0.05). Obviously we don’t know the true value of \(\theta\), but if we did get a survey with a reasonably size \(n\) and no ‘yes’ responses, thats a pretty strong hint that \(\theta\) is in that range - and that if you want to avoid the risk of an unaccepably low coverage, then you should use the Agresti-Coull formula.

Epilogue

So we now know that estimation of Jeremy Hunt’s unpopularity turns outr to be quite a complicated statistical issue. It is quiye interesting to note that this very common statistical problem (estimating a population prortion from a survey) has some quite surprising proporties if you look a little beneath the surface. Of course we havn’t quite been able to estimate a confidence interval for the survey result shown in the introduction, as unfortunately the value of \(n\) is not given. However I will contact Museums Journal and attempt to obtain this figure. With this, I will be able to compute a confidence interval which will, I expect answer this post’s key question by providing an interval that suggests his unpopularity is somewhere between universally reviled and just extremely unpopular. He has just been appointed as Secretary of State for Health, and has already made a few ‘interesting’ statements - so it is possible this topic may be revisted.

Reference

A lot of the statistical ideas in this paper stem from

Interval Estimation for a Binomial Proportion Statistical Science 2001, 16(2), pp101-133

which cites the references for Wilson and Agresti-Coull. My main contribution has been to put these ideas in context, and provide some practical examples of computation using R.