In Bayesian analysis, uncertainty is communicated with standard errors, credible intervals, and predictive intervals. We discuss how to evaluate them and their relationships with traditional methods, such as Fisher Information, likelihood profiles, likelihood ratio tests, an p-values. Examples include the normal and Poisson distributions, linear and non-linear regression.
Today
Discussion from last time:
Exercise. Do increasers and decreasers tend to have positive or negative interactions htimepost:dist2shore? If so, why or why not? Check this for seabirds and non-seabird species. Where and why might GLM and GJAM results disagree?
You might try this:
The objects
bmandbsinclude one column for each species. They hold the posterior mean and standard error, respectively, for each predictor-species combination. Find the increasers and decreasers, based on the variablehtimepost, using the approximate rule that zero is abovebm + 1.96*bs(a decreaser) or belowbm - 1.96*bs(an increaser).Determine which of these are seabirds using the
vectorofgenera.You might summarize coefficients with biplots that compare the main effects (
htimepost,dist2shore) with the interaction term.
Review of factors, interactions, MVN distributions.
This vignette.
Objectives
- articulate a frequency versus a subjective probability interpretation of uncertainty
- recognize the asymptotic Gaussian approximation and the role of the standard error in classical interval estimation
- relate the likelihood function to Fisher Information and the classical standard error
- relate the likelihood profile to classical hypothesis testing
- implement a simple bootstrap
- find the posterior distribution and Bayesian standard error for a conjugate likelihood-prior pair
Classical interval estimation
Uncertainty can be quantified with intervals estimated from data. The maximum likelihood estimate (MLE) of a parameter \(\theta\) is defined as the value finding most support from the data, obtained by optimizing the likelihood function \(L(\theta)\). But how much better is the MLE than some other value? As data accumulate, confidence in the MLE should increase. The shape of the likelihood function contains additional information about the parameter estimate (Fisher 1959); increasing curvature in -\(\log L\) with accumulation of data describes increasing confidence in the ML estimate.
Both classical and Bayesian approaches assume some underlying true value for the parameter that is fixed, and both produce estimates that are, in different senses, ‘random’. A frequentist confidence interval is interpreted as the fraction of intervals estimated from a large number of idealized data sets generated by the same process that would include the true parameter value. From each of these data sets we would obtain a different estimate. The true (but unknown) parameter value is viewed as being fixed and the data as being random. In the nonparametric bootstrap, the frequentist concept is simulated by re-sampling from the sample itself, generating random data sets, and building up a frequency distribution of estimates. Because the data are viewed as random, the confidence interval is random; the confidence interval is evaluated from random data.
The frequentist confidence interval is based on the idea of “coverage”, whereas a Bayesian credible interval comes from integrating a likelihood (times prior) as a parameter density. Although Bayesians speak of a parameter as random, the ‘randomness’ describes degree of belief in parameter values. For example, based on forecasts, I may believe with 70% certainty that I will see rain tomorrow. My tomorrow will not happen 100 times, so it could not rain 70 of those 100 times. It will happen once, and it will rain or not. So the frequency interpretation is, at best, a device for thinking about uncertainty, but it does not describe the problem. The Bayesian approach uses a probability model for the parameter value to quantify uncertainty prior to data collection, which is updated by data. The posterior is conditioned on the data actually collected. If I had used a model to predict rainfall tomorrow, I could update it tomorrow. This concept of updating my understanding with data does not have a place in the frequentist approach.
In both the Bayesian and frequentist cases, the confidence limits represent uncertainty as to the ‘true’ value. In this sense, Bayesian and classical approaches are not as different as they first appear.
In this unit I compare different types of “confidence envelopes”—a term I use to include both confidence and credible intervals—but focus first on a classical notion of a confidence interval.
A confidence interval from introductory statistics class
A brief example outlines the frequentist notion of a confidence interval. Consider a random variate \(y_i\) drawn from a normal distribution,
\[ y_i \sim N(\mu, \sigma^2) \] with parameters \(\mu\) and \(\sigma^2\). For the moment, suppose the variance parameter \(\sigma^2\) is known, as would be the case if \(y_i\) were a measurement obtained on an instrument used previously. Based on this one observation, uncertainty as to the true value of \(\mu\) could be represented by a density. This density is centered on the sample value \(y_i\), because \(y_i\) is the best (ML) estimate of \(\mu\). The normal density function is centered on the a value \(\mu\),
\[ N(y_i | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma} exp \left[- \frac{1}{2 \sigma^2} (y_i -\mu)^2 \right] \]
The “most likely” value of \(\mu\) is that which yields the highest probability of the data, i.e., the ML estimate \(\hat{\mu}\). To demonstrate that \(y_i\) is the ML estimate of \(\mu\), recall the log likelihood (including only terms involving \(\mu\)),
\[ \log L \propto - \frac{1}{2 \sigma^2} (y_i -\mu)^2 \] Differentiating, setting the derivative equal to zero, and solving for \(\mu\) gives gives the value \(\hat{\mu}\), which is equal to \(y_i\). Because \(y_i\) is the only observation, a density centered at \(y_i\) is more likely than is a density centered on any other value of \(y\). To estimate uncertainty in this estimate I need multiple observations.
The classical confidence interval can be calculated or approximated in several ways. I introduce the topic with a review of the standard error. Three methods for constructing confidence intervals that follow all involve the likelihood function. The first technique, the likelihood profile, focuses on likelihood shape. I follow with Fisher Information, which summarizes that shape in terms of its curvature (second derivative) at the MLE. Next is a numerical method, the bootstrap, which yields a frequency distribution of estimates. All of the methods approximate the frequentist idea of coverage (the true parameter value lies within a fraction \((1 - \alpha)\) of the CIs that would be constructed from similar experiments). A fourth method, integrating the likelihood only makes sense in the context of Bayes’ theorem, assuming that the prior is so flat as to be ignored. Last I describe a Bayesian credible interval, which comes from combining the likelihood with a prior distribution.
The standard error as basis for a confidence interval
The standard error expresses uncertainty in an estimate. This example concerns the standard error for a mean parameter of a normal distribution \(\mu\), which is estimated by taking the average over \(n\) observations, i.e., the sample mean
\[ \hat{\mu} = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \] This is the maximum likelihood estimate of \(\mu\). [It is also the least-squares estimate.] Confidence in this estimate increases with sample size \(n\). The residual variance is the estimate of \(\sigma^2\),
\[ \hat{\sigma}^2 = var[y] = \frac{1}{n-1} \sum_{i=1} (y_i - \hat{\mu})^2 \] The standard error summarizes support from the data for a particular value of \(\mu\). It is the square root of the variance estimate divided by the sample size,
\[ se_{\hat{\mu}} = \sqrt{ \frac{\hat{\sigma}^2}{n} } \]
From the formula for the standard error it is clear that the width of the confidence interval must decrease with increasing sample size.
Recall from introductory statistics that \(68\%\) of the intervals spanning 1 se on either side of the mean estimate are expected to contain the true value. Ninety five percent of the intervals spanning 1.96 standard errors of the mean estimate should contain the true value. This is equivalent to saying that the frequency distribution of parameter estimates we would obtain from repeating the experiment can be approximated by a normal distribution
\[ \hat{\mu} \sim N \left(\bar{y}, se^2_{\hat{\mu}} \right) \] Here is a random sample of \(n = 5\) observations drawn from the normal density with mean \(\mu = 100\) and variance \(\sigma^2 = 10\)
n <- 10
mu <- 100
sd <- 20
yseq <- seq(0,2*mu,by=1)
y <- rnorm(n,mu,sd)
hist(y, breaks=yseq, probability=T)
se <- sqrt(var(y)/n)
mle <- mean(y)
ci <- mle + c(-1,1)*1.96*se
abline(v=ci, lty=2)n observations drawn from a normal distribution. The CI around the mean is shown with vertical dashed lines.
The sample mean \(\hat{\mu}\) will be close to the true population mean \(\mu\). So with just a few observations I have a reasonable estimate of the mean. Ninety five percent of the confidence intervals obtained from samples of size \(n = 10\), calculated as 1.96 standard errors of sample means, should contain the true \(\mu\). The 95% confidence interval I obtain for this experiment is ci = 91.264, 118.534.
I could check this result using qnorm to find the values where the lower and upper tails of the distribution equal \(\alpha/2\), where \(\alpha = 0.05\),
## [1] 91.26382 118.53384
If I could repeat the experiment 1000 times, a histogram of CIs might be represented like this:
rep <- 1000
ci <- matrix(0,rep,2)
for(i in 1:rep){
y <- rnorm(n,mu,sd)
se <- sqrt(var(y)/n)
mle <- mean(y)
ci[i,] <- c(mle - 1.96*se,mle + 1.96*se)
}
clo <- hist(ci[,1],breaks=yseq,plot=F) #lower conf limit
chi <- hist(ci[,2],breaks=yseq,plot=F) #upper conf limit
plot(clo$mids, clo$density, type = 's' )
lines(chi$mids, chi$density, type='s' )Random lower and upper confidence intervals generated from a frequentist concept.
The value cover = 0.907 is the fraction of confidence intervals that include the true value of the parameter. Because the likelihood is Gaussian, these two densities yield an answer not too different from what I would obtain if I simply integrated the tails of the density \(N \left(\hat{\mu} | \bar{y}, se^2_{\hat{\mu}} \right)\), one centered on the sample estimate \(\hat{\mu} = \bar{y}\),
Confidence interval based on the standard error of the mean.
This example illustrates the random character of a classical confidence interval. Any given sample from the same population would produce a different confidence interval, referenced by the estimate and its standard error. A sample of size \(n = 50\) results in an estimate \(\hat{\mu}\) that is much closer to the true value that I obtain with \(n = 10\). The increased information available from a sample of size \(n = 50\) over one of size \(n = 5\) is reflected in a narrow \(95\%\) CI. Thus, a confidence interval can be calculated from a standard error. When the likelihood is not normal, the confidence interval is approximate.
Likelihood profile
The shape of the likelihood function contains information about uncertainty. In the absence of a probability model, the likelihood shape is used to describe confidence.
An absolute probability cannot be assigned to a particular parameter estimate or to a model. We can think of the likelihood of a particular parameter value only in the context of some other value. The likelihood function is used for such comparisons. Contrary to intuition, classical methods do not permit integrating the likelihood function. Instead confidence intervals can be derived using methods that involve ratios of likelihoods using a likelihood ratio test (LRT).
Let the vector \(\mathbf{y} = [y_1, \dots ,y_n]'\) represent \(n\) observations and \(\mu\) be one or more (a vector of) parameters about which we wish to draw inference. Ideas about the value(s) of \(\mu\) can take the form of hypotheses. A probability statement is made in the context of two competing hypotheses, e.g., a “null” (say, \(\mu = \mu_0\)) and an alternative, such as \(\mu \ne \mu_0\) or \(\mu = \mu_1\). The standard \(F\), \(t\), and \(\chi^2\) tests all arise in the context of classical methods for relating likelihoods of two hypotheses based on the view that both values of \(\mu\) are fixed, and the data are random. The probability involves two likelihoods, both of which assume the respective hypothesis to be true. I begin with a sample mean.
Likelihood ratio and deviance
A confidence interval based on shape of the likelihood function can use a \(\chi^2\) test of ratio of values relative to the likelihood at the MLE.
Suppose I want to make a probability statement about the value of \(\mu\) finding most support from a data set, i.e., \(\hat{\mu}\), when the likelihood is normal. The alternative hypothesis concerns a rival value of \(\mu\), call it \(\mu_0\). The test statistic for this comparison involves a likelihood ratio
\[ R = \frac{L(\mathbf{y}; \mu_0)}{L(\mathbf{y}; \hat{\mu})} \] This is the ratio of two likelihoods taken over the same data \(\mathbf{y}\), but assuming different values of \(\mu\). The likelihood ratio can be evaluated for any two values of \(\mu\). When the denominator is taken at the MLE (as in this case), the LR is termed the normed likelihood function (Lindsey 1999) and has range \([0, 1]\). The test statistic is sometimes called a deviance,
\[ D = -2 \log R \] and is distributed as \(\chi^2\) with, in this case, 1 degree of freedom, because there is only one parameter at issue.
Consider again the likelihood for the mean of the normal distribution,
\[ \begin{aligned} R &= \frac{L(\mathbf{y}; \mu_0)}{L(\mathbf{y}; \hat{\mu})} \\ &= \frac{\sigma_0^{-n} exp \left[ -\frac{1}{2\sigma_0^2} \sum_i (y_i - \mu_0)^2 \right]} {\hat{\sigma}^{-n} exp \left[ -\frac{1}{2\hat{\sigma}^2} \sum_i (y_i - \hat{\mu})^2 \right]} \\ &= \left( \frac{ \hat{\sigma} }{\sigma_0} \right)^n \end{aligned} \] with the deviance
\[ D = -2 \log R = -2n \left(\log \hat{\sigma} - \log \sigma_0 \right) \] The likelihood ratio test provides a confidence interval that is identical to that found by the integral method. Because the sampling distribution is normal, the log likelihood is a parabola and symmetric about the minimum at \(\hat{\mu}\). If \(\hat{\mu}_0\) is taken to be the value of \(\hat{\mu}\) where the log likelihood function is 1/2 unit below the value at the maximum, then \(D = 1\), and the probability of \(D\) (\(\chi^2\) with 1 df) is 0.68. A value of \(D = 3.84\) corresponds to a probability of 0.05.
Likelihood ratio test to profile
Using the Poisson distribution for count data, I want to estimate the Poisson mean. A likelihood profile considers the likelihood function at a range of parameter values.
Here is the Poisson likelihood I will use:
\[ L(\mathbf{y};\lambda) = \prod_i^n Poi(y_i| \lambda) = \prod_i^n \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} \propto \lambda^{\sum_i y_i} e^{-n \lambda} \] The log likelihood is
\[ \log L \propto n \bar{y}\log \lambda -n \lambda \]
I first find the MLE by differentiating and solving for \(\hat{\lambda}\):
\[ \frac{\partial{\log L}}{\partial \lambda} = \frac{n \bar{y}}{\lambda} - n \] The solution is \(\hat{\lambda} = \bar{y}\).
Substituting for the MLE, \(\hat{\lambda} = \bar{y}\), the likelihood ratio is
\[ \begin{aligned} \log R &= \log \left( \frac{ \lambda_0^{n \bar{y}} e^{-n \lambda_0 }}{\bar{y}^{n \bar{y}} e^{-n \bar{y}} } \right) \\ &= n \left[ \bar{y}(\log \lambda_0 - \log \bar{y} + 1) - \lambda_0 \right] \end{aligned} \]
Here is a simulated data set that I use to draw the likelihood:
n <- 10
y <- rpois(n,4.7) # simulated counts
ybar <- mean(y) # MLE
rseq <- seq(.2,3,length=100)*ybar # sequence for plotting
R <- n*(ybar*( log(rseq) - log(ybar) + 1) - rseq ) # normed likelihood
D <- -2*R # Deviance
P <- 1 - pchisq(D,1) # P value
CI <- range(rseq[P > .05]) # CI from P value
par(mfrow=c(3,1), mar=c(4,5,2,4), bty='n')
plot(rseq, R, type='l', xlab='')
abline(v=ybar,lty=2)
abline(v=CI, lty=2)
title ('log R')
plot(rseq,D,type='l', xlab='')
abline(v=ybar,lty=2)
abline(h=3.84,lty=2) # value of D at P = 0.05
abline(v=CI, lty=2)
title('Deviance')
plot(rseq, P, type='l', xlab='lambda_0')
abline(v=ybar,lty=2)
abline(h=.05, lty=2)
abline(v=CI, lty=2)
title('p value')Relationships between likelihood ratio, Deviance, and P value for a Poisson example.
I used the R function pchisq for the \(\chi^2\) distribution with one degree of freedom to find the area in the tail, beyond the value of \(D\) evaluated for this data set.
Repeat this experiment using a sample size of 100.
Unlike the Gaussian example, the CI for the Poisson distribution is not symmetric.
The relationship between likelihood and deviance is apparent from a plot of the normed likelihood function \(R\), where the likelihood in the denominator is taken to be the ML estimate. The normed likelihood thus has a maximum value of 1 (\(\log(1) = 0\)) at the MLE and declines on either side. At an \(\alpha\) level of, say, 0.05 the CI can be constructed by finding the values of \(D\) that yield \((1 - \alpha) = 0.95\) probability from the \(\chi^2\) test. At the MLE, \(R = 1\), \(D = 0\), and \(P = 1\), because the ratio is taken for two equivalent values finding equal support from the data. The profile itself is constructed by calculating \(R\) for a range of parameter values \(\lambda_0\) in the neighborhood of the MLE. Deviances increase on either side of the MLE, whereas the associated \(P\) values decline.
The likelihood profile is based on the distance below the maximum likelihood, and not on integrating the likelihood.
The likelihood ratio provides an economical means for comparing parameter values, because all coefficients that do not contribute drop out. The normal example collapsed to a ratio of standard deviations, because other coefficients were redundant between the two models being compared. We can ignore some coefficients provided the comparison involves the same functional form. Model comparisons that involve different functional forms can still be accomplished using likelihoods, but all coefficients must be retained.
Approximate CI’s from Fisher information
Fisher Information uses the curvature of the log likelihood function to estimate a variance for the parameter error distribution. This differs from the likelihood profile, which makes use of the decline in the normed likelihood function on either side of the MLE.
The curvature is actually a quadratic approximation (2nd derivative) of the log likelihood function in the neighborhood of the MLE. If this curvature is slight, then the data do not contain much information about the parameter, and the standard error is large, and vice versa. The quadratic is exact for a normal sampling distribution, because the log likelihood is quadratic, but it can be a poor approximation for asymmetric likelihoods when sample sizes are small.
Fisher Information is inversely related to the curvature of the likelihood function. This curvature is found by approximating the log likelihood function with a polynomial, called a Taylor expansion. This is shown in the Appendix.
The same approach works in multiple dimensions. If I have the regression model \(y_i = \mathbf{x}'_i \boldsymbol{\beta} + \epsilon_i\), then the covariance matrix on the coefficients is one that we’ve seen before, giving Fisher Information
\[ I = -\frac{1}{ \sigma^2}\mathbf{X}'\mathbf{X} \] and covariance matrix
\[ cov(\hat{\boldsymbol{\beta}}) = \left( I \right)^{-1} = \sigma^2 \left( \mathbf{X}'\mathbf{X} \right)^{-1} \] (Appendix).
Bootstrap
The bootstrap is a numerical method that can be used to obtain confidence intervals (Efron and Tibshirani 1993). The basic methodology can be extended for propagation of error for purposes of prediction, sometimes termed Monte Carlo simulation.
Unlike other methods, bootstrapping does not require much math. If a point estimate is available, then so too is the bootstrap. There are two types of bootstraps to consider, non-parametric and parametric.
The nonparametric bootstrap produces estimates based on resampling the data. The resampling procedure involves sampling from the data ‘with replacement’. The sample contains \(n\) observations. Sampling with replacement means that once an observation has been included in the sample, it can be included again. This procedure simulates the frequentist concept of obtaining estimates from repeated similar experiments. It substitutes resampling of one data set for repeated experiments.
Simple recipe
The following five steps can be used to simultaneously estimate standard errors and confidence intervals. The extension to multiple parameters is straightforward. This algorithm assumes a sample of size \(n\) for which I can estimate the parameter \(\theta\) using an estimator function (e.g., a likelihood function).
Draw with replacement a ‘resample’ of size \(n\) from the original sample.
Estimate \(\theta\) from this resample. Let \(\theta_b\) represent this \(b^{th}\) estimate of \(\theta\).
Repeat this procedure \(B\) times. For a standard error estimate, \(B\) might be as low as 50. For a \(95\%\) confidence interval, \(B\) might be more like 2000. (Smaller \(\alpha\) values require larger samples.) There are now \(B\) estimates of the parameter, one for each bootstrap resample.
Estimate the parameter standard error as the standard deviation of the \(B\) replicates.
Estimate the confidence interval as the interior \(100(1 - \alpha)\%\) of the bootstrapped estimates.
Application
I want to estimate how cone production increases with tree size. For counts, I used a Poisson likelihood, assuming that cone production increases with the square of diameter \(x_i\),
\[ L = \prod_{i=1}^n Poi(y_i | \beta x_i^2 ) \propto \beta^{n \bar{y}} exp[-\beta n \bar{x^2} ] \] The log of this function is
\[ \log(L) = n \bar{y} \log(\beta) - \beta n \bar{x^2} \]
I get the MLE from
\[ \frac{\partial \log L}{\partial \beta} = \frac{n \bar{y}}{\beta} - n \bar{x^2} \] having the solution \(\hat{\beta} = \bar{y}/\bar{x^2}\). To find an approximate standard error I differentiate again
\[ \frac{\partial^2 \log L}{\partial \beta^2} = -\frac{n \bar{y}}{\beta^2} = -\frac{n }{\bar{y}}\left(\bar{x^2} \right)^2 \] to obtain \(se_{\hat{\beta}} = \frac{\sqrt{\bar{y}}}{\bar{x^2} \sqrt{n}}\). For the last step I substituted the MLE for \(\beta\). Here is the estimate and standard error for the cone example at ambient CO2 in R:
filename <- ('FACEtrees.txt')
data <- read.table(filename, header=T)
y <- data[,'cones']
x <- data[,'diam']
w <- is.finite(x) & is.finite(y)
ambient <- which(data[,'trt'] == 0 & w)
elevated <- which(data[,'trt'] == 1 & w)
Y <- mean(y[ambient])
X2 <- mean(x[ambient]^2)
nlo <- length(ambient)
bmuA <- Y/X2
bseA <- sqrt(Y/nlo)/X2
Y <- mean(y[elevated])
X2 <- mean(x[elevated]^2)
nhi <- length(elevated)
bmuE <- Y/X2
bseE <- sqrt(Y/nhi)/X2
estimates <- signif( matrix( cbind(c(bmuA, bseA),c(bmuE, bseE)), 2, 2), 3 )
rownames(estimates) <- c('mean','se')
colnames(estimates) <- c('ambient','elevated')
knitr::kable(estimates)| ambient | elevated | |
|---|---|---|
| mean | 3.22e-03 | 6.23e-03 |
| se | 5.38e-05 | 7.42e-05 |
I could drawn the normal distribution approximated by this mean and standard deviation like this:
bseq <- seq(.002,.007,length=1000)
plot(bseq,dnorm(bseq,estimates['mean',1],estimates['se',1]),type='l',
xlab='beta',ylab='density')
lines(bseq,dnorm(bseq,estimates['mean',2],estimates['se',1]),col=2)Normal densities centered on MLEs and SEs for ambient and elevated CO2.
For comparison here is a bootstrap. I begin by defining the number of estimates that will be obtained and a vector to store them:
nboot <- 2000 # no. bootstrap estimates
bvals <- matrix(0, nboot, 2) # matrix to hold estimates
colnames(bvals) <- c('ambient','elevated')The bootstrap is done in a loop. At each iteration, a new sample is obtained determined by the index for b, called bindex. The data are then extracted from ylo and xlo and used to calculate the MLE for the parameter.
for(b in 1:nboot){
bindex <- sample(ambient, nlo, replace=T) # sample with replacement
Y <- mean(y[bindex])
X2 <- mean(x[bindex]^2)
bvals[b, 1] <- Y/X2
bindex <- sample(elevated, nhi, replace=T) # sample with replacement
Y <- mean(y[bindex])
X2 <- mean(x[bindex]^2)
bvals[b, 2] <- Y/X2
}From this sample I determine the standard deviation, which is an estimate of the standard error, and quantiles, as an estimate of the confidence interval,
se <- apply(bvals, 2, sd)
hist(bvals[,1],freq=F,nclass=20,xlim=c(0,.01))
lines(density(bvals[,1]))
ci1 <- quantile(bvals[,1],c(0.025,.975))
abline(v=ci1,lty=2)
elev <- hist(bvals[,2],nclass=20, plot=F)
lines(elev$mids, elev$density, type='s')
lines(density(bvals[,2]))
ci2 <- quantile(bvals[,2],c(0.025,.975))
abline(v=ci2,lty=2)Distributions of estimates for the bootstrap.
The two methods yield quite different confidence intervals, most likely because of the large number of zeros in the data. In the next example, I consider a classical analysis of these data that might be done with standard software, followed by one where we explicitly model the zeros as a separate process.
Bayesian credible intervals for conjugate prior-likelihood pairs
The Bayesian credible interval can be obtained directly by integrating the posterior.
This posterior is like the density I constructed with a bootstrap, and so returns us to the way we typically think about confidence intervals. Thus, the way we tend to think about confidence intervals corresponds to a Bayesian credible interval with a “uniform” prior. So far, I have talked about posterior distributions that have a parametric form. In other words, there is a standard distribution that can be described by parameters. For a \(95\%\) credible interval, I seek the upper and lower interval limits such that
\[ 0.95 = \int_{\theta_l}^{\theta_u} [\theta | \mathbf{y} ] d\theta \] or, if the parameter \(\theta\) assumes discrete values,
\[ 0.95 \leq \sum_{\theta_l \leq \theta \leq \theta_u} [\theta | \mathbf{y} ] \] This intuitive definition represents the subjective probability that the true value of \(\theta\) lies between \(\theta_l\) and \(\theta_u\). This differs from the frequentist concept of coverage, which invokes the hypothetical idea of repeating the experiment many times and determining the fraction of confidence intervals that contain \(\theta\). Of course, the values of\(\theta_l\) and \(\theta_u\) that satisfy these equalities are not unique, and I would like the interval to be as narrow as possible. The highest posterior density (HPD) is defined by a horizontal line drawn through the density. This line intersects the posterior density in two places, thus defining two tails. A different line drawn such that the fraction of the density in the tails is \(1 - \alpha\) (a \(95\%\) credible interval would have \(\alpha = 0.05\)) defines the HPD. The areas in the two tails can differ by this method. Much easier to compute is the equal-tail interval, where each tail has the fraction \(\alpha/2\) of the density, and is hereafter used throughout this book. The two are equivalent for symmetric, unimodal densities. Otherwise, the equal-tail interval is slightly wider.
A Poisson-gamma pair
Recall the conjugate Poisson likelihood/gamma prior. The likelihood is
\[ [\mathbf{y} | \lambda] = \prod_i^n Poi(y_i| \lambda) = \prod_i^n \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} = \frac{\lambda^{n \bar{y}} e^{-n \lambda}}{\prod_i^n y_i!} \] With a gamma prior distribution for \(\lambda\) this becomes
\[ \prod_i^n Poi(y_i| \lambda) gam(\lambda | a, b) \propto \lambda^{n \bar{y}} e^{-n \lambda} \times b^a \lambda^{a-1} e^{-b \lambda} \] I simplify this to
\[ \lambda^{n \bar{y} + a - 1} e^{-\lambda (n + b)} \] Note that this is a new gamma density, \(gam(\lambda | n \bar{y} + a, n + b)\).
Together, the Poisson likelihood with gamma prior distribution yield a gamma posterior having parameters that sum the contributions of prior and likelihood. For conjugate likelihood-prior pairs I can skip the integration of the denominator, because the normalizer is already known. To find the posterior, I follow these simple steps:
Ignore the coefficients that do not include the parameter of interest, writing the posterior as the product of those coefficients that do contain the parameter.
Collect coefficients.
Identify the parameters for the posterior based on comparison with the prior.
Standard error and credible interval for the Poisson-gamma
Because the gamma distribution has a parametric form I can easily find a standard error as the square root of the variance. The variance of the distribution \(gam(\lambda | a, b)\) is \(a/b^2\). Thus, the Bayesian standard error for the previous example is
\[
se_{\hat{\lambda}} = \sqrt{ \frac{n \hat{y} + a}{(n + b)^2} }
\] The credible interval can be found using the qgamma function in R.
n <- 4
y <- rpois(n,4.7) #simulated counts
ybar <- mean(y)
rseq <- seq(.3,1.6,length=100)*ybar
a <- b <- 1
plot(rseq,dgamma(rseq, a, b), type='l', lty=2, ylim=c(0,.6),
xlab = 'lambda', ylab = 'density')
A <- sum(y) + a
B <- n + b
lines(rseq,dgamma(rseq, A, B), type='l')
ci <- qgamma(c(.025,.975), A, B)
segments(ci[1], 0, ci[1], dgamma(ci[1],A,B))
segments(ci[2], 0, ci[2], dgamma(ci[2],A,B))Group exercise
Summarize in a few sentences the differences between a Bayesian view of uncertainty and a frequentist view.
Why do 1.96 standard errors approximate a 95% CI and when might this approximation be poor?
What role does a P value play in the likelihood profile method for a CI? Why might a Bayesian object to it?
How does Fisher Information differ from the likelihood profile?
Why is the bootstrap conform to a frequentist perspective, and what is the key assumption it uses?
Which approach involves integrating a distribution?
Recap
Classical methods and Bayes often give similar estimates for small problems, but this is not the case for large models. The classical summaries reviewed here highlight some of the conceptual differences. Some advocates of classical philosophy combined with modern computation implement large models without prior distributions, apparently to maintain some purity about data uncorrupted by prior. Bayesians would respond that the underlying lack of an axiomatic foundation is the larger conceptual issue. Pragmatically, the prior distribution also stabilizes large models, while bringing in information outside what is contained in the data. As models increase in size, not only are the conceptual differences large, but results can differ substantially.
Appendix
Fisher information in one dimension
Consider the likelihood function \(L(\theta)\) for a parameter \(\theta\). There is a Taylor series for the the log likelihood that basically comes from a polynomial of higher order derivatives,
\[ \log L(\theta) = \sum_{k=0}^{\infty} \frac{(\theta - \hat{\theta})^k}{k!} \times \frac{d^k \log L}{d \theta^k}\bigg\rvert_{\hat{\theta}} \] This works because a polynomial with enough terms can follow the shape of any function. The zero term is just the function itself at the MLE, \(\log L(\hat{\theta})\). The second term is a slope,
\[ \log L(\theta) = \log L(\hat{\theta}) + (\theta - \hat{\theta}) \frac{d \log L(\hat{\theta})}{d \theta} + \dots \] The second term is the distance from the MLE multplied by the slope, which is, of course, equal to zero, so this term disappears.
Including up through quadratic terms this series is
\[ \log L(\theta) = \log L(\hat{\theta}) + 0 + \frac{ (\theta - \hat{\theta})^2}{2} \frac{d^2 \log L(\hat{\theta})}{d \theta^2} + \dots \] The terms contribute increasingly less to the estimate with increasing order. To simplify notation, let
\[ I = - \frac{d^2 \log L(\theta)}{d \theta^2}\bigg\rvert_{\hat{\theta}} \] and write the series as
\[ \log L(\theta) = \log L(\hat{\theta}) - \frac{ I(\theta - \hat{\theta})^2}{2} \] or, equivalently,
\[ \log L(\theta) - \log L(\hat{\theta}) = -\frac{ I(\theta - \hat{\theta})^2}{2} \] The LHS of this expression is just the log of the normed likelihood, so
\[ R(\theta) = exp \left[ - \frac{ I(\theta - \hat{\theta})^2}{2} \right] \] This has the same shape as (is proportional to) a normal density with a mean parameter of \(\hat{\theta}\) and a variance parameter of \(I^{-1}\), i.e.,
\[ R(\theta) \propto N(\theta | \hat{\theta},I^{-1}) \]
This approximate proportionality is the basis for the standard error, calculated in two steps: 1. Determine the curvature of \(-\log L\) near the MLE. For one parameter, Fisher Information is \[ I = - \frac{d^2 \log L(\theta)}{d \theta^2}\bigg\rvert_{\hat{\theta}} \] 2. Estimate the standard error as \(se_{\hat{\theta}} = 1/ \sqrt{I}\). \(I\) is known as the ‘observed information’, because it is obtained by plugging in the MLE obtained from the data.
Fisher information in multiple dimensions
This works for multiple dimensions. Take a regression model, where the likelihood is
\[ \log L \propto -\frac{1}{2 \sigma^2}(\mathbf{y} - \mathbf{X}\beta)'(\mathbf{y} - \mathbf{X}\beta) \] Here are derivatives:
\[ \begin{aligned} \frac{\partial \log L}{\partial \beta} &= -\frac{1}{ \sigma^2}(\mathbf{X}'\mathbf{y} - \beta' \mathbf{X}'\mathbf{X}) \\ \frac{\partial^2 \log L}{\partial \beta^2} &= -\frac{1}{ \sigma^2}\mathbf{X}'\mathbf{X} \end{aligned} \] The MLE is the vector \(\hat{\beta} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}\), with parameter covariance \(cov(\hat{\beta}) = \mathbf{I}^{-1}= \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\). The standard errors are the square roots of the diagonal of this matrix.