Probability & Statistics for Linguists: Class 7(Jan 28, 2015)

Bayesian inference

Suppose you see someone coughing, and three hypotheses \(H\) occur to you: \(h_1\) is “they have a cold”, \(h_2\) is “they have a stomachache”, and \(h_3\) is “they have lung cancer”. Let the estimate prior probability of these events - i.e, what your degree of belief would have been before you ran into this person about their state of health - be \[ \begin{aligned} P(\mathbf{cold}) &= .1\\ P(\mathbf{stomachache}) &= .1\\ P(\mathbf{lung\ cancer}) &= .001 \end{aligned} \] Furthermore, your estimate of the probability that someone would be coughing if they had these symptoms is \[ \begin{aligned} P(\mathbf{coughing}\ |\ \mathbf{cold}) &= .9\\ P(\mathbf{coughing}\ |\ \mathbf{stomachache}) &= .01\\ P(\mathbf{coughing}\ |\ \mathbf{lung\ cancer}) &= 1 \end{aligned} \] That is, colds almost always cause coughing, and lung cancer always does; stomachaches almost never do. Then we can use Bayes’ rule to calculate the non-normalized posterior probability of each hypothesis. \[ \begin{aligned} P(\mathbf{cold}\ |\ \mathbf{coughing}) &= \frac{P(\mathbf{coughing}\ |\ \mathbf{cold}) \times P(\mathbf{cold})}{P(\mathbf{coughing})}\\ &= \frac{.9 \times .1}{P(\mathbf{coughing})} = \frac{.09}{P(\mathbf{coughing})}\\ P(\mathbf{stomachache}\ |\ \mathbf{coughing}) &= \frac{P(\mathbf{coughing}\ |\ \mathbf{stomachache}) \times P(\mathbf{stomachache})}{P(\mathbf{coughing})}\\ &= \frac{.01 \times .1}{P(\mathbf{coughing})} = \frac{.001}{P(\mathbf{coughing})}\\ P(\mathbf{lung\ cancer}\ |\ \mathbf{coughing}) &= \frac{P(\mathbf{coughing}\ |\ \mathbf{lung\ cancer}) \times P(\mathbf{lung\ cancer})}{P(\mathbf{coughing})}\\ &= \frac{1 \times .001}{P(\mathbf{coughing})} = \frac{.001}{P(\mathbf{coughing})} \end{aligned} \]

If our three hypotheses were an exhaustive and mutually exclusive list of possible causes, we could use them to calculate the normalizing constant.

\[ \begin{aligned} P(\mathbf{coughing}) &= \sum_{h \in H} P(\mathbf{coughing}, h)\\ &= \sum_{h \in H} P(\mathbf{coughing}|h) \times P(h)\\ &= .9 \times .1 + .01 \times .1 + 1 \times .001\\ &= .09 + .001 + .001\\ &= .092 \end{aligned} \]

However, in this context it is clearly not true that these hypotheses are either exhaustive or mutually exclusive! Still, we can calculate useful statistics from the non-normalized posteriors: the ratio of the posterior probabilities of the various hypotheses, for example, don’t depend on this normalizing constant. This is because the denominator is the same for all hypotheses, and it drops out of the ratio.

\[ \begin{aligned} \frac{P(h_i|\mathbf{coughing})}{P(h_j|\mathbf{coughing})} &= \frac{\frac{P(\mathbf{coughing}|h_i) \times P(h_i)}{P(\mathbf{coughing})}}{\frac{P(\mathbf{coughing}|h_j) \times P(h_j)}{P(\mathbf{coughing})}}\\ &= \frac{P(\mathbf{coughing}|h_i) \times P(h_i)}{P(\mathbf{coughing})} \times \frac{P(\mathbf{coughing})}{P(\mathbf{coughing}|h_j) \times P(h_j)}\\ &= \frac{P(\mathbf{coughing}|h_i) \times P(h_i)}{P(\mathbf{coughing}|h_j) \times P(h_j)} \end{aligned} \]

For \(\mathbf{cold}\) vs. \(\mathbf{stomachache}\), for example, this ratio is \(.09/.001 = 90\) - cold is 90 times more likely. This result is driven entirely by the difference in likelihoods: the prior probabilities were the same, but the conditional probability of the symptom given the disease is 90 times greater for cold.

On the other hand, the ratio of posteriors is also 90 times greater for cold vs. lung cancer. This is driven entirely by the difference in priors: even though the symptom is more likely under the hypotheses of lung cancer, it’s just a lot more likely that the person has a cold.

A conceptually useful (though, again, mathematically trivial) point is that we can factorize the ratio of posteriors into two terms: the prior odds and the Bayes factor, where the latter is just a fancy name for the ratio of likelihood terms. \[ \begin{aligned} \frac{P(h_i|\mathbf{coughing})}{P(h_j|\mathbf{coughing})} &= \frac{P(\mathbf{coughing}|h_i) \times P(h_i)}{P(\mathbf{coughing}|h_j) \times P(h_j)}\\ &= \underbrace{\frac{P(\mathbf{coughing}|h_i)}{P(\mathbf{coughing}|h_j)}}_{\mathbf{Bayes\ factor}} \times \underbrace{\frac{P(h_i)}{P(h_j)}}_{\mathbf{prior\ odds}} \end{aligned} \] The prior odds tell us the relative weight of evidence in favor of \(h_1\) before the observation of symptoms, and the Bayes factor tells us how much this observation favors \(h_1\) vs. \(h_2\). For instance, in the competition between stomachache and cold, we have \[ \begin{aligned} \frac{P(\mathbf{cold}\ |\ \mathbf{coughing})}{P(\mathbf{stomachache}\ |\ \mathbf{coughing})} &= \frac{P(\mathbf{coughing}\ |\ \mathbf{cold})}{P(\mathbf{coughing}\ |\ \mathbf{stomachache})} \times \frac{P(\mathbf{cold})}{P(\mathbf{stomachache})}\\ &= \frac{.9}{.01} \times \frac{.1}{.1}\\ &= \underbrace{90}_{\mathbf{Bayes\ factor}} \times \underbrace{1}_{\mathbf{prior\ odds}} \end{aligned} \] This may help to clarify the sense in which the inference that cold is more likely than stomachache is “driven entirely by the likelihood”, as opposed to the inference that cold is more likely than lung cancer being “driven entirely by the prior”.

Bayes factors are commonly used in Bayesian hypothesis testing. Generally, a large Bayes factor indicates strong support for the hypothesis in the numerator. How large is “large” is, of course, something that requires deep thought and domain knowledge, not something that we should ask a statistician to decide.

Bayesian parameter estimation

In the disease-rate estimation example studied in class a few weeks ago, we can use Bayes’ rule to invert the equation, deriving a posterior probability for various rates of the disease from the observed number of infected individuals \(I\) in your sample of size \(N\). Bayes’ rule tells us that \[ \begin{aligned} P(\pi|D) =& \frac{P(D|\pi) \times P(\pi)}{P(D)}\\ =& \frac{P(D|\pi) \times P(\pi)}{\int_0^1\ P(D|\pi') \times P(\pi') d\pi'} \end{aligned} \]

Let’s do inference using grid approximation, pretending that the only possible infection rates are in \(\{0, .01, .02, ..., .99, 1\}\). Then we substitute a discrete form of Bayes’ rule \[ \begin{aligned} P(\pi|D) =& \frac{P(D|\pi) \times P(\pi)}{P(D)}\\ =& \frac{P(D|\pi) \times P(\pi)}{\sum_{\pi' \in \{0, .01, ..., .99, 1\}}\ P(D|\pi') \times P(\pi') d\pi'} \end{aligned} \] Now we can calculate the posterior probability of each parameter \(\pi\), given the data \(D\), by finding the non-normalized probability \[ P(\pi|D) \propto P(D|\pi) \times P(\pi) \] and then dividing all of the values so derived by the sum \[ P(D) = \sum_{\pi' \in \{0, .01, ..., .99, 1\}} P(D|\pi') \times P(\pi') \] To do this we have to assign a prior probability to each \(\pi' \in \{0, .01, ..., .99, 1\}\); let’s make them all equal, i.e., for all \(\pi'\): \(P(\pi') = 1/|\{0, .01, ..., .99, 1\}| = 1/101 \approx .0099\).

Fixing \(D\), we can calculate these values using R. Suppose we sampled 30 individuals and found that 5 of them had the disease:

parameters = seq(from=0, to=1, by=.01) # our grid
total.sampled = 30
num.infected = 5
prior = rep(1/length(parameters), length(parameters)) # uniform prior
prior

##   [1] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##   [7] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [13] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [19] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [25] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [31] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [37] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [43] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [49] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [55] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [61] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [67] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [73] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [79] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [85] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [91] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [97] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099

likelihood = sapply(parameters, function(p) dbinom(num.infected, size=total.sampled, prob=p))
likelihood

##   [1] 0.000000e+00 1.108442e-05 2.751915e-04 1.617085e-03 5.259130e-03
##   [6] 1.235302e-02 2.359314e-02 3.902987e-02 5.807360e-02 7.963069e-02
##  [11] 1.023048e-01 1.246082e-01 1.451464e-01 1.627538e-01 1.765772e-01
##  [16] 1.861069e-01 1.911680e-01 1.918802e-01 1.886001e-01 1.818556e-01
##  [21] 1.722792e-01 1.605487e-01 1.473369e-01 1.332725e-01 1.189130e-01
##  [26] 1.047285e-01 9.109465e-02 7.829310e-02 6.651747e-02 5.588328e-02
##  [31] 4.643981e-02 3.818290e-02 3.106747e-02 2.501917e-02 1.994464e-02
##  [36] 1.574009e-02 1.229830e-02 9.513833e-03 7.286909e-03 5.525888e-03
##  [41] 4.148722e-03 3.083562e-03 2.268712e-03 1.652140e-03 1.190686e-03
##  [46] 8.491080e-04 5.990545e-04 4.180413e-04 2.884848e-04 1.968194e-04
##  [51] 1.327191e-04 8.842723e-05 5.819435e-05 3.781455e-05 2.425201e-05
##  [56] 1.534474e-05 9.573929e-06 5.887336e-06 3.566197e-06 2.126619e-06
##  [61] 1.247640e-06 7.196105e-07 4.077383e-07 2.267666e-07 1.236789e-07
##  [66] 6.608472e-08 3.455618e-08 1.766264e-08 8.813069e-09 4.286679e-09
##  [71] 2.029340e-09 9.334151e-10 4.163383e-10 1.796973e-10 7.487304e-11
##  [76] 3.003583e-11 1.156597e-11 4.260717e-12 1.495799e-12 4.982542e-13
##  [81] 1.566870e-13 4.624857e-14 1.272668e-14 3.239266e-15 7.554907e-16
##  [86] 1.596648e-16 3.016660e-17 5.012025e-18 7.174269e-19 8.621847e-20
##  [91] 8.414837e-21 6.384147e-22 3.548313e-23 1.329532e-24 2.973402e-26
##  [96] 3.286255e-28 1.308245e-30 1.036867e-33 4.322285e-38 1.355218e-45
## [101] 0.000000e+00

non.normalized.posterior = prior * likelihood # pointwise vector multiplication
non.normalized.posterior

##   [1] 0.000000e+00 1.097467e-07 2.724668e-06 1.601074e-05 5.207060e-05
##   [6] 1.223072e-04 2.335954e-04 3.864343e-04 5.749862e-04 7.884227e-04
##  [11] 1.012919e-03 1.233745e-03 1.437093e-03 1.611424e-03 1.748289e-03
##  [16] 1.842643e-03 1.892753e-03 1.899804e-03 1.867328e-03 1.800550e-03
##  [21] 1.705734e-03 1.589592e-03 1.458782e-03 1.319529e-03 1.177356e-03
##  [26] 1.036916e-03 9.019273e-04 7.751792e-04 6.585888e-04 5.532998e-04
##  [31] 4.598001e-04 3.780485e-04 3.075987e-04 2.477146e-04 1.974716e-04
##  [36] 1.558425e-04 1.217653e-04 9.419636e-05 7.214762e-05 5.471176e-05
##  [41] 4.107646e-05 3.053032e-05 2.246250e-05 1.635782e-05 1.178897e-05
##  [46] 8.407010e-06 5.931232e-06 4.139023e-06 2.856285e-06 1.948707e-06
##  [51] 1.314050e-06 8.755171e-07 5.761817e-07 3.744015e-07 2.401189e-07
##  [56] 1.519281e-07 9.479138e-08 5.829045e-08 3.530889e-08 2.105563e-08
##  [61] 1.235287e-08 7.124857e-09 4.037013e-09 2.245213e-09 1.224543e-09
##  [66] 6.543041e-10 3.421404e-10 1.748776e-10 8.725811e-11 4.244236e-11
##  [71] 2.009247e-11 9.241734e-12 4.122162e-12 1.779181e-12 7.413172e-13
##  [76] 2.973845e-13 1.145146e-13 4.218532e-14 1.480989e-14 4.933210e-15
##  [81] 1.551356e-15 4.579066e-16 1.260067e-16 3.207194e-17 7.480106e-18
##  [86] 1.580840e-18 2.986792e-19 4.962401e-20 7.103237e-21 8.536482e-22
##  [91] 8.331522e-23 6.320937e-24 3.513182e-25 1.316369e-26 2.943963e-28
##  [96] 3.253718e-30 1.295292e-32 1.026601e-35 4.279490e-40 1.341800e-47
## [101] 0.000000e+00

posterior = non.normalized.posterior/sum(non.normalized.posterior) 
  # justified ONLY if hypotheses are exhaustive and mutually exclusive
par(mfrow=c(1,2))
plot(parameters, non.normalized.posterior, pch=20, xlab='parameter', ylab='posterior probability', main='Non-normalized posterior')
plot(parameters, posterior, pch=20, xlab='parameter', ylab='posterior probability', main='Posterior')
points(parameters, prior, pch=20, col='red')

The normalized and non-normalized posterior differ only in scale: the normalized values are just divided by a constant.

With 5 of 30 individuals infected, we find that the posterior (black dots) is shifted substantially from our previous flat prior (red) to favor fairly low, but non-zero, infection rates. This is because low parameter values do the best job of explaining the data that we saw. 0 and 1 are impossible, since there are both infected and non-infected individuals. Parameter .9, while possible, is vanishingly unlikely to be right because if it were, we would almost certainly have seen more than 5 infected in our random sample. Parameters .1 and .2, on the other hand, make the probability of the data much higher.

dbinom(num.infected, total.sampled, prob=0)

## [1] 0

dbinom(num.infected, total.sampled, prob=.9)

## [1] 8.414837e-21

dbinom(num.infected, total.sampled, prob=.2)

## [1] 0.1722792

dbinom(num.infected, total.sampled, prob=1)

## [1] 0

On the other hand, if we had observed 0 infected, we would assume that the true rate is low, but not necessarily zero. This is because with a low but non-zero rate it’s quite plausible that a small sample simply happens not to contain any of the people with the disease.

parameters = seq(from=0, to=1, by=.01) # our grid
total.sampled = 30
num.infected = 0
prior = rep(1/length(parameters), length(parameters)) # uniform prior
likelihood = sapply(parameters, function(p) dbinom(num.infected, total.sampled, prob=p))
non.normalized.prob = prior * likelihood # pointwise vector multiplication
posterior = non.normalized.prob/sum(non.normalized.prob)
plot(parameters, posterior, pch=20, xlab='proportion', ylab='posterior probability', main='')
points(parameters, prior, pch=20, col='red')

If we had examined 200 people and found zero infections, though, we should be more confident that the true rate is at or close to 0. This is because, as the number of people sampled increases, it becomes less likely that we have by chance failed to pick any up in our sample.

parameters = seq(from=0, to=1, by=.01) # our grid
total.sampled = 200
num.infected = 0
prior = rep(1/length(parameters), length(parameters)) # uniform prior
binomial = function(total, diseased, parameter) {
  return(choose(total, diseased) * parameter^diseased * (1-parameter)^(total-diseased))
}
likelihood = sapply(parameters, function(candidate.prop) {
  return(binomial(total.sampled, num.infected, parameter=candidate.prop))
})
non.normalized.prob = prior * likelihood # pointwise vector multiplication
posterior = non.normalized.prob/sum(non.normalized.prob)
plot(parameters, posterior, pch=20, xlab='proportion', ylab='posterior probability', main='')
points(parameters, prior, pch=20, col='red')

OK, now we have a posterior distribution on model parameters, given some data. What does this give us in terms of parameter estimates? Here are some options.

Maximum a posteriori: \(\hat{\Theta} = \text{arg max}_\theta\ P(\Theta|\mathcal{D})\).
Posterior mean: \(\hat{\Theta} = \mathbb{E}_P(\Theta|\mathcal{D})\).
Bayesian density estimation: if we want to know the probability that a parameter falls into some range, say \([a,b]\), we can use various exact or approximate inference methods to approximate the posterior probability \(P(\pi \in [a,b]|\mathcal{D})\).
Maintain full uncertainty: This is the preferred Bayesian approach, unless you have some specific reason to need a point estimate. That is, any of the previous methods are throwing away information by compressing a probability distribution into a single point, and this will usually negatively affect your model’s ability to predict future data. Frequently we want to predict future data, and then the best thing to do is to compute \[ \begin{aligned} P(\mathbf{second\ batch}\ |\ \mathbf{first\ batch}) &= \sum_\Theta\ P(\Theta, \mathbf{second\ batch}\ |\ \mathbf{first\ batch})\\ &= \sum_\Theta\ P(\mathbf{second\ batch}\ |\ \Theta, \mathbf{first\ batch}) \times P(\Theta\ |\ \mathbf{first\ batch})\\ &= \sum_\Theta P(\mathbf{second\ batch}\ |\ \Theta) \times P(\Theta\ |\ \mathbf{first\ batch}), \end{aligned} \] where the second line uses the conditional version of Bayes’ rule, and the third follows because the second batch is conditionally independent of the first given the value of the parameter. This method allows us to make full use of the uncertainty in our estimate of \(\Theta\), instead of throwing out information by using a point estimate of \(\Theta\).

Exact Bayesian inference with conjugate priors

A lot of work in Bayesian statistics makes use of conjugate priors for specific probability distributions. Levy section 4.4.2-4.4.3 goes through an example using the beta distribution, which is conjugate for the binomial distribution. The reason that it is “conjugate” is that, even though it is generally not possible to calculate Bayesian posteriors analytically, these distributions have closed-form posteriors. For example, if we have a Beta(1,1) prior - which is uniform on \([0,1]\) - and then observe 6 successes out of 10 trials, our posterior will be Beta(7,5): we just add the numbers of successes and failures to the parameters. On the other hand, we might choose to enter this inference problem by encoding the prior belief, based on our experience as linguists and English speakers, that most sentences are not passives. We could incorporate this information by setting the prior to a different value - say, Beta(5,20).

curve(dbeta(x, 5, 20), col='red', xlab='Parameter', ylab='density')
curve(dbeta(x, 1, 1), col='blue', add=T)
curve(dbeta(x, 5, 7), col='purple', add=T)
curve(dbeta(x, 20, 5), col='green', add=T)
legend(.4, 5, c('Beta(1,1)', 'Beta(5,20)', 'Beta(20,5)', 'Beta(5,7)'), lty=1, col=c('blue', 'red', 'green','purple'), text.col=c('blue', 'red', 'green','purple'))

A Beta(5,20) distribution is also what we’d get if started with a Beta(1,1) prior and then observed 4 passive and 19 non-passive sentences. In general, when they’re serving as priors for the binomial, we can think of the parameters of the Beta distribution as “pseudo-counts”: choosing a Beta(5,20) is a way of encoding that our prior information is as if we had started with an unformative prior and then observed 4 passives and 19 non-passives.

Bayesian confidence intervals

In Bayesian stats, a 95% confidence interval for a parameter is an interval \([a,b]\) such that the probability that the parameter lies in the interval is 95%: \(P(\theta \in [a,b]) = .95\). A common (but not the only) way to calculate this is to find a symmetric 95% CI. That is, we just shave off 2.5% from either tail, and infer with 95% confidence that (given the prior, model, and evidence) the parameter lies in the remainder.

For example, suppose we have an uninformative Beta(1,1) prior and then observe 4 passive and 6 non-passive sentences. This gives us a Beta(5,7) posterior, and we can calculate the symmetric 95% CI as follows:

qbeta(p=c(.025, .975), 5, 7)

## [1] 0.1674881 0.6920953

curve(dbeta(x, 5, 7), main="", col='blue')
abline(v=qbeta(.025, 5, 7), col='red')
abline(v=qbeta(.975, 5, 7), col='red')

We can infer with 95% confidence that the true interval lies between \(.167\) and \(.692\). That’s not that great, but hey, we didn’t have much data to go on. If we had observed 40 passive and 60 non-passive, or 400 passive and 600 non-passive, this interval would be much smaller - even though the rate is the same in the data.

qbeta(p=c(.025, .975), 41, 61)

## [1] 0.3093085 0.4982559

curve(dbeta(x, 41, 61), main="", col='blue')
abline(v=qbeta(.025, 41, 61), col='red')
abline(v=qbeta(.975, 41, 61), col='red')

qbeta(p=c(.025, .975), 401, 601)

## [1] 0.3700744 0.4307022

curve(dbeta(x, 401, 601), main="", col='blue')
abline(v=qbeta(.025, 401, 601), col='red')
abline(v=qbeta(.975, 401, 601), col='red')

We might also do this using sampling: again with a uniform prior,

n.observations = 10
passives.in.observation = 4
accepted.parameters = c()
while (length(accepted.parameters) < 500) { 
  # make sure to get 500 samples from the conditional distribution
  this.sim.param = runif(1,0,1)
  passives.in.simulated.data = rbinom(n=1, size=10, prob=this.sim.param)
  if (passives.in.simulated.data == passives.in.observation) {
    accepted.parameters = c(accepted.parameters, this.sim.param)
  }
}
accepted.parameters[1:20]

##  [1] 0.3721239 0.3323947 0.3253522 0.6427955 0.5076418 0.3354875 0.3376153
##  [8] 0.3082923 0.4527201 0.5783539 0.4985145 0.1314165 0.4615518 0.7621515
## [15] 0.2828840 0.4639624 0.4202159 0.3483021 0.1850700 0.3080524

quantile(accepted.parameters, c(.025, .975))

##      2.5%     97.5% 
## 0.1682732 0.7226147

This is very close to the 95% CI that we got above by calling qbeta(p=c(.025, .975), 5, 7): 0.1674881, 0.6920953.

plot(density(accepted.parameters), col='blue', main="")
abline(v=quantile(accepted.parameters, .025), col='red')
abline(v=quantile(accepted.parameters, .975), col='red')

Also note how close the posterior derived by rejection sampling is to the one we got using analytic solution with a Beta(1,1) prior:

plot(density(accepted.parameters), col='blue', main="")
curve(dbeta(x, 5, 7), col='red', lty=4, add=T)
legend('topright', c('Posterior=Beta(5,7)', 'Approx. by rejection sampling'), 
       lty=1, col=c('blue', 'red'), text.col=c('blue','red'))