Probability & Statistics for Linguists: Class 6(Jan 26, 2015)

Philosophical differences between Bayesian and frequentist stats

There’s an important difference between frequentist and Bayesian statistics that will continue to dog us through the course (and, probably, for the rest of our careers).

In Bayesian statistics, model parameters are random variables like any other, which represent degrees of belief that the world is such-and-such a way. As a result, it makes sense to ask about their conditional probability. One of the primary quantities of interest is Bayesian statistical inference is \(P(\Theta|\mathcal{D})\), the probability distribution over parameter values given some observed data.
In frequentist statistics, the model parameters are fixed features of the world. They are unique, unrepeatable events, like the outcome of the 2016 US Presidential election. Depending on your philosophical outlook, this either means that
- it does not make sense to ask what their probability is (unconditionally, given some data, or whatever else); or
- you can ask this question, but the answer is always 1 or 0, depending on how the world really is.

In either case, it’s either nonsensical or pointless for frequentists to ask about \(P(\Theta|\mathcal{D})\). What we do instead is construct estimators, and then evaluate their performance on real or imagined data to find out whether they do a reasonable job. This usually involves considering the distribution of data that we might get when sampling from the true distribution, and what the distribution is over estimates of the parameters is - \(P(\hat{\Theta}|\mathcal{D})\), which is a theoretically acceptable RV in frequentist statistics.

Parameter estimation

A lot of statistical inference takes the following form: assume some model, using a combination of intuition and trial-and-error. The model will have some free parameters - for instance, a normal model will be parametrized by \(\mu\) and \(\sigma\), and a binomial model by \(\pi\). Then we collect some data and use the data to estimate the values of model parameters.

Let’s think about this in terms of a generic set of \(n\) model parameters \(\Theta = \{\theta_1, ..., \theta_n\}\). How do we estimate \(\hat{\Theta}\), a set of particular estimates for \(\{\hat{\theta_1}, ..., \hat{\theta_n}\}\)?

Example (Levy 52-53): As a speaker of American English, you know (let’s suppose) that fraction \(\pi_{\mathbf{US}}\) of sentences are passivized. Now, given \(m\) sentences of Singaporean English of which \(n\) are passivized, you could estimate \(\pi_{\mathbf{Sing}}\) - the probability that a sentence is passive in Singaporean English - in either of the following two ways (and many more).

Fixed guess: \(\hat{\pi}_{\mathbf{Sing}} = \pi_{\mathbf{US}}\) - i.e., just ignore the data and assume that Singaporean and American English are the same.
Relative frequency: \(\hat{\pi}_{\mathbf{Sing}} = m/n\), with no reference to information that does not live in the sample.

We want to find estimators that lack the failings of both of these extreme cases - ideally, onces that find a good balance of the following considerations.

Consistency: We want the value of \(\hat{\Theta}\) to get ever-closer to the true values \(\Theta\) as the amount of data that you have increases.
- Are either or both of the estimators considered above consistent?
Low bias: We want the expected value of \(\hat{\Theta}\) to be as close as possible to \(\Theta\).
- What is the bias of each estimator above?
Low variance: we want an estimator whose value depends as little as possible on random fluctuations of the sampling process, all else being equal. (That ceveat is important!) Most of the time variance gets lower with larger sample sizes, but not always at the same rate.
- The variance of the relative frequency estimator is inversely proportional to the number of samples taken: \(\mathbf{Var}(\hat{\pi}) = \pi(1 - \pi)/n\).
- What is the variance of the Fixed-guess estimator? What does this tell us?

n.sims = 1000
US.parameter = .2
Singapore.parameter = .4
number.of.examples = 10
simulated.data = rbinom(n.sims, number.of.examples, prob=Singapore.parameter)
relative.frequency.estimate = sapply(simulated.data, function(s) return(s/number.of.examples))
fixed.guess.estimate = sapply(simulated.data, function(s) return(US.parameter))
plot(table(relative.frequency.estimate), xlab='Parameter', col='blue', main='2 estimators', ylim=c(0,1000))
lines(table(fixed.guess.estimate), col='red')
legend('topright', c('relative freq.', 'fixed guess'), lty=1, col=c('blue', 'red'), text.col=c('blue', 'red'))

mean(relative.frequency.estimate) - Singapore.parameter # approx 0 if unbiased

## [1] 0.0023

mean(fixed.guess.estimate) - Singapore.parameter

## [1] -0.2

var(relative.frequency.estimate)

## [1] 0.02420892

var(fixed.guess.estimate) # lower = better??

## [1] 0

Frequentist parameter estimation

The most important form of parameter estimation in frequentist statistics is the maximum likelihood estimate (MLE). In general, the likelihood of the data as parametrized by \(\Theta\) is \[ \mathbf{Lik}(\mathcal{D}; \Theta) = P_\Theta(\mathcal{D}), \] where \(P_\Theta(\mathcal{D})\) is the probability of getting the data under a probability distribution in which \(\Theta\) gives the true parameter values. Note the semicolon (read “as parametrized by”) as opposed to a bar (“conditioned on”). This is because we aren’t allowed to condition on model parameters when doing this kind of stats.

The idea behind MLE that our best guess about \(\theta\), given some data, is the value of \(\theta\) which - if we imagine that it’s the true value - maximizes the probability of observing what we did in fact observe. \[ \theta_{\mathbf{MLE}} = \text{arg max}_\theta\ \mathbf{Lik}(\mathcal{D}; \theta) \]

For binomially or multinomially distibuted data (e.g., counts), the MLE estimate is the relative frequency estimate. Here’s an example using grid approximation:

parameters = seq(from = 0, to=1, length.out = 1001)
number.of.examples = 27
number.of.successes = 17
lik = function(param) dbinom(number.of.successes, number.of.examples, param)
likelihood = sapply(parameters, FUN=lik)
relative.frequency = number.of.successes/number.of.examples
relative.frequency

## [1] 0.6296296

parameters[which(likelihood == max(likelihood))]

## [1] 0.63

plot(parameters, likelihood, main='MLE',
     pch=20, col='blue', xlab=expression(pi), ylab='data likelihood')
abline(v=relative.frequency, col='red')

Limitations of MLE

MLE can be problematic for linguistic data because so many of the events we are interested in have very low probabilities, and so very small, or zero, examples in a corpus. When \(\pi\) is very low the estimate \(\hat{\pi}\) will frequently be zero, since no examples of the phenomenon of interest happen to appear in our corpus.

n.sims = 1000
Singapore.parameter = .05
number.of.examples = 10
simulated.data = rbinom(n.sims, number.of.examples, prob=Singapore.parameter)
relative.frequency.estimate = sapply(simulated.data, function(s) return(s/number.of.examples))
summary(relative.frequency.estimate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0483  0.1000  0.3000

plot(table(relative.frequency.estimate), col='blue', ylab='frequency', main='Low-prob, n=10')

var(relative.frequency.estimate)

## [1] 0.004601712

If we had 1000 examples instead of 10, though, MLE would do much better.

n.sims = 1000
Singapore.parameter = .05
number.of.examples = 1000
simulated.data = rbinom(n.sims, number.of.examples, prob=Singapore.parameter)
relative.frequency.estimate = sapply(simulated.data, function(s) return(s/number.of.examples))
summary(relative.frequency.estimate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02600 0.04500 0.05000 0.04997 0.05500 0.07200

plot(table(relative.frequency.estimate), col='blue', ylab='frequency', main='Low-prob, n=1000')

var(relative.frequency.estimate)

## [1] 4.953758e-05

Unfortunately, a lot of linguistic data are sparse in this way.

A second problem is bias: while MLE is unbiased for many estimators, it’s biased for some of linguistic interest, such as estimating the first temporal occurrence of an expression from temporally labeled occurrences in a corpus. Basically the problem is that MLE will always assign greatest likelihood to the smallest possible interval, i.e., it will always lead you to infer that the first occurrence of an expression is the first attested occurrence. But we know that this is generally not the case! Usually, a linguistic feature is around for some time before it happens to be recorded in a form that appears in the fragmentary records that we have access to. See Levy section 4.3.3 for details.

origin = 1823
now = 2015
n.examples = 23
sim = function(i) {
    dates.of.attestation = runif(n.examples, min=origin, max=now)
    first.attestation = min(dates.of.attestation)
    return(first.attestation)   
}
simulated.first.attestation = sapply(1:1000, FUN=sim)
summary(simulated.first.attestation)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1823    1825    1829    1831    1834    1892

plot(density(simulated.first.attestation), xlab='year', ylab='density',
     col='blue', main='MLE estimate of origin of an event that actually originated in 1823')

Bayesian parameter estimation

As a linguist, here’s what you’d should be doing if you found yourself in a fieldwork situation like the Singaporean English aseize your prior knowledge of other varieties of the language to construct a prior distribution which incorporates what you know about these varieties and how closely they are related to the variety being studied. Then, use some conceptually well-motivated procedure to incorporate the data you acquired in the current investigation into your best-guess.

This is how Bayesian inference would work here. You set up a prior distribution on model parameters, which should generally reflect relevant background knowledge. Then you make inferences by conditionalizing the model on the data that you have access to. If you care about the value of \(\pi\) for theoretical reasons - which you may or may not - you could then estimate \(\hat{\pi}\) as \(\mathbb{E}_P(\pi|\mathcal{D})\) (posterior mean) or as \(\text{arg max}_\pi\ P(\pi|\mathcal{D})\) (maximum a posteriori), among other options.

Reminder: interpretation of conditional probability.

Imagine you have a distribution \(P(\cdot)\), and then you learn that \(A\) is definitely true. Then, you can discard from consideration all possible outcomes/worlds in which \(A\) is false. If this is all that you have learned, then - for any two events \(B\) and \(B'\) which are contained in \(A\) - the ratio \(P(B)/P(B')\) should remain the same, since you haven’t learned anything that would justify modifying their relative probabilities. For events that partially overlap \(A\), you have to remove their not-\(A\) parts and consider only the remaining, uneliminated possibilities. Finally, given that probabilities must sum to 1, you have to ‘re-normalize’ the probabilities of all events by dividing by \(P(A)\), since you (normally will have) discarded some of the original probability mass when you eliminated not-\(A\). So, the conditional probability of any \(B\), given \(A\), is \[ P(B|A) = \frac{P(A\ \&\ B)}{P(A)} \]

Bayes’ rule is just another way to calculate conditional probabilities, which is more useful if the information you have happens to come in the form of a mixture of conditional and unconditional probabilities (as it frequently will). \[ \begin{aligned} P(B|A) =& \frac{P(A|B) \times P(B)}{P(A)}\\ =& \frac{P(A|B) \times P(B)}{\sum_{B'} P(A \ \& \ B')}\\ =& \frac{P(A|B) \times P(B)}{\sum_{B'} P(A|B') \times P(B')}, \end{aligned} \] where the \(B'\) values range over all of the alternatives to \(B\) in some partition of \(\Omega\). This is pretty abstract, but it’s clear if you take the partition \(\{B, \text{not-}B\}\): then we have \[ \begin{aligned} P(B|A) =& \frac{P(A|B) \times P(B)}{P(A)}\\ =& \frac{P(A|B) \times P(B)}{P(A|B) \times P(B) + P(A|\text{not-}B) \times P(\text{not-}B)}, \end{aligned} \] Call this ‘Inference by inversion’: ‘The probability that \(B\) is the case, given \(A\), is the probability that you would have seen \(A\) if \(B\) were the case, multiplied by the probability that \(B\) is the case; then you normalize the whole thing by finding the probabilities of alternative explanations for \(A\)’s being the case, and divide by their sum.’

Example. In the disease-rate estimation example studied in class a few weeks ago, we can use Bayes’ rule to invert the equation, deriving a posterior probability for various rates of the disease from the observed number of infected individuals \(I\) in your sample of size \(N\). Bayes’ rule tells us that \[ \begin{aligned} P(\pi|D) =& \frac{P(D|\pi) \times P(\pi)}{P(D)}\\ =& \frac{P(D|\pi) \times P(\pi)}{\sum_{\pi'} P(D|\pi') \times P(\pi')} \end{aligned} \] \(P(\pi|D)\), the posterior probability of rate \(\pi\) given data \(D\), is controlled by two main factors: the prior probability of \(\pi\), and the probability that you would have seen data \(D\) if \(\pi\) were the true rate.

This is the essence of Bayesian inference. We can do this inference in numerous ways.

Exact mathematical solution (only possible with some kinds of priors and likelihoods)
Grid approximation (OK for low-dimensional distributions, and if the distribution is not too “wiggly”)
Monte Carlo approximation (rejection sampling, MCMC, Gibbs sampling, particle filters, …)

Let’s start by using grid approximation, pretending that the only possible infection rates are in \(\Pi = \{0, .01, .02, ..., .99, 1\}\). Then Bayes’ rule tells us that we can calculate the posterior probability of each parameter \(\pi\), given the data \(D\), by finding the non-normalized probability \[ P(\pi|D) \propto P(D|\pi) \times P(\pi) \] and then dividing all of the values so derived by the sum \[ P(D) = \sum_{\pi' \in \Pi} P(D|\pi') \times P(\pi') \] To do this we have to assign a prior probability to each \(\pi \in \mathbf{H}\); let’s make them all equal, i.e., for all \(\pi \in \Pi\): \(P(\pi) = 1/|\Pi| = 1/101 \approx .0099\).

Fixing \(D\), we can calculate these values using R. Suppose we sampled 30 individuals and found that 5 of them had the disease:

parameters = seq(from=0, to=1, by=.01) # our grid
total.sampled = 30
num.infected = 5
prior = rep(1/length(parameters), length(parameters)) # uniform prior
prior

##   [1] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##   [7] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [13] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [19] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [25] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [31] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [37] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [43] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [49] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [55] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [61] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [67] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [73] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [79] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [85] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [91] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099
##  [97] 0.00990099 0.00990099 0.00990099 0.00990099 0.00990099

binomial = function(total, diseased, parameter) {
  not.diseased = total - diseased
  return(choose(total, diseased) * parameter^diseased * (1-parameter)^(not.diseased))
}
likelihood = sapply(parameters, function(candidate.prop) {
  return(binomial(total.sampled, num.infected, parameter=candidate.prop))
})
likelihood

##   [1] 0.000000e+00 1.108442e-05 2.751915e-04 1.617085e-03 5.259130e-03
##   [6] 1.235302e-02 2.359314e-02 3.902987e-02 5.807360e-02 7.963069e-02
##  [11] 1.023048e-01 1.246082e-01 1.451464e-01 1.627538e-01 1.765772e-01
##  [16] 1.861069e-01 1.911680e-01 1.918802e-01 1.886001e-01 1.818556e-01
##  [21] 1.722792e-01 1.605487e-01 1.473369e-01 1.332725e-01 1.189130e-01
##  [26] 1.047285e-01 9.109465e-02 7.829310e-02 6.651747e-02 5.588328e-02
##  [31] 4.643981e-02 3.818290e-02 3.106747e-02 2.501917e-02 1.994464e-02
##  [36] 1.574009e-02 1.229830e-02 9.513833e-03 7.286909e-03 5.525888e-03
##  [41] 4.148722e-03 3.083562e-03 2.268712e-03 1.652140e-03 1.190686e-03
##  [46] 8.491080e-04 5.990545e-04 4.180413e-04 2.884848e-04 1.968194e-04
##  [51] 1.327191e-04 8.842723e-05 5.819435e-05 3.781455e-05 2.425201e-05
##  [56] 1.534474e-05 9.573929e-06 5.887336e-06 3.566197e-06 2.126619e-06
##  [61] 1.247640e-06 7.196105e-07 4.077383e-07 2.267666e-07 1.236789e-07
##  [66] 6.608472e-08 3.455618e-08 1.766264e-08 8.813069e-09 4.286679e-09
##  [71] 2.029340e-09 9.334151e-10 4.163383e-10 1.796973e-10 7.487304e-11
##  [76] 3.003583e-11 1.156597e-11 4.260717e-12 1.495799e-12 4.982542e-13
##  [81] 1.566870e-13 4.624857e-14 1.272668e-14 3.239266e-15 7.554907e-16
##  [86] 1.596648e-16 3.016660e-17 5.012025e-18 7.174269e-19 8.621847e-20
##  [91] 8.414837e-21 6.384147e-22 3.548313e-23 1.329532e-24 2.973402e-26
##  [96] 3.286255e-28 1.308245e-30 1.036867e-33 4.322285e-38 1.355218e-45
## [101] 0.000000e+00

non.normalized.posterior = prior * likelihood # pointwise vector multiplication
posterior = non.normalized.posterior/sum(non.normalized.posterior) 
  # justified ONLY if parameters is exhaustive
par(mfrow=c(1,2))
plot(parameters, non.normalized.posterior, pch=20, xlab='parameter', ylab='posterior probability', main='Non-normalized posterior')
plot(parameters, posterior, pch=20, xlab='parameter', ylab='posterior probability', main='Posterior')
points(parameters, prior, pch=20, col='red')

The normalized and non-normalized posterior differ only in scale: the normalized values are divided by \(P(D) = \sum_{\pi \in \Pi} P(D|\pi) \times P(\pi)\). Here this value is given by 1/sum(non.normalized.posterior), which is equal to 31.3100005.

With 5 of 30 individuals infected, we find that the posterior (black dots) is shifted substantially from our previous flat prior (red) to favor fairly low, but non-zero, infection rates. This is because low parameter values do the best job of explaining the data that we saw. 0 and 1 are impossible, since there are both infected and non-infected individuals. Parameter .9, while possible, is vanishingly unlikely to be right because if it were, we would almost certainly have seen more than 5 infected in our random sample. Parameters .1 and .2, on the other hand, make the probability of the data much higher.

binomial(total.sampled, num.infected, parameter=0)

## [1] 0

binomial(total.sampled, num.infected, parameter=1)

## [1] 0

binomial(total.sampled, num.infected, parameter=.9)

## [1] 8.414837e-21

binomial(total.sampled, num.infected, parameter=.2)

## [1] 0.1722792

binomial(total.sampled, num.infected, parameter=.1)

## [1] 0.1023048

On the other hand, if we had observed 0 infected, we would assume that the true rate is low, but not necessarily zero. (Intuitively, why?)

parameters = seq(from=0, to=1, by=.01) # our grid
total.sampled = 30
num.infected = 0
prior = rep(1/length(parameters), length(parameters)) # uniform prior
binomial = function(total, diseased, parameter) {
  return(choose(total, diseased) * parameter^diseased * (1-parameter)^(total-diseased))
}
likelihood = sapply(parameters, function(candidate.prop) {
  return(binomial(total.sampled, num.infected, parameter=candidate.prop))
})
non.normalized.prob = prior * likelihood # pointwise vector multiplication
posterior = non.normalized.prob/sum(non.normalized.prob)
plot(parameters, posterior, pch=20, xlab='proportion', ylab='posterior probability', main='')
points(parameters, prior, pch=20, col='red')

If we had examined 200 people and found zero infections, though, we should be more confident that the true rate is at or close to 0. This is because, as the number of people sampled increases, it becomes less likely that we have by chance failed to pick any up in our sample.

parameters = seq(from=0, to=1, by=.01) # our grid
total.sampled = 200
num.infected = 0
prior = rep(1/length(parameters), length(parameters)) # uniform prior
binomial = function(total, diseased, parameter) {
  return(choose(total, diseased) * parameter^diseased * (1-parameter)^(total-diseased))
}
likelihood = sapply(parameters, function(candidate.prop) {
  return(binomial(total.sampled, num.infected, parameter=candidate.prop))
})
non.normalized.prob = prior * likelihood # pointwise vector multiplication
posterior = non.normalized.prob/sum(non.normalized.prob)
plot(parameters, posterior, pch=20, xlab='proportion', ylab='posterior probability', main='')
points(parameters, prior, pch=20, col='red')

OK, now we have a posterior distribution on model parameters, given some data. What do we do with it? Here are some options.

Maximum a posteriori: \(\hat{\Theta} = \text{arg max}_\theta\ P(\Theta|\mathcal{D})\).
Posterior mean: \(\hat{\Theta} = \mathbb{E}_P(\Theta|\mathcal{D})\).
Bayesian density estimation: if we want to know the probability that a parameter falls into some range, say \([a,b]\), we can use various exact or approximate inference methods to approximate the posterior probability \(P(\pi \in [a,b]|\mathcal{D})\).
Maintain full uncertainty: This is the preferred Bayesian approach, unless you have some specific reason to need a point estimate. That is, any of the previous methods are throwing away information by compressing a probability distribution into a single point, and this will usually negatively affect your model’s ability to predict future data. Frequently we want to predict future data, and then the best thing to do is to compute \[ \begin{aligned} P(\mathbf{second\ batch}\ |\ \mathbf{first\ batch}) &= \sum_\Theta\ P(\Theta, \mathbf{second\ batch}\ |\ \mathbf{first\ batch})\\ &= \sum_\Theta\ P(\mathbf{second\ batch}\ |\ \Theta, \mathbf{first\ batch}) \times P(\Theta\ |\ \mathbf{first\ batch})\\ &= \sum_\Theta P(\mathbf{second\ batch}\ |\ \Theta) \times P(\Theta\ |\ \mathbf{first\ batch}), \end{aligned} \] where the second line follows from Bayes’ rule, and the third because the second batch is conditionally independent of the first given the value of the parameter. This method allows us to make full use of the uncertainty in our estimate of \(\Theta\), instead of throwing out information by using a point estimate of \(\Theta\).