Probability & Statistics for Linguists: Class 2

R intro

R is a functional programming language that is specialized for statistical data analysis and graphical exploration of data. It’s not the best-designed language ever, and it’s frequently slow, but it’s incredibly useful nonetheless. It also has an enormous number of packages that you can install and use for a variety of purposes. For example, I’m writing these notes in R (more specifically, in RStudio) using R Markdown and knitr package. You’ll do the same with your homework for the course. A number of people have written entire books in knitr.

Because I cruelly made you do a homework on the topic for today, you should have the basics of R down already. First, though, any burning questions that you have after doing the homework?

General advice on learning and using R:

Don’t be shy about asking me or others for advice. I’m here to help. (I’m a nice guy, and it’s also my job.)
R has a very large and helpful user community. For almost any question, you should be able to find an answer by typing a description of your issue into Google.

Rather than giving you a blow-by-blow on R syntax and usage up-front and risk losing you to boredom, I’m going to dive into the course content, introducing R features as they come up. I hope this will work! But since I am imperfect, and on top of that I don’t know for sure what exactly you know or don’t, the only way that this can work is for you to be very, very proactive.

If you see a piece of code that you don’t understand, or even think you might not understand, speak up! Most likely there are others in the class who have the same question.

The binomial distribution

Suppose we have a coin with weight $\pi$. If we flip it 4 times, what’s the probability of getting the sequence $HHTH$?

We’re assuming that the flips are independent of each other, so - letting $H_i$ and $H_j$ represent the outcomes of getting heads on flips $i$ and $j$ respectively - we have $P(H_i, H_j)$ [i.e., $P(H_i \& H_j)$] equal to $P(H_i) \times P(H_j)$ for any $i$ and $j$. As a result: \[ \begin{aligned} P(HHTH) =& P(H) \times P(H) \times P(T) \times P(H)\\ =& P(H)^3 \times P(T)\\ =& \pi^3 \times (1-\pi) \end{aligned} \]

More generally, for any sequence $S$ containing $N_H$ heads and $N_T$ tails, the probability of that sequence is \[ P(S) = \pi^{N_H} \times (1-\pi)^{N_T} \]

If we flip a coin 3 times, the sample space is $\Omega=\{HHH, THH, HTH, HHT, TTH, THT, HTT, TTT\}$ - a set with $2^3=8$ elements. What is the probability of each?

\[ \begin{aligned} P(HHH) =& \pi^3\\ P(THH) =& \pi^2\times (1-\pi)\\ P(HTH) =& \pi^2\times (1-\pi)\\ P(HHT) =& \pi^2\times (1-\pi)\\ P(TTH) =& \pi \times (1-\pi)^2\\ P(THT) =& \pi \times (1-\pi)^2\\ P(HTT) =& \pi \times (1-\pi)^2\\ P(TTT) =& (1-\pi)^3 \end{aligned} \]

Note the greater incidence of repeated values in the middle range: there are simply more ways to get 1 or 2 heads in 3 flips than to get 0 or 3. Suppose that, instead of asking about the probability of each possible outcome, we ask about the variable ‘number of tails in 2 flips’ (or, the question ‘How many tails in 2 flips?’). Then the probability will be greater for 1 or 2 or than for 0 or 3, because of this combinatorial fact.

\[ \begin{aligned} P(N_T = 0) =& 1 \times \pi^3\\ P(N_T = 1) =& 3 \times (\pi^2\times (1-\pi))\\ P(N_T = 2) =& 3 \times (\pi \times (1-\pi)^2)\\ P(N_T = 3) =& 1 \times (1-\pi)^3 \end{aligned} \]

When we ask about ‘How many heads?’, we don’t care about the order in which heads and tails occur, and so we’re interested in how many ways there are to permute a sequence of length 3 with $N_T$ tails in it. In mathematical notation we write \[ {n \choose r} = \frac{n!}{(n-r)!r!}, \] where $n$ is the length of the sequence, $r$ is the number of items of interest, and $x! = x \times (x-1) \times ... \times 1$ [except that $0! = 1$].

For each sequence above, the target number is ${3 \choose N_T} = 3!/[(3-N_T)!N_T!]$, i.e.: \[ \begin{aligned} {3 \choose 0} =& \frac{3!}{(3-0)!0!} = \frac{6}{6 \times 1} = 1\\ {3 \choose 1} =& \frac{3!}{(3-1)!1!} = \frac{6}{2 \times 1} = 3\\ {3 \choose 2} =& \frac{3!}{(3-2)!2!} = \frac{6}{1 \times 2} = 3\\ {3 \choose 3} =& \frac{3!}{(3-3)!3!} = \frac{6}{1 \times 6} = 1 \end{aligned} \] Here order doesn’t matter, so that all permutations of a sequence have equal probability. Thus the probability we derive for each $N_T$ is the probability of any single sequence with that length and number of tails, multiplied by the number of unique permutations of that sequence.

Tip: Math in R Markdown

Notice that I used LaTeX to insert some nice-looking math above. You can do this in familiar ways in R Markdown + knitr, if you’re a LaTeX user: single $ surrounding a chunk for inline math, and double $$ surrounding a chunk for math typeset on a separate line. For example, the formula for counting permutations was typeset as follows: $$ {n \choose r} = \frac{n!}{(n-r)!r!}, $$

Aligned multiline equations are typeset by enclosing them in $$\begin{aligned}...\end{aligned}$$, with the usual LaTeX conventions (\\ for a line break, & for alignment location).

See this article for more details on math in R Markdown.

Precise and approximate predictions about data

Suppose you want to estimate the prevalence of a disease in a certain city. (Example from Hoff, A First Course in Bayesian Statistical Methods.) You select 30 people from the city at random and test them. Assume the test is perfectly accurate, and let the true rate of the disease be $\pi$. How many of the people you test do you expect to have the disease?

If you’re really sampling randomly from the population, and the true rate of infection is $\pi$, then the distribution of $T^+$ - the number of positive test results in your sample - is \[ P(T^+|\pi) = {30 \choose T^+} \pi^{T^+} (1-\pi)^{30-T^+} \] In R:

total.sampled = 30
num.infected = 0:30 # vector listing possible numbers of positive test results in sample
true.rate = .3
binomial = function(total, diseased, proportion) {
  return(choose(total, diseased) * proportion^diseased * (1-proportion)^(total-diseased))
}
probs = sapply(num.infected, function(t.plus) {
    binomial(total.sampled, t.plus, proportion=true.rate)
  })
probs

##  [1] 2.253934e-05 2.897915e-04 1.800847e-03 7.203389e-03 2.083838e-02
##  [6] 4.643981e-02 8.292823e-02 1.218537e-01 1.501412e-01 1.572908e-01
## [11] 1.415617e-01 1.103078e-01 7.485173e-02 4.441751e-02 2.311524e-02
## [16] 1.056697e-02 4.245656e-03 1.498467e-03 4.638111e-04 1.255429e-04
## [21] 2.959225e-05 6.039234e-06 1.058827e-06 1.578375e-07 1.972969e-08
## [26] 2.029340e-09 1.672533e-10 1.061925e-11 4.876188e-13 1.441238e-14
## [31] 2.058911e-16

plot(num.infected, probs, pch=20, xlab='Number infected in sample, out of 30', ylab='Probability of this number', main='')

On the other hand, we could have used sampling to get an approximate picture. There will be more on this below, but to preview, here’s a way to use rbinom() to randomly generate data from a specified distribution, which we can then inspect.

set.seed(0)
rbinom(n=1, size=30, prob=.3) # a single data set we could have collected

## [1] 12

rbinom(n=1, size=30, prob=.3) # another one

## [1] 7

rbinom(n=20, size=30, prob=.3) # 20 such data sets

##  [1]  8  9 12  7 12 13 10 10  5  7  7 10  8 11  9 10 15  8 11 13

par(mfrow=c(2,3))
hist(rbinom(n=20, size=30, prob=.3), xlab='number infected', main='20 sims', xlim=c(0,30))
hist(rbinom(n=200, size=30, prob=.3), xlab='number infected', main='200 sims', xlim=c(0,30))
hist(rbinom(n=2000, size=30, prob=.3), xlab='number infected', main='2,000 sims', xlim=c(0,30))
hist(rbinom(n=20000, size=30, prob=.3), xlab='number infected', main='20,000 sims', xlim=c(0,30))
hist(rbinom(n=200000, size=30, prob=.3), xlab='number infected', main='200,000 sims', xlim=c(0,30))
two.million.samples = rbinom(n=2000000, size=30, prob=.3)
hist(two.million.samples, xlab='number infected', main='2,000,000 sims', xlim=c(0,30))

summary(two.million.samples)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.000   9.000   8.998  11.000  23.000

With a large number of simulated data sets, the histogram of the sample distribution gives a very close approximation to the shape of the true distribution. However, notice that there are zero samples with $T^+ > 22$ even in 2 million samples. This is not because $P(T^+ > |\pi) = 0$, but simply because $P(T^+ > 22)$ probability is so tiny that we didn’t sample enough to find any such samples! Specifically, this probability is less than $2 \times 10^{-7}$, or 2 in 100 million.

which(num.infected > 22) # make sure you understand which() usage

## [1] 24 25 26 27 28 29 30 31

probs[which(num.infected > 22)]

## [1] 1.578375e-07 1.972969e-08 2.029340e-09 1.672533e-10 1.061925e-11
## [6] 4.876188e-13 1.441238e-14 2.058911e-16

sum(probs[which(num.infected > 22)])

## [1] 1.797749e-07