Statistical Rethinking: Chapter 2

Mark Schulist

2022-11-25

Grid Approximation

I will use a grid approximation to approximate the posterior distribution given different priors.

I’ll start with a uniform prior and 20 points.

# define grid
p_grid <- seq(from = 0, to = 1, length.out = 20)
# define prior
prior <- rep(1, 20)
# compute likelihood at each value in grid
likelihood <- dbinom(6, size = 9, prob = p_grid)
# compute product of likelihood and prior
unstd.posterior <- likelihood * prior
# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("20 points")

Let’s see what 100 points does.

# define grid
p_grid <- seq(from = 0, to = 1, length.out = 100)
# define prior
prior <- rep(1, 100)
# compute likelihood at each value in grid
likelihood <- dbinom(6, size = 9, prob = p_grid)
# compute product of likelihood and prior
unstd.posterior <- likelihood * prior
# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("100 points")

Instead of using a uniform prior (which contains no information), we can try to use two different types of priors.

The first one is like a binary step function, where it is 0 for low values and 1 for high values. The second is two negative exponential that are at the highest near the center of the distribution. Let’s graph them for fun.

# First prior
plot(ifelse(p_grid < 0.5, 0, 1))
mtext("'Step' Prior Distribution")

# Second prior
plot(exp(-5 * abs(p_grid - 0.5)))
mtext("'Exponential' Prior Distribution")

Now implementing them into our grid approximation. The p_grid and likelihood have already been defined above.

# define first prior
prior <- ifelse(p_grid < 0.5, 0, 1)
# compute product of likelihood and prior
unstd.posterior <- likelihood * prior
# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("First 'Step' Prior")

# define second prior
prior <- exp(-5 * abs(p_grid - 0.5))
# compute product of likelihood and prior
unstd.posterior <- likelihood * prior
# standardize the posterior, so it sums to 1
posterior <- unstd.posterior / sum(unstd.posterior)

plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("Second 'Exponential' Prior")

As we can see, the prior definitely affects the posterior, although not as much as I would have thought.

There is the most variance in the posterior when the uniform prior is used, and that makes sense since there is the least amount of information in a uniform prior.

Grid approximation is great at first, but it needs to recomputed for every parameter, so with 10 parameters using a grid of 100, that’s \(100^{10}\) values to be computed. That’s why we’ll want a more efficient way of approximating the posterior—quadratic approximation.

Quadratic Approximation

This method assumes that the region near the peak of the posterior will be nearly Gaussian (normal). We can use what we know about Gaussian distributions to approximate the posterior.

The log of a Gaussian is a parabola, so quadratic approx represents a log-posterior with a parabola.

Let’s try the quadratic approx for the tossing the globe in the air model.

W = number of tosses that land on water

L = number of tosses that land on land

globe.qa <- quap(
  alist(
    W ~ dbinom(W + L, p), # binomial likelihood
    p ~ dunif(0, 1) # uniform prior
  ),
  data = list(W = 6, L = 3)
)
# display summary of quadratic approximation
precis(globe.qa)

##        mean        sd      5.5%     94.5%
## p 0.6666668 0.1571337 0.4155367 0.9177969

# analytical calculation
W <- 6
L <- 3
curve(dbeta(x, W + 1, L + 1), from = 0, to = 1)
# quadratic approximation
curve(dnorm(x, 0.67, 0.16), lty = 2, add = TRUE)

The black curve is the analytical posterior and the dashed curve is the quadratic approx.

As we get more data, the quadratic approx gets closer to the analytical posterior. It’s not always possible to get the analytical posterior with complex models, so more data == more better.

Markov Chain Monte Carlo

“The conceptual challenge with MCMC lies in its highly non-obvious strategy. Instead of attempting to compute or approximate the posterior distribution directly, MCMC techniques merely draw samples from the posterior. You end up with a collection of parameter values, and the frequencies of these values correspond to the posterior plausibilities.” (pg. 45)

Here’s a fun example of using MCMC with the globe toss example.

n_samples <- 1000
p <- rep(NA, n_samples)
p[1] <- 0.5
W <- 6
L <- 3
for (i in 2:n_samples) {
  p_new <- rnorm(1, p[i - 1], 0.1)
  if (p_new < 0) p_new <- abs(p_new)
  if (p_new > 1) p_new <- 2 - p_new
  q0 <- dbinom(W, W + L, p[i - 1])
  q1 <- dbinom(W, W + L, p_new)
  p[i] <- ifelse(runif(1) < q1 / q0, p_new, p[i - 1])
}
  dens(p, xlim = c(0, 1))
  curve(dbeta(x, W + 1, L + 1), lty = 2, add = TRUE)

The MCMC posterior (solid) and the analytical posterior (dashed) after 1000 samples.

Summary

“The target of inference in Bayesian inference is a posterior probability distribution.” (pg. 46)

The probability of the data is the likelihood.

Practice

Easy

2E1. Which of the expressions below correspond to the statement: the probability of rain on Monday?

Pr(rain)
Pr(rain|Monday)
Pr(Monday|rain)
Pr(rain, Monday)/ Pr(Monday)

2E2. Which of the following statements corresponds to the expression: Pr(Monday|rain)?

The probability of rain on Monday.
The probability of rain, given that it is Monday.
The probability that it is Monday, given that it is raining.
The probability that it is Monday and that it is raining.

2E3. Which of the expressions below correspond to the statement: the probability that it is Monday, given that it is raining?

Pr(Monday|rain)
Pr(rain|Monday)
Pr(rain|Monday) Pr(Monday)
Pr(rain|Monday) Pr(Monday)/ Pr(rain)
Pr(Monday|rain) Pr(rain)/ Pr(Monday)

2E4. The Bayesian statistician Bruno de Finetti (1906–1985) began his book on probability theory with the declaration: “PROBABILITY DOES NOT EXIST.” The capitals appeared in the original, so I imagine de Finetti wanted us to shout this statement. What he meant is that probability is a device for describing uncertainty from the perspective of an observer with limited knowledge; it has no objective reality. Discuss the globe tossing example from the chapter, in light of this statement. What does it mean to say “the probability of water is 0.7”?

It means that there’s a 70% chance that you will land on water if you toss a globe in the air and put your finger on a random spot. Of course, we know how much of earth is covered by water. But if we didn’t the probability would really be a probability distribution, not a single value.

Medium

2M1. Recall the globe tossing model from the chapter. Compute and plot the grid approximate posterior distribution for each of the following sets of observations. In each case, assume a uniform prior for p.

W, W, W
W, W, W, L
L, W, W, L, W, W, W

# W, W, W
p_grid <- seq(0, 1, length.out = 100)
prior <- rep(1, length(p_grid))
likelihood <- dbinom(3, 3, prob = p_grid)
unstd.posterior <- likelihood * prior
posterior <- unstd.posterior / sum(unstd.posterior)
plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("W, W, W")

# W, W, W, L
p_grid <- seq(0, 1, length.out = 100)
prior <- rep(1, length(p_grid))
likelihood <- dbinom(3, 4, prob = p_grid)
unstd.posterior <- likelihood * prior
posterior <- unstd.posterior / sum(unstd.posterior)
plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("W, W, W, L")

# L, W, W, L, W, W, W
p_grid <- seq(0, 1, length.out = 100)
prior <- rep(1, length(p_grid))
likelihood <- dbinom(5, 7, prob = p_grid)
unstd.posterior <- likelihood * prior
posterior <- unstd.posterior / sum(unstd.posterior)
plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("L, W, W, L, W, W, W")

2M2. Now assume a prior for p that is equal to zero when p < 0.5 and is a positive constant when p ≥ 0.5. Again compute and plot the grid approximate posterior distribution for each of the sets of observations in the problem just above.

# W, W, W
p_grid <- seq(0, 1, length.out = 100)
prior <- ifelse(p_grid < 0.5, 0, 1)
likelihood <- dbinom(3, 3, prob = p_grid)
unstd.posterior <- likelihood * prior
posterior <- unstd.posterior / sum(unstd.posterior)
plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("W, W, W")

# W, W, W, L
p_grid <- seq(0, 1, length.out = 100)
prior <- ifelse(p_grid < 0.5, 0, 1)
likelihood <- dbinom(3, 4, prob = p_grid)
unstd.posterior <- likelihood * prior
posterior <- unstd.posterior / sum(unstd.posterior)
plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("W, W, W, L")

# L, W, W, L, W, W, W
p_grid <- seq(0, 1, length.out = 100)
prior <- ifelse(p_grid < 0.5, 0, 1)
likelihood <- dbinom(5, 7, prob = p_grid)
unstd.posterior <- likelihood * prior
posterior <- unstd.posterior / sum(unstd.posterior)
plot(p_grid, posterior,
  type = "b",
  xlab = "probability of water", ylab = "posterior probability"
)
mtext("L, W, W, L, W, W, W")

2M3. Suppose there are two globes, one for Earth and one for Mars. The Earth globe is 70% covered in water. The Mars globe is 100% land. Further suppose that one of these globes—you don’t know which—was tossed in the air and produced a “land” observation. Assume that each globe was equally likely to be tossed. Show that the posterior probability that the globe was the Earth, conditional on seeing “land” (Pr(Earth|land)), is 0.23.

What we know from the problem:

\(\text{Pr(land|Earth)} = 1 - 0.7 = 0.3\)

\(\text{Pr(land|Mars)} = 1\)

Priors:

\(\text{Pr(Earth)} = 0.5\)

\(\text{Pr(Mars)} = 0.5\)

We want to calculate Pr(Earth|land). For this we can use Bayes Rule:

\(\text{Pr(Earth|land)} = \dfrac{\text{Pr(land|Earth)Pr(Earth)}}{\text{Pr(land)}}\)

\(\text{Pr(Earth|land)} = \dfrac{\text{Pr(land|Earth)Pr(Earth)}}{\text{Pr(land|Earth)Pr(Earth) + Pr(land|Mars)Pr(Mars)}}\)

\(\text{Pr(Earth|land)} = \dfrac{0.3 \cdot 0.5}{(0.3 \cdot 0.5) + (1 \cdot 0.5)}\)

0.3 * 0.5 / (0.3*0.5 + 0.5)

## [1] 0.2307692

2M4. Suppose you have a deck with only three cards. Each card has two sides, and each side is either black or white. One card has two black sides. The second card has one black and one white side. The third card has two white sides. Now suppose all three cards are placed in a bag and shuffled. Someone reaches into the bag and pulls out a card and places it flat on a table. A black side is shown facing up, but you don’t know the color of the side facing down. Show that the probability that the other side is also black is 2/3. Use the counting method (Section 2 of the chapter) to approach this problem. This means counting up the ways that each card could produce the observed data (a black side facing up on the table).

Let’s do some counting. There are three cards: BB, BW, WW.

If we observe a black side, we know it can only be the first two cards (BB, BW).

Then, we can make a tree of the possible options for the other side.

If we have the BB card, then there two possible ways for us to have a black side up. That means there are two possible ways to have a black side down.

If we have the BW card, there’s only one possibility—the other side is white.

We can add up these probabilities and get that there are two possible ways to have it black and one way to have it white. Pr(black when flipped) \(= \dfrac{2}{3}\).

2M5. Now suppose there are four cards: B/B, B/W, W/W, and another B/B. Again suppose a card is drawn from the bag and a black side appears face up. Again calculate the probability that the other side is black.

Two BB cards and one BW card means there are \(4\) possibilities to get a black side when we flip it over and 1 possibility to get a white side. So the Pr(black when flipped) \(= \dfrac{4}{5}\).

2M6. Imagine that black ink is heavy, and so cards with black sides are heavier than cards with white sides. As a result, it’s less likely that a card with black sides is pulled from the bag. So again assume there are three cards: B/B, B/W, and W/W. After experimenting a number of times, you conclude that for every way to pull the B/B card from the bag, there are 2 ways to pull the B/W card and 3 ways to pull the W/W card. Again suppose that a card is pulled and a black side appears face up. Show that the probability the other side is black is now 0.5. Use the counting method, as before.

We can assume that the probs would be the same if we have on BB, 2BW, and 3WW cards. So if we count, we get that there are \(2 + 2\) ways to get a black when flipped and \(1 + 1\) ways to get a white when flipping. So the Pr(black when flipped) \(= 0.5\).

2M7. Assume again the original card problem, with a single card showing a black side face up. Before looking at the other side, we draw another card from the bag and lay it face up on the table. The face that is shown on the new card is white. Show that the probability that the first card, the one showing a black side, has black on its other side is now 0.75. Use the counting method, if you can. Hint: Treat this like the sequence of globe tosses, counting all the ways to see each observation, for each possible first card.

If the first card is the BB card, then there are two ways for it to show the black side up. When we draw from the other two cards, there are three possible ways for them to show white side up. BW has one way and the WW has two ways. So joint number of ways \(= 2 \cdot 3 = 6\).

If the first card is the BW card, there is one way for it to show black up. The BB can’t show white up, and there are two ways for the WW to show white side up. So joint number of ways \(= 1 \cdot 2 = 2\).

Add the number of ways together to get the total number of ways to produce the B, W sequence. \[ 6 + 2 = 8 \]

Then we find the number of ways of the first card being the BB card and divide. \[ \frac{6}{8} = 0.75 \]

Pr(first card is black/black) \(= 0.75\)

Hard

2H1. Suppose there are two species of panda bear. Both are equally common in the wild and live in the same places. They look exactly alike and eat the same food, and there is yet no genetic assay capable of telling them apart. They differ however in their family sizes. Species A gives birth to twins 10% of the time, otherwise birthing a single infant. Species B births twins 20% of the time, otherwise birthing singleton infants. Assume these numbers are known with certainty, from many years of field research.

Now suppose you are managing a captive panda breeding program. You have a new female panda of unknown species, and she has just given birth to twins. What is the probability that her next birth will also be twins?

We are looking for the joint probability that this unknown species will give another birth to twins given that she already gave birth to twins.

\(\text{Pr(twins}_2|\text{twins}_1\text{)}\)

Bayes Theorem

\(\text{Pr(twins}_2|\text{twins}_1\text{)} = \dfrac{\text{Pr(twins}_1, \text{twins}_2\text{)}}{\text{Pr(twins)}}\)

Pr(twins) can be thought of as the average probability of unconditionally having twins. \[ \text{Pr(twins)} = \underbrace{\frac{1}{2}(0.1)}_{\text{Species A}} + \underbrace{\frac{1}{2}(0.2)}_{\text{Species B}} = 0.15 \]

\(\text{Pr(twins}_1, \text{twins}_2\text{)}\) is the average probability that a female from either species has twins. We need to determine the probabilities for each species separately and then average them using the individual probs of each species occurring. In other words, this is the average conditional probability of having twins.

\[ \text{Pr(twins}_1, \text{twins}_2\text{)} = \frac{1}{2}(0.01) + \frac{1}{2}(0.04) = 0.025 \]

Combining these values in Bayes Theorem:

\[ \text{Pr(twins}_2|\text{twins}_1\text{)} = \frac{0.025}{0.15} = \frac{1}{6} \approx 0.17 \]

One interesting thing about this posterior is that it is higher than Pr(twins) because the first twins have some information about the species of the panda. It’s more likely that the female is species B since she had twins.

2H2. Recall all the facts from the problem above. Now compute the probability that the panda we have is from species A, assuming we have observed only the first birth and that it was twins.

We are looking for the following conditional probability: \[ \text{Pr(A|twins}_1\text{)} \]

Again, Bayes Theorem comes to the rescue.

\[ \text{Pr(A|twins}_1\text{)} = \frac{\text{Pr(twins}_1|\text{A)} \text{Pr(A)}}{\text{Pr(twins)}} \]

\(\text{Pr(twins)} = 0.15\) from previous problem

\(\text{Pr(A)} 0.5\) since both species are equally common and distributed.

\(\text{Pr(twins}_1|\text{A)} = 0.1\) as given in the previous problem.

\[ \text{Pr(A|twins}_1\text{)} = \frac{0.1 \cdot 0.5}{0.15} = \frac{1}{3} \]

This means that the posterior probability of species A, after observing the female having twins, falls to \(\dfrac{1}{3}\), down from \(\dfrac{1}{2}\). This implies that the posterior probability of our female being species B after having twins is \(\dfrac{2}{3}\).

2H3. Continuing on from the previous problem, suppose the same panda mother has a second birth and that it is not twins, but a singleton infant. Compute the posterior probability that this panda is species A.

Ah yes, the immense power of Bayes Theorem comes into use. Now, we can update our beliefs!!

\[ \text{Pr(A|singleton)} = \frac{\text{Pr(singleton|A)Pr(A)}}{\text{Pr(singleton)}} \]

\(\text{Pr(A)} = \frac{1}{3}\) — this the posterior from the previous question

\(\text{Pr(singleton|A)} = 1 - 0.1 = 0.9\) — we arrive at this since this is the complement of species A having a twin

\[ \begin{align*} \text{Pr(singleton)} &= \text{Pr(singleton|A)Pr(A)} + \text{Pr(singleton|B)Pr(B)} \\ &=(0.9)\frac{1}{3} + (0.8)\frac{2}{3} = \frac{5}{6} \end{align*} \]

Now we can plug these values into our Bayes Theorem equation.

\[ \text{Pr(A|singleton)} = \frac{0.9 \cdot \frac{1}{3}}{\frac{5}{6}} = 0.36 \]

2H4. A common boast of Bayesian statisticians is that Bayesian inference makes it easy to use all of the data, even if the data are of different types. So suppose now that a veterinarian comes along who has a new genetic test that she claims can identify the species of our mother panda. But the test, like all tests, is imperfect. This is the information you have about the test:

The probability it correctly identifies a species A panda is 0.8.
The probability it correctly identifies a species B panda is 0.65.

The vet administers the test to your panda and tells you that the test is positive for species A. First ignore your previous information from the births and compute the posterior probability that your panda is species A. Then redo your calculation, now using the birth data as well.

Let’s first see the probability of our test for species A being a true positive.

\[ \text{Pr(A|testA)} = \frac{\text{Pr(testA|A)Pr(A)}}{\text{Pr(testA)}} \]

Prior

\(\text{Pr(A)} = 0.5\) since both species are equally common and distributed

What we know about the test:

\(\text{Pr(testA|A)} = 0.8\) since that is the true positive rate

\(\text{Pr(testA|B)} = 1 - 0.65 = 0.35\) — probability of incorrectly identifying species B as species A

Average probability of getting a positive result \[ \begin{align*} \text{Pr(testA)} &= \text{Pr(testA|A)} \cdot 0.5 + \text{Pr(testA|B)} \cdot 0.5 \\ &= 0.8 \cdot 0.5 + 0.35 \cdot 0.5 \\ &= 0.575 \end{align*} \]

Putting it all together in the original Bayes Theorem equation.

\[ \text{Pr(A|testA)} = \frac{0.8 \cdot 0.5}{0.575} \approx 0.7 \]

The test has updated our belief about the probability of our female being species A: it has increased from 0.5 to 0.7.

We can plug in the prior (prior = Pr(A)) containing birth information into the same equation, from the previous equation, which is 0.36.

\[ \begin{align*} \text{Pr(A|testA, twins, singleton)} &= \frac{\text{Pr(testA|A)Pr(A)}}{\text{Pr(testA|A)} \cdot \text{Pr(A)} + \text{Pr(testA|B)} \cdot \text{Pr(B)}} \\ \text{Pr(A|testA, twins, singleton)} &= \frac{0.8 \cdot 0.36}{0.8 \cdot 0.36 + 0.35 \cdot 0.64} \\ &= 0.5625 \end{align*} \]

This means that the probability of our female being species A given the positive test, having twins and then a single kid is 0.5625.