library(ggplot2)
In ThinkBayes, the author uses a Gaussian prior for the hockey score
example. The gamma prior is more traditional, conjuate, and doesn't
allow negative values. Therefore seems to be a better fit. Let's first
evaluate that.
The parameters for the original distribution:
hockey.prior.mean <- 2.7
hockey.prior.sd <- 0.3
hockey.prior.var <- hockey.prior.sd * hockey.prior.sd
Reparameterize for \( Gamma(\alpha, \beta) \):
hockey.prior.rate <- hockey.prior.mean/hockey.prior.var
hockey.prior.shape <- hockey.prior.mean * hockey.prior.rate
Thus the original $N(2.7
, 0.3
)$ becomes
$Gamma(81
, 30
)$.
Let's quick make sure that this distribution gives the right mean
and variance.
gamma.rates <- rgamma(1e+05, hockey.prior.shape, rate = hockey.prior.rate)
These samples have a mean of 2.7003
and standard deviation
of 0.2999
. Pretty close.
First, we need to set up a data frame.
rates <- seq(0, 6, 0.01)
hockey.prior.frame <- rbind(data.frame(rate = rates, dist = "Normal", dens = dnorm(rates,
hockey.prior.mean, hockey.prior.sd), cum = pnorm(rates, hockey.prior.mean,
hockey.prior.sd)), data.frame(rate = rates, dist = "Gamma", dens = dgamma(rates,
hockey.prior.shape, rate = hockey.prior.rate), cum = pgamma(rates, hockey.prior.shape,
rate = hockey.prior.rate)))
With this in place, we can plot the PDF:
ggplot(hockey.prior.frame) + aes(x = rate, y = dens, color = dist, linetype = dist) +
ggtitle("PDF") + geom_line()
ggplot(hockey.prior.frame) + aes(x = rate, y = cum, color = dist, linetype = dist) +
ggtitle("CDF") + geom_line()
The gamma and normal produce very similar distributions. The gamma has the advantages
of being more correct (non-negative) and conjugate.
There are four hockey games, each of which has Bruins goals and Canucks goals.
hockey.data <- data.frame(bruins = c(0, 2, 8, 4), canucks = c(1, 3, 1, 0))
To do a posterior inference, we want to compute the shape and rate
parameters of the posterior gamma distribution for each.
bruins.post <- list(shape = hockey.prior.shape + sum(hockey.data$bruins), rate = hockey.prior.rate +
nrow(hockey.data))
canucks.post <- list(shape = hockey.prior.shape + sum(hockey.data$canucks),
rate = hockey.prior.rate + nrow(hockey.data))
The Bruins therefore have an analytic posterior distribution of
$Gamma(95
, )$, and the
Canucks have a posterior of $Gamma(86
, 34
)$.
Let's simulate these distributions!
bruins.rates <- rgamma(10000, bruins.post$shape, bruins.post$rate)
canucks.rates <- rgamma(10000, canucks.post$shape, canucks.post$rate)
So the Bruins have an expected score of 2.7931
, and the
Canucks have an expected score of 2.5264
.
Let's simulate a game for each of these posterior draws:
bruins.scores <- sapply(bruins.rates, function(r) {
rpois(1, r)
})
canucks.scores <- sapply(canucks.rates, function(r) {
rpois(1, r)
})
What do these posterior distributions look like? Like usual, we'll start
by building a data frame.
post.frame <- with(new.env(), {
rates <- seq(0, 6, 0.01)
rbind(data.frame(rate = rates, team = "Bruins", dens = dgamma(rates, bruins.post$shape,
bruins.post$rate)), data.frame(rate = rates, team = "Canucks", dens = dgamma(rates,
canucks.post$shape, canucks.post$rate)))
})
And plot the PDFs:
ggplot(post.frame) + aes(x = rate, y = dens, color = team, linetype = team) +
ggtitle("Posterior PDFs") + geom_line()
Let's plot the simulated scores:
scores.data <- rbind(data.frame(team = "Bruins", score = bruins.scores), data.frame(team = "Canucks",
score = canucks.scores))
ggplot(scores.data) + aes(x = score, fill = team, color = team) + stat_bin(binwidth = 1,
geom = "line")
## ymax not defined: adjusting position using y instead
There's something weird with this plot.
How often to the Bruins win?
sum(bruins.scores > canucks.scores)/length(bruins.scores)
## [1] 0.4597
How often do the Canucks win?
sum(bruins.scores < canucks.scores)/length(bruins.scores)
## [1] 0.3694
So the Bruins are more likely to win.