Notes 19

Introduction to Bayesian Statistics | Part 1: Introduction

Setup

Install LearnBayes package through Tools > Install Packages

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(LearnBayes)

Frequentist and Bayesian Statistics

Frequentist statistics might be defined as “traditional” statistics. Most of the ideas we’ve talked about so far, like maximum likelihood estimation, hypothesis testing, and confidence intervals fall under frequentist statistics.

Frequentist and Bayesian statistics fundamentally differ in how they view parameters.

  • Frequentist statistics views parameters as fixed but unknown constants. It uses the data to find estimates of these parameters.

  • Bayesian statistics views parameters as random variables. That means we assign a distribution to the parameter.

Prior and Posterior distributions

Another difference of the Bayesian approach is that it allows the statistician to include their prior knowledge (before collecting the data) about the parameter(s) into the analysis. This is specified through a prior distribution.

Bayesian analysis then combines the information in the prior distribution with the information about the parameter in the data (via the likelihood function) to obtain the posterior distribution of the parameter.

Main idea:

  • prior distribution: reflects understanding about parameter before data collection

  • posterior distribution: reflects understanding about parameter after data collection

The posterior distribution is obtained through Bayes’ theorem. Without notation, it says:

The posterior distribution is proportional to the prior distribution times the likelihood function

With notation, we have

\[f(\theta|y) \propto \pi(\theta) L(\theta|y)\]

Example

From Chapter 2 of Bayesian Computation with R, starting at Section 2.2.

  • Population: all American college students

  • Parameter of interest: proportion \(p\) of this population who sleep at least 8 hours (on a typical night during the week).

Some steps:

  • First, researcher does some initial research to help in constructing a prior distribution. (See book)

  • Random sample of 27 students is taken. Of this group, 11 record that they had at least 8 hours of sleep the previous night.

  • In addition, researcher is interested in predicting the number of students that get at least 8 hours of sleep if a new sample of 20 students is taken.

In here:

3 of ~20 slept at least 8 hours

Specifying the prior distribution

The book goes through three ways of specifying a prior distribution for \(p\):

  • Using a discrete prior (Section 2.3) - notes 19

  • Using a beta distribution prior (Section 2.4) - notes 20

  • Using a histogram prior (Section 2.5) - notes 21

This notes

Focuses on calculating the posterior distribution of \(p\) under the first: using a discrete prior.

The likelihood function

If we treat the students as independent trials with a “success” getting at least 8 hours of sleep the previous night, we can use the likelihood function from the binomial distribution.

Thus, for \(s\) successes and \(f\) failures, the likelihood function

\[L(p) \propto p^{11} (1-p)^{16}\]

Using a Discrete Prior

(Section 2.3) A simple approach for specifying a prior distribution for \(p\) is to write down a list of plausible proportion values and then assign weights to these values. According to the book, The researcher assigns

Proportions Weights
.05 1
.15 5.2
.25 8
.35 7.2
.45 4.6
.55 2.1
.65 0.7
.75 0.1
.85 0
.95 0

We can convert these to prior probabilities by dividing each weight by the sum.

sleep8 <- tibble(
  p = c(.05, .15, .25, .35, .45, .55, .65, .75, .85, .95),
  wt = c(1, 5.2, 8, 7.2, 4.6, 2.1, .7, .1, 0, 0)
) |> 
  mutate(
    prior = wt / sum(wt)
  )
# sleep8

Thus a plot of the prior distribution is

ggplot(data = sleep8) + 
  geom_segment(
    mapping = aes(x=p, xend=p, y=0, yend=prior), linewidth=2
  ) + 
  labs(
    x = "p (Proportion sleeping at least 8 hrs)", 
    y = "Prior probability"
  )

Posterior proportional to prior times likelihood

Likelihood function: based on the random sample of 27 students – of which 11 reported sleeping at least 8 hours. So \(L(p) \propto p^{11}(1-p)^{16}\).

Then the posterior is proportional to the prior times the likelihood.

sleep8 <- sleep8 |> 
  mutate(
    L = p^11 * (1-p)^16, 
    post = prior * L,
  )
sleep8
## # A tibble: 10 × 5
##        p    wt   prior        L     post
##    <dbl> <dbl>   <dbl>    <dbl>    <dbl>
##  1  0.05   1   0.0346  2.15e-15 7.44e-17
##  2  0.15   5.2 0.180   6.42e-11 1.16e-11
##  3  0.25   8   0.277   2.39e- 9 6.61e-10
##  4  0.35   7.2 0.249   9.80e- 9 2.44e- 9
##  5  0.45   4.6 0.159   1.07e- 8 1.71e- 9
##  6  0.55   2.1 0.0727  3.94e- 9 2.86e-10
##  7  0.65   0.7 0.0242  4.44e-10 1.07e-11
##  8  0.75   0.1 0.00346 9.83e-12 3.40e-14
##  9  0.85   0   0       1.10e-14 0       
## 10  0.95   0   0       8.68e-22 0

The “proportional” part means that we have to divide each posterior value by its sum to obtain valid probabilities (that sum to one).

sleep8 <- sleep8 |> 
  mutate(posterior = post / sum(post))
sleep8 |>
  select(-post)
## # A tibble: 10 × 5
##        p    wt   prior        L    posterior
##    <dbl> <dbl>   <dbl>    <dbl>        <dbl>
##  1  0.05   1   0.0346  2.15e-15 0.0000000145
##  2  0.15   5.2 0.180   6.42e-11 0.00226     
##  3  0.25   8   0.277   2.39e- 9 0.129       
##  4  0.35   7.2 0.249   9.80e- 9 0.477       
##  5  0.45   4.6 0.159   1.07e- 8 0.334       
##  6  0.55   2.1 0.0727  3.94e- 9 0.0559      
##  7  0.65   0.7 0.0242  4.44e-10 0.00210     
##  8  0.75   0.1 0.00346 9.83e-12 0.00000664  
##  9  0.85   0   0       1.10e-14 0           
## 10  0.95   0   0       8.68e-22 0

Comparing the prior distribution and the posterior distribution:

sleep8 |> 
  select(p, prior, posterior) |> 
  pivot_longer(cols = c(prior, posterior), names_to = "Distribution", values_to = "Probability") |> 
  mutate(Distribution = factor(Distribution, levels = c("prior", "posterior"))) |>
  ggplot() + 
  geom_segment(
    mapping = aes(x = p, xend=p, y = Probability, yend = 0), 
    linewidth = 2
  ) + 
  facet_wrap(~Distribution, nrow = 2)

Why is the posterior shifted to the right relative to the prior? prior distributiion is centred around 0.3

sleep8 |>
  summarize(
    sum(p * prior)
  )
## # A tibble: 1 × 1
##   `sum(p * prior)`
##              <dbl>
## 1            0.315

Where estimate of p according to the data is 11/27

11/27
## [1] 0.4074074

posterior mean

sleep8 |>
  summarize(
    sum(p * posterior)
  )
## # A tibble: 1 × 1
##   `sum(p * posterior)`
##                  <dbl>
## 1                0.382

Why is the posterior more concentrated on a few values of \(p\)? The posterior is based on more information than the prior (because it incorporates the data).