1 Initialize notebook

Load required packages.

options(repos=c(getOption('repos'), rethinking='http://xcelab.net/R'))
library( tidyverse )
library( tidybayes )
library( tidyr )
library( rethinking )
library( gridExtra )

Set up workspace, i.e., remove all existing data from working memory, initialize the random number generator, turn of scientific notation of large numbers, set a standard theme for plotting.

knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
rm(list=ls())
set.seed(42)
options(scipen=10000)
## Warning in options(scipen = 10000): invalid 'scipen' 10000, used 9999

2 The Need for Bayesian Statistics

2.1 Scenario: Estimating the Average Height of Football Players Males

You are tasked with estimating the average height of adult males in a country. You collect a small sample of 3 adult males and measure their heights:

  • 180 cm, 182 cm, and 178 cm.

The goal is to estimate the population’s average height based on this small sample.

2.2 Frequentist Approach

Using the frequentist approach, you estimate the population mean by calculating the sample mean.

2.2.1 Step 1: Compute the Sample Mean

## [1] 192

The frequentist estimate of the population mean is 192 cm.

2.2.2 Why This Might Be Misleading

  • Small sample size: The sample consists of only 3 individuals, which may not be representative of the whole population.

  • No prior knowledge: The frequentist approach does not incorporate any prior knowledge about human height (e.g., that the global average is around 175 cm). It treats this small sample as fully representative of the population.

Because the sample size is very small, this estimate could easily be biased. It may not provide a realistic estimate of the true average height of the population.

2.3 Bayesian Approach

In the Bayesian approach, we can incorporate prior knowledge (such as global height averages) along with the observed data to get a more reasonable estimate.

2.3.1 Step 1: Define a Prior

Let’s start by defining a prior distribution for the average height of adult males. Based on global data, we assume that the average height is approximately 175 cm with a standard deviation of 5 cm.

This prior reflects our belief that the average height is likely around 175 cm, but with some uncertainty.

2.3.2 Step 2: Observe the Data (Likelihood)

The data we have are the heights of 3 individuals (195 cm, 192 cm, and 189 cm). This gives us the likelihood of observing the data, which we will use to update our prior belief.

2.3.3 Step 3: Compute the Posterior

Using Bayes’ Theorem, we can now update our prior belief based on the observed data. The posterior distribution combines our prior and the sample data to give us an updated estimate of the population mean.

## [1] 187.75
## [1] 2.5

The posterior mean is approximately 187.75 cm, and the posterior standard deviation is around 2.5 cm.

2.3.4 Step 4: Interpret the Posterior

The posterior mean of 187.75 cm gives us a more reasonable estimate than the frequentist mean of 180 cm. It accounts for both the prior knowledge about typical human heights and the new data we collected. Additionally, the posterior standard deviation of 2.5 cm quantifies our uncertainty around this estimate.

2.4 Why Bayesian Statistics Can Be Superior to Frequentist Statistics: 3 Key Reasons

2.4.0.1 1. Incorporation of Prior Knowledge

  • Bayesian Approach: Bayesian statistics allows for the incorporation of prior knowledge or beliefs through the use of prior distributions. This is particularly useful when there is historical data or expert opinion available, allowing for a more informed and realistic model.
  • Frequentist Approach: Frequentist statistics relies solely on the current dataset, without accounting for prior information. This can be limiting in situations where previous knowledge would improve decision-making or model performance.

2.4.0.2 2. Probabilistic Interpretation

  • Bayesian Approach: Bayesian statistics provides probabilistic statements about parameters or hypotheses, making it more intuitive and easier to interpret. For example, instead of simply rejecting or failing to reject a hypothesis, Bayesian methods give a probability that a parameter lies within a particular range.
  • Frequentist Approach: Frequentist statistics frames conclusions in terms of long-run frequencies (e.g., confidence intervals), which are often misunderstood. The frequentist approach doesn’t allow for direct probability statements about parameters.

2.4.0.3 3. Handling of Complex Models

  • Bayesian Approach: Bayesian methods are powerful for modeling complex or hierarchical structures. Through the use of computational techniques like Markov Chain Monte Carlo (MCMC), Bayesian statistics can effectively model multi-level data structures and quantify uncertainty.
  • Frequentist Approach: Frequentist methods can struggle with complex models, especially when it comes to incorporating uncertainty in a flexible way. They rely on asymptotic properties and approximations, which may not be robust for small sample sizes or intricate models.

2.5 How Bayes Works: Prior, Likelihood, and Posterior

  • Prior Probability
    Bayesian methods start with a prior – an initial belief or distribution about the parameter before seeing the data.

  • Likelihood
    The likelihood function measures how well the data fits different parameter values, given a specific model.

  • Posterior Probability
    Using Bayes’ Theorem, we combine the prior and the likelihood to compute the posterior, the updated belief about the parameter after observing the data:

\[P(\theta|D) = \frac{P(D|\theta) P(\theta)}{P(D)}\]

This allows us to continuously update our understanding of the parameter as new data arrives.

3 Extreme Case Example: Rare Disease Testing

3.1 Problem: Rare Disease Testing

  • Scenario:
    A rare disease affects 0.1% of the population. A diagnostic test for the disease has:
    • Sensitivity (true positive rate): 99%
    • Specificity (true negative rate): 95%
  • You test positive: What is the probability that you actually have the disease?

3.2 Frequentist Approach

  • Frequentist Thinking:
    The test has a 99% sensitivity, so a positive result strongly suggests the presence of the disease.

  • Frequentist Limitation:
    Frequentist methods cannot formally account for how rare the disease is (i.e., the base rate of 0.1%). It simply focuses on the test’s properties (sensitivity and specificity).

3.3 Bayesian Approach

  • Bayes’ Theorem Applied:
    We can combine the test’s sensitivity, specificity, and the prior probability (0.1%) to compute the posterior probability of having the disease given a positive result:

\[P(\text{Disease} | \text{Positive}) = \frac{P(\text{Positive} | \text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive})}\]

Where:

The prior probability is

\[P(\text{Disease}) = 0.001\]

The likelihood is:

\[P(\text{Positive} | \text{Disease}) = 0.99\] The standardizing factor accounts for true positives and false positives:

\[P(\text{Positive}) = P(\text{Positive} | \text{Disease}) \cdot P(\text{Disease})+P(\text{Positive} | \text{No Disease}) \cdot P(\text{No Disease})\] Plug in values

\[P(\text{Positive}) = (0.99 \cdot 0.001) + (0.05 \cdot 0.999) = 0.00099 + 0.04995 = 0.05094\]

  • Bayesian Solution:

\[P(\text{Disease} | \text{Positive}) = \frac{0.99 \cdot 0.001}{0.05094} = 0.0194\]

After applying Bayes’ theorem, the probability of actually having the disease is only 1.94%, despite testing positive!

3.4 Conclusion: Why Frequentist Methods Fall Short

  • Lack of Prior Knowledge
    Frequentist methods do not incorporate prior information or expert knowledge. Every analysis starts “from scratch,” which can be inefficient, especially in cases where historical data or prior beliefs could improve estimates.

  • P-Value Misinterpretation
    Frequentist inference relies heavily on p-values for hypothesis testing. However, p-values are often misinterpreted. They only tell us how extreme the data is under a null hypothesis but do not provide the probability that a hypothesis is true or false.

  • Confidence Intervals
    Frequentist confidence intervals are often misunderstood. A 95% confidence interval does not mean there is a 95% chance the true parameter lies within the interval; rather, it means that in 95% of repeated samples, the interval will contain the true parameter, assuming the model is correct.

  • Limited in Small Sample Sizes
    Frequentist methods tend to require large sample sizes to provide reliable estimates. With small datasets, the conclusions can be unstable, whereas Bayesian methods often provide more robust estimates by incorporating prior information.

4 Concept of the day

4.1 Example: Penality kicks (Bayesian updating)

In 1990, Andreas Brehme scored the game-deciding penalty kick in the 89th minute of the World Cup final.

Let’s assume, we want to study the proportion of goals scored in penalty kicks (i.e., the probability of a goal).

From previous trials, we have observed the following sequence: G M G G G M G M G, i.e., 6 out of 9 (G = goal; M = miss).

How plausible is it to observe such a sequence?

4.2 Concept of the day: The Forking Garden

Simple example (4 penalty trials): We want to know a players true performance (goal probability), after observing a sequence of trials. How plausible is it this player scores 0, 1, 2, 3, or 4 goals out of 4 trials?

The Forking Garden Here, a player has 4 penalty trials. After 3 trials, we observe 2 goals and 1 miss.

(Simplified assumption: penalty trials are independent, “drawing with replacement”)

What is the most likely conjecture (1-5) given the observed data? (blue = goal; white = miss)

Let’s start with conjecture 2.

Each circle marks a trail.

For conjecture 2, we count the paths/ways being consistent with the data. Here, there are 3 ways.

(Given the observed sequence, it is impossible that a player missed 3 penalties.)

Now, we do it for all conjectures, and add the ways that might have produced it (i.e., we multiply). That’s it.

Finally, we can calculate the plausibility of each event. Remember, plausibility is probability: set of non-negative real numbers that sum to one.

  • What we do here: we calculate how plausible a certain probability is (be careful!).
  • Plausibility sums to 1 (relative ways to produce data).
  • This means standardizing, e.g., 3 / 20 ways = 0.15
  • Interpretation: Given the observed sequence (G M G), the plausibility represents how plausible each event (“way”) is.

Belief updating: When a player scores another goal in a penalty trial (3 out of 4), we can simple update our belief.

Bayesian Data Analysis For each possible explanation of the sample, count all the ways the sample could happen. Explanations with more ways to produce the sample are more plausible.

5 Building a Model

5.1 Create Penalty Data

We create a dataset (tibble) with penalty trails, now with 9 trials, where “G” = goal and “M” is a miss.

We use the sequence of penalty trials to decide among different possible proportions of goals/misses. Each proportion may be more or less plausible, given the evidence (G M G G G M G M G).

## # A tibble: 9 × 1
##   trial
##   <chr>
## 1 G    
## 2 M    
## 3 G    
## 4 G    
## 5 G    
## 6 M    
## 7 G    
## 8 M    
## 9 G

5.2 Bayesian Updating

Let’s first have a look at belief updating in the context of a penalty shootout.

Therefore, we create a figure. (Figure 2.5 from the book; don’t worry, you don’t need to understand the code yet.).

(The dashed line represents t-1; the solid line t´.)

Recall the sequence: G M G G G M G M G

In each new penalty trial, the plausibility of the previous trial becomes the prior of the current trial.

We initially (n=0; no penalty trial) assign the same plausibility to each possible value of p, i.e., goal probability.

For trial: n = 1: G: p = 0 is impossible n = 2: M: p = 0.5 is most plausible n = 3: G: p > 0.5 more plausible

Each new sample updates our prior belief. - Data order is NOT relevant - Every posterior is a prior for next observation - Every prior is posterior of some other inference - Sample size is embodied in posterior

6 Components of a model

Let’s build a model. How? 1. Design the model (data story: how is the data distributed?): e.g., height = normal; penalty (yes/no) = binomial; number of goals = Poisson 2. Condition on data (update) 3. Evaluation the model (critique)

We need: 1. Variables: observed (such as the counts of G and M) and unobserved such as p (which we usually call parameters that need to be inferred from other variables). 2. Likelihood: a function assigned to an observed variable.

6.1 Likelihood

What’s the relative number of ways to see goals (G) for given N=3 and p?

Sample sequence: G M G

It is the probability of scoring (p) x not scoring (1-p) x scoring (p)

\[ p \cdot (1-p) \cdot p \]

More general:

\[ p^G \cdot (1-p)^M \]

Now, we normalize this to probability.

We define how plausible any combination of G and M would be, for a specific value of p.

Here, we use the binomial distribution: The counts of “goal” G and “miss” M are distributed binomially, with probability p of “goal” on each trial (N).

\[ \text{Posterior probability of}\ p = \frac{(G+M+1)!}{G!M!} p^G (1 - p)^{M} \] This term makes sure that probabilities add up to 1.

\[ \frac{(G+M+1)!}{G!M!} \]

The binomial distribution is implemented in R. Given a probability of .5 (i.e., goals and misses are equally likely), the binomial likelihood of scoring 6 goals out of 9 trials is:

G, N, p

in R: x, size, prob

## [1] 0.1640625

Interpretation: Given a true goal probability of 50%, the likelihood of scoring in 6 out of 9 trails is… (see a above)

Recall, the sequence of events doesn’t matter!

Exercise > How does the likelihood change, when we assume that the true probablity of goals is higher (say 60%)? > How does the likelihood change, when we assume that Andreas Brehme scored in all of his last 9 trials? > What is the likelihood if Brehme scored in all of his trials and the probability of scoring a goal is 100%? > What is the likelihood if Brehme scored in 8 his 9 trials and the probability of scoring a goal is 100%?

Now let’s plot the binomial likelihood. Which probability values are most plausible?

## # A tibble: 101 × 1
##    probabilities
##            <dbl>
##  1          0   
##  2          0.01
##  3          0.02
##  4          0.03
##  5          0.04
##  6          0.05
##  7          0.06
##  8          0.07
##  9          0.08
## 10          0.09
## # ℹ 91 more rows

We see, a goal probability of around .65 is most plausible.

Exercise > Change the number of goals (x) and see what happens to the distribution. > Change the number of trials (size) and see what happens to the distribution.

6.2 The joint model

The observed variables G and M are given relative counts through the binomial distribution.

\[ G \sim Binomial(N,p) \\ p \sim Uniform(0,1) \] How to read it: - First line: Scoring a goal follows a binomial distribution with a given N (number of trails), and an unknown probability p. - Second line: Probability p has a uniform prior distribution (each value is equally plausible).

N = G + M p is the unobserved parameter

Excursus: The distribution zoo: https://ben18785.shinyapps.io/distribution-zoo/

7 Running a model

7.1 Bayes Theorem

  • A Bayesian “estimate” is always the posterior distribution over parameters: Pr(parameters|data)
  • Here: Pr(p|G,N)
  • Bayes theorem is essentially “likelihood of data x prior”:

\[ \text{Posterior distribution}=\frac{(\text{Probability of the observed data}) \cdot (\text{Prior})}{\text{Average probability of the data}} \]

(Average probability of the data is often called the “evidence” or the “average likelihood” and we’ll get a sense of what that means as we go along)

Using Bayes’ theorem in the context of scoring goals:

\[ \operatorname{Pr}(p | G, M) = \frac{\operatorname{Pr}(G, M | p) \cdot \operatorname{Pr}(p )}{\operatorname{Pr}(G, M)}. \]

  • Posterior: is the plausibility (Pr) of a goal probability given G and M.
  • Likelihood: is the plausibility (Pr) of an observed goal/miss sequence, given p.
  • Prior: is the plausibility of a “prior” goal probability

Let’s use the formula to explain the figure from above (solid lines = posterior): (Sequence: G M G G G M G M G) Posterior: Finding most plausible values

A note on different distributions Let’s check different combinations of priors and likelihood. (Creating Figure 2.6: prior, likelihood, and posterior)

The posterior distribution is a product of the prior distribution and the likelihood: - Top: a flat prior is proportional to likelihood. - Middle: a step prior, zero probability to all values below 0.5, results in truncated posterior. - Bottom: a peaked prior shifts and skews the posterior.

7.2 Producing a posterior

We condition the prior on the data. However, there are different ways to do it (i.e., different conditioning techniques). - Grid approximation - Quadratic approximation - Markov chain Monte Carlo (MCMC)

7.3 Grid approximation

Here, we consider only a finite grid of parameter values.

How to do it (see code example below): 1. Define the grid: how many data points to use in estimating the posterior 2. Compute the value of the prior at each parameter value on the grid 3. Compute the likelihood at each parameter value. 4. Compute the unstandardized posterior at each parameter value, by multiplying the prior by the likelihood 5. Standardize the posterior, by dividing each value by the sum of all values

Here, we use grid approximation for our penalty example. p.grid = goal probability prior = prior probability (here, ground truth) of scoring a penalty likelihoods = plausibility of seeing such a sequence

\[ \operatorname{Posterior} = \frac{\operatorname{Likelihood} \cdot \operatorname{Prior}}{\operatorname{Standardizing factor}} = \frac{\operatorname{Pr}(G, M | p) \cdot \operatorname{Pr}(p )}{\operatorname{Pr}(G, M)}. \]

## # A tibble: 10 × 5
##    p_grid prior likelihood unstd_posterior posterior
##     <dbl> <dbl>      <dbl>           <dbl>     <dbl>
##  1  0       0.5   0              0          0       
##  2  0.111   0.5   0.000111       0.0000555  0.000123
##  3  0.222   0.5   0.00476        0.00238    0.00528 
##  4  0.333   0.5   0.0341         0.0171     0.0379  
##  5  0.444   0.5   0.111          0.0555     0.123   
##  6  0.556   0.5   0.217          0.108      0.241   
##  7  0.667   0.5   0.273          0.137      0.303   
##  8  0.778   0.5   0.204          0.102      0.227   
##  9  0.889   0.5   0.0568         0.0284     0.0631  
## 10  1       0.5   0              0          0

(The standardizing factor makes sure that the posterior adds up to 1. It shows the most plausible value for a given p in the grid.)

Plot this data.

Exercise: > What happens if you create a sparser grid (i.e., say, 5 points)) > What happens if you create a denser grid (i.e., say, 1000 points))

7.4 Quadratic approximation

It assumes that the shape of the distribution near the peak is Gaussian (normal). This is convenient as a Gaussian distribution can be described by its location (i.e, the mean) and its spread (i.e., variance).

  1. Find the posterior mode. Usually done by an optimization algorithm which tries to find the peak (checking the slope)
  2. Estimate the curvature near the peak.
##        mean        sd      5.5%     94.5%
## p 0.6666664 0.1571338 0.4155362 0.9177966

Interpretation: assuming the posterior is Gaussian, it is maximized at 0.67, and its standard deviation is 0.16.

Now, let’s plot the posterior using different N’s. (Comparing grid posterior – black – and quadratic approximation posterior – grey)

Interpretation: with more observations we are more certain in our posterior.

8 Let’s experiment!

Suppose the penalty kick data had turned out to be 8 goals in 15 trials. Construct the posterior distribution, using grid approximation. Use the same flat prior as before.

8.1 Playground

## # A tibble: 20 × 5
##    p_grid prior  likelihood unstd_posterior   posterior
##     <dbl> <dbl>       <dbl>           <dbl>       <dbl>
##  1 0        0.5 0               0           0          
##  2 0.0526   0.5 0.000000260     0.000000130 0.000000219
##  3 0.105    0.5 0.0000445       0.0000223   0.0000375  
##  4 0.158    0.5 0.000747        0.000373    0.000629   
##  5 0.211    0.5 0.00475         0.00237     0.00400    
##  6 0.263    0.5 0.0175          0.00873     0.0147     
##  7 0.316    0.5 0.0447          0.0223      0.0376     
##  8 0.368    0.5 0.0876          0.0438      0.0737     
##  9 0.421    0.5 0.139           0.0693      0.117      
## 10 0.474    0.5 0.182           0.0912      0.154      
## 11 0.526    0.5 0.203           0.101       0.171      
## 12 0.579    0.5 0.191           0.0953      0.160      
## 13 0.632    0.5 0.150           0.0751      0.126      
## 14 0.684    0.5 0.0968          0.0484      0.0815     
## 15 0.737    0.5 0.0489          0.0244      0.0412     
## 16 0.789    0.5 0.0178          0.00890     0.0150     
## 17 0.842    0.5 0.00398         0.00199     0.00335    
## 18 0.895    0.5 0.000378        0.000189    0.000319   
## 19 0.947    0.5 0.00000467      0.00000234  0.00000393 
## 20 1        0.5 0               0           0

9 Sampling

9.1 Sampling from the posterior

Sampling: draw samples (i.e., observations) from a distribution/bucket (with replacement).

Why? In many situation, working with the full posterior is infeasable.

Here, we work with samples from the posterior, which helps us to (1) visualize uncertainty, (2) compute credible intervals, and (3) simulate observations. From a mathematical perspective, it is hence easier to work with samples.

Note: some procedures to compute the posterior, like MCMC, produce nothing but samples.

Procedure: 1. We compute or approximate the posterior 2. We sample with replacement from posterior 3. Compute stuff from samples

Here we use grid approximation to generate samples for our penalty example.

## # A tibble: 1,000 × 4
##     p_grid prior likelihood posterior
##      <dbl> <dbl>      <dbl>     <dbl>
##  1 0         0.5   0         0       
##  2 0.00100   0.5   8.43e-17  8.43e-19
##  3 0.00200   0.5   5.38e-15  5.38e-17
##  4 0.00300   0.5   6.11e-14  6.11e-16
##  5 0.00400   0.5   3.42e-13  3.42e-15
##  6 0.00501   0.5   1.30e-12  1.30e-14
##  7 0.00601   0.5   3.87e-12  3.88e-14
##  8 0.00701   0.5   9.73e-12  9.74e-14
##  9 0.00801   0.5   2.16e-11  2.16e-13
## 10 0.00901   0.5   4.37e-11  4.38e-13
## # ℹ 990 more rows

Let’ create some samples from the posterior. It’s like drawing from a bucket, where some values are more likely (i.e., see code: weight = posterior)

## Rows: 10,000
## Columns: 4
## $ p_grid     <dbl> 0.5805806, 0.7537538, 0.4704705, 0.6726727, 0.6016016, 0.66…
## $ prior      <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,…
## $ likelihood <dbl> 0.23735584, 0.23001972, 0.13525211, 0.27292844, 0.25181752,…
## $ posterior  <dbl> 0.0023759343, 0.0023024997, 0.0013538750, 0.0027320164, 0.0…

Now, we plot the samples. - The first plot shows the goal probability for each sample. - The second plot flips the first plot by 90 degree and shows the density.

The estimated density is very similar to ideal posterior that we computed via grid approximation. This is a crudely replicated version the posterior density.

9.2 Summarize the samples

“Once your model produces a posterior distribution, the model’s work is done. But your work has just begun. It is necessary to summarize and interpret the posterior distribution. Exactly now it is summarized depends upon your purpose” (p. 53).

Some sample questions (summary tasks): - How much posterior probability is below/above/between specified parameters? - Which parameter values contain 50%/80%/95% of the posterior probability (“confidence” intervals) - Which parameter values maximizes posterior probability? (point estimates)

9.2.1 Intervals

Let’s start.

What is the likelihood for a goal probability less than 50%?

(Here you find more information on “summarise”: https://r4ds.had.co.nz/transform.html#grouped-summaries-with-summarise)

## # A tibble: 1 × 1
##     sum
##   <dbl>
## 1 0.172

What is the likelihood for a goal probability of more than 50% but less than 75%?

## # A tibble: 1 × 1
##     sum
##   <dbl>
## 1 0.604

Next, we plot the previous “questions”?

Here, we plot the lower and middle 80% quantile.

What is the value for the middle 80% quantile of goal probability?

## # A tibble: 1 × 6
##   p_grid .lower .upper .width .point .interval
##    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1  0.646  0.450  0.810    0.8 median qi

Let’s introduce two common types of intervals. - Percentile intervals (PI): equal area in each tail - Highest posterior density interval (HPDI): narrowest interval containing mass (always includes the highest point)

Let’s compare PI and HPDI.

Percentage intervals: What are the value for the middle 50%, 80%, and 99% quantile of goal probability?

## # A tibble: 3 × 6
##   p_grid .lower .upper .width .point .interval
##    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1  0.646  0.544  0.739   0.5  median qi       
## 2  0.646  0.450  0.810   0.8  median qi       
## 3  0.646  0.273  0.918   0.99 median qi

HPDI: What is the 50% highest posterior density interval for goal probability (HPDIs)?

## # A tibble: 1 × 6
##   p_grid .lower .upper .width .point .interval
##    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1  0.672  0.571  0.764    0.5 mode   hdi

9.2.2 Point estimates

Here, we are interested in values for central tendency (mean, mode, median).

In tidybayes, this is done straight forward, using the following syntax: tendency_interval()

For example, this is how we can get the mean.

## # A tibble: 1 × 6
##   p_grid .lower .upper .width .point .interval
##    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1  0.636  0.363  0.883   0.95 mean   hdi
## # A tibble: 1 × 6
##   p_grid .lower .upper .width .point .interval
##    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1  0.636  0.349  0.874   0.95 mean   qi
## [1] 0.6724069

Let’s check mean and median.

## # A tibble: 1 × 2
##    mean median
##   <dbl>  <dbl>
## 1 0.636  0.646

9.3 Simulate observations (Predictive checks)

Why?

  • Model design: sample from prior (not from posterior)
  • Model checking: investigate model behavior
  • Software validation: make sure that there is no software error
  • Research design: simulate observations from your hypothesis
  • Forecasting: simulate new predictions

First, we create some dummy data from the binomial likelihood, with 2 trials (we use dbinom).

Given a true goal probability of 0.7 and 2 trials, what’s the likelihood of seeing 0, 1, or 2 goals?

\[\operatorname{Pr} (G|n, p) = \frac{n!}{G!(n - G)!} p^G (1 - p)^{n - G}.\] Letting n = 2, p(G) = .7, and G_observed = 0 through 2, the densities are:

## # A tibble: 3 × 4
##       n probability G_observed likelihood
##   <dbl>       <dbl>      <int>      <dbl>
## 1     2         0.7          0       0.09
## 2     2         0.7          1       0.42
## 3     2         0.7          2       0.49

There is a 9% chance of seeing G = 0,…

Now, let’s sample from this distribution. We could also use n_samples from above. But R has a build-in sampling function (rbinom)

We generate a number of random dummy observations that we expect to appear in proportion to its likelihood.

## # A tibble: 3 × 3
## # Groups:   draws [3]
##   draws     n goal_propability
##   <int> <int>            <dbl>
## 1     0    85            0.085
## 2     1   420            0.42 
## 3     2   495            0.495

Let’s plot n = 10 penalty trials (we assume that the true goal probability is 0.7)

Exercise: > How does the distribution change if you change n, size, or prob?

10 Let’s experiment

  1. Suppose the penalty kick data had turned out to be 4 goals in 15 trials. Construct the posterior distribution, using grid approximation. Use the same flat prior as before. (you can copy the code from above)

  2. Draw 10,000 samples from the grid approximation from above. Then use the samples to calculate the 90% HPDI for p.

  3. Construct a posterior predictive check for this model and data.

10.1 Playground

  1. Create the dataset
## # A tibble: 20 × 5
##    p_grid prior likelihood unstd_posterior posterior
##     <dbl> <dbl>      <dbl>           <dbl>     <dbl>
##  1 0          1   0               0         0       
##  2 0.0526     1   5.78e- 3        5.78e- 3  4.87e- 3
##  3 0.105      1   4.93e- 2        4.93e- 2  4.15e- 2
##  4 0.158      1   1.28e- 1        1.28e- 1  1.08e- 1
##  5 0.211      1   1.99e- 1        1.99e- 1  1.68e- 1
##  6 0.263      1   2.28e- 1        2.28e- 1  1.92e- 1
##  7 0.316      1   2.09e- 1        2.09e- 1  1.76e- 1
##  8 0.368      1   1.60e- 1        1.60e- 1  1.35e- 1
##  9 0.421      1   1.05e- 1        1.05e- 1  8.85e- 2
## 10 0.474      1   5.90e- 2        5.90e- 2  4.97e- 2
## 11 0.526      1   2.82e- 2        2.82e- 2  2.38e- 2
## 12 0.579      1   1.13e- 2        1.13e- 2  9.52e- 3
## 13 0.632      1   3.69e- 3        3.69e- 3  3.10e- 3
## 14 0.684      1   9.32e- 4        9.32e- 4  7.85e- 4
## 15 0.737      1   1.69e- 4        1.69e- 4  1.42e- 4
## 16 0.789      1   1.91e- 5        1.91e- 5  1.61e- 5
## 17 0.842      1   1.04e- 6        1.04e- 6  8.79e- 7
## 18 0.895      1   1.54e- 8        1.54e- 8  1.30e- 8
## 19 0.947      1   9.44e-12        9.44e-12  7.95e-12
## 20 1          1   0               0         0

  1. Do the grid approximation, draw samples, & 90% HDPI
## # A tibble: 10,000 × 4
##      p_grid prior likelihood posterior
##       <dbl> <dbl>      <dbl>     <dbl>
##  1 0            1   0         0       
##  2 0.000100     1   1.36e-13  2.18e-16
##  3 0.000200     1   2.18e-12  3.49e-15
##  4 0.000300     1   1.10e-11  1.76e-14
##  5 0.000400     1   3.48e-11  5.57e-14
##  6 0.000500     1   8.49e-11  1.36e-13
##  7 0.000600     1   1.76e-10  2.81e-13
##  8 0.000700     1   3.25e-10  5.21e-13
##  9 0.000800     1   5.54e-10  8.87e-13
## 10 0.000900     1   8.87e-10  1.42e-12
## # ℹ 9,990 more rows
## # A tibble: 1 × 6
##   p_grid .lower .upper .width .point .interval
##    <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1  0.275  0.124  0.475    0.9 mode   hdi
  1. Posterior predictive check