1 Bayes Theorem

Conditional probability: For two events A and B \[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{P(AB)}{B}\] From the definition it follows that \[ P(AB) = P(B) P(A \mid B) = P(A) P(B \mid A)\]

and if \(\bar{A}\) is the complimentary event to \(A\) (\(P(A \cup \bar{A}) = 1\)) then \[ P(A \mid B) = \frac{P(A) P(B \mid A)}{P(B)} = \frac{P(A) P(B \mid A)}{P(A) P(B \mid A) + P(\bar{A}) P(B \mid \bar{A})}\]

Fact: Bayes Theorem. If \(H_1, ..., H_k\) are such that \(P(H_1 \cup ... \cup H_k) = 1\); \(H_i \cap H_j = \phi\) for \(i \ne j\), then \(P(H_i \mid B) = \frac{P(H_i) P(B \mid H_i)}{\sum_{i=1}^k P(H_i) P(B \mid H_i)}\)

1.1 Bayes Theorem in Statistics

\[ P(\theta \mid Data) = \frac{P(Data\mid \theta)P(\theta)}{P(Data)} \]

i.e.,

\[ Posterior \ Distribution \ = \ \frac{Likelihood \ \times \ Prior}{Data}\]

1.2 Steps of Bayesian Statistics:

Steps of Bayesian Statistics Visualization
1. Specify the model \(M_{\theta}\)
2. Define prior \(F_{\theta}(x)\)
3. From \(Data\) and model \(M_{\theta}\) find likelihood
4. Find denominator by integrating over all \(\theta\)
5. Find posterior \(F_{\theta}\)

1.3 Computations in Bayesian analysis

  • Model in the Bayes Theorem (above) is a condition in all terms! Thus, we can rewrite \[ P(\theta \mid Data, Model) = \frac{P(Data\mid \theta, Model) P(\theta\mid Model)}{P(Data\mid Model)}\]
  • Numerator
    • Numerator in Bayes theorem expression for statistical analysis consists of 2 terms:
      • \(P(Data\mid \theta, Model)\): likelihood function For each value of the parameter \(\theta\), it shows the probability of the observed data
      • \(P(\theta\mid Model)\): marginal distribution of the parameter. It is known before the analysis, and even before the experiment is conducted \(\rightarrow\) prior distribution

      Both terms can only be calculated after the model generating the data is specified. For example, in the mammography experiment the model is binomial: the data are generated by binomial distribution with parameter \(\theta\) = \(p\) (equal to probability of disease)

  • Denominator: the trickiest part for calculation is the denominator – the foundation for all modern Bayesian computational methods
    • \(P(Data\mid Model) = \sum_{i=1}^n P(\theta = \theta_i\mid Model) P(Data\mid\theta=\theta_i, Model)\), i.e. is the sum (integral) of the numerator values for all possible values of the parameter \(\theta\)
    • Fortunately this denominator is the same regardless choice of \(\theta\). This fact is reflected in the common proportionality notation: \(P(\theta \mid Data, Model) \propto P(Data\mid \theta) P(\theta, Model)\)

1.4 General Framework of Bayesian Analysis

  • Previously, the use of Bayesian analysis was limited to only few possible ways of calculating the posterior distribution with special conjugate pairs of prior and likelihood, e.g. conjugate distributions: normal-normal, beta-binomial, gamma-Poisson,etc.
  • Since technology and new efficient algorithms were developed for sampling of posterior distributions the use of Bayesian approach has expanded
    • The basic idea of sampling is:
      • Simulate values of the parameter \(\theta\) from the prior distribution \(F_{\theta}(x) = P (\theta \le x)\)
      • Using the sample, calculate the likelihood and the numerator for each simulated \(\theta\). This makes a sample of the numerator
      • From the sample of the numerator estimate the posterior distribution
      • If new data arrive the posterior distribution is used as a new prior and the whole sequence of steps is repeated again

2 Example: Mammography Screening Controvercy

2.1 Problem

According to CDC, probability of breast cancer among women of age 50-60 years old is 2.28%.

Screening using mammography has sensitivity (i.e. probability of detecting the disease correctly) 75%-94%. Specificity of the test (i.e. probability of correct “no disease” conclusion) is 83%-98%. Assume the most optimistic parameters of the test (94% and 98%, correspondingly for sensitivity and specificity). What is the probability that randomly selected woman with positive test result has the disease?

\(\Longrightarrow\) Need to calculate: \(P(\text{Disease} \mid \text{Test Positive})\) or \(P(\theta_1 \mid \tau_1) = \frac{P(\tau_1 \mid \theta_1)P(\theta_1)}{P(\tau_1 \mid \theta_1)P(\theta_1) + P(\tau_1 \mid \theta_0)P(\theta_0)}\)

#Prior p(theta1)
p.theta1 <- .0228

#Sensitivity: p(tau1|theta1)
p.tau1.theta1 <- .94

#Specificity: p(tau0|theta0)
p.tau0.theta0 <- .98

#Complimentary probabilities
p.theta0 <- 1-p.theta1
p.tau1.theta0 <- 1-p.tau0.theta0

#Bayes
(p.theta1.tau1 <- p.tau1.theta1*p.theta1/(p.tau1.theta1*p.theta1+p.tau1.theta0*p.theta0))
## [1] 0.5230379

Assume that randomly selected woman tested positive and then retested with negative test result.

After both tests what is the probability of disease?

\(\Longrightarrow\) Need to calculate: \(P(\text{Disease} \mid \text{Test Negative})\) where the \(P(\text{Disease})\) will be \(P(\text{Disease} \mid \text{Test Positive})\) in the previous information. Or \(P(\theta_1^{\prime} \mid \tau_0) = \frac{P(\tau_0 \mid \theta_1^{\prime})}{P(\theta_1^{\prime})}\) where \(P(\theta_1^{\prime}) = P(\theta_1 \mid \tau_1)\)

# Set the prior to the new probability of having the disease:
pDisease <- p.theta1.tau1

# Bayes rule for second, negative test:
(p.theta1.tau0 <- ( (1.0-p.tau1.theta1) * pDisease /
+ ( (1.0-p.tau1.theta1) * pDisease+ (1.0-p.tau1.theta0) * (1.0-pDisease) ) ))
## [1] 0.06291489

2.2 Comparison with FNP Statistical Approach

library(faraway)
data(babyfood)
babyfood

Select rows 1 and 3: boys bottle fed or breast fed.

(data <- babyfood[c(1,3),c(1:2,4)])

2.2.1 FNP-Approach

The simplest way of analysis, as we learned, was to use the prop test.

It means:

  • Creating 2 proportions:
  1. 77 diseases out of 458 for bottle-fed baby boys

  2. 47 diseases out of 494 for breast-fed baby boys

Comparing the proportions and testing \(H_0: p_1=p_2\)

prop.test(as.matrix(data[,1:2]))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  as.matrix(data[, 1:2])
## X-squared = 10.539, df = 1, p-value = 0.001169
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.02795236 0.11800878
## sample estimates:
##    prop 1    prop 2 
## 0.1681223 0.0951417

The null hypothesis is rejected.

Then we use logistic regression to find out how much the odds of having the disease change between bottle-fed and breast-fed babies.

mdl <- glm(cbind(disease,nondisease) ~ food, family=binomial, data)  
summary(mdl)  
## 
## Call:
## glm(formula = cbind(disease, nondisease) ~ food, family = binomial, 
##     data = data)
## 
## Deviance Residuals: 
## [1]  0  0
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.5990     0.1249 -12.797  < 2e-16 ***
## foodBreast   -0.6534     0.1978  -3.303 0.000955 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.1235e+01  on 1  degrees of freedom
## Residual deviance: 1.1680e-13  on 0  degrees of freedom
## AIC: 15.591
## 
## Number of Fisher Scoring iterations: 3

Take exponents to express the change in odds ratio.

exp(mdl$coefficients)
## (Intercept)  foodBreast 
##   0.2020997   0.5202650

Look at confidence intervals to see the significance and the range of accuracy of our conclusions.

library(MASS)
exp(confint(mdl))
##                 2.5 %    97.5 %
## (Intercept) 0.1570976 0.2565573
## foodBreast  0.3511082 0.7636948

2.2.2 B-Approach (Bayesian approach)

2.2.2.1 Not a true B-approach

First, we apply Bayes theorem to address the same question, but not exactly within Bayesian approach.

Create joint distribution of two binary variables: (“disease”,“nondisease”) and (“bottle”,“breast”).

joint.dist<-data
joint.dist[1:2,1:2] <- joint.dist[1:2,1:2]/sum(joint.dist[1:2,1:2])
joint.dist
sum(joint.dist[1:2,1:2])
## [1] 1

Find marginal probabilities for the disease and the treatment.

(p.breast<-sum(joint.dist[2,-3]))
## [1] 0.5189076
(p.disease<-sum(joint.dist[,1]))
## [1] 0.1302521

Find conditional probability of breastfeeding, given that baby got the disease

\[ P(\text{breast} \mid \text{disease}) = \frac{P(\text{breast}, \text{disease})}{P(\text{disease})} \]

(p.breast.disease<-joint.dist[2,1]/p.disease)
## [1] 0.3790323

Finally, use Bayes theorem.

(p.disease.breast<-p.breast.disease*p.disease/p.breast)
## [1] 0.0951417

Could we calculate the same probability directly, not using Bayes theorem?

(p.disease.breast<-joint.dist[2,1]/p.breast)
## [1] 0.0951417

Why one would prefer using Bayes theorem rather than calculating the probability directly? It could follow the flow so feel more logical and easily understood.

This is not Bayesian approach. Even, we applied the ‘basic’ Bayes Theorem, but mostly we use the frequentists’ probability to calculate the results.

2.2.2.2 True B-approach but for a different problem

True Bayesian approach to the baby food problem would be logistic regression with the food type as a predictor and prior distribution for the parameters.

We will look at this model later.

Now use the true B-approach to estimation of probability of binomial distribution.

The data collected from observation of 952 baby boys of which 124 got the disease: this is a sequence of 124 ones and 828 zeros.

(Data<-c(rep(0,828),rep(1,124)))
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [815] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [852] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [889] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [926] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Probability of success is \(p\).
In the FNP-approach its estimate is \(\hat{p}\).
In the B-approach it is a random variable with prior distribution density \(f_p(x)\).
Then the B-approach means using Bayes theorem to find posterior \(f_p(p|Data)\).
We will practice this approach in the following section.

3 Example of Using Bayes Theorem in Statistics

The following example uses script BernGrid.R from the book by John K. Kruschke.

Source the scripts DBDA2E-utilities.R and BernGrid.R available on the book site.

source("DBDA2E-utilities.R")
## 
## *********************************************************************
## Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
## A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
## *********************************************************************
source("BernGrid.R")

3.1 Binomial model, triangular prior with 5 values

  • Set the parameters for Graph 5.1 on page 111 in [K]. Vector \(\theta\) is probability of success in Bernoulli trials, it can take 5 values from 0 to 1.

Excert from Graph 5.1 on page 111 in [K]

  • Vector pTheta is the prior distribution of triangular shape.
  • Variable Data is the observed number of successes \(\gamma\). Start with one flip in which \(\gamma\) = 1.
  • Likelihood function for Bernoulli trials is \(p(\gamma \mid \theta)=\theta^{\gamma} (1-\theta)^{(1-\gamma)}\). Posterior distribution is obtained from Bayes theorem.
    Numerator in the expression is the product of pTheta and likelihood.
  • The simplest way of finding the denominator is to sum the numerator vector. This will make posterior probabilities add up to 1.
  • Calculate the values of likelihood and posterior distribution post, then run BenGrid() and compare the results.

Set \(\theta\), the prior and the data.

(Theta = seq(0 ,1 ,length=5 ))  # Sparse teeth for Theta.
## [1] 0.00 0.25 0.50 0.75 1.00
pTheta = pmin(Theta ,1-Theta ) # Triangular shape for pTheta.
(pTheta = pTheta/sum(pTheta)) # Make pTheta sum to 1.0
## [1] 0.00 0.25 0.50 0.25 0.00
Data = c(rep(0,0),rep(1,1)) # Single flip with 1 head
Function `pmin()` returns parallel minimum, i.e. minimum of 2 vectors `Theta` and `1-Theta` by position.  
  • Calculate likelihood and posterior.

Likelihood: \(p(\gamma \mid \theta)=\theta^{\gamma} (1-\theta)^{(1-\gamma)}\)

Prior: \(p(\theta)=\theta\)
\(p(D) = \sum_{\theta^{*}} p(D \mid \theta^{*}) p(\theta^{*})\)

So, Post: \(p(\theta \mid \gamma) = \frac{p(\gamma \mid \theta) p(\theta)}{\sum_{\theta^{*}} p(D \mid \theta^{*}) p(\theta^{*})}\)

likelihood=Theta^Data * (1-Theta)^(1-Data)
post=pTheta*likelihood
post=post/sum(post)

Compare the results, make the plot.

likelihood
## [1] 0.00 0.25 0.50 0.75 1.00
post
## [1] 0.000 0.125 0.500 0.375 0.000
(posterior = BernGrid(Theta, pTheta, Data, plotType="Bars", 
                      showCentTend="None" , showHDI=FALSE , showpD=FALSE ))

## [1] 0.000 0.125 0.500 0.375 0.000

3.2 Binomial model, triangular prior with 11 values

Increase the number of values taken by \(\theta\) to 11.

Theta = seq( 0 , 1 , length=11 )  # Sparse teeth for Theta.
pTheta = pmin( Theta , 1-Theta ) # Triangular shape for pTheta.
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,0),rep(1,1))      # Single flip with 1 head
posterior = BernGrid(Theta, pTheta, Data, plotType="Bars", 
                     showCentTend="None", showHDI=FALSE, showpD=FALSE )

Which of the functions has more influence on the posterior: prior or likelihood?
- Currently, it looks like the prior influences the posterior.

3.3 Binomial model, triangular prior with 1001 values, 4 observations

Make prior with triangular shape.
Make the number of Bernoulli trials equal to 4, of which one outcome is Head (\(\gamma\)=1) and three outcomes are Tails.
Calculate likelihood and posterior yourself, compare with BernGrid().

Theta = seq(0, 1, length=1001)  # fine grid for Theta.
pTheta = pmin(Theta, 1-Theta) # Triangular shape for pTheta.
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,3),rep(1,1))      # 25% heads, N=4
likelihood=Theta^sum(Data) * (1-Theta)^(4-sum(Data))
post=pTheta*likelihood
post=post/sum(post)
plot(Theta,pTheta)

plot(Theta,likelihood)

plot(Theta,post)

posterior = BernGrid(Theta, pTheta, Data, plotType="Bars", 
                     showCentTend="Mode" , showHDI=TRUE, showpD=FALSE)

3.4 Binomial model, triangular prior, 40 observations

Increase the number of Bernoulli trials to 40. Keep the proportion of heads to tails in the data as 3/1.

Theta = seq( 0 , 1 , length=1001 )  # fine grid for Theta.
pTheta = pmin( Theta , 1-Theta ) # Triangular shape for pTheta.
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,30),rep(1,10))    # 25% heads, N=40

posterior = BernGrid( Theta, pTheta , Data , plotType="Bars" , 
                      showCentTend="Mode" , showHDI=TRUE , showpD=FALSE )

3.5 How did the influence of the prior and the likelihood change?

  • For the small sample size (Binomial model, N = 4) the mode (peak) of the posterior distribution is at θ = 0.40, which is closer to the mode of the prior distribution (at 0.5) than the mode of the likelihood function. For the larger sample size (N = 40), the mode of the posterior distribution is at θ = 0.268, which is close to the mode of the likelihood function.
  • In both cases, the mode of the posterior distribution is between the mode of the prior distribution and the mode of the likelihood function, but the posterior mode is closer to the likelihood mode for larger sample sizes.
  • The width of the posterior highest density intervals (HDI) is smaller for the larger sample size. Even though both samples of data showed 25% heads, the larger sample implied a smaller range of credible underlying biases in the coin.
  • In general, the more data we have, the more precise is the estimate of the parameter(s) in the model. Larger sample sizes yield greater precision or certainty of estimation.

3.6 Binomial model, uniform prior with 1001 values

Increase the number of values of \(\theta\) to 1001. Make prior uniform on [0,1].

Theta = seq( 0 , 1 , length=1001 ) # fine grid for Theta.
pTheta = rep(1,length(Theta))      # Uniform (horizontal) shape for pTheta.
pTheta = pTheta/sum(pTheta)        # Make pTheta sum to 1.0
Data = c(rep(0,0),rep(1,1))        # Single flip with 1 head
likelihood=Theta^Data * (1-Theta)^(1-Data)
post=pTheta*likelihood
post=post/sum(post)
plot(Theta,pTheta, pch=20, lwd=.25)

plot(density(likelihood), lwd=4)

plot(density(post), lwd=4)

Which of the functions has more influence on the posterior: prior or likelihood?

  • That would be the likelihood in this case
posterior = BernGrid( Theta, pTheta , Data , plotType="Bars" , 
                      showCentTend="None" , showHDI=FALSE , showpD=FALSE )

3.7 Binomial model, dispersed prior, 4 observations

Smear the prior to almost flat shape by taking pTheta to the power of 0.1.

Theta = seq(0, 1, length=1001)  # fine grid for Theta.
pTheta = pmin(Theta , 1-Theta) # Triangular shape for pTheta.
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
pTheta = pTheta^0.1              # Flatten pTheta !
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,3),rep(1,1))      # 25% heads, N=4

posterior = BernGrid(Theta, pTheta, Data, plotType="Bars",
                     showCentTend="Mode", showHDI=TRUE,
                     showpD=FALSE )

Which of the functions has more influence on the posterior: prior or likelihood?

  • In general, whenever the prior distribution is relatively broad compared with the likelihood function, the prior has fairly little influence on the posterior.

3.8 Binomial model, flat prior, 40 observations

Flatten the prior, given the same data.

Theta = seq(0, 1, length=1001)  # fine grid for Theta.
pTheta = pmin(Theta, 1-Theta) # Triangular shape for pTheta.
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
pTheta = pTheta^0.1              # Flatten pTheta !
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,30),rep(1,10))    # 25% heads, N=40

posterior = BernGrid(Theta, pTheta, Data, plotType="Bars", 
                      showCentTend="Mode", showHDI=TRUE, showpD=FALSE)

3.9 Binomial model, concentrated prior, 4 observations

Make the prior concentrated at 0.5 by taking it to the power of 10.

Theta = seq( 0 , 1 , length=1001 )  # fine grid for Theta.
pTheta = pmin( Theta , 1-Theta ) # Triangular shape for pTheta.
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
pTheta = pTheta^10               # Sharpen pTheta !
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,3),rep(1,1))      # 25% heads, N=4

posterior = BernGrid( Theta, pTheta , Data , plotType="Bars" , 
                      showCentTend="Mode" , showHDI=TRUE , showpD=FALSE )

Which of the functions has more influence on the posterior: prior or likelihood?

In this case, the answer is the prior.

3.10 Binomial model, concentrated prior, 40 observations

With the same number and ratio of observations make the prior condensed around 0.5.

Theta = seq( 0 , 1 , length=1001 )  # fine grid for Theta.
pTheta = pmin( Theta , 1-Theta ) # Triangular shape for pTheta.
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
pTheta = pTheta^10               # Sharpen pTheta !
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,30),rep(1,10))    # 25% heads, N=40

posterior = BernGrid( Theta, pTheta , Data , plotType="Bars" , 
                      showCentTend="Mode" , showHDI=TRUE , showpD=FALSE )

  • In this case, despite the fact that the sample has a larger size of N = 40, the prior is so sharp that the posterior distribution is noticeably influenced by the prior.
  • In real applications, this is a reasonable and intuitive inference, because in real applications a sharp prior has been informed by genuine prior knowledge that we would be reluctant to move away from without a lot of contrary data.

3.11 Binomial model, binary prior with values 0, 1

Make prior distribution giving probabilities 0.5 to \(\theta\) = 0, 1 and probabilities 0 to all other values of \(\theta\).

Theta = seq(0 ,1 , length=1001) # fine grid for Theta.
pTheta = rep(0, length(Theta))      # Only extremes are possible!
pTheta[2] = 1                      # Only extremes are possible!
pTheta[length(pTheta)-1] = 1       
pTheta = pTheta/sum(pTheta)        # Make pTheta sum to 1.0
Data = c(rep(0,0),rep(1,1))        # Single flip with 1 head
 
posterior = BernGrid(Theta, pTheta, Data, plotType="Bars", 
                     showCentTend="None", showHDI=FALSE, showpD=FALSE)

Compromise between the prior and likelihood resulted in concentration of the posterior at the observed value (maximum likelihood).

3.12 Binomial model, bimodal prior, 27 observations

Make the prior bimodal, the data generated by 27 Bernoulli trials with 13 tails and 14 heads.

Theta = seq( 0 , 1 , length=1000 )  # fine grid for Theta.
# Two triangular peaks on a small non-zero floor:
pTheta = c( rep(1,200),seq(1,100,length=50),seq(100,1,length=50),rep(1,200) , 
            rep(1,200),seq(1,100,length=50),seq(100,1,length=50),rep(1,200) )
pTheta = pTheta/sum(pTheta)      # Make pTheta sum to 1.0
Data = c(rep(0,13),rep(1,14)) 


posterior = BernGrid( Theta, pTheta , Data , plotType="Bars" , 
                      showCentTend="None" , showHDI=FALSE , showpD=FALSE )

Bayesian inference is intuitively rational: With a strongly informed prior that uses a lot of previous data to put high credibility over a narrow range of parameter values, it takes a lot of novel contrary data to budge beliefs away from the prior. But with a weakly informed prior that spreads credibility over a wide range of parameter values, it takes relatively little data to shift the peak of the posterior distribution toward the data (although the posterior will be relatively wide and uncertain).

What, understand! Above, not my words. But in real applications a sharp prior has been informed by practically prior knowledge that we would be loath to move away from without having numerous contrary data.

4 Posterior distribution by simulation. Updating prior and adding new data

Look again at the example in section Binomial model, uniform prior with 1001 values with uniform prior.

Follow the steps of obtaining posterior distribution by Monte Carlo.

Define likelihood function for binomial distribution.

likeli <- function(par, data){
  sdata <- sum(data)
  ldata <- length(data)
  return(par^sdata*(1-par)^(ldata-sdata))
}

Define values of parameter \(\theta\) and prior distribution.

  • fine grid for \(\theta\).
  • Uniform (horizontal) shape for pTheta
Theta = seq(.00001, 1-.00001, length=1001) # fine grid for theta
pTheta = rep(1,length(Theta))      # Uniform (horizontal) shape for pTheta.
pTheta = pTheta/sum(pTheta)        # Make pTheta sum to 1.0
plot(Theta,pTheta)

Create data of length 5 generated by the model with parameter 0.84.

set.seed(15)
(Data<-rbinom(5,size=1,prob=.84))
## [1] 1 1 0 1 1

Create sample of \(\theta\) generated from the prior distribution.

set.seed(15)
priorInd <- sample(1:length(Theta),500,replace = T)
priorSample <- cbind(Theta=Theta[priorInd],Prob=pTheta[priorInd])
priorSample <- rbind(priorSample,
                     c(head(Theta,1),head(pTheta,1)),
                     c(tail(Theta,1),tail(pTheta,1)))

5 Example: Expressing prior knowledge as beta distribution

We would learn from an example from Section 6.4.1 of [K].
Load the definition of the BernBeta function

source("BernBeta.R") 

5.1 Situation 1

A coin known to be regular gets 20 flips of which 17 or 85% are heads.
Despite this there is a strong prior belief that the coin is fair with concentration of 500 observations. What is your posterior believe about the probability of heads?

Express prior belief in terms of parameters of beta distribution.

The prior is very strong and characterized by mode $ = 0.5$ and concentration \(\kappa = 500\).

Use formulas given in my website to convert the mode and concentration into shape parameters of beta distribution.

# Specify the prior:
t = 0.5             # Specify the prior MODE.
n = 500               # Specify the effective prior sample size.
a = t*(n-2) + 1      # Convert to beta shape parameter a.
b = (1-t)*(n-2) + 1  # Convert to beta shape parameter b.

Form the prior shapes parameters vector.

(Prior = c(a,b))       # Specify Prior as vector with the two shape parameters.
## [1] 250 250

Define the data of size 20 and number of heads 17. Note that Data should be a sequence of zeros and ones.

# Specify the data:
N = 20                         # The total number of flips.
z = 17                         # The number of heads.
(Data = c(rep(0,N-z),rep(1,z))) # Convert N and z into vector of 0's and 1's.
##  [1] 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Use the same function for calculating binomial likelihood as in the previous

likeli <- function(par,data){
  sdata <- sum(data)
  ldata <- length(data)
  return(par^sdata*(1-par)^(ldata-sdata))
}

Define the grid of values for \(\theta\).
Define the prior as beta distribution with the shape parameters in Prior. Plot prior.

Theta <- seq(0, 1, length=1001)  # Make grid of Theta.
pTheta <- dbeta(Theta, shape1=Prior[1], shape2=Prior[2]) # Beta prior distribution.
plot(Theta, dbeta(Theta, shape1=Prior[1], shape2=Prior[2]), ylab="Prior")

Calculate likelihood for Data, given values in Theta.

likelihood<-likeli(Theta,Data)
plot(Theta,likelihood)

Calculate posterior distribution for \(\theta\) using grid approximation. We need to normalize the posterior distribution to make it add up to 1.

post <- pTheta*likelihood
post <- post/sum(post)*1000

Then, plot the posterior.

plot(Theta, post)

Find the mode of the posterior.

(mode<-Theta[which.max(post)])
## [1] 0.514

Calculate posterior using the analytical formula.

postBeta <- dbeta(Theta, shape1 = z+Prior[1], shape2 = N-z+Prior[2])
plot(Theta, postBeta)

(mode<-Theta[which.max(postBeta)])
## [1] 0.514

We can run script from [K]

posterior <- BernBeta(priorBetaAB=Prior, Data=Data, plotType="Bars", 
                      showCentTend="Mode", showHDI=TRUE, showpD=FALSE)

Conclusion.
Very strong prior knowledge expressed by mode of 0.5 and concentration of 500 could not be affected by a sample of 20 coin flips, even if estimated frequency from these data shows probability of 85% for heads.

5.2 Situation 2

A professional basketball player makes 20 free throws.

We know that professionals tend to make 75% of free throws, majority makes 50% or more, but very few make more than 90%.

This knowledge is reflected in a beta distribution with mode \(\omega = 0.75\) and concentration \(\kappa = 25\).

Such distribution has 95% highest density interval (HDI) approximately between 0.5 and 0.9.

Think how you would define concentration to satisfy the requirement that 95% HDI equals [0.5,0.9].

\(\Rightarrow\) What is your posterior believe this time?

Follow the same steps as in Situation 1.

5.3 Situation 3

5.4 Exercise 6.2 from [K]

6 Predicting binomial probability and Bernoulli outcome

7 References

  • Adaptive from lecture note UC Bayesian Methods
  • Gelman, A., et al. (2013). Bayesian Data Analysis, Third Edition, 3rd Edition, CRC Press.
  • [K]ruschke, J. K. (2015). Doing Bayesian data analysis : a tutorial with R, JAGS, and Stan. Amsterdam, Academic Press is an imprint of Elsevier.