Reference etc

Bayesian inference

Let Y be a random variable representing the data. We have a (parametric) model for Y with Y∼f(y∣θ),θ∈Θ, Θ the parameter space. That is probability density function and can result in the likelihood function. Then we have a prior distribution represented by π(θ). Given an observation of Y, the posterior distribution of θ is

\[ f(\theta \mid y) =\frac{f(y\mid\theta) \pi(\theta)}{\int_\Theta f(y\mid\theta) \pi(\theta)\; d\theta} \]

the posterior is proportional to the likelihood times the prior (\(f(y)\)is a constant).

Continuous distributions, estimating one parameter

To find the max probability of mu given some observations.

The prior

Assign mu a arbitrary prior probability (or estimated by maximum likelihood method using a sample). To keep things simple and intuitive, we will assess our distributions (i.e. prior, likelihood, posterior) at equally spaced points . for each specific value approximate the probability \(\pi\) that μ actually lies within a narrow interval (0.01) around this point (treating them as categorical distributions without gap).

\[ \int{f(y\mid\theta) }= \int_{-\infty}^\infty \frac1{\sqrt{2\pi}} e^{-\frac12 (y-\theta)^2}\; d\theta = 1 \]

res <- 1 / 100              # Set resolution for the sequence of potential mu-values, 0.01 interval
mu.h <- seq(-10, 10, res)   # Define values at which to assess the probability density

density <- dnorm(mu.h, mean = -1, sd = 1.5) # Get the density for each value in mu.h
prior <- density * res      # Multiply with resolution to get probability per interval

length(mu.h)
## [1] 2001
head(mu.h)
## [1] -10.00  -9.99  -9.98  -9.97  -9.96  -9.95
sum(prior)
## [1] 1
length(prior)
## [1] 2001

Individual observation

set.seed(4)                 # Set seed for reproducibility
n <- 100                      # Set number of observations
y <- rnorm(n, mean = 1, sd = 2) # Generate data
y                           # Take a look at the data
##   [1]  1.43350973 -0.08498514  2.78228929  2.19196115  4.27123600  2.37855088
##   [7] -1.56249326  0.57371096  4.79307974  4.55372643  2.13320900  1.03143891
##  [13]  1.76611468  0.90972577  1.06870381  1.33805355  3.33005368  0.91159201
##  [19]  0.79926311  0.43311086  4.08162996  1.33033804  3.61524472  3.57651376
##  [25]  2.18579388  0.43411263  3.51176805  2.81967830 -0.85605621  3.48036168
##  [31]  1.30692836  3.10386516 -0.50842243 -1.96437824  2.72226374  0.19096034
##  [37]  0.54518917  2.86819234  0.06820824 -0.27508700  3.68741725  1.36307077
##  [43]  3.58502467 -2.37609715 -0.64198716 -0.72429229  1.19768738  0.24868971
##  [49]  2.44780831 -2.59476404 -0.32748628 -0.24745298  0.84073514  1.87124953
##  [55]  4.94180194 -0.19351735 -0.10501442  2.39193327  0.68867207  3.69779639
##  [61] -1.13704614  3.12890149 -1.62544353  5.12738940  1.26276602  0.53662310
##  [67]  0.20528895  2.77886416  2.05233808  0.65745351  1.31735379  0.02866987
##  [73] -0.91781215  1.36103458  2.44346857  0.26091904  1.47507663 -0.33184422
##  [79] -0.59361502  0.89660614  3.57385667  0.57170067 -0.14949093 -1.94145409
##  [85] -1.06547687 -1.61304971 -0.67650481 -1.26130736  1.73749635  0.59639396
##  [91] -1.55531981 -0.59602496  1.31816485  2.22959527  2.37589592  0.90589798
##  [97]  5.66064336 -0.15513198  2.93695827  0.44492875

The likelihood

Suppose we know the standard deviation is 2. For a single observation, \(y_i\), the likelihood is then: \(p\left(y_i \mid \mu\right)=N\left(y_i \mid \mu, 2^2\right)\). The likelihood for the whole dataset is the product of all the individual likelihoods:

\[ p(y \mid \mu)=\prod_{i=1}^n N\left(y_i \mid \mu, 2^2\right) \]

we can calculate the likelihood across the set of potential values for \(\mu\).

likelihood <- vector()      # Create an empty vector

for (s in mu.h) {           # Loop over each value in mu.h , 2001 MU
  temp.lik <- 1             # Create a temporary likelihood scalar
  
  for (i in 1:n) {          # Loop over each individual data point, 30 observations
    temp.lik <- temp.lik * dnorm(y[i], mean = s, sd = 2) # Multiply individual likelihoods
  }
  likelihood <- c(likelihood, temp.lik) # Store the likelihood for each value in mu.h
}

# output likelihood value
head(likelihood)
## [1] 0 0 0 0 0 0
length(likelihood)
## [1] 2001
# normalized.likelihood <- likelihood / sum(likelihood)

The posterior

As we know from Bayes’ theorem, we can obtain a distribution that is proportional (” \(\propto\) “) to the posterior by multiplying the likelihood and the prior:

\[ p(\mu \mid y) \propto p(y \mid \mu) p(\mu) \]

unnormalized.posterior <- likelihood * prior
  • Then divided by p(y)
normalizing.constant <-  sum(likelihood * prior) #p(y)
normalizing.constant #integrate
## [1] 5.001066e-90
# head(unnormalized.posterior) #no integrate

posterior <- unnormalized.posterior / sum(likelihood * prior)

Determine which one having max posterior of mu

which.max(posterior)
## [1] 1116
mu.h[which.max(posterior)]
## [1] 1.15

Plot

prior and posterior probability

plot(mu.h, posterior / res, type = "l", ylab = "")
# lines(mu.h, normalized.likelihood / res, col = "red")
lines(mu.h, prior / res, col = "green")

#having a corresponding relationship for the two normal distributions 

Continuous distributions, estimating two parameters

let us try to estimate both the mean, μ, and the standard deviation, σ. The standard deviation must be positive, let us give σ a flat prior over the range from 0 to 10, using a uniform distribution: p(σ)∼U(0,10).

The joint prior

\[ p(\mu, \sigma)=p(\mu) p(\sigma) \]

mu prior

res <- 1 / 100              # Set resolution for the sequence of potential mu-values, 0.01 interval
mu.h <- seq(-10, 10, res)   # Define values at which to assess the probability density

density <- dnorm(mu.h, mean = -1, sd = 1.5) # Get the density for each value in mu.h
prior <- density * res      # Multiply with resolution to get probability per interval

length(mu.h)
## [1] 2001
head(mu.h)
## [1] -10.00  -9.99  -9.98  -9.97  -9.96  -9.95
sum(prior)
## [1] 1
length(prior)
## [1] 2001
mu.prior <- prior           

σ prior

sigma.h <- seq(0, 10, res)    # Define values at which to assess the probabilities for sigma
sigma.density <- dunif(sigma.h, min(sigma.h), max(sigma.h)) # Density-values for a flat prior
sigma.prior <- sigma.density * res # Multiply with res to get probability per interval


length(sigma.h)
## [1] 1001
head(sigma.h)
## [1] 0.00 0.01 0.02 0.03 0.04 0.05
sum(sigma.prior)
## [1] 1.001
length(sigma.prior)
## [1] 1001

Now that we have specified priors for μ and σ, we can also specify their joint prior distribution, p(μ,σ).

joint.prior <- sapply(mu.prior, function(mu.s) { # Using sapply's instead of loops
                sapply(sigma.prior, function(sigma.s) { mu.s * sigma.s })}) 

This gives us a two-dimensional grid of probabilities, representing all possible combinations of μ and σ values that we are considering.

joint.prior[1:5,1:5]  
##              [,1]         [,2]        [,3]         [,4]        [,5]
## [1,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [2,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [3,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [4,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14
## [5,] 4.050589e-14 4.215803e-14 4.38756e-14 4.566113e-14 4.75172e-14

Individual observation

set.seed(4)                 # Set seed for reproducibility
n <- 200                      # Set number of observations
y <- rnorm(n, mean = 1, sd = 2) # Generate data                           # Take a look at the data

The joint likelihood

If we calculate the likelihood across the possible combinations of values for μ and σ,

\[ p(y \mid \mu,\sigma)=\prod_{i=1}^n N\left(y_i \mid \mu, \sigma^2\right) \]

joint.likelihood <- sapply(mu.h, function(mu.s) { # Sapply's instead of loops
                      sapply(sigma.h, function(sigma.s) { prod(dnorm(y, mu.s, sigma.s)) })})
# normalized.joint.likelihood <- joint.likelihood / sum(joint.likelihood) # Normalize 

The joint posterior

The posterior distribution is now a joint distribution for our two parameters.

\[ p(\mu, \sigma \mid y) \propto p(y \mid \mu, \sigma) p(\mu, \sigma) \]

unnormalized.joint.posterior <- joint.likelihood * joint.prior
normalizing.constant <- 1 / sum(joint.likelihood * joint.prior)

joint.posterior <- unnormalized.joint.posterior * normalizing.constant  #have the same max as unnormalized.joint.posterior

Determine which one having max posterior of mu, sigma

# find the position in the matrix
df<-matrix(joint.posterior,nrow = 1001, byrow = F)
dim(df)
## [1] 1001 2001
position=which(df==max(joint.posterior),arr.ind = T)
position
##      row  col
## [1,] 194 1101
# then detect the ordinal mu and sigma by position
mu.h[position[2]]
## [1] 1
sigma.h[position[1]]
## [1] 1.93

Plot

contour(mu.h,sigma.h,t(joint.posterior))

Marginal distributions

we often want to draw conclusions about the values of a given parameter without reference to the other parameter(s) in the model. a marginal posterior distribution is to average the joint distribution over the parameter(s).

to get the marginal posterior for p(μ∣y), we integrate the joint posterior over σ: by summing the posterior probabilities in our grid for each potential value of σ.

\[ p(\mu \mid y)=\int p(\mu, \sigma \mid y) d(\sigma) \]

marginal posterior for mu

mu.marginal.posterior <- colSums(df) #integrate or sum

which.max(mu.marginal.posterior)
## [1] 1101
mu.h[which.max(mu.marginal.posterior)]
## [1] 1
  • Plot
plot(mu.h,mu.marginal.posterior/res,type = "l"); lines(mu.h,mu.prior/res,col=2)

marginal posterior for sigma

sigma.marginal.posterior <- rowSums(df) #integrate or sum

which.max(sigma.marginal.posterior )
## [1] 194
sigma.h[which.max(sigma.marginal.posterior )]
## [1] 1.93
  • Plot
plot(sigma.h,sigma.marginal.posterior/res,type = "l"); lines(sigma.h,sigma.prior/res,col=2)

the number of elements in the grid increases exponentially with the number of parameters. For such problems, we would normally sample from the posterior distribution using a form of Markov Chain Monte Carlo (MCMC) or solved analytically.

Beyes’ theory

The prevalence is 2.28%. Assume the test (94% and 98%, correspondingly for sensitivity and specificity). What is the probability with positive test result has the disease?

# d is disease and t is test

#Prior p(theta1)
p.d1 <- .0228
p.d0 <- 1-p.d1

#Sensitivity: p(tau1|theta1)
p.t1.d1 <- .94

#Specificity: p(tau0|theta0)
p.t0.d0 <- .98
p.t1.d0 <- 1-p.t0.d0

#Bayes prob
(p.d1.t1 <- p.t1.d1* p.d1 /( p.t1.d1 *p.d1+ p.t1.d0 *p.d0))
## [1] 0.5230379
# PPV = (sensitivity x prevalence) / [ (sensitivity x prevalence) + ((1 – specificity) x (1 – prevalence)) ]
(p.t1.d1 * p.d1) / ( (p.t1.d1 * p.d1) + (1 - p.t0.d0) * (1 - p.d1) )
## [1] 0.5230379

Binomilal distributions

\[ P(x ; p, n)=\left(\begin{array}{c} n \\ x \end{array}\right)(p)^x(1-p)^{(n-x)} \quad \text { for } x=0,1,2, \cdots, n \]

The prior

use the normal distribution as prior distribution

\[ \int{f(y\mid\theta) }= \int_{-\infty}^\infty \frac1{\sqrt{2\pi}} e^{-\frac12 (y-\theta)^2}\; d\theta = 1 \]

res <- 1 / 100              # Set resolution for the sequence of potential mu-values, 0.01 interval
mu.h <- seq(0, 1, res)   # Define values at which to assess the probability density

density <- dnorm(mu.h, mean = -1, sd = 1.5) # Get the density for each value in mu.h
prior <- density * res      # Multiply with resolution to get probability per interval

length(mu.h)
## [1] 101
head(mu.h)
## [1] 0.00 0.01 0.02 0.03 0.04 0.05
sum(prior)
## [1] 0.1628928
length(prior)
## [1] 101
# sigma.h <- seq(0, 10, res)    # Define values at which to assess the probabilities for sigma
# sigma.density <- dunif(sigma.h, min(sigma.h), max(sigma.h)) # Density-values for a flat prior
# sigma.prior <- sigma.density * res # Multiply with res to get probability per interval
# 
# 
# length(sigma.h)
# head(sigma.h)
# 
# sum(sigma.prior)
# length(sigma.prior)

Individual observation

set.seed(4)                 # Set seed for reproducibility
n <- 500                      # Set number of observations
y <- rbinom(n, size=1, p = 0.26) # Generate data                           # Take a look at the data
y
##   [1] 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0
##  [38] 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0
##  [75] 1 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0
## [112] 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1
## [149] 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [186] 0 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0
## [223] 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0
## [260] 1 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
## [297] 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0
## [334] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [371] 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1
## [408] 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0
## [445] 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
## [482] 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0

The likelihood

use binomial distribution likelihood function.

\[ p(y \mid p)=\prod_{i=1}^n B\left(y_i \mid p\right) \]

Where n=1.

joint.likelihood <- sapply(mu.h, function(s) prod(dbinom(y, 1, s)) )
# normalized.joint.likelihood <- joint.likelihood / sum(joint.likelihood) # Normalize 

The posterior

\[ p(p \mid y) \propto p(y \mid p) p(p) \]

unnormalized.joint.posterior <- joint.likelihood * prior
normalizing.constant <- 1 / sum(joint.likelihood * prior)

joint.posterior <- unnormalized.joint.posterior * normalizing.constant  #have the same max as unnormalized.joint.posterior

Determine which one having max posterior of p

# find the position in the matrix
mu.h[which.max(joint.posterior)]
## [1] 0.27

Plot

plot(mu.h, joint.posterior / res, type = "l", ylab = "")
# lines(mu.h, normalized.likelihood / res, col = "red")
lines(mu.h, prior / res, col = "green")