Question 1: Poisson Distribution with Chi-Square Prior

a) What is the posterior distribution of λ, after observing a single value of x?

Step 1: Likelihood Function

If \(X \sim \text{Poisson}(\lambda)\), then the likelihood of observing \(x\) is:

\[ P(X = x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} \]

Step 2: Prior Distribution

Given \(\lambda \sim \chi^2_4\), we can express this as a Gamma distribution:

\[ \lambda \sim \text{Gamma}(\alpha = 2, \theta = 1/2) \]

Recall that a Chi-square distribution with \(k\) degrees of freedom is equivalent to:

\[ \chi^2_k \equiv \text{Gamma}\left(\frac{k}{2}, \frac{1}{2}\right) \]

So the prior density is:

\[ \begin{align} \pi(\lambda) &= \frac{\beta^{\alpha}}{\Gamma(\alpha) } \lambda^{\alpha-1} e^{-\beta\lambda}\\ &=\frac{(1/2)^2}{\Gamma(2) } \lambda^{2-1} e^{-\lambda/2}\\ &= \frac{1}{4} \lambda e^{-\lambda/2} \end{align} \] since \(\Gamma(2)=1\)

Step 3: Posterior Distribution

Using Bayes’ theorem, the unnormalized posterior is:

\[ \pi(\lambda \mid x) \propto P(x \mid \lambda) \cdot \pi(\lambda) = \left( \frac{\lambda^x e^{-\lambda}}{x!} \right) \cdot \left( \frac{1}{4} \lambda e^{-\lambda/2} \right) \]

Simplifying:

\[\begin{aligned} \pi(\lambda \mid x) &\propto\lambda^{x + 1} e^{- \frac{3}{2} \lambda}\\ &\propto\lambda^{x + 2-1} e^{- \frac{3}{2} \lambda}\\ \end{aligned} \]

This is the kernel of a Gamma distribution:

\[ \lambda \mid x \sim \text{Gamma}(x + 2, \text{rate} = \frac{3}{2}) \]

Final Answer

The posterior distribution of \(\lambda\), after observing \(x\), is:

\[ \boxed{\lambda \mid x \sim \text{Gamma}(x + 2, \text{rate} = \frac{3}{2})} \]

where \[\alpha^* = x+2 , \quad \beta^*=3/2\]

b) Given the observed x = 5, compute the mean, median, and mode and the 95

\[ \begin{aligned} \lambda \mid x &\sim \text{Gamma}(5 + 2, \text{rate} = \frac{3}{2})\\ \lambda \mid x&\sim \text{Gamma}(7,\frac{3}{2}) \end{aligned} \]

\[\begin{aligned} Mean&=E(X)\\ &=\frac{\alpha^{*}}{\beta^{*}}\\ &=\frac{7}{\frac{3}{2}}\\ &\approx 4.67 \end{aligned}\]

\[\begin{aligned} Mode&=\frac{\alpha^{*}-1}{\beta^{*}}\\ &=\frac{7-1}{\frac{3}{2}}\\ &=4 \end{aligned}\]

median using Wilson-Hilferty Approximation:

\[\begin{aligned} median&=\frac{\alpha^{*}-\frac{1}{3}}{\beta^{*}}\\ &=\frac{7-\frac{1}{3}}{\frac{3}{2}}\\ &=4.444444 \end{aligned}\]

median using software:

qgamma(p = 0.5, shape = 7, rate = 1.5)
#> [1] 4.446425

library(bayesrules)
plot_gamma_poisson(shape = 2, rate = 1/2, sum_y = 5, n = 1)


# Summarizing the posterior (and the prior):

summarize_gamma_poisson(shape = 2, rate = 1/2, sum_y = 5, n = 1)

model	shape	rate	mean	mode	var	sd
prior	2	0.5	4.000000	2	8.000000	2.828427
posterior	7	1.5	4.666667	4	3.111111	1.763834

######################
## Credible Intervals
######################

# A 95% quantile-based credible interval :
round(qgamma(c(0.025,0.975), shape=7, rate=3/2),3)
#> [1] 1.876 8.706

Markov Chain Monte Carlo in Rjags

library(rjags)
library(coda)
library(ggplot2)
library(ggmcmc)  # For converting MCMC samples to tidy format

# Run your JAGS model (from previous code)
model <- textConnection("
  model {
    lambda ~ dgamma(2, 0.5)  # Prior: Gamma(2, 0.5) = Chi-square(4)
    X ~ dpois(lambda)         # Likelihood
  }
")
data <- list(X = 5)
jm <- jags.model(model, data = data, n.chains = 3)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 1
#>    Unobserved stochastic nodes: 1
#>    Total graph size: 4
#> 
#> Initializing model
update(jm, 1000)
samples <- coda.samples(jm, variable.names = "lambda", n.iter = 10000)
summary(samples)
#> 
#> Iterations = 1001:11000
#> Thinning interval = 1 
#> Number of chains = 3 
#> Sample size per chain = 10000 
#> 
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#> 
#>           Mean             SD       Naive SE Time-series SE 
#>        4.67630        1.77569        0.01025        0.01007 
#> 
#> 2. Quantiles for each variable:
#> 
#>  2.5%   25%   50%   75% 97.5% 
#> 1.910 3.403 4.434 5.703 8.778
quantile(samples[[1]], c(0.025, 0.5, 0.975))
#>     2.5%      50%    97.5% 
#> 1.901777 4.434241 8.740291
# Convert to ggmcmc object

# Trace plot and diagnostics
plot(samples)  # Trace plot and density

gelman.diag(samples)  # R-hat
#> Potential scale reduction factors:
#> 
#>        Point est. Upper C.I.
#> lambda          1          1

Autocorrelation plots

samples_ggs <- ggs(samples)

autocorr_plot <- ggs_autocorrelation(samples_ggs, family = "lambda") +
  geom_hline(yintercept = 0, color = "darkred", linetype = "dashed") +
  labs(title = "Autocorrelation for λ",
       subtitle = "Ideal: Fast decay to zero (lag < 10)",
       x = "Lag",
       y = "Autocorrelation") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())

print(autocorr_plot)

Trace Plot

library(ggplot2)
samples_df <- as.data.frame(samples[[1]])
ggplot(samples_df, aes(x = 1:10000, y = lambda)) +
  geom_line(color = "blue") +
  labs(title = "Trace Plot for λ", x = "Iteration", y = "λ") +
  theme_minimal()

Question 2

(a) Explain what is meant by a conjugate prior distribution

Given a likelihood function \(p(y \mid \theta)\), if the posterior distribution \(p(\theta \mid y)\) belongs to the same family of probability distributions as the prior distribution \(p(\theta)\), then the prior and posterior are referred to as conjugate distributions with respect to that likelihood function. In this case, the prior is called a conjugate prior for the likelihood function \(p(y \mid \theta)\).

(b) Derive the posterior distribution for beliefs about p.

Let \(X_1, X_2,\cdots, X_n\) be iid Bernoulli(\(p\)). Then \(Y = \sum_{i=1}^n X_i\) is binomial(\(n,p\)). We assume the prior distribution of \(p\) is Beta(\(\alpha, \beta\)). From here we note that:

\[Prior: ~p\sim Beta(\alpha,\beta)\] so that \(f(p) = \frac{\Gamma(\alpha+\beta)}{\Gamma{(\alpha})\Gamma{(\beta)}}p^{\alpha-1} (1 - p)^{\beta-1}\)

also :

\[Likelihood~:Y|p \sim Binomial(n,p)\]

\[\begin{eqnarray*} % \nonumber to remove numbering (before each equation) f(y,p) &=& \left[\binom{n}{y}p^y(1-p)^{n-y}\right]\left[\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}\right] \\ &=& \binom{n}{y}\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}p^{y+\alpha-1}(1-p)^{n-y+\beta -1}. \end{eqnarray*}\]

The marginal pdf of \(Y\) is

\[\begin{equation*} f(y) = \int_0^1f(y,p)\mathrm{d}p = \binom{n}{y}\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}\frac{\Gamma(y + \alpha) \Gamma(n-y+\beta)}{\Gamma(n + \alpha + \beta)}, \end{equation*}\]

which is the beta-binomial distribution. The posterior distribution of \(p\) is given as follows;

\[\begin{equation*} \pi(p|y) = \frac{f(y,p)}{f(y)} = \frac{\Gamma(n + \alpha + \beta)}{\Gamma(y + \alpha) \Gamma(n-y+\beta)}p^{y+\alpha-1}(1-p)^{n-y+\beta -1}, \end{equation*}\]

which is \(Beta(y+\alpha, n-y + \beta)\).

By inspection one could easily note that the posterior can easily be derived by inspecting the function derived from the joint distribution:

\[ \begin{aligned} {\color{red}{Pr(p \mid k)}} & \propto {\color{blue}{\binom{n}{k}p^k(1-p)^{n-k}}} \; {\color{green}{p^{a-1} (1 - p)^{\beta-1}}}\\ & \propto {p^{(\alpha+k)-1}} {(1-p)^{(\beta+n-k)-1}} \end{aligned} \] - That is:

\[ p \mid k \sim Beta(\alpha+k,\beta+n-k)\]

c) Show that if \(X \sim Beta(\alpha,\beta)\) with \(\alpha > 1\) then

\[E(\frac{1}{X})=\frac{\alpha+\beta-1}{\alpha-1}\]

Proof

\[ \begin{align} Given \quad that \quad f(x) &= \frac{\Gamma{(\alpha+\beta)}}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1} (1 - x)^{\beta-1}\\ E(\frac{1}{X})&= \int^{\infty}_{-\infty}\frac{1}{x}\frac{\Gamma{(\alpha+\beta)}}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1} (1 - x)^{\beta-1}\\ &=\frac{\Gamma{(\alpha+\beta)}}{\Gamma(\alpha)\Gamma(\beta)}\int^{\infty}_{-\infty}\frac{1}{x}x^{\alpha-1} (1 - x)^{\beta-1}\\ &=\frac{\Gamma{(\alpha+\beta)}}{\Gamma(\alpha)\Gamma(\beta)}\int^{\infty}_{-\infty}\frac{x^{\alpha-1}}{x} (1 - x)^{\beta-1}\\ &=\frac{\Gamma{(\alpha+\beta)}}{\Gamma(\alpha)\Gamma(\beta)}\int^{\infty}_{-\infty}x^{\alpha-1-1} (1 - x)^{\beta-1}\\ &=\frac{\Gamma{(\alpha-1+1+\beta)}}{\Gamma(\alpha-1+1)\Gamma(\beta)}\int^{\infty}_{-\infty}x^{\alpha-1-1} (1 - x)^{\beta-1}\\ &=\frac{\Gamma{(\alpha+\beta-1+1)}}{\Gamma(\alpha-1+1)\Gamma(\beta)}\int^{\infty}_{-\infty}x^{\alpha-1-1} (1 - x)^{\beta-1}\\ &=\frac{(\alpha+\beta-1)\Gamma{(\alpha+\beta-1)}}{(\alpha-1)\Gamma(\alpha-1)\Gamma(\beta)}\int^{\infty}_{-\infty}x^{\alpha-1-1} (1 - x)^{\beta-1} \quad since \quad \Gamma{(n+1)}=n\Gamma{(n)}\\ &=\frac{(\alpha+\beta-1)}{(\alpha-1)}\int^{\infty}_{-\infty}\frac{\Gamma{(\alpha+\beta-1)}}{\Gamma(\alpha-1)\Gamma(\beta)}x^{\alpha-1-1} (1 - x)^{\beta-1}\\ &=\frac{(\alpha+\beta-1)}{(\alpha-1)}\int^{\infty}_{-\infty}\frac{\Gamma{(\alpha-1+\beta)}}{\Gamma(\alpha-1)\Gamma(\beta)}x^{(\alpha-1)-1} (1 - x)^{\beta-1}\\ &=\frac{(\alpha+\beta-1)}{(\alpha-1)}\int^{\infty}_{-\infty}\frac{\Gamma{(\alpha^{*}+\beta)}}{\Gamma(\alpha^{*})\Gamma(\beta)}x^{\alpha^{*}-1} (1 - x)^{\beta-1} \quad where~ \alpha^{*} = \alpha-1\\ &=\frac{(\alpha+\beta-1)}{(\alpha-1)} \quad since ~\int^{\infty}_{-\infty}\frac{\Gamma{(\alpha^{*}+\beta)}}{\Gamma(\alpha^{*})\Gamma(\beta)}x^{\alpha^{*}-1} (1 - x)^{\beta-1} =1 \end{align}\]

d) Compute Bayes’ posterior mean estimate in the specific case where \(α = β = 3, k = 2\),and \(n =10\) from your derivations from b):

i) manually and find the percentage contribution of the sample (credibility factor Z as a percentage)

The posterior mean for \(Beta(\alpha^{*},\beta^{*})\) is given by

\[ \begin{align} \hat{p}_B &=\frac{a^{*}}{\alpha^{*}+\beta^{*}}\\ &= \frac{k+\alpha}{\alpha + k+\beta + n-k}\\ &= \frac{k+\alpha}{\alpha +\beta + n}. \end{align} \] We can write \(\hat{p}_B\) in the following way;

\[\hat{p}_B = \left(\frac{n}{\alpha +\beta +n}\right)\left(\frac{k}{n}\right) + \left(\frac{\alpha + \beta}{\alpha + \beta + n}\right)\left(\frac{\alpha}{\alpha + \beta}\right).\]

Thus \(\hat{p}_B\) is a linear combination of the prior mean and the sample mean. The weights are determined by the values of \(\alpha\), \(\beta\) and \(n\).

\[Z=\frac{n}{\alpha + \beta + n} ,\quad 1-Z= \frac{\alpha + \beta}{\alpha +\beta +n}\] \[Z=\frac{\alpha + \beta}{\alpha + \beta + n}=\frac{10}{3+3+10} \times 100\%=62.5\%\]

alpha <- 3; beta <- 3; k <- 2; n <- 10

# Posterior mean
post_mean <- (alpha + k) / (alpha + beta + n)  # 0.3125

# Credibility factor
(Z <- n / (n + alpha + beta) *100) # 0.625 (62.5%)
## [1] 62.5

Model specification

library(rjags)
library(coda)
library(bayesplot)

# Data
data_list <- list(n = 10, k = 2)

# Model 1: Original Prior (Beta(3,3))
model1 <- textConnection("
model {
  p ~ dbeta(3, 3)       # Original prior
  k ~ dbin(p, n)        # Likelihood
}
")

# Model 2: Expert Prior (Beta(2.25,6.75) for p=0.25)
model2 <- textConnection("
model {
  p ~ dbeta(2.25, 6.75) # Expert prior (mean=0.25)
  k ~ dbin(p, n)
}
")

Model fitting

# Fit Model 1
jm1 <- jags.model(model1, data = data_list, n.chains = 4)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 1
##    Total graph size: 4
## 
## Initializing model
update(jm1, 1000)  # Burn-in
samples1 <- coda.samples(jm1, variable.names = "p", n.iter = 10000)

# Fit Model 2
jm2 <- jags.model(model2, data = data_list, n.chains = 4)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 1
##    Total graph size: 5
## 
## Initializing model
update(jm2, 1000)
samples2 <- coda.samples(jm2, variable.names = "p", n.iter = 10000)

# Posterior summaries
summary(samples1)  # Original prior
## 
## Iterations = 2001:12000
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      0.3129095      0.1120814      0.0005604      0.0007275 
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.1203 0.2303 0.3049 0.3873 0.5513
summary(samples2)  # Expert prior
## 
## Iterations = 2001:12000
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      0.2233034      0.0931844      0.0004659      0.0006328 
## 
## 2. Quantiles for each variable:
## 
##    2.5%     25%     50%     75%   97.5% 
## 0.07146 0.15410 0.21385 0.28219 0.42968

# WAIC comparison (requires loo package)
# Note: For rjags, extract log-likelihood manually
loglik1 <- sapply(1:10, function(i) dbinom(data_list$k, data_list$n, samples1[[1]][,1]))
loglik2 <- sapply(1:10, function(i) dbinom(data_list$k, data_list$n, samples2[[1]][,1]))

library(loo)
waic1 <- waic(matrix(loglik1, ncol = 10))
waic2 <- waic(matrix(loglik2, ncol = 10))
print(compare(waic1, waic2))
## elpd_diff        se 
##       0.4       0.0

Compare convergence

# Trace plots
color_scheme_set("blue")
mcmc_trace(samples1) + ggtitle("Original Prior")


color_scheme_set("red")
mcmc_trace(samples2) + ggtitle("Expert Prior")


# Autocorrelation
mcmc_acf(samples1) + ggtitle("ACF: Original Prior")

mcmc_acf(samples2) + ggtitle("ACF: Expert Prior")

gelman.diag(samples1)
## Potential scale reduction factors:
## 
##   Point est. Upper C.I.
## p          1          1
effectiveSize(samples1)
##       p 
## 23777.5

Question 3

a) Use examples to demonstrate the inverse transformation method and also the acceptance rejection random number generation approach.

Inverse Transformation Method

Suppose we want to simulate variable \(X\) with some distribution. We denote it’s cumulative function as: \[F_{X}(x) = \mathbb{P}( X \leq x) \]
The random variable \(Y\) defined as \(Y=F_{X}(x)\) has \(U(0,1)\) distribution.
The inverse of that theorem says that \(X=F^{-1}(Y)\) hass the same distribution as \(X\)

Generating Random numbers

If we know the inverse CDF function for a continuous random variable X , then we can generate random variates from the distribution of X by generating \(U(0,1)\) variates and transforming them using the inverse CDF.

Let’s see how we can use this method with a simple example. If we have \(X \sim Exp(\lambda)\) (an exponential distribution with parameter \(\lambda\)) we have that the cumulative function is: \[F(x) = 1 - e^{-\lambda x}\]

By inverting this function with some basic algebra we get: \[F^{-1}(u) = \frac{-1}{\lambda} ln(1-u) \]

if \(\lambda=3\) we have \[F^{-1}(u) = \frac{-1}{3} ln(1-u) \]

We can note that \(U \sim (1-U)\) and so we have: \[ \frac{-1}{3} ln(U) \sim F \]

Example

# Set parameters
N <- 10000    # Number of simulations
lam <- 3     # Rate parameter

# Set seed for reproducibility
set.seed(123)

# Sample uniform random variates
U <- runif(N)

# Apply probability integral transform
X <- -log(U)/lam

# Create analytic result
x_a <- seq(0, 2.5, length.out = 50)
fx_a <- lam * exp(-lam * x_a)

# Plot histogram of samples
hist(X, freq = FALSE, breaks = 50, col = "grey", 
     xlab = "x", ylab = "f(x)", 
     main = "Empirical Simulated Exponential Distribution Compared to Analytic Result",
     xlim = c(0, 2.5), ylim = c(0, 3))
lines(x_a, fx_a, col = "red", lwd = 2.5)
legend("topright", legend = c("Empirical", "Analytic"), 
       col = c("grey", "red"), lwd = c(NA, 2.5), 
       fill = c("grey", NA), border = c("black", NA))

Acceptance-Rejection Methods

There are many distributions for which the inverse transform method and even then general transformations will fail to be able to generate the required random variables
hence we turn to Another class of method we can use are the acceptance-rejection methods. Again this relies on taking (pseudo-random) uniform variates as a starting point.
The general concept is to “look” at each uniform pseudo-random number and decide whether we believe this could have been generated by the distribution we wish to sample from (and accept it) or not (and reject it).

Note

The Fundamental Theorem of Simulation:

suppose we wish to sample from a distribution \(f(x)\) that is difficult , however we can sample easily from a distribution with pdf \(g(x)\) which has a property that it envelopes \(f(x)\) i.e \[f(x)\leq Mg(x)\] for x where M is a constant.
Here we are saying that the height of \(g(x)\) or some scaled version of it \(Mg(x)\) must be greater than \(f(x)\)

Choose \(g(x)\) which is close to the \(f(x)\) and easy to sample and has the same domain as the target \(f(x)\).

Sample a random variable \(Y\) with density \(g(x)\)

Generate \(Y \sim g, U \sim U[0,1]\)

Calculate the acceptance probability \(\alpha=\frac{f(y)}{Mg(Y)}\)
Sample a Uniform random variable \(U \sim U(0,1)\) giving a sample value u

Accept \(X=Y ~if~ U \leq \frac{f(Y)}{Mg(Y)}\)
Return to 1

In general , Accept \(X_{i}\) if \(U_i \leq \frac{f(x_i)}{Mg(x_i)} ,\forall x_i\) otherwise reject hence \(\underline{X}\sim f(x)\)

Example without code

Bongani wants to draw samples from a \(Beta(2,2)\) distribution but his package does not have a beta random number generator but it does have uniform number generator.
He knows that for the beta distribution:

\[f(\pi)\propto \pi^{\alpha-1}(1-\alpha)^{\beta-1}\]

We can analyse the structure of this function (find the maximum turning point) so that we know where to put our envelope.

\[ \begin{align} f(\pi)&= \pi^{\alpha-1}(1-\pi)^{\beta-1}\\ &= \pi^{2-1}(1-\pi)^{2-1}\\ &= \pi(1-\pi)\\ &= \pi-\pi^2\\ &differentiate~with~respect~to~\pi\\ f'(\pi)&= 1-2\pi \\ &equate~to~zero\\ 1-2\pi&=0\\ \pi&=0.5\\ \end{align}\]

value of the function when \(\pi\) is maximum is \(f(0.5)=0.25\). The image below shows how an envelope would look like:

We shall generate from an actual Beta distribution. This has density function: \[ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}} {B(\alpha,\beta)} \]

Where \(B\) is the Beta function defined: \[ B(\alpha ,\beta )={\frac {\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )}} \]

The Gamma function \(\Gamma\) being: \[ \Gamma (z)=\int _{0}^{\infty }x^{z-1}e^{-x}\,dx \]

Through differentiation we find the maximum value attained by this function is: \[ max_x (f(x)) = \frac{\left(\frac{\alpha-1}{\alpha+\beta-2}\right)^{\alpha-1}\left(1-\frac{\alpha-1}{\alpha+\beta-2}\right)^{\beta-1}} {B(\alpha,\beta)} = M \]

Assuming \(\alpha + \beta > 2\)

We can use this to sample from the Beta distribution using acceptance/rejection:

Summary

Practical idea

# Set parameters
ap <- 2
bt <- 2

# Define beta PDF function
beta_pdf <- function(x, a = ap, b = bt) {
  (x^(a-1) * (1-x)^(b-1)) / beta(a, b)
}

# Calculate maximum value
x_max <- (ap - 1) / (ap + bt - 2)
M <- beta_pdf(x_max)

# Number of samples
N <- 1000

# Generate samples
set.seed(123)  # For reproducibility
Y <- runif(N)
U <- runif(N, 0, M)

# Acceptance/rejection masks
mask_acc <- U < beta_pdf(Y)
mask_rej <- U >= beta_pdf(Y)  # Changed to >= for consistency

# Accepted and rejected samples
Y_acc <- Y[mask_acc]
U_acc <- U[mask_acc]

Y_rej <- Y[mask_rej]
U_rej <- U[mask_rej]

# Create analytic PDF
X <- seq(0, 1, length.out = 100)
PDF <- beta_pdf(X)

# Create plots
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

# Plot 1: Acceptance-Rejection Sampling
plot(Y_acc, U_acc, col = "pink", pch = 16, 
     xlab = "x", ylab = "f(x)", 
     main = "Acceptance-Rejection Sampling",
     xlim = c(0, 1), ylim = c(0, M + 0.05))
points(Y_rej, U_rej, col = "grey", pch = 16)
lines(X, PDF, col = "red", lwd = 2)
legend("topright", legend = c("Accepted", "Rejected", "PDF"), 
       col = c("pink", "grey", "red"), pch = c(16, 16, NA), lty = c(NA, NA, 1))

# Plot 2: Empirical Distribution
hist(Y_acc, freq = FALSE, col = "grey", 
     xlab = "x", ylab = "f(x)", 
     main = "Empirical Distribution",
     xlim = c(0, 1), ylim = c(0, M + 0.05))
lines(X, PDF, col = "red", lwd = 2)


# Reset plotting parameters
par(mfrow = c(1, 1))

(b) Gibbs sampling

Gibbs

The essence of the Gibbs algorithm is the sampling of parameters conditioned in other parameters: \[P(\theta_1 \mid \theta_2, \dots \theta_p)\]
we have a two parameter situation \(\{\theta_1, \theta_2\}\) with independent prior distributions \(p(\theta_1, \theta_2) = p(\theta_1)p(\theta_2)\).

Start at a random point : \(\theta^0\)

for \(t= 1 \cdot T\) repeat the following steps

Set \(\theta^{(1)}\)

For \(j=1,2,3,...,d\) update \(\theta_j\) from \(\theta_j \sim f(\theta_j|\theta_{j-1,y})\)

Define an initial set \(\boldsymbol{\theta}^{(0)} \in \mathbb{R}^p\) that \(P\left(\boldsymbol{\theta}^{(0)} \mid \mathbf{y} \right) > 0\) \(t = 1, 2, \dots\) Assign:

\[ \boldsymbol{\theta}^{(t)} = \begin{cases} \theta^{(t)}_1 & \sim P \left(\theta_1 \mid \theta^{(0)}_2, \dots, \theta^{(0)}_p \right) \\ \theta^{(t)}_2 & \sim P \left(\theta_2 \mid \theta^{(t-1)}_1, \dots, \theta^{(0)}_p \right) \\ & \vdots \\ \theta^{(t)}_p & \sim P \left(\theta_p \mid \theta^{(t-1)}_1, \dots, \theta^{(t-1)}_{p-1} \right) \end{cases} \]

Example

Suppose we have: - Data: Observations \(y_1, \dots, y_n\) from \(\mathcal{N}(\mu, \sigma^2)\) (with \(\sigma^2\) initially known)

Prior: \(\mu \sim \mathcal{N}(\mu_0, \tau_0^2)\)
Goal: Estimate the posterior \(p(\mu \mid \mathbf{y})\)

For Gibbs sampling, we’ll estimate both \(\mu\) and \(\sigma^2\) (now unknown).

Gibbs Sampling Steps

1. Define Conditional Distributions

Conditional for \(\mu\) (given \(\sigma^2\)):

\[ \mu \mid \sigma^2, \mathbf{y} \sim \mathcal{N}\left( \frac{\frac{n\bar{y}}{\sigma^2} + \frac{\mu_0}{\tau_0^2}}{\frac{n}{\sigma^2} + \frac{1}{\tau_0^2}}, \left( \frac{n}{\sigma^2} + \frac{1}{\tau_0^2} \right)^{-1} \right) \]

Conditional for \(\sigma^2\) (given \(\mu\)): \[ \sigma^2 \mid \mu, \mathbf{y} \sim \text{Inverse-Gamma}\left( \frac{n}{2} + \alpha, \frac{\sum (y_i - \mu)^2}{2} + \beta \right) \] where \(\alpha, \beta\) are hyperparameters.

2. Gibbs Sampling Algorithm

Initialize \(\mu^{(0)}\) and \(\sigma^{2(0)}\)
For \(t = 1\) to \(T\):
- Sample \(\mu^{(t)}\) from \(p(\mu \mid \sigma^{2(t-1)}, \mathbf{y})\)
- Sample \(\sigma^{2(t)}\) from \(p(\sigma^2 \mid \mu^{(t)}, \mathbf{y})\)
Discard burn-in samples

# Simulate data
set.seed(123)
n <- 50
true_mu <- 2
true_sigma2 <- 4
y <- rnorm(n, true_mu, sqrt(true_sigma2))

# Hyperparameters
mu0 <- 0      # Prior mean for mu
tau02 <- 100  # Prior variance for mu
alpha <- 1    # Prior shape for sigma2
beta <- 1     # Prior scale for sigma2

# Gibbs sampling
T <- 10000
mu_samples <- numeric(T)
sigma2_samples <- numeric(T)

# Initialize
mu_samples[1] <- mean(y)
sigma2_samples[1] <- var(y)

for (t in 2:T) {
  # Sample mu given sigma2
  post_var_mu <- 1 / (n/sigma2_samples[t-1] + 1/tau02)
  post_mean_mu <- post_var_mu * (n*mean(y)/sigma2_samples[t-1] + mu0/tau02)
  mu_samples[t] <- rnorm(1, post_mean_mu, sqrt(post_var_mu))
  
  # Sample sigma2 given mu
  post_alpha <- alpha + n/2
  post_beta <- beta + sum((y - mu_samples[t])^2)/2
  sigma2_samples[t] <- 1/rgamma(1, post_alpha, post_beta)
}

# Discard burn-in (first 20%)
burnin <- T/5
mu_final <- mu_samples[-(1:burnin)]
sigma2_final <- sigma2_samples[-(1:burnin)]

library(ggplot2)
ggplot(data.frame(iteration = 1:T, mu = mu_samples), aes(iteration, mu)) +
  geom_line(color = "blue") +
  labs(title = "Trace Plot for μ")



acf(mu_final, main = "Autocorrelation for μ")


mean(mu_final)    # Posterior mean for μ
## [1] 2.063447
quantile(mu_final, c(0.025, 0.975))  # 95% credible interval
##     2.5%    97.5% 
## 1.556372 2.578244

c) Proof for any two of the following. If the random variables \(X_i\) are iid \(Exp(1)\), then three standard distributions can be derived. We can get these three standard distributions

if \(X_1,X_2, \cdot ,X_n\) are \(iid\) random variables with parameter \(\lambda=1\) then: \[Y=\sum_{i=1}^n X_i\sim \Gamma{(n,1)}\]

Proof

MGF for a gamma is given by:

\[ \begin{align} M_X(t) &= E(e^{tx})\\ &=\int^{\infty}_{-\infty}e^{tx}.\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}\\ &=\int^{\infty}_{-\infty}\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-(\beta -t)x}\\ &=\frac{\beta^{\alpha}}{(\beta-t)^{\alpha}}\int^{\infty}_{-\infty}\frac{(\beta-t)^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-(\beta -t)x}\\ &=\frac{\beta^{\alpha}}{(\beta-t)^{\alpha}}\\ &=\left(\frac{\beta}{\beta-t}\right)^{\alpha}\\ \end{align} \] if \(X\sim \exp(\lambda)\) then MGF then be given as follows:

\[ \begin{align} M_X(t)&=E(e^{tx})\\ &=\int^{\infty}_{-\infty}e^{tx}.\lambda e^{-\lambda x}dx\\ &=\int^{\infty}_{-\infty}\lambda e^{-(\lambda-t)x}dx\\ &=\frac{\lambda}{\lambda-t}\int^{\infty}_{-\infty}(\lambda-t) e^{-(\lambda-t)x}dx\\ &=\frac{\lambda}{\lambda-t} \end{align} \] Given that \(S_n=X_1+X_2+\cdot \cdot +X_n\) we show that the MGF for \(S_n\) resembles a gamma:

\[ \begin{align} M_{S_n}(t)&=M_{X_1+X_2+\cdot \cdot +X_n}(t)\\ &=E(e^{t(X_1+X_2+\cdot \cdot +X_n)})\\ &=E(e^{tX_1}).E(e^{tX_2})\cdot \cdot \cdot E(e^{tX_n)})\\ &= M_{X_1}(t).M_{X_2}(t)...M_{X_n}(t)\\ &=[M_{X}(t)]^n \quad due \quad to~iid~property\\ &=\left(\frac{\lambda}{\lambda-t} \right)^n \end{align} \] for \(\lambda=1\)

\[ \begin{align} M_{S_n}(t)&=\left(\frac{1}{1-t} \right)^n \end{align} \]

let \(\beta=1\) and \(\alpha=n\) thus \(S_n \sim \Gamma(n,1)\)

Exponential Distribution: If \(X_1,X_2,...,X_n\) are iid exponential random variables with parameter \(\lambda = 1\), show that the minimum \(W = min(X_1,X_2,...,X_n)\) follows an exponential distribution with parameter \(n\)

\[the~pdf~of~the~mininimum~order~statistics~given~ that~f(x)=e^{-x}\]

First we derive the cdf for a minimum order statistics.

\[ \begin{aligned} F_{X_{(1)}}(X_{(1)})&=P(X_{(1)}\le x)\\ &=1-P(X_{(1)}\ge x)\\ &=1-P(X_1>x,X_2>x,X_3>x,\cdot \cdot \cdot,X_n>x)\\ &=1-[P(X\ge x)]^n \quad iid \quad assumption\\ &=1-[1-[P(X\le x)]^n]\\ &=1-[1-F(x)]^n \end{aligned} \]

for an exponential distribution with parameter \(\lambda\) we can find the cumulative distribution function as follows:

\[ \begin{aligned} F(x) &= \int^{x}_{-\infty}f(t)dt\\ &= \int^{x}_{0}e^{-t}dt\\ &=[-e^{-t}]^{x}_{0}\\ &=[-e^{-x}]-[-e^{-0}]\\ &=1-e^{-x} \end{aligned} \]

hence we plug that in the cdf for the minimum order statistics:

\[ \begin{aligned} F_{X_{(1)}}(X_{(1)})&=1-[1-(1-e^{-x})]^n\\ F_{X_{(1)}}(X_{(1)})&=1-e^{-nx}\\ f_{x_{(1)}}(x_{(1)})&=\frac{d}{dx}F_{X_{(1)}}(X_{(1)})\\ &=\frac{d}{dx}(1-e^{-nx})\\ &=0-(-ne^{nx})\\ &=ne^{-nx} \quad hence \quad X_{(1)}\sim \exp(n) \end{aligned} \]

Recall that if \(X\sim \exp(\lambda)\) then \(f(x)=\lambda e^{-\lambda}\)

Question 4

a) Write down the posterior resulting from a Normal-Gamma prior and the resulting conjugate prior. Show the likelihood and prior and all the parameters that lead to the final posterior distribution.

\[Posterior:\mu,\sigma^2|X \sim \Gamma \left(\frac{n}{2}+\alpha,\beta+\frac{\sum(x-\mu)^2}{2}\right)\]

\[Prior:f(\sigma^2)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}(\sigma^2)^{-(\alpha+1)}\exp^{-\beta/\sigma^2}\]

\[Likelihood:L(X|\mu,\sigma^2)= \prod^{n}_{i=1}\frac{1}{\sigma\sqrt(2\pi)}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\]

deriving the posterior distribution:

\[ \begin{aligned} p(\mu,\sigma^2|X) &\propto L(X|\mu,\sigma^2)\times f(\sigma^2)\\ &\propto \left(\frac{1}{\sigma\sqrt(2\pi)}\right)^ne^{\frac{-(n-1)s^2}{2\sigma^2}}*\frac{\beta^{\alpha}}{\Gamma(\alpha)}(\sigma^2)^{-(\alpha+1)}\exp^{-\beta/\sigma^2}\\ &\propto \left(\frac{1}{\sqrt(2\pi)}\right)^n\frac{\beta^{\alpha}}{\Gamma(\alpha)}(\sigma^2)^{\frac{-n}{2}}*e^{\frac{-(n-1)s^2}{2\sigma^2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}\\ &\propto (\sigma^2)^{\frac{-n}{2}}*e^{\frac{-(n-1)s^2}{2\sigma^2}}*(\sigma^2)^{\alpha-1}e^{-\beta/\sigma^2}\\ &\propto (\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-(n-1)s^2}{2\sigma^2}}\\ &\propto (\sigma^2)^{-(\frac{n}{2}+\alpha+1)}*e^{-\frac{1}{\sigma^2}(\beta+\frac{(n-1)s^2}{2})} \end{aligned} \]

b(i) Derive the marginal posterior distribution of the variance for the claims in general.Define all your terms before finally substituting the given values. for normal Gamma Conjugacy , we have that:

1. the likelihood of observing x is given by:

\[ \begin{aligned} L(X|\mu,\sigma^2)&= \prod^{n}_{i=1}\frac{1}{\sigma\sqrt(2\pi)}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\\ &= \left(\frac{1}{\sigma\sqrt(2\pi)}\right)^ne^{\frac{-\sum(x-\mu)^2}{2\sigma^2}}\\ &= \left(\frac{1}{\sigma\sqrt(2\pi)}\right)^ne^{\frac{-\sum(x-\bar{x}+\bar{x}+\mu)^2}{2\sigma^2}}\\ &= \left(\frac{1}{\sigma\sqrt(2\pi)}\right)^ne^{\frac{-\sum[(x-\bar{x})^2-2\sum(x-\bar{x})(x-\mu)+(\bar{x}-\mu)^2]}{2\sigma^2}} \\ &= \left(\frac{1}{\sigma\sqrt(2\pi)}\right)^ne^{\frac{-\sum[(x-\bar{x})^2+(\bar{x}-\mu)^2]}{2\sigma^2}} \quad since -2\sum(x-\bar{x})(x-\mu)=0\\ \end{aligned} \]

2. prior knowledge about \(\sigma^2\) is known i.e \(\sigma^2 \sim \Gamma(\alpha,\beta)\)

\[f(\sigma^2)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}(\sigma^2)^{-(\alpha+1)}\exp^{-\beta/\sigma^2}\] deriving the marginal posterior distribution for the variance:

\[ \begin{aligned} p(\mu,\sigma^2|X) &\propto L(X|\mu,\sigma^2)\times f(\sigma^2)\\ &\propto \left(\frac{1}{\sigma\sqrt(2\pi)}\right)^ne^{\frac{-\sum[(x-\bar{x})^2+(\bar{x}-\mu)^2]}{2\sigma^2}}*\frac{\beta^{\alpha}}{\Gamma(\alpha)}(\sigma^2)^{-(\alpha+1)}\exp^{-\beta/\sigma^2}\\ &\propto \left(\frac{1}{\sqrt(2\pi)}\right)^n\frac{\beta^{\alpha}}{\Gamma(\alpha)}(\sigma^2)^{\frac{-n}{2}}*e^{\frac{-\sum[(x-\bar{x})^2+(\bar{x}-\mu)^2]}{2\sigma^2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}\\ &\propto (\sigma^2)^{\frac{-n}{2}}*e^{\frac{-\sum[(x-\bar{x})^2+(\bar{x}-\mu)^2]}{2\sigma^2}}*(\sigma^2)^{\alpha-1}e^{-\beta/\sigma^2}\\ &\propto (\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-\sum[(x-\bar{x})^2+(\bar{x}-\mu)^2]}{2\sigma^2}}\\ &\propto (\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-\sum(x-\bar{x})^2}{2\sigma^2}}e^{\frac{-\sum(\bar{x}-\mu)^2}{2\sigma^2}} \\ &\propto (\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-\sum(x-\bar{x})^2}{2\sigma^2}}e^{\frac{-n(\bar{x}-\mu)^2}{2\sigma^2}}\quad since \sum(\bar{x}-\mu)^2=n(\bar{x}-\mu)^2\\ \end{aligned} \]

next up , to find the marginal posterior for the variance , we integrate with respect to \(\mu\)

\[ \begin{aligned} p(\sigma^2|X)&= \int p(\mu,\sigma^2|X) d\mu\\ &= \int(\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-\sum(x-\bar{x})^2}{2\sigma^2}}e^{\frac{-(\bar{x}-\mu)^2}{2\frac{\sigma^2}{n}}}d\mu\\ &= (\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-\sum(x-\bar{x})^2}{2\sigma^2}}\int e^{\frac{-(\bar{x}-\mu)^2}{2\frac{\sigma^2}{n}}}d\mu\\ &= (\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-\sum(x-\bar{x})^2}{2\sigma^2}}*\sigma\sqrt{2\pi}\quad since \int e^{\frac{-(\bar{x}-\mu)^2}{2\frac{\sigma^2}{n}}}d\mu =\sigma\sqrt{2\pi}\\ &= (\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-(n-1)s^2}{2\sigma^2}}*\sigma\sqrt{2\pi} \quad since \sum(x-\bar{x})^2=(n-1)s^2\\ &=(\sigma^2)^{\frac{-n}{2}}*(\sigma^2)^{\frac{1}{2}}\sqrt{2\pi}*(\sigma^2)^{-(\alpha+1)}e^{-\beta/\sigma^2}*e^{\frac{-(n-1)s^2}{2\sigma^2}}\\ &=(\sigma^2)^{-(\frac{n-1}{2}+\alpha+1)}*e^{-\frac{\frac{(n-1)s^2}{2}+\beta}{\sigma^2}}\\ \end{aligned} \]

hence the marginal posterior for variance is given by:

\[\sigma^2|X \sim \Gamma \left(\frac{n-1}{2}+\alpha,\beta+\frac{(n-1)s^2}{2}\right)\]

Definition of Terms

\(\sigma^2\): Population variance (unknown parameter)
\(X\): Observed data \(\{x_1, \ldots, x_n\}\)
\(n\): Sample size
\(s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\): Sample variance
\(\alpha, \beta\): Parameters of the inverse-gamma prior (equivalent to \(\Gamma\) prior on \(1/\sigma^2\))

Plugging in terms

\[\sigma^2|X \sim \Gamma \left(\frac{n-1}{2}+200,300+\frac{(n-1)s^2}{2}\right)\]

Question 5

Bayesian model in Stata

use poissondata

*Log of person-years for offset*
gen log_py = log(personyears)

* Run Bayesian Poisson regression
bayes: poisson deaths smoke i.age, offset(log_py)

bayes, ///
    prior({deaths:_cons}, normal(0, 100)) ///
    prior({deaths:smoke}, normal(0, 100)) ///
    prior({deaths:age}, normal(0, 100)) ///
: poisson deaths smoke i.age, offset(log_py)

bayes, ///
    rseed(123) ///
    burnin(1000) ///
    mcmcsize(4000) ///
    nchains(4) ///
: poisson deaths smoke i.age, offset(log_py)

bayesstats summary

* Trace and density plots
bayesgraph diagnostics {deaths:smoke} 
graph export post1.png,replace

bayesgraph diagnostics {deaths: 2.age 3.age 4.age 5.age}
graph export post3.png,replace

levelsof age, local(levels)
foreach l of local levels {
    if `l' != 1 {  // Skip base category (1.age)
        bayesgraph diagnostics {deaths:`l'.age}, name(diag_`l', replace)
        graph export diag_age`l'.png, replace
    }
}
## Burn-in ...
## Simulation ...
## 
## Model summary
## ------------------------------------------------------------------------------
## Likelihood: 
##   deaths ~ poisson(xb_deaths)
## 
## Prior: 
##   {deaths:smoke i.age _cons} ~ normal(0,10000)                             (1)
## ------------------------------------------------------------------------------
## (1) Parameters are elements of the linear form xb_deaths.
## 
## Bayesian Poisson regression                      MCMC iterations  =     12,500
## Random-walk Metropolis–Hastings sampling         Burn-in          =      2,500
##                                                  MCMC sample size =     10,000
##                                                  Number of obs    =         10
##                                                  Acceptance rate  =      .2464
##                                                  Efficiency:  min =     .01924
##                                                               avg =     .02552
## Log marginal-likelihood = -75.624571                          max =     .03181
##  
## ------------------------------------------------------------------------------
##              |                                                Equal-tailed
##       deaths |      Mean   Std. dev.     MCSE     Median  [95% cred. interval]
## -------------+----------------------------------------------------------------
##        smoke |  .3521283   .1084005   .006078   .3546537   .1489718   .5658104
##              |
##          age |
##       45-54  |   1.49149   .1946732   .013833   1.484139   1.131913   1.887716
##       55-64  |  2.647907   .1827394   .013174    2.64017   2.307437   3.051508
##       65-74  |  3.368698   .1864861   .012197   3.362579   3.025836   3.769044
##       75-84  |  3.712312   .1973451   .011946   3.711851   3.338273   4.134095
##              |
##        _cons | -8.287494   .2669905   .015025  -8.291325  -8.821961  -7.780908
## ------------------------------------------------------------------------------
## Note: Variable log_py is included in the model as the offset.
## Note: Default priors are used for model parameters.
## 
##   
## Burn-in ...
## Simulation ...
## 
## Model summary
## ------------------------------------------------------------------------------
## Likelihood: 
##   deaths ~ poisson(xb_deaths)
## 
## Priors: 
##   {deaths:_cons smoke} ~ normal(0,100)                                     (1)
##         {deaths:i.age} ~ normal(0,10000)                                   (1)
##           {deaths:age} ~ normal(0,100)
## ------------------------------------------------------------------------------
## (1) Parameters are elements of the linear form xb_deaths.
## 
## Bayesian Poisson regression                      MCMC iterations  =     12,500
## Random-walk Metropolis–Hastings sampling         Burn-in          =      2,500
##                                                  MCMC sample size =     10,000
##                                                  Number of obs    =         10
##                                                  Acceptance rate  =      .2818
##                                                  Efficiency:  min =    .005653
##                                                               avg =     .01223
## Log marginal-likelihood = -71.441567                          max =     .03652
##  
## ------------------------------------------------------------------------------
##              |                                                Equal-tailed
##       deaths |      Mean   Std. dev.     MCSE     Median  [95% cred. interval]
## -------------+----------------------------------------------------------------
##        smoke |  .3462018   .1091205   .009213   .3437231   .1406685   .5704558
##              |
##          age |
##       45-54  |  1.376838   .1818841   .022073   1.371683   1.007062   1.733001
##       55-64  |  2.595201   .1714622   .022806   2.584071   2.281989   2.949362
##       65-74  |  3.259247   .1717166   .021148   3.247875   2.960456   3.617496
##       75-84  |  3.671238   .1836642   .021791   3.672013   3.292856   4.017376
##              |
##        _cons | -8.202599   .2510759   .026623    -8.1901  -8.706632  -7.756934
##          age |   .690597   9.899977   .518051   .5576908  -18.79259   20.23102
## ------------------------------------------------------------------------------
## Note: Variable log_py is included in the model as the offset.
## Note: Default priors are used for some model parameters.
## 
##   
## Chain 1
##   Burn-in ...
##   Simulation ...
##   
## Chain 2
##   Burn-in ...
##   Simulation ...
##   
## Chain 3
##   Burn-in ...
##   Simulation ...
##   
## Chain 4
##   Burn-in ...
##   Simulation ...
## 
## Model summary
## ------------------------------------------------------------------------------
## Likelihood: 
##   deaths ~ poisson(xb_deaths)
## 
## Prior: 
##   {deaths:smoke i.age _cons} ~ normal(0,10000)                             (1)
## ------------------------------------------------------------------------------
## (1) Parameters are elements of the linear form xb_deaths.
## 
## Bayesian Poisson regression                   Number of chains    =          4
## Random-walk Metropolis–Hastings sampling      Per MCMC chain:
##                                                   Iterations      =      5,000
##                                                   Burn-in         =      1,000
##                                                   Sample size     =      4,000
##                                               Number of obs       =         10
##                                               Avg acceptance rate =      .2262
##                                               Avg efficiency: min =    .009659
##                                                               avg =     .01542
##                                                               max =     .02122
## Avg log marginal-likelihood = -82.938672      Max Gelman–Rubin Rc =      7.182
##  
## ------------------------------------------------------------------------------
##              |                                                Equal-tailed
##       deaths |      Mean   Std. dev.     MCSE     Median  [95% cred. interval]
## -------------+----------------------------------------------------------------
##        smoke |  .4303658   .1391096   .007549   .4289995   .1707217   .6865947
##              |
##          age |
##       45-54  |  1.998429    .999939   .059009   1.577177   1.131391   3.676881
##       55-64  |  3.094335   .9058428   .051194   2.725766   2.301714   4.662266
##       65-74  |  3.864905   .9757828   .069524   3.459085   3.051721   5.506548
##       75-84  |  4.183364   .9209916   .066947   3.823151   3.328629   5.750522
##              |
##        _cons | -8.904458   1.051759   .084603  -8.509256   -10.7048  -7.873415
## ------------------------------------------------------------------------------
## Note: Variable log_py is included in the model as the offset.
## Note: Default priors are used for model parameters.
## Note: Default initial values are used for multiple chains.
## Note: Adaptation continues during simulation.
## Note: There is a high autocorrelation after 500 lags in at least one of the
##       chains.
## 
## 
## Posterior summary statistics                      Number of chains =         4
##                                                   MCMC sample size =    16,000
##  
## ------------------------------------------------------------------------------
##              |                                                Equal-tailed
##       deaths |      Mean   Std. dev.     MCSE     Median  [95% cred. interval]
## -------------+----------------------------------------------------------------
##        smoke |  .4303658   .1391096   .007549   .4289995   .1707217   .6865947
##              |
##          age |
##       45-54  |  1.998429    .999939   .059009   1.577177   1.131391   3.676881
##       55-64  |  3.094335   .9058428   .051194   2.725766   2.301714   4.662266
##       65-74  |  3.864905   .9757828   .069524   3.459085   3.051721   5.506548
##       75-84  |  4.183364   .9209916   .066947   3.823151   3.328629   5.750522
##              |
##        _cons | -8.904458   1.051759   .084603  -8.509256   -10.7048  -7.873415
## ------------------------------------------------------------------------------
## 
## 
## file post1.png saved as PNG format
## 
## 
## file post3.png saved as PNG format
## 
## 1 2 3 4 5
## 
## file diag_age2.png saved as PNG format
## file diag_age3.png saved as PNG format
## file diag_age4.png saved as PNG format
## file diag_age5.png saved as PNG format

Fitting the models in Stan and JAGS

Poisson Model dataset

(data<-MLGdata::Britishdoc |>  
  mutate(person.years = ifelse(person.years==18793,18790,person.years)))

age	smoke	person.years	deaths
35-44	n	18790	2
35-44	y	52407	32
45-54	n	10673	12
45-54	y	43248	104
55-64	n	5710	28
55-64	y	28612	206
65-74	n	2585	28
65-74	y	12663	186
75-84	n	1462	31
75-84	y	5317	102

Poisson Model STAN

brms package uses STAN bayesian software in the backend

library(brms)

# Convert age to factor if not already
data$age <- factor(data$age)

# Fit model
brm_fit <- brm(
  deaths ~ smoke + age + offset(log(person.years)),
  data = data,
  family = poisson(),
  prior = c(
    set_prior("normal(0, 0.04)", class = "Intercept"),
    set_prior("normal(0, 0.04)", class = "b")
  ),
  chains = 4,
  iter = 4000,
  warmup = 1000,
  thin = 1, seed = 678,
  control = list(adapt_delta = 0.95)
)
## Running "C:/PROGRA~1/R/R-44~1.3/bin/x64/Rcmd.exe" SHLIB foo.c
## using C compiler: 'gcc.exe (GCC) 13.3.0'
## gcc  -I"C:/PROGRA~1/R/R-44~1.3/include" -DNDEBUG   -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/Rcpp/include/"  -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/RcppEigen/include/"  -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/RcppEigen/include/unsupported"  -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/BH/include" -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/StanHeaders/include/src/"  -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/StanHeaders/include/"  -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/RcppParallel/include/" -DRCPP_PARALLEL_USE_TBB=1 -I"C:/Users/r1953/AppData/Local/R/win-library/4.4/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include "C:/Users/r1953/AppData/Local/R/win-library/4.4/StanHeaders/include/stan/math/prim/fun/Eigen.hpp"  -std=c++1y    -I"C:/RBuildTools/4.4/x86_64-w64-mingw32.static.posix/include"     -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign  -c foo.c -o foo.o
## cc1.exe: warning: command-line option '-std=c++14' is valid for C++/ObjC++ but not for C
## In file included from C:/Users/r1953/AppData/Local/R/win-library/4.4/RcppEigen/include/Eigen/Core:19,
##                  from C:/Users/r1953/AppData/Local/R/win-library/4.4/RcppEigen/include/Eigen/Dense:1,
##                  from C:/Users/r1953/AppData/Local/R/win-library/4.4/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
##                  from <command-line>:
## C:/Users/r1953/AppData/Local/R/win-library/4.4/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
##   679 | #include <cmath>
##       |          ^~~~~~~
## compilation terminated.
## make: *** [C:/PROGRA~1/R/R-44~1.3/etc/x64/Makeconf:289: foo.o] Error 1
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.4 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 1: Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 1: Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 1: Iteration: 1001 / 4000 [ 25%]  (Sampling)
## Chain 1: Iteration: 1400 / 4000 [ 35%]  (Sampling)
## Chain 1: Iteration: 1800 / 4000 [ 45%]  (Sampling)
## Chain 1: Iteration: 2200 / 4000 [ 55%]  (Sampling)
## Chain 1: Iteration: 2600 / 4000 [ 65%]  (Sampling)
## Chain 1: Iteration: 3000 / 4000 [ 75%]  (Sampling)
## Chain 1: Iteration: 3400 / 4000 [ 85%]  (Sampling)
## Chain 1: Iteration: 3800 / 4000 [ 95%]  (Sampling)
## Chain 1: Iteration: 4000 / 4000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.033 seconds (Warm-up)
## Chain 1:                0.091 seconds (Sampling)
## Chain 1:                0.124 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 6e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 2: Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 2: Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 2: Iteration: 1001 / 4000 [ 25%]  (Sampling)
## Chain 2: Iteration: 1400 / 4000 [ 35%]  (Sampling)
## Chain 2: Iteration: 1800 / 4000 [ 45%]  (Sampling)
## Chain 2: Iteration: 2200 / 4000 [ 55%]  (Sampling)
## Chain 2: Iteration: 2600 / 4000 [ 65%]  (Sampling)
## Chain 2: Iteration: 3000 / 4000 [ 75%]  (Sampling)
## Chain 2: Iteration: 3400 / 4000 [ 85%]  (Sampling)
## Chain 2: Iteration: 3800 / 4000 [ 95%]  (Sampling)
## Chain 2: Iteration: 4000 / 4000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.036 seconds (Warm-up)
## Chain 2:                0.093 seconds (Sampling)
## Chain 2:                0.129 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 6e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 3: Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 3: Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 3: Iteration: 1001 / 4000 [ 25%]  (Sampling)
## Chain 3: Iteration: 1400 / 4000 [ 35%]  (Sampling)
## Chain 3: Iteration: 1800 / 4000 [ 45%]  (Sampling)
## Chain 3: Iteration: 2200 / 4000 [ 55%]  (Sampling)
## Chain 3: Iteration: 2600 / 4000 [ 65%]  (Sampling)
## Chain 3: Iteration: 3000 / 4000 [ 75%]  (Sampling)
## Chain 3: Iteration: 3400 / 4000 [ 85%]  (Sampling)
## Chain 3: Iteration: 3800 / 4000 [ 95%]  (Sampling)
## Chain 3: Iteration: 4000 / 4000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.035 seconds (Warm-up)
## Chain 3:                0.094 seconds (Sampling)
## Chain 3:                0.129 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 6e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 4: Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 4: Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 4: Iteration: 1001 / 4000 [ 25%]  (Sampling)
## Chain 4: Iteration: 1400 / 4000 [ 35%]  (Sampling)
## Chain 4: Iteration: 1800 / 4000 [ 45%]  (Sampling)
## Chain 4: Iteration: 2200 / 4000 [ 55%]  (Sampling)
## Chain 4: Iteration: 2600 / 4000 [ 65%]  (Sampling)
## Chain 4: Iteration: 3000 / 4000 [ 75%]  (Sampling)
## Chain 4: Iteration: 3400 / 4000 [ 85%]  (Sampling)
## Chain 4: Iteration: 3800 / 4000 [ 95%]  (Sampling)
## Chain 4: Iteration: 4000 / 4000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.033 seconds (Warm-up)
## Chain 4:                0.097 seconds (Sampling)
## Chain 4:                0.13 seconds (Total)
## Chain 4:

# Summary and diagnostics
summary(brm_fit)
##  Family: poisson 
##   Links: mu = log 
## Formula: deaths ~ smoke + age + offset(log(person.years)) 
##    Data: data (Number of observations: 10) 
##   Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
##          total post-warmup draws = 12000
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -3.75      0.03    -3.81    -3.70 1.00    14341     8536
## smokey       -0.51      0.03    -0.57    -0.46 1.00    14190    10102
## age45M54     -0.19      0.03    -0.25    -0.13 1.00    17261     8728
## age55M64      0.14      0.03     0.08     0.20 1.00    17112     8427
## age65M74      0.48      0.03     0.42     0.55 1.00    16093     9092
## age75M84      0.64      0.03     0.57     0.71 1.00    15625     9183
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
plot(brm_fit)

pp_check(brm_fit)

Poisson Model JAGS

Rjags package uses JAGS bayesian software in the backend

library(rjags)
jags_code <- "
model {
  for (i in 1:N) {
    deaths[i] ~ dpois(mu[i])
    log(mu[i]) <- log(person_years[i]) + beta0 + beta_smoke*smoke[i] + beta_age[age[i]]
  }
  
  # Reference group (age category 1)
  beta_age[1] <- 0
  
  # Priors
  beta0 ~ dnorm(0, 0.01)
  beta_smoke ~ dnorm(0, 0.04)
  for (k in 2:5) {
    beta_age[k] ~ dnorm(0, 0.04)
  }
  
  # Derived quantities
  IRR_smoke <- exp(beta_smoke)
  for (k in 2:5) {
    IRR_age[k] <- exp(beta_age[k])
  }
}
"

jags_data <- list(
  N = nrow(data),
  deaths = data$deaths,
  person_years = data$person.years,
  smoke = data$smoke,
  age = data$age
)

jags_fit <- jags.model(textConnection(jags_code), 
                      data = jags_data,
                      n.chains = 4, 
                      n.adapt = 1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 10
##    Unobserved stochastic nodes: 6
##    Total graph size: 87
## 
## Initializing model
update(jags_fit, 3000)
samples <- coda.samples(jags_fit, 
                       c("beta_smoke","beta_age","IRR_smoke","IRR_age"), 
                       n.iter = 10000, 
                       thin = 2)

library(coda)
plot(samples,
     density = FALSE)

par(mfrow = c(1,2))
autocorr.plot(samples)

summary(samples)
## 
## Iterations = 4002:14000
## Thinning interval = 2 
## Number of chains = 4 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                Mean     SD Naive SE Time-series SE
## IRR_age[2]   4.4520 0.8676 0.006135       0.029480
## IRR_age[3]  13.9745 2.5551 0.018067       0.094903
## IRR_age[4]  28.7685 5.2726 0.037283       0.192853
## IRR_age[5]  40.7859 7.7873 0.055065       0.271373
## IRR_smoke    1.4377 0.1519 0.001074       0.006562
## beta_age[1]  0.0000 0.0000 0.000000       0.000000
## beta_age[2]  1.4750 0.1906 0.001348       0.006409
## beta_age[3]  2.6212 0.1783 0.001261       0.006399
## beta_age[4]  3.3431 0.1786 0.001263       0.006420
## beta_age[5]  3.6908 0.1865 0.001319       0.006395
## beta_smoke   0.3575 0.1052 0.000744       0.004567
## 
## 2. Quantiles for each variable:
## 
##                2.5%     25%    50%     75%   97.5%
## IRR_age[2]   3.0607  3.8348  4.354  4.9646  6.4050
## IRR_age[3]   9.8505 12.1452 13.671 15.4678 19.7707
## IRR_age[4]  20.3394 24.9984 28.155 31.8616 40.5896
## IRR_age[5]  28.2888 35.2270 39.850 45.3397 58.2964
## IRR_smoke    1.1679  1.3325  1.426  1.5346  1.7570
## beta_age[1]  0.0000  0.0000  0.000  0.0000  0.0000
## beta_age[2]  1.1186  1.3441  1.471  1.6023  1.8571
## beta_age[3]  2.2875  2.4969  2.615  2.7388  2.9842
## beta_age[4]  3.0126  3.2188  3.338  3.4614  3.7035
## beta_age[5]  3.3425  3.5618  3.685  3.8142  4.0655
## beta_smoke   0.1552  0.2871  0.355  0.4283  0.5636

(c) Bayesian Formulation

Model Specification

The Poisson regression model has the following structure:

\[\begin{align*} \text{Deaths}_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \beta_0 + \beta_{\text{smoke}} \text{smoke}_i + \sum_{k=1}^K \beta_{\text{age}_k} \text{age}_{ik} + \log(\text{person.years}_i) \end{align*}\]

where: + \(\text{age}_{ik}\) is an indicator for the \(k\)-th age category + \(K\) is the number of age categories + \(\text{person.years}_i\) is the offset

Likelihood and Priors

The complete Bayesian formulation is:

\[\begin{align*} \text{Likelihood:} & \\ &\prod_{i=1}^n \frac{e^{-\lambda_i}\lambda_i^{y_i}}{y_i!}, \quad y_i = \text{deaths}_i \\ \\ \text{Priors:} & \\ \beta_0 &\sim \mathcal{N}(0, 0.04) \\ \beta_{\text{smoke}} &\sim \mathcal{N}(0, 0.04) \\ \beta_{\text{age}_k} &\sim \mathcal{N}(0, 0.04) \quad \text{for } k = 1,\ldots,K \\ \end{align*}\]

Joint Distribution Factorization

The joint posterior distribution factors as:

\[\begin{align*} p(\beta_0, \beta_{\text{smoke}}, \boldsymbol{\beta}_{\text{age}} \mid \mathbf{y}) &\propto \left[\prod_{i=1}^n \text{Poisson}(y_i \mid \lambda_i)\right] \\ &\times \mathcal{N}(\beta_0 \mid 0, 0.04) \\ &\times \mathcal{N}(\beta_{\text{smoke}} \mid 0, 0.04) \\ &\times \prod_{k=1}^K \mathcal{N}(\beta_{\text{age}_k} \mid 0, 0.04) \end{align*}\]

Conjugacy Analysis

Non-conjugate components:
The Poisson likelihood with normal priors on the log-linear parameters is not conjugate

Potentially conjugate components:

If using Gamma priors for the Poisson rate parameters, we would have conjugacy
In this model with normal priors on the coefficients, all full conditionals require Metropolis-Hastings or Hamiltonian Monte Carlo updates

Full Conditionals

The model requires MCMC sampling as no closed-form conditionals exist:

\[ \begin{align} p(\beta_0 \mid \cdot) &\propto \exp\left\{ \sum_{i=1}^n \left( y_i\eta_i - e^{\eta_i} \right) - \frac{\beta_0^2}{0.08} \right\} \\ p(\beta_{\text{smoke}} \mid \cdot) &\propto \exp\left\{ \sum_{i=1}^n \left( y_i\eta_i - e^{\eta_i} \right) - \frac{\beta_{\text{smoke}}^2}{0.08} \right\} \\ p(\beta_{\text{age}_k} \mid \cdot) &\propto \exp\left\{ \sum_{i:x_{\text{age}_k,i}=1} \left( y_i\eta_i - e^{\eta_i} \right) - \frac{\beta_{\text{age}_k}^2}{0.08} \right\} \end{align} \]