A parametric family of probability distributions.
Linear form \(\beta^T X\) and mean response \(\theta\) are linked.
Link function has a closed form and is invertible.
Solve nonlinear optimization problem for \(\hat{\beta}\).
12/02/2018
A parametric family of probability distributions.
Linear form \(\beta^T X\) and mean response \(\theta\) are linked.
Link function has a closed form and is invertible.
Solve nonlinear optimization problem for \(\hat{\beta}\).
An exponential family is a family of probability density functions.
The density functions have the form \[p(y|\theta) = h(y)\exp\left\lbrace g(\theta)^{T}\phi(y)-\tilde{A}(\theta)\right\rbrace,\] where \(h(y)\) is the ancillary statistic, \(\tilde{A}(\theta)\) is a normalizer \[\exp\{\tilde{A}(\theta)\}=\int_{\Omega}h(y)\exp\left\lbrace g(\theta)^{T}\phi(y)\right\rbrace dy\] \(\phi(y)\) is the sufficient statistic such that conditioning on \(\phi(y)\), \(\theta\) is independent of \(Y\), or \(\theta\) and \(Y \mid\phi(y)\) are independent.
All GLM distributions are in (a subset of) the exponential family.
The canonical exponential family:\[ p(y|\eta)=h(y)\exp(\eta^{T}\phi(y)-A(\eta))= \frac{h(y) \exp(\eta^{T}\phi(y))}{\int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy}.\] where \(\exp\{A(\eta)\}=\int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy.\)
Link function and canonical exponential family: \[\eta = g(\theta) \mbox{ and } \theta = f(\eta).\]
A Poisson distribution: \(p(y|\theta) = \frac{\theta^{y}e^{-\theta}}{y!}\) has its expression \[p(y|\theta) = \frac{1}{y!} \exp\{y\log\theta -\theta\}\] where \(\eta = \log \theta\) , \(\phi(y)=y\), \(A(\eta) = \theta = e^{\eta}\), \(h(y)=(y!)^{-1}\).
The inverse link function is indeed \[\theta = e^{\eta}.\]
A Bernoulli distribution: \(p(y|\theta)=\theta^{y}(1-\theta)^{1-y}\) has its expression \[\exp\left\{y \log\left( \frac{\theta}{1-\theta} \right) + \log(1-\theta)\right\}\] where \(\eta = \log \frac{\theta}{1-\theta}\), \(\phi(y) = y\), \(A(\eta) = \log(1-\theta) = \log(1+e^{\eta})\), \(h(y)=1\).
The inverse link function is \[\theta = \frac{1}{1 + e^{-\eta}}.\]
A multinomial distribution has the form \[\begin{aligned}p(\mathbf{y}|\theta)&= \prod_{k=1}^{K}\theta_{k}^{y_{k}}=\exp\Big(\sum_{k=1}^{K}y_{k}\log\theta_{k}\Big) =\exp\Big(\sum_{k=1}^{K}y_{k}\eta_{k}\Big)=e^{\langle y,\eta\rangle}\end{aligned}.\]
By using the constraint \(\sum_{k=1}^{K}\theta_{k}=1\), we have \[\exp\{A(\eta)\}=\sum_{y\in\{0,1\}^{K}}\exp\Big(\sum_{k=1}^{K}y_{k}\eta_{k}\Big)=\sum_{k=1}^{K}e^{\eta_{k}}=1\] so it gives the usual form of multinomial \[\theta_{k}=\frac{e^{\eta_{k}}}{\sum_{k=1}^{K}e^{\eta_{k}}}\] which is known as the multinomial logit or softmax function.
The exponential family has appealing computational properties.
Mean and variance are the first and second cumulants of a distribution. In the exponential family, they can be obtained by calculating derivatives of \(A(\eta)\).
\(A(\eta)\) is the cumulant function, \[\begin{aligned}\frac{\partial A(\eta)}{\partial\eta^{T}} & =\frac{\partial}{\partial \eta^{T} }\left\{ \log \int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy\right\} \\ & = \frac{\int\phi(y)h(y)\exp\{\eta^T\phi(y)\}dy}{\int h(y) \exp\{\eta^T\phi(y)\}dx}\\ & =\int\phi(y) h(y) \exp\left\lbrace \eta^{T}\phi(y)-A(\eta)\right\rbrace dy= \mathbb{E}[\phi(Y)]. \end{aligned}\]
Second derivative of \(A(\eta)\): \[\begin{aligned}\frac{\partial^{2} A(\eta)}{\partial\eta\eta^{T}} & =\int \phi(y) \left(\phi(y)-\frac{\partial A(\eta)}{\partial \eta^{T}}\right)^{T} \exp\left\lbrace \eta^{T}\phi(y)-A(\eta)\right\rbrace h(y)dy \\ & =\mathbb{E}[\phi(Y)\phi(Y)^T] - \mathbb{E}[\phi(Y)]\mathbb{E}[\phi(Y)]^T = \mbox{Cov}[\phi(Y)]. \end{aligned}\]
Higher-order cumulants of \(\phi(y)\) can be obtained by taking higher-order derivatives of \(A(\eta)\).
\(A(\eta)\) is convex as \(\partial^{2} A(\eta)/\partial \eta \eta^T = \mbox{Cov}[\phi(Y)]\) is positive definite (given \(\eta\) is in a convex set).
The likelihood of canonical exponential family is convex \[\ell(\eta) =\sum_{i=1}^{N}\log p(y_i | \eta) = \sum_{i=1}^{N} [\log h(y_i) + \eta^T\phi(y_i) - A(\eta)].\]
Convexity: unique global ML solution for \(\eta\). MLE satisfies \[ \ell'(\hat{\eta}) = \sum_{i=1}^{N} \phi(y_i) - N A'(\hat{\eta}) = 0\] or \(A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\). Optimization is feasible.
The standard Netwon's optimizier is feasible for \[A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\] by using additional information (2nd order derivative - Hessian) of the likelihood. It will improve the computational efficiency.
Statistics \(\mbox{Cov}[\phi(Y)]\) assist to attain the Hessian directly.
MLE intends to set up the equivalence between a finite sample average and population mean which in principle is only asymptotically valid.
Result: MLE does not provide a good estimate \(\hat{\eta}\) for \(\eta\).
The estimate error is propogating when one estimates \(\beta\) using \(\hat{\eta}\).
\(X\in\mathbb{R}^{d}\)
\(Y\in\lbrace0,1\rbrace\) and \(Y\sim \mbox{Ber}(\theta)\).
Logistic link function:\[\theta= f(\eta) = \frac{1}{1+e^{-\eta}}.\]
Conditional distribution: \[\Pr(Y=y|X=\mathbf{x})=\theta^{y}(1-\theta)^{1-y}=f(\beta^{T}\mathbf{x})^{y}\times f(-\beta^{T}\mathbf{x})^{1-y}.\]
Given \(\mathcal{D}=\lbrace(\mathbf{x}_{1},y_{1}),\cdots,(\mathbf{x}_{n},y_{n})\rbrace\) of i.i.d. random variables, the log-likelihood \[\ell(\beta)=\sum_{i=1}^{n}y_{i}\log f(\beta^{T}\mathbf{x}_{i})+(1-y_{i})\log f(-\beta^{T}\mathbf{x}_{i}).\]
Let \(f_{i}=f(\beta^{T}\mathbf{x}_{i})\). Minimization: \[\begin{align}\nabla_{\beta}\ell(\beta)=&\sum_{i=1}^{n}y_{i}\mathbf{x}_{i}\frac{f(\beta^{T}\mathbf{x}_{i})f(-\beta^{T}\mathbf{x}_{i})}{f(\beta^{T}\mathbf{x}_{i})} \\ &-(1-y_{i})\mathbf{x}_{i}\frac{f(\beta^{T}\mathbf{x}_{i})f(-\beta^{T}\mathbf{x}_{i})}{f(-\beta^{T}\mathbf{x}_{i})}\\ =&\sum_{i=1}^{n}\mathbf{x}_{i}(y_{i}-f_{i}).\end{align}\]
Note \(\sum_{i=1}^{n}(y_{i}-f(\eta_i))\) is \(A'(\eta)\). \(\hat{\beta}\) should solve the nonlinear problem: \[\sum_{i=1}^{n}\mathbf{x}_{i}(y_{i}-f(\hat{\beta}^{T}\mathbf{x}_{i}))=0.\]
nyears = 36; beta0= 0.2; beta1 = -0.1; beta2 = -0.9; year = 1:nyears; x = (year-round(nyears/2)) / (nyears / 2) eta = beta0 + beta1 * x + beta2 * x^2 pi = plogis(eta) y = rbinom(nyears, 1, pi)
counts = glm(y ~ x + I(x^2), family=binomial); summary(counts)$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.7352476 0.5322872 -1.381299 0.1671871 ## x 0.9048698 0.6274104 1.442229 0.1492377 ## I(x^2) 1.4733067 1.2350338 1.192928 0.2328975
\(36\)-years is not enough for a consistent estimate.
nyears=10^4; year = 1:nyears; x =(year-round(nyears/2))/(nyears/2) eta = beta0 + beta1 * x + beta2 * x^2 pi = plogis(eta); y = rbinom(nyears, 1, pi) counts = glm(y ~ x + I(x^2), family=binomial); summary(counts)$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.1570019 0.03016450 5.204855 1.941482e-07 ## x -0.1497764 0.03549839 -4.219244 2.451232e-05 ## I(x^2) -0.8739806 0.06891199 -12.682563 7.387640e-37
A solution for finite-age persons.
Rule for joint probability has the expression \[P(B) \times P(A|B) = P(A) \times P(B|A)\] which we can rearrange to get \[P(A|B) = \frac {P(A) \times P(B|A)}{P(B)}\] which is the Bayes' rule.
Replace \(A\) and \(B\) by model parameters \(\theta\) and the data \(y\) to get \[p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)}\] where \(p(y|\theta)\) is the likelihood, \(p(\theta)\) is the prior, \(p(\theta|y)\) is the posterior, \(p(y)=\int p(y|\theta)p(\theta)d\theta\) is the horrible thing.
To avoid seeing the horrible thing: \[p(\theta|y) \propto p(\theta) \times p(y|\theta) \]
In most cases we can't calculate \(p(y)\). But we can calculate the ratio of \(p(\theta_1|y)\) and \(p(\theta_2|y)\): \[\frac{p(\theta_1|y) }{ p(\theta_2|y)}=\frac{p(\theta_1) \times p(y|\theta_1)}{p(\theta_2) \times p(y|\theta_2)} = \alpha\]
One can use the ratio \(\frac{p(\theta_1|y) }{ p(\theta_2|y)}\) to sample from the posterior distribution by a numerical sampling algorithm called Markov Chain Monte Carlo (MCMC).
One can uses the idea that in certain types of models \(p(\theta|y)\) can be aproximated by attaintable distributions e.g. normal distribution \(\mathcal{N}(\mu(\theta),\sigma^{2}(\theta))\) that makes the expression \(p(y|\theta) \mathcal{N}(\theta)\) integrable. This is called Approximate Bayesian Computation (ABC).
BUGS (Bayesian inference Using Gibbs Sampling), JAGS (Just Another Gibbs Sampler)
OpenBUGS www.openbugs.net
JAGS mcmc-jags.sourceforge.net/
INLA www.r-inla.org
We will dump the model to a file using cat("", file="")
dataset = list(T=length(x), year=x, y=as.numeric(y))
cat("model
{ # priors
beta0 ~ dnorm(0,0.05)
beta1 ~ dnorm(0,0.05)
beta2 ~ dnorm(0,0.05)
# likelihood
for(t in 1:T)
{
y[t] ~ dbin(pi[t], 1)
logit(pi[t]) = beta0 + beta1*year[t] + beta2*pow(year[t],2)
}
}", file="model_single.bug")
library(R2jags)
params = c("beta0","beta1","beta2", "pi")
counts.bayes = jags(data=dataset, model.file="model_single.bug",
parameters.to.save=params, n.iter=2000,
n.burnin=1000, n.chains=3)
## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 36 ## Unobserved stochastic nodes: 3 ## Total graph size: 389 ## ## Initializing model
pi.b = counts.bayes$BUGSoutput$median$pi residuals = y - pi.b acf(residuals)
head(print(counts.bayes))
## Inference for Bugs model at "model_single.bug", fit using jags, ## 3 chains, each with 2000 iterations (first 1000 discarded) ## n.sims = 3000 iterations saved ## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff ## beta0 0.691 0.512 -0.286 0.336 0.693 1.040 1.713 1.005 470 ## beta1 -0.431 0.619 -1.633 -0.848 -0.419 -0.006 0.783 1.005 460 ## beta2 -0.978 1.179 -3.287 -1.775 -1.010 -0.172 1.292 1.004 590 ## pi[1] 0.545 0.206 0.150 0.388 0.548 0.705 0.911 1.003 920 ## pi[2] 0.562 0.189 0.189 0.423 0.567 0.706 0.899 1.003 840 ## pi[3] 0.580 0.172 0.236 0.456 0.587 0.707 0.882 1.003 750 ## pi[4] 0.595 0.156 0.277 0.484 0.605 0.709 0.868 1.004 660 ## pi[5] 0.610 0.142 0.316 0.511 0.619 0.713 0.857 1.004 580 ## pi[6] 0.623 0.130 0.353 0.533 0.632 0.717 0.847 1.004 510 ## pi[7] 0.634 0.119 0.384 0.553 0.642 0.721 0.845 1.005 450 ## pi[8] 0.644 0.112 0.409 0.569 0.650 0.726 0.840 1.005 400 ## pi[9] 0.652 0.107 0.429 0.581 0.658 0.731 0.839 1.006 370 ## pi[10] 0.658 0.104 0.439 0.589 0.665 0.734 0.837 1.006 360 ## pi[11] 0.663 0.102 0.447 0.594 0.671 0.738 0.841 1.006 350 ## pi[12] 0.666 0.102 0.454 0.597 0.675 0.743 0.845 1.006 360 ## pi[13] 0.668 0.103 0.460 0.596 0.677 0.744 0.846 1.006 370 ## pi[14] 0.668 0.104 0.454 0.596 0.677 0.745 0.849 1.006 390 ## pi[15] 0.667 0.106 0.449 0.594 0.678 0.746 0.851 1.005 420 ## pi[16] 0.665 0.107 0.444 0.592 0.675 0.746 0.849 1.005 450 ## pi[17] 0.662 0.108 0.435 0.589 0.671 0.743 0.850 1.005 490 ## pi[18] 0.657 0.109 0.429 0.583 0.667 0.739 0.847 1.004 550 ## pi[19] 0.652 0.110 0.424 0.578 0.659 0.732 0.843 1.004 620 ## pi[20] 0.645 0.110 0.418 0.571 0.653 0.726 0.837 1.003 710 ## pi[21] 0.636 0.109 0.409 0.563 0.644 0.716 0.830 1.003 850 ## pi[22] 0.627 0.108 0.402 0.554 0.632 0.705 0.822 1.002 1100 ## pi[23] 0.616 0.107 0.398 0.544 0.621 0.693 0.811 1.002 1500 ## pi[24] 0.603 0.106 0.389 0.531 0.607 0.678 0.799 1.001 2200 ## pi[25] 0.588 0.106 0.377 0.516 0.593 0.664 0.787 1.001 3000 ## pi[26] 0.572 0.106 0.364 0.500 0.576 0.646 0.773 1.001 3000 ## pi[27] 0.555 0.108 0.342 0.481 0.560 0.631 0.761 1.001 3000 ## pi[28] 0.536 0.111 0.316 0.458 0.541 0.612 0.747 1.001 3000 ## pi[29] 0.515 0.117 0.284 0.434 0.520 0.597 0.736 1.001 3000 ## pi[30] 0.493 0.125 0.246 0.405 0.497 0.582 0.732 1.001 2200 ## pi[31] 0.470 0.135 0.209 0.374 0.472 0.565 0.732 1.002 1400 ## pi[32] 0.447 0.146 0.172 0.340 0.447 0.549 0.734 1.002 1100 ## pi[33] 0.424 0.159 0.134 0.304 0.418 0.534 0.738 1.003 860 ## pi[34] 0.401 0.171 0.102 0.269 0.390 0.521 0.744 1.003 750 ## pi[35] 0.379 0.184 0.078 0.235 0.362 0.508 0.755 1.003 680 ## pi[36] 0.358 0.196 0.055 0.202 0.333 0.497 0.772 1.004 630 ## deviance 50.638 2.464 47.849 48.883 49.974 51.683 56.962 1.002 1600 ## ## For each parameter, n.eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor (at convergence, Rhat=1). ## ## DIC info (using the rule, pD = var(deviance)/2) ## pD = 3.0 and DIC = 53.7 ## DIC is an estimate of expected predictive error (lower deviance is better).
## $n.chains ## [1] 3 ## ## $n.iter ## [1] 2000 ## ## $n.burnin ## [1] 1000 ## ## $n.thin ## [1] 1 ## ## $n.keep ## [1] 1000 ## ## $n.sims ## [1] 3000
plot(counts.bayes)
plot(year, y, type="b") lines(pi,col="blue")
plot(as.mcmc(counts.bayes)[[1]][,1]) # 1st chain, beta0
nyears = 10^3; beta0= 0.2; beta1 = -0.1; beta2 = -0.9;
year = 1:nyears; x = (year-round(nyears/2)) / (nyears / 2)
eta = beta0 + beta1 * x + beta2 * x^2
pi = plogis(eta); y = rbinom(nyears, 1, pi)
dataset = list(T=length(x), year=x, y=as.numeric(y))
params = c("beta0","beta1","beta2")
counts.bayes = jags(data=dataset, model.file="model_single.bug",
parameters.to.save=params, n.iter=2000,
n.burnin=1000, n.chains=3)
## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 1000 ## Unobserved stochastic nodes: 3 ## Total graph size: 10511 ## ## Initializing model
plot(counts.bayes)
Bernoulli distribution: \(p(x_i|\theta) = \theta^{y_i} (1-\theta)^{1-y_i}\) for \(y_i=0\) or \(y_i=1\) \[p(\mathbf{y}|\theta) = \prod_{i=1}^n \theta^{y_i} (1-\theta)^{1-y_i} = \theta^{\sum_{i=1}^n y_i} (1-\theta)^{n-\sum_{i=1}^n y_i}.\]
Prior: a Beta distribution: Beta\((\alpha_1,\alpha_2)\) \[ p(\theta) = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)}\theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1}\] \(\Gamma(x)\) is the usual Gamma function:\(\Gamma(\alpha)=\int_{0}^{\infty}t^{\alpha-1}e^{-t}dt\).
Posterior: Beta\((\alpha_1^\prime,\alpha_2^\prime)\), with \(\alpha_1^\prime = \alpha_1 + \sum_{i=1}^n y_i\), \(\alpha_2^\prime = \alpha_2 + n-\sum_{i=1}^n y_i\).
\[ \begin{eqnarray*} p(\theta|\mathbf{y}) & \propto & p(\mathbf{y}|\theta)~p(\theta) \\ && \\ &=& \theta^{\sum_{i=1}^n y_i} (1-\theta)^{n-\sum_{i=1}^n y_i} ~ \frac{\Gamma(\alpha_1 +\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} \theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1} \\ && \\ & \propto & \theta^{(\alpha_1 + \sum_{i=1}^n y_i) - 1} (1-\theta)^{(\alpha_2 + n-\sum_{i=1}^n y_i) - 1}. \\ \end{eqnarray*} \] The posterior is still a beta distribution with updated parameters become \(\alpha_1^\prime = \alpha_1 + \sum_{i=1}^n y_i\), \(\alpha_2^\prime = \alpha_2 + n-\sum_{i=1}^n y_i\).
Voting model. \(n\) randomly selected registered voters express opinions (yes or no). Voter are identical with the probability \(\theta\) in favor of the proposal.
Initial guess \(330\) persons are in favor and \(270\) not.
If \(\sum_{i=1}^n y_i = 600\) of the \(n = 1000\) respondents favor the vote, then the posterior has a beta distribution with parameters \(\alpha^\prime =\alpha +\sum_{i=1}^nx_i = 930\) and \(\beta^\prime = \beta + n -\sum_{i=1}^nx_i = 670\).
x = seq(.45, 1, .005); prior = dbeta(x, 330, 270);
post = dbeta(x, 930, 670); infinite = dbeta(x,10^4,270);
plot(x, post, type="l", col="blue", ylim=c(0, 50), lwd=2,
xlab="Proportion in Favor", ylab="Density")
lines(x, prior, col="red"); lines(x,infinite, col="black")
Grid Approximation
x = seq(.45, 1, .005); prior = dbeta(x, 330, 270); likelihood = dbinom(1000,600,x); post.com = likelihood*prior; post.com = post.com/sum(post.com)
The prior comes \(\theta^{\alpha-1}\) \[p(\theta|\alpha)\propto\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}\] have the normalizing denominator \[\int_{\theta\in\Delta}\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}d\theta=\frac{\Pi_{i=1}^{k}\Gamma(\alpha_{i})}{\Gamma(\sum_{i=1}^{k}\alpha_{i})}=B(\alpha)\] where \(\Delta\) means a simplex \(\left\{ \sum_{i=1}^{k}\theta_{i}=1,\theta_{i}\geq0\right\}\),
The distribution is called Dirichlet distribution denoted as \(\mbox{Dir}(\alpha)\): \[\frac{\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}}{\int_{\theta\in\Theta}\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}d\theta}=\frac{\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}}{B(\alpha)}.\]
The Beta distribution is a special case of the Dirichlet distribution with \(k=2\): \[\int_{0}^{1}\theta^{\alpha_{1}-1}(1-\theta)^{\alpha_{2}-1}d\theta=\frac{\Gamma(\alpha_{1})\Gamma(\alpha_{2})}{\Gamma(\alpha_{1}+\alpha_{2})}.\]
Conjugation is mutual
The conjugation: \[ p(\theta | y,\alpha) \propto p(y|\theta) p(\theta|\alpha).\] \(p(\theta | y,\alpha)\) and \(p(\theta | \alpha)\) are Dirichlet, then \(p(y|\theta)\) is Multinomial.
Mutuality: \(p(\theta | y,\alpha)\) and \(p(\theta | \alpha)\) are multinomial then \(p(y|\theta)\) is Dirichlet.
Products of exponential family distributions are exponential family distribution
The integral of the Gamma distribution can be computed by change of variables, giving \[\Gamma(\alpha)=\beta^{\alpha}\int_{0}^{\infty}\theta^{\alpha-1}e^{-\beta\theta}d\theta.\]
We have the Gamma distribution: \(\mbox{Gamma}(\alpha,\beta)\) \[\frac{\beta^{\alpha}\theta^{\alpha-1}e^{-\beta\theta}}{\Gamma(\alpha)}.\]
Consider \(Y|\theta\) has a Poisson \(\mbox{Poi}(\theta)\) distribution: \[p(y|\theta)=\prod_{j=1}^{n}\frac{\theta^{y_{j}}e^{-\theta}}{y_{j}!}\propto\theta^{\sum_{j=1}^{n}y_{j}}e^{-n\theta}.\]
Its conjugation prior is the \(\mbox{Gamma}(\alpha,\beta)\) \[p(\theta|\alpha,\beta)\propto\theta^{\alpha-1}e^{-\beta\theta}\] with normalizing constant \(\beta^{\alpha}/\Gamma(\alpha)\).
The posterior is a Gamma distribution \(\mbox{Gamma}(\sum_{j}y_{j}+\alpha,\beta+n)\): \[\theta^{\sum_{j=1}^{n}y_{j}+\alpha-1}e^{-(\beta+n)\theta}.\]
One reason that statistical models routinely correspond to many different detailed process is because they belong to the exponential family.
All of the exponential family distributions are maximum entropy distributions.
One can no more infer evolutionary process by these statistical distributions than its underlying mechanistic process.
On the other hand, maximum entropy means we can use them to do useful work without fully understanding the underyling nature.
Bayesian method can be considered as a stochastic optimizier for a global optimization problem.
The maximization problem \(\max_{\theta \in \Theta} U(\theta)\) has its stochastic version:\[P(U(\theta^{*}))\geq P(U(\theta)) \mbox{ for } \theta\neq\theta^{*}.\] If the probability of simulating in regions where \(U\) is large, we increase this probability and decrease this probability in regions where it is small.
A density sharing maxima with \(U(\theta^{*})\) such that \[ P(U(\theta))= \frac{\exp(U(\theta)) p(\mathbf{y}|\theta)}{\int \exp(U(\theta)) p(\mathbf{y}|\theta)d\theta }\] \(P(U(\theta^{*}))\geq P(U(\theta))\) if \(U(\theta^{*})\geq U(\theta)\).
The likelihood is free to calibrate so that the denominator can be chosen toward accelerating convergence or avoiding local maxima.