Topic in Econometrics (2018)

12/02/2018

GLM

A parametric family of probability distributions.
Linear form \(\beta^T X\) and mean response \(\theta\) are linked.
Link function has a closed form and is invertible.
Solve nonlinear optimization problem for \(\hat{\beta}\).

Exponential Family

An exponential family is a family of probability density functions.
The density functions have the form \[p(y|\theta) = h(y)\exp\left\lbrace g(\theta)^{T}\phi(y)-\tilde{A}(\theta)\right\rbrace,\] where \(h(y)\) is the ancillary statistic, \(\tilde{A}(\theta)\) is a normalizer \[\exp\{\tilde{A}(\theta)\}=\int_{\Omega}h(y)\exp\left\lbrace g(\theta)^{T}\phi(y)\right\rbrace dy\] \(\phi(y)\) is the sufficient statistic such that conditioning on \(\phi(y)\), \(\theta\) is independent of \(Y\), or \(\theta\) and \(Y \mid\phi(y)\) are independent.

Exponential Family

All GLM distributions are in (a subset of) the exponential family.
The canonical exponential family:\[ p(y|\eta)=h(y)\exp(\eta^{T}\phi(y)-A(\eta))= \frac{h(y) \exp(\eta^{T}\phi(y))}{\int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy}.\] where \(\exp\{A(\eta)\}=\int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy.\)
Link function and canonical exponential family: \[\eta = g(\theta) \mbox{ and } \theta = f(\eta).\]

Exponential Family

A Poisson distribution: \(p(y|\theta) = \frac{\theta^{y}e^{-\theta}}{y!}\) has its expression \[p(y|\theta) = \frac{1}{y!} \exp\{y\log\theta -\theta\}\] where \(\eta = \log \theta\) , \(\phi(y)=y\), \(A(\eta) = \theta = e^{\eta}\), \(h(y)=(y!)^{-1}\).
The inverse link function is indeed \[\theta = e^{\eta}.\]

Exponential Family

A Bernoulli distribution: \(p(y|\theta)=\theta^{y}(1-\theta)^{1-y}\) has its expression \[\exp\left\{y \log\left( \frac{\theta}{1-\theta} \right) + \log(1-\theta)\right\}\] where \(\eta = \log \frac{\theta}{1-\theta}\), \(\phi(y) = y\), \(A(\eta) = \log(1-\theta) = \log(1+e^{\eta})\), \(h(y)=1\).
The inverse link function is \[\theta = \frac{1}{1 + e^{-\eta}}.\]

Exponential Family

A Gaussian distribution, the canonical exponential family representation is: \[\begin{aligned}p(y|\mu,\sigma^{2}) & =\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y-\mu)^{2}}{2\sigma^{2}}}\\ & =\exp\left\lbrace y^{2}\left(\frac{-1}{2\sigma^{2}}\right)+y\frac{\mu}{\sigma^{2}}-\left[\frac{\mu^{2}}{2\sigma^{2}}+\frac{1}{2}\log(2\pi\sigma^{2})\right]\right\rbrace \end{aligned}\] where \(\eta=(\frac{\mu}{\sigma^{2}},-\frac{1}{2\sigma^{2}})^{T}\), \(\phi(y)=(y, y^2)^T\), \(A(\eta)=\frac{1}{2}\log\left(-\frac{2\pi}{2\eta_{2}}\right)-\frac{\eta_{1}^{2}}{4\eta_{2}}\) and \(h(y)=1/\sqrt{2\pi}\).

Exponential Family

A multinomial distribution has the form \[\begin{aligned}p(\mathbf{y}|\theta)&= \prod_{k=1}^{K}\theta_{k}^{y_{k}}=\exp\Big(\sum_{k=1}^{K}y_{k}\log\theta_{k}\Big) =\exp\Big(\sum_{k=1}^{K}y_{k}\eta_{k}\Big)=e^{\langle y,\eta\rangle}\end{aligned}.\]
By using the constraint \(\sum_{k=1}^{K}\theta_{k}=1\), we have \[\exp\{A(\eta)\}=\sum_{y\in\{0,1\}^{K}}\exp\Big(\sum_{k=1}^{K}y_{k}\eta_{k}\Big)=\sum_{k=1}^{K}e^{\eta_{k}}=1\] so it gives the usual form of multinomial \[\theta_{k}=\frac{e^{\eta_{k}}}{\sum_{k=1}^{K}e^{\eta_{k}}}\] which is known as the multinomial logit or softmax function.

Exponential Family and Optimization

The exponential family has appealing computational properties.
Mean and variance are the first and second cumulants of a distribution. In the exponential family, they can be obtained by calculating derivatives of \(A(\eta)\).
\(A(\eta)\) is the cumulant function, \[\begin{aligned}\frac{\partial A(\eta)}{\partial\eta^{T}} & =\frac{\partial}{\partial \eta^{T} }\left\{ \log \int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy\right\} \\ & = \frac{\int\phi(y)h(y)\exp\{\eta^T\phi(y)\}dy}{\int h(y) \exp\{\eta^T\phi(y)\}dx}\\ & =\int\phi(y) h(y) \exp\left\lbrace \eta^{T}\phi(y)-A(\eta)\right\rbrace dy= \mathbb{E}[\phi(Y)]. \end{aligned}\]

Exponential Family and Optimization

Second derivative of \(A(\eta)\): \[\begin{aligned}\frac{\partial^{2} A(\eta)}{\partial\eta\eta^{T}} & =\int \phi(y) \left(\phi(y)-\frac{\partial A(\eta)}{\partial \eta^{T}}\right)^{T} \exp\left\lbrace \eta^{T}\phi(y)-A(\eta)\right\rbrace h(y)dy \\ & =\mathbb{E}[\phi(Y)\phi(Y)^T] - \mathbb{E}[\phi(Y)]\mathbb{E}[\phi(Y)]^T = \mbox{Cov}[\phi(Y)]. \end{aligned}\]
Higher-order cumulants of \(\phi(y)\) can be obtained by taking higher-order derivatives of \(A(\eta)\).

Exponential Family and Optimization

\(A(\eta)\) is convex as \(\partial^{2} A(\eta)/\partial \eta \eta^T = \mbox{Cov}[\phi(Y)]\) is positive definite (given \(\eta\) is in a convex set).
The likelihood of canonical exponential family is convex \[\ell(\eta) =\sum_{i=1}^{N}\log p(y_i | \eta) = \sum_{i=1}^{N} [\log h(y_i) + \eta^T\phi(y_i) - A(\eta)].\]
Convexity: unique global ML solution for \(\eta\). MLE satisfies \[ \ell'(\hat{\eta}) = \sum_{i=1}^{N} \phi(y_i) - N A'(\hat{\eta}) = 0\] or \(A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\). Optimization is feasible.

Remarks

The standard Netwon's optimizier is feasible for \[A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\] by using additional information (2nd order derivative - Hessian) of the likelihood. It will improve the computational efficiency.
Statistics \(\mbox{Cov}[\phi(Y)]\) assist to attain the Hessian directly.

Remarks

When \(\phi(y)=y\), e.g. in the Poisson and Bernoulli case, \(A(\eta)\) is the \(\log\) of Laplace transform of \(h(y)\): \[e^{A(\eta)} = \int h(y) e^{\eta^T \phi(y)}dy.\]

Problem of MLE

The solution of \[A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\] depends on the sample size \(N\). Given a small number of observations, \(N^{-1}\sum_{i=1}^{N} \phi(y_i)\) is likely far from population mean (\(\mathbb{E}[\phi(Y)]\))!

Problem of MLE

MLE intends to set up the equivalence between a finite sample average and population mean which in principle is only asymptotically valid.
Result: MLE does not provide a good estimate \(\hat{\eta}\) for \(\eta\).
The estimate error is propogating when one estimates \(\beta\) using \(\hat{\eta}\).

Nonlinear Regression

\(X\in\mathbb{R}^{d}\)
\(Y\in\lbrace0,1\rbrace\) and \(Y\sim \mbox{Ber}(\theta)\).
Logistic link function:\[\theta= f(\eta) = \frac{1}{1+e^{-\eta}}.\]

Nonlinear Regression

Conditional distribution: \[\Pr(Y=y|X=\mathbf{x})=\theta^{y}(1-\theta)^{1-y}=f(\beta^{T}\mathbf{x})^{y}\times f(-\beta^{T}\mathbf{x})^{1-y}.\]
Given \(\mathcal{D}=\lbrace(\mathbf{x}_{1},y_{1}),\cdots,(\mathbf{x}_{n},y_{n})\rbrace\) of i.i.d. random variables, the log-likelihood \[\ell(\beta)=\sum_{i=1}^{n}y_{i}\log f(\beta^{T}\mathbf{x}_{i})+(1-y_{i})\log f(-\beta^{T}\mathbf{x}_{i}).\]

Nonlinear Regression

Let \(f_{i}=f(\beta^{T}\mathbf{x}_{i})\). Minimization: \[\begin{align}\nabla_{\beta}\ell(\beta)=&\sum_{i=1}^{n}y_{i}\mathbf{x}_{i}\frac{f(\beta^{T}\mathbf{x}_{i})f(-\beta^{T}\mathbf{x}_{i})}{f(\beta^{T}\mathbf{x}_{i})} \\ &-(1-y_{i})\mathbf{x}_{i}\frac{f(\beta^{T}\mathbf{x}_{i})f(-\beta^{T}\mathbf{x}_{i})}{f(-\beta^{T}\mathbf{x}_{i})}\\ =&\sum_{i=1}^{n}\mathbf{x}_{i}(y_{i}-f_{i}).\end{align}\]
Note \(\sum_{i=1}^{n}(y_{i}-f(\eta_i))\) is \(A'(\eta)\). \(\hat{\beta}\) should solve the nonlinear problem: \[\sum_{i=1}^{n}\mathbf{x}_{i}(y_{i}-f(\hat{\beta}^{T}\mathbf{x}_{i}))=0.\]

One Person Model

nyears = 36; beta0= 0.2; beta1 = -0.1; beta2 = -0.9;
year = 1:nyears;  x = (year-round(nyears/2)) / (nyears / 2)
eta = beta0 + beta1 * x + beta2 * x^2 
pi = plogis(eta)
y = rbinom(nyears, 1, pi)

Estimation

counts = glm(y ~ x + I(x^2), family=binomial); 
summary(counts)$coefficients

##               Estimate Std. Error   z value  Pr(>|z|)
## (Intercept) -0.7352476  0.5322872 -1.381299 0.1671871
## x            0.9048698  0.6274104  1.442229 0.1492377
## I(x^2)       1.4733067  1.2350338  1.192928 0.2328975

\(36\)-years is not enough for a consistent estimate.

One Person with a Long Life

nyears=10^4; year = 1:nyears; x =(year-round(nyears/2))/(nyears/2)
eta = beta0 + beta1 * x + beta2 * x^2 
pi = plogis(eta); y = rbinom(nyears, 1, pi)
counts = glm(y ~ x + I(x^2), family=binomial); 
summary(counts)$coefficients

##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  0.1570019 0.03016450   5.204855 1.941482e-07
## x           -0.1497764 0.03549839  -4.219244 2.451232e-05
## I(x^2)      -0.8739806 0.06891199 -12.682563 7.387640e-37

Bayesian Inferences

A solution for finite-age persons.
Rule for joint probability has the expression \[P(B) \times P(A|B) = P(A) \times P(B|A)\] which we can rearrange to get \[P(A|B) = \frac {P(A) \times P(B|A)}{P(B)}\] which is the Bayes' rule.

Bayesian Inferences

Replace \(A\) and \(B\) by model parameters \(\theta\) and the data \(y\) to get \[p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)}\] where \(p(y|\theta)\) is the likelihood, \(p(\theta)\) is the prior, \(p(\theta|y)\) is the posterior, \(p(y)=\int p(y|\theta)p(\theta)d\theta\) is the horrible thing.
To avoid seeing the horrible thing: \[p(\theta|y) \propto p(\theta) \times p(y|\theta) \]
In most cases we can't calculate \(p(y)\). But we can calculate the ratio of \(p(\theta_1|y)\) and \(p(\theta_2|y)\): \[\frac{p(\theta_1|y) }{ p(\theta_2|y)}=\frac{p(\theta_1) \times p(y|\theta_1)}{p(\theta_2) \times p(y|\theta_2)} = \alpha\]

Computation in Bayesian Models

One can use the ratio \(\frac{p(\theta_1|y) }{ p(\theta_2|y)}\) to sample from the posterior distribution by a numerical sampling algorithm called Markov Chain Monte Carlo (MCMC).
One can uses the idea that in certain types of models \(p(\theta|y)\) can be aproximated by attaintable distributions e.g. normal distribution \(\mathcal{N}(\mu(\theta),\sigma^{2}(\theta))\) that makes the expression \(p(y|\theta) \mathcal{N}(\theta)\) integrable. This is called Approximate Bayesian Computation (ABC).

Bayesian Models

The most common MCMC samplers: Metropolis-Hastings algorithm, Gibbs algorithm

BUGS (Bayesian inference Using Gibbs Sampling), JAGS (Just Another Gibbs Sampler)

OpenBUGS www.openbugs.net

JAGS mcmc-jags.sourceforge.net/

ABC method: Integrated Nested Laplace Approximation (INLA)

INLA www.r-inla.org

All available in R.

Bayesian Simple Model

We will dump the model to a file using cat("", file="")

dataset = list(T=length(x), year=x, y=as.numeric(y))
cat("model
    { # priors
      beta0 ~ dnorm(0,0.05)
      beta1 ~ dnorm(0,0.05)
      beta2 ~ dnorm(0,0.05)
      # likelihood
      for(t in 1:T)
      {
        y[t] ~ dbin(pi[t], 1)
        logit(pi[t]) = beta0 + beta1*year[t] + beta2*pow(year[t],2)
      }
    }", file="model_single.bug")

Bayesian Simple Model

library(R2jags)
params = c("beta0","beta1","beta2", "pi")
counts.bayes = jags(data=dataset, model.file="model_single.bug", 
                    parameters.to.save=params, n.iter=2000, 
                    n.burnin=1000, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 36
##    Unobserved stochastic nodes: 3
##    Total graph size: 389
## 
## Initializing model

Bayesian Simple Model

pi.b = counts.bayes$BUGSoutput$median$pi
residuals = y - pi.b
acf(residuals)

Bayesian Simple Model

head(print(counts.bayes))

## Inference for Bugs model at "model_single.bug", fit using jags,
##  3 chains, each with 2000 iterations (first 1000 discarded)
##  n.sims = 3000 iterations saved
##          mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
## beta0      0.691   0.512 -0.286  0.336  0.693  1.040  1.713 1.005   470
## beta1     -0.431   0.619 -1.633 -0.848 -0.419 -0.006  0.783 1.005   460
## beta2     -0.978   1.179 -3.287 -1.775 -1.010 -0.172  1.292 1.004   590
## pi[1]      0.545   0.206  0.150  0.388  0.548  0.705  0.911 1.003   920
## pi[2]      0.562   0.189  0.189  0.423  0.567  0.706  0.899 1.003   840
## pi[3]      0.580   0.172  0.236  0.456  0.587  0.707  0.882 1.003   750
## pi[4]      0.595   0.156  0.277  0.484  0.605  0.709  0.868 1.004   660
## pi[5]      0.610   0.142  0.316  0.511  0.619  0.713  0.857 1.004   580
## pi[6]      0.623   0.130  0.353  0.533  0.632  0.717  0.847 1.004   510
## pi[7]      0.634   0.119  0.384  0.553  0.642  0.721  0.845 1.005   450
## pi[8]      0.644   0.112  0.409  0.569  0.650  0.726  0.840 1.005   400
## pi[9]      0.652   0.107  0.429  0.581  0.658  0.731  0.839 1.006   370
## pi[10]     0.658   0.104  0.439  0.589  0.665  0.734  0.837 1.006   360
## pi[11]     0.663   0.102  0.447  0.594  0.671  0.738  0.841 1.006   350
## pi[12]     0.666   0.102  0.454  0.597  0.675  0.743  0.845 1.006   360
## pi[13]     0.668   0.103  0.460  0.596  0.677  0.744  0.846 1.006   370
## pi[14]     0.668   0.104  0.454  0.596  0.677  0.745  0.849 1.006   390
## pi[15]     0.667   0.106  0.449  0.594  0.678  0.746  0.851 1.005   420
## pi[16]     0.665   0.107  0.444  0.592  0.675  0.746  0.849 1.005   450
## pi[17]     0.662   0.108  0.435  0.589  0.671  0.743  0.850 1.005   490
## pi[18]     0.657   0.109  0.429  0.583  0.667  0.739  0.847 1.004   550
## pi[19]     0.652   0.110  0.424  0.578  0.659  0.732  0.843 1.004   620
## pi[20]     0.645   0.110  0.418  0.571  0.653  0.726  0.837 1.003   710
## pi[21]     0.636   0.109  0.409  0.563  0.644  0.716  0.830 1.003   850
## pi[22]     0.627   0.108  0.402  0.554  0.632  0.705  0.822 1.002  1100
## pi[23]     0.616   0.107  0.398  0.544  0.621  0.693  0.811 1.002  1500
## pi[24]     0.603   0.106  0.389  0.531  0.607  0.678  0.799 1.001  2200
## pi[25]     0.588   0.106  0.377  0.516  0.593  0.664  0.787 1.001  3000
## pi[26]     0.572   0.106  0.364  0.500  0.576  0.646  0.773 1.001  3000
## pi[27]     0.555   0.108  0.342  0.481  0.560  0.631  0.761 1.001  3000
## pi[28]     0.536   0.111  0.316  0.458  0.541  0.612  0.747 1.001  3000
## pi[29]     0.515   0.117  0.284  0.434  0.520  0.597  0.736 1.001  3000
## pi[30]     0.493   0.125  0.246  0.405  0.497  0.582  0.732 1.001  2200
## pi[31]     0.470   0.135  0.209  0.374  0.472  0.565  0.732 1.002  1400
## pi[32]     0.447   0.146  0.172  0.340  0.447  0.549  0.734 1.002  1100
## pi[33]     0.424   0.159  0.134  0.304  0.418  0.534  0.738 1.003   860
## pi[34]     0.401   0.171  0.102  0.269  0.390  0.521  0.744 1.003   750
## pi[35]     0.379   0.184  0.078  0.235  0.362  0.508  0.755 1.003   680
## pi[36]     0.358   0.196  0.055  0.202  0.333  0.497  0.772 1.004   630
## deviance  50.638   2.464 47.849 48.883 49.974 51.683 56.962 1.002  1600
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 3.0 and DIC = 53.7
## DIC is an estimate of expected predictive error (lower deviance is better).

## $n.chains
## [1] 3
## 
## $n.iter
## [1] 2000
## 
## $n.burnin
## [1] 1000
## 
## $n.thin
## [1] 1
## 
## $n.keep
## [1] 1000
## 
## $n.sims
## [1] 3000

Bayesian Simple Model

plot(counts.bayes)

Bayesian Simple Model

plot(year, y, type="b")
lines(pi,col="blue")

Bayesian Simple Model

plot(as.mcmc(counts.bayes)[[1]][,1]) # 1st chain, beta0

Bayesian Long Life Model

nyears = 10^3; beta0= 0.2; beta1 = -0.1; beta2 = -0.9;
year = 1:nyears;  x = (year-round(nyears/2)) / (nyears / 2)
eta = beta0 + beta1 * x + beta2 * x^2 
pi = plogis(eta); y = rbinom(nyears, 1, pi)
dataset = list(T=length(x), year=x, y=as.numeric(y))
params = c("beta0","beta1","beta2")
counts.bayes = jags(data=dataset, model.file="model_single.bug", 
                    parameters.to.save=params, n.iter=2000, 
                    n.burnin=1000, n.chains=3)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1000
##    Unobserved stochastic nodes: 3
##    Total graph size: 10511
## 
## Initializing model

Bayesian Long Life Model

plot(counts.bayes)

MCMC Animation

Bayesian Computation by Hands

Bernoulli distribution: \(p(x_i|\theta) = \theta^{y_i} (1-\theta)^{1-y_i}\) for \(y_i=0\) or \(y_i=1\) \[p(\mathbf{y}|\theta) = \prod_{i=1}^n \theta^{y_i} (1-\theta)^{1-y_i} = \theta^{\sum_{i=1}^n y_i} (1-\theta)^{n-\sum_{i=1}^n y_i}.\]
Prior: a Beta distribution: Beta\((\alpha_1,\alpha_2)\) \[ p(\theta) = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)}\theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1}\] \(\Gamma(x)\) is the usual Gamma function:\(\Gamma(\alpha)=\int_{0}^{\infty}t^{\alpha-1}e^{-t}dt\).
Posterior: Beta\((\alpha_1^\prime,\alpha_2^\prime)\), with \(\alpha_1^\prime = \alpha_1 + \sum_{i=1}^n y_i\), \(\alpha_2^\prime = \alpha_2 + n-\sum_{i=1}^n y_i\).

Bayesian Computation by Hands

\[ \begin{eqnarray*} p(\theta|\mathbf{y}) & \propto & p(\mathbf{y}|\theta)~p(\theta) \\ && \\ &=& \theta^{\sum_{i=1}^n y_i} (1-\theta)^{n-\sum_{i=1}^n y_i} ~ \frac{\Gamma(\alpha_1 +\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} \theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1} \\ && \\ & \propto & \theta^{(\alpha_1 + \sum_{i=1}^n y_i) - 1} (1-\theta)^{(\alpha_2 + n-\sum_{i=1}^n y_i) - 1}. \\ \end{eqnarray*} \] The posterior is still a beta distribution with updated parameters become \(\alpha_1^\prime = \alpha_1 + \sum_{i=1}^n y_i\), \(\alpha_2^\prime = \alpha_2 + n-\sum_{i=1}^n y_i\).

Conjugate priors: A family of priors such that, upon being multiplied by the likelihood, yields a posterior in the same family.

Bayesian Computation by Hands

Voting model. \(n\) randomly selected registered voters express opinions (yes or no). Voter are identical with the probability \(\theta\) in favor of the proposal.
Initial guess \(330\) persons are in favor and \(270\) not.
Prior of beta distribution: \(\alpha_1 = 330\) and \(\alpha_2 =270\).
If \(\sum_{i=1}^n y_i = 600\) of the \(n = 1000\) respondents favor the vote, then the posterior has a beta distribution with parameters \(\alpha^\prime =\alpha +\sum_{i=1}^nx_i = 930\) and \(\beta^\prime = \beta + n -\sum_{i=1}^nx_i = 670\).

Bayesian Computation by Hands

x = seq(.45, 1, .005); prior = dbeta(x, 330, 270); 
post = dbeta(x, 930, 670); infinite = dbeta(x,10^4,270);
plot(x, post, type="l", col="blue", ylim=c(0, 50), lwd=2,
         xlab="Proportion in Favor", ylab="Density")
lines(x, prior, col="red"); lines(x,infinite, col="black")

Bayesian Computation by Computer

Grid Approximation

x = seq(.45, 1, .005); prior = dbeta(x, 330, 270); 
likelihood = dbinom(1000,600,x);
post.com = likelihood*prior; post.com = post.com/sum(post.com)

Conjugation

Multinomial distribution of \(n\) trials for \(k\) categories and \(y_{i}=1\) trials showed up in category \(i\) \[ p(\theta | y )=\left(\begin{array}{c} n\\ y_{1},\dots,y_{k} \end{array}\right)\theta_{1}^{y_{1}}\theta_{2}^{y_{2}}\cdots\theta_{k}^{y_{k}}.\] If we choose a prior like \(\theta^{\alpha-1}\) \[p(\theta|\alpha)\propto\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}.\]
The posterior and the product of the likelihood and the prior looks like \[p(\theta |y, \alpha) \propto p(y| \theta)p(\theta | \alpha) \propto\theta_{1}^{y_{1}+\alpha_{1}-1}\theta_{2}^{y_{2}+\alpha_{2}-1}\cdots\theta_{k}^{y_{k}+\alpha_{k}-1}.\]

Conjugation

The prior comes \(\theta^{\alpha-1}\) \[p(\theta|\alpha)\propto\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}\] have the normalizing denominator \[\int_{\theta\in\Delta}\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}d\theta=\frac{\Pi_{i=1}^{k}\Gamma(\alpha_{i})}{\Gamma(\sum_{i=1}^{k}\alpha_{i})}=B(\alpha)\] where \(\Delta\) means a simplex \(\left\{ \sum_{i=1}^{k}\theta_{i}=1,\theta_{i}\geq0\right\}\),
The distribution is called Dirichlet distribution denoted as \(\mbox{Dir}(\alpha)\): \[\frac{\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}}{\int_{\theta\in\Theta}\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}d\theta}=\frac{\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}}{B(\alpha)}.\]
The Beta distribution is a special case of the Dirichlet distribution with \(k=2\): \[\int_{0}^{1}\theta^{\alpha_{1}-1}(1-\theta)^{\alpha_{2}-1}d\theta=\frac{\Gamma(\alpha_{1})\Gamma(\alpha_{2})}{\Gamma(\alpha_{1}+\alpha_{2})}.\]

Conjugation

Conjugation is mutual
The conjugation: \[ p(\theta | y,\alpha) \propto p(y|\theta) p(\theta|\alpha).\] \(p(\theta | y,\alpha)\) and \(p(\theta | \alpha)\) are Dirichlet, then \(p(y|\theta)\) is Multinomial.
Mutuality: \(p(\theta | y,\alpha)\) and \(p(\theta | \alpha)\) are multinomial then \(p(y|\theta)\) is Dirichlet.

Exponential Family and Conjugation

Products of exponential family distributions are exponential family distribution
The integral of the Gamma distribution can be computed by change of variables, giving \[\Gamma(\alpha)=\beta^{\alpha}\int_{0}^{\infty}\theta^{\alpha-1}e^{-\beta\theta}d\theta.\]
We have the Gamma distribution: \(\mbox{Gamma}(\alpha,\beta)\) \[\frac{\beta^{\alpha}\theta^{\alpha-1}e^{-\beta\theta}}{\Gamma(\alpha)}.\]

Exponential Family and Conjugation

Consider \(Y|\theta\) has a Poisson \(\mbox{Poi}(\theta)\) distribution: \[p(y|\theta)=\prod_{j=1}^{n}\frac{\theta^{y_{j}}e^{-\theta}}{y_{j}!}\propto\theta^{\sum_{j=1}^{n}y_{j}}e^{-n\theta}.\]
Its conjugation prior is the \(\mbox{Gamma}(\alpha,\beta)\) \[p(\theta|\alpha,\beta)\propto\theta^{\alpha-1}e^{-\beta\theta}\] with normalizing constant \(\beta^{\alpha}/\Gamma(\alpha)\).
The posterior is a Gamma distribution \(\mbox{Gamma}(\sum_{j}y_{j}+\alpha,\beta+n)\): \[\theta^{\sum_{j=1}^{n}y_{j}+\alpha-1}e^{-(\beta+n)\theta}.\]

Truth about Exponential Family

One reason that statistical models routinely correspond to many different detailed process is because they belong to the exponential family.
All of the exponential family distributions are maximum entropy distributions.
One can no more infer evolutionary process by these statistical distributions than its underlying mechanistic process.
On the other hand, maximum entropy means we can use them to do useful work without fully understanding the underyling nature.

Truth about Bayesian Methods

Bayesian method can be considered as a stochastic optimizier for a global optimization problem.
The maximization problem \(\max_{\theta \in \Theta} U(\theta)\) has its stochastic version:\[P(U(\theta^{*}))\geq P(U(\theta)) \mbox{ for } \theta\neq\theta^{*}.\] If the probability of simulating in regions where \(U\) is large, we increase this probability and decrease this probability in regions where it is small.

Truth about Bayesian Methods

A density sharing maxima with \(U(\theta^{*})\) such that \[ P(U(\theta))= \frac{\exp(U(\theta)) p(\mathbf{y}|\theta)}{\int \exp(U(\theta)) p(\mathbf{y}|\theta)d\theta }\] \(P(U(\theta^{*}))\geq P(U(\theta))\) if \(U(\theta^{*})\geq U(\theta)\).
The likelihood is free to calibrate so that the denominator can be chosen toward accelerating convergence or avoiding local maxima.