12/02/2018

GLM

  • A parametric family of probability distributions.

  • Linear form \(\beta^T X\) and mean response \(\theta\) are linked.

  • Link function has a closed form and is invertible.

  • Solve nonlinear optimization problem for \(\hat{\beta}\).

Exponential Family

  • An exponential family is a family of probability density functions.

  • The density functions have the form \[p(y|\theta) = h(y)\exp\left\lbrace g(\theta)^{T}\phi(y)-\tilde{A}(\theta)\right\rbrace,\] where \(h(y)\) is the ancillary statistic, \(\tilde{A}(\theta)\) is a normalizer \[\exp\{\tilde{A}(\theta)\}=\int_{\Omega}h(y)\exp\left\lbrace g(\theta)^{T}\phi(y)\right\rbrace dy\] \(\phi(y)\) is the sufficient statistic such that conditioning on \(\phi(y)\), \(\theta\) is independent of \(Y\), or \(\theta\) and \(Y \mid\phi(y)\) are independent.

Exponential Family

  • All GLM distributions are in (a subset of) the exponential family.

  • The canonical exponential family:\[ p(y|\eta)=h(y)\exp(\eta^{T}\phi(y)-A(\eta))= \frac{h(y) \exp(\eta^{T}\phi(y))}{\int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy}.\] where \(\exp\{A(\eta)\}=\int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy.\)

  • Link function and canonical exponential family: \[\eta = g(\theta) \mbox{ and } \theta = f(\eta).\]

Exponential Family

  • A Poisson distribution: \(p(y|\theta) = \frac{\theta^{y}e^{-\theta}}{y!}\) has its expression \[p(y|\theta) = \frac{1}{y!} \exp\{y\log\theta -\theta\}\] where \(\eta = \log \theta\) , \(\phi(y)=y\), \(A(\eta) = \theta = e^{\eta}\), \(h(y)=(y!)^{-1}\).

  • The inverse link function is indeed \[\theta = e^{\eta}.\]

Exponential Family

  • A Bernoulli distribution: \(p(y|\theta)=\theta^{y}(1-\theta)^{1-y}\) has its expression \[\exp\left\{y \log\left( \frac{\theta}{1-\theta} \right) + \log(1-\theta)\right\}\] where \(\eta = \log \frac{\theta}{1-\theta}\), \(\phi(y) = y\), \(A(\eta) = \log(1-\theta) = \log(1+e^{\eta})\), \(h(y)=1\).

  • The inverse link function is \[\theta = \frac{1}{1 + e^{-\eta}}.\]

Exponential Family

  • A Gaussian distribution, the canonical exponential family representation is: \[\begin{aligned}p(y|\mu,\sigma^{2}) & =\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(y-\mu)^{2}}{2\sigma^{2}}}\\ & =\exp\left\lbrace y^{2}\left(\frac{-1}{2\sigma^{2}}\right)+y\frac{\mu}{\sigma^{2}}-\left[\frac{\mu^{2}}{2\sigma^{2}}+\frac{1}{2}\log(2\pi\sigma^{2})\right]\right\rbrace \end{aligned}\] where \(\eta=(\frac{\mu}{\sigma^{2}},-\frac{1}{2\sigma^{2}})^{T}\), \(\phi(y)=(y, y^2)^T\), \(A(\eta)=\frac{1}{2}\log\left(-\frac{2\pi}{2\eta_{2}}\right)-\frac{\eta_{1}^{2}}{4\eta_{2}}\) and \(h(y)=1/\sqrt{2\pi}\).

Exponential Family

  • A multinomial distribution has the form \[\begin{aligned}p(\mathbf{y}|\theta)&= \prod_{k=1}^{K}\theta_{k}^{y_{k}}=\exp\Big(\sum_{k=1}^{K}y_{k}\log\theta_{k}\Big) =\exp\Big(\sum_{k=1}^{K}y_{k}\eta_{k}\Big)=e^{\langle y,\eta\rangle}\end{aligned}.\]

  • By using the constraint \(\sum_{k=1}^{K}\theta_{k}=1\), we have \[\exp\{A(\eta)\}=\sum_{y\in\{0,1\}^{K}}\exp\Big(\sum_{k=1}^{K}y_{k}\eta_{k}\Big)=\sum_{k=1}^{K}e^{\eta_{k}}=1\] so it gives the usual form of multinomial \[\theta_{k}=\frac{e^{\eta_{k}}}{\sum_{k=1}^{K}e^{\eta_{k}}}\] which is known as the multinomial logit or softmax function.

Exponential Family and Optimization

  • The exponential family has appealing computational properties.

  • Mean and variance are the first and second cumulants of a distribution. In the exponential family, they can be obtained by calculating derivatives of \(A(\eta)\).

  • \(A(\eta)\) is the cumulant function, \[\begin{aligned}\frac{\partial A(\eta)}{\partial\eta^{T}} & =\frac{\partial}{\partial \eta^{T} }\left\{ \log \int h(y)\exp\left\lbrace \eta^{T}\phi(y)\right\rbrace dy\right\} \\ & = \frac{\int\phi(y)h(y)\exp\{\eta^T\phi(y)\}dy}{\int h(y) \exp\{\eta^T\phi(y)\}dx}\\ & =\int\phi(y) h(y) \exp\left\lbrace \eta^{T}\phi(y)-A(\eta)\right\rbrace dy= \mathbb{E}[\phi(Y)]. \end{aligned}\]

Exponential Family and Optimization

  • Second derivative of \(A(\eta)\): \[\begin{aligned}\frac{\partial^{2} A(\eta)}{\partial\eta\eta^{T}} & =\int \phi(y) \left(\phi(y)-\frac{\partial A(\eta)}{\partial \eta^{T}}\right)^{T} \exp\left\lbrace \eta^{T}\phi(y)-A(\eta)\right\rbrace h(y)dy \\ & =\mathbb{E}[\phi(Y)\phi(Y)^T] - \mathbb{E}[\phi(Y)]\mathbb{E}[\phi(Y)]^T = \mbox{Cov}[\phi(Y)]. \end{aligned}\]

  • Higher-order cumulants of \(\phi(y)\) can be obtained by taking higher-order derivatives of \(A(\eta)\).

Exponential Family and Optimization

  • \(A(\eta)\) is convex as \(\partial^{2} A(\eta)/\partial \eta \eta^T = \mbox{Cov}[\phi(Y)]\) is positive definite (given \(\eta\) is in a convex set).

  • The likelihood of canonical exponential family is convex \[\ell(\eta) =\sum_{i=1}^{N}\log p(y_i | \eta) = \sum_{i=1}^{N} [\log h(y_i) + \eta^T\phi(y_i) - A(\eta)].\]

  • Convexity: unique global ML solution for \(\eta\). MLE satisfies \[ \ell'(\hat{\eta}) = \sum_{i=1}^{N} \phi(y_i) - N A'(\hat{\eta}) = 0\] or \(A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\). Optimization is feasible.

Remarks

  • The standard Netwon's optimizier is feasible for \[A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\] by using additional information (2nd order derivative - Hessian) of the likelihood. It will improve the computational efficiency.

  • Statistics \(\mbox{Cov}[\phi(Y)]\) assist to attain the Hessian directly.

Remarks

  • When \(\phi(y)=y\), e.g. in the Poisson and Bernoulli case, \(A(\eta)\) is the \(\log\) of Laplace transform of \(h(y)\): \[e^{A(\eta)} = \int h(y) e^{\eta^T \phi(y)}dy.\]

Problem of MLE

  • The solution of \[A'(\hat{\eta}) = N^{-1}\sum_{i=1}^{N} \phi(y_i)\] depends on the sample size \(N\). Given a small number of observations, \(N^{-1}\sum_{i=1}^{N} \phi(y_i)\) is likely far from population mean (\(\mathbb{E}[\phi(Y)]\))!

Problem of MLE

  • MLE intends to set up the equivalence between a finite sample average and population mean which in principle is only asymptotically valid.

  • Result: MLE does not provide a good estimate \(\hat{\eta}\) for \(\eta\).

  • The estimate error is propogating when one estimates \(\beta\) using \(\hat{\eta}\).

Nonlinear Regression

  • \(X\in\mathbb{R}^{d}\)

  • \(Y\in\lbrace0,1\rbrace\) and \(Y\sim \mbox{Ber}(\theta)\).

  • Logistic link function:\[\theta= f(\eta) = \frac{1}{1+e^{-\eta}}.\]

Nonlinear Regression

  • Conditional distribution: \[\Pr(Y=y|X=\mathbf{x})=\theta^{y}(1-\theta)^{1-y}=f(\beta^{T}\mathbf{x})^{y}\times f(-\beta^{T}\mathbf{x})^{1-y}.\]

  • Given \(\mathcal{D}=\lbrace(\mathbf{x}_{1},y_{1}),\cdots,(\mathbf{x}_{n},y_{n})\rbrace\) of i.i.d. random variables, the log-likelihood \[\ell(\beta)=\sum_{i=1}^{n}y_{i}\log f(\beta^{T}\mathbf{x}_{i})+(1-y_{i})\log f(-\beta^{T}\mathbf{x}_{i}).\]

Nonlinear Regression

  • Let \(f_{i}=f(\beta^{T}\mathbf{x}_{i})\). Minimization: \[\begin{align}\nabla_{\beta}\ell(\beta)=&\sum_{i=1}^{n}y_{i}\mathbf{x}_{i}\frac{f(\beta^{T}\mathbf{x}_{i})f(-\beta^{T}\mathbf{x}_{i})}{f(\beta^{T}\mathbf{x}_{i})} \\ &-(1-y_{i})\mathbf{x}_{i}\frac{f(\beta^{T}\mathbf{x}_{i})f(-\beta^{T}\mathbf{x}_{i})}{f(-\beta^{T}\mathbf{x}_{i})}\\ =&\sum_{i=1}^{n}\mathbf{x}_{i}(y_{i}-f_{i}).\end{align}\]

  • Note \(\sum_{i=1}^{n}(y_{i}-f(\eta_i))\) is \(A'(\eta)\). \(\hat{\beta}\) should solve the nonlinear problem: \[\sum_{i=1}^{n}\mathbf{x}_{i}(y_{i}-f(\hat{\beta}^{T}\mathbf{x}_{i}))=0.\]

One Person Model

nyears = 36; beta0= 0.2; beta1 = -0.1; beta2 = -0.9;
year = 1:nyears;  x = (year-round(nyears/2)) / (nyears / 2)
eta = beta0 + beta1 * x + beta2 * x^2 
pi = plogis(eta)
y = rbinom(nyears, 1, pi)

Estimation

counts = glm(y ~ x + I(x^2), family=binomial); 
summary(counts)$coefficients
##               Estimate Std. Error   z value  Pr(>|z|)
## (Intercept) -0.7352476  0.5322872 -1.381299 0.1671871
## x            0.9048698  0.6274104  1.442229 0.1492377
## I(x^2)       1.4733067  1.2350338  1.192928 0.2328975

\(36\)-years is not enough for a consistent estimate.

One Person with a Long Life

nyears=10^4; year = 1:nyears; x =(year-round(nyears/2))/(nyears/2)
eta = beta0 + beta1 * x + beta2 * x^2 
pi = plogis(eta); y = rbinom(nyears, 1, pi)
counts = glm(y ~ x + I(x^2), family=binomial); 
summary(counts)$coefficients
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  0.1570019 0.03016450   5.204855 1.941482e-07
## x           -0.1497764 0.03549839  -4.219244 2.451232e-05
## I(x^2)      -0.8739806 0.06891199 -12.682563 7.387640e-37

Bayesian Inferences

  • A solution for finite-age persons.

  • Rule for joint probability has the expression \[P(B) \times P(A|B) = P(A) \times P(B|A)\] which we can rearrange to get \[P(A|B) = \frac {P(A) \times P(B|A)}{P(B)}\] which is the Bayes' rule.

Bayesian Inferences

  • Replace \(A\) and \(B\) by model parameters \(\theta\) and the data \(y\) to get \[p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)}\] where \(p(y|\theta)\) is the likelihood, \(p(\theta)\) is the prior, \(p(\theta|y)\) is the posterior, \(p(y)=\int p(y|\theta)p(\theta)d\theta\) is the horrible thing.

  • To avoid seeing the horrible thing: \[p(\theta|y) \propto p(\theta) \times p(y|\theta) \]

  • In most cases we can't calculate \(p(y)\). But we can calculate the ratio of \(p(\theta_1|y)\) and \(p(\theta_2|y)\): \[\frac{p(\theta_1|y) }{ p(\theta_2|y)}=\frac{p(\theta_1) \times p(y|\theta_1)}{p(\theta_2) \times p(y|\theta_2)} = \alpha\]

Computation in Bayesian Models

  • One can use the ratio \(\frac{p(\theta_1|y) }{ p(\theta_2|y)}\) to sample from the posterior distribution by a numerical sampling algorithm called Markov Chain Monte Carlo (MCMC).

  • One can uses the idea that in certain types of models \(p(\theta|y)\) can be aproximated by attaintable distributions e.g. normal distribution \(\mathcal{N}(\mu(\theta),\sigma^{2}(\theta))\) that makes the expression \(p(y|\theta) \mathcal{N}(\theta)\) integrable. This is called Approximate Bayesian Computation (ABC).

Bayesian Models

  • The most common MCMC samplers: Metropolis-Hastings algorithm, Gibbs algorithm

BUGS (Bayesian inference Using Gibbs Sampling), JAGS (Just Another Gibbs Sampler)

OpenBUGS www.openbugs.net

JAGS mcmc-jags.sourceforge.net/

  • ABC method: Integrated Nested Laplace Approximation (INLA)

INLA www.r-inla.org

  • All available in R.

Bayesian Simple Model

We will dump the model to a file using cat("", file="")

dataset = list(T=length(x), year=x, y=as.numeric(y))
cat("model
    { # priors
      beta0 ~ dnorm(0,0.05)
      beta1 ~ dnorm(0,0.05)
      beta2 ~ dnorm(0,0.05)
      # likelihood
      for(t in 1:T)
      {
        y[t] ~ dbin(pi[t], 1)
        logit(pi[t]) = beta0 + beta1*year[t] + beta2*pow(year[t],2)
      }
    }", file="model_single.bug")

Bayesian Simple Model

library(R2jags)
params = c("beta0","beta1","beta2", "pi")
counts.bayes = jags(data=dataset, model.file="model_single.bug", 
                    parameters.to.save=params, n.iter=2000, 
                    n.burnin=1000, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 36
##    Unobserved stochastic nodes: 3
##    Total graph size: 389
## 
## Initializing model

Bayesian Simple Model

pi.b = counts.bayes$BUGSoutput$median$pi
residuals = y - pi.b
acf(residuals)

Bayesian Simple Model

head(print(counts.bayes))
## Inference for Bugs model at "model_single.bug", fit using jags,
##  3 chains, each with 2000 iterations (first 1000 discarded)
##  n.sims = 3000 iterations saved
##          mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
## beta0      0.691   0.512 -0.286  0.336  0.693  1.040  1.713 1.005   470
## beta1     -0.431   0.619 -1.633 -0.848 -0.419 -0.006  0.783 1.005   460
## beta2     -0.978   1.179 -3.287 -1.775 -1.010 -0.172  1.292 1.004   590
## pi[1]      0.545   0.206  0.150  0.388  0.548  0.705  0.911 1.003   920
## pi[2]      0.562   0.189  0.189  0.423  0.567  0.706  0.899 1.003   840
## pi[3]      0.580   0.172  0.236  0.456  0.587  0.707  0.882 1.003   750
## pi[4]      0.595   0.156  0.277  0.484  0.605  0.709  0.868 1.004   660
## pi[5]      0.610   0.142  0.316  0.511  0.619  0.713  0.857 1.004   580
## pi[6]      0.623   0.130  0.353  0.533  0.632  0.717  0.847 1.004   510
## pi[7]      0.634   0.119  0.384  0.553  0.642  0.721  0.845 1.005   450
## pi[8]      0.644   0.112  0.409  0.569  0.650  0.726  0.840 1.005   400
## pi[9]      0.652   0.107  0.429  0.581  0.658  0.731  0.839 1.006   370
## pi[10]     0.658   0.104  0.439  0.589  0.665  0.734  0.837 1.006   360
## pi[11]     0.663   0.102  0.447  0.594  0.671  0.738  0.841 1.006   350
## pi[12]     0.666   0.102  0.454  0.597  0.675  0.743  0.845 1.006   360
## pi[13]     0.668   0.103  0.460  0.596  0.677  0.744  0.846 1.006   370
## pi[14]     0.668   0.104  0.454  0.596  0.677  0.745  0.849 1.006   390
## pi[15]     0.667   0.106  0.449  0.594  0.678  0.746  0.851 1.005   420
## pi[16]     0.665   0.107  0.444  0.592  0.675  0.746  0.849 1.005   450
## pi[17]     0.662   0.108  0.435  0.589  0.671  0.743  0.850 1.005   490
## pi[18]     0.657   0.109  0.429  0.583  0.667  0.739  0.847 1.004   550
## pi[19]     0.652   0.110  0.424  0.578  0.659  0.732  0.843 1.004   620
## pi[20]     0.645   0.110  0.418  0.571  0.653  0.726  0.837 1.003   710
## pi[21]     0.636   0.109  0.409  0.563  0.644  0.716  0.830 1.003   850
## pi[22]     0.627   0.108  0.402  0.554  0.632  0.705  0.822 1.002  1100
## pi[23]     0.616   0.107  0.398  0.544  0.621  0.693  0.811 1.002  1500
## pi[24]     0.603   0.106  0.389  0.531  0.607  0.678  0.799 1.001  2200
## pi[25]     0.588   0.106  0.377  0.516  0.593  0.664  0.787 1.001  3000
## pi[26]     0.572   0.106  0.364  0.500  0.576  0.646  0.773 1.001  3000
## pi[27]     0.555   0.108  0.342  0.481  0.560  0.631  0.761 1.001  3000
## pi[28]     0.536   0.111  0.316  0.458  0.541  0.612  0.747 1.001  3000
## pi[29]     0.515   0.117  0.284  0.434  0.520  0.597  0.736 1.001  3000
## pi[30]     0.493   0.125  0.246  0.405  0.497  0.582  0.732 1.001  2200
## pi[31]     0.470   0.135  0.209  0.374  0.472  0.565  0.732 1.002  1400
## pi[32]     0.447   0.146  0.172  0.340  0.447  0.549  0.734 1.002  1100
## pi[33]     0.424   0.159  0.134  0.304  0.418  0.534  0.738 1.003   860
## pi[34]     0.401   0.171  0.102  0.269  0.390  0.521  0.744 1.003   750
## pi[35]     0.379   0.184  0.078  0.235  0.362  0.508  0.755 1.003   680
## pi[36]     0.358   0.196  0.055  0.202  0.333  0.497  0.772 1.004   630
## deviance  50.638   2.464 47.849 48.883 49.974 51.683 56.962 1.002  1600
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 3.0 and DIC = 53.7
## DIC is an estimate of expected predictive error (lower deviance is better).
## $n.chains
## [1] 3
## 
## $n.iter
## [1] 2000
## 
## $n.burnin
## [1] 1000
## 
## $n.thin
## [1] 1
## 
## $n.keep
## [1] 1000
## 
## $n.sims
## [1] 3000

Bayesian Simple Model

plot(counts.bayes)

Bayesian Simple Model

plot(year, y, type="b")
lines(pi,col="blue")

Bayesian Simple Model

plot(as.mcmc(counts.bayes)[[1]][,1]) # 1st chain, beta0

Bayesian Long Life Model

nyears = 10^3; beta0= 0.2; beta1 = -0.1; beta2 = -0.9;
year = 1:nyears;  x = (year-round(nyears/2)) / (nyears / 2)
eta = beta0 + beta1 * x + beta2 * x^2 
pi = plogis(eta); y = rbinom(nyears, 1, pi)
dataset = list(T=length(x), year=x, y=as.numeric(y))
params = c("beta0","beta1","beta2")
counts.bayes = jags(data=dataset, model.file="model_single.bug", 
                    parameters.to.save=params, n.iter=2000, 
                    n.burnin=1000, n.chains=3)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1000
##    Unobserved stochastic nodes: 3
##    Total graph size: 10511
## 
## Initializing model

Bayesian Long Life Model

plot(counts.bayes)

MCMC Animation

Bayesian Computation by Hands

  • Bernoulli distribution: \(p(x_i|\theta) = \theta^{y_i} (1-\theta)^{1-y_i}\) for \(y_i=0\) or \(y_i=1\) \[p(\mathbf{y}|\theta) = \prod_{i=1}^n \theta^{y_i} (1-\theta)^{1-y_i} = \theta^{\sum_{i=1}^n y_i} (1-\theta)^{n-\sum_{i=1}^n y_i}.\]

  • Prior: a Beta distribution: Beta\((\alpha_1,\alpha_2)\) \[ p(\theta) = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)}\theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1}\] \(\Gamma(x)\) is the usual Gamma function:\(\Gamma(\alpha)=\int_{0}^{\infty}t^{\alpha-1}e^{-t}dt\).

  • Posterior: Beta\((\alpha_1^\prime,\alpha_2^\prime)\), with \(\alpha_1^\prime = \alpha_1 + \sum_{i=1}^n y_i\), \(\alpha_2^\prime = \alpha_2 + n-\sum_{i=1}^n y_i\).

Bayesian Computation by Hands

\[ \begin{eqnarray*} p(\theta|\mathbf{y}) & \propto & p(\mathbf{y}|\theta)~p(\theta) \\ && \\ &=& \theta^{\sum_{i=1}^n y_i} (1-\theta)^{n-\sum_{i=1}^n y_i} ~ \frac{\Gamma(\alpha_1 +\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} \theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1} \\ && \\ & \propto & \theta^{(\alpha_1 + \sum_{i=1}^n y_i) - 1} (1-\theta)^{(\alpha_2 + n-\sum_{i=1}^n y_i) - 1}. \\ \end{eqnarray*} \] The posterior is still a beta distribution with updated parameters become \(\alpha_1^\prime = \alpha_1 + \sum_{i=1}^n y_i\), \(\alpha_2^\prime = \alpha_2 + n-\sum_{i=1}^n y_i\).

  • Conjugate priors: A family of priors such that, upon being multiplied by the likelihood, yields a posterior in the same family.

Bayesian Computation by Hands

  • Voting model. \(n\) randomly selected registered voters express opinions (yes or no). Voter are identical with the probability \(\theta\) in favor of the proposal.

  • Initial guess \(330\) persons are in favor and \(270\) not.

  • Prior of beta distribution: \(\alpha_1 = 330\) and \(\alpha_2 =270\).
  • If \(\sum_{i=1}^n y_i = 600\) of the \(n = 1000\) respondents favor the vote, then the posterior has a beta distribution with parameters \(\alpha^\prime =\alpha +\sum_{i=1}^nx_i = 930\) and \(\beta^\prime = \beta + n -\sum_{i=1}^nx_i = 670\).

Bayesian Computation by Hands

x = seq(.45, 1, .005); prior = dbeta(x, 330, 270); 
post = dbeta(x, 930, 670); infinite = dbeta(x,10^4,270);
plot(x, post, type="l", col="blue", ylim=c(0, 50), lwd=2,
         xlab="Proportion in Favor", ylab="Density")
lines(x, prior, col="red"); lines(x,infinite, col="black")

Bayesian Computation by Computer

Grid Approximation

x = seq(.45, 1, .005); prior = dbeta(x, 330, 270); 
likelihood = dbinom(1000,600,x);
post.com = likelihood*prior; post.com = post.com/sum(post.com)

Conjugation

  • Multinomial distribution of \(n\) trials for \(k\) categories and \(y_{i}=1\) trials showed up in category \(i\) \[ p(\theta | y )=\left(\begin{array}{c} n\\ y_{1},\dots,y_{k} \end{array}\right)\theta_{1}^{y_{1}}\theta_{2}^{y_{2}}\cdots\theta_{k}^{y_{k}}.\] If we choose a prior like \(\theta^{\alpha-1}\) \[p(\theta|\alpha)\propto\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}.\]
  • The posterior and the product of the likelihood and the prior looks like \[p(\theta |y, \alpha) \propto p(y| \theta)p(\theta | \alpha) \propto\theta_{1}^{y_{1}+\alpha_{1}-1}\theta_{2}^{y_{2}+\alpha_{2}-1}\cdots\theta_{k}^{y_{k}+\alpha_{k}-1}.\]

Conjugation

  • The prior comes \(\theta^{\alpha-1}\) \[p(\theta|\alpha)\propto\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}\] have the normalizing denominator \[\int_{\theta\in\Delta}\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}d\theta=\frac{\Pi_{i=1}^{k}\Gamma(\alpha_{i})}{\Gamma(\sum_{i=1}^{k}\alpha_{i})}=B(\alpha)\] where \(\Delta\) means a simplex \(\left\{ \sum_{i=1}^{k}\theta_{i}=1,\theta_{i}\geq0\right\}\),

  • The distribution is called Dirichlet distribution denoted as \(\mbox{Dir}(\alpha)\): \[\frac{\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}}{\int_{\theta\in\Theta}\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}d\theta}=\frac{\theta_{1}^{\alpha_{1}-1}\theta_{2}^{\alpha_{2}-1}\cdots\theta_{k}^{\alpha_{k}-1}}{B(\alpha)}.\]

  • The Beta distribution is a special case of the Dirichlet distribution with \(k=2\): \[\int_{0}^{1}\theta^{\alpha_{1}-1}(1-\theta)^{\alpha_{2}-1}d\theta=\frac{\Gamma(\alpha_{1})\Gamma(\alpha_{2})}{\Gamma(\alpha_{1}+\alpha_{2})}.\]

Conjugation

  • Conjugation is mutual

  • The conjugation: \[ p(\theta | y,\alpha) \propto p(y|\theta) p(\theta|\alpha).\] \(p(\theta | y,\alpha)\) and \(p(\theta | \alpha)\) are Dirichlet, then \(p(y|\theta)\) is Multinomial.

  • Mutuality: \(p(\theta | y,\alpha)\) and \(p(\theta | \alpha)\) are multinomial then \(p(y|\theta)\) is Dirichlet.

Exponential Family and Conjugation

  • Products of exponential family distributions are exponential family distribution

  • The integral of the Gamma distribution can be computed by change of variables, giving \[\Gamma(\alpha)=\beta^{\alpha}\int_{0}^{\infty}\theta^{\alpha-1}e^{-\beta\theta}d\theta.\]

  • We have the Gamma distribution: \(\mbox{Gamma}(\alpha,\beta)\) \[\frac{\beta^{\alpha}\theta^{\alpha-1}e^{-\beta\theta}}{\Gamma(\alpha)}.\]

Exponential Family and Conjugation

  • Consider \(Y|\theta\) has a Poisson \(\mbox{Poi}(\theta)\) distribution: \[p(y|\theta)=\prod_{j=1}^{n}\frac{\theta^{y_{j}}e^{-\theta}}{y_{j}!}\propto\theta^{\sum_{j=1}^{n}y_{j}}e^{-n\theta}.\]

  • Its conjugation prior is the \(\mbox{Gamma}(\alpha,\beta)\) \[p(\theta|\alpha,\beta)\propto\theta^{\alpha-1}e^{-\beta\theta}\] with normalizing constant \(\beta^{\alpha}/\Gamma(\alpha)\).

  • The posterior is a Gamma distribution \(\mbox{Gamma}(\sum_{j}y_{j}+\alpha,\beta+n)\): \[\theta^{\sum_{j=1}^{n}y_{j}+\alpha-1}e^{-(\beta+n)\theta}.\]

Truth about Exponential Family

  • One reason that statistical models routinely correspond to many different detailed process is because they belong to the exponential family.

  • All of the exponential family distributions are maximum entropy distributions.

  • One can no more infer evolutionary process by these statistical distributions than its underlying mechanistic process.

  • On the other hand, maximum entropy means we can use them to do useful work without fully understanding the underyling nature.

Truth about Bayesian Methods

  • Bayesian method can be considered as a stochastic optimizier for a global optimization problem.

  • The maximization problem \(\max_{\theta \in \Theta} U(\theta)\) has its stochastic version:\[P(U(\theta^{*}))\geq P(U(\theta)) \mbox{ for } \theta\neq\theta^{*}.\] If the probability of simulating in regions where \(U\) is large, we increase this probability and decrease this probability in regions where it is small.

Truth about Bayesian Methods

  • A density sharing maxima with \(U(\theta^{*})\) such that \[ P(U(\theta))= \frac{\exp(U(\theta)) p(\mathbf{y}|\theta)}{\int \exp(U(\theta)) p(\mathbf{y}|\theta)d\theta }\] \(P(U(\theta^{*}))\geq P(U(\theta))\) if \(U(\theta^{*})\geq U(\theta)\).

  • The likelihood is free to calibrate so that the denominator can be chosen toward accelerating convergence or avoiding local maxima.