Concept of Bayesian data analysis with a coin example and rejection sampling

Concept of Bayesian statistics (with coin example and rejection sampling)

1. Frequentist view (빈도주의 관점)

당신이 동전 하나를 가지고 있고, 이 동전을 던졌을 때 앞면이 나올 확률을 구하고 싶다.
만약 10번 동전을 던져서 8번의 앞면을 얻었다면, 당신은 앞면이 나올 확률을 무엇이라고 말하겠는가?
빈도주의자의 관점에서 본다면, 아마 앞면이 나올 확률은 0.8 \(\text{Pr}(H)=\frac{8}{10}=0.8\) (\(H\) = 앞면) 이라고 할 수있다.

이런식으로 우리가 현재 관찰한 데이터의 빈도들을 파악해서 결과를 설명하는 방법을 빈도주의 방법이라고 할 수 있다. 다시 말하자면, 빈도주의 관점에서는 “동전의 앞면이 나올 확률을 나타내는 true parameter(\(\theta\))는 (비록 모르지만) 하나의 값으로 고정되어 있다고 생각하고, 우리가 현재 관찰한 데이터는 그 true parameter로 부터 나온 어떤 데이터의 일부분 인 것이다” 참조.

이것을 앞의 동전의 예를 통해 설명해보자. 우선, 어떤 동전의 앞면(Head)이 나올 확률을 다음과 같이 표기하자 \(\text{Pr}(H)=\theta\). 받아 들이기 힘들겠지만, 여기서 \(\theta\)는 우리가 가진 동전이 앞/뒷면이 나올 확률을 결정하는 어떤 특정한 값으로 볼 수 있다. 우리는 모르지만, 동전을 던질때마다 \(\text{Pr}(H)=\theta\)에 따라서 우리는 앞면 /뒷면이 나오는 것을 목격할 수 있고, 우리는 지금 그 중 10번을 목격한 것이다.
그렇다면, 빈도주의자들의 목표는 무엇일까? 10번을 목격한 데이터를 가지고 가장 그럴듯한 (혹은 데이터를 잘 설명하는) \(\hat{\theta}\)를 찾아내는 것이다.

수식으로 예를 들어 설명해보자. 동전을 던져서 앞면이 나오면 1, 뒷면이 나오면 0이라고 한다면, 이러한 시행을 수식으로 적으면 이항분포(Binomial)인 것을 알 수 있다. 이항분포의 확률분포함수는 다음과 같다.

\[\text{Pr}(\textbf{y}|\theta)=\binom{n}{k}\theta^{k}(1-\theta)^{n-k}\] 여기서 \(\theta\) is \(\text{Pr}(H)\) of 모집단, \(n\) is 전체 시행수, \(k\) is 관측 앞면 수, and \(\textbf{y}\) is 우리의 관측 (e.g., \(\textbf{y}=(H,H,H,T,H,T,H,...)\))

빈도주의의 관점을 따르면, \(\theta\) 고정된 하나의 값이고, 목표는 그 \(\theta\)의 가장 좋은 추정치, 즉, \(\text{Pr}(\textbf{y}|\theta)\)를 최대화하는 \(\hat{\theta}_{MLE}\)을 찾아내는 것이다 (이것을 Maximum Likelihood Estimation, MLE라 한다. 좀더 얘기하면, likelihood는 우리가 정한 모델(이항분포+\(\hat{\theta}\))일 때 생성된 데이터가 나올 확률이다. 따라서, 모델이 데이터에 잘 맞으면 likelihood가 올라간다.) \[\theta_{MLE}=\arg\max_\theta\text{Pr}( \textbf{y}|\theta)\] 사실 이 MLE 해를 구하게 되면 \(\mathbb{E}[\textbf{y}]=0.8\)이란 결과가 나온다. 하지만, 앞서 말한 개념에 맞춰서 간단한 방법으로 찾아보자.

지금부터 \(\text{Pr}(\textbf{y}|\hat{\theta})\)를 \(\hat{\theta}\) 값을 변화시켜가면서 산정할 것이다. 이 경우에 \(\hat{\theta}_{MLE}\)에서 \(\text{Pr}(\textbf{y}|\hat{\theta})\)의 값이 가장 크게 나타날 것이다.

theta_hat_grid <- seq(0, 1, length.out = 101)
# binomial probability
pr_y_theta_likelihood = function(theta, n, k) {
    likelihood <- factorial(n)/factorial(k)/factorial(n - k) * theta^k * (1 - 
        theta)^(n - k)
    # Actually it is better to use loglikelihood to prevent numerical errors.
    # log_lik<-sum(log(1:n))-sum(log(1:k))-sum(log(1:(n-k)))+k*log(theta)+(n-k)*log(1-theta)
    # likelihood<-exp(log_lik)
    return(likelihood)
}


n = 10  # number trial
k = 8  # number heads

likelihood_grid <- pr_y_theta_likelihood(theta_hat_grid, n, k)  # likelihood for each hat(theta)
mle_index <- which.max(likelihood_grid)  # find MLE index of theta (which maximizes likelihood)

plot(theta_hat_grid, likelihood_grid, ylab = expression("Likelihood [Pr(y|" * 
    theta * ")]"), xlab = expression(hat(theta)), type = "l", ylim = c(0, 0.35), 
    main = "Red x is MLE point")
points(theta_hat_grid[mle_index], likelihood_grid[mle_index], pch = 4, col = "red")
text(0.9, 0.32, "MLE solution")

print(paste0("MLE soluation of theta is ", theta_hat_grid[mle_index]))

## [1] "MLE soluation of theta is 0.8"

예상했듯이, \(\theta_{MLE}\) 는 0.8 이고, 이는 간단하게 \(\frac{\text{Observed head}}{\text{Total flips}}\)를 의미한다.

이제 문제가 생긴다.
당신이 누구에게 앞면이 나올 확률을 말해야 한다면 0.8이라고 말할 것인가?
아니면 confidence interval을 구해서 (예를 들어 0.7-0.9) 95%의 확률로 \(\theta\)가 0.7-0.9라고 할 것인가?
(이것은 삼가하길 바란다. confidence interval의 의미는 이것과는 조금 다르다. 앞서 말했듯이 \(\theta\)는 하나의 값으로 고정되어 있고, confidence interval을 반복해서 구하면 그 중 95%는 \(\theta\)를 포함한다는 것이다. 즉, \(\theta\)의 범위가 0.7-0.9인 것이 아니라, 저 범위가 \(\theta\)를 95% 확률로 포함하는 것이다. 같은 것으로 들릴 수 도 있는데, 우리가 다음 데이터를 또 관측한다면 그 범위는 0.4-0.6일 수 도 있다. 이런식으로 계속 confidence interval을 계속 관측하면 \(\theta\)가 그 안에 95% 확률로 존재하는 것이다.)

당신이 오직 한번만의 시행을 했다면 어떻게 할 것인가? 앞면/뒷면이 나올 확률을 0 혹은 1이라고 말할 것인가?

다른 상황을 생각해보자.
다른 사람이 와서 같은 동전을 10번 더 던졌는데 앞면이 1번 뒷면이 9번 나왔다고 하자.
그렇다면, 당신은 다른 사람의 시행과 당신의 시행을 합쳐서 앞면이 나올 확률을 \(\frac{8+1}{20}=0.45\) 라고 말할 것인가?

마지막 질문(다른 사람이 시행)은 이번 포스팅의 목표는 아니지만, 우리는 이 질문을 대답하기에 더 많은 사람이 와서 더 많은 시행을 해봐야 한다는 것을 알 수 있다. 만약 우리가 1000000 사람들에게 이 동일한 동전을 100000 번씩 던지게 한다면, 우리는 우리의 결과에 아마 어느 정도 자신감을 가지고 앞면이 나올 확률에 대해서 말할 수 있을 것이다.
그러나, 이 문제에서는 우리는 사람 한명이 오직 10번의 동전을 던졌다. 그 사람의 던지기 패턴이 편향 되었을 수도 있고, 동전 자체가 균일 하지 않을 수도 있다. 그리고 우연의 일치로 앞면이 이상하게 많이 나왔을 수도 있다.

2. Bayesian view

이제, 베이지안의 관점을 살펴보자. 베이지안 관점에서는 parameter (\(\theta\))는 랜덤(분포)이고 데이터는 고정되어 있다고 생각한다.
다른 말로 하면, 당신이 동전을 던질때마다 \(\theta\)는 고정되지 않고, 어떤 특정한 분포를 따라서 계속 생성이 되고 당신은 그때 그때의 \(\theta\)에 따라 결정된 동전의 앞면을 관측하는 것이다. 아까는 고정된 \(\theta\)에서 임의로 생성된 데이터를 관측한 반면에 이번에는 \(\theta\)가 변하고 그에 따른 데이터를 관측한 것이다. 베이지안 모델링에서 우리는 \(\theta\)를 정확하게 어떤 특정한 값이라고 말할 순 없지만, 그것의 분포를 알아 낼 수 있다. 우선, 베이즈 공식을 써보자.

\[\text{Pr}(\theta|\textbf{y})=\frac{\text{Pr}(\textbf{y}|\theta)\text{Pr}(\theta)}{\text{Pr}(\textbf{y})}=\frac{\text{Likelihood}\: \times\: \text{Prior}}{\text{Evidence}}\]

실제로 앞서말한 분포\(\text{Pr}(\theta|\textbf{y})\)를 구해내기 위해서 우리는 위와 같은 식을 사용하였다. 즉, 우측항의 값들을 모두 알게되면 \(\text{Pr}(\theta|\textbf{y})\)를 구할 수 있고, 우리는 \(\theta\)의 분포를 구할 수 있다.

이때, 분포를 구하는 것이 왜 중요하냐고 생각할 수 있다. 식으로 부터 살펴보면, likelihood에 우리의 어떤 믿음(prior)를 넣어서 결과를 데이터에 덜 편향되게 할 수 있는 장점이 있지만, 또 한편으로는 분포를 직접 구하게 되면서 내가 앞으로 동전을 계속 던졌을때 어떠한 값들을 얻게 될지 시뮬레이션을 할 수도 있고, uncertainty의 범위를 구해서 우리의 데이터가 내가 결과를 만드는데 충분한지 역시 파악할 수 있다. 마지막으로 앞서 언급한 이 동전은 95%의 확률로 앞면이 나올 확률이 0.7-0.9 사이에 있어 같은 결과를 이끌어 낼 수 도 있다.

(뒤에 이어서 말하겠지만, 여기서 또 하나 살펴볼 것은 Evidence, \(\text{Pr}(\textbf{y})=\sum_\theta\text{Pr}(\textbf{y}|\theta)\)는 \(\theta\)에 영향을 받지 않기 때문에 데이터가 고정되어 있으면 상수이고, 일반적인 베이지안 분석에서 이를 구하지 않고 \(\text{Pr}(\theta|\textbf{y})\) 구하는 방법들이 연구되어 왔다. 식으로 표현하면)

\[\text{Pr}(\theta|\textbf{y})\propto\text{Pr}(\textbf{y}|\theta)\text{Pr}(\theta)\]

이전 분석에서 우리는 likelihood(\(\text{Pr}(\textbf{y}|\theta)\))를 이미 언급하였다. 따라서, 여기에 우리의 prior knowledge (혹은 \(\theta\) 분포에 대한 믿음 (데이터를 얻기 전), i.e., \(\text{Pr}(\theta)\))을 더한다면, 우리는 normalized 되지 않은 posterior (사후분포) (\(\text{Pr}(\textbf{y}|\theta)\text{Pr}(\theta)\))를 알아 낼 수 있다. 그리고 이 unnormalized posterior로 부터 posterior를 구할 수 있다.

일반적으로 \(\text{Pr}(\theta)\)에 어떤 특정한 분포를 사용한다 (모분포에 대해 적당한 근거가 있는 것으로).
이 Posterior들을 구하기 위해서 해석적인 방법을 사용할 수도 있지만, 많은 경우에 구하기가 어려워서 샘플링 기법들을 사용한다. 즉, 어떤 샘플들을 어떤 특정한 기법을 통해서 얻어내는 것인데, 충분히 많은 샘플들을 모으면 그것이 posterior를 근사한다고 생각한다.
이 예제에서는, 적당한 prior는 \(\mathbb{E}[\theta]\approx0.5\)인 것인데, 그 이유는 일반적인 동전의 경우 \(\text{Pr}(\theta)\)가 0.5에 가깝기 때문이다. Posterior를 알아보기 전에 rejection sampling 방법부터 알아보자.

3. Rejection sampling

Rejection sampling은 어떤 분포로 부터 샘플링을 하는 방법 중 하나이다 (가장 간단한). 예를 들어서, 우리가 어떤 분포를 안다고 하자 (정규분포). 정규분포의 pdf는 \(\text{Pr}(y|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp[-\frac{(y-\mu)^2}{2\sigma^2}]\). 그러나 여기서 샘플링을 어떻게 할까? \(y~\sim \text{N}(\mu,\sigma^2)\)? R에서는 당신은 rnorm 함수를 써서 하면 된다. 하지만, 함수가 없다면? 여기서 우리는 rejection sampling 같은 방법들을 이용할 수 있다.

다음과 같은 정규분포의 pdf를 생각해보자 (\(\text{N}(\mu=3,\sigma=1.5)\)).

mu = 3
sigma = 1.5
y_grid <- seq(-3, 9, length.out = 1201)  # for plotting
y_norm <- dnorm(y_grid, mean = mu, sd = sigma)
plot(y_grid, y_norm, type = "l", ylab = "density", xlab = "y")

\(y\)의 샘플링이 잘 되었다면, density curve와 샘플의 히스토그램이 유사할 것이다.

set.seed(1235)
y_sample <- rnorm(1e+05, mu, sigma)

library(ggplot2)
hist(y_sample, freq = F, breaks = 50, xlab = "y", main = "Red line is density (dnorm)")
points(y_grid, y_norm, type = "l", xlab = "y", col = "red")

rejection sampling의 기본적인 아이디어는 어떤 샘플링 하기 쉽고 알고 싶은 분포를 모두 포함하는 샘플링 공간을 정의하고 (nqy), 우리가 샘플하고 싶은 pdf를 기준으로 밖에 위치하는 샘플을 버리는 것이다. (여기선 어떤 수학식도 다루지 않겠다) see this note). 일단 정규분포에서의 샘플링이 어렵다고 가정하자.

# Assume y grid [-3,9]
y_grid <- seq(-3, 9, length.out = 1201)

# to fill out we will define constant*q(y) function . say nqy() n*qy should
# include all density curve of the normal. I will use uniform distribution
nqy <- function(y) {
    # since the normal uniform distribution doesn't include the top density
    # curve of normal graph.  we need to multiply (constant_n) to make the
    # uniform density include all the normal density.
    constant_n <- 0.28/(1/(9 - (-3)))
    return(constant_n * dunif(y, min = -3, max = 9))
}

plot(y_grid, y_norm, type = "l", ylab = "density", xlab = "y", ylim = c(0, 0.3), 
    main = "Red line is nqy's density")
points(y_grid, nqy(y_grid), type = "l", col = "red")

nqy density함수를 꽉 채우는 샘플링을 해보자. To fill out, we can simply, random number [0,1] x density.

number_sampling <- 10000
set.seed(1235)
sample_nqy <- sample(y_grid, number_sampling, replace = T, prob = nqy(y_grid))  # sample y based on nqy (actually uniform)
sampling <- nqy(sample_nqy) * runif(number_sampling, min = 0, max = 1)

plot(sample_nqy, sampling, main = "Red line is nqy, blue line is target (normal)", 
    xlab = "y")
points(y_grid, nqy(y_grid), type = "l", col = "red", lwd = 3)
points(y_grid, y_norm, type = "l", col = "blue", lwd = 3)

이제 nqy의 샘플을 타겟 분포 (정규분포, 파란색)을 기준으로 벗어나는 샘플들을 버리자.

final_sample <- sample_nqy[sampling <= dnorm(sample_nqy, mu, sigma)]  #filter sampling density is higher than the normal

hist(final_sample, freq = F, breaks = 50, xlab = "y", main = "Red line is density (dnorm)")
points(y_grid, y_norm, type = "l", xlab = "y", col = "red")

비록 히스토그램이 rnorm 함수보다 좋지는 않지만, 우리는 비슷한 것을 확인할 수 있다 (샘플이 더 많으면 더 비슷해진다). 따라서 우리는 이런 방법을 이용해서 어떤 특정 분포로 부터 샘플을 추출할 수 있다.

4. Rejection sampling for unnormalized posterior

이제 다시 unnormalized posterior로 돌아오자 (right side of the below equation).

\[\text{Pr}(\theta|\textbf{y})\propto\text{Pr}(\textbf{y}|\theta)\text{Pr}(\theta)\]

Rejection sampling을 확인하고, 우리의 분석을 마무리 하기 위해 다음을 가정하자: \(y_i|\theta\sim \binom{n}{k}(\theta)\) and \(\theta\sim\text{Beta}(\alpha=2,\beta=2)\).
여기서, 우리는 \(\theta\sim\text{Beta}(2,2)\) 선택했는데, 그 이유는 \(\mathbb{E}[\theta]=0.5\). 사실 prior의 선택은 여기서의 주목표가 아니다.
일단 unnormalized posterior를 계산하자..

# n=10, k=8 likelihood
lik_hood = function(theta, n, k) {
    # likelihood<-factorial(n)/factorial(k)/factorial(n-k)*theta^k*(1-theta)^(n-k)
    # Actually it is better to use loglikelihood to prevent numerical errors.
    log_lik <- sum(log(1:n)) - sum(log(1:k)) - sum(log(1:(n - k))) + k * log(theta) + 
        (n - k) * log(1 - theta)
    likelihood <- exp(log_lik)
    return(likelihood)
}
prior <- function(theta) {
    return(dbeta(theta, 2, 2))
}
unnorm_post_density <- function(theta, n, k) {
    unnorm_posterior <- lik_hood(theta, n, k) * prior(theta)
}
theta_grid <- seq(0, 1, length.out = 101)
n = 10  #total trials
k = 8  #head 8
unnorm_post <- unnorm_post_density(theta_grid, 10, k)

plot(theta_grid, unnorm_post, type = "l", ylab = "unnormalized density", xlab = expression(theta * 
    "|" * y))

위의 텍스트에서 언급했듯이, 이것은 unnormalized posterior이다. 이는 다시 말해서이 분포를 \(\theta\)에 대해 적분하면 1이 나오지 않는다. 즉, \[\int_{\theta}\text{Pr}(\textbf{y}|\theta)\text{Pr}(\theta)d\theta\neq1\]

여기서 normalization하는 것은 어렵지 않다. 그 이유는 하나의 parameter \(\theta\) 에 대해서 적분하는 것은 그리 어려운 문제가 아니기 때문이다. 심지어 우리는 discrete grid에서 이 일을 매우 쉽게 할 수 있다. 코드로 봐보자.

delta_theta <- theta_grid[2] - theta_grid[1]  # grid resoultion

normalization_const <- sum(unnorm_post * delta_theta)

plot(theta_grid, unnorm_post/normalization_const, type = "l", ylab = "Normalized density", 
    xlab = expression(theta * "|" * y))

이제 우리는 normalized posterior 분포를 구했다. 하지만, 여기서 어떻게 샘플링을 할까? rejection sampling을 이용할 수 있다. 그러나, 자세히 살펴보면, rejection 샘플링을 하면 우리는 굳이 normalized posterior를 사용할 필요가 없다. 우리가 어떠한 상수를 사용하건 간에 샘플된 값들의 히스토그램의 모양은 일정하게 유지 될 것이다. (물론 density curve의 절대적인 값은 다르겠지만)

아까와 같은 방법으로 rejection sampling을 하자.

# Assume theta grid [0,1]
theta_grid <- seq(0, 1, length.out = 1001)

# to fill out we will define constant*q(y) function . say nq_theta()
# n*q_theta should include all density curve of the normal. I will use
# uniform distribution
nq_theta <- function(y) {
    # since the normal uniform distribution doesn't include the top density
    # curve of normal graph.  we need to multiply (constant_n) to make the
    # uniform density include all the normal density.
    constant_n <- 0.35  # since the pick of unnorm_posterior is less than 0.35
    return(constant_n * dunif(y, min = 0, max = 1))
}
unnorm_post_theta <- unnorm_post_density(theta_grid, n, k)
plot(theta_grid, unnorm_post_theta, type = "l", ylab = "density", xlab = expression(theta), 
    ylim = c(0, 0.4), main = "Red line is nq_theta's density")
points(theta_grid, nq_theta(theta_grid), type = "l", col = "red")

Let’s fill out the nq_theta density curve by sampling. To fill out, we can simply, random number [0,1] x density.

number_sampling <- 50000
set.seed(1235)
sample_nq_theta <- sample(theta_grid, number_sampling, replace = T, prob = nq_theta(theta_grid))  # sample y based on nqy (actually uniform)
sampling <- nq_theta(sample_nq_theta) * runif(number_sampling, min = 0, max = 1)

plot(sample_nq_theta, sampling, main = "Red line is nq_theta, blue line is unnorm_post", 
    xlab = expression(theta), cex = 0.3)
points(theta_grid, nq_theta(theta_grid), type = "l", col = "red", lwd = 3)
points(theta_grid, unnorm_post_theta, type = "l", col = "blue", lwd = 3)

Now let’s subtract the nq_theta samples whose density is higher than unnormalized posterior (blue, target) curve.

final_sample <- sample_nq_theta[sampling <= unnorm_post_density(sample_nq_theta, 
    n, k)]  #filter sampling density is higher than the normal

hist(final_sample, freq = F, breaks = 50, xlab = expression(theta), main = "Red: unnorm_post, Blue: norm_post")
points(theta_grid, unnorm_post_theta, type = "l", col = "red")
points(theta_grid, unnorm_post_theta/normalization_const, type = "l", col = "blue")

위의 그래프에서 알 수 있듯이, rejection sampler가 normalized posterior distribution을 unnormalized density curve로 부터 구해내는 것을 알 수 있다. 여기서 normalization constant를 구하는 것이 쉽다고 하였지만, 일반적으로는 어렵기 때문에, 우리는 MCMC와 같은 샘플링 방법들을 이용한다.

사실 conjugate prior를 이용하는 다른 방법이 있다. 당신이 conjugate prior를 이용한다면 link, 해석적으로 posterior distribution을 알아 낼 수 있다.
예를 들어, 만약 당신이 Beta분포에 Binomial distribution을 곱한다면, 이것은 또 다른 Beta distribution이다 (다른 \(\alpha\)와 \(\beta\)값을 가진). 즉 다시 말해,

\[\text{Binom}(n,k)\times \text{Beta}(\alpha,\beta)=\text{Beta}(\alpha+k,\beta+n-k)\]

plot(theta_grid, dbeta(theta_grid, 10, 4), type = "l", lwd = 10, col = "black", 
    main = "Black: Beta binomial, Grey: norm_posterior", ylab = "density", xlab = expression(theta * 
        "|" * y))
points(theta_grid, unnorm_post_theta/normalization_const, col = "grey", pch = 4, 
    cex = 0.2)

5. Analysis

다시 분석으로 돌아가보자. 이제 우리는 하나의 값을 앞면이 나올 확률로 말할 수 없다. 우리의 결과는 분포로 나타났다.
Prior가 결과에 영향을 주기 때문에, 이것이 진리라고 말할 수는 없다. 하지만, 당신의 결과는 당신이 관측한 적은 수의 시행에 덜 편향되었을 것이다 (만약 당신의 prior가 적당했다면).

quantiles <- quantile(final_sample, c(0.025, 0.5, 0.975))
print(quantiles)

##     2.5%      50%    97.5% 
## 0.463000 0.724000 0.909475

샘플된 것들을 요약하기 위해 quntiles들을 계산하자. 이 결과로 부터 우리는 “나의 동전이 앞면이 나올 확률은 95% 확률로 [0.463,0.9095]” 범위에 있어. 이것을 보통 credible interval이라고 부르는데 이는 우리가 빈도주의에서 구하는 confidence interval과는 다르다. confidence interval에서는 이러한 해석을 할 수가 없다 (비록 부트스트랩 을 했다고 할지라도).

plot(theta_grid, unnorm_post_theta/normalization_const, type = "l", lwd = 1, 
    col = "black", main = "Between two lines are credible interval", ylab = "density", 
    xlab = expression(theta * "|" * y))
abline(v = c(quantiles[1], quantiles[3]), col = "red")

여기서 나는 매우 간단하게 결과를 끝내지만, 실제로 우리는 다양한 분석을 더 해볼 수 있다. 사실 우리는 parameter의 분포를 알기 때문에 새로운 데이터\(\tilde{y}\)를 얻는 시나리오를 시뮬레이션 할 수 있다. 이것을 posterior predictive check라고 부르기도 한다. 또한, 당신이 더 많은 데이터가 있다면, 지금의 결과를 prior로 해서 결과를 점점 업데이트 할 수 있다 (그러면서 likelihood에 수렴해 갈 것이다). 마지막으로 계층적 베이지안 모델(혹은 멀티레벨 모델)로 확장해 나갈 수 있는데, prior에 또 다른 prior를 주는 것이다. 직관적이지 않겠지만, 예를 들어보자. 당신의 동전을 다른 10명의 사람들이 던졌다고 생각하자. 10명의 사람들이 posterior는 다를 수도 있을 것이다. 하지만 같은 동전을 사용했기 때문에 각 사람의 결과는 연관은 있을 것이다. 이러한 개념을 다음과 같이 식으로 쓸 수 있다.

\[\textbf{y}_{\text{person}[i]}|\theta_{\text{person}[i]}\sim\text{Pr}(y|\theta,\alpha,\beta)\] \[\theta_{\text{person}[i]}|\alpha_{[i]},\beta_{[i]}\sim\text{Pr}(\theta|\alpha,\beta)\] \[\alpha_{[i]},\beta_{[i]}\sim\text{Pr}(\alpha,\beta)\] To summarize \[\text{Pr}(\theta,\alpha,\beta|\textbf{y})\propto\text{Pr}(\textbf{y}|\theta,\alpha,\beta)\text{Pr}(\theta|\alpha,\beta)\text{Pr}(\alpha,\beta)\]

이 포스팅의 목적은 간단한 예로 베이지안 분석을 소개하는 것이다. 따라서, 분석에 한계가 있고 더 많은 것들에 대해서 논의가 필요하다. 어찌되었건 포스팅이 개념을 이해하는데 도움이 되었으면 한다.