Stat 136 (Bayesian Statistics)

Lesson 3.1: Bayesian Inference for the Mean of the Normal Distribution

Author

Norberto E. Milla, Jr.

Published

April 11, 2023

Introduction: The Normal Distribution

  • Many random variables seem to follow the normal distribution, at least approximately

  • The normal distribution is a symmetric, bell shaped distribution

  • It has two parameters: mean \mu and variance \sigma^2

  • If Y \sim N(\mu, \sigma^2), then f_Y(y)= \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y-\mu)^2}{2\sigma^2}}

Bayes inference for the normal mean using discrete prior

  • Suppose there are m possible values of \mu: \mu_1, \mu_2, \cdots, \mu_m

  • Choose a discrete prior probability distribution,f(\mu), which summarizes our prior belief about the parameter

  • Suppose a single observation Y \sim N(\mu, \sigma^2), with \sigma^2 known

  • For a single observation, the likelihood is given by

\begin{align} L(y|\mu) &= \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(y-\mu)^2}{2\sigma^2}} \notag \\ &\propto e^{-\frac{(y-\mu)^2}{2\sigma^2}} \notag \\ &= exp\left[-\frac{(y-\mu)^2}{2\sigma^2}\right] \tag{1} \end{align}

  • For a random sample of observation y_1, y_2, \cdots, y_n, the joint likelihood is given by

\begin{align} L(y_1, y_2, \cdots, y_n|\mu) &= \prod_{i=1}^n L(y_i | \mu) \notag \\ &= \left(\frac{1}{\sigma \sqrt{2\pi}}\right)^n e^{-\frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^2}} \notag \\ &\propto exp\left[-\frac{n(\overline{y}-\mu)^2}{2\sigma^2}\right] \tag{2} \\ &= L(\overline{y}| \mu, \sigma^2) \notag \end{align}

Example 1:

Suppose Y \sim N(\mu,1). Suppose further that there are only five possible values for \mu. These are 2.0, 2.5, 3.0, 3.5, and 4, and each is equally likely to occur. Let Y=3.2. Using (1), we obtain the likelihood of observing y=3.2 for each value of \mu.

Code
mu <-  c(2.0, 2.5, 3.0, 3.5, 4)
pr <- c(1/5, 1/5, 1/5, 1/5, 1/5)
lik <-  round(exp(-(3.2-mu)^2/2),4)
PrxL <- round(pr*lik,4)
post <- round(PrxL/sum(PrxL),4)
knitr::kable(cbind(mu, pr, lik, PrxL, post),
             col.names = c("mu", "Prior", "Likelihood", "Prior x Likelihood", "Posterior"))
mu Prior Likelihood Prior x Likelihood Posterior
2.0 0.2 0.4868 0.0974 0.1239
2.5 0.2 0.7827 0.1565 0.1990
3.0 0.2 0.9802 0.1960 0.2493
3.5 0.2 0.9560 0.1912 0.2432
4.0 0.2 0.7261 0.1452 0.1847

Example 2:

  • Suppose we take a random sample of 4 observations from a normal distribution having mean \mu and variance \sigma^2 = 1

  • Let the observations be 3.2, 2.2, 3.6, and 4.1.

  • Consider the same values of \mu and discrete prior probabilities in Example 1

  • The likelihood is obtained using (2)

Code
yobs <- c(3.2, 2.2, 3.6, 4.1)
n <- length(yobs)
lik2 <-  round(exp(-n*(mean(yobs)-mu)^2/2),4)
PrxL2 <- round(pr*lik2,4)
post2 <- round(PrxL2/sum(PrxL2),4)
table2 <- as.data.frame(cbind(mu, pr, lik2, PrxL2, post2))
knitr::kable(table2,
             col.names = c("mu", "Prior", "Likelihood", "Prior x Likelihood", "Posterior"))
mu Prior Likelihood Prior x Likelihood Posterior
2.0 0.2 0.0387 0.0077 0.0157
2.5 0.2 0.3008 0.0602 0.1228
3.0 0.2 0.8596 0.1719 0.3505
3.5 0.2 0.9037 0.1807 0.3685
4.0 0.2 0.3495 0.0699 0.1425
  • Posterior summaries are computed as before
Code
postmean <- round(sum(mu*post2),4)
postvar <- round(sum(mu^2*post2) - postmean^2,4)
pmulessthan3 <- table2 %>% 
  filter(mu<3) %>% 
  select(post2) %>% 
  sum()
knitr::kable(cbind(postmean, postvar, pmulessthan3),
             col.names = c("Posterior Mean", "Posterior variance", "P(mu < 3)"))
Posterior Mean Posterior variance P(mu < 3)
3.2496 0.219 0.1385
  • Bayesian inference such as test of hypothesis and credible interval construction is based on the posterior distribution

Bayes inference for the normal mean using continuous prior

  • Suppose we have a random sample Y_1, Y_2, \cdots , Y_n from N(\mu, \sigma^2), where \sigma^2 is known

  • Some possible prior distributions are:

    • Jeffrey’s uniform prior: a flat (non-informative) prior, also improper prior, f(\mu) = 1

    • Normal prior: a conjugate prior (Normal-Normal conjugate pair)

Jeffrey’s uniform prior for the normal mean

  • Prior: f(\mu) = 1

  • Likelihood: see (2)

  • Posterior: same as the likelihood

    • the posterior distribution is N(\mu, \frac{\sigma^2}{n})

Normal prior density for the normal mean

  • Suppose we consider a normal density as the prior for \mu, say N(\mu_p, \sigma_p^2)

  • Hence, f(\mu) \propto exp \left[ - \frac{(\mu-\mu_p)^2}{2\sigma_p^2} \right]

  • The posterior distribution of \mu is \begin{align} f(\mu| \bar{y}) &= f(\mu)L(\bar{y}|\mu, \sigma^2) \notag \\ &\propto exp \left[ - \frac{(\mu-\mu_p)^2}{2\sigma_p^2} \right] exp \left[- \frac{(\bar{y}-\mu)^2}{2(\frac{\sigma^2}{n})} \right] \notag \\ &\vdots \notag \\ &= exp \left[ -\frac{1}{2 \left(\frac{\sigma_p^2 \sigma^2}{\sigma^2 + n \sigma_p^2} \right)} \left(\mu - \frac{\sigma^2 \mu_p + n \sigma_p^2 \bar{y}}{\sigma^2 + n \sigma_p^2} \right)^2 \right] \notag \end{align}

  • This means that \mu is distributed as N(\mu^*, \sigma^{2*}), where:

\mu^* = \frac{\sigma^2 \mu_p + n \sigma_p^2 \bar{y}}{\sigma^2 + n \sigma_p^2} and

\sigma^{2*} = \frac{\sigma_p^2 \sigma^2}{\sigma^2 + n \sigma_p^2} \implies \frac{1}{\sigma^{2*}} = \frac{n}{\sigma^2} + \frac{1}{\sigma_p^2}

  • Alternatively, we can also express the posterior mean as

\mu^* = \left(\frac{\frac{1}{\sigma_p^2}}{\frac{n}{\sigma^2} + \frac{1}{\sigma_p^2}} \right)\mu_p + \left(\frac{\frac{1}{\sigma_p^2}}{\frac{n}{\sigma^2} + \frac{1}{\sigma_p^2}} \right)\bar{y}

  • Thus, the posterior mean \mu^* is the weighted average of the prior mean \mu_p and \bar{y}, where the weights are the proportions of the posterior precision

  • When \sigma is unknown we estimate it using the sample standard deviation, s, and recalculate \mu^* and \sigma^* with \sigma replaced by s

Example 3:

Arnie and Marie are going to estimate the mean length of tilapia in a fishpond. Previous studies in other fishponds have shown the length of tilapia to be normally distributed with known standard deviation of 2 cm. Arnie decides his prior mean is 30 cm. He does not believe it is possible for a yearling rainbow to be less than 18 cm or greater than 34 cm. Thus, his prior standard deviation is approximately 4 cm. Thus, he will use a N(30, 16) prior. On the other hand, Marie does not know anything about tilapia, so she decides to use the Jeffrey’s prior.

  • Arnie: f(\mu) \approx exp \left[-\frac{(\mu-30)^2}{2(16)} \right]

  • Marie: f(\mu) =1

  • They take a random sample of 12 tilapia from the pond and calculated the sample mean to be \bar{y}=32 cm

  • The likelihood: y_i \sim N(\mu, 2^2) \implies \bar{y} \sim N(\mu, \frac{2^2}{12})

  • For Marie, since she used Jeffrey’s flat prior, the posterior distribution is equal to the likelihood, that is, f(\mu|\bar{y}) \sim N(\mu^*, \sigma^{2*}), where

\mu^* = \bar{y}=32

and

\sigma^{2*} = \frac{\sigma^2}{n} = \frac{2^2}{12} = \frac{1}{3}

  • For Arnie, the posterior distribution is also N(\mu^*, \sigma^{2*}), where the posterior mean and variance are, respectively,

\mu^* = \frac{\sigma^2 \mu_p + n \sigma_p^2 \bar{y}}{\sigma^2 + n \sigma_p^2} = \frac{2^2(30)+12(16)(32)}{2^2+12(16)} \approx 31.96

and

\sigma^{2*} = \frac{\sigma^2 \sigma_p^2}{\sigma^2 + n \sigma_p^2} = \frac{2^2(16)}{2^2+12(16)} \approx 0.3265

  • There are a few functions in the bayesrules package which can facilitate the above calculations

    • For example, the plot_normal_normal() and summarize_normal_normal() functions generate the distribution plot and summary statistics of the posterior, respectively
Code
plot_normal_normal(mean = 30, sd = 4, sigma = 2,
                   y_bar = 32, n = 12)

Code
summarize_normal_normal(mean = 30, sd = 4, sigma = 2,
                   y_bar = 32, n = 12)
      model     mean     mode        var        sd
1     prior 30.00000 30.00000 16.0000000 4.0000000
2 posterior 31.95918 31.95918  0.3265306 0.5714286

Choosing a normal prior

  • How does one, in practice, choose a Normal prior for \mu that reflects prior beliefs about the location of this parameter?

  • One indirect strategy for choosing for selecting values of the prior parameters \mu_p and \sigma_p^2 is based on the specification of quantiles

  • On the basis of one’s prior beliefs, one specifies two quantiles of the Normal density

  • Then, the Normal parameters are found by matching these two quantiles to a particular Normal curve

  • The matching is performed by the R function normal.select() in the LearnBayes package

  • Input two quantiles by the list() statements, and the output is the mean and standard deviation of the Normal prior

  • For example, suppose P_{50} = 30 and P_{90} = 32

Code
normal.select(list(p=0.5,x=30),list(p=0.9,x=32))
$mu
[1] 30

$sigma
[1] 1.560608
  • In practice we perform several checks to see if this Normal prior makes sense

  • Several functions are available to help in this prior checking

Code
qnorm(0.25,30, 1.56)
[1] 28.9478
  • The Bayes’ theorem allows the updating of our prior information when data is already available

  • Without going thru the calculations demonsrated earlier, we can performed updating by the R function normal_update() nin the ProbBayes package

  • One inputs two vectors: prior is a vector of the prior mean and standard deviation and data is a vector of the sample mean and standard error

  • The output is a vector of the posterior mean and posterior standard deviation

Code
prior <- c(30, 4)
data <- c(32, 0.58)
normal_update(prior, data)
[1] 31.9588159  0.5739972

Bayesian credible interval for the normal mean

  • Recall that the posterior distribution for \mu is normal with mean \mu^* and standard deviation \sigma^*

  • A (1-\alpha) \times100% credible interval for \mu is given by

\mu^* \pm z_{\frac{\alpha}{2}}\sigma^*

  • When \sigma is unknown we estimate it using the sample standard deviation (s) and the (1 - \alpha) \times 100% credible interval for \mu is given by

\mu^* \pm t_{\frac{\alpha}{2}}\sigma^*

Example 3 (continued):

  • For Marie, her 95% credible interval for \mu is 32 \pm 1.96(0.5774) \implies (30.87,33.13)

    • There is a 95% probability that (30.87,33.13) contains \mu
  • For Arnie, his 95% credible interval for \mu is 31.96 \pm 1.96(0.5714) \implies (30.84,33.08)

    • There is a 95% probability that (30.84, 33.08) contains \mu
  • Simulation-based

Code
m <- rnorm(100000, 32, 0.5774)
a <- rnorm(100000, 31.96, 0.5714)
quantile(m,c(0.025,0.975)) #Marie
    2.5%    97.5% 
30.86534 33.13507 
Code
quantile(a,c(0.025,0.975)) #Arnie
    2.5%    97.5% 
30.83543 33.07889 

Bayesian one-sided hypothesis test about \mu

  • The posterior distribution f(\mu|\bar{y}) summarizes our entire belief about the parameter, after viewing the data

  • Suppose we want to test H_0 : \mu \leq \mu_0 against H_1 : \mu > \mu_0

  • Testing a one-sided hypothesis in Bayesian statistics is done by calculating the posterior probability of the null hypothesis, P(H_0|\mu^*, \sigma^{2*})

\begin{align} P(H_0 : \mu \leq \mu_0 | \mu^*, \sigma^{2*}) &= \int_{-\infty}^{\mu_0} f(\mu|\bar{y}) d\mu \notag \\ &= P\left( Z \leq \frac{\mu^* - \mu_0}{\sigma^*} \right) \notag \end{align}

  • If this probability is less than the \alpha-level of significance, H_0 is rejected

Example 3 (continued):

  • For example, Marie wants to test H_0: \mu \leq 31 versus H_1: \mu > 31

\begin{align} P(\mu \leq 31) &= P \left( Z \leq \frac{31-32}{0.5774} \right) \notag \\ &=P(Z \leq -1.73) \notag \\ &\approx 0.0418 \notag \end{align}

  • Thus, we reject H_0 at \alpha = 0.05

  • Simulation-based:

Code
samples <- 100000
m <- rnorm(samples,32, 0.5774)
sum(m<=31)/samples
[1] 0.04186
  • Let us test H_0: \mu \leq 31 versus H_1: \mu > 31 based on the posterior distribution of Arnie

\begin{align} P(\mu \leq 31) &= P \left( Z \leq \frac{31-31.96}{0.5714} \right) \notag \\ &=P(Z \leq -1.68) \notag \\ &\approx 0.0465 \notag \end{align}

  • Likewise, we reject H_0 at \alpha = 0.05

  • Simulation-based:

Code
samples <- 100000
m <- rnorm(samples,31.96, 0.5714)
sum(m<=31)/samples
[1] 0.04531

Bayesian two-sided hypothesis test about \mu

  • Suppose we want to test H_0: \mu = \mu_0 against H_1: \mu \neq \mu_0

  • Recall that we have a continuous posterior, thus, the probability of any specific value of a continuous random variable always equals 0

  • Construct the (1 - \alpha) \times 100% credible interval for \mu

  • If \mu_0 is contained in the credible interval, we do not reject H_0 and conclude that \mu_0 still has credibility as a possible value for \mu

  • For example, if we test H_0: \mu = 35 against H_1: \mu \neq 35 using Arnie’s posterior distribution at \alpha = 0.05

  • Recall that the 95% credible for \mu based on Arnie’s posterior distribution is (30.84, 33.08) which does not include \mu=35, hence, we reject H_0

Bayesian prediction

  • Suppose we wanted to predict the length of a tilapia

  • Two-step procedure:

    • Sample a value of \mu from its posterior distribution
    • Sample a new observation \widetilde{Y} from the data model (i.e. a prediction)
Code
mu_sim <- rnorm(100000, 31.96,0.5714)
y_sim <- rnorm(100000, mu_sim, 2 )
round(mean(y_sim),0)
[1] 32
  • Based on this Bayesian model, the expected length of a tilapia is 32 cm.