In this publication I want to show you how to perform Bayesian Analysis in R. We are going to use the example 3.1 from this book: Introducción a la estadÃstica Bayesiana. I hope you like soccer because this example is about the number of goals in Colombian professional soccer tournament.
First, let’s think that the number of goals that are scored in a Colombian professional soccer match come from a distribution \(Poisson(\theta)\), where \(\theta\) is the average number of goals that are scored. Now suposse that we use the opinion of a soccer expert to elicit the average number of goals that are scored in a soccer match, that is, parameter \(\theta\), and we obtain: \(\theta \sim N(2.5, (0.2)^{2})\).
curve(dnorm(x, 2.5, 0.2), from = -2, to = 8,
ylab="Density", main="Normal Prior Distribution",
xlab=expression(theta), col="mediumorchid4", las=1, ylim = c(0, 2), lwd = 2)
In this case, we want to know what the posterior distribution of \(\theta\) looks like and what is the mean of this distribution. In order to do that, we are going to work with three scenarios:
Here, I want to show you how the Bayesian Analysis works. First we have a likelihood function from a popultion with a Poisson distribution with unknown parameter \(\theta\):
\[
L(\theta)=p(x|\theta)=\prod_{i=1}^{n}\frac{\exp(-\theta)\times\theta^{x}}{x!}
\]
As I mentioned before, after an elicitation process, we know that prior distribution \(p(\theta)\) of the parameter \(\theta\) is given by:
\[ p(\theta)=\frac{1}{\sqrt{2\pi(0.2)^2}}\exp\left(\frac{-(\theta-2.5)^{2}}{2(0.2)^{2}}\right) \]
Finally, the posterior distribution of \(\theta\) is:
\[
p(\theta|x)=\frac{p(x|\theta)p(\theta)}{C},
\]
where the constant \(C\) is calculated as follows:
\[
C=\int_{-\infty}^\infty p(x|\theta)p(\theta)\,
\mathrm{d}\theta
\]
And the mean of the posterior distribution \(E(\theta|x)\) is given by:
\[ E(\theta|x)=\frac{\int_{-\infty}^\infty \theta p(x|\theta)p(\theta)\, \mathrm{d}\theta}{C} \]
Here you are going to learn how to use Monte Carlo simulation in R to answer the concerns presented above. For the three scenarios you will follow the next steps:
First, you need to define the data depending on the scenario:
x <- 1 #First scenario
Now, you will calculate the integral using Monte Carlo simulation. For this, it is necessary to generate \(N=10000\) values \(\theta_i\) from the prior distribution and evaluate them in the likelihood function \(p(x|\theta_i)\). Finally, to obtain C these values are averaged. The code in R is as follows:
N <- 100000 # number of simulated values
thetas <- rnorm(n=N, mean = 2.5, sd = 0.2) #prior distribution
fvero <- function(theta) prod(dpois(x=x, lambda = theta)) #likelihood function
fvero <- Vectorize(fvero)
C <- mean(fvero(thetas))
C
## [1] 0.2060313
After calculating C, you can obtain the posterior distribution as follows:
posterior <- function(theta) {
fvero(theta) * dnorm(x=theta, 2.5, 0.2) / C
}
Finally you can obtain the mean of the posterior distribution using Monte Carlo simulation to calculate the integral:
aux <- thetas * fvero(thetas)
integral <- mean(aux)
mposterior <- integral/C
mposterior
## [1] 2.476012
As mentioned before, the code presented above is used in all three scenarios and the only thing that changes depending on the scenario is \(x\). In this section, we will present a graph for each scenario that contains the prior and the posterior distribution of \(\theta\), the mean of the posterior distribution (blue dotted line) and the observations (pink dots).
From the results we can conclude that when we have few observations as in scenario 1 and 2, the posterior distribution will tend to resemble the prior distribution due to the lack of sample evidence. On the contrary, when we have a large number of observations as in scenario 3, the posterior distribution will move away from the prior distribution since the data will have a greater influence.