METROPOLIS ALGORITHM APPLIED TO A NORMAL DISTRIBUTION ASSUMING UNKNOWN MEAN AND VARIANCE

Professor: Dr. Joseph Skufca

By Cássio Rampinelli

October, 21th, 2018

OBS: This script was done in R programming language

__________________________________________________________________________________________________________________________

INTRODUCTION

The goal of this script is to demonstrate how the Metropolis (M) algorithm works by applying it to a specific case where our posterior distribution (or target distribution) is actually known. In this example we use a Normal distribution with mean 3 and variance 1. \(N\)~\((\)\(\mu=3\),\(\sigma^2=1\)\()\).

 

We expect to sample from Bayesian posterior distribution and check if our sampling really approximates to the originally normal distribution with \(\mu=2\), and \(\sigma^2=1\) that is our known target distribution.

 

For our example, let’s pretend that although we don`t know the mean and variance of our distribution we have a sample of “n” elements from this distribution that is recorded in a vector named \(y\), with \(y_1\), \(y_2\)\(y_n\)

 

In R we can take a sample from a normal distribution running the code below. (For this example, let’s consider that we have 20 samples from this distribution ).

y=rnorm(20, mean=3, sd=1)
y
##  [1] 2.633200 4.225387 3.350086 2.784806 2.361114 2.885058 3.707478
##  [8] 2.505103 1.303057 3.296337 2.658003 3.949087 4.271872 3.280822
## [15] 3.753841 3.971929 1.996949 3.391142 1.459014 3.976890

 

WHAT ARE WE SEEKING?

We are seeking the joint posterior distribution of the unknown parameters \(\mu\) and \(\sigma^2\) from a normal distribution that we suppose to describe the available data. Then, if we write the Bayes rule we have the following expression:

\[ p(\mu,\sigma^2|y) = \frac{p(y|\mu,\sigma^2)\cdot p(\mu,\sigma)}{\int\int p(y|\mu,\sigma^2)\cdot p(\mu,\sigma)\cdot d\mu \cdot d\sigma^2} \]  

Omitting the denominator that is a constant and using the proportionality symbol:

\[ p(\mu,\sigma^2|y) \propto p(y|\mu,\sigma^2)\cdot p(\mu,\sigma) \]

We also can check the marginal posterior distributions for each unknown parameter, instead of represent the joint posterior.

\[ p(\mu|y) = Marginal\ posterior\ of\ \mu \] \[ p(\sigma^2|y) = Marginal\ posterior\ of\ \sigma^2 \]

 

In which

\[ p(\mu,\sigma^2|y) = Joint \ posterior\ distribution \] \[ p(y|\mu,\sigma^2)= Likelihood \] \[ p(\mu,\sigma)= Prior \]  

Now let’s specify each component of that expression.

 

OBS: When applying MCMC (Markov Chain Monte Carlo) methods to obtain the target distribution, we don`t need to be so worried about finding conjugate priors once we don’t need a closed form for the posterior because it will be computed numerically.  

PRIORS

Assuming that \(\mu\) and \(\sigma^2\) are independent, we can re-rewrite the joint prior as two independents priors:

\[ p(\mu,\sigma)= p(\mu)\cdot p(\sigma^2) \]  

Let`s first assume a simple uniform distribution as priors for both parameters $and \(\sigma^2\), considering some reasonable intervals. Let’s check the expressions for each one of them.

 

\[ p(\mu)= Unif.(-50,50) \]

\[ p(\sigma^2)= Unif.(0,100) \]

Now let’s create a R function to compute both priors densities for \(\mu\) and \(\sigma^2\).As will be demonstrated for the likelihood expression later, once the density values will become very small it will be convenient to work with the logarithms of the likelihoods. Then the densities will be set to return the logarithm of the likelihood.

 

priorMu <- function(mu){
  
  return(dunif(mu,min=-50,max=50,log=TRUE))
  
}

priorSigma2 <- function(sigma2){
  
  return(dunif(sigma2,min=0,max=100,log=TRUE))
  
}

 

And take a look in both densities plot

#Create a window for 2 plots
par(mfrow=c(1,2))

mus=seq(from=-50,to=50,by=0.1)
sigmas2=seq(from=0,to=100,by=0.1)

#Mus density
plot(mus,priorMu(mus),type='l',col="red",xlab=expression(mu),ylab="Density")

#Sigma2 density
plot(sigmas2,priorSigma2(sigmas2),type='l',col="red",xlab=expression(sigma^2),ylab="Density")

 

LIKELIHOOD

The likelihood for n observed data are given by the following equation:

\[ p(y|\mu,\sigma^2)=\prod_{i=1}^{n} \frac{1}{\sqrt{2\small \pi \sigma^2}}e^{-\frac{(y_i-\mu)^2}{2\sigma^2}} \]  

Checking the likelihood for the given data. Let’s check how is the likelihood behavior for our given data. For instance, let’s assume a variance value of \(\sigma^2=4\), and just vary the values for \(\mu\) considering an interval between -5 and 10. The code below show how to compute the product in the likelihood expression for the data and plot the results.

#Create the likelihood function
likelihood<-function(data,Mu,Sigma){
  
  likelihoodValue=0
  likelihoodValue=prod(dnorm(data,mean=Mu,sd=Sigma))
  
  return(likelihoodValue)
  
}
 
#Applying the likelihood function to the data and considering 
#possibles values of means
mus=seq(from=-5,to=10,by=0.1)
Sd=sqrt(4)
data=y
likelihoodVector=0

for(i in 1:length(mus)){
  
  likelihoodVector[i]=likelihood(data=y,Mu=mus[i],Sigma=Sd)
}
#Ploting the results


#Sigma2 density
plot(mus,likelihoodVector,type='l',col="red",xlab=expression(mu),ylab="Likelihood")

 

The likelihood plot shows that the magnitude of the values is very samall. To avoid computing problems with memory to save extreme samall values we will work with the natural logarithm (log) of the values. Then, let`s re-write the product of the normal distribution as a sum of elements, setting the normal density function to return the log of the values and check the plot again.

#Create the likelihood function
loglikelihood<-function(data,Mu,Sigma){
  
  loglikelihoodValue=0
  loglikelihoodValue=sum(dnorm(data,mean=Mu,sd=Sigma,log=TRUE))
  
  return(loglikelihoodValue)
  
}
 
#Applying the likelihood function to the data and considering 
#possibles values of means
mus=seq(from=-5,to=10,by=0.1)
Sd=sqrt(4)
data=y
loglikelihoodVector=0

for(i in 1:length(mus)){
  
  loglikelihoodVector[i]=loglikelihood(data=y,Mu=mus[i],Sigma=Sd)
}
#Ploting the results


#Sigma2 density
plot(mus,loglikelihoodVector,type='l',col="red",xlab=expression(mu),ylab="Log-Likelihood")

 

POSTERIOR AND THE METROPOLIS ALGORITHM (M)

Now we are ready to apply the Metropolis algorithm so that we can sample from the joint posterior distribution (or marginal posterior distributions) of the parameters \(\mu\) and \(\sigma^2\). The Figure 1 below gives a general overview of the ideia behind the MCMC method.

 

Figure 1: General overview of the Bayesian approach

Figure 1: General overview of the Bayesian approach

 

To simplify the notation, instead of referring to the variance, we will refer to the standard deviation.Basically we make proposal values for our unknown parameters (we are using p index to refer to the proposal values: \(\mu^p\) and \(\sigma^p\)) and that values can be or can not be accepted based on an acceptance rate (r) that is computed for each step (i). We use the notation \(\theta^i=(\mu^i, \sigma^i)\) to refer to a vector that will record the parameters considered in each step of the iteration. Those parameters can be the proposal values (in case of them being accepted) or they can be the same parameters from the previous step.The algorithm is described as follows:

 

a) Initialize i=1; and set the number of iterations;

b) Set initial values for the parameters vector; \(\theta^0=(\mu^0,\sigma^0)\)

c) Sample from a proposal distribution values for \(\mu^p\) and \(\sigma^p\);

d) Calculate the ratio (r) of the densities;

\[ r=\frac{p(\mu^p,\sigma^p|y)}{p(\mu^{i-1},\sigma^{i-1}|y)}=\frac{p(y|\mu^p,\sigma^p)p(\mu^p)p(\sigma^p)}{p(y|\mu^{i-1},\sigma^{i-1})p(\mu^{i-1})p(\sigma^{i-1})} \]  

If we consider the log of the ratio:

\[ log(r)=log\Bigg(\frac{p(\mu^p,\sigma^p|y)}{p(\mu^{i-1},\sigma^{i-1}|y)}\Bigg)=log\bigg(p(y|\mu^p,\sigma^p)p(\mu^p)p(\sigma^p)\bigg)-log\bigg(p(y|\mu^{i-1},\sigma^{i-1})p(\mu^{i-1})p(\sigma^{i-1}) \bigg) \]

\[ log(r)=\Bigg(log\bigg(p(y|\mu^p,\sigma^p)\bigg)+log\bigg(p(\mu^p)\bigg)+log\bigg(p(\sigma^p)\bigg)\Bigg)-log\bigg(log\bigg(p(y|\mu^{i-1},\sigma^{i-1})\bigg)+log\bigg(p(\mu^{i-1})\bigg)+log\bigg(p(\sigma^{i-1})\bigg) \bigg) \]  

In order to make the code more clear we create a function that computes the log of the joint posterior as previously presented in denominator and denominator of the last equation. Then, to set this function to compute the numerator of the ratio we should pass as argument the proposal values \((\) \(\mu^{p} \ and \ \sigma^{p}\) \()\). To compute the denominator we should pass as arguments the parameters of the previous step

\((\) \(\mu^{i-1} \ and \ \sigma^{i-1}\) \()\).

############################
##Joint
###########################
##Logarithm os the joint distibution

joint <- function(Mu,Sigma){
  
  data=y
  return (loglikelihood(data,Mu,Sigma) + priorMu(Mu)+priorSigma2(Sigma^2))
}

 

e) Sample “u” from an uniform distribution;

\[ u=Unif.[0,1] \]  

f) If \(u\)< min(1,r), then:

\[ \theta^{i+1}=\theta^{p} \] Otherwise:

 

\[ \theta^{i+1}=\theta^{i} \]

g) i=i+1 (then, come back to c and repeat the process until reach the number of iterations)

 

For the proposal values for \(\mu\) we will use a normal distribution with mean equivalent to the value of \(\mu\) in the previous step of the chain and standard variation 2 and for the standard deviation \(\sigma\) we will use the previous value of variance in the chain added to a random value drawn from a uniform distribution between -1/2 and 1/2.

 

Now we present the Metropolis algorithm as a function in R.Commentaries are provided along the code to make it easier to check it code line.

#Defining the Metropolis Algorithm as a function
#the arguments of the functions are the initial values for mu and sigma passed as a vector (startvalue) and the total number of iterations.
run_Metropolis <- function(startvalue, iterations){
  
  #Matrix to save the 2 parameters of the chain 
  #Number of rows= number of iterations +1 and number of columns equals to the number of parameters.(2)
  chain = array(dim = c(iterations+1,2))
  
  #Initialize the first row of the chain matrix with the initial values for the parameters.
  chain[1,] = startvalue
  
  #variables that save the number of accepted and rejected values
  naceito=0
  nrejeitado=0
  
  #Loop with Metropolis Algorithm
  for (i in 1:iterations){
    
    #Proposal value for Mu. This is a sample from a normal distribution with mean equals to the Mu value of the previous step of the chain and standard deviation equals to 2.
    proposalMu=rnorm(1,mean= chain[i,1],sd=1.6)
  
  #Proposal value for Sigma. This is a sample from an uniform distribution between the range of -0.5 and 0.5 added to the previous value of sigma from the previous step of the chain.  
    proposalSigma=chain[i,2]+runif(1,min=-0.5,max=0.5)
  #Log of the Acceptance ratio for Metropolis Algorithm
    razao=joint(proposalMu,proposalSigma)-joint(chain[i,1],chain[i,2])
    
#Error treatment. If for some reason the ratio results in a not a number value then we keep the same value of the current step for the next step of the chain.
    if (exp(razao) == "NaN"){
      
      chain[i+1,]=chain[i,]
      nrejeitado=nrejeitado+1
    }
    
    else
    
# Draw a number between 0 and 1 and compare it with the calculated ratio of acceptance. Once we used the log of the ratio, we need to power it to "e" exponent to recover the correct scale of the real ratio.      
      
      if(runif(1,0,1)<min(1,exp(razao))){
        
        chain[i+1,]=c(proposalMu,proposalSigma)
        
        #counting acceptance
        naceito=naceito+1
      }
    
    else{
      
#Rejecting the proposal value and keeping the current parameters value to the next step of the chain.
      chain[i+1,] = chain[i,]
      nrejeitado=nrejeitado+1
      
    }
  }
 #Creates and returns a list with the chain with all parameters values, the number of accepted and rejected values
  lista=list(chain,naceito,nrejeitado)
  
  return(lista)
}

 

RUNNING THE METROPOLIS ALGORITHM

 

The code below runs the Metropolis algorithm and presents the acceptance rate. The comments are presented in the code lines. The graphs are shown in sequence.

#Running the model
#Setting the number of iterations
iterations=20000
#Setting the initial values for mu and sigma
startvalue=c(2,6)

#Setting the data
y=data

#Calling Metropolis function and saving the output in resultsM variable
resultsM=run_Metropolis(startvalue,iterations)

#Computing the acceptance rate
acptRate=resultsM[[2]]/(resultsM[[3]])
acptRate
## [1] 0.1044232
#Recording the chain for mu and sigma
chainM=resultsM[[1]]

 

CHECKING THE RESULTS: CHAINS AND MARGINAL POSTERIORS

We present the chains and the densities (marginal posteriors) for each parameter comparing the means of the chains (discarding the initial 5000 values) with the true values that originated the sampling data.

 

par(mfrow=c(2,2))

plot(chainM[,1],type="l",ylab=expression(mu),xlab="Iterations")
abline(h=mean(chainM[-5000,1]),col="red")
abline(h=3,col="blue",lty=2)


plot(chainM[,2],type="l",ylab=expression(sigma),xlab="Iterations")
abline(h=mean(chainM[-5000,2]),col="red")
abline(h=1,col="blue",lty=2)
legend(8000, 5.6,c("chain mean","real mean"),col=c("red","blue"),lty=c(1,2),cex=1,border="black",bty="o",box.lwd = 1,box.col = "black",bg = "white")

plot(density(chainM[,1],bw=0.05),type="l",xlab=expression(mu),main="")
abline(v=mean(chainM[-5000,1]),col="red",lty=1)
abline(v=3,col="blue",lty=2)

plot(density(chainM[-5000,2]),type="l",xlab=expression(sigma),main="")
abline(v=mean(chainM[-5000,2]),col="red",lty=1)
abline(v=1,col="blue",lty=2)

#Mu mean
mean(chainM[-5000,1])
## [1] 3.084854
#Sigma mean
mean(chainM[-5000,2])
## [1] 0.9505381