In this publication we will work with a changepoint regression model from a Bayesian approach and estimate the model with the Metropolis sampler using R. So let’s get started…
The Metropolis sampler generates a Markov Chain \(\{X_t|t=0,1,2,...\}\) as follows:
Choose a proposal distribution \(g(.|X_t)\). Before choosing it, check the book presented above to learn more about the desirable features in this function.
Generate \(X_0\) from the proposal distribution \(g\).
Repeat until the chain has converged to a stationary distribution:
Generate Y from \(g(.|X_T)\).
Generate U from Uniform(0, 1).
If \(U \leq \frac{f(Y)}{f(X_t)}\), accept \(Y\) and set \(X_{t+1}=Y\), otherwise set \(X_{t+1}=X_T\). This means that the candidate point Y is accepted with probability \(\alpha(X_t, Y)=\min\left(1, \frac{f(Y)}{f(X_t)}\right)\).
Increment \(t\).
To learn more about Metropolis sampler check this book: Statistical computing with R
In this case, we are going to sample from the next changepoint model:
\[ Y_i = \alpha + \beta_1 x_i + \beta_2 I_i(\theta) x_i + \varepsilon_i, \\ I_i(\theta)=\left\{ \begin{array}{ll} 0 & i < \theta \\ 1 & i \geq \theta \end{array} \right., \]
where \(\alpha\), \(\beta_1\) and \(\beta_2\) are unknown parameters, \(\theta\) is the changepoint, \(\varepsilon_i \sim \mathcal{N}(0, \sigma^{2}_\varepsilon)\) and \(i=1,2,...,n\). For estimating parameters in Bayesian analysis we need to determine the likelihood function and for this example is given by:
\[ L(X,Y|\alpha, \beta_1, \beta_2, \sigma, \theta )=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(\frac{-(Y_i-(\alpha + \beta_1 X_i + \beta_2 I_i(\theta) X_i))^{2}}{2\sigma^2}\right) \]
Now we have to specify a prior distribution for each parameter \(\alpha\), \(\beta_1\), \(\beta_2\), \(\sigma\) and \(\theta\). In this case, we will use uninformative normal and uniform distributions for the parameters:
\[\begin{align*} \alpha &\sim \mathcal{N}(1, 100),\\ \beta_1 &\sim \mathcal{N}(1, 100),\\ \beta_2 &\sim \mathcal{N}(4, 100),\\ \sigma &\sim Uniform(4, 15),\\ \theta &\sim Uniform\{10, 11,12,...,50\}. \end{align*}\]
Finally, we define the posterior distribution as the product of the prior and likelihood distribution:
\[ p(\alpha, \beta_1, \beta_2, \sigma, \theta | X, Y) \propto L(X,Y|\alpha, \beta_1, \beta_2, \sigma, \theta )\times \pi(\alpha)\times \pi(\beta_1)\times \pi(\beta_2)\times \pi(\sigma)\times \pi(\theta), \]
The posterior distribution will be the target distribution when using the Metropolis sampler.
First, we generate the data from the changepoint regression model presented above:
set.seed(5674)
alpha.init <- 0.5
beta1.init <- 0.5
beta2.init <- 2
sigma.init <- 7
theta.init <- 30
x <- 1:60
mean.init <- alpha.init + (x * beta1.init) + ifelse(x >= theta.init, x * beta2.init, 0)
y <- rnorm(60, mean = mean.init , sd = sigma.init)
Now we define the likelihood function. In this case we will work with the logarithm of this function which is highly recommended to avoid numerical problems while running the algorithm:
LikelihoodFunction <- function(param){
alpha <- param[1]
beta1 <- param[2]
beta2 <- param[3]
sigma <- param[4]
theta <- param[5]
mu <- alpha + (x * beta1) + ifelse(x > theta, x * beta2, 0)
loglikelihoods <- sum(dnorm(y, mu, sigma), log = TRUE)
return(loglikelihoods)
}
Next we define the prior distributions for the parameters \(\alpha\), \(\beta_1\), \(\beta_2\), \(\sigma\) and \(\theta\). As with the likelihood function, we are going to use the logarithm of the prior distributions:
LogPriorFunction <- function(param){
alpha <- param[1]
beta1 <- param[2]
beta2 <- param[3]
sigma <- param[4]
theta <- param[5]
alphaprior <- dnorm(alpha, 1, 10, log = TRUE)
beta1prior <- dnorm(beta1, 1, 10, log = TRUE)
beta2prior <- dnorm(beta2, 4, 10, log = TRUE)
sigmaprior <- dunif(sigma, 4, 15, log = TRUE)
thetaprior <- ifelse(theta %in% 10:50, log(1/41), -Inf)
return(alphaprior + beta1prior + beta2prior + sigmaprior + thetaprior)
}
As we are working with logarithms, we define the posterior distribution as the sum of the logarithm of the likelihood function and the logarithm of the prior distributions. Remember that this function is our target function \(f(.)\) from which we want to sample.
PosteriorFunction <- function(param){
return (LikelihoodFunction(param) + LogPriorFunction(param))
}
Finally, we define the proposal distribution \(g(.|X_t)\). As we will work with the Metropolis sampler, the proposal distribution must be symmetric and depend on the current state of the chain so we will use the normal distribution with mean equal to the values of the parameters in the current state for \(\alpha\), \(\beta_1\), \(\beta_2\) and \(\sigma\) and for \(\theta\) we will use the discrete uniform distribution.
ProposalFunction <- function(param){
alpha <- rnorm(1, param[1], sd = 1)
beta1 <- rnorm(1, param[2], sd = 1)
beta2 <- rnorm(1, param[3], sd = 1)
sigma <- rnorm(1, param[4], sd = 0.5)
theta <- sample(-3:3, 1) + param[5]
return(c(alpha, beta1, beta2, sigma, theta))
}
Finally, we write the code that will help us to perform the Metropolis sampler. In this case, as we are working with logarithms, we must define the probability that the candidate point \(Y\) is accepted as:
\[ \alpha(X_t, Y)=\min\left(1, \exp(\log(f(Y))-\log(f(X_t))\right). \]
RunMetropolisMCMC <- function(startvalue, iterations){
# Create an array to save the chain values
chain <- array(dim=c(iterations + 1, 5))
chain[1, ] <- startvalue # Define the start value of the chain
for (i in 1:iterations){
# Generate Y from the proposal function
Y <- ProposalFunction(chain[i, ])
# Probability that the candidate point is accepted
probability <- exp(PosteriorFunction(Y) -
PosteriorFunction(chain[i, ]))
# Decision criteria to accept or reject Y
if (runif(1) < probability) {
chain[i+1, ] <- Y
}else{
chain[i+1, ] <- chain[i, ]
}
}
return(chain)
}
As the MCMC chain is strongly autocorrelated, we will thin the sample using only every nth value in the chain to decrease autocorrelation. In this case, we will select a value for our final chain every 50 iterations of the algorithm. To learn more about this method check this publication: Thinning to reduce autocorrelation: Rarely useful.
startvalue <- c(0, 0, 0, 5, 40) # Define the start value of the chain
iterations <- 50 * 10000
chain <- RunMetropolisMCMC(startvalue, iterations)
# Select values every 20 iterations for the final chain
cfinal <- matrix(NA, ncol = 5, nrow = 10000)
for (i in 1:10000){
if (i == 1){
cfinal[i, ] <- chain[i*50,]
} else {
cfinal[i, ] <- chain[i*50,]
}
}
# Remove the first 8000 values of the chain
burnIn <- 8000
Here, you can observe the ACF plot which gives us values of auto-correlation of any series with its lagged values. In this case, we present an ACF plot for the initial MCMC chain and for the final chain after thinning the sample for the parameter \(\theta\). From the graphs we can conclude that the procedure used actually manages to significantly decrease autocorrelation. You can build these graphs for the other parameters and observe the same behavior.
To learn more about ACF check this publication: Significance of ACF and PACF Plots In Time Series Analysis.
In this section, we present the chain generated by Metropolis sampler and its distribution for the parameters \(\alpha\), \(\beta_1\), \(\beta_2\), \(\sigma\) and \(\theta\). The true value of the parameters is represented by the red line.
The next table presents the real value of the parameters and the average of the estimates obtained using the Metropolis sampler:
res <- colMeans(cfinal[-(1:burnIn),])
real <- c(alpha.init, beta1.init, beta2.init, sigma.init, theta.init)
resfinal <- cbind(real, res)
colnames(resfinal) <- c('True value', 'Mean MCMC')
rownames(resfinal) <- c('alpha', 'beta1', 'beta2', 'sigma', 'theta')
resfinal
## True value Mean MCMC
## alpha 0.5 0.2793430
## beta1 0.5 0.4340764
## beta2 2.0 5.0349229
## sigma 7.0 9.5907309
## theta 30.0 29.3975000
From the results, we can conclude that the estimations for the parameters \(\alpha\), \(\beta_2\) and \(\sigma\) obtained using the Metropolis sampler are significantly different from the true values of the parameters. On the other hand, the estimations for \(\beta_1\) and \(\theta\) tend to get closer to the real values of these two parameters. From this conclusion, it is important to realize that the results are significantly influenced by the selection of the prior distribution for each parameter and the initial values of the chain, that is, when these values are very close to the real parameter values, the estimations will be better, which represents a disadvantage of this sampling method.