Introduction

Here I am going to give you a starting point for fitting a Bayesian errors-in-variables, change-point regression model for unevenly spaced time series data, assuming a single change point and measurement errors. We’ll start by simulating some data with a known change point and then we’ll see if the change-point model can detect it.

Install packages and JAGS

Firstly, the package we’re using requires the installation of JAGS (Just Another Gibbs Sampler) to do the MCMC sampling. If you don’t have it then you will need to install it (http://mcmc-jags.sourceforge.net).

Next, install the package EIVmodels from github.

devtools::install_github("ncahill89/EIVmodels")
library(EIVmodels)

Later we will use the ggplot and tidybayes packages, so make sure to load those too.

library(tidybayes)
library(ggplot2)

Model specification

When it comes to the model specification recipe, the ingredients are

  1. The process model, which relates to the expectation of what’s underlying the observed data. The process model will be governed by a set of parameters that need to be estimated.

  2. The data model, which contains assumptions about how the data are generated and incorporates data uncertainty. The data model links the observed data to the underlying process.

  3. Priors, which contain assumptions about the parameters we are estimating. Priors can be used to impose constraints on model parameters based on apriori information.

process model

For our specification, we’ll assume that the expected value of our observed outcome, \(y\), has a changing linear relationship with the expected x, \(\mu_x\), such that

\(\mu_y = \alpha + \beta_1(\mu_x - cp) \hspace{2em} \text{for } \mu_x < cp\)

\(\mu_y = \alpha + \beta_2(\mu_x - cp) \hspace{2em} \text{for } \mu_x \geq cp\)

In other words, the linear relationship between \(\mu_x\) (the expected value of x) and \(mu_y\) changes at some time point (cp). The measurements of \(x\) are related to \(\mu_x\) via a normal distribution

\(x \sim N(\mu_x,x_{err}^2)\)

where \(x_{err}\) are the observed measurement errors.

data model

We’ll link the observations to the process through a normal data model.

\(y \sim N(\mu_y, \sigma^2)\)

If we have measurement error for the ys here, which we’ll call \(y_{err}\), then we can update the data model to be

\(y \sim N(\mu_y, \sigma^2 + y_{err}^2)\)

priors

We need priors for all unknown parameters. The prior on the change point should constrain the change point to occur somewhere within the range of the observation years.

If you are interested in digging deeper in to the model specification, here is the JAGS specification for this particular model, otherwise feel free to ignore this next part and move straight to the Simulate data section.

cp_model <-
'model{

  ## data model
    for(j in 1:n_obs)
  {
  y[j]~dnorm(mu_y[j],tau[j])
  mu_y[j] <- alpha + beta[C[j]]*(mu_x[j]-cp)
  C[j] <- 1+step(mu_x[j]-cp)
  
  x[j] ~ dnorm(mu_x[j],x_err[j]^-2)
  mu_x[j] ~ dnorm(0,0.5^-2)
  
  tau[j] <- (y_err[j]^2 + sigma^2)^-1
  }

  ##Priors
  alpha[1] ~ dnorm(0.0,10^-2)

  beta[1]~dnorm(0.0,10^-2)
  beta[2]~dnorm(0.0,10^-2)

  sigma ~ dt(0,4^-2,1)T(0,)

  cp ~ dunif(x_min,x_max)

  for(i in 1:n_pred)
  {
  mu_pred[i] <-alpha + beta[Cstar[i]]*(x_pred[i]-cp)
  Cstar[i] <- 1+step(x_pred[i]-cp)
  }
}
'

Notes on the model setup:

Simulate data

One of the best ways to get a feel for a model and it’s limitations is to use simulations. We can use the EIVmodels package to simulate data, with measurement errors, that has a single change point. Below I’m simulating 50 data points where x ranges from 0 to 2 (by default), the change point occurs at a value of \(x = 1\), the slope = 1.1 before the change point (\(\beta_1\)) and 0.1 after the change point (\(\beta_2\)) . The measurement errors for \(y\) and \(x\) are set to 0.1 and 0.2 respectively.

cp_dat <- sim_cp(n_sim = 50,
                 alpha = 0,
                 beta = c(1.1,0.1),
                 cp = 1,
                 y_err = 0.1,
                 x_err = 0.2)
cp_dat
## # A tibble: 50 × 6
##        x x_err       y y_err   true_y true_x
##    <dbl> <dbl>   <dbl> <dbl>    <dbl>  <dbl>
##  1 1.72    0.2  0.0246   0.1  0.0668   1.67 
##  2 0.983   0.2 -0.122    0.1 -0.124    0.887
##  3 0.439   0.2 -0.511    0.1 -0.626    0.431
##  4 2.35    0.2  0.151    0.1  0.0837   1.84 
##  5 1.23    0.2  0.0602   0.1  0.0231   1.23 
##  6 1.76    0.2  0.240    0.1  0.0804   1.80 
##  7 0.212   0.2 -0.767    0.1 -0.910    0.173
##  8 0.922   0.2 -0.242    0.1 -0.199    0.819
##  9 1.19    0.2  0.134    0.1  0.00413  1.04 
## 10 0.982   0.2  0.0603   0.1  0.0328   1.33 
## # … with 40 more rows

The output data will contain the simulated \(x\), \(y\) data as well as the measurement errors that were supplied. It will also contain the “true” underlying process, which can be used for comparison purposes later. Here’s where you can play around with the simulation by changing the parameter values to get different datasets with varying characteristics.

Now, plot the simulated data and add the true underlying process that generated \(y\). When we fit the change-point regression model we are trying to estimate the “Truth” line.

plot_dat(cp_dat,
         add_truth = TRUE)

Run the EIV 1-change-point regression model

Now we’re going to pretend that our simulated data is real life data (i.e., we don’t know the parameter values). So we want to run the model and estimate the parameters. We can then compare the true values of the parameters to the posterior distribution for the parameters to see how the model performs.

We can use the EIVmodels package to fit an EIV change-point regression model using the run_mod function. Specify the model argument as `model = “cp”.

mod_eiv_cp <- run_mod(cp_dat,
                      model = "cp")
## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 100
##    Unobserved stochastic nodes: 55
##    Total graph size: 924
## 
## Initializing model
## No convergence issues detected

Results

We can plot the results (and add the true line) using the plot_res function as follows:

plot_res(mod_eiv_cp,
         add_truth = TRUE)

To access the parameter summaries from the model we need to run the par_est function.

mod_cp_res <- par_est(mod = mod_eiv_cp)
mod_cp_res$par_summary
## # A tibble: 5 × 10
##   variable    mean  median     sd    mad      q5   q95  rhat ess_bulk ess_tail
##   <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 alpha    -0.0784 -0.0775 0.128  0.135  -0.291  0.128 1.00     1038.    1017.
## 2 beta[1]   1.17    1.09   0.361  0.256   0.785  1.83  0.999    1024.    1026.
## 3 beta[2]   0.181   0.205  0.137  0.116  -0.0872 0.362 1.00     1072.    1026.
## 4 cp        0.751   0.737  0.232  0.243   0.386  1.15  1.00     1000.    1017.
## 5 sigma     0.0688  0.0696 0.0336 0.0346  0.0112 0.123 0.999     994.     918.

In this summary, q5 and q95 provide the upper and lower bounds of the 90% uncertainty intervals. Note here whether or not the 90% uncertainty intervals contain the true parameter values.

If you would like to get a more detailed look at the posterior distributions for the parameters you can pull the posterior samples from the model object created above and we can use the tidybayes package to create some plots.

First, get the sample_draws which contains the posterior samples for the parameters and the model-based estimates for the expected value of y.

mod_eiv_cp$sample_draws
## # A tibble: 3,000 × 59
##    .chain .iteration .draw   alpha `beta[1]` `beta[2]`    cp deviance
##     <int>      <int> <int>   <dbl>     <dbl>     <dbl> <dbl>    <dbl>
##  1      1          1     1  0.0468     0.836    0.0687 0.994    -76.7
##  2      1          2     2 -0.108      1.11     0.238  0.756    -63.7
##  3      1          3     3 -0.133      1.69     0.178  0.462    -60.7
##  4      1          4     4  0.0682     0.930    0.0877 1.02     -47.0
##  5      1          5     5 -0.0663     0.902    0.199  0.913    -58.1
##  6      1          6     6 -0.387      4.44     0.352  0.173    -44.0
##  7      1          7     7 -0.178      1.29     0.316  0.611    -51.6
##  8      1          8     8 -0.169      1.37     0.247  0.584    -62.7
##  9      1          9     9 -0.0500     1.31     0.126  0.615    -40.7
## 10      1         10    10 -0.0961     1.21     0.168  0.622    -41.5
## # … with 2,990 more rows, and 51 more variables: `mu_pred[1]` <dbl>,
## #   `mu_pred[2]` <dbl>, `mu_pred[3]` <dbl>, `mu_pred[4]` <dbl>,
## #   `mu_pred[5]` <dbl>, `mu_pred[6]` <dbl>, `mu_pred[7]` <dbl>,
## #   `mu_pred[8]` <dbl>, `mu_pred[9]` <dbl>, `mu_pred[10]` <dbl>,
## #   `mu_pred[11]` <dbl>, `mu_pred[12]` <dbl>, `mu_pred[13]` <dbl>,
## #   `mu_pred[14]` <dbl>, `mu_pred[15]` <dbl>, `mu_pred[16]` <dbl>,
## #   `mu_pred[17]` <dbl>, `mu_pred[18]` <dbl>, `mu_pred[19]` <dbl>, …

Now let’s look at the posterior distributions for \(\beta_1\) and \(\beta_2\) and the change-point (cp). We’ll use the stat_halfeye() function from the tidybayes package.

ggplot(mod_eiv_cp$sample_draws, aes(x = `beta[1]`)) +
  tidybayes::stat_halfeye() 

ggplot(mod_eiv_cp$sample_draws, aes(x = `beta[2]`)) +
  tidybayes::stat_halfeye() 

ggplot(mod_eiv_cp$sample_draws, aes(x = cp)) +
  tidybayes::stat_halfeye() 

What if we didn’t assume the EIV component in the model?

If you wish the ignore the EIV component in the change-point model then you can use the run_mod function from the EIVmodels package and specify EIV = FALSE.

mod_cp <- run_mod(cp_dat,
                  model = "cp",
                  EIV = FALSE)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 50
##    Unobserved stochastic nodes: 5
##    Total graph size: 809
## 
## Initializing model
## No convergence issues detected

We can then compare both models to the truth.

plot_res(mod_cp,
         mod_eiv_cp,
         add_truth = TRUE)

Summary

This is a starting point for developing Bayesian change-point regression models for time dependent data with measurement error. Using simulated data can be helpful for checking if the model is doing what you expect it to do.