Inferring Causal Impact Using Bayesian Structural Time-Series Models

Introduction

Brodersen et al. (2013) suggest a fully Bayesian way to infer causal impact in a time-series when there is no straightforward candidate for the counterfactual. An important example for such a situation is the evaluation of a designed market intervention which affects the whole market and there is no way to form a control group for the counterfactual. They propose a flexible method based on a standard state-space model which could accommodate multiple sources of variation (e.g. seasonality, contemporaneous covariates, etc.) and can be used to predict the counterfactual market response that would have occurred had no intervention taken place.

In what follows I try to replicate their method in a slightly simplified way. I evaluate the accuracy of its prediction based on simulated data (which is more or less the replication of their section 3). The task was implemented in R.

Model

The problem is modeled by a basic state-space model which can be written as \[ \begin{align} y_t &= Z_t^T \mathbf z_t + \epsilon_t \\ \mathbf z_{t+1} &= T_t \mathbf z_t + R_t \eta_t \end{align} \]

where \(\epsilon_t \sim N(0, \sigma_t^2)\) and \(\eta_t \sim N(0,Q_t)\). The first equation is the observation or measurement equation which links the observed variable \(y_t\) to the unobserved \(z_t\) (we used to denote the latent variable as \(x_t\) but I follow the notation of Brodersen et al., 2013). The second equation is the transition equation which models the evolution of the hidden state.

State components. This model is flexible enough to accommodate multiple sources of variation. The authors list four possible components of the state:

Local linear trend defined by pair of equations \[ \begin{align} \mu_{t+1} &= \mu_t + \delta_t + \eta_{\mu,t} \\ \delta_{t+1} &= \delta_t + \eta_{\delta,t} \end{align} \] where \(\eta_{\mu,t} \sim N(0,\sigma_{\mu}^2)\) and \(\eta_{\delta,t} \sim N(0,\sigma_{\delta}^2)\). The term \(\mu\) is the value of the trend at time \(t\) whereas \(\delta\) can be thought of as the slope.
Seasonality defined by \[ \gamma_{t+1} = - \sum_{s=0}^{S-2}\gamma_{t-s} + \eta_{\gamma,t} \] where \(S\) represents the number of seasons and \(\gamma_t\) denotes their joint contribution to the observed response \(y_t\). The mean of \(\gamma_{t+1}\) is such that the total seasonal effect is zero when summed over \(S\) seasons. The transition matrix \(T_t\) for the seasonal model is an \(S-1 \times S-1\) matrix with -1’s along the top row, 1’s along the subdiagonal and 0’s elsewhere.
Contemporaneous covariates with static coefficients that allow counterfactual predictions to be informed by synthetic controls (e.g. observed responses from markets that were not treated – note that this assumes the absence of any spill-over effect). It could be defined as \[ Z_t = \beta^T \mathbf{x}_t \] setting \(z_t = 1\) with \(T_t=1\) and \(Q_t=0\). If the number of control series is very large, we can impose sparsity using a spike-and-slab prior on the coefficients (which is a type of ridge regression).
Contemporaneous covariates with dynamic coefficients which involve significantly more computational burden than static ones, but this way we can account for time-varying relationships. Given covariates \(j=1,...,J\), this introduces the dynamic regression component \[ \begin{align} \mathbf{x}_t \beta_t &= \sum_{j=1}^{J} x_{j,t} \beta_{j,t} \\ \beta_{j,t+1} &= \beta_{j,t} + \eta_{\beta,j,t} \end{align} \] where \(\eta_{\beta,j,t} \sim N(0,\sigma_{\beta_j}^2)\). Here, \(\beta_{j,t}\) is the coefficient for the \(j^{\text{th}}\) control series at time \(t\) and \(\sigma_{\beta_j}\) is the standard deviation of its associated random walk. We can write this component in state-space form by setting \(z_t=\beta_t\), \(Z_t = \mathbf{x}_t\), with \(T_t = I\) and \(Q_t = \mathrm{diag}(\sigma_{\beta_j}^2)\).

For this project I will only incorporate (1) and (4) into my model, and make further simplifications as follows: I omit the spike-and-slab part of the covariates and assume instead that we only have two control series, which are simple sinusoids with wavelength of 90 and 360 days, respectively. I also assume that \(\delta_0=0\) and \(\eta_{\delta,t}=0\) as well. These simplifications are made by Brodersen et al. (2013) for their section 3. Therefore, the terms of my model looks like the following: \[ \mathbf z_t = \begin{pmatrix} \mu_t \\ \beta_{1,t} \\ \beta_{2,t} \end{pmatrix}, \quad Z_t = \begin{pmatrix} 1 \\ x_{1,t} \\ x_{2,t} \end{pmatrix}, \quad T_t = I_3, \quad R_t = I_3, \quad Q_t=Q=\begin{pmatrix} \sigma_{\mu}^2 & 0 & 0 \\ 0 & \sigma_{\beta_1}^2 & 0 \\ 0 & 0 & \sigma_{\beta_2}^2 \end{pmatrix}. \] I will also assume that the variance of the observation is stable in time, so that \(\sigma_t^2=\sigma^2\).

Inference

The posterior inference of the paper can be broken down into three pieces. First, they simulate draws of the model parameters \(\theta\) and the state vector \(\mathbf{z}\) given the observed data \(y_t\) in the training period. Second, they use the posterior simulations to simulate from the posterior predictive distribution \(p(\mathbf{\tilde y} | \mathbf{y})\) over the counterfactual time series \(\mathbf{\tilde y}\) given the observed pre-intervention activity \(\mathbf{y}\). Third, they use the posterior predictive samples to compute the posterior distribution of the pointwise impact \(y_t - \tilde y_t\) for each \(t\).

For the first step, they use the posterior simulation algorithm from Durbin and Koopman (2002), which consists of a Gibbs sampler to simulate from a Markov chain whose stationary distribution is \(p(\theta,\mathbf{z}|\mathbf{y})\). How the sampler alternates between a data-augmentation step that simulates from \(p(\mathbf{z} | \mathbf{y}, \theta)\); and a parameter-simulation step that simulates from \(p(\theta|\mathbf{y},\mathbf{z})\) is detailed in Durbin and Koopman (2002).

For this project, I will choose a simpler but less flexible method. As we have a linear Gaussian state-space model, I can apply a simple Kalman-filter to back up the hidden states from the observed \(\mathbf{y}\), and, for the sake of simplicity, I will assume that we know the values of the parameters, \(Q_t\) and \(\sigma^2\). Then, I can use the uncovered states and the parameters to push the system forward into the future and simulate \(\mathbf{\tilde y}\). In that way I can simulate the distribution of counterfactual observations given the observations from the training period. Comparing this distribution to the actual observations I can calculate a distribution for the impact of the program by defining \[ \phi^{(\tau)}_t := y_t - \tilde y_t^{(\tau)} \] as the effect at time \(t\) for the draw \(\tau\). Following Brodersen et al. (2013), I conclude that an impact was present if and only if the central 95% posterior probability interval of the cumulative effect excluded zero.

Simulation

Simulation of the data. First, I simulate the series of observations following section 3 of Brodersen et al. (2013). I consider a period of 500 days, with a treatment happening at the 334^th day. To simulate the effect of the market intervention, the post-treatment period of the series is multiplied by \((1+e)\) where \(e\) is the size of the program. For this illustration, I choose \(e=0.25\), that is the intervention has a constant 25% effect in each period.

library(MASS)
set.seed(20140504)
# length of the series
len <- 500
treatment <- ceiling(2*len/3) #beginning of the treatment
postperiod <- len-treatment
# parameters of the state
statedim <- 3
T <- diag(statedim) #parameter matrix of state
Q <- 0.01*diag(statedim) #variance of state
x1 <- sin(2*pi*seq_len(len)/90)
x2 <- sin(2*pi*seq_len(len)/360)
# simulation of the state
z0 <- c(5,0,0) #initial conditions: mu_0=5, beta_10=0, beta_20=0
z <- matrix(ncol=statedim, nrow=len)
z[1,] <- z0%*%T + mvrnorm(1,rep(0,statedim),Q)
for (i in 2:len) {
    z[i,] <- z[(i-1),]%*%T + mvrnorm(1,rep(0,statedim),Q)
    }
# observations
sigmasq <- 0.01 
Z <- matrix(ncol=statedim,nrow=len) 
Z[,1] <- rep(1,len)
Z[,2] <- x1
Z[,3] <- x2

e <- 0.25 #effect size
y <- vector(length=len)
for (i in 1:len){
    y[i] <- Z[i,]%*%z[i,] + sqrt(sigmasq)*rnorm(1)
}
ye <- y
ye[treatment:len] <- y[treatment:len]*(1+e)

The figure below shows the simulated observations (blue) along with the true counterfactual (gray). We can clearly see the effect of the program (that, of course, highly depends on \(\sigma^2\): higher measurement variation might hide a smaller effect).

plot(y, type="l",
     col="gray",
     ylim=c(min(ye,y),max(ye,y)),
     xlab="time", ylab="",
     main="")
lines(ye, lwd=2, col="blue")
abline(v=treatment)
legend("topleft",
       c("true counterfactual","observation"),
       lty=c(1,1), lwd=c(1,2),col=c("gray","blue"))

Kalman-filter. I apply Kalman-filter for the pre-treatment period of observations to filter out the hidden state. The following function accomplishes the filtering (the notation is based on what we had in class). Note that it works well for other-dimensional states, and for arbitrary transition matrices as well.

KF <- function(y,A,H,R,Q,x10,V10) {
    len <- length(y)
    statedim <- length(x10)
    #initialization of the variance V_{t|t-1}
    Vttm1 <- V10
    #collect the state vectors x_{t|t-1} and x_{t|t}
    xttm1 <- matrix(nrow=len+1, ncol=statedim)
    xtt <- xttm1[-1, , drop=F]
    xttm1[1,] <- x10
    #start KF recursions
    for (i in 1:len) {
        #Var(y_k|y_{1:k-1})
        Vy <- H[i,,drop=F]%*%Vttm1%*%t(H[i,,drop=F]) + R
        #forecast error (out of sample)
        epsilon <- y[i] - H[i,]%*%xttm1[i,]
        #update x_{t|t} and x_{t+1|t}
        xtt[i,] <- xttm1[i,] + Vttm1%*%t(H[i,, drop=F])%*%solve(Vy)%*%epsilon
        xttm1[i+1,] <- xtt[i,]
        #update V_{t|t}
        Vtt <- Vttm1 - Vttm1%*%t(H[i,,drop=F])%*%solve(Vy)%*%H[i,,drop=F]%*%Vttm1
        #update V_{t+1|t}
        Vttm1  <- A%*%Vtt%*%t(A) + Q 
    }
    return(xtt)
}

On the figure below we can see that the filtering works quite well. Although it is clear that filtering a multi-dimensional state is less precise than in the uni-dimensional case. The reason is that for the multi-dimensional state, \(Q\) is relatively higher compared to \(\sigma^2\). See the figures below for illustration.

#diffuse prior
V10 <- 10000*diag(statedim)
x10 <- rep(0,statedim)
zsim  <- KF(y[1:treatment],A=T,H=Z[1:treatment,, drop=F],R=sigmasq,Q,x10,V10)

#plot the result of the filter (burn the first 20 observations)
burn <- 20
ztrue <- z[(burn+1):treatment,]
zsim0 <- zsim[(burn+1):treatment,]
matplot(ztrue, type="l",
     col="gray",
     ylim=c(min(ztrue,zsim0),max(ztrue,zsim0)),
     xlab="time", ylab="",
     main="")
matlines(zsim0, col="blue")
legend("topleft",
       c("true (unobserved) states","filtered states"),
       lty=c(1,1),col=c("gray","blue"))

#
#works much better if just one-dimensional state (only local trend)
T1 <- 1
Q1 <- 0.01
Z1 <- Z[,1,drop=F]
z1 <- z[,1,drop=F]
e <- 0.1 #effect size
y1 <- vector(length=len)
for (i in 1:len){
    y1[i] <- Z1[i,]%*%z1[i,] + sqrt(sigmasq)*rnorm(1)
}
ye1 <- y1
ye1[treatment:len] <- y1[treatment:len]*(1+e)

#diffuse prior
statedim <- dim(z1)[[2]]
V10 <- 10000*diag(statedim)
x10 <- rep(0,statedim)
zsim1  <- KF(y=y1[1:treatment],A=T1,H=Z1[1:treatment,, drop=F],R=sigmasq,Q1,x10,V10)

#plot the result of the filter (burn the first 20 observations)
ztrue <- z1[20:treatment,]
zsim1 <- zsim1[20:treatment,]
matplot(ztrue, type="l",
     col="gray",
     ylim=c(min(ztrue,zsim1),max(ztrue,zsim1)),
     xlab="time", ylab="",
     main="")
matlines(zsim1, col="blue")
legend("topleft",
       c("true (unobserved) state","filtered state"),
       lty=c(1,1),col=c("gray","blue"))

As I said before, the extent to which the Kalman-filter can filter out the hidden state hinges upon the relative magnitude of variance \(\sigma^2\) to the variance of state \(Q\). My choice for the variance parameters follows that of Brodersen et al. (2013). Next, I should simulate from the posterior predictive distribution \(p(\mathbf{\tilde y} | \mathbf{y})\).

Posterior predictive simulation. For simulating from the posterior predictive distribution \(p(\mathbf{\tilde y} | \mathbf{y})\), I wrote a function which takes the last value of the filtered state and pushes forward the state using the (now known) parameters. From the series of uncovered state variables in the pre-treatment period, it simulates a draw for the counterfactual series for the post-treatment period.

pps <- function(i,zn,Z,T,sigmasq,Q,postperiod) {
    print(paste("prediction: ",i,sep=""))
    statedim <- length(zn)
    zsim <- matrix(nrow=postperiod,ncol=statedim)
    #simulate the state 
    for (i in 1:postperiod) {
        zsim[i,] <- zn%*%T + mvrnorm(1,rep(0,statedim),Q)  
        zn <- zsim[i,] 
    }
    #simulate y in the future
    ysim <- vector(length=postperiod)
    for (i in 1:postperiod){
        ysim[i] <- Z[treatment+i,]%*%zsim[i,] + sqrt(sigmasq)*rnorm(1)
    }
    return(list(y=ysim,z=zsim))
}

I draw 1000 series to simulate the distribution of the counterfactual series.

zn <- zsim[treatment,]
sim_series <- lapply(1:1000,pps,zn,Z=Z,T=T,sigmasq=sigmasq,Q=Q,postperiod=postperiod)

The figure below shows the true counterfactual along with the distribution (mean, 2.5% and 97.5% percentile) of the simulated counterfactual. We can see that the observed series is not contained by the 95%-credible interval for most of the time which suggests that the market intervention had a positive impact.

#plot the simulated series
y_sims <- do.call(rbind,lapply(sim_series,function(x) x$y))
yh <- yl <- ysim <- y 
ysim[(treatment+1):len] <- apply(y_sims,2,mean)
yh[(treatment+1):len] <- apply(y_sims,2,quantile,0.975)
yl[(treatment+1):len] <- apply(y_sims,2,quantile,0.025)

plot(y, type="l",
     col="gray",
     ylim=c(min(y,ye,yl),max(y,ye,yh)),
     xlab="time",ylab="",
     main="Measurement, true and predicted counterfactual")
lines(ysim, col="red")
lines(yh, col="red", lty=2)
lines(yl, col="red", lty=2)
lines(ye, col="blue", lwd=2)
abline(v=treatment)
legend("topleft",
       c("observed","true (unobserved) counterfactual", 
         "predicted counterfactual (mean)", "95% credible interval"),
       lty=c(rep(1,3),2,2), lwd=c(2,rep(1,3)),
       col=c("blue","gray",rep("red",2)))

Impact assessment. To be more precise about whether we can conclude that there is an effect, I follow the decision rule of Brodersen et al. (2013). They decide that the intervention had an effect if the 95% credible interval of the cumulative effect at the last period excludes zero. For the previous simulation we can conclude that the program indeed had an effect as the lower bound of the credible interval is positive.

impact <- ye[(treatment+1):len] - y_sims
cumimpact <- t(apply(impact,1,cumsum))
cuml <- apply(cumimpact,2,quantile,0.025)
cuml[length(cuml)]

## [1] 143.6

Following the analysis of Brodersen et al. (2013), I will investigate how the model performs by looking at the sensitivity. That is, if there is an effect what is the probability that the model finds it. I will simulate series with different effect sizes (5, 10, 25, 50, and 100 percent) and look at how frequent the model correctly finds that there is a positive impact. For each effect sizes, I will simulate 256 series, and replicate the analysis what was detailed above. The following function implements one simulation (the parameter nsim controls how many counterfactual series are simulated in the posterior predictive simulation part).

cumimp <- function(j,e,nsim){
    print(paste("---simulation: ",j,sep=""))
    #simulate data as before
    len <- 500
    treatment <- ceiling(2*len/3) #beginning of the treatment
    postperiod <- len-treatment
    # parameters of the state
    statedim <- 3
    T <- diag(statedim) #parameter matrix of state
    Q <- 0.01*diag(statedim) #variance of state
    x1 <- sin(2*pi*seq_len(len)/90)
    x2 <- sin(2*pi*seq_len(len)/360)
    # simulation of the state
    z0 <- c(10,0,0) #initial conditions: mu_0=10, beta_10=0, beta_20=0
    z <- matrix(ncol=statedim, nrow=len)
    z[1,] <- z0%*%T + mvrnorm(1,rep(0,statedim),Q)
    for (i in 2:len) {
        z[i,] <- z[(i-1),]%*%T + mvrnorm(1,rep(0,statedim),Q)
        }
    # observations
    sigmasq <- 0.01 
    Z <- matrix(ncol=statedim,nrow=len) 
    Z[,1] <- rep(1,len)
    Z[,2] <- x1
    Z[,3] <- x2

    y <- vector(length=len)
    for (i in 1:len){
        y[i] <- Z[i,]%*%z[i,] + sqrt(sigmasq)*rnorm(1)
    }
    ye <- y
    ye[treatment:len] <- y[treatment:len]*(1+e)
    
    #KF with diffuse priors
    V10 <- 10000*diag(statedim)
    x10 <- rep(0,statedim)
    zsim  <- KF(y=y[1:treatment],A=T,H=Z[1:treatment,, drop=F],R=sigmasq,Q,x10,V10)
    
    #posterior predictive simulation    
    zn <- zsim[treatment,]
    lot_sim_series <- lapply(1:nsim,pps,
                            zn,Z=Z,T=T,sigmasq=sigmasq,Q=Q,postperiod=postperiod)
    y_sims <- do.call(rbind,lapply(lot_sim_series,function(x) x$y))

    #calculate cumulative impact 
    impact <- ye[(treatment+1):len] - y_sims
    cumimpact <- t(apply(impact,1,cumsum))
    cuml <- apply(cumimpact,2,quantile,0.025)

    return(cuml[length(cuml)])
}

This code implements the simulations for different effect sizes.

#Sensitivity analysis
effect.sizes <- c(0.05, 0.1, 0.25, 0.5, 1)
sensitivity <- vector(length=length(effect.sizes)) 
for (i in 1:length(effect.sizes)) {
    impacts <- lapply(1:256,cumimp,e=effect.sizes[i],nsim=200)
    has.effect <- sapply(impacts,function(x) x>0)
    sensitivity[i] <- mean(has.effect)
}

sensitivity

## [1] 0.2109 0.3008 0.5898 0.9023 0.9805

The result is in line with our expectations. The larger the effect is, the likely the method finds it. For example, a constant 5-percent effect is detected in only one of five cases, whereas an effect which doubles the measurement (e.g. the sales) is detected almost every time (98.05%). We can also see that probability of detection of the effect is non-linear in the effect size: an effect of 50% is found in 90% of the cases that is quite close the 100% case where the effect is double as large, but far better than the 25% case which is only found in 59% of the cases.

Concluding remarks

In this project I demonstrated how the proposed method of Brodersen et al. (2013) works for inferring causal impact for time-series using Bayesian methods. My simplified replication showed that the proposed model might detect causal effects even if there is no candidate for the counterfactual. Naturally, the results clearly depend on the parametrization of the simulated series (the choice of state components, and the relative sizes of the variances), and also on how the effect is defined (one can reasonably argue that a constant effect might be far from real-life situations and a fading-away type effect might better capture the situations we are interested in). However, the method still has a potential, and it might be worth to pursue this line of research given the importance of problems it addresses.