Idea behind SC

There is one treated group and many possible control groups. The SC method advocates computing the counterfactual \(E(Y_{T+1}(0)|D=1)\) by a weighted average of outcomes at \(T+1\), where the outcomes belong to the “selected” control groups from the pool of possible control groups, that is:

\[E(Y_{T+1}(0)|D=1) = \sum_{j=1}^J w^*_j Y_{j,T+1}\] where the donor pool is formed of \(j=1,2,...,J\) control groups. The treated unit is indexed by \(j=0\), so that \(Y_{0,t}\) is the outcome of the treated at time \(t\) and \(Y_{j,t},j=1,2,...,J\) is the outcome of the \(j\)’th control group at time \(t\).

The weights \(w^*_j\) are computed in a data driven way. The procedure of selecting the weights is described on page 13 of this document. As noted there, when the covariates are lagged outcomes, the procedure selects the weights that minimize:

\[\sum_{t=1}^T \Big(Y_{0,t} - \sum_{j=1}^J w_j Y_{j,t}\Big)^2\]

As Imbens points out, this is equivalent to solving for \(w_j\) from the following regression equation:

\[Y_{0,t} = \sum_{j=1}^J w_j Y_{j,t} + \epsilon_{0,t}\] This implies that \(Y_{0,t} = Y_{0,t}(0) = f(Y_{1,t},..., Y_{J,t})\) for \(t \leq T\). In particular, the potential outcomes of the treated are a function of only contemporaneous outcomes of the control. This procedure does not allow for, say, the past outcome \(Y_{0,t-1}\) to affect \(Y_{0,t}(0)\).

What is not intuitive to me about this specification is why pre-treatment outcomes of the treated are not a function of the treated group’s own past and weighted past outcomes of the control group. For example, at time \(t\) one could specify:

\[\begin{align} \sum_{t=1}^T \Big(Y_{0,t} - \phi Y_{0,t-1} - \sum_{j=1}^J w_j Y_{j,t}\Big) \end{align}\]

By not including the past of the treated group, the procedure is implicitely saying that, at each point in time, it is the (weighted) outcome of the control group that is more informative of the outcome of the treated. However, when there is considerable persistence in the outcome for the treated, it seems to me that this assumption is ignoring

Questions:

  1. The problem with this approach seems to be overfitting. (Almost same results by using the entire past of the treated as using covariates from control). If there are \(J\) possible control groups, each satisfying an unconfoundedness assumption, is there a way to determine the optimal number of control groups to be used for the weighted average? That is, are all groups in the donor pool used to compute the weights and the weighted average?

The weights change depending on the forecast horizon, so using \(T-h, \quad h>0\) periods produces different weights than using \(T\) periods. This should change the actual forecast (try this on the Basque data). So computing the forecast based on \(T-h\) is not the same as that based on \(T\) periods. This should create an issue for what is the counterfactual?

Some studies use only one pre-treatment period to compute the weights. What is the trade-off in terms of fit between the number of periods and the number of groups used?

  1. Since we are uncertain about which control groups to use, it would be nice to make this uncertainty explicit. One suggestion was to determine the weights via minimax or maximin procedure. Then however we would be claiming it is OK to have some bias in the estimate for the treatment effect as the minimax procedure does a bias-variance trade-off?

  2. Which variables to use when determining the weights? past outcomes, observed covariates; if using lagged outcomes, how many past periods to use? Again, the problem as it is formulated now seems to be overfitting.

Article

For now, let’s assume the are \(J\) potential control groups that form the donor pool. The way I see it is that the fundamental problem is that of prediction. That is, how best to predict the counterfactual outcome given \(T\) historical observations on \(J\) potential control groups, whose composition is unknown.

Can this be connected to a forecasting problem where one needs to determine which variables to be on the right hand side? The betas would be the weights.

Regularization seems to better suit situations when all but a few observed predictors have non-zero effects on the regression function while dimension reduction methods seem more appropriate when the predictors are highly collinear and possibly have a factor structure. A central question in (predictive regression) is the robustness of these estimates to the choice of predictors. Bai and Ng (2006b) find that targeted predictors generally yield better forecasts but the composition of the predictors changes with the forecast horizon.