Synthetic Controls

Author

Christopher P. Adams

Introduction

By observing the same individual with different levels of exposure to a policy we can learn about how the policy affects that individual. However, we have a problem that “time” is also affecting the individual. Chapter 10 showed how panel data could be used to “control for time” using fixed effects.

In the standard fixed effects model, it is assumed that time has the same effect on all individuals. In particular, time has the same effect for both treated and non-treated groups (or control group). This assumption is sometimes called parallel trends. Abadie, Diamond, and Hainmueller (2010) argue that this assumption can be weakened. The authors argue for choosing the control group more carefully. The authors suggest using the pre-treatment period to estimate the relationship between members of the treated group and members of the control group. Instead of placing equal weights on every member of the control group, members of the treated group should place more weight on members of the control group that are similar.

This chapter introduces the synthetic control estimator. It presents the general model, discusses the identification problem and presents the solution provided by Abadie, Diamond, and Hainmueller (2010). The chapter presents two alternative approaches, LASSO and factor models. The estimators are used to determine the effect of the federal minimum wage increase discussed above.

Beyond “Parallel Trends”

The fixed effects model presented in Chapter 10 makes the very strong parallel trends assumption. The impact of time is identical for everyone. This does not seem reasonable. We would like to estimate a model that allows the impact of time to vary, in particular to vary across treated and non-treated groups.

A General Fixed Effects Model

One idea is to allow the fixed effects model to be more general. Below we make two changes to the model. \[ y_{it}(x_{it}) = \alpha_i + \beta_{it} x_{it} + \gamma_{t} + \delta_{it} + \epsilon_{it} \tag{1}\]

where, as before, $\alpha_i$ is the individual fixed effect, $\gamma_t$ is the time-effect and $\beta_{it}$ is the treatment effect. The difference now is that there is an additional time effect, $\delta_{it}$ that varies over time and across individuals. Similarly, $\beta_{it}$ measures the treatment effect and may vary across individuals and over time.

These two relatively simple changes to the model have a fundamental impact on our ability to estimate the treatment effect. Under this model, the individual treatment effect estimation is given by the following equation.

\[ y_{i1}(1) - \hat{y}_{i1}(0) = \beta_{i1} + \delta_{i1} - \delta_{i0} + \epsilon_{i1} \tag{2}\]

Note that we simplified things and assumed we can accurately estimate the individual and time fixed effects. The value of interest, $\beta_{i1}$ is confounded by the individual time varying changes represented by $\delta_{i1} - \delta_{i0}$. Because this time varying effect is individual specific, there is no obvious method for estimating it.

A Slightly Less General Model

Unfortunately, the generalization of the fixed effects model made above means that we can no longer estimate the model. At least we cannot estimate the policy variable of interest. Consider the following slightly less general model.

\[ y_{it}(x_{it}) = \beta_{it} x_{it} + f_{1t} \lambda_{1i} + f_{2t} \lambda_{2i} + \epsilon_{it} \tag{3}\]

where $f_{kt}$ represents some “factor” that determines the outcome of interest and $\lambda_{ki}$ is the weight that individual $i$ places on the factor. If we are considering hours worked by individual in NLSY97 then these factors may account for employment in the region where they live or the industry they work in. In Equation 3 there are two factors. These factors affect the hours worked for everyone in the data but each individual may be affected more or less by each factor. For example, if someone lives in Texas they may be more affected by a large fall in the price of petroleum than someone in Oregon.

Synthetic Synthetic Control Data

The synthetic data is generated according to a factor model.¹ All simulated individuals have their outcome determined by three factor values. However, individuals weight these factor values differently. There are 35 time periods and 200 individuals.

set.seed(123456789)
N <- 200  # individuals
T <- 35  # time periods
K <- 3 # factors
sig <- 0.1  # noise
F <- matrix(3*runif(T*K), nrow=T) # factors
Lambda <- matrix(exp(rnorm(N*K)), nrow=K) 
# weights are positive
Lambda <- t(t(Lambda)/rowSums(t(Lambda)))  
# sum weights to 1
E <- matrix(rnorm(T*N,mean=0,sd=sig),nrow=T) # unobserved
Ybar <- F%*%Lambda  # average outcome
Ys <- Ybar + E  # actual outcome

Note there is no policy effect in the simulated data. Each individual’s outcome is determined by their individual weights on three factors, plus an unobserved term.

Figure 1: Plot of outcomes for periods 1 and 2. Triangles are located at the factor value vectors for the two periods.

The Figure 1 presents a plot of outcomes for the 200 individuals in periods 1 and 2. Each individual in the data places a weight on the three factor values, which are represented by the triangles. The implication is that while each individual is unique there are other individuals in the data that have similar outcomes because they weight the factor values in the same way.

Constructing Synthetic Controls with OLS

The goal is to use the non-treated individuals to construct the “time effect” for the treated individuals. As in Chapter 10, once we have accounted for the time effect, we can back out the treatment effect. But how do we construct this synthetic control?

Perhaps the most straightforward way is to use OLS. We discussed in Chapter 1 that OLS captures the correlation between variables. Let’s say we want to construct a “synthetic” version of individual 1 using data from individuals 2 to 5. We can think of the outcomes for individual 1 as the variable we want to explain, while the outcomes for individuals 2 to 5 are the explanatory variables.

\[ y_{1t} = \omega_0 + \omega_2 y_{2t} + \omega_3 y_{3t} + \omega_4 y_{4t} + \omega_5 y_{5t} + \epsilon_{1t} \tag{4}\]

The Equation 4 assumes that the observed outcome for synthetic individual 1 in period $t$ is a weighted average of the outcomes for the other four individuals in the same time period.

We can then write this out in matrix form.

\[ \mathbf{y}_1 = \mathbf{Y}_{-1} \omega + \epsilon \tag{5}\]

where $\mathbf{y}_1$ is the vector of outcomes for synthetic individual 1 across all the time periods, $\mathbf{Y}_{-1}$ is the matrix of outcomes for all the individuals not individual 1 across all the time periods, finally $\omega$ is the vector of weights that individual 1 places on the other synthetic individuals.

Following the same matrix algebra we used in Chapters 1 and 2, OLS determines the weights we should use for our synthetic control.

\[ \hat{\omega} = (\mathbf{Y}_{-1}' \mathbf{Y}_{-1})^{-1} \mathbf{Y}_{-1}' \mathbf{y}_{1} \tag{6}\]

With these estimated weights the synthetic individual 1’s outcome in some future period is a weighted average of the outcomes for the other synthetic individuals.

\[ \hat{y}_{1(t+s)} = \mathbf{Y}_{-1(t+s)}' \hat{\omega} \tag{7}\]

OLS Weights in `R`

We can use the first 30 time periods in the synthetic data constructed above to estimate the synthetic control for individual 1. We can use the last 5 time periods to compare our synthetic prediction to the actual outcomes.

lm1 <- lm(Ys[1:30,1] ~ Ys[1:30,2:5])
omega_hat <- lm1$coefficients
y1_hat_ols <- cbind(1,Ys[31:35,2:5])%*%omega_hat

The Figure 6 plots the actual outcomes and predicted outcomes for synthetic individual 1 in periods 31 to 35. The 45 degree line represents perfect prediction. The chart shows that our OLS based synthetic control tracks the actual outcomes pretty well.

The issue with this method is that our choice of using the next four individuals is completely arbitrary. We can construct more accurate controls using more individuals.

A “Wide” Data Problem

To understand the issue consider what happens when we use the next 28 simulated individuals. The loop finds the root mean squared error as each individual up to 28 is added to the regression.

tab_res <- matrix(NA,28,2)
for (i in 2:29) {
  lm1 <- lm(Ys[1:30,1] ~ Ys[1:30,2:i])
  omega_hat <- lm1$coefficients
  res <- Ys[31:35,1] - cbind(1,Ys[31:35,2:i])%*%omega_hat
  tab_res[i-1,1] <- i
  tab_res[i-1,2] <- mean(res^2)^(0.5)
}
# creates a table of the rmse for different numbers of 
# explanatory variables.

Figure 2: Plot of root mean squared error for randomly choosing from the non-treated individuals.

The Figure 2 presents a measure of predictive ability of the synthetic control. The chart uses root mean squared error of the difference between the actual outcome and the predicted outcome. The chart suggests that the prediction is optimized when 5 to 10 other individuals are used. The prediction gets slowly worse as more individuals are used. Why would that be? Then around 25, the prediction starts to go off the rails. Why is that?

We are running into a problem brought up in Chapter 2. We have collinearity and multicollinearity problems. Trying running the regression hashed out below. It should produce some sort of error. Probably something related to not being able to find the inverse of the matrix. There is a collinearity problem when the number of “explanatory variables” (simulated individual outcomes) equals the number of “observations” (time periods). As we add in more individuals the correlation between individuals in the data increases until the point where the regression starts producing garbage and then it finally stops working all together.

y1 <- Ys[1:30,1]
Y_n1 <- Ys[1:30,-1]
#omega_hat <- solve(t(Y_n1)%*%Y_n1)%*%t(Y_n1)%*%y1

The problem of how to choose between potentially collinear explanatory variables is called a “wide” data problem. Here “wide” refers to the fact that our data matrix is very wide relative to its length (the number of observations or time periods). In our simulated data we have 199 explanatory variables and 30 observations. The rest of the chapter discusses three solutions to the wide data problem; the Abadie, Diamond, and Hainmueller (2010) estimator, the LASSO and convex factor models.

Abadie Estimator

Abadie, Diamond, and Hainmueller (2010) provide one popular solution to the wide data problem presented above. Their idea is to restrict the weights. In particular, the authors restrict the weights to be positive and sum to one. The treated unit is assumed to be a convex combination of the non-treated units.

Restricting Weights

Abadie, Diamond, and Hainmueller (2010) present an OLS estimator like the one presented in the previous section. However, the weights are restricted such that each treated individual is a convex combination of the non-treated individuals. We assume the outcomes of individual 1 are a convex combination of the outcomes for the other 199 simulated individuals.

Figure 3: Plot of simulated outcomes for period 1 and period 2. The triangle is individual 1.

The Figure 3 shows that individual 1 sits in the cloud of simulated outcomes for period 1 and period 2. This means that it is possible to represent individual 1’s outcomes as a convex combination of the other simulated individuals.

The authors assume that $\omega_{0} = 0$, $\omega_j \ge 0$ and $\sum_{j=2}^{N} \omega_{j} = 1$. That is, the constant term in the OLS is zero and the remaining weights are positive and add to 1.

The Abadie, Diamond, and Hainmueller (2010) estimator solves the following constrained least squares problem.

\[ \begin{array}{l} \min_{\omega} \sum_{t=1}^{T_0} (y_{1t} - \sum_{j=2}^N \omega_{1j} y_{jt})^2\\ \\ s.t. \sum_{j=2}^N \omega_{j} = 1\\ \omega_{j} \ge 0 \mbox{ for all } j \in \{2,...,N\} \end{array} \tag{8}\]

The authors suggest that the restriction can be justified by assuming a factor model is generating the observed data. Note however, that there is no relationship between the assumptions presented here and the assumptions on the factor model presented below.

Synthetic Control Estimator in `R`

The Equation 8 provides the pseudo-code for the estimator in R.

f_sc <- function(par,yt,ynt) {
  yt <- as.matrix(yt)
  ynt <- as.matrix(ynt)
  omega <- exp(par)/(sum(exp(par)))  
  # the restriction on the weights.
  sos <- mean((yt - ynt%*%omega)^2)
  # sum of squares
  return(sos)
}
J <- dim(Y_n1)[2]
a <- optim(par=rnorm(J), fn=f_sc, yt=y1, ynt=Y_n1,
              control = list(trace=0,maxit=1000000))
omega_hat <- exp(a$par)/sum(exp(a$par))
y1_hat_sc <- Ys[31:35,-1]%*%omega_hat

The Figure 6 shows that the Abadie, Diamond, and Hainmueller (2010) estimator is able to predict pretty well. Like the OLS estimator presented above, the prediction/actual pairs lie along the 45 degree line. The big difference is that the Abadie, Diamond, and Hainmueller (2010) estimator does not need to be restricted, at least not in this case.²

Regularization

The wide data problem is difficult for standard estimators, but it is the type of problem machine learning estimators excel at. This section presents a popular machine learning estimator called the LASSO. This is a regularized estimator. This means that some “cost” is added to the least squares problem. There is an explicit cost or penalty to adding an additional explanatory variable.

Turducken Estimation

A turducken is a dish in which a chicken is stuffed inside of a duck, which is then stuffed inside a turkey. Many machine learning algorithms have this property of estimation within an estimation. The “chicken” layer is the standard estimation problem in which the tuning parameters are taken as given and not estimated. The chicken layer uses some standard criterion such as least squares or maximum likelihood. The “duck” layer takes the results from the chicken layer and determines the appropriate tuning parameter values. The duck layer uses a measure like “root mean squared error” of the model on the test data to evaluate the tuning parameters. There may even be a turkey layer in which different machine learning models are evaluated. This layer may add a criterion such as speed of estimation or accuracy on new test data withheld from the original problem. Here we limit things to “ducken” estimation.

LASSO

The Stanford statistician, Robert Tibshirani, introduced the idea of a regularized OLS estimator he called LASSO for least absolute shrinkage and selection operator (Tibshirani 1996). In the LASSO, the parameters are regularized through a penalty term which is the sum of their absolute values.³ The problem involves solving a constrained optimization problem.

The chicken layer is a standard least squares estimator. Note that the chicken layer is estimated on a subset of time periods ($T - s$). This is the training data. The remaining $s$ time periods are held out and not used in the estimation. The held out data is the testing data.

\[ \begin{array}{l} \omega_{\theta} = \arg \min_{\omega_1} & \sum_{t=1}^{T - s} (y_{1t} - \omega_{0} - \sum_{j=2}^N \omega_{j} y_{jt})^2 + \theta \sum_{j=-1} |\omega_{j}| \end{array} \tag{9}\]

where $\theta$ is described as a tuning parameter. The higher the value of $\theta$ the more pressure to reduce the sum of the absolute values of the parameters. This estimator will tend to set “unimportant” weights to zero. It tends to set non-identified weights to zero and also set small but identified weights to zero.⁴

The duck layer uses the held out sample and chooses $\theta$ to minimize root mean squared error.

\[ \min_{\theta} \left(\frac{1}{s}(\sum_{t=T - s + 1}^{T_0} (y_{1t} - \omega_{0 \theta} - \sum_{j=2}^N \omega_{j \theta} y_{jt})^2)\right)^{0.5} \tag{10}\]

where $\omega_{\theta}$ is the vector of weights that is the solution to Equation 9.

One argument for using root mean squared error as the appropriate criteria is that it provides a trade-off between two things we may care about in our prediction. It weights the bias, that is the extent our estimate is wrong in expectation, and the variance, that is the extent that our estimate is wrong in fact.

LASSO in `R`

Using Equation 9 as pseudo-code we have a simple regularized OLS estimator. The estimator defaults to $\theta = 0$, but this is changed in the procedure below. Note that code uses the matrix notation because it is a bit more compact.

f_reg <-function(omega,y_t,Y_nt,th1 = 0) {
  y_t <- as.matrix(y_t)
  Y_nt <- as.matrix(Y_nt)
  sos <- (y_t - cbind(1,Y_nt)%*%omega)^2 + th1*sum(abs(omega))
  return(mean(sos, na.rm = TRUE))
}

The code splits the data into a training data set, which is the first 25 periods, and a test data set which is the next 5 periods. We will compare predictions using the last 5 periods.

y1_r <- y1[1:25]  # training data
y1_t <- y1[26:30]  # test data
Y_n1_r <- Y_n1[1:25,]
Y_n1_t <- Y_n1[26:30,]

The following loop goes through the set of tuning parameters and finds the one that minimizes the root mean squared error.

f_lasso <- function(y_tr,y_tt,Y_ntr,Y_ntt,K=3) {
  y_tr <- as.matrix(y_tr)
  y_tt <- as.matrix(y_tt)
  Y_ntr <- as.matrix(Y_ntr)
  Y_ntt <- as.matrix(Y_ntt)
  set.seed(123456789)
  theta_values <- c(1:K)-1
  J <- dim(Y_ntr)[2]
  omega_th <- NULL
  rmse_old <- 10000  # some large number.
  for (k in 1:K) {
    ai <- optim(par=rep(0,J+1), fn=f_reg, y_t=y_tr, 
                Y_nt=Y_ntr, th1 = theta_values[k],
                control = list(maxit=10000))
    omega <- ai$par
    rmse_new <- mean((y_tt - cbind(1,Y_ntt)%*%omega)^2)^(0.5)
    if (rmse_new < rmse_old) {
      omega_th <- omega
      rmse_old <- rmse_new
      #print(rmse_new)
    }
  }
  return(omega_th)
}

omega_hat <- f_lasso(y1_r,y1_t,Y_n1_r,Y_n1_t)
y1_hat_lasso <- cbind(1,Ys[31:35,-1])%*%omega_hat

Testing the LASSO on the synthetic data shows that the estimator does okay. Figure 6 shows that in the example it may have the best predictions of each of the models tested.

Factor Models

We started off the chapter by saying that we could make the fixed effects model more general by adding a factor model. We call this an interactive fixed effects model (Bai 2009). However, we then presented three other methods for estimating the synthetic control weights. We seemed to forget that we had a factor model.

Why didn’t we simply estimate the factor model? The answer is easy. It is hard.

As a reminder, the problem with estimating the synthetic control is that we don’t have enough observations relative to the number of potential variables. This is called a “wide data” problem. Factor models have been suggested as a solution to this problem for close to 100 years.

Factors models date back to at least Hotelling (1933), although he preferred the term “component models.” Hotelling didn’t like the fact that he was using matrix factorization to factor a factor model, the result of which is that today, economists use “principal components to factor factor models.” I guess we didn’t read the memo the whole way through.

Hotelling’s contribution was to formalize a method in use for solving a wide-data problem in psychometrics. Researchers had individual subjects take a battery of tests, so they had data with a small number of individuals and a large number of test results.

Variations on the model presented below are popular in machine learning, under the names “topic models” or “non-negative matrix factorization.” In machine learning these models are categorized as “unsupervised learning” models, where LASSO is “supervised learning models” (Blei 2012, Huang:2014).

So why is it hard? Hotelling (1933) pointed out that the matrix factorization problem at the heart of the approach is not identified. There are many non-trivial ways to factor one matrix into two matrices. You may remember that you can factor the number 12 into a number of different factor pairs: 1 and 12, 2 and 6, and 3 and 4. It is similar for matrix factorization. There are many pairs of matrices that can be multiplied together to get the same observed matrix. In trying to solve one identification problem we have run headlong into a new (actually very old) identification problem.

Fu et al. (2016) suggests a solution to the new identification problem. The paper suggests restricting the factor weights to be positive and sum to 1. This is called a convex matrix factorization problem. Fu et al. (2016) and Adams (2018) suggest that this restriction significantly reduces the identification problem. Under certain restrictions, it may even provide a unique solution to the matrix factorization problem.

Matrix Factorization

Consider we observe some matrix $\mathbf{Y}$ that is $T \times N$. This may represent a panel data set like NLSY97, with each cell representing hours worked for an individual in a particular year. The question is whether we can use math to split this data in a part that varies over time and a part that varies with the individuals, a more sophisticated version of what we did with the fixed effects model.

The simplest case is where there is no error term. This is called an **exact factor model}.

\[ \mathbf{Y} = \mathbf{F} \mathbf{\Lambda} \tag{11}\]

where $\mathbf{F}$ is a $T \times K$ matrix and $\mathbf{\Lambda}$ is a $K \times N$ matrix. In this model, $K$ represents the number of factors, $\mathbf{F}$ is the time-varying factor values and $\mathbf{\Lambda}$ is the set of factor weights that are unique to each individual.

One way to show that it is *not} possible to find a unique solution to Equation 11 is add a square $K \times K$ matrix, $\mathbf{A}$,

\[ \mathbf{Y} = \mathbf{F} \mathbf{A}^{-1} \mathbf{A} \mathbf{\Lambda} \tag{12}\]

where the equation holds for any full-rank matrix $\mathbf{A}$. The implication is that many many factor values and factor weights are consistent with the observed data.

Convex Matrix Factorization

Are factor models useless?

Possibly. One solution is to assume that the weighting matrix is convex. That is, assume that all the values in the matrix are positive and rows sum to 1. Each individual in the data places weights on the factor values that are like probabilities. Note there is no compelling reason to make this restriction other than it is intuitive.

Convex $\sum_{k=1}^K \lambda_{ik} = 1$ and $\lambda_{ik} \in [0, 1]$ for all $k \in \{1, ..., K\}$ and $i \in \{1, ..., N\}$.

Assumption 1

Note that Assumption 1 is different from the convexity assumption proposed by Abadie, Diamond, and Hainmueller (2010). Here it is that each individual’s weights on the factors sum to 1. In Abadie, Diamond, and Hainmueller (2010) it is that the treated individual places convex weights on the other individuals in the data. One does not imply the other.

The result of this restriction is that we can now uniquely find the matrix factors. Moreover, we may be able to determine exactly how many factors there are. The proof is a picture.

Figure 4: Plot of average outcomes by time period, with the factor vectors placed as triangles. The true outcomes are grey.

The Figure 4 presents the outcomes (Ys) for the simulated data presented at the start of the chapter in two of the time periods. The black dots represent the results from an exact factor model. In the simulation this is $\bar{Y}$. Notice anything? We have plotted out a triangle. If we are able to observe the average outcomes then we have more than plotted out a triangle; we have identified the factor vectors. The factor vectors lie at the points of the triangle. We moved from an intractable problem first pointed out by Hotelling nearly 100 years ago, to a problem solved by plotting out the observed outcomes for two time-periods.

Unfortunately, in our problem things are not as simple. Our data is represented by following approximate matrix factorization problem.

\[ \mathbf{Y} = \mathbf{F} \mathbf{\Lambda} + \mathbf{E} \tag{13}\]

In the figure, the observed outcomes are represented by the gray circles. These make the triangle more rounded and make it difficult to factorize the matrix. Adams (2018) considers a restriction that there exists some independence of unobservables across time. Given this restriction, the paper shows that a mixture model can be used to identify the model.⁵

Synthetic Controls using Factors

If we can do the matrix factorization, we can create our synthetic controls. To do this, first note that we can write out the individual as a weighted average of the factor values.

\[ \mathbf{y}_1 = \mathbf{F} \lambda_1 + \mathbf{\epsilon}_1 \tag{14}\]

In addition, we can write the potential controls as a different weighted average.

\[ \mathbf{Y}_{-1} = \mathbf{F} \mathbf{\Lambda}_{-1} + \mathbf{E}_{-1} \tag{15}\]

where $\mathbf{Y}_{-1}$ represents the outcomes for all the individuals that are not individual 1.

Note that both equations have the same matrix of factor values. This means that we can substitute the second into the first. However, first we need to solve for $\mathbf{F}$. To illustrate we will drop the unobserved term $\mathbf{E}$. Note because the weighting matrix $\mathbf{\Lambda}_{-1}$ is not square we use the same division idea presented in Chapter 1.⁶

\[ \mathbf{F} = \mathbf{Y}_{-1} \mathbf{\Lambda}_{-1}' (\mathbf{\Lambda_{-1} \mathbf{\Lambda}_{-1}'})^{-1} \tag{16}\]

Substituting into Equation 14.

\[ \hat{\mathbf{y}}_1 = \mathbf{Y}_{-1} \mathbf{\Lambda}_{-1}' (\mathbf{\Lambda}_{-1} \mathbf{\Lambda}_{-1}')^{-1} \lambda_1 \tag{17}\]

Thus, we can write the outcomes of individual 1 as a weighted average of the outcomes of the other individuals. This is our synthetic control.

Estimating the Weights

The Equation 17 shows that in order to create our synthetic control we need to estimate the factor weight matrix ($\mathbf{\Lambda}$). Note that the matrix $\mathbf{\Lambda}_{-1}$ and the vector $\lambda_1$ are the constituent columns of the original matrix.

To make things doable we will assume that the unobserved characteristic in Equation 3 is normally distributed, $\epsilon_{it} \sim \mathcal{N}(0, \sigma^2)$. Given this assumption we can use maximum-likelihood to factorize the matrix. We will do that in two steps.

Step 2. Find $\hat{\mathbf{\Lambda}}$. The second of the two steps is to determine the factor weights after we determine the factor value matrix ($\hat{\mathbf{F}}$) and the standard deviation of the unobserved characteristic ($\hat{\sigma}$).

\[ \max_{\mathbf{\lambda}_i} \sum_{t=1}^{T} \log \left (\phi \left(\frac{y_{it} - \hat{\mathbf{F}}_t' \mathbf{\lambda}_i}{\hat{\sigma}} \right) \right) \tag{18}\]

Similar to the Abadie, Diamond, and Hainmueller (2010) estimator, this is calculated for each individual $i$.

Step 1. Solve for $\hat{\mathbf{F}}$ and $\hat{\sigma}$. The first step also uses maximum-likelihood. However, things are more complicated because we don’t know the weights. It is assumed these are distributed uniformly.

\[ \max_{\{\mathbf{F}_t,\sigma\}} \sum_{t=1}^{T} \sum_{i=1}^N \log \left(\int_{\mathbf{l} \in \mathcal{L}} \phi\left(\frac{y_{it} - \mathbf{F}_t' l}{\sigma} \right) d \mathbf{l} \right ) - N T \log(\sigma) \tag{19}\]

where $\mathcal{L} = [0, 1]^K$.

Convex Factor Model Estimator in `R`

Step 1. Estimate the factor value matrix, $\mathbf{F}$, and the distribution of the error term, $\sigma$.

f_cmf <- function(par,Y,K,Reps=10) {
  eps <- 1e-20
  set.seed(123456789)
  Y <- as.matrix(Y)
  T <- dim(Y)[1]
  N <- dim(Y)[2]
  sig <- exp(par[1])
  F <- matrix(exp(par[2:(K*T+1)]),nrow=T)
  p <- matrix(0,T,N) 
  for (r in 1:Reps) {
    L <- matrix(runif(N*K),nrow=K)
    L <- as.matrix(t(t(L)/rowSums(t(L))))
    p <- p + dnorm((Y - F%*%L)/sig)
  }
  p <- p/Reps
  # small number added becuase of logging in the next step.
  log_lik <- -(mean(log(p+eps)) - log(sig))
  return(log_lik)
}

We can test the estimator on the synthetic data described in the introduction. Note we will cheat and set the initial parameter values for the standard deviation equal to its true value. Again it is assumed we know the true number of factors. Note that this is estimated on just the first five time periods.

part <- c(log(.1),rnorm(15,mean=0,sd=1))
a1 <- optim(par=part,fn=f_cmf,Y=Ys[1:5,],K=3,
            control = list(trace = 0,maxit=1000000))
sighat <- exp(a1$par[1])
sighat

[1] 0.2138843

Fhat <- matrix(exp(a1$par[2:length(a1$par)]), nrow=5)

The estimated value for the standard deviation of the error is not too far from its true value of 0.1. The Table 1 presents the estimated factor values for the synthetic data. The true factor values lie between 0 and 3, so these estimates are a little off.

Loading required package: knitr

Table 1: Estimated factor value matrix for the synthetic data created at the start of the chapter.

3.595568	0.1732047	0.0102529
2.133545	1.9249020	2.1255772
1.544619	0.9729736	1.7110443
3.550781	0.1267980	0.6791404
2.837622	1.9093541	1.8518186

Step 2. Estimate the weighting matrix, $\mathbf{\Lambda}$.

f_lambda <- function(par, Y, F, sig) {
  eps <- 1e-20
  Y <- as.matrix(Y)
  F <- as.matrix(F)
  L <- as.matrix(exp(par)/sum(exp(par)))
  log_lik <- -sum(log(dnorm((Y - F%*%L)/sig) + eps))
  return(log_lik) 
}

Lhat <- matrix(NA,3,200)
for (i in 1:200) {
  parti <- rnorm(3,mean=0,sd=20)
  ai <- optim(par=parti, fn=f_lambda, Y=Ys[1:5,i], 
              F=Fhat, sig=sighat)
  Lhat[,i] <- exp(ai$par)/sum(exp(ai$par))
  #print(i)
}

Figure 5: Plot of factor weights for the synthetic data.

The Figure 5 presents a plot of the factor weights. The true values are more spread out than the estimates here.

Step 3. Once the factor model is estimated, the component parts can be used to determine the counter-factual outcome and the treatment effect.

Ln1 <- Lhat[,2:200]
L1 <- Lhat[,1]
Yn1 <- Ys[31:35,2:200]
y1_hat_fm <- Yn1%*%t(Ln1)%*%solve(Ln1%*%t(Ln1))%*%L1

Again, we can compare the predicted values to actual outcomes for individual 1. Figure 6 presents the actual and predicted outcomes with perfect prediction on the 45 degree line. Possibly unsurprising, the factor model seems to do the best. It certainly had the advantage that the data actually came from a factor model.

Figure 6: Plot of actual vs. predicted outcomes for individual 1 in periods 31 to 35. The 45 degree line is drawn and represents perfect prediction.

Returning to Minimum Wage Effects

In Chapter 10 we used a standard fixed effects model to analyze the impact of the 2007-2009 increase in the federal minimum wage on hours worked. The analysis follows the approach of Currie and Fallick (1996) but uses data from NLSY97.

NLSY97 Data

We can now return to the problem of estimating the effect of the federal minimum wage increase. In order to reduce the computational burden, the size of the treated and non-treated groups is substantially restricted.

# using the same data as Chapter 9
N2 <- 600  # restrict for computational reasons.
T2 <- 18
Y2 <- Y[,1:N2]
W2 <- W[,1:N2]
rate_07 <- W2[11,]/Y2[11,]  
# calculates the wage rate for each person.
treat1 <- ifelse(rate_07<7.26 | W2[11,]==0 | Y2[11,]==0,1,0)
treat <- matrix(0,T2,N2)
for (i in 11:T2) {
  treat[i,] <- treat1[1:N2]
}
lm1 <- f_fe(Y2, treat)
# fixed effects regression.

Y2 <- matrix(NA,T2,N2)
Y2[-lm1$na.action] <- lm1$residuals
# na.action is an index of the NAs created in lm()
# remember we can index a matrix as if it is a vector.
Y_t <- Y2[,treat1==1]
Y_t <- Y2[,is.na(colSums(Y_t))==0]
Y_t <- Y_t[,1:5] # restrict for computational reasons.
Y_nt <- Y2[,treat1==0]
Y_nt <- Y_nt[,is.na(colSums(Y_nt))==0]
# split into treated and non-treated groups
# keep only those with data observed in all time-periods.

Synthetic Control Estimates of the Minimum Wage Effect

N_t <- dim(Y_t)[2]
N_nt <- dim(Y_nt)[2]
mat_treat <- matrix(NA,8,N_t)
for (i in 1:N_t) {
  pari <- rnorm(N_nt)
  ai <- optim(par=pari, fn=f_sc, yt=Y_t[1:10,i], 
              ynt=Y_nt[1:10,], 
              control = list(trace=0,maxit=100000000))
  omega <- exp(ai$par)/sum(exp(ai$par))
  mat_treat[,i] <- Y_t[11:18,i] - Y_nt[11:18,]%*%omega
  #print(i)
}
mat_treat <- mat_treat + lm1$coefficients[2]
# add back in the results of the fixed effects.
plot_res_sc <- matrix(NA,8,3)
for (i in 1:8) {
  plot_res_sc[i,] <- quantile(mat_treat[i,],c(0.2,0.5,0.8),
                              na.rm = TRUE)
}
# quantiles of results at 0.2, 0.5, and 0.8.

The weights are estimated separately for each individual. Then the treatment effect is measured for each individual.

The Figure 7 presents the distribution of the treatment effect for the five individuals analyzed. At least for these five people, the effect of the minimum wage increase is not large.

LASSO Estimates of the Minimum Wage Effect

The estimator goes through a number of different tuning parameters and then presents the results for the one that minimizes root mean squared error on the test data set.

Y_tr <- Y_t[1:8,]  # training data
Y_tt <- Y_t[9:10,]  # test data
Y_ntr <- Y_nt[1:8,]
Y_ntt <- Y_nt[9:10,]
treat3 <- matrix(NA,8,dim(Y_t)[2])
for (i in 1:dim(Y_t)[2]) {
  omega <- f_lasso(y_tr=Y_tr[,i],y_tt=Y_tt[,i],
                   Y_ntr=Y_ntr,Y_ntt=Y_ntt,K=10)
  treat3[,i] <- Y_t[11:18,i] - cbind(1,Y_nt[11:18,])%*%omega
  #print(i)
}

treat3 <- treat3 + lm1$coefficients[2]

plot_res_ls <-matrix(NA,8,3)
for (i in 1:8) { 
  plot_res_ls[i,] <- quantile(treat3[i,],c(0.2,0.5,0.8), 
                              na.rm = TRUE) 
}

The results with this estimator are similar to those presented above. See Figure 7. You should check to see what happens when the set of tuning parameters is changed. The number of treated and non-treated individuals has also been restricted to reduce computational time.

Figure 7: Plot of quantiles of the individual treatment effect for federal minimum wage change on five treated individuals in the NLSY97 data. The quantiles are the 20%, 50% and 80%.

Factor Model Estimates of the Minimum Wage Effect

The convex factor model approach is slightly different from that above.

Step 0. Set the data up. Note that the panel is aggregated up to half-decades. This is done for two reasons. First, it reduces the computational time for the problem. Second, it accounts for the possibility that there is auto-correlation in the annual data.⁷ Ruling out auto-correlation is important for the identification argument presented in Adams (2018). It also allows more factors to be observed.⁸ However, the estimator uses most of the individual observations. We don’t need to restrict it for computational reasons.

Y_1t <- colMeans(Y[1:5,treat0==1], na.rm = TRUE)
Y_2t <- colMeans(Y[6:10,treat0==1], na.rm = TRUE)
Y_3t <- colMeans(Y[11:15,treat0==1], na.rm = TRUE)
Y_1nt <- colMeans(Y[1:5,treat0==0], na.rm = TRUE)
Y_2nt <- colMeans(Y[6:10,treat0==0], na.rm = TRUE)
Y_3nt <- colMeans(Y[11:15,treat0==0], na.rm = TRUE)
Y_t <- rbind(Y_1t,Y_2t,Y_3t)
index_t_na <- is.na(colMeans(Y_t))==0
Y_t <- Y_t[,index_t_na]
Y_nt <- rbind(Y_1nt,Y_2nt,Y_3nt)
Y_nt <- Y_nt[,is.na(colMeans(Y_nt))==0]
Y_fm <- cbind(Y_t,Y_nt)
N_t <- dim(Y_t)[2]
N_nt <- dim(Y_nt)[2]

Step 1. Estimate $\mathbf{F}$ and $\sigma$ in the pre-treatment period. Note that we assume that there are three factors.

part <- c(log(400),rnorm(6,mean=0,sd=10))
a <- optim(par=part, fn=f_cmf, 
           Y=Y_fm[1:2,], K=3, Reps=100, 
            control = list(trace = 0,maxit=1000000))
# sigma
sighat <- exp(a$par[1])
sighat

[1] 411.6859

Fhat <- matrix(exp(a$par[2:length(a$par)]), nrow=2)

The Table 2 shows that there is a fair amount of variation in the hours.⁹ The factors make points of a triangle at 0 hours for both time periods, 0 hours for period 1 and 4,000 hours for period 2, and 1700 hours for period 1 and 0 hours for period 2.

Table 2: Estimated factor value matrix for the NLSY97 data. There are three factors and two time-periods.

0.1023438	1737.1381599	0.1048798
4024.4214607	0.0083537	0.0000003

Step 2. Determine the weights.

Lhat <- matrix(NA,3,dim(Y_fm)[2])
set.seed(123456789)
for (i in 1:dim(Y_fm)[2]) {
  ai <- optim(par=rnorm(3,mean=0,sd=1), fn=f_lambda, 
              Y=Y_fm[1:2,i], F=Fhat, sig=sighat)
  Lhat[,i] <- exp(ai$par)/sum(exp(ai$par))
  #print(i)
}

Figure 8: Plot of weights for the NLSY97 data. Note that there are three factors.

The Figure 8} shows that weights are spread out, perhaps unsurprising given that there are over 5,000 individuals.

Lhat_t <- Lhat[,1:N_t]
Lhat_nt <- Lhat[,(N_t+1):dim(Lhat)[2]]
Omega <- t(Lhat_nt)%*%solve(Lhat_nt%*%t(Lhat_nt))%*%Lhat_t 
itt <- Y_t[3,] - Y_nt[3,]%*%Omega 
summary(as.vector(itt))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2314.8 -1263.9  -648.6  -529.5   124.8  3371.4

Figure 9: Density plot of the average loss in hours per year for the period 2008 to 2012.

The Figure 9 presents the individual treatment effect for the treated. It shows that a substantial proportion of individuals work fewer hours than they would have if they had not been treated by the increase in the federal minimum wage. That said, it is not clear that these people are worse off. In fact, the median treated individual had their average income increase over $7,000 in the 5 years following the minimum wage increase.

W_2t <- colMeans(W[6:10,treat0==1], na.rm = TRUE)
W_2t <- W_2t[index_t_na]
W_3t <- colMeans(W[11:15,treat0==1], na.rm = TRUE)
W_3t <- W_3t[index_t_na]
inc <- W_3t - (Y_3t[index_t_na] + itt)*W_2t/Y_2t[index_t_na]
summary(as.vector(inc[is.na(inc)==0 & is.infinite(inc)==0]))

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-4411593     1760     7630    55703    15793 53364583

Discussion and Further Reading

Panel data can be used to estimate the treatment effect by comparing outcomes before and after the treatment occurs. The problem is that time also affects the outcomes. In Chapter 10 this is solved by assuming that time has the same effect on the treated group as it does on the non-treated group. This chapter considers relaxing the assumption.

A synthetic control aims to construct a prediction of the counter-factual outcome. For the treated units, we observe their outcome after they have been treated, but not the outcome they would have had if they had not been treated. One way to construct the prediction is to use a weighted average of the outcomes for the non-treated units in the treated period. This is, in fact, what was done in Chapter 10. The difference here, is that pre-treatment period is used to find the appropriate weights for each treated unit.

The simplest and most obvious weights are the OLS weights. However, with only a few time periods and a large number of units, these weights are not identified. Abadie, Diamond, and Hainmueller (2010) suggests two restrictions, that the intercept is zero and the weights add to 1.

An alternative approach to restricting the model is to use a machine learning tool such as regularization. The chapter goes through a popular algorithm called LASSO. This is a least squares algorithm with an extra constraint that restricts the sum of the absolute value of the coefficients. The LASSO model tends to set coefficient values to zero. This leads to biased estimates, but potentially better predictions.

The chapter also shows that instead of making restrictions inferred by a particular factor model, we could simply use a factor model. Hotelling (1933) showed that the factor model is also not identified. The set of matrix factors that are consistent with the data is not unique. However, Fu et al. (2016) shows that restricting the factor weights to be positive and add to 1 (convex), it is possible to find unique matrix factors. The chapter presents a maximum-likelihood algorithm for finding the synthetic controls based on the convex factor model.

An excellent review of the various synthetic control techniques is Doudchenko and Imbens (2016). Guido Imbens, Susan Athey and co-authors have some review papers on using machine learning techniques in economics including in panel data models (Athey et al. 2019).

References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic control methods for comparative case studies: estimating the effect of California’s tobacco control program.” Journal of the American Statistical Association 105 (490): 493–505.

Adams, Christopher P. 2018. “Measuring Treatment Effects with Big n Panels.”

Athey, Susan, Mohsen Bayati, Guido Imbens, and Zhaonan Qu. 2019. “Ensemble Methods for Causal Effects in Panel Data Settings.” American Economics Review Papers and Proceedings 109 (May).

Bai, Jushan. 2009. “Panel Data Models with Interactive Fixed Effects.” Econometrica 77: 1229–79.

Blei, David M. 2012. “Probabilistic Topic Models.” Communications of ACM 55 (4): 77–84.

Currie, Janet, and Bruce C. Fallick. 1996. “The minimum wage and the employment of youth: evidence from the NLSY.” Journal of Human Resources 31 (2): 404–24.

Doudchenko, Nikolay, and Guido W. Imbens. 2016. “Balancing, Regression, Difference-in-Difference and Synthetic Control Methods: A synthesis.”

Fu, Xiao, Kejun Huang, Bo Yang, and Nicholas D. Sidiropoulos. 2016. “Robust Volume Minimization-Based Matrix Factorization for Remote Sensing and Document Clustering.” IEEE Transactions on Signal Processing 64 (23): 6254–68.

Hotelling, Harold. 1933. “Analysis of a Complex of Statistical Variables into Principal Components.” Journal of Educational Psychology 24 (6): 417–41.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1): 267–88.

Footnotes

In particular, it is generated using a convex factor model, which is discussed in more detail below.↩︎
Eventually, the estimator also runs into collinearity and multicollinearity problems Doudchenko and Imbens (2016).↩︎
Another standard regularization procedure is to have a penalty equal to the square of the values. This is called a ridge regression.↩︎
From Chapter 1 we know that the standard OLS model gives unbiased estimates. If the model presented above is true, then this regularized OLS model must give biased estimates of the true weights. The importance of this is up for debate.↩︎
Mixture models are discussed in Chapter 11.↩︎
Remember that with matrices, order matters.↩︎
Auto-correlation refers to correlation in the unobserved characteristics over time.↩︎
Try re-doing without aggregation. The auto-correlation in the data will tend to imply that there are only two factors.↩︎
One issue with the estimator is that it will tend to run the estimate of the standard deviation down to zero if the starting value is too small.↩︎