Mixture Models

Author

Christopher P. Adams

Introduction

I am a colon cancer survivor. When I was diagnosed, I was told that had about an 75% probability of surviving 5 years (that was ten years ago!). My high survival probability is due to a combination of surgery and chemotherapy. Prior to the introduction of chemotherapy, the prognosis for someone with my diagnosis was 50%. The original chemos increased the 5-year survival to 70% and then newer chemos increased it to 75%. Am I alive because I had chemo?

I don’t know.

The chemo may have done nothing. Prior to chemo, half the people with my diagnosis survived. I could have survived without chemo.

The example suggests that the treatment effect of chemotherapy is heterogeneous. For some, it has no effect, while for others it saves their life. There may even be a group of people that chemo harms. Imagine if we could determine which group each patient is in? Many fewer people would be exposed to the horrors of chemotherapy. Mixture models may be the answer.

Mixture models have become an important part of the econometrician’s tool kit. However, in many cases they are not identified without strong and non-credible assumptions. Moreover, many researchers seem to be unaware that their results are relying on such non-credible assumptions.

This part of the book highlights how repeated measurement can be used to predict the effect of policies. This chapter considers more general cases than the panel data used in Chapters 10 and 11. The chapter uses the mixture model to solve the problem of measurement error. The chapter considers the problem of estimating returns to schooling using data from twins. It also returns to the question of whether an increase in the minimum wage will cause restaurants to decrease employment. The chapter highlights the great promise of mixture models, as well as the identification problems that arise.

Two-Type Mixture Models

Consider the simplest mixture problem. There are two hidden types which we will denote \(\theta_t \in \{1, 2\}\). The observed outcome (\(y_t\)) depends on which is the true type.

\[ y_{t} = \left \{\begin{array}{ll} x_{1t} & \mbox{ if } \theta_t = 1\\ x_{2t} & \mbox{ if } \theta_t = 2 \end{array} \right . \tag{1}\]

The density of \(y_t\) is the following mixture of two component densities.

\[ f(y_t) = \Pr(\theta_t = 1) h_1(x_{1t}) + \Pr(\theta_t = 2) h_2(x_{2t}) \tag{2}\]

The question is whether we can determine the component parts of the mixture by observing the outcomes (\(y_t\)).

In my example above, the hidden types may represent how a particular colon cancer patient is affected by chemotherapy. Can we uncover the true effects (the \(x_t\)’s) of the chemo by observing the aggregated outcomes, i.e. survival (\(y_t\))?

Simulation of Two-Type Mixture

The simulation is based on the model presented above. Note that there are two observed outcomes \(\{y_1, y_2\}\), both generated by an underlying set of hidden types. Also let \(\Pr(\theta_t = 1) = \alpha\).

set.seed(123456789)
require(mvtnorm)
Loading required package: mvtnorm
N <- 10000
mu <- c(0,10,-1,3)
Sig <- diag(1,4)*c(2,1,3,1) 
# diagonal matrix with different variances.
x <- rmvnorm(N,mean=mu,sigma=Sig)
alpha <- 0.4
ranN <- runif(N)
y1 <- ifelse(ranN < alpha,x[,1],x[,2])
y2 <- ifelse(ranN < alpha,x[,3],x[,4])

Looking at Figure 1 and the distribution of \(y_1\), the question is whether we can back out the distribution of the \(x\)’s from observing \(y_1\). The answer appears to be yes! The distribution is bi-modal with the two modes at 0 and 10, which are the means of the distributions for each of the types. How about for the second case, the distribution of \(y_2\)? Can you make out the \(h_1(x_1)\) and \(h_2(x_2)\) densities? Maybe if you squint.

Figure 1: Density plots of two mixture distributions.

Simply observing the outcome may not be enough to discern the underlying distributions. If it was, the chapter would be very short.

Knowing the Component Distributions

What if we knew the underlying type distributions? Could we back out the mixture weights? The answer is yes. To see this, consider the second mixture distribution broken into its component parts. In the simulated data we have the following relationship.

\[ f(y_2) = \alpha h_3(y_2) + (1 - \alpha) h_4(y_2) \tag{3}\]

where \(h_3(y_2) = \phi \left(\frac{y_2 - \mu_3}{\sigma_3} \right)\) and \(h_4(y_2) = \phi \left(\frac{y_2 - \mu_4}{\sigma_4}\right)\). Noting that \(\mu_3 = -1\), \(\sigma_3^2 = 3\), \(\mu_4 = 3\) and \(\sigma_4^2 = 1\), we can rewrite as follows.

\[ f(y_2) = \alpha \phi\left(\frac{y_2 + 1}{3^{0.5}}\right) + (1 - \alpha) \phi(y_2 - 3) \tag{4}\]

As we have a linear equation with one unknown (\(\alpha\)), we can just rearrange to get the answer.

\[ \alpha = \frac{f(y_2) - \phi(y_2 - 3)}{\phi\left(\frac{y_2 + 1}{3^{0.5}}\right) - \phi(y_2 - 3)} \tag{5}\]

At the median, we can calculate the density of the mixed distributions and use the formula presented in Equation 5. We get 0.43 where the true value is 0.4.

q_y2 <- quantile(y2,0.5)
a1 <- density(y2)$y[315]  # about the median
b1 <- dnorm(q_y2,mean=3)
c1 <- dnorm(q_y2,mean=-1,sd=3^(0.5))
(a1 - b1)/(c1 - b1)
     50% 
0.425599 

In some cases it is reasonable to assume the component distributions are known. In Appendix A, we analyze a statistical approach called empirical Bayesian estimation. This approach relies on the fact that we know the component distributions under random sampling.

Observing Multiple Signals

What if we don’t know the component distributions? In general, we are back to square one.

We can make progress if we observe more than one outcome being generated by the same process. Consider the simulated data generated above. I didn’t highlight it, but both \(y_1\) and \(y_2\) are two signals of the same underlying process. In one state, \(y_1\) is determined by \(x_1\) and \(y_2\) is determine by \(x_3\), in the other state \(y_1\) is determined by \(x_2\) and \(y_2\) is determined by \(x_4\).

We can back out the underlying mixtures by looking carefully at the distribution of one outcome conditional on the the second outcome. To see this, write out the probability of observing a certain value of \(y_1\) given that \(y_2 = a\).

\[ \begin{array}{l} f(y_1 | y_2 = a) = \Pr(\theta=1 |y_2=a) f(y_1 | \theta=1, y_2=a)\\ + \Pr(\theta=2 |y_2=a) f(y_1 | \theta=2, y_2=a)\\ = \Pr(\theta=1 | y_2=a) f(y_1 | \theta=1) + \Pr(\theta=2 | y_2=a) f(y_1 | \theta=2)\\ = \Pr(\theta=1 | y_2=a) h_1(x_1) + \Pr(\theta=2 | y_2=a) h_2(x_2) \end{array} \tag{6}\]

The simulated data satisfies an important assumption. Conditional on knowing the state, the value of \(y_1\) is independent of the value of \(y_2\). This assumption allows us to move from the first line to the second line. In math, we have \(f(y_1 | \theta=1, y_2=a) = f(y_1 | \theta=1)\). Once we know \(\theta\), the value of \(y_2\) provides no additional information. This is a reasonable assumption if, as in the simulated data, the two observed outcomes are being generated by the same underlying process. They are repeated measurements.

The conditional probability is a mixture over the two component distributions. Importantly, the mixture weights can vary. The value of \(y_2\) is dependent on the state; this means that the probability of the state changes conditional on the value of \(y_2\). Like above, we can write out the probability of the state as a function of the probability of \(y_2\) conditional on the state.

\[ \Pr(y_2 = a) = \Pr(\theta=1 | y_2=a)h_4(x_4) + (1 - \Pr(\theta=1 | y_2 = a))h_3(x_3) \tag{7}\]

Rearranging we have the conditional probability as a function of objects that vary with \(y_2\).

\[ \Pr(\theta=1 | y_2 = a) = \frac{Pr(y_2 = a) - h_3(x_3)}{h_4(x_4) - h_3(x_3)} \tag{8}\]

If there is a case where \(\Pr(\theta = 1 | y_2 = a)=1\), then from Equation 6, we have \(f(y_1 | y_2 = a) = h_1(x_1)\). That is, we can infer \(h_1(x_1)\). Similarly, if there is a case where \(\Pr(\theta=1 | y_2=a) = 0\) then we can infer \(h_2(x_2)\). Once we know these values, we can use the argument above to determine the mixing probabilities.

To illustrate, the next loop creates a matrix of conditional probabilities for our two simulated outcomes (\(y_1\) and \(y_2\)). The loop breaks each outcome into a set of cells; then for each bucket of \(y_2\) it calculates the probability of being in the various buckets for \(y_1\).

Fu et al. (2016) point out that if these probabilities are precisely estimated then they will lie on a line between the two true underlying states. Moreover, the observed end points will correspond to the two true states. Identifying the model is simply a matter of plotting the data.

min1 <- min(y1,y2)
max1 <- max(y1,y2)
K1 <- 10
K2 <- 7
eps <- 1e-5
del1 <- (max1 - min1)/K1
del2 <- (max1 - min1)/K2
prob_mat <- matrix(NA,K2+1,K1+1)
# conditional probability matrix
prob_mat_x <- matrix(NA,K2+1,3)
# actual hidden type conditional probability matrix
low1 <- min1 - eps
high1 <- low1 + eps + del1
for (k1 in 1:K1) {
  low2 <- min1 - eps
  high2 <- low2 + eps + del2
  prob_mat[1,1+k1] <- (low1 + high1)/2
  y1_c <- y1[y2 > low1 & y2 < high1] # conditioning
  for (k2 in 1:K2) {
    prob_mat[1+k2,1] <- (low2 + high2)/2
    prob_mat[1+k2,1+k1] <- mean(y1_c > low2 & y1_c < high2)
    # probability conditional on 2nd signal
    prob_mat_x[1+k2,2] <- mean(x[,1] > low2 & x[,1] < high2)
    prob_mat_x[1+k2,3] <- mean(x[,2] > low2 & x[,2] < high2)
    # probabilities conditional on actual hidden type.
    low2 <- high2 - eps
    high2 <- low2 + eps + del2
    #print(k2)
  }
  low1 <- high1 - eps
  high1 <- low1 + eps + del1
  #print(k1)
}
prob_mat_x[,1] <- prob_mat[,1]

The Figure 2 plots the probabilities of two outcomes, one in the left tail and one in the right. These probabilities are denoted by the circles. Each probability is calculated given different values for \(y_2\). The triangles are the probabilities derived from the underlying mixture distributions. In general, we don’t observe the triangles, but the figure shows that the observed data enables us to determine where the triangles must lie.

Figure 2: Plot of the probabilities \(y_1 = -1.15\) on the x-axis and \(y_1 = 12.06\) on the y-axis. The circles are the observed probabilities conditional on \(y_2\) and the triangles are unobserved probabilities conditional on the true types.

The Figure 2 shows that we can use the result of Fu et al. (2016) to identify the mixture model. If the conditional probabilities are estimated accurately then they must lie on the line between the two true types. If the conditional probabilities vary enough then the extremes will correspond to the case where only one type is true.

Two Signal Mixture Models

Below, we consider the measurement error problem associated with self-reported education levels. We are interested in comparing the difference in earnings for twins with different education levels. The problem is that most twins get the same level of education. This means that our analysis may be impacted by measurement error. It may be that all the differences we observe are due to reporting mistakes.

The proposed solution is to use two signals of the education level for each person in the study. Luckily, in the data set the twins were asked about both their own education level and the education level of their sibling. This means that for each person we have two signals of their true education level. One signal is the person’s own report of their own education level. The second signal is the person’s sibling’s report of their education level.

Model of Two Signal Data

Consider a problem in which there are \(K\) hidden types which determine the values of two signals \(x_{1}\) and \(x_2\). In the twins study there are four types and the signals are the self-reported education level and the sibling-reported education level.

For simplicity assume that the type is represented by \(\theta_k\) and the signals are a linear function of that value.

\[ \begin{array}{l} x_{i1} = a_1 + b_1 \theta_k + \epsilon_{i1}\\ \\ x_{i2} = a_2 + b_2 \theta_k + \epsilon_{i2} \end{array} \tag{9}\]

where \(k \in \{1,...,K\}\).

Simulation of Two Signal Data

In the simulated data there are four unknown types, say four education levels.

set.seed(123456789)
N <- 10000
u <- c(1:4)
p <- c(0.2,0.5,.25,0.05)
ranN <- runif(N)
theta <- ifelse(ranN < p[1], u[1],
               ifelse(ranN < p[2], u[2],
                      ifelse(ranN < p[3], u[3], u[4])))
e1 <- rnorm(N,mean=0,sd=0.3)
e2 <- rnorm(N,mean=0,sd=0.3)
x1 <- 1 + .25*theta + e1
x2 <- 0.5 + 0.1*theta + e2
summary(x1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2179  1.3847  1.7037  1.6980  2.0207  3.2098 
summary(x2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3676  0.5651  0.7897  0.7848  1.0046  1.9819 

The two simulated signals have overlapping distributions. We can extend the ideas above to use these two signals to determine the unobserved types.

Conditional Independence of Signals

We can write down the joint density by the Law of Total Probability.

\[ f(x_1, x_2) = \int_{\theta} h_1(x_1 | x_2, \theta) h_2(x_2 | \theta) g(\theta) \tag{10}\]

We are interested in estimating the distribution of the hidden types (\(g(\theta)\)) and the conditional distributions \(h_{j}(x_j | \theta)\). However, this time we only observe the joint distribution of the signals (\(f(x_1, x_2)\). Is this enough? There is no magic here. We cannot determine \(h_1\), \(h_2\) and \(g\) from observing \(f\).

However, if we are willing to assume that \(x_1\) and \(x_2\) are independent conditional on \(\theta\) then we can rewrite Equation 10 in the following way.

\[ f(x_1, x_2) = \int_{\theta} h_1(x_1 | \theta) h_2(x_2 | \theta) g(\theta) \tag{11}\]

It is a subtle but important change. In the first case we cannot determine \(h_1\), \(h_2\) and \(g\) from observing \(f\). Now we can, at least we can bound these equations following the arguments in Adams (2018) and Fu et al. (2016).

We have assumed that we have two signals and these signals are independent of each other conditional on the true underlying type. I think this assumption is reasonable for our twin’s education reporting example. It implies that once we know the true education level for a person, then observing the sibling report provides no additional information about their own report.

Two Signal Estimator

The algorithm for solving the mixture model is suggested by Benaglia, Chauveau, and Hunter (2009). The authors suggest simultaneously solving three equations.

\[ \gamma(\theta | x_1, x_2) = \frac{h_1(x_1 | \theta) h_2(x_2 | \theta) g(\theta)}{\int_{\theta'} h_1(x_1 | \theta') h_2(x_2 | \theta') g(\theta')} \tag{12}\]

The first is the posterior (\(\gamma\)) conditional on the observed outcomes.1 This follows from Bayes’ rule with conditional independence, so that \(h(x_1, x_2 | \theta) = h_1(x_1 | \theta) h_{2}(x_2 | \theta)\). The posterior is a function of the joint likelihood of observing the outcomes conditional on the true state, weighted by the prior.

\[ g(\theta) = \int_{x_1,x_2} \gamma(\theta | x_1, x_2) f(x_1,x_2) \tag{13}\]

The second is the prior distribution which is calculated as the posteriors weighted by the probability of the two outcomes \(x_1\) and \(x_2\).

\[ h(x_1, x_2 | \theta) = \frac{\int_{x_2} \gamma(\theta | x_1,x_2) f(x_1, x_2)}{\int_{x_1,x_2} \gamma(\theta | x_1, x_2) f(x_1,x_2)} \tag{14}\]

The third is the likelihood function for the outcomes \(x_1\) and \(x_2\). This can be determined using Bayes’s rule given the posterior and the observed distribution of the two signals.

Note that there may be multiple priors that satisfy these equations. There is no guarantee that there will be a unique solution. However, depending on the data, bounds around the solution set may be tight.

Two Signal Estimator Algorithm

Benaglia, Chauveau, and Hunter (2009) suggest the following algorithm for solving the system.

Step 0. Let \(t_{ik} \in \{0,1\}\) and \(\sum_{k=1}^K t_{ik} = 1\), denote the category that each observation \(i\) is assigned. Let \(N_k = \sum_{i=1}^N t_{ik}\). The initial step assigns individual observations to one of four categories. In the estimator, this is done using `kmeans()}.2

Step 1. Solve for the likelihood functions given the categorization.

\[ \hat{h}_j(x | \theta_k) = \frac{1}{N_k} \sum_{i=1}^{N_k} \mathbb{1}(x - \epsilon < x_{ij} < x + \epsilon) t_{ik} \tag{15}\]

Step 2. Solve for the prior.

\[ \hat{g}(\theta_k) = \frac{\sum_{i}^{N} t_{ik}}{N} \tag{16}\]

Step 3. Solve for the posterior.

\[ \hat{\gamma}(\theta_k | x_1, x_2) = \frac{\hat{h}_1(x_1 | \theta_k) \hat{h}_2(x_2 | \theta_k) \hat{g}(\theta_k)}{\sum_{j=1}^K \hat{h}_1(x_1 | \theta_j) \hat{h}_2(x_2 | \theta_j} \tag{17}\]

Step 4. Determine types. \(t_{ik} = 1\) if and only if

\[ \theta_{ik} = \arg \max \hat{\gamma}_{i}(\theta_{ik} | x_{1i}, x_{2i}) \tag{18}\]

Go to Step 1.

Stop when the estimate of the prior (\(g\)) converges. Note that there is no guarantee that this algorithm will converge.3

Mixture Model Estimator in R

Mixture models can be estimated using npEM() from the package mixtools. It stands for non-parametric EM estimator, even though it is not actually an EM estimator.4 The estimator below is based on the npEM() code. The original uses calls to C. This version uses R’s built in density() function.

mixmod_fun <- function(X,centers=2,tol=1e-9,
                            maxiter=500,verb=FALSE) {
  X <- as.matrix(X)
  
  # find the initial set of types using kmeans().
  a1 <- kmeans(X,centers)
  
  # the function accepts either a matrix or just the number
  # of hidden types.
  if (length(centers) > 1) {
    K <- dim(centers)[1]
  } else {
    K <- centers
  }
  t1 <- a1$cluster
  T1 <- matrix(NA,dim(X)[1],K)
  for (k in 1:K) {
    T1[,k] <- ifelse(t1==k,1,0)
  }
  
  # determine the initial prior estimate
  g1 <- colMeans(T1)
  
  # loop to calculate the prior and posterior
  diff <- sum(abs(1 - g1))
  i <- 0
  while (diff > tol & i < maxiter) {
    
    # calculate the posterior
    gamma <- t(matrix(rep(g1,dim(X)[1]), nrow=K))
    
    # calculate the likelihood functions
    for (k in 1:K) {
      for (j in 1:dim(X)[2]) {
        h1 <- density(X[t1==k,j],from=min(X),to=max(X))
        h1$y <- h1$y/sum(h1$y)
        gamma[,k] <- 
          gamma[,k]*sapply(c(1:dim(X)[1]), function(x)
            h1$y[which.min((h1$x - X[x,j])^2)])
      }
    }
    gamma <- gamma/rowSums(gamma)
    
    # update the categorization of types
    t1 <- max.col(gamma)
    
    # update the prior estimate
    g2 <- colMeans(gamma)
    
    # check for convergence
    diff <- sum(abs(g1 - g2))
    g1 <- g2
    i <- i + 1
    if (verb) {
      print(diff)
      print(i)
    }
  }
  return(list(g = g2, gamma=gamma, t = t1))
}


centers <- cbind(1 + 0.25*u, 0.5 + 0.1*u)
# columns are number of signals (2 in this case)
# rows are number of types (4 in this case.)
a <- mixmod_fun(X=cbind(x1,x2),centers=centers)
Figure 3: Density plots of the likelihood functions for each of the types conditional on \(x_1\).

We can use this model to determine the underlying types from simulated data. Figure 3 shows that the mixture model does a pretty good job of discerning the underlying types.

mean(x1[a$t==1])
[1] 1.190989
mean(x1[a$t==2])
[1] 1.471282
mean(x1[a$t==3])
[1] 1.876143
mean(x1[a$t==4])
[1] 2.135286

The estimator finds that the hidden types are \(\{1.19, 1.47, 1.88, 2.14\}\), which compares to the true values of \(\{1.25,1.5,1.75,2\}\).

Twins Reports and Returns to Schooling

The book returns to more or less where it began, with the question of measuring returns to schooling. The bane of the econometrician’s efforts to estimate returns to schooling is “unmeasured ability.” Ashenfelter and Krueger (1994) and Ashenfelter and Rouse (1998) suggest a unique solution - find a set of twins, ask them their education level and their wage and measure the differences in wages associated with the differences in education levels. This first-difference approach assumes that twins have the same unmeasured ability and differences it out.

The problem is measurement error. Most twins have the same education level, so any observed difference in education level may not be real. Observed differences may be due to one or the other of the twins misreporting his education level. Such measurement error is likely to lead to estimates of returns to education that tend toward 0.

The section shows how a mixture model estimator can be used to estimate each person’s true education level conditional on his own report and the report of his sibling. These true education levels are then used to estimate returns to schooling.

Mixture Model of Twin Reports

The mixture model approach solves the measurement error problem in two steps. In the first step, each individual’s education level is assumed to be measured by two different signals. The first signal is the individual’s own report of his education level. The second signal is the individual’s sibling’s report of his education level. Note that a sibling’s report is likely to have different characteristics than an individual’s own report.

To use the approach above we need to assume that the individual’s own report is independent of their sibling’s report of their education level, conditional on the true education level. This assumption seems quite reasonable. It implies that the reports only differ from the truth due to errors and mistakes.

Here, these two signals are used to estimate four “types” of individuals. Each type has a conditional distribution over education levels, where each is labeled for the mode of the conditional distribution of education level - 12, 14, 16 and 18.5 These levels line up with the certificate granting years - high school diploma, associates degree, bachelor’s degree and master’s degree. To be clear, while the number of latent types is imposed on the estimated model, the conditional distributions (and their modes) are determined by the data.

The second step uses the results to estimate returns to schooling. It takes estimates from the mixture model to calculate the posterior probability of being a member of each education level type. These probabilities are used to weight the average education level for each type to give each person’s estimated education level - the “true” education level. Each twin is then matched with his sibling to calculate the difference in estimated true education levels. These differences are used in the place of the naive and IV estimated differences in the first difference regression models presented in Ashenfelter and Rouse (1998).

Twin Reports Data

The analysis uses the same data as Ashenfelter and Rouse (1998).6

require(foreign) # package for reading older Stata data sets.
Loading required package: foreign
x <- read.dta("pubtwins.dta")
Warning in read.dta("pubtwins.dta"): cannot read factor labels from Stata 5
files
xf <- x[is.na(x$first)==0,]  # first twin
xs <- x[is.na(x$first)==1,]  # second twin
# educ "Twin1 Education";
# educt "Twin1 Report of Twin2 Education";
# educ_t "Twin2 Education";
# educt_t "Twin2 Report of Twin1 Education";
# the data is written out in pairs for each set of twins.
lm1 <- lm(dlwage ~ deduc, data=xf)
lm2 <- lm(dlwage ~ deduc + duncov + dmaried + dtenure, 
          data=xf)
Figure 4: Wage differences by differences in education reports by twins. The line is the regression from above.

The Figure 4 shows two important takeaways from the data. First, for many twins there is no difference in the education levels. Second, there is a distinct positive relationship between differences in education level and income for the twins in the sample.

Ashenfelter and Krueger (1994) and Ashenfelter and Rouse (1998) present evidence that first differencing gives biased estimates. This bias is due to measurement error which is exacerbated by the first differencing. The authors’ solution is to use an IV approach in which alternative reports are used as an instrument for the individual’s own reports. The naive approach suggests an effect of 7% while the IV approach gives an estimate of 8.8%. Can you replicate these IV estimates?

Mixture Model Approach to Measurement Error

The IV approach suggested by Ashenfelter and Krueger (1994) makes very strong assumptions about the structure of the measurement error.7 The mixture model approach suggests that the measurement error issue can be addressed without the need for the strong assumptions of IV. Note that both the IV method and the mixture model method assume that reports are conditionally independent.

The first step is to determine the education level associated with each type.
We need to split the data between “first” and “second” twins so that we can make the correct comparison.

set.seed(123456789)
yf <- cbind(xf$educ,xf$educt_t)
ys <- cbind(xs$educ,xs$educt_t)
y1 <- rbind(yf,ys)

The data is recombined to estimate the mixture model.

centers <- matrix(rep(c(12, 14, 16, 18),2), nrow = 4)
b1 <- mixmod_fun(y1,centers,tol=1e-25)
Figure 5: Density plot of the likelihood functions for observing the twin’s own report conditional on their “true” education level.

The Figure 5 presents the estimates of the likelihood of observing the twin’s own report conditional on their true education level being 12, 14, 16 or 18. The plots show that for some people who report 13 years, their true education level is 12 years, while for others their true education level is 14 years.

The plots suggest that the model is not point identified. Adams (2016) shows that point identification depends on the existence of cases where the likelihood uniquely determines the type. The plot shows that the likelihoods have tails that are always overlapping. This means that the no likelihood is able to uniquely determine the type. That said, the existence of high likelihood ratios suggests that the identified set is tightly bound.

Estimating Returns to Schooling from Twins

Using the posterior estimate of each twin’s “type” we can assign them and their sibling an education level. Then we can calculate the first-difference.

ed <- c(12,14,16,18)
xf$deduc <- b1$gamma[1:340,]%*%ed - b1$gamma[341:680,]%*%ed
# difference in "true" education levels
# for each twin
lm3 <- lm(dlwage ~ deduc, data = xf)
lm4 <- lm(dlwage ~ deduc + duncov + dmaried + dtenure, data=xf)
Loading required package: modelsummary
`modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
  backend. Learn more at: https://vincentarelbundock.github.io/tinytable/

Revert to `kableExtra` for one session:

  options(modelsummary_factory_default = 'kableExtra')
  options(modelsummary_factory_latex = 'kableExtra')
  options(modelsummary_factory_html = 'kableExtra')

Silence this message forever:

  config_modelsummary(startup_message = FALSE)
Table 1: Results comparing the standard first difference estimates to the mixture model estimates. In each case the second model includes the standard explanatory variables from Ashenfelter and Rouse (1998). Models (1) and (2) are the first difference estimates, while models (3) and (4) are the respective mixture model estimates. The standard errors are reported in parentheses.
(1) (2) (3) (4)
(Intercept) 0.030 0.026 0.032 0.030
(0.028) (0.026) (0.027) (0.026)
deduc 0.061 0.073 0.068 0.082
(0.019) (0.018) (0.019) (0.018)
duncov 0.086 0.085
(0.056) (0.055)
dmaried 0.024 0.049
(0.073) (0.073)
dtenure 0.023 0.024
(0.003) (0.003)
Num.Obs. 340 333 340 333
R2 0.031 0.170 0.036 0.178
R2 Adj. 0.028 0.160 0.033 0.168
AIC 507.5 455.5 505.6 452.2
BIC 518.9 478.3 517.1 475.0
Log.Lik. -250.727 -221.740 -249.803 -220.085
RMSE 0.51 0.47 0.50 0.47

The Table 1 presents the OLS results for log wages on the first-difference in education levels. The standard errors are reported in parenthesis. Note that these standard errors do not account for the mixture model estimation used in calculating the first difference in education level. These results can be compared to the first difference estimates in Ashenfelter and Rouse (1998). The mixture model estimates are somewhat higher suggesting that measurement error is in fact biasing down the first difference estimates. However, they are lower than the IV estimates (not presented).

Revisiting Minimum Wage Effects

I began the chapter by claiming that mixture models could be used to estimate heterogeneous treatment effects. This section tests that idea on Card and Krueger’s controversial minimum wage paper.8 The original paper was controversial for two reasons. First, it purported to show that a bedrock result of economics wasn’t true. Economic theory predicts that a minimum wage increase will lead to a decrease in the demand for labor. The difference in difference results showed that the minimum wage increase in New Jersey caused a (small) increase in the demand for labor. Second, the paper was used to justify an increase in the federal minimum wage, a controversial policy.

A possible explanation for the result is that the change in the minimum wage in New Jersey had heterogeneous effects. Economic theory is clear that an increase in the cost of labor hours will lead to a decrease in demand for labor hours. It is less clear on how that decrease will be achieved. Will restaurants decrease the number of full-time staff or the number of part-time staff or substitute from full-time to part-time or the other way around?

Restaurant Employment

Warning: NAs introduced by coercion
Warning: NAs introduced by coercion

The analysis uses the same data as Chapter 10, but with the mixture model method described above. The analysis uses the pre-treatment period (period 1) to estimate the hidden types. With these in hand, it uses the post-treatment period (period 2) to estimate the difference in difference for each hidden type. Note that the section assumes that the types do not change from period 1 to period 2. The two signals of the restaurant’s hidden type are the number of full-time employees and the number of part-time employees.

Y <- rbind(x3$FT[x3$TIME==1],x3$PT[x3$TIME==1],
           x3$FT[x3$TIME==2],x3$PT[x3$TIME==2])
Y1 <- Y[1:2,is.na(colSums(Y))==0] # period 1
Y2 <- Y[3:4,is.na(colSums(Y))==0] # period 2
treat <- x3$STATE[x3$TIME==1]==1
treat1 <- treat[is.na(colSums(Y))==0]
Loading required package: knitr
Table 2: Summary table of the number of full-time and part-time employees in period 1.
Full-Time Part-Time
Min. : 0.000 Min. : 0.00
1st Qu.: 2.000 1st Qu.:11.00
Median : 6.000 Median :17.00
Mean : 8.202 Mean :18.84
3rd Qu.:11.250 3rd Qu.:25.00
Max. :60.000 Max. :60.00

The Table 2 presents summary statistics for the number of full-time and part-time employees in the surveyed restaurants. Note that these firms tend to have many fewer full-time employees than part-time employees.

Mixture Model Estimation of Restaurant Type

In the education example above there was a clear measure of the “hidden type,” the years of education, particularly the diploma granting years. Here there is no clear indication. I proceed by guessing that there are firms that hire small, medium and large numbers of full-time employees and part-time employees. For full-time these are the groups centered at 2, 6 and 11. For part-time they are the groups centered at 11, 17 and 25. Moreover, there are some firms that hire a small number of both, some that hire a small number of one and a large number of the other. Lastly, through trial-and-error I paired down the initial groups to 5.9

p1 <- c(2,6,2,2,6)
p2 <- c(11,17,17,25,11)
a <- mixmod_fun(X=t(Y1),centers=cbind(p1,p2),tol=1e-15)
a$g
[1] 0.30724141 0.07859604 0.26042828 0.17348538 0.18024888
Figure 6: Plot of distribution of full-time and part-time employees by hidden type.

The Figure 6 shows that there are more or less five types of restaurants. There are small restaurants that hire a small number of both types of employees. There are small restaurants that tend to hire part-time employees and ones that tend to hire full-time employees. Then there are large restaurants that either have more part-time employees or more full-time employees.

Heterogeneous Minimum Wage Effect

The model determines the hidden type for each restaurant. With this information, we can run the difference in difference estimator presented in Chapter 10 for each type.

Figure 7: Plot of distribution of full-time and part-time employees for type 1 (small restaurants) before (left) and after (right) the minimum wage increase in New Jersey.

The Figure 7 suggests that there was an effect of the increase in the minimum wage on small restaurants. There seems to be a shift down in the number of full-time and part-time employees that these firms hire after the minimum wage increases. The cloud disperses with some firms increasing employment, but relative to Pennsylvania firms there seem to be more firms with lower or zero levels of employment of part-time workers.

To see whether there is an effect for the small type we can estimate the difference and difference model for full-time workers. The estimate suggests that the minimum wage increase reduced full-time employment by 17% for small restaurants.

y_11 <- c(Y1[1,][!treat1 & a$t==1],Y1[1,][treat1 & a$t==1])
# full-time employment for type 1 restaurants 
# prior to the minimum wage increase
y_12 <- c(Y2[1,][!treat1 & a$t==1],Y2[1,][treat1 & a$t==1])
# full-time employment for type 1 restaurants after
# the minimum wage increase.
treat = c(rep(0,length(Y1[1,][!treat1 & a$t==1])),
          rep(1,length(Y1[1,][treat1 & a$t==1])))
did1 <- f_did(rbind(y_11,y_12),treat)
did1[3,3]/did1[1,1]
[1] -0.1710775

We can do the same analysis for part-time workers. Here the effect of the minimum wage increase is smaller, an 8% decrease in part-time workers for small restaurants.

y_13 <- c(Y1[2,][!treat1 & a$t==1],Y1[2,][treat1 & a$t==1])
y_14 <- c(Y2[2,][!treat1 & a$t==1],Y2[2,][treat1 & a$t==1])
treat = c(rep(0,length(Y1[2,][!treat1 & a$t==1])),
          rep(1,length(Y1[2,][treat1 & a$t==1])))
did2 <- f_did(rbind(y_13,y_14),treat)
did2[3,3]/did2[1,1]
[1] -0.08023572

The mixture model analysis finds that for type 1 restaurants (small restaurants) the minimum wage increase is associated with a large decrease in the number of full-time employees and a smaller but significant decrease in the number of part-time employees.

Figure 8: Plot of distribution of full-time and part-time employees for types 2 to 5 after the minimum wage increase in New Jersey.

The Figure 8 presents the distribution of part-time and full-time employees by each of the other four hidden types. The figure shows that it is difficult to determine if there is any measurable effect of the minimum wage increase. In each case, the data is spread out in a cloud with no discernible decrease in employment. Can you replicate the difference and difference estimates for these types?

Discussion and Further Reading

Mixture models play an important role in microeconometrics. Unfortunately, mixture models are generally not identified without reliance on hard-to-justify parametric restrictions. The chapter considers the case where the econometrician observes multiple signals of the same underlying process. An example is reports of educational attainment. In this case we have two reports, one self-reported and the other, a report from the person’s sibling. These two signals can be used to bound the distribution of the prior. In some cases the bounds can be tight.

Hu (2017) presents a comprehensive review of the three signal identification result and its applications. Haile and Kitamura (2019) limits the application to auctions with unobserved heterogeneity, but the paper provides an excellent overview of the various problems that occur in econometrics and a discussion of their solutions. The estimator presented here is based on Benaglia, Chauveau, and Hunter (2009). While very nice to use, it is not guaranteed to converge. Levine, Hunter, and Chauveau (2011) suggest a similar algorithm which does have nice convergence properties. This is a majorization-minimization algorithm.

The question of returns to schooling bookends this book. Chapters 1 and 2 presented OLS estimates using data from the NLSM 1966. The analysis suggested that one additional year of schooling increased wages by 7.8%. Card (1995) pointed out that this estimate is biased because there are unobserved characteristics that affect both the amount of schooling an individual receives and their income. The paper uses distance to college as an instrument and estimates that the returns to schooling is closer to 14%. Chapter 3 also presents LATE estimates that vary from 19% to 32%, although those estimates don’t include other explanatory variables. These results suggest that there is a fair amount of heterogeneity in school returns. Chapter 6 returns to the question using the Heckman selection model and gets an average effect of 14%. The model estimates that there is variation in school returns and in fact returns are not positive for everyone. Chapter 9 uses GMM to re-estimate Card’s IV approach with two instruments. It finds that returns are slightly smaller than suggested by OLS. This chapter uses reports from twins to estimate returns to schooling. After accounting for measurement error with a mixture model, the twins analysis suggests an average return of 8% for each additional year of schooling.

The chapter revisits Card and Krueger (1994) and their analysis of the minimum wage increase in New Jersey. The result suggests that minimum wage increase resulted in a substantial reduction in employment for small restaurants.

References

Adams, Christopher P. 2016. “Finite Mixture Models with One Exclusion Restriction.” The Econometrics Journal 19: 150–65.
———. 2018. “Measuring Treatment Effects with Big n Panels.”
Ashenfelter, Orley, and Alan Krueger. 1994. “Estimates of the Economic Return to Schooling from a New Sample of Twins.” American Economic Review 84 (5).
Ashenfelter, Orley, and Cecilia Rouse. 1998. “Income, Schooling, and Ability: Evidence from a New Sample of Identical Twins.” Quarterly Journal of Economics 113 (1): 253–84.
Benaglia, Tatiana, Didier Chauveau, and David R. Hunter. 2009. “An EM-Like Algorithm for Semi- and Non-Parametric Estimation in Multivariate Mixtures.” Journal of Computational and Graphical Statistics 18 (2): 505–26.
Card, David. 1995. “Aspects of Labour Market Behavior: Essays in Honour of John Vanderkamp.” In, edited by Louis N. Christofides, E. Kenneth Grant, and Robert Swidinsky, 201–22. University of Toronto Press.
Card, David, and Alan B. Krueger. 1994. Minimum wages and employment: a case study of the fast-food industry in New Jersey and Pennsylvania.” The American Economic Review 84 (4): 772–93.
Fu, Xiao, Kejun Huang, Bo Yang, and Nicholas D. Sidiropoulos. 2016. “Robust Volume Minimization-Based Matrix Factorization for Remote Sensing and Document Clustering.” IEEE Transactions on Signal Processing 64 (23): 6254–68.
Haile, Philip A., and Yuichi Kitamura. 2019. “Unobserved Heterogeneity in Auctions.” The Econometrics Journal 22: C1–19.
Hu, Yingyao. 2017. “The Econometrics of Unobservables: Applications of Measurement Error Models in Empirical Industrial Organization and Labor Economics.” Journal of Econometrics 200 (2): 154–68.
Levine, M., D. R. Hunter, and D. Chauveau. 2011. “Maximum Smoothed Likelihood for Multivariate Mixtures.” Biometrika 98 (2): 403–16.

Footnotes

  1. See Appendix A for a discussion of the use of Bayesian updating.↩︎

  2. kmeans() is a method for grouping correlated observations.↩︎

  3. Levine, Hunter, and Chauveau (2011) presents a similar algorithm which is guaranteed to converge.↩︎

  4. An expectation-maximization (EM) algorithm is a widely used alternative approach involving parametric assumptions.↩︎

  5. The fact that particular individuals may be poor at estimating both their own or their sibling’s education level will not affect the procedure used here except through sampling correlation. See Ashenfelter and Krueger (1994) for discussion of this issue in regards to the IV approach.↩︎

  6. The data is available here: https://dataspace.princeton.edu/jspui/handle/88435/dsp01rv042t084↩︎

  7. These assumptions are discussed in Chapter 3.↩︎

  8. The ideas in this section are inspired by recent work by Jean-Marc Robin, an econometrician at Sciences Po in Paris.}↩︎

  9. I re-ran the mixture model a number of times. I began with all nine types and each time I removed the type which were estimated to be the smallest.↩︎