Sample Selection models

Sample selection happens when part of the the outcome of interest is missing non-randomly (Heckman 1976), (Heckman 1979), (Wooldridge 2010). See also Iqbal, Ogundimu, and Rubio (2023) for a recent overview on sample selection models. Sample selection models (also known as Tobit-2 models) with continuous outcomes consist of two equations, one for the outcome and one for the selection processes, as follows,

\[ \begin{split} y^{*}_{i} &= \beta_{0} + \mathbf{x}_{i}^{\top}\mathbf{\beta} + \epsilon_{1,i} \, , \quad \text{(outcome)} \\ s^{*}_{i} &= \alpha_{0} + \mathbf{w}_{i}^{\top}\mathbf{\alpha} + \epsilon_{2,i}\, , \quad \text{(selection)} \end{split} \]

where \(\mathbf{x}_{i} = (x_{i,1},\ldots,x_{i,p})^{\top} \in {\mathbb R}^{p}\) and \(\mathbf{w}_{i} = (w_{i,1},\ldots,w_{i,q})^{\top} \in {\mathbb R}^{q}\), \(i = 1, \ldots, n\), are the corresponding covariates, \(\mathbf{\beta} = (\beta_{1}, \ldots, \beta_{p})^{\top} \in \mathbb{R}^{p}\) and \(\mathbf{\alpha} = (\alpha_{1}, \ldots, \alpha_{q})^{\top} \in \mathbb{R}^{q}\) are the regression coefficients, and \(\beta_0\) and \(\alpha_0\) are the intercepts.

The errors are usually assumed to be distributed according to a bivariate normal distribution

\[ \begin{pmatrix} \epsilon_{1,i} \\ \epsilon_{2,i} \end{pmatrix} \sim N\left( \begin{pmatrix} 0\\ 0 \end{pmatrix}, \begin{pmatrix} \sigma^{2} & \sigma \rho \\ \sigma \rho & 1 \end{pmatrix} \right) \, , \label{corr_matrix} \]

where \(\sigma>0\) and \(\rho \in (-1,1)\).

Let us denote \(\mathbf{\eta} = (\alpha_{0}, \beta_{0}, \mathbf{\alpha}^{\top}, \mathbf{\beta}^{\top}, \tau, \rho)^{\top}\). The log-likelihood of the parameters for this model is (Toomet and Henningsen 2008)

\[ \begin{split} \ell(\alpha_{0}, \beta_{0}, \mathbf{\alpha}, \mathbf{\beta}, \sigma, \rho) &= \sum_{\{s_{i}=0\}} \log\Phi(-\alpha_{0} - \mathbf{w}_{i}^{\top}\mathbf{\alpha})\\ &+ \sum_{\{s_i=1\}} \log\Phi\left( \frac{\alpha_{0} + \mathbf{w}_{i}^{\top}\mathbf{\alpha} + \rho(y_{i} - \beta_{0} - \mathbf{x}_{i}^{\top}\mathbf{\beta})/\sigma}{\sqrt{1-\rho^{2}}}\right)\\ &- \sum_{\{s_i=1\}} \left[ \log(\sigma) + \frac{1}{2}\log(2\pi) + \frac{1}{2}\left(\frac{y_{i} - \beta_{0} - \mathbf{x}_{i}^{\top}\mathbf{\beta}}{\sigma}\right)^{2}\right]. \end{split} \]

It has been shown that the log-likelihood is not globally concave (Toomet and Henningsen 2008). This often induces numerical challenges in the calculation of the maximum likelihood estimator of the parameters. Moreover, this problem is typically associated to the estimation of the parameter \(\rho\). The lack of concavity of the log-likelihood suggests the use of confidence intervals based on asymptotic normality may produce low coverage for some of the parameters, for small or moderate sample sizes. An alternative method method for constructing confidence intervals consists of using the profile likelihood function, which allows for constructing intervals that are not symmetric around the maximum likelihood estimator, as explained next.

Profile likelihood

Let \(\mathbf{x} = (x_1,\dots,x_n)^{\top}\) be a sample from a distribution with density \(f(\cdot;\mathbf{\theta})\), \(\mathbf{\theta} \in \Theta \subset {\mathbb R}^p\). Suppose that \(\mathbf{\theta} = (\mathbf{\delta},\mathbf{\xi})\), where \(\mathbf{\delta} \in {\mathbb R}^{p_1}\) is a parameter of interest, and \(\mathbf{\xi} \in {\mathbb R}^{p_2}\) is a nuisance parameter, where \(p=p_1+p_2\). Let:

\[ L(\mathbf{\theta}\mid \mathbf{x}) = L(\mathbf{\delta},\mathbf{\xi} \mid \mathbf{x}) = \prod_{i=1}^{n} f(x_i;\mathbf{\theta}), \]

be the likelihood function of \(\mathbf{\theta}\). Let \(\widehat{\mathbf{\theta}} = (\widehat{\mathbf{\delta}},\widehat{\mathbf{\xi}})\) be the MLE of \(\mathbf{\theta}\). Then, the profile likelihood (or profile likelihood ratio) of \(\mathbf{\delta}\) is:

\[ R(\mathbf{\delta}\mid \mathbf{x}) = \dfrac{L(\mathbf{\delta},\widehat{\mathbf{\xi}}(\mathbf{\delta}) \mid \mathbf{x})}{L(\widehat{\mathbf{\theta}}\mid \mathbf{x})} = \dfrac{\max_{\mathbf{\xi}} L(\mathbf{\delta},\mathbf{\xi} \mid \mathbf{x}) }{\max_{\mathbf{\theta}}L({\mathbf{\theta}}\mid \mathbf{x})}, \]

where \(\max_{\xi} L(\mathbf{\delta},\mathbf{\xi} \mid \mathbf{x})\) means that we are maximising the likelihood with respect to \(\mathbf{\xi}\), for a fixed value of \(\mathbf{\delta}\). Then, profile likelihood is bounded \(0<R(\mathbf{\delta}\mid \mathbf{x})\leq 1\).

Now, suppose that the dimension of the parameter of interest is one (\(p_1=1\)). Then,

\[ \Lambda(\delta_0) = -2 \log R(\delta_0 \mid \mathbf{x}) \stackrel{\cdot}{\sim} \chi^2_{1}. \]

Thus, \(\Lambda(\delta_0)\) can be seen as an approximate pivotal quantity which may be used to construct confidence intervals for \(\delta_0\).

For example, a \(0.147\) profile-likelihood interval is an approximate \(95\%\) confidence interval. See: Statistical Inference Lecture notes

R code

The following R code provides a direct implementation of the log likelihood function, maximum likelihood estimation, and the profile likelihood. A comparison agains the point estimates obtained using the R packages sampleSelection and ssmodels is also presented. An illustration is presented using the Ambulatory Expenditures Data set.

Required R packages and functions

rm(list=ls())

# Required R packages
library(ssmrob)
library(ssmodels)
## If you have questions, suggestions,
##   or comments regarding the 'ssmodels' package, please contact: fernando.bastos@ufv.br
library(sampleSelection)
## Loading required package: maxLik
## Loading required package: miscTools
## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
# Maximum likelihood estimation in Tobit-2 models
# sigma is reparameterised using a log link for unrestricted optimisation
# rho is reparameterised using a tanh link for unrestricted optimisation

MLETobit2 <- function(init, out, sel, deso, dess, method = "nlminb", maxit = 100){
  out <- as.vector(out); sel <- as.vector(sel)
  x1 <- as.matrix(cbind(1,deso)); w1 <- as.matrix(cbind(1,dess))
  p <- ncol(x1); q <- ncol(w1)
  
  loglik <- function(par){
    sigma <- exp(par[1]); rho <- tanh(par[2]); beta <- par[3:(p+2)]; alpha <- par[(p+3):(p+q+2)]
    x_beta <- x1%*%beta; w_alpha <- w1%*%alpha 
    inds0 <- (sel==0); inds1 <- (sel==1)
    n1 <- sum(inds1)
    
    ll <- sum( pnorm(-w_alpha[inds0], log.p = TRUE) ) +
      sum( pnorm( (w_alpha[inds1] + rho*(out[inds1] - x_beta[inds1])/sigma)/sqrt(1-rho^2) , log.p = TRUE)  ) -
      0.5*n1*log(2*pi) - n1*log(sigma) - 0.5*sum((out[inds1] - x_beta[inds1])^2/sigma^2)
    
    return(-ll)
  }
  
  if(method == "nlminb") OPT <- nlminb(init, loglik, control = list(iter.max = maxit))
  if(method != "nlminb") OPT <- optim(init, loglik, control = list(maxit = maxit), method = method)
  
  MLE <- c(exp(OPT$par[1]), tanh(OPT$par[2]) , OPT$par[-c(1,2)]) 
  names(MLE) <- c("sigma","rho", "intercept", paste(paste("deso[,",1:(p-1), sep = ""),"]", sep = ""), 
                  "intercept",  paste(paste("dess[,",1:(q-1), sep = ""),"]", sep = ""))
  
  outf <- list(loglik = loglik, OPT = OPT, MLE = MLE)
  return(outf)
}

Data preparation

# Ambulatory Expenditure data set
data(MEPS2001)
attach(MEPS2001)
head(MEPS2001)
##   educ age income female vgood good hospexp totchr ffs dhospexp      age2
## 1   11 3.3 17.472      0     0    1       0      2   1        0 10.889999
## 2   14 2.1 16.920      0     1    0       0      0   0        0  4.409999
## 3   12 4.0 19.620      1     1    0       0      1   0        0 16.000000
## 4   14 5.2  8.563      1     0    1       0      0   0        0 27.039997
## 5   16 5.0 25.360      1     1    0       0      0   0        0 25.000000
## 6   12 3.7 30.000      0     1    0       0      0   1        0 13.690001
##   agefem fairpoor year01 instype ambexp  lambexp blhisp instype_s1 dambexp
## 1    0.0        0      1       0    760 6.633318      0          0       1
## 2    0.0        0      1       1    497 6.208590      1          0       1
## 3    4.0        0      1       1   1002 6.909753      1          0       1
## 4    5.2        0      1       1    745 6.613384      1          1       1
## 5    5.0        0      1       1   2728 7.911324      0          0       1
## 6    0.0        0      1       0    636 6.455199      0          0       1
##     lnambx ins
## 1 6.633318   0
## 2 6.208590   0
## 3 6.909753   0
## 4 6.613384   1
## 5 7.911324   0
## 6 6.455199   0
# Design matrices

# outcome
deso <- cbind(age, female, educ, blhisp, totchr, ins)

# selection
dess <- cbind(age, female, educ, blhisp, totchr, ins, income)

# selection indicator
sel <- as.logical(dambexp)

# outcome 
out <- lnambx

MLE (direct)

# Optimisation step
OPT <-  MLETobit2(init = rep(0,17), out = out, sel = sel, deso = deso, dess = dess, method = "nlminb", maxit = 10000)

# MLE
MLE <- OPT$MLE

MLE
##        sigma          rho    intercept     deso[,1]     deso[,2]     deso[,3] 
##  1.271019107 -0.130599689  5.044053022  0.211974636  0.348144614  0.018716442 
##     deso[,4]     deso[,5]     deso[,6]    intercept     dess[,1]     dess[,2] 
## -0.218568878  0.539918618 -0.029986445 -0.676044223  0.087935840  0.662664455 
##     dess[,3]     dess[,4]     dess[,5]     dess[,6]     dess[,7] 
##  0.061948057 -0.363936181  0.796949467  0.170139544  0.002707656

MLE (R packages)

# Equations

# Selection equation
selectEq <- dambexp ~ age + female + educ + blhisp + totchr + ins + income

# Outcome equation
outcomeEq <- lnambx ~ age + female + educ + blhisp + totchr + ins

# ssmodels R package
HMLE <- HeckmanCL(selectEq, outcomeEq, data = MEPS2001)

summary(HMLE)
## 
## --------------------------------------------------------------
##           Classic Heckman Model (Package: ssmodels)           
## --------------------------------------------------------------
## --------------------------------------------------------------
## Maximum Likelihood estimation 
## optim function with method BFGS-iterations numbers: 22 
## Log-Likelihood: -5836.219 
## AIC: 11706.44 BIC: 11810.31 
## Number of observations: ( 526 censored and 2802 observed ) 
## 17 free parameters ( df= 3311 ) 
## --------------------------------------------------------------
## Probit selection equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.676054   0.194029  -3.484  0.00050 ***
## age          0.087936   0.027421   3.207  0.00135 ** 
## female       0.662665   0.060937  10.875  < 2e-16 ***
## educ         0.061948   0.012029   5.150 2.76e-07 ***
## blhisp      -0.363937   0.061874  -5.882 4.46e-09 ***
## totchr       0.796952   0.071130  11.204  < 2e-16 ***
## ins          0.170137   0.062871   2.706  0.00684 ** 
## income       0.002708   0.001316   2.057  0.03973 *  
## --------------------------------------------------------------
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.04406    0.22813  22.111  < 2e-16 ***
## age          0.21197    0.02301   9.213  < 2e-16 ***
## female       0.34814    0.06011   5.791 7.64e-09 ***
## educ         0.01872    0.01055   1.774 0.076078 .  
## blhisp      -0.21857    0.05967  -3.663 0.000253 ***
## totchr       0.53992    0.03933  13.727  < 2e-16 ***
## ins         -0.02999    0.05109  -0.587 0.557260    
## --------------------------------------------------------------
## Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma  1.27102    0.01838  69.157   <2e-16 ***
## rho   -0.13060    0.14708  -0.888    0.375    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------------------------
# sampleSelection

ssMLE <- selection(selectEq, outcomeEq)
summary(ssMLE) 
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 4 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -5836.219 
## 3328 observations (526 censored and 2802 observed)
## 17 free parameters (df = 3311)
## Probit selection equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.676054   0.194029  -3.484  0.00050 ***
## age          0.087936   0.027421   3.207  0.00135 ** 
## female       0.662665   0.060938  10.874  < 2e-16 ***
## educ         0.061948   0.012029   5.150 2.76e-07 ***
## blhisp      -0.363938   0.061873  -5.882 4.46e-09 ***
## totchr       0.796951   0.071131  11.204  < 2e-16 ***
## ins          0.170137   0.062871   2.706  0.00684 ** 
## income       0.002708   0.001317   2.056  0.03982 *  
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.04406    0.22813  22.111  < 2e-16 ***
## age          0.21197    0.02301   9.213  < 2e-16 ***
## female       0.34814    0.06011   5.791 7.64e-09 ***
## educ         0.01872    0.01055   1.774 0.076078 .  
## blhisp      -0.21857    0.05967  -3.663 0.000253 ***
## totchr       0.53992    0.03933  13.727  < 2e-16 ***
## ins         -0.02999    0.05109  -0.587 0.557261    
##    Error terms:
##       Estimate Std. Error t value Pr(>|t|)    
## sigma  1.27102    0.01838  69.157   <2e-16 ***
## rho   -0.13060    0.14708  -0.888    0.375    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Profile likelihood

#######################################################################################
# Profile likelihood
#######################################################################################
po <- ncol(deso)
ps <- ncol(dess)

p <- length(OPT$MLE) # number of parameters
ML <- OPT$OPT$objective # Maximum likelihood

# Profile likelihood function for parameter "ind"

# parint : parameter of interest (reparameterised to the real line)
# ind : index of the parameter of interest
# init : type of initial value for the optimisation. "MLE" or "zero".

prof.lik <- function(parint, ind, init = "MLE"){
  
  tempf <- function(par){
    tempv <- rep(0,p)
    tempv <- replace(x = tempv, c(1:p)[-ind] , par)
    tempv <- replace(x = tempv, ind , parint)
    out0 <- OPT$loglik(tempv)
    return(out0)
  } 
  
  if(init == "MLE") out <-  -nlminb(OPT$OPT$par[-ind],tempf, control = list(iter.max = 10000))$objective + ML
  if(init == "zero") out <-  -nlminb(OPT$OPT$par[-ind]*0,tempf, control = list(iter.max = 10000))$objective + ML
  
  return(exp(out))
}


# Some Profile likelihoods

# Profile likelihood of Parameter 1
prof1 <- Vectorize(function(par) prof.lik(parint = log(par), ind = 1, init = "zero"))
curve(prof1,1.2,1.35 , n = 50, lwd = 2, xlab = expression(sigma), ylab = "Profile Likelihood")

# Profile likelihood of Parameter 2
prof2 <- Vectorize(function(par) prof.lik(parint = atanh(par), ind = 2, init = "MLE"))
curve(prof2,-0.6,0.25 , n = 50, lwd = 2, xlab = expression(rho), ylab = "Profile Likelihood")

# Profile likelihood of Parameter 3
prof3 <- Vectorize(function(par) prof.lik(parint = par, ind = 3, init = "MLE"))
curve(prof3,4.25,6 , n = 50, lwd = 2, xlab = expression(beta[0]), ylab = "Profile Likelihood")

# Profile likelihood of Parameter 4
prof4 <- Vectorize(function(par) prof.lik(parint = par, ind = 4, init = "MLE"))
curve(prof4,0.125,0.3 , n = 50, lwd = 2, xlab = expression(beta[1]), ylab = "Profile Likelihood")

Profile-likelihood confidence interval for \(\rho\)

# First, we construct the 0.147 profile likelihood interval for atanh(rho)

intprof2 <- Vectorize(function(rho) prof2(rho) - 0.147 )

# Profile likelihood of rho
curve(prof2,-0.6,0.25 , n = 50, lwd = 2, xlab = expression(rho), ylab = "Profile Likelihood")
abline(h = 0.147, lwd = 2, col = "red")

# Looking at the previous plot helps identifying regions to find the roots

LP <- uniroot(f = intprof2, interval = c(-0.5,-0.4))$root

UP <- uniroot(f = intprof2, interval = c(0.1,0.2))$root

# 95% Confidence interval for atanh(rho)
c(LP, UP)
## [1] -0.4349697  0.1191926
# Now, we transform this interval to the original parameterisation
# to obtain a 95% Confidence interval for rho

tanh(c(LP, UP))
## [1] -0.4094662  0.1186313
Heckman, J. J. 1976. The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models.” In Annals of Economic and Social Measurement, Volume 5, Number 4, 475–92. NBER. http://www.nber.org/chapters/c10491.
———. 1979. Sample Selection Bias as a Specification Error.” Econometrica 47 (1): 153–61. http://www.jstor.org/stable/1912352.
Iqbal, A., E. O. Ogundimu, and F. J. Rubio. 2023. “Bayesian Variable Selection in Sample Selection Models Using Spike-and-Slab Priors.” arXiv Preprint arXiv:2312.03538.
Toomet, O., and A. Henningsen. 2008. “Sample Selection Models in R: Package SampleSelection.” Journal of Statistical Software 27 (7): 1–23. https://doi.org/10.18637/jss.v027.i07.
Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. Cambridge, Massachusetts: The MIT Press.