Sample selection happens when part of the the outcome of interest is missing non-randomly (Heckman 1976), (Heckman 1979), (Wooldridge 2010). See also Iqbal, Ogundimu, and Rubio (2023) for a recent overview on sample selection models. Sample selection models (also known as Tobit-2 models) with continuous outcomes consist of two equations, one for the outcome and one for the selection processes, as follows,
\[ \begin{split} y^{*}_{i} &= \beta_{0} + \mathbf{x}_{i}^{\top}\mathbf{\beta} + \epsilon_{1,i} \, , \quad \text{(outcome)} \\ s^{*}_{i} &= \alpha_{0} + \mathbf{w}_{i}^{\top}\mathbf{\alpha} + \epsilon_{2,i}\, , \quad \text{(selection)} \end{split} \]
where \(\mathbf{x}_{i} = (x_{i,1},\ldots,x_{i,p})^{\top} \in {\mathbb R}^{p}\) and \(\mathbf{w}_{i} = (w_{i,1},\ldots,w_{i,q})^{\top} \in {\mathbb R}^{q}\), \(i = 1, \ldots, n\), are the corresponding covariates, \(\mathbf{\beta} = (\beta_{1}, \ldots, \beta_{p})^{\top} \in \mathbb{R}^{p}\) and \(\mathbf{\alpha} = (\alpha_{1}, \ldots, \alpha_{q})^{\top} \in \mathbb{R}^{q}\) are the regression coefficients, and \(\beta_0\) and \(\alpha_0\) are the intercepts.
The errors are usually assumed to be distributed according to a bivariate normal distribution
\[ \begin{pmatrix} \epsilon_{1,i} \\ \epsilon_{2,i} \end{pmatrix} \sim N\left( \begin{pmatrix} 0\\ 0 \end{pmatrix}, \begin{pmatrix} \sigma^{2} & \sigma \rho \\ \sigma \rho & 1 \end{pmatrix} \right) \, , \label{corr_matrix} \]
where \(\sigma>0\) and \(\rho \in (-1,1)\).
Let us denote \(\mathbf{\eta} = (\alpha_{0}, \beta_{0}, \mathbf{\alpha}^{\top}, \mathbf{\beta}^{\top}, \tau, \rho)^{\top}\). The log-likelihood of the parameters for this model is (Toomet and Henningsen 2008)
\[ \begin{split} \ell(\alpha_{0}, \beta_{0}, \mathbf{\alpha}, \mathbf{\beta}, \sigma, \rho) &= \sum_{\{s_{i}=0\}} \log\Phi(-\alpha_{0} - \mathbf{w}_{i}^{\top}\mathbf{\alpha})\\ &+ \sum_{\{s_i=1\}} \log\Phi\left( \frac{\alpha_{0} + \mathbf{w}_{i}^{\top}\mathbf{\alpha} + \rho(y_{i} - \beta_{0} - \mathbf{x}_{i}^{\top}\mathbf{\beta})/\sigma}{\sqrt{1-\rho^{2}}}\right)\\ &- \sum_{\{s_i=1\}} \left[ \log(\sigma) + \frac{1}{2}\log(2\pi) + \frac{1}{2}\left(\frac{y_{i} - \beta_{0} - \mathbf{x}_{i}^{\top}\mathbf{\beta}}{\sigma}\right)^{2}\right]. \end{split} \]
It has been shown that the log-likelihood is not globally concave (Toomet and Henningsen 2008). This often induces numerical challenges in the calculation of the maximum likelihood estimator of the parameters. Moreover, this problem is typically associated to the estimation of the parameter \(\rho\). The lack of concavity of the log-likelihood suggests the use of confidence intervals based on asymptotic normality may produce low coverage for some of the parameters, for small or moderate sample sizes. An alternative method method for constructing confidence intervals consists of using the profile likelihood function, which allows for constructing intervals that are not symmetric around the maximum likelihood estimator, as explained next.
Let \(\mathbf{x} = (x_1,\dots,x_n)^{\top}\) be a sample from a distribution with density \(f(\cdot;\mathbf{\theta})\), \(\mathbf{\theta} \in \Theta \subset {\mathbb R}^p\). Suppose that \(\mathbf{\theta} = (\mathbf{\delta},\mathbf{\xi})\), where \(\mathbf{\delta} \in {\mathbb R}^{p_1}\) is a parameter of interest, and \(\mathbf{\xi} \in {\mathbb R}^{p_2}\) is a nuisance parameter, where \(p=p_1+p_2\). Let:
\[ L(\mathbf{\theta}\mid \mathbf{x}) = L(\mathbf{\delta},\mathbf{\xi} \mid \mathbf{x}) = \prod_{i=1}^{n} f(x_i;\mathbf{\theta}), \]
be the likelihood function of \(\mathbf{\theta}\). Let \(\widehat{\mathbf{\theta}} = (\widehat{\mathbf{\delta}},\widehat{\mathbf{\xi}})\) be the MLE of \(\mathbf{\theta}\). Then, the profile likelihood (or profile likelihood ratio) of \(\mathbf{\delta}\) is:
\[ R(\mathbf{\delta}\mid \mathbf{x}) = \dfrac{L(\mathbf{\delta},\widehat{\mathbf{\xi}}(\mathbf{\delta}) \mid \mathbf{x})}{L(\widehat{\mathbf{\theta}}\mid \mathbf{x})} = \dfrac{\max_{\mathbf{\xi}} L(\mathbf{\delta},\mathbf{\xi} \mid \mathbf{x}) }{\max_{\mathbf{\theta}}L({\mathbf{\theta}}\mid \mathbf{x})}, \]
where \(\max_{\xi} L(\mathbf{\delta},\mathbf{\xi} \mid \mathbf{x})\) means that we are maximising the likelihood with respect to \(\mathbf{\xi}\), for a fixed value of \(\mathbf{\delta}\). Then, profile likelihood is bounded \(0<R(\mathbf{\delta}\mid \mathbf{x})\leq 1\).
Now, suppose that the dimension of the parameter of interest is one (\(p_1=1\)). Then,
\[ \Lambda(\delta_0) = -2 \log R(\delta_0 \mid \mathbf{x}) \stackrel{\cdot}{\sim} \chi^2_{1}. \]
Thus, \(\Lambda(\delta_0)\) can be seen as an approximate pivotal quantity which may be used to construct confidence intervals for \(\delta_0\).
For example, a \(0.147\) profile-likelihood interval is an approximate \(95\%\) confidence interval. See: Statistical Inference Lecture notes
The following R code provides a direct implementation of the log
likelihood function, maximum likelihood estimation, and the profile
likelihood. A comparison agains the point estimates obtained using the R
packages sampleSelection
and ssmodels
is also
presented. An illustration is presented using the Ambulatory
Expenditures Data set.
rm(list=ls())
# Required R packages
library(ssmrob)
library(ssmodels)
## If you have questions, suggestions,
## or comments regarding the 'ssmodels' package, please contact: fernando.bastos@ufv.br
library(sampleSelection)
## Loading required package: maxLik
## Loading required package: miscTools
##
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
##
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
# Maximum likelihood estimation in Tobit-2 models
# sigma is reparameterised using a log link for unrestricted optimisation
# rho is reparameterised using a tanh link for unrestricted optimisation
MLETobit2 <- function(init, out, sel, deso, dess, method = "nlminb", maxit = 100){
out <- as.vector(out); sel <- as.vector(sel)
x1 <- as.matrix(cbind(1,deso)); w1 <- as.matrix(cbind(1,dess))
p <- ncol(x1); q <- ncol(w1)
loglik <- function(par){
sigma <- exp(par[1]); rho <- tanh(par[2]); beta <- par[3:(p+2)]; alpha <- par[(p+3):(p+q+2)]
x_beta <- x1%*%beta; w_alpha <- w1%*%alpha
inds0 <- (sel==0); inds1 <- (sel==1)
n1 <- sum(inds1)
ll <- sum( pnorm(-w_alpha[inds0], log.p = TRUE) ) +
sum( pnorm( (w_alpha[inds1] + rho*(out[inds1] - x_beta[inds1])/sigma)/sqrt(1-rho^2) , log.p = TRUE) ) -
0.5*n1*log(2*pi) - n1*log(sigma) - 0.5*sum((out[inds1] - x_beta[inds1])^2/sigma^2)
return(-ll)
}
if(method == "nlminb") OPT <- nlminb(init, loglik, control = list(iter.max = maxit))
if(method != "nlminb") OPT <- optim(init, loglik, control = list(maxit = maxit), method = method)
MLE <- c(exp(OPT$par[1]), tanh(OPT$par[2]) , OPT$par[-c(1,2)])
names(MLE) <- c("sigma","rho", "intercept", paste(paste("deso[,",1:(p-1), sep = ""),"]", sep = ""),
"intercept", paste(paste("dess[,",1:(q-1), sep = ""),"]", sep = ""))
outf <- list(loglik = loglik, OPT = OPT, MLE = MLE)
return(outf)
}
# Ambulatory Expenditure data set
data(MEPS2001)
attach(MEPS2001)
head(MEPS2001)
## educ age income female vgood good hospexp totchr ffs dhospexp age2
## 1 11 3.3 17.472 0 0 1 0 2 1 0 10.889999
## 2 14 2.1 16.920 0 1 0 0 0 0 0 4.409999
## 3 12 4.0 19.620 1 1 0 0 1 0 0 16.000000
## 4 14 5.2 8.563 1 0 1 0 0 0 0 27.039997
## 5 16 5.0 25.360 1 1 0 0 0 0 0 25.000000
## 6 12 3.7 30.000 0 1 0 0 0 1 0 13.690001
## agefem fairpoor year01 instype ambexp lambexp blhisp instype_s1 dambexp
## 1 0.0 0 1 0 760 6.633318 0 0 1
## 2 0.0 0 1 1 497 6.208590 1 0 1
## 3 4.0 0 1 1 1002 6.909753 1 0 1
## 4 5.2 0 1 1 745 6.613384 1 1 1
## 5 5.0 0 1 1 2728 7.911324 0 0 1
## 6 0.0 0 1 0 636 6.455199 0 0 1
## lnambx ins
## 1 6.633318 0
## 2 6.208590 0
## 3 6.909753 0
## 4 6.613384 1
## 5 7.911324 0
## 6 6.455199 0
# Design matrices
# outcome
deso <- cbind(age, female, educ, blhisp, totchr, ins)
# selection
dess <- cbind(age, female, educ, blhisp, totchr, ins, income)
# selection indicator
sel <- as.logical(dambexp)
# outcome
out <- lnambx
# Optimisation step
OPT <- MLETobit2(init = rep(0,17), out = out, sel = sel, deso = deso, dess = dess, method = "nlminb", maxit = 10000)
# MLE
MLE <- OPT$MLE
MLE
## sigma rho intercept deso[,1] deso[,2] deso[,3]
## 1.271019107 -0.130599689 5.044053022 0.211974636 0.348144614 0.018716442
## deso[,4] deso[,5] deso[,6] intercept dess[,1] dess[,2]
## -0.218568878 0.539918618 -0.029986445 -0.676044223 0.087935840 0.662664455
## dess[,3] dess[,4] dess[,5] dess[,6] dess[,7]
## 0.061948057 -0.363936181 0.796949467 0.170139544 0.002707656
# Equations
# Selection equation
selectEq <- dambexp ~ age + female + educ + blhisp + totchr + ins + income
# Outcome equation
outcomeEq <- lnambx ~ age + female + educ + blhisp + totchr + ins
# ssmodels R package
HMLE <- HeckmanCL(selectEq, outcomeEq, data = MEPS2001)
summary(HMLE)
##
## --------------------------------------------------------------
## Classic Heckman Model (Package: ssmodels)
## --------------------------------------------------------------
## --------------------------------------------------------------
## Maximum Likelihood estimation
## optim function with method BFGS-iterations numbers: 22
## Log-Likelihood: -5836.219
## AIC: 11706.44 BIC: 11810.31
## Number of observations: ( 526 censored and 2802 observed )
## 17 free parameters ( df= 3311 )
## --------------------------------------------------------------
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.676054 0.194029 -3.484 0.00050 ***
## age 0.087936 0.027421 3.207 0.00135 **
## female 0.662665 0.060937 10.875 < 2e-16 ***
## educ 0.061948 0.012029 5.150 2.76e-07 ***
## blhisp -0.363937 0.061874 -5.882 4.46e-09 ***
## totchr 0.796952 0.071130 11.204 < 2e-16 ***
## ins 0.170137 0.062871 2.706 0.00684 **
## income 0.002708 0.001316 2.057 0.03973 *
## --------------------------------------------------------------
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.04406 0.22813 22.111 < 2e-16 ***
## age 0.21197 0.02301 9.213 < 2e-16 ***
## female 0.34814 0.06011 5.791 7.64e-09 ***
## educ 0.01872 0.01055 1.774 0.076078 .
## blhisp -0.21857 0.05967 -3.663 0.000253 ***
## totchr 0.53992 0.03933 13.727 < 2e-16 ***
## ins -0.02999 0.05109 -0.587 0.557260
## --------------------------------------------------------------
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## sigma 1.27102 0.01838 69.157 <2e-16 ***
## rho -0.13060 0.14708 -0.888 0.375
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------------------------
# sampleSelection
ssMLE <- selection(selectEq, outcomeEq)
summary(ssMLE)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 4 iterations
## Return code 1: gradient close to zero (gradtol)
## Log-Likelihood: -5836.219
## 3328 observations (526 censored and 2802 observed)
## 17 free parameters (df = 3311)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.676054 0.194029 -3.484 0.00050 ***
## age 0.087936 0.027421 3.207 0.00135 **
## female 0.662665 0.060938 10.874 < 2e-16 ***
## educ 0.061948 0.012029 5.150 2.76e-07 ***
## blhisp -0.363938 0.061873 -5.882 4.46e-09 ***
## totchr 0.796951 0.071131 11.204 < 2e-16 ***
## ins 0.170137 0.062871 2.706 0.00684 **
## income 0.002708 0.001317 2.056 0.03982 *
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.04406 0.22813 22.111 < 2e-16 ***
## age 0.21197 0.02301 9.213 < 2e-16 ***
## female 0.34814 0.06011 5.791 7.64e-09 ***
## educ 0.01872 0.01055 1.774 0.076078 .
## blhisp -0.21857 0.05967 -3.663 0.000253 ***
## totchr 0.53992 0.03933 13.727 < 2e-16 ***
## ins -0.02999 0.05109 -0.587 0.557261
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## sigma 1.27102 0.01838 69.157 <2e-16 ***
## rho -0.13060 0.14708 -0.888 0.375
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#######################################################################################
# Profile likelihood
#######################################################################################
po <- ncol(deso)
ps <- ncol(dess)
p <- length(OPT$MLE) # number of parameters
ML <- OPT$OPT$objective # Maximum likelihood
# Profile likelihood function for parameter "ind"
# parint : parameter of interest (reparameterised to the real line)
# ind : index of the parameter of interest
# init : type of initial value for the optimisation. "MLE" or "zero".
prof.lik <- function(parint, ind, init = "MLE"){
tempf <- function(par){
tempv <- rep(0,p)
tempv <- replace(x = tempv, c(1:p)[-ind] , par)
tempv <- replace(x = tempv, ind , parint)
out0 <- OPT$loglik(tempv)
return(out0)
}
if(init == "MLE") out <- -nlminb(OPT$OPT$par[-ind],tempf, control = list(iter.max = 10000))$objective + ML
if(init == "zero") out <- -nlminb(OPT$OPT$par[-ind]*0,tempf, control = list(iter.max = 10000))$objective + ML
return(exp(out))
}
# Some Profile likelihoods
# Profile likelihood of Parameter 1
prof1 <- Vectorize(function(par) prof.lik(parint = log(par), ind = 1, init = "zero"))
curve(prof1,1.2,1.35 , n = 50, lwd = 2, xlab = expression(sigma), ylab = "Profile Likelihood")
# Profile likelihood of Parameter 2
prof2 <- Vectorize(function(par) prof.lik(parint = atanh(par), ind = 2, init = "MLE"))
curve(prof2,-0.6,0.25 , n = 50, lwd = 2, xlab = expression(rho), ylab = "Profile Likelihood")
# Profile likelihood of Parameter 3
prof3 <- Vectorize(function(par) prof.lik(parint = par, ind = 3, init = "MLE"))
curve(prof3,4.25,6 , n = 50, lwd = 2, xlab = expression(beta[0]), ylab = "Profile Likelihood")
# Profile likelihood of Parameter 4
prof4 <- Vectorize(function(par) prof.lik(parint = par, ind = 4, init = "MLE"))
curve(prof4,0.125,0.3 , n = 50, lwd = 2, xlab = expression(beta[1]), ylab = "Profile Likelihood")
# First, we construct the 0.147 profile likelihood interval for atanh(rho)
intprof2 <- Vectorize(function(rho) prof2(rho) - 0.147 )
# Profile likelihood of rho
curve(prof2,-0.6,0.25 , n = 50, lwd = 2, xlab = expression(rho), ylab = "Profile Likelihood")
abline(h = 0.147, lwd = 2, col = "red")
# Looking at the previous plot helps identifying regions to find the roots
LP <- uniroot(f = intprof2, interval = c(-0.5,-0.4))$root
UP <- uniroot(f = intprof2, interval = c(0.1,0.2))$root
# 95% Confidence interval for atanh(rho)
c(LP, UP)
## [1] -0.4349697 0.1191926
# Now, we transform this interval to the original parameterisation
# to obtain a 95% Confidence interval for rho
tanh(c(LP, UP))
## [1] -0.4094662 0.1186313