Sample Selection Method

Jackie Finik
11/06/19

Background

  • Developed in the field of econometrics (Heckman, 1979) to address selection bias
  • In the context of predicting wage (among those 'selected' into the labor force)
  • This approach can be carried out using the original 2-step approach, or simulatneously via ML estimation
  • Other extentions/applications have since developed

Heckman 'Sample Selection' Model (1979)

  • Proposed a 2-step approach to correct for 'sample selection'

    • \( y_i \)* = \( x'_i\beta \) + \( \epsilon_i \) [Outcome Equation]
    • \( d_i \)* = \( z'_i\gamma \) + \( \upsilon_i \) [Selection Equation]
      • \( y_i \) = \( x'_i\beta \) + \( \mu\hat{\lambda_i} \) + \( \epsilon_i \) [Final Equation with Heckman Correction]
    • 1) Model selection using probit model with exclusion restriction ('instrument')
      • Estimate \( \gamma \); produce estimate of \( \hat{\lambda} \) ('Inverse Mills Ratio' - IMR)
      • IMR (\( \hat{\lambda_i} \)) = \( \phi(z'_i\gamma \)) / \( \Phi(z'_i\gamma \))
      • \( \hat{\lambda} \) captures the part of the error term for which selection affects \( y \); +/- indicates nature of \( r \) between [\( \epsilon_i \), \( \upsilon_i \)]
        • If \( \hat{\lambda} \) is negative then unobserved factors that make selection more likely tend to be associated with a decrease in \( y \)
    • 2) Model outcome of interest with \( \hat{\lambda} \) as a regressor in the model ('control factor')
      • T-test of \( \mu \)= 0 determines if selection bias is present

Model Assumptions

\( y_i \)* = \( x'_i\beta \) + \( \epsilon_i \) [Outcome Equation]

\( d_i \)* = \( z'_i\gamma \) + \( \upsilon_i \) [Selection Equation]

\( y_i \) = \( x'_i\beta \) + \( \mu\hat{\lambda_i} \) + \( \epsilon_i \) [Final Equation with Heckman Correction]

Heckman model assumptions

  • 1) Assumes joint normality of errors
  • 2) Suitable exclusion restriction selected, factor(s) predicting selection (\( z \)) do not directly predict \( y \)

Software

  • R Packages:
  • Stata Functions:
    • 1) heckman built-in function
    • 2) GLLAMM (Rabe-Hesketh) for multilevel selection

Example of implementation in R

  • Example of implementation in R
    • Greene (2002) using sampleSelection package
    • Female labour supply (OLS, 2 Step Heckman, ML simultaneous estimation using Heckman sample selection correction)
    • Mroz87 data frame contains n=753 married women
    • Data from the “Panel Study of Income Dynamics” (PSID)
    • Of the 753 observations, n=428 are women with positive hours worked in 1975, while n=325 are women who did not work for pay in 1975
  lfp hours kids5 kids618 age educ   wage repwage hushrs husage huseduc
1   1  1610     1       0  32   12 3.3540    2.65   2708     34      12
2   1  1656     0       2  30   12 1.3889    2.65   2310     30       9
3   1  1980     1       3  35   12 4.5455    4.04   3072     40      12
4   1   456     0       3  34   12 1.0965    3.25   1920     53      10
5   1  1568     1       2  31   14 4.5918    3.60   2000     32      12
6   1  2032     0       0  54   12 4.7421    4.70   1040     57      11
  huswage faminc    mtr motheduc fatheduc unem city exper  nwifeinc
1  4.0288  16310 0.7215       12        7  5.0    0    14 10.910060
2  8.4416  21800 0.6615        7        7 11.0    1     5 19.499981
3  3.5807  21040 0.6915       12        7  5.0    0    15 12.039910
4  3.5417   7300 0.7815        7        7  5.0    0     6  6.799996
5 10.0000  27300 0.6215       12       14  9.5    1     7 20.100058
6  6.7106  19495 0.6915       14        7  7.5    1    33  9.859054
  wifecoll huscoll
1    FALSE   FALSE
2    FALSE   FALSE
3    FALSE   FALSE
4    FALSE   FALSE
5     TRUE   FALSE
6    FALSE   FALSE

OLS Regression

library(sampleSelection)
data ("Mroz87")
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618 > 0)
#regular OLS model
ols1 = lm(wage ~ educ + exper + I( exper^2 ) + city, data=subset(Mroz87, lfp==1))
summary(ols1)

Call:
lm(formula = wage ~ educ + exper + I(exper^2) + city, data = subset(Mroz87, 
    lfp == 1))

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6021 -1.6012 -0.4787  0.8950 21.2762 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.5609920  0.9288390  -2.757  0.00608 ** 
educ         0.4809623  0.0668679   7.193 2.91e-12 ***
exper        0.0324982  0.0615864   0.528  0.59800    
I(exper^2)  -0.0002602  0.0018378  -0.142  0.88747    
city         0.4492741  0.3177735   1.414  0.15815    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.111 on 423 degrees of freedom
Multiple R-squared:  0.1248,    Adjusted R-squared:  0.1165 
F-statistic: 15.08 on 4 and 423 DF,  p-value: 1.569e-11

2-Step Heckman Correction

  • 2 Step Approach
    • 1) Step 1: model labor force participation (selection outcome) ('lfp')
    • 2) Step 2: model wage (outcome of interest)
#estimate the selection, followed by outcome models
greeneTS <- selection(lfp~ age + I(age^2) + faminc + kids + educ, + wage ~ exper + I(exper^2) + educ + city, data = Mroz87, method = "2step")
#exclusion restriction (including var(s) in selection modeling not in outcome modeling; satisfied by age, faminc, kids)
summary(greeneTS)
--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
753 observations (325 censored and 428 observed)
14 free parameters (df = 740)
Probit selection equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.157e+00  1.402e+00  -2.965 0.003127 ** 
age          1.854e-01  6.597e-02   2.810 0.005078 ** 
I(age^2)    -2.426e-03  7.735e-04  -3.136 0.001780 ** 
faminc       4.580e-06  4.206e-06   1.089 0.276544    
kidsTRUE    -4.490e-01  1.309e-01  -3.430 0.000638 ***
educ         9.818e-02  2.298e-02   4.272 2.19e-05 ***
Outcome equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.9712003  2.0593505  -0.472    0.637    
exper        0.0210610  0.0624646   0.337    0.736    
I(exper^2)   0.0001371  0.0018782   0.073    0.942    
educ         0.4170174  0.1002497   4.160 3.56e-05 ***
city         0.4438379  0.3158984   1.405    0.160    
Multiple R-Squared:0.1264,  Adjusted R-Squared:0.116
   Error terms:
              Estimate Std. Error t value Pr(>|t|)
invMillsRatio   -1.098      1.266  -0.867    0.386
sigma            3.200         NA      NA       NA
rho             -0.343         NA      NA       NA
--------------------------------------------
#sigma > 0 observed outcomes are 'better' than average 

ML Simultaneous Estimation (with BHHH SE estimation)

  • Berndt-Hall-Hall-Hausman method to obtain SEs (see Greene 2002)
greeneML <- selection (lfp ~ age + I(age^2) + faminc + kids + educ, + wage ~ exper + I(exper^2) + educ + city, data = Mroz87, maxMethod = "BHHH")
summary(greeneML)
--------------------------------------------
Tobit 2 model (sample selection model)
Maximum Likelihood estimation
BHHH maximisation, 62 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -1581.259 
753 observations (325 censored and 428 observed)
13 free parameters (df = 740)
Probit selection equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.120e+00  1.410e+00  -2.921  0.00359 ** 
age          1.840e-01  6.584e-02   2.795  0.00532 ** 
I(age^2)    -2.409e-03  7.735e-04  -3.115  0.00191 ** 
faminc       5.676e-06  3.890e-06   1.459  0.14493    
kidsTRUE    -4.507e-01  1.367e-01  -3.298  0.00102 ** 
educ         9.533e-02  2.400e-02   3.973  7.8e-05 ***
Outcome equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.9537242  1.6745690  -1.167    0.244    
exper        0.0284295  0.0753989   0.377    0.706    
I(exper^2)  -0.0001151  0.0023339  -0.049    0.961    
educ         0.4562471  0.0959626   4.754 2.39e-06 ***
city         0.4451424  0.4255420   1.046    0.296    
   Error terms:
      Estimate Std. Error t value Pr(>|t|)    
sigma  3.10350    0.08368  37.088   <2e-16 ***
rho   -0.13328    0.22296  -0.598     0.55    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

Debate in the literature

  • Debate regarding 2 step approach vs. ML methods
    • Multivariate selection models increase complexity, more restrictive distributional assumptions, but increase efficiency
    • 2 step approach more robust to violations of assumptions and model misspecification, but can produce inaccurate SEs
    • Some econometricians suggest ML (Nawata 2004) while others say that 2 step is more realistic for real-world data (Chiburis, Lokshin, 2007)
    • (Galimard et al 2018) found that ML method (combined with multiple imputation (MI) via Fully Conditional Specification (FCS) provided the least biased estimates

Advantages for handling missing data

  • Specifying individual imputation models is less efficient (then a one-step ML estimation w/ Heckman's correction)
  • MI often requires joint models for the incomplete variable and it's indicators
  • Galimard et al 2018:
    • Extended a Heckman ML estimation for binary outcomes
    • Combined Heckman imputation of MNAR outcomes with standard MI
      • Included the missing data indicator in all imputation models
    • Only FCS MI with Heckman's correction produced unbiased estimates for MNAR outcomes
      • For MAR predictors, the combined approach outperformed all other methods
  • Combining the Heckman approach with MI is now possible via miceMNAR package in R (supplementary code available via Gelimard et al. 2018); for binary and continuous outcomes

Galimard et al. 2018

Simulation Study (Binary Outcome)

  • HEml: ML estimation with Heckman correction
  • MIHEml: multiple imputation with above ML estimation w/ Heckman correction
    • Y Axis: %Bias [percent relative bias] Galimard1

Galimard et al. 2018

Simulation Study (Continuous Outcome)

  • HE2steps: Heckman's 2 step approach
  • MIHE2steps: multiple imputation with Heckman's 2 step approach
    • Y Axis: %Bias [percent relative bias] Galimard2

Extended implementations

  • Panel Data:
    • Diggle & Kenward, 1994 extended the method to panel data ('outcome-dependent selection models') where the 'selection equation' models dropout as a function of current (missing) responses and lagged responses
    • ML w/ heckman estimator preffered as 2 step approach does not account for within panel correlation
    • Random effects model with sample selection for panel data applications only available in Stata GLLAMM
  • Implementation in hierarchical data
    • GLLAMM can be used to extend the Diggle & Kenward method to multilevel models where selection may occur at multiple levels