Sample Selection Method

Jackie Finik
11/06/19

Background

Developed in the field of econometrics (Heckman, 1979) to address selection bias
In the context of predicting wage (among those 'selected' into the labor force)
This approach can be carried out using the original 2-step approach, or simulatneously via ML estimation
Other extentions/applications have since developed

Heckman 'Sample Selection' Model (1979)

Proposed a 2-step approach to correct for 'sample selection'
- \( y_i \)* = \( x'_i\beta \) + \( \epsilon_i \) [Outcome Equation]
- \( d_i \)* = \( z'_i\gamma \) + \( \upsilon_i \) [Selection Equation]
  - \( y_i \) = \( x'_i\beta \) + \( \mu\hat{\lambda_i} \) + \( \epsilon_i \) [Final Equation with Heckman Correction]
- 1) Model selection using probit model with exclusion restriction ('instrument')
  - Estimate \( \gamma \); produce estimate of \( \hat{\lambda} \) ('Inverse Mills Ratio' - IMR)
  - IMR (\( \hat{\lambda_i} \)) = \( \phi(z'_i\gamma \)) / \( \Phi(z'_i\gamma \))
  - \( \hat{\lambda} \) captures the part of the error term for which selection affects \( y \); +/- indicates nature of \( r \) between [\( \epsilon_i \), \( \upsilon_i \)]
    - If \( \hat{\lambda} \) is negative then unobserved factors that make selection more likely tend to be associated with a decrease in \( y \)
- 2) Model outcome of interest with \( \hat{\lambda} \) as a regressor in the model ('control factor')
  - T-test of \( \mu \)= 0 determines if selection bias is present

Model Assumptions

\( y_i \)* = \( x'_i\beta \) + \( \epsilon_i \) [Outcome Equation]

\( d_i \)* = \( z'_i\gamma \) + \( \upsilon_i \) [Selection Equation]

\( y_i \) = \( x'_i\beta \) + \( \mu\hat{\lambda_i} \) + \( \epsilon_i \) [Final Equation with Heckman Correction]

Heckman model assumptions

1) Assumes joint normality of errors
2) Suitable exclusion restriction selected, factor(s) predicting selection (\( z \)) do not directly predict \( y \)

Software

R Packages:
- 1) sampleSelection (Toomet & Henningsen)
- 2) miceMNAR (Galimard & Rigon) for MI + Heckman applications
Stata Functions:
- 1) heckman built-in function
- 2) GLLAMM (Rabe-Hesketh) for multilevel selection

Example of implementation in R

Example of implementation in R
- Greene (2002) using sampleSelection package
- Female labour supply (OLS, 2 Step Heckman, ML simultaneous estimation using Heckman sample selection correction)
- Mroz87 data frame contains n=753 married women
- Data from the “Panel Study of Income Dynamics” (PSID)
- Of the 753 observations, n=428 are women with positive hours worked in 1975, while n=325 are women who did not work for pay in 1975

  lfp hours kids5 kids618 age educ   wage repwage hushrs husage huseduc
1   1  1610     1       0  32   12 3.3540    2.65   2708     34      12
2   1  1656     0       2  30   12 1.3889    2.65   2310     30       9
3   1  1980     1       3  35   12 4.5455    4.04   3072     40      12
4   1   456     0       3  34   12 1.0965    3.25   1920     53      10
5   1  1568     1       2  31   14 4.5918    3.60   2000     32      12
6   1  2032     0       0  54   12 4.7421    4.70   1040     57      11
  huswage faminc    mtr motheduc fatheduc unem city exper  nwifeinc
1  4.0288  16310 0.7215       12        7  5.0    0    14 10.910060
2  8.4416  21800 0.6615        7        7 11.0    1     5 19.499981
3  3.5807  21040 0.6915       12        7  5.0    0    15 12.039910
4  3.5417   7300 0.7815        7        7  5.0    0     6  6.799996
5 10.0000  27300 0.6215       12       14  9.5    1     7 20.100058
6  6.7106  19495 0.6915       14        7  7.5    1    33  9.859054
  wifecoll huscoll
1    FALSE   FALSE
2    FALSE   FALSE
3    FALSE   FALSE
4    FALSE   FALSE
5     TRUE   FALSE
6    FALSE   FALSE

OLS Regression

library(sampleSelection)
data ("Mroz87")
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618 > 0)
#regular OLS model
ols1 = lm(wage ~ educ + exper + I( exper^2 ) + city, data=subset(Mroz87, lfp==1))
summary(ols1)


Call:
lm(formula = wage ~ educ + exper + I(exper^2) + city, data = subset(Mroz87, 
    lfp == 1))

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6021 -1.6012 -0.4787  0.8950 21.2762 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.5609920  0.9288390  -2.757  0.00608 ** 
educ         0.4809623  0.0668679   7.193 2.91e-12 ***
exper        0.0324982  0.0615864   0.528  0.59800    
I(exper^2)  -0.0002602  0.0018378  -0.142  0.88747    
city         0.4492741  0.3177735   1.414  0.15815    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.111 on 423 degrees of freedom
Multiple R-squared:  0.1248,    Adjusted R-squared:  0.1165 
F-statistic: 15.08 on 4 and 423 DF,  p-value: 1.569e-11

2-Step Heckman Correction

2 Step Approach
- 1) Step 1: model labor force participation (selection outcome) ('lfp')
- 2) Step 2: model wage (outcome of interest)

#estimate the selection, followed by outcome models
greeneTS <- selection(lfp~ age + I(age^2) + faminc + kids + educ, + wage ~ exper + I(exper^2) + educ + city, data = Mroz87, method = "2step")
#exclusion restriction (including var(s) in selection modeling not in outcome modeling; satisfied by age, faminc, kids)
summary(greeneTS)

--------------------------------------------
Tobit 2 model (sample selection model)
2-step Heckman / heckit estimation
753 observations (325 censored and 428 observed)
14 free parameters (df = 740)
Probit selection equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.157e+00  1.402e+00  -2.965 0.003127 ** 
age          1.854e-01  6.597e-02   2.810 0.005078 ** 
I(age^2)    -2.426e-03  7.735e-04  -3.136 0.001780 ** 
faminc       4.580e-06  4.206e-06   1.089 0.276544    
kidsTRUE    -4.490e-01  1.309e-01  -3.430 0.000638 ***
educ         9.818e-02  2.298e-02   4.272 2.19e-05 ***
Outcome equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.9712003  2.0593505  -0.472    0.637    
exper        0.0210610  0.0624646   0.337    0.736    
I(exper^2)   0.0001371  0.0018782   0.073    0.942    
educ         0.4170174  0.1002497   4.160 3.56e-05 ***
city         0.4438379  0.3158984   1.405    0.160    
Multiple R-Squared:0.1264,  Adjusted R-Squared:0.116
   Error terms:
              Estimate Std. Error t value Pr(>|t|)
invMillsRatio   -1.098      1.266  -0.867    0.386
sigma            3.200         NA      NA       NA
rho             -0.343         NA      NA       NA
--------------------------------------------

#sigma > 0 observed outcomes are 'better' than average

ML Simultaneous Estimation (with BHHH SE estimation)

Berndt-Hall-Hall-Hausman method to obtain SEs (see Greene 2002)

greeneML <- selection (lfp ~ age + I(age^2) + faminc + kids + educ, + wage ~ exper + I(exper^2) + educ + city, data = Mroz87, maxMethod = "BHHH")
summary(greeneML)

--------------------------------------------
Tobit 2 model (sample selection model)
Maximum Likelihood estimation
BHHH maximisation, 62 iterations
Return code 2: successive function values within tolerance limit
Log-Likelihood: -1581.259 
753 observations (325 censored and 428 observed)
13 free parameters (df = 740)
Probit selection equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.120e+00  1.410e+00  -2.921  0.00359 ** 
age          1.840e-01  6.584e-02   2.795  0.00532 ** 
I(age^2)    -2.409e-03  7.735e-04  -3.115  0.00191 ** 
faminc       5.676e-06  3.890e-06   1.459  0.14493    
kidsTRUE    -4.507e-01  1.367e-01  -3.298  0.00102 ** 
educ         9.533e-02  2.400e-02   3.973  7.8e-05 ***
Outcome equation:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.9537242  1.6745690  -1.167    0.244    
exper        0.0284295  0.0753989   0.377    0.706    
I(exper^2)  -0.0001151  0.0023339  -0.049    0.961    
educ         0.4562471  0.0959626   4.754 2.39e-06 ***
city         0.4451424  0.4255420   1.046    0.296    
   Error terms:
      Estimate Std. Error t value Pr(>|t|)    
sigma  3.10350    0.08368  37.088   <2e-16 ***
rho   -0.13328    0.22296  -0.598     0.55    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
--------------------------------------------

Debate in the literature

Debate regarding 2 step approach vs. ML methods
- Multivariate selection models increase complexity, more restrictive distributional assumptions, but increase efficiency
- 2 step approach more robust to violations of assumptions and model misspecification, but can produce inaccurate SEs
- Some econometricians suggest ML (Nawata 2004) while others say that 2 step is more realistic for real-world data (Chiburis, Lokshin, 2007)
- (Galimard et al 2018) found that ML method (combined with multiple imputation (MI) via Fully Conditional Specification (FCS) provided the least biased estimates

Advantages for handling missing data

Specifying individual imputation models is less efficient (then a one-step ML estimation w/ Heckman's correction)
MI often requires joint models for the incomplete variable and it's indicators
Galimard et al 2018:
- Extended a Heckman ML estimation for binary outcomes
- Combined Heckman imputation of MNAR outcomes with standard MI
  - Included the missing data indicator in all imputation models
- Only FCS MI with Heckman's correction produced unbiased estimates for MNAR outcomes
  - For MAR predictors, the combined approach outperformed all other methods
Combining the Heckman approach with MI is now possible via miceMNAR package in R (supplementary code available via Gelimard et al. 2018); for binary and continuous outcomes

Galimard et al. 2018

Simulation Study (Binary Outcome)

HEml: ML estimation with Heckman correction
MIHEml: multiple imputation with above ML estimation w/ Heckman correction
- Y Axis: %Bias [percent relative bias]

Galimard et al. 2018

Simulation Study (Continuous Outcome)

HE2steps: Heckman's 2 step approach
MIHE2steps: multiple imputation with Heckman's 2 step approach
- Y Axis: %Bias [percent relative bias]

Extended implementations

Panel Data:
- Diggle & Kenward, 1994 extended the method to panel data ('outcome-dependent selection models') where the 'selection equation' models dropout as a function of current (missing) responses and lagged responses
- ML w/ heckman estimator preffered as 2 step approach does not account for within panel correlation
- Random effects model with sample selection for panel data applications only available in Stata GLLAMM
Implementation in hierarchical data
- GLLAMM can be used to extend the Diggle & Kenward method to multilevel models where selection may occur at multiple levels