Overview

Sample-selection bias in OLS regressions arises when the process whereby observations are selected into the sample depends on the dependent variable in a way that creates correlations between the error term and the regressors, it is one of the reasons that causes endogeneity. This part is to practise Heckman self-selection model. The data used here is Mroz87, which is 1975 data on married women’s pay and labor-force participation, from a well-known paper by Thomas Mroz (1987). In this data, it is believed that people only choose to work for pay if their expected wage would exceed some constant threshold value, otherwise, it is not worth for them to work and they may not choose to work. Then the sample of workers in the labor force, for whom we observe earnings, is selected from those in the population who are paid above that threshold. Among highly educated workers, most would earn above the threshold and so they are well-represented in the sample. However, among the less-educated workers, they usually get the lower paid. In this case, education would be correlated with the error term, which biases the OLS coefficient on education.

Wooldridge - Sample selection - Logit and Probit model

This part is a practice of sample selection model by using Mroz87 data through OLS, Logit anf Probit model (Wooldridge-page 593). This is just to estimate the labor force participation of married women.

#First, let have a look at the data
library(sampleSelection)
## Loading required package: maxLik
## Loading required package: miscTools
## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
data("Mroz87")
head(Mroz87)
##   lfp hours kids5 kids618 age educ   wage repwage hushrs husage huseduc
## 1   1  1610     1       0  32   12 3.3540    2.65   2708     34      12
## 2   1  1656     0       2  30   12 1.3889    2.65   2310     30       9
## 3   1  1980     1       3  35   12 4.5455    4.04   3072     40      12
## 4   1   456     0       3  34   12 1.0965    3.25   1920     53      10
## 5   1  1568     1       2  31   14 4.5918    3.60   2000     32      12
## 6   1  2032     0       0  54   12 4.7421    4.70   1040     57      11
##   huswage faminc    mtr motheduc fatheduc unem city exper  nwifeinc
## 1  4.0288  16310 0.7215       12        7  5.0    0    14 10.910060
## 2  8.4416  21800 0.6615        7        7 11.0    1     5 19.499981
## 3  3.5807  21040 0.6915       12        7  5.0    0    15 12.039910
## 4  3.5417   7300 0.7815        7        7  5.0    0     6  6.799996
## 5 10.0000  27300 0.6215       12       14  9.5    1     7 20.100058
## 6  6.7106  19495 0.6915       14        7  7.5    1    33  9.859054
##   wifecoll huscoll
## 1    FALSE   FALSE
## 2    FALSE   FALSE
## 3    FALSE   FALSE
## 4    FALSE   FALSE
## 5     TRUE   FALSE
## 6    FALSE   FALSE
tail(Mroz87)
##     lfp hours kids5 kids618 age educ wage repwage hushrs husage huseduc
## 748   0     0     0       2  36   12    0       0   3120     39      12
## 749   0     0     0       2  40   13    0       0   3020     43      16
## 750   0     0     2       3  31   12    0       0   2056     33      12
## 751   0     0     0       0  43   12    0       0   2383     43      12
## 752   0     0     0       0  60   12    0       0   1705     55       8
## 753   0     0     0       3  39    9    0       0   3120     48      12
##     huswage faminc    mtr motheduc fatheduc unem city exper nwifeinc
## 748  1.3013   5330 0.7915        7       12 14.0    0     4    5.330
## 749  9.2715  28200 0.6215       10       10  9.5    1     5   28.200
## 750  4.8638  10000 0.7715       12       12  7.5    0    14   10.000
## 751  1.0898   9952 0.7515       10        3  7.5    0     4    9.952
## 752 12.4400  24984 0.6215       12       12 14.0    1    15   24.984
## 753  6.0897  28363 0.6915        7        7 11.0    1    12   28.363
##     wifecoll huscoll
## 748    FALSE   FALSE
## 749     TRUE    TRUE
## 750    FALSE   FALSE
## 751    FALSE   FALSE
## 752    FALSE   FALSE
## 753    FALSE   FALSE
str(Mroz87)
## 'data.frame':    753 obs. of  22 variables:
##  $ lfp     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hours   : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ kids5   : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ kids618 : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ age     : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educ    : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ wage    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ repwage : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $ hushrs  : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ husage  : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ huseduc : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ huswage : num  4.03 8.44 3.58 3.54 10 ...
##  $ faminc  : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $ mtr     : num  0.722 0.661 0.692 0.781 0.622 ...
##  $ motheduc: int  12 7 12 7 12 14 14 3 7 7 ...
##  $ fatheduc: int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unem    : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city    : int  0 1 0 0 1 1 0 0 0 0 ...
##  $ exper   : int  14 5 15 6 7 33 11 35 24 21 ...
##  $ nwifeinc: num  10.9 19.5 12 6.8 20.1 ...
##  $ wifecoll: Factor w/ 2 levels " TRUE","FALSE": 2 2 2 2 1 2 1 2 2 2 ...
##  $ huscoll : Factor w/ 2 levels " TRUE","FALSE": 2 2 2 2 2 2 2 2 2 2 ...
hist(Mroz87$lfp)

hist(Mroz87$hours)

hist(Mroz87$nwifeinc)

hist(Mroz87$wage)

hist(Mroz87$educ)

hist(Mroz87$exper)

hist(Mroz87$kids5)

#LPM(OLS)
library(nnet)
library(ggplot2)
library(reshape2)
LPMw <- multinom(lfp ~ nwifeinc + educ + exper + exper^2+age+kids5+kids618, data = Mroz87) # Run Multinomial      Logistic Regression. 
## # weights:  8 (7 variable)
## initial  value 521.939827 
## iter  10 value 406.146576
## final  value 406.143184 
## converged
summary(LPMw)
## Call:
## multinom(formula = lfp ~ nwifeinc + educ + exper + exper^2 + 
##     age + kids5 + kids618, data = Mroz87)
## 
## Coefficients:
##                  Values   Std. Err.
## (Intercept)  0.83791334 0.840938455
## nwifeinc    -0.02021676 0.008263717
## educ         0.22697857 0.043295540
## exper        0.11974650 0.013626450
## age         -0.09108890 0.014320723
## kids5       -1.43940849 0.201499775
## kids618      0.05816886 0.073380123
## 
## Residual Deviance: 812.2864 
## AIC: 826.2864
#Logit model (MLE)
logitW <- glm(lfp ~ nwifeinc + educ + exper + exper^2+age+kids5+kids618,
              family = binomial(link = "logit"), data = Mroz87) 
summary(logitW)
## 
## Call:
## glm(formula = lfp ~ nwifeinc + educ + exper + exper^2 + age + 
##     kids5 + kids618, family = binomial(link = "logit"), data = Mroz87)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5261  -0.9223   0.4489   0.8978   2.3170  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.837909   0.840933   0.996   0.3191    
## nwifeinc    -0.020216   0.008264  -2.446   0.0144 *  
## educ         0.226977   0.043295   5.243 1.58e-07 ***
## exper        0.119746   0.013626   8.788  < 2e-16 ***
## age         -0.091088   0.014321  -6.361 2.01e-10 ***
## kids5       -1.439393   0.201498  -7.143 9.10e-13 ***
## kids618      0.058174   0.073380   0.793   0.4279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.75  on 752  degrees of freedom
## Residual deviance:  812.29  on 746  degrees of freedom
## AIC: 826.29
## 
## Number of Fisher Scoring iterations: 4
#Probit model (MLE)
ProbitW <- glm(lfp ~ nwifeinc + educ + exper + exper^2+age+kids5+kids618,
              family = binomial(link = "probit"), data = Mroz87) 
summary(ProbitW)
## 
## Call:
## glm(formula = lfp ~ nwifeinc + educ + exper + exper^2 + age + 
##     kids5 + kids618, family = binomial(link = "probit"), data = Mroz87)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6047  -0.9283   0.4417   0.9021   2.3471  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.579574   0.495537   1.170   0.2422    
## nwifeinc    -0.011565   0.004858  -2.380   0.0173 *  
## educ         0.133690   0.025254   5.294 1.20e-07 ***
## exper        0.070217   0.007693   9.127  < 2e-16 ***
## age         -0.055555   0.008305  -6.689 2.24e-11 ***
## kids5       -0.874290   0.117359  -7.450 9.35e-14 ***
## kids618      0.034546   0.043376   0.796   0.4258    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.75  on 752  degrees of freedom
## Residual deviance:  812.44  on 746  degrees of freedom
## AIC: 826.44
## 
## Number of Fisher Scoring iterations: 4
#These models is just to estimate wherether married women would paticipate in labor force or not. 

### These results are a litte bit different from the results in Wooldridge book and I still can not find the reasons why:))

Thomas Mroz

In the paper published in 1989 on Ecometrica, Thomas Mroz aimed to undertakes a systematic analysis of several theoretic and statistical assumption used in many empirical models of female labor supply. In this paper, Mroz aimed to use Tobit assumption to control for self-slection into la labor force and to test the exogeneity assumptions on the wife’s wage rate and her labor market experience. # At the first stage Mroz test the basic labor supply model with sets of variables with two-stage least squares method. There are two approaches in this stage, the first 4 equations with labor market as instrumental variable for the wife’s wage rate and the last 4 estimates use age-education as instruments but do not contain measures of the wife’s labor market experience. #At the second part Mroz used Hausman test to test the endogeneity of the variables. The results from these tests suggest that the hours of work (hours) and labor markert experience (exper) of married women can be considered as endogenous variables on the estimation of married women’s wage. There are no evidence that non-wi fe income (nwifeinc) and children that women have (kids5, kids618) are endogenous.
# Estimatin model specifications with controls for sel-selection bias

In this data, women’s labor market experience is endogenous variable in the equation to measure women’s wage. Women who has worked many years in the past tend to have higher wages and work more in the present and it is also obvious that women’s working experience has correlation with predicted wage. So experience can be consider as endogenous variable in the wage regression. They use “kids” (the number of kids that women have), as instrumental variable to solve the problem. “Kids” correlates with the women’s working experiences.

Heckman original paper

From Heckman original paper, it is said that sample selection bias may arise for two different reasons. First, there will be self selection by the individuals being investigated. In case of married women wage data, one observes market wages for certain women because their prouductivity in the labour market exceeds their productivity in their home. Sample selection bias may be also arise as a direct consequence of actions taken by the analyst. The results from Heckman original model suggested that labor force experience of the wife is shown tobe an endogenous variable in labor supply equations but not in wage functions. Heckman used “kids” as instrumental variable in the first stage of equation.

Heckman in this model suggest that in the first stage, there will be at least one endogenous variable that highly correlated with the error term which affect to the decision process (paticipant in labor force), but this variable does not effect to the outcome (wage).

An other practice

#This part is to practise Heckman selection model used Mroz87 data. There will be three regressions to estimate the log wage equation. The first one is simply OLS, using the sample of labor-force participants, just for whom we can observe the wage. With this regression, lfp = 1 (labor force paticipant)
# OLS 
ols1 <- lm(log(wage) ~ educ + exper + I( exper^2 ) + city, data=subset(Mroz87, lfp==1))
summary(ols1)
## 
## Call:
## lm(formula = log(wage) ~ educ + exper + I(exper^2) + city, data = subset(Mroz87, 
##     lfp == 1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.10084 -0.32453  0.05292  0.36261  2.34806 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5308476  0.1990253  -2.667  0.00794 ** 
## educ         0.1057097  0.0143280   7.378 8.58e-13 ***
## exper        0.0410584  0.0131963   3.111  0.00199 ** 
## I(exper^2)  -0.0007973  0.0003938  -2.025  0.04352 *  
## city         0.0542225  0.0680903   0.796  0.42629    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6667 on 423 degrees of freedom
## Multiple R-squared:  0.1581, Adjusted R-squared:  0.1501 
## F-statistic: 19.86 on 4 and 423 DF,  p-value: 5.389e-15
#Now turn to two-step estimation with labor force selection equation with selection. This one below is Heckman two-step correlation model. Heckman's idea was to treat the selection problem as if it were an obmitteb variable problem. A first-stage is the probit equation to estimate the selection process (who in the labor force), and the result from that equation are used to construct a variablr that captures the selection effec in the wage equation.
#The selection equation should include regressors variables that are likely to affect the selection process, in this case, is the variable that effect whether or not a married woman would be in the labor force. Heckman and Mroz have found evidence that Labor market experience is endogenous variable, it is also guided that "Kids5" - the number of kids under 5 years old that married women have, is that variable. 
heckvan = heckit( lfp ~ age + I( age^2 ) + kids5 + huswage + educ,
                 log(wage) ~ educ + exper + I( exper^2 ) + city, data=Mroz87 )
summary(heckvan)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 753 observations (325 censored and 428 observed)
## 14 free parameters (df = 740)
## Probit selection equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.6204199  1.5164452  -0.409 0.682564    
## age          0.0100365  0.0689931   0.145 0.884378    
## I(age^2)    -0.0004891  0.0007841  -0.624 0.532954    
## kids5       -0.8546564  0.1153682  -7.408  3.5e-13 ***
## huswage     -0.0421711  0.0124158  -3.397 0.000719 ***
## educ         0.1476737  0.0234280   6.303  5.0e-10 ***
## Outcome equation:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5471531  0.2890638  -1.893  0.05877 .  
## educ         0.1064521  0.0171745   6.198 9.48e-10 ***
## exper        0.0411569  0.0131805   3.123  0.00186 ** 
## I(exper^2)  -0.0008014  0.0003950  -2.029  0.04282 *  
## city         0.0532725  0.0687956   0.774  0.43897    
## Multiple R-Squared:0.1581,   Adjusted R-Squared:0.1481
##    Error terms:
##               Estimate Std. Error t value Pr(>|t|)
## invMillsRatio  0.01173    0.15157   0.077    0.938
## sigma          0.66285         NA      NA       NA
## rho            0.01769         NA      NA       NA
## --------------------------------------------
# Maximun likelihood estimation of selection model
ml <- selection(lfp~age+I(age^2)+kids5+huswage+educ, 
                log(wage)~educ+exper+I(exper^2)+city, data = Mroz87)

summary(ml)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 2 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -891.1769 
## 753 observations (325 censored and 428 observed)
## 13 free parameters (df = 740)
## Probit selection equation:
##               Estimate Std. error t value  Pr(> t)    
## (Intercept) -0.6153808  1.5178025  -0.405 0.685153    
## age          0.0097374  0.0691127   0.141 0.887956    
## I(age^2)    -0.0004856  0.0007857  -0.618 0.536579    
## kids5       -0.8541359  0.1156288  -7.387  1.5e-13 ***
## huswage     -0.0422970  0.0125412  -3.373 0.000745 ***
## educ         0.1478204  0.0235240   6.284  3.3e-10 ***
## Outcome equation:
##               Estimate Std. error t value  Pr(> t)    
## (Intercept) -0.5440471  0.2721876  -1.999  0.04563 *  
## educ         0.1063107  0.0165931   6.407 1.48e-10 ***
## exper        0.0411377  0.0131664   3.124  0.00178 ** 
## I(exper^2)  -0.0008006  0.0003942  -2.031  0.04225 *  
## city         0.0534526  0.0685656   0.780  0.43564    
##    Error terms:
##       Estimate Std. error t value Pr(> t)    
## sigma  0.66284    0.02268  29.228  <2e-16 ***
## rho    0.01433    0.20297   0.071   0.944    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

womenwk - Prof MJH’s data

library(foreign)
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
womenwkMJH <- read.xlsx("/Users/vancam/Dropbox/Van_Waikato/Rworking/womenwk_MJH.xlsx",1)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## Generate new variable "lfp" which recive "1" if "wage" >0, = "0" if "wage" is NA
# To create "lfp" = 1 if wage > 0
womenwkMJH <- mutate(womenwkMJH, lfp = if_else(wage > 0, "1", "0")) 
#To deal with missing data, lfp = 0 if wage is missing
womenwkMJH$lfp [is.na(womenwkMJH$lfp)]<-0
 
library(sampleSelection)
heckMJH = heckit( lfp ~ married+children+education+ age,
                 wage ~ education + age, data=womenwkMJH )                   

summary(heckMJH)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 2000 observations (657 censored and 1343 observed)
## 11 free parameters (df = 1990)
## Probit selection equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.467365   0.192563 -12.813  < 2e-16 ***
## married      0.430857   0.074208   5.806 7.43e-09 ***
## children     0.447325   0.028742  15.564  < 2e-16 ***
## education    0.058365   0.010974   5.318 1.16e-07 ***
## age          0.034721   0.004229   8.210 3.94e-16 ***
## Outcome equation:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.73404    1.24833   0.588    0.557    
## education    0.98253    0.05388  18.235   <2e-16 ***
## age          0.21187    0.02205   9.608   <2e-16 ***
## Multiple R-Squared:0.2793,   Adjusted R-Squared:0.2777
##    Error terms:
##               Estimate Std. Error t value Pr(>|t|)    
## invMillsRatio   4.0016     0.6065   6.597 5.35e-11 ***
## sigma           5.9474         NA      NA       NA    
## rho             0.6728         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

From these practices, it is suggested that to deal with self-selection problems, we need to test the endogeneity and exogeneity of the variables before deciding which should be endogenous variables and which should be instrumental variables to deal with this endogeneity.