Sample-selection bias in OLS regressions arises when the process whereby observations are selected into the sample depends on the dependent variable in a way that creates correlations between the error term and the regressors, it is one of the reasons that causes endogeneity. This part is to practise Heckman self-selection model. The data used here is Mroz87, which is 1975 data on married women’s pay and labor-force participation, from a well-known paper by Thomas Mroz (1987). In this data, it is believed that people only choose to work for pay if their expected wage would exceed some constant threshold value, otherwise, it is not worth for them to work and they may not choose to work. Then the sample of workers in the labor force, for whom we observe earnings, is selected from those in the population who are paid above that threshold. Among highly educated workers, most would earn above the threshold and so they are well-represented in the sample. However, among the less-educated workers, they usually get the lower paid. In this case, education would be correlated with the error term, which biases the OLS coefficient on education.
This part is a practice of sample selection model by using Mroz87 data through OLS, Logit anf Probit model (Wooldridge-page 593). This is just to estimate the labor force participation of married women.
#First, let have a look at the data
library(sampleSelection)
## Loading required package: maxLik
## Loading required package: miscTools
##
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
##
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
data("Mroz87")
head(Mroz87)
## lfp hours kids5 kids618 age educ wage repwage hushrs husage huseduc
## 1 1 1610 1 0 32 12 3.3540 2.65 2708 34 12
## 2 1 1656 0 2 30 12 1.3889 2.65 2310 30 9
## 3 1 1980 1 3 35 12 4.5455 4.04 3072 40 12
## 4 1 456 0 3 34 12 1.0965 3.25 1920 53 10
## 5 1 1568 1 2 31 14 4.5918 3.60 2000 32 12
## 6 1 2032 0 0 54 12 4.7421 4.70 1040 57 11
## huswage faminc mtr motheduc fatheduc unem city exper nwifeinc
## 1 4.0288 16310 0.7215 12 7 5.0 0 14 10.910060
## 2 8.4416 21800 0.6615 7 7 11.0 1 5 19.499981
## 3 3.5807 21040 0.6915 12 7 5.0 0 15 12.039910
## 4 3.5417 7300 0.7815 7 7 5.0 0 6 6.799996
## 5 10.0000 27300 0.6215 12 14 9.5 1 7 20.100058
## 6 6.7106 19495 0.6915 14 7 7.5 1 33 9.859054
## wifecoll huscoll
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE FALSE
## 4 FALSE FALSE
## 5 TRUE FALSE
## 6 FALSE FALSE
tail(Mroz87)
## lfp hours kids5 kids618 age educ wage repwage hushrs husage huseduc
## 748 0 0 0 2 36 12 0 0 3120 39 12
## 749 0 0 0 2 40 13 0 0 3020 43 16
## 750 0 0 2 3 31 12 0 0 2056 33 12
## 751 0 0 0 0 43 12 0 0 2383 43 12
## 752 0 0 0 0 60 12 0 0 1705 55 8
## 753 0 0 0 3 39 9 0 0 3120 48 12
## huswage faminc mtr motheduc fatheduc unem city exper nwifeinc
## 748 1.3013 5330 0.7915 7 12 14.0 0 4 5.330
## 749 9.2715 28200 0.6215 10 10 9.5 1 5 28.200
## 750 4.8638 10000 0.7715 12 12 7.5 0 14 10.000
## 751 1.0898 9952 0.7515 10 3 7.5 0 4 9.952
## 752 12.4400 24984 0.6215 12 12 14.0 1 15 24.984
## 753 6.0897 28363 0.6915 7 7 11.0 1 12 28.363
## wifecoll huscoll
## 748 FALSE FALSE
## 749 TRUE TRUE
## 750 FALSE FALSE
## 751 FALSE FALSE
## 752 FALSE FALSE
## 753 FALSE FALSE
str(Mroz87)
## 'data.frame': 753 obs. of 22 variables:
## $ lfp : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hours : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
## $ kids5 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ kids618 : int 0 2 3 3 2 0 2 0 2 2 ...
## $ age : int 32 30 35 34 31 54 37 54 48 39 ...
## $ educ : int 12 12 12 12 14 12 16 12 12 12 ...
## $ wage : num 3.35 1.39 4.55 1.1 4.59 ...
## $ repwage : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
## $ hushrs : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
## $ husage : int 34 30 40 53 32 57 37 53 52 43 ...
## $ huseduc : int 12 9 12 10 12 11 12 8 4 12 ...
## $ huswage : num 4.03 8.44 3.58 3.54 10 ...
## $ faminc : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
## $ mtr : num 0.722 0.661 0.692 0.781 0.622 ...
## $ motheduc: int 12 7 12 7 12 14 14 3 7 7 ...
## $ fatheduc: int 7 7 7 7 14 7 7 3 7 7 ...
## $ unem : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
## $ city : int 0 1 0 0 1 1 0 0 0 0 ...
## $ exper : int 14 5 15 6 7 33 11 35 24 21 ...
## $ nwifeinc: num 10.9 19.5 12 6.8 20.1 ...
## $ wifecoll: Factor w/ 2 levels " TRUE","FALSE": 2 2 2 2 1 2 1 2 2 2 ...
## $ huscoll : Factor w/ 2 levels " TRUE","FALSE": 2 2 2 2 2 2 2 2 2 2 ...
hist(Mroz87$lfp)
hist(Mroz87$hours)
hist(Mroz87$nwifeinc)
hist(Mroz87$wage)
hist(Mroz87$educ)
hist(Mroz87$exper)
hist(Mroz87$kids5)
#LPM(OLS)
library(nnet)
library(ggplot2)
library(reshape2)
LPMw <- multinom(lfp ~ nwifeinc + educ + exper + exper^2+age+kids5+kids618, data = Mroz87) # Run Multinomial Logistic Regression.
## # weights: 8 (7 variable)
## initial value 521.939827
## iter 10 value 406.146576
## final value 406.143184
## converged
summary(LPMw)
## Call:
## multinom(formula = lfp ~ nwifeinc + educ + exper + exper^2 +
## age + kids5 + kids618, data = Mroz87)
##
## Coefficients:
## Values Std. Err.
## (Intercept) 0.83791334 0.840938455
## nwifeinc -0.02021676 0.008263717
## educ 0.22697857 0.043295540
## exper 0.11974650 0.013626450
## age -0.09108890 0.014320723
## kids5 -1.43940849 0.201499775
## kids618 0.05816886 0.073380123
##
## Residual Deviance: 812.2864
## AIC: 826.2864
#Logit model (MLE)
logitW <- glm(lfp ~ nwifeinc + educ + exper + exper^2+age+kids5+kids618,
family = binomial(link = "logit"), data = Mroz87)
summary(logitW)
##
## Call:
## glm(formula = lfp ~ nwifeinc + educ + exper + exper^2 + age +
## kids5 + kids618, family = binomial(link = "logit"), data = Mroz87)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5261 -0.9223 0.4489 0.8978 2.3170
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.837909 0.840933 0.996 0.3191
## nwifeinc -0.020216 0.008264 -2.446 0.0144 *
## educ 0.226977 0.043295 5.243 1.58e-07 ***
## exper 0.119746 0.013626 8.788 < 2e-16 ***
## age -0.091088 0.014321 -6.361 2.01e-10 ***
## kids5 -1.439393 0.201498 -7.143 9.10e-13 ***
## kids618 0.058174 0.073380 0.793 0.4279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 812.29 on 746 degrees of freedom
## AIC: 826.29
##
## Number of Fisher Scoring iterations: 4
#Probit model (MLE)
ProbitW <- glm(lfp ~ nwifeinc + educ + exper + exper^2+age+kids5+kids618,
family = binomial(link = "probit"), data = Mroz87)
summary(ProbitW)
##
## Call:
## glm(formula = lfp ~ nwifeinc + educ + exper + exper^2 + age +
## kids5 + kids618, family = binomial(link = "probit"), data = Mroz87)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6047 -0.9283 0.4417 0.9021 2.3471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.579574 0.495537 1.170 0.2422
## nwifeinc -0.011565 0.004858 -2.380 0.0173 *
## educ 0.133690 0.025254 5.294 1.20e-07 ***
## exper 0.070217 0.007693 9.127 < 2e-16 ***
## age -0.055555 0.008305 -6.689 2.24e-11 ***
## kids5 -0.874290 0.117359 -7.450 9.35e-14 ***
## kids618 0.034546 0.043376 0.796 0.4258
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 812.44 on 746 degrees of freedom
## AIC: 826.44
##
## Number of Fisher Scoring iterations: 4
#These models is just to estimate wherether married women would paticipate in labor force or not.
### These results are a litte bit different from the results in Wooldridge book and I still can not find the reasons why:))
In the paper published in 1989 on Ecometrica, Thomas Mroz aimed to undertakes a systematic analysis of several theoretic and statistical assumption used in many empirical models of female labor supply. In this paper, Mroz aimed to use Tobit assumption to control for self-slection into la labor force and to test the exogeneity assumptions on the wife’s wage rate and her labor market experience. # At the first stage Mroz test the basic labor supply model with sets of variables with two-stage least squares method. There are two approaches in this stage, the first 4 equations with labor market as instrumental variable for the wife’s wage rate and the last 4 estimates use age-education as instruments but do not contain measures of the wife’s labor market experience. #At the second part Mroz used Hausman test to test the endogeneity of the variables. The results from these tests suggest that the hours of work (hours) and labor markert experience (exper) of married women can be considered as endogenous variables on the estimation of married women’s wage. There are no evidence that non-wi fe income (nwifeinc) and children that women have (kids5, kids618) are endogenous.
# Estimatin model specifications with controls for sel-selection bias
In this data, women’s labor market experience is endogenous variable in the equation to measure women’s wage. Women who has worked many years in the past tend to have higher wages and work more in the present and it is also obvious that women’s working experience has correlation with predicted wage. So experience can be consider as endogenous variable in the wage regression. They use “kids” (the number of kids that women have), as instrumental variable to solve the problem. “Kids” correlates with the women’s working experiences.
From Heckman original paper, it is said that sample selection bias may arise for two different reasons. First, there will be self selection by the individuals being investigated. In case of married women wage data, one observes market wages for certain women because their prouductivity in the labour market exceeds their productivity in their home. Sample selection bias may be also arise as a direct consequence of actions taken by the analyst. The results from Heckman original model suggested that labor force experience of the wife is shown tobe an endogenous variable in labor supply equations but not in wage functions. Heckman used “kids” as instrumental variable in the first stage of equation.
Heckman in this model suggest that in the first stage, there will be at least one endogenous variable that highly correlated with the error term which affect to the decision process (paticipant in labor force), but this variable does not effect to the outcome (wage).
#This part is to practise Heckman selection model used Mroz87 data. There will be three regressions to estimate the log wage equation. The first one is simply OLS, using the sample of labor-force participants, just for whom we can observe the wage. With this regression, lfp = 1 (labor force paticipant)
# OLS
ols1 <- lm(log(wage) ~ educ + exper + I( exper^2 ) + city, data=subset(Mroz87, lfp==1))
summary(ols1)
##
## Call:
## lm(formula = log(wage) ~ educ + exper + I(exper^2) + city, data = subset(Mroz87,
## lfp == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.10084 -0.32453 0.05292 0.36261 2.34806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5308476 0.1990253 -2.667 0.00794 **
## educ 0.1057097 0.0143280 7.378 8.58e-13 ***
## exper 0.0410584 0.0131963 3.111 0.00199 **
## I(exper^2) -0.0007973 0.0003938 -2.025 0.04352 *
## city 0.0542225 0.0680903 0.796 0.42629
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6667 on 423 degrees of freedom
## Multiple R-squared: 0.1581, Adjusted R-squared: 0.1501
## F-statistic: 19.86 on 4 and 423 DF, p-value: 5.389e-15
#Now turn to two-step estimation with labor force selection equation with selection. This one below is Heckman two-step correlation model. Heckman's idea was to treat the selection problem as if it were an obmitteb variable problem. A first-stage is the probit equation to estimate the selection process (who in the labor force), and the result from that equation are used to construct a variablr that captures the selection effec in the wage equation.
#The selection equation should include regressors variables that are likely to affect the selection process, in this case, is the variable that effect whether or not a married woman would be in the labor force. Heckman and Mroz have found evidence that Labor market experience is endogenous variable, it is also guided that "Kids5" - the number of kids under 5 years old that married women have, is that variable.
heckvan = heckit( lfp ~ age + I( age^2 ) + kids5 + huswage + educ,
log(wage) ~ educ + exper + I( exper^2 ) + city, data=Mroz87 )
summary(heckvan)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 753 observations (325 censored and 428 observed)
## 14 free parameters (df = 740)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6204199 1.5164452 -0.409 0.682564
## age 0.0100365 0.0689931 0.145 0.884378
## I(age^2) -0.0004891 0.0007841 -0.624 0.532954
## kids5 -0.8546564 0.1153682 -7.408 3.5e-13 ***
## huswage -0.0421711 0.0124158 -3.397 0.000719 ***
## educ 0.1476737 0.0234280 6.303 5.0e-10 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5471531 0.2890638 -1.893 0.05877 .
## educ 0.1064521 0.0171745 6.198 9.48e-10 ***
## exper 0.0411569 0.0131805 3.123 0.00186 **
## I(exper^2) -0.0008014 0.0003950 -2.029 0.04282 *
## city 0.0532725 0.0687956 0.774 0.43897
## Multiple R-Squared:0.1581, Adjusted R-Squared:0.1481
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio 0.01173 0.15157 0.077 0.938
## sigma 0.66285 NA NA NA
## rho 0.01769 NA NA NA
## --------------------------------------------
# Maximun likelihood estimation of selection model
ml <- selection(lfp~age+I(age^2)+kids5+huswage+educ,
log(wage)~educ+exper+I(exper^2)+city, data = Mroz87)
summary(ml)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 2 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -891.1769
## 753 observations (325 censored and 428 observed)
## 13 free parameters (df = 740)
## Probit selection equation:
## Estimate Std. error t value Pr(> t)
## (Intercept) -0.6153808 1.5178025 -0.405 0.685153
## age 0.0097374 0.0691127 0.141 0.887956
## I(age^2) -0.0004856 0.0007857 -0.618 0.536579
## kids5 -0.8541359 0.1156288 -7.387 1.5e-13 ***
## huswage -0.0422970 0.0125412 -3.373 0.000745 ***
## educ 0.1478204 0.0235240 6.284 3.3e-10 ***
## Outcome equation:
## Estimate Std. error t value Pr(> t)
## (Intercept) -0.5440471 0.2721876 -1.999 0.04563 *
## educ 0.1063107 0.0165931 6.407 1.48e-10 ***
## exper 0.0411377 0.0131664 3.124 0.00178 **
## I(exper^2) -0.0008006 0.0003942 -2.031 0.04225 *
## city 0.0534526 0.0685656 0.780 0.43564
## Error terms:
## Estimate Std. error t value Pr(> t)
## sigma 0.66284 0.02268 29.228 <2e-16 ***
## rho 0.01433 0.20297 0.071 0.944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
library(foreign)
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
womenwkMJH <- read.xlsx("/Users/vancam/Dropbox/Van_Waikato/Rworking/womenwk_MJH.xlsx",1)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
## Generate new variable "lfp" which recive "1" if "wage" >0, = "0" if "wage" is NA
# To create "lfp" = 1 if wage > 0
womenwkMJH <- mutate(womenwkMJH, lfp = if_else(wage > 0, "1", "0"))
#To deal with missing data, lfp = 0 if wage is missing
womenwkMJH$lfp [is.na(womenwkMJH$lfp)]<-0
library(sampleSelection)
heckMJH = heckit( lfp ~ married+children+education+ age,
wage ~ education + age, data=womenwkMJH )
summary(heckMJH)
## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 2000 observations (657 censored and 1343 observed)
## 11 free parameters (df = 1990)
## Probit selection equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.467365 0.192563 -12.813 < 2e-16 ***
## married 0.430857 0.074208 5.806 7.43e-09 ***
## children 0.447325 0.028742 15.564 < 2e-16 ***
## education 0.058365 0.010974 5.318 1.16e-07 ***
## age 0.034721 0.004229 8.210 3.94e-16 ***
## Outcome equation:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.73404 1.24833 0.588 0.557
## education 0.98253 0.05388 18.235 <2e-16 ***
## age 0.21187 0.02205 9.608 <2e-16 ***
## Multiple R-Squared:0.2793, Adjusted R-Squared:0.2777
## Error terms:
## Estimate Std. Error t value Pr(>|t|)
## invMillsRatio 4.0016 0.6065 6.597 5.35e-11 ***
## sigma 5.9474 NA NA NA
## rho 0.6728 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
From these practices, it is suggested that to deal with self-selection problems, we need to test the endogeneity and exogeneity of the variables before deciding which should be endogenous variables and which should be instrumental variables to deal with this endogeneity.