SBS2018 Lec5

Data
- Import data
- New variables
Step 1, 2
- Using ‘heckit’ function
- Self made ‘2-step’
Step 3
Counterfactual simulation
- Labor participation rate
- Visualizing ‘subsidy policy’ impact
Effect on income distribution

※ 補助金政策（子供の人数に応じて保育補助金を給付する政策）が女性の就労に与える影響の構造推定（Roy model）．コードは小西先生の授業 ( https://kdb.tsukuba.ac.jp/syllabi/2018/FH24012/jpn/ ) で使用されているStataプログラムを黒田が自己流でRに書き換えたもの．

Data

library(foreign)  # for reading 'dta' file
library(sampleSelection)  # for heckit estimation

Import data

Using Mroz (1987 ECTA) PSID data [wave = 1975], DL from https://www.uam.es/personal_pdi/economicas/rsmanga/mef_aplications_0809.html .

mroz <- foreign::read.dta(file = "c://sbs2018/mroz.dta")

お行儀が悪いが，以下のように直接アクセスしてimportすることも可．

# url_mroz <- "https://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta"
# mroz <- foreign::read.dta(file = url_mroz)

パッケージ sampleSelection にも用意されているのでそちらを利用してもよい．ただし変数名が異なるので要注意（例：mroz$inlf に対応するのは Mroz87$lfp）．

# data(Mroz87)  # in 'sampleSelection' package

Structure of mroz data frame

str(mroz)

## 'data.frame':    753 obs. of  22 variables:
##  $ inlf    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hours   : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ kidslt6 : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ kidsge6 : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ age     : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educ    : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ wage    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ repwage : num  2.65 2.65 4.04 3.25 3.6 ...
##  $ hushrs  : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ husage  : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ huseduc : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ huswage : num  4.03 8.44 3.58 3.54 10 ...
##  $ faminc  : num  16310 21800 21040 7300 27300 ...
##  $ mtr     : num  0.721 0.661 0.692 0.781 0.622 ...
##  $ motheduc: int  12 7 12 7 12 14 14 3 7 7 ...
##  $ fatheduc: int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unem    : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city    : int  0 1 0 0 1 1 0 0 0 0 ...
##  $ exper   : int  14 5 15 6 7 33 11 35 24 21 ...
##  $ nwifeinc: num  10.9 19.5 12 6.8 20.1 ...
##  $ lwage   : num  1.2102 0.3285 1.5141 0.0921 1.5243 ...
##  $ expersq : int  196 25 225 36 49 1089 121 1225 576 441 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr "10 Feb 2009 01:31"
##  - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  98 105 98 98 98 98 102 102 105 98 ...
##  - attr(*, "val.labels")= chr  "" "" "" "" ...
##  - attr(*, "var.labels")= chr  "=1 if in lab frce, 1975" "hours worked, 1975" "# kids < 6 years" "# kids 6-18" ...
##  - attr(*, "version")= int 7

head(cbind(colnames(mroz), attr(mroz, "var.labels")))  # short description

##      [,1]      [,2]                     
## [1,] "inlf"    "=1 if in lab frce, 1975"
## [2,] "hours"   "hours worked, 1975"     
## [3,] "kidslt6" "# kids < 6 years"       
## [4,] "kidsge6" "# kids 6-18"            
## [5,] "age"     "woman's age in yrs"     
## [6,] "educ"    "years of schooling"

New variables

mroz$winc <- with(mroz, wage * hours)  # annual income
mroz$linc <- log(mroz$winc)  # logarithm

Step 1, 2

Step 1: Heckman’s first stage (probit) \[ \Pr(D_i = 1 | X_i) = \Phi \left( \frac{z_i' \beta - (\mu + \alpha) n_i - x_i' \gamma}{\sigma_\eta} \right), \quad \tilde{\beta} = \frac{\beta}{\sigma_\eta} \]

Step 2: Heckman’s second stage (linear) \[ E [w_i | X_i, D_i = 1] = z_i' \beta + \rho \sigma_\epsilon \lambda (X_i' \tilde{\theta}_V) \]

$z$ = { educ, exper, exper$^2$, unem }
$x$ = { nwifeinc, kidslt6, kidsge6 }
$\lambda$ = inverse Mills ratio

Using ‘heckit’ function

fm_selection <- inlf ~ educ + exper + expersq + unem + nwifeinc + kidslt6 + kidsge6
fm_outcome <- winc ~ educ + exper + expersq + unem
heckit00 <- heckit(selection = fm_selection, outcome = fm_outcome, data = mroz, method = "2step")
summary(heckit00)

## --------------------------------------------
## Tobit 2 model (sample selection model)
## 2-step Heckman / heckit estimation
## 753 observations (325 censored and 428 observed)
## 16 free parameters (df = 738)
## Probit selection equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.070824   0.332255  -6.233 7.71e-10 ***
## educ         0.152485   0.024807   6.147 1.29e-09 ***
## exper        0.126984   0.018440   6.886 1.23e-11 ***
## expersq     -0.002464   0.000594  -4.148 3.75e-05 ***
## unem        -0.026340   0.016335  -1.612 0.107286    
## nwifeinc    -0.016772   0.004751  -3.530 0.000441 ***
## kidslt6     -0.558067   0.103770  -5.378 1.01e-07 ***
## kidsge6      0.132657   0.040041   3.313 0.000968 ***
## Outcome equation:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2547.224   2262.864  -1.126   0.2607    
## educ          505.569    102.217   4.946 9.38e-07 ***
## exper         246.804    111.542   2.213   0.0272 *  
## expersq        -2.869      2.822  -1.017   0.3096    
## unem          -89.686     64.254  -1.396   0.1632    
## Multiple R-Squared:0.1975,   Adjusted R-Squared:0.188
##    Error terms:
##                Estimate Std. Error t value Pr(>|t|)
## invMillsRatio -786.0960  1104.6538  -0.712    0.477
## sigma         3888.8738         NA      NA       NA
## rho             -0.2021         NA      NA       NA
## --------------------------------------------

str(heckit00$coefficients) # first step: probit selection eq.

##  Named num [1:16] -2.07082 0.15248 0.12698 -0.00246 -0.02634 ...
##  - attr(*, "names")= chr [1:16] "(Intercept)" "educ" "exper" "expersq" ...

str(heckit00$lm$coefficients) # second step: outcome (wage) eq.

##  Named num [1:6] -2547.22 505.57 246.8 -2.87 -89.69 ...
##  - attr(*, "names")= chr [1:6] "XO(Intercept)" "XOeduc" "XOexper" "XOexpersq" ...

cf.

Professor William Sundstrom, Santa Clara University, “Models with sample selection”

Self made ‘2-step’

probit_sel <- glm(formula = fm_selection, family = binomial(probit), data = mroz)
summary(probit_sel)

## 
## Call:
## glm(formula = fm_selection, family = binomial(probit), data = mroz)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2936  -0.9601   0.5195   0.8852   2.4616  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.0708235  0.3329048  -6.220 4.96e-10 ***
## educ         0.1524850  0.0249703   6.107 1.02e-09 ***
## exper        0.1269837  0.0183895   6.905 5.01e-12 ***
## expersq     -0.0024639  0.0005907  -4.171 3.03e-05 ***
## unem        -0.0263399  0.0162702  -1.619 0.105468    
## nwifeinc    -0.0167724  0.0048504  -3.458 0.000544 ***
## kidslt6     -0.5580667  0.1031070  -5.412 6.22e-08 ***
## kidsge6      0.1326571  0.0406115   3.266 0.001089 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.7  on 752  degrees of freedom
## Residual deviance:  841.0  on 745  degrees of freedom
## AIC: 857
## 
## Number of Fisher Scoring iterations: 4

\[ \lambda (\cdot) = \frac{\phi(\cdot)}{\Phi(\cdot)} \]

mroz$imr <- dnorm(probit_sel$linear.predictors) / pnorm(probit_sel$linear.predictors)  # inv Mills ratio
# plot(x = mroz$imr, y = sampleSelection::invMillsRatio(probit_sel)$IMR1)  # check
summary(mroz$imr)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05675 0.39292 0.63470 0.73644 0.99112 3.09369

wage_eq <- lm(formula = update.formula(fm_outcome, . ~ . + imr), data = mroz, subset = winc > 0)
summary(wage_eq)

## 
## Call:
## lm(formula = update.formula(fm_outcome, . ~ . + imr), data = mroz, 
##     subset = winc > 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9265  -2649   -460   2409  16382 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2547.239   2270.136  -1.122    0.262    
## educ          505.569    102.444   4.935 1.16e-06 ***
## exper         246.804    111.909   2.205    0.028 *  
## expersq        -2.869      2.830  -1.014    0.311    
## unem          -89.686     64.375  -1.393    0.164    
## imr          -786.088   1109.462  -0.709    0.479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3876 on 422 degrees of freedom
## Multiple R-squared:  0.1975, Adjusted R-squared:  0.188 
## F-statistic: 20.78 on 5 and 422 DF,  p-value: < 2.2e-16

cf.

Dr. Garrett Glasgow, UC Santa Barbara, “R code for selection models” at Political Science 207.

Step 3

\[ \sigma_\eta = \frac{\beta}{\tilde{\beta}} = \frac{\mbox{(coef. of wage eq.)}}{\mbox{(coef. of selection eq.)}} \]

sigma2 <- heckit00$lm$coefficients["XOunem"] / heckit00$coefficients["unem"]  # county unemployment rate [%]
sigma2

##   XOunem 
## 3404.972

Other $\sigma_\eta$ candidates:

heckit00$lm$coefficients["XOeduc"] / heckit00$coefficients["educ"]  # years of schooling

##   XOeduc 
## 3315.533

heckit00$lm$coefficients["XOexper"] / heckit00$coefficients["exper"]  # labor market experience [year]

##  XOexper 
## 1943.587

Counterfactual simulation

\[ \Pr(D_i = 1 | X_i, \mbox{subsidy} q) - \Pr(D_i = 1 | X_i) = \Phi \left( \frac{z_i' \beta - (\mu - q + \alpha) n_i - x_i' \gamma}{\sigma_\eta} \right) - \Phi \left( \frac{z_i' \beta - (\mu + \alpha) n_i - x_i' \gamma}{\sigma_\eta} \right) \]

cf.

1.0 USD in 1975 $\simeq$ 4.75 USD in 2018 $\simeq$ 536 JPY in 2018
- 500 USD in 1975 $\simeq$ 268,000 JPY in 2018
- 1,000 USD in 1975 $\simeq$ 536,000 JPY in 2018
- 1,500 USD in 1975 $\simeq$ 804,000 JPY in 2018

# heckit00$coefficients["kidslt6"] = - (mu + alpha)/sigma_eta
cf1 <- heckit00$coefficients["kidslt6"] + (500 / sigma2)  # q = 500 USD/y per one child (age < 6)
cf2 <- heckit00$coefficients["kidslt6"] + (1000 / sigma2)  # q = 1,000
cf3 <- heckit00$coefficients["kidslt6"] + (1500 / sigma2)  # q = 1,500

heckit00_sel_coef <- heckit00$coefficients[c("(Intercept)", "educ", "exper", "expersq", "unem", "nwifeinc", "kidslt6", "kidsge6")]
heckit00_sel_coef_cf1 <- c(heckit00_sel_coef[1:6], cf1, heckit00_sel_coef[8])
heckit00_sel_coef_cf2 <- c(heckit00_sel_coef[1:6], cf2, heckit00_sel_coef[8])
heckit00_sel_coef_cf3 <- c(heckit00_sel_coef[1:6], cf3, heckit00_sel_coef[8])
mroz_sel <- cbind(1, mroz[, c("educ", "exper", "expersq", "unem", "nwifeinc", "kidslt6", "kidsge6")])
head(t(heckit00_sel_coef * t(mroz_sel)))

##           1     educ     exper     expersq       unem   nwifeinc
## 1 -2.070824 1.829819 1.7777707 -0.48292874 -0.1316986 -0.1829867
## 2 -2.070824 1.829819 0.6349181 -0.06159805 -0.2897369 -0.3270593
## 3 -2.070824 1.829819 1.9047543 -0.55438248 -0.1316986 -0.2019369
## 4 -2.070824 1.829819 0.7619017 -0.08870120 -0.1316986 -0.1140515
## 5 -2.070824 2.134789 0.8888854 -0.12073218 -0.2502274 -0.3371240
## 6 -2.070824 1.829819 4.1904595 -2.68321120 -0.1975479 -0.1653589
##      kidslt6   kidsge6
## 1 -0.5580674 0.0000000
## 2  0.0000000 0.2653132
## 3 -0.5580674 0.3979698
## 4  0.0000000 0.3979698
## 5 -0.5580674 0.2653132
## 6  0.0000000 0.0000000

mroz$pk0 <- pnorm(q = apply(t(heckit00_sel_coef * t(mroz_sel)), 1, sum))
mroz$pk1 <- pnorm(q = apply(t(heckit00_sel_coef_cf1 * t(mroz_sel)), 1, sum))
mroz$pk2 <- pnorm(q = apply(t(heckit00_sel_coef_cf2 * t(mroz_sel)), 1, sum))
mroz$pk3 <- pnorm(q = apply(t(heckit00_sel_coef_cf3 * t(mroz_sel)), 1, sum))

summary(mroz$pk0)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002591 0.385960 0.606166 0.568381 0.770972 0.976533

summary(mroz$pk3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02716 0.44248 0.63576 0.60277 0.78261 0.97653

Labor participation rate

sum(mroz$pk0 > .5) / nrow(mroz)  # without subsidy

## [1] 0.6215139

sum(mroz$pk1 > .5) / nrow(mroz)  # with 0.5K USD/y subsidy

## [1] 0.6334661

sum(mroz$pk2 > .5) / nrow(mroz)  # with 1.0K USD/y subsidy

## [1] 0.6586985

sum(mroz$pk3 > .5) / nrow(mroz)  # with 1.5K USD/y subsidy

## [1] 0.6733068

Visualizing ‘subsidy policy’ impact

# bandwidth
bw0 <- bw.SJ(mroz$pk0)  # Sheather & Jones (1991)
# bw0 <- bw.nrd0(mroz$pk0)  # [default] Silverman's rule of thumb
# bw0 <- bw.nrd(mroz$pk0)  # Scott (1992)
# bw0 <- bw.ucv(mroz$pk0)  # unbiased cross-validation

plot(density(mroz$pk3, bw = bw0), col = 4, lty = 2, axes = F, main = "Density of Pr(working)")
axis(side = 1); axis(side = 2)
lines(density(mroz$pk2, bw = bw0), col = 3, lty = 2)
lines(density(mroz$pk1, bw = bw0), col = 2, lty = 2)
lines(density(mroz$pk0, bw = bw0), col = 1)
legend(x = "topleft", legend = c("CF: 1,500 USD/year", "CF: 1000", "CF: 500", "No subsidy"), col = 4:1, lty = c(2, 2, 2, 1))

Same options as ‘Lec 5.do’, modified version, DL from manaba

plot(density(mroz$pk3, bw = .1, kernel = "epanechnikov"), col = 4, lty = 2, axes = F, main = "Density of Pr(working)")
axis(side = 1); axis(side = 2)
lines(density(mroz$pk2, bw = .1, kernel = "epanechnikov"), col = 3, lty = 2)
lines(density(mroz$pk1, bw = .1, kernel = "epanechnikov"), col = 2, lty = 2)
lines(density(mroz$pk0, bw = .1, kernel = "epanechnikov"), col = 1)
legend(x = "topleft", legend = c("CF: 1,500 USD/year", "CF: 1000", "CF: 500", "No subsidy"), col = 4:1, lty = c(2, 2, 2, 1))

cf.

Stata ‘kdensity’ function: https://www.stata.com/manuals13/rkdensity.pdf
R ‘density’ function: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html

Effect on income distribution

mu0 <- c(0, 0)  # mu
varcov_el <- with(heckit00, c(sigma^2, rep(sigma*rho, 2), 1))
varcov0 <- matrix(varcov_el, nrow = 2)  # variance-covariance matrix
library(MASS)  # for 'mvrnorm' function
eps_eta <- MASS::mvrnorm(n = nrow(mroz), mu = mu0, Sigma = varcov0)
cor(eps_eta)

##            [,1]       [,2]
## [1,]  1.0000000 -0.2098448
## [2,] -0.2098448  1.0000000

mroz$eps0 <- eps_eta[, 1]  # epsilon (error term of wage eq.)
mroz$eta0 <- eps_eta[, 2]  # eta (error term of selection eq.)

$V^*$

mroz$x0 <- apply(t(heckit00_sel_coef * t(mroz_sel)), 1, sum) + mroz$eta0
mroz$x3 <- apply(t(heckit00_sel_coef_cf3 * t(mroz_sel)), 1, sum) + mroz$eta0
# summary(mroz$x0)
# summary(mroz$x3)
table(round(mroz$x3 - mroz$x0, digits = 3))

## 
##     0 0.441 0.881 1.322 
##   606   118    26     3

子供のいる女性に対してのみ効果が生じる（という理論モデルを仮定している）ことに注意．よって，上記（効果の頻度）は以下（子供の人数）と対応している．

table(mroz$kidslt6)

## 
##   0   1   2   3 
## 606 118  26   3

Individual effects

xl <- range(with(mroz, c(x0, x3)))
yl <- range(mroz$eps0)
plot(x =mroz$x0 , y = mroz$eps0, col = 3, pch = 19, xlim = xl, ylim = yl, axes = F, xlab = "V*", ylab = "epsilon")
axis(side = 1); axis(side = 2)
par(new = T)
plot(x = mroz$x3, y = mroz$eps0, col = 2, pch = 3, xlim = xl, ylim = yl, axes = F, ann = F)
legend(x = "topright", legend = c("CF3", "No subsidy"), col = 2:3, pch = c(3, 19))

Income dist.

wage_eq_name <- c("educ", "exper", "expersq", "unem")
mroz$w <- apply(t(heckit00$lm$coefficients[1:5] * t(cbind(1, mroz[, wage_eq_name]))), 1, sum) + mroz$eps0
mroz$w0 <- ifelse(mroz$x0 >= 0, mroz$w, NA)
mroz$w3 <- ifelse(mroz$x3 >= 0, mroz$w, NA)

hist(mroz$w3[mroz$w3 > 0], breaks = 30, xlab = "Income")
hist(mroz$w0[mroz$w0 > 0], breaks = 30, add = T, col = 3)
legend(x = "topright", legend = c("CF3", "No subsidy"), col = 1, pt.bg = c(0, 3), pch = 22)

EOS