Binary Dependent Variables Assignment

Question 1

We will examine a paper by Anastasia Semykina entitled, “Self-employment among women: Do children matter more than we previously thought?”. You are provided the following data.

March CPS white women (NLSY_white_women_JAE.csv): 12,624 women for several year. You observe the following variables

Variable	definitions:
id	unique individual ID
year	year
working	=1 if working, otherwise
self_empl	=1 if self-employed, otherwise
age	age in years
agesq	age squared
educ	years of schooling, truncated at 20 years
edu_0_11	=1 if has 0-11 years of schooling, 0 otherwise
edu_12	=1 if has 12 years of schooling, 0 otherwise
edu_13_15	=1 if has 13-15 years of schooling, 0 otherwise
edu_16plus	=1 if has 16 or more years of schooling, 0 otherwise
married	=1 if married, 0 if not
d_ch_1_5	=1 if has children ages 0 to 5, otherwise
d_ch_0	=1 if has a newborn (<1 years old), otherwise
d_ch_1_5_alt	=1 if has children ages 1 to 5, otherwise
d_ch_6_17	=1 if has children ages 6 to 17, otherwise
rotter_score	locus of control
sesteem_score1	self esteem score
urban	=1 if urban location, 0 otherwise
afqt_1	AFQT score
south	=1 if South region, otherwise
northeast	=1 if Northeast region, otherwise
northcen	=1 if North Central region, otherwise
west	=1 if West region, otherwise
sp_inc1000	spouse’s income in thousands of dollars
samesex	=1 if the first two children have the same gender, otherwise
policever	=1 if ever stopped by police for other than minor traffic offense in 1980, 0 otherwise
unemp_rate	unemployment rate in percentage points
m_sp_inc1000	individual time mean of sp_inc1000
m_married	individual time mean of married

library(tidyverse)
library(texreg)
library(sampleSelection)
library(readr)
library(mfx)
library(tinytex)
NLSY <- read_csv("NLSY_white_women_JAE.csv")

Estimate a linear probability model of self employment and a separate linear probability model of working.

#use kids as instrument, look at lm with dummy variables

ols.work<- lm(working ~ age + agesq + educ + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)

ols.work2<- lm(working ~ age + agesq + edu_0_11 + edu_12 + edu_13_15 + edu_16plus + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)


ols.self<- lm(self_empl ~ age + agesq + educ + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)

ols.self2<- lm(self_empl ~ age + agesq + edu_0_11 + edu_12 + edu_13_15 + edu_16plus + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)


#htmlreg(list(ols.work,ols.self), digits = 4)

#these labeling functions work but need to be put in a list 
htmlreg(ols.work, custom.coef.names = c("Intercept", "Age", "Age squared", "Education", "Married", "Has a Child aged 1 to 5", "Has a Newborn", "Has a Child 6 to 17", "Locus of Control score", "Self-esteem score", "Intelligence test score", "Urban location", "South location", "Northeast location", "North Central location", "Spouse's Income", "Ever stopped by police", "Unemployment rate"), custom.model.names = "OLS-Working", digits = 4)

Statistical models
	OLS-Working
Intercept	0.4729^***
	(0.0565)
Age	0.0154^***
	(0.0030)
Age squared	-0.0003^***
	(0.0000)
Education	0.0141^***
	(0.0011)
Married	0.0521^***
	(0.0047)
Has a Child aged 1 to 5	-0.1708^***
	(0.0045)
Has a Newborn	-0.0138
	(0.0074)
Has a Child 6 to 17	-0.0517^***
	(0.0044)
Locus of Control score	0.0013
	(0.0008)
Self-esteem score	0.0023^***
	(0.0005)
Intelligence test score	0.0012^***
	(0.0001)
Urban location	0.0176^***
	(0.0043)
South location	0.0139^*
	(0.0058)
Northeast location	0.0083
	(0.0064)
North Central location	0.0078
	(0.0057)
Spouse’s Income	-0.0016^***
	(0.0001)
Ever stopped by police	-0.0118
	(0.0066)
Unemployment rate	-0.0126^***
	(0.0016)
R²	0.1170
Adj. R²	0.1166
Num. obs.	33365
p < 0.001; p < 0.01; p < 0.05

htmlreg(ols.self, custom.coef.names = c("Intercept","Age", "Age squared", "Education", "Married", "Has a Child aged 1 to 5", "Has a Newborn", "Has a Child 6 to 17", "Locus of Control score", "Self-esteem score", "Intelligence test score", "Urban location", "South location", "Northeast location", "North Central location", "Spouse's Income", "Ever stopped by police", "Unemployment rate"), custom.model.names =  "OLS-Self Employed", digits = 4)

Statistical models
	OLS-Self Employed
Intercept	-0.0998^*
	(0.0460)
Age	0.0082^***
	(0.0025)
Age squared	-0.0001^**
	(0.0000)
Education	-0.0016
	(0.0009)
Married	0.0020
	(0.0039)
Has a Child aged 1 to 5	0.0450^***
	(0.0038)
Has a Newborn	-0.0055
	(0.0064)
Has a Child 6 to 17	0.0211^***
	(0.0037)
Locus of Control score	-0.0020^**
	(0.0007)
Self-esteem score	0.0011^**
	(0.0004)
Intelligence test score	0.0001
	(0.0001)
Urban location	0.0034
	(0.0035)
South location	-0.0294^***
	(0.0047)
Northeast location	-0.0304^***
	(0.0052)
North Central location	-0.0283^***
	(0.0047)
Spouse’s Income	0.0006^***
	(0.0001)
Ever stopped by police	0.0272^***
	(0.0054)
Unemployment rate	0.0007
	(0.0013)
R²	0.0254
Adj. R²	0.0248
Num. obs.	28228
p < 0.001; p < 0.01; p < 0.05

#older - more likely to work but at a decreasing rate - r2 is negative. The more educated, the more likely to work, but less likely to be self-employed. Married, more likely to work. children less than 5, less likely to work but if you do work you're more likely to be self employed. Same with children 6-17. Children youger then a year, less likely to work but very small. Rotter score, self esteem positive. aftq is positive. Regions are relative to the west. Higher the income, less likley to be employed.

Estimate the same two models from a) using either probit or logit and report the marginal effects.

probit.work <- probitmfx(working ~ age + agesq + educ + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)

probit.self <- probitmfx(self_empl ~ age + agesq + educ + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)

#htmlreg(list(probit.work,probit.self), digits = 4)


htmlreg(probit.work, custom.coef.names = c( "Age", "Age squared", "Education", "Married", "Has a Child aged 1 to 5", "Has a Newborn", "Has a Child 6 to 17", "Locus of Control score", "Self-esteem score", "Intelligence test score", "Urban location", "South location", "Northeast location", "North Central location", "Spouse's Income", "Ever stopped by police", "Unemployment rate"), custom.model.names = "Probit-Working", digits = 4)

Statistical models
	Probit-Working
Age	0.0107^***
	(0.0029)
Age squared	-0.0002^***
	(0.0000)
Education	0.0163^***
	(0.0011)
Married	0.0239^***
	(0.0047)
Has a Child aged 1 to 5	-0.1730^***
	(0.0050)
Has a Newborn	-0.0166^*
	(0.0065)
Has a Child 6 to 17	-0.0474^***
	(0.0043)
Locus of Control score	0.0011
	(0.0008)
Self-esteem score	0.0020^***
	(0.0005)
Intelligence test score	0.0012^***
	(0.0001)
Urban location	0.0190^***
	(0.0041)
South location	0.0169^**
	(0.0053)
Northeast location	0.0108
	(0.0059)
North Central location	0.0087
	(0.0053)
Spouse’s Income	-0.0012^***
	(0.0001)
Ever stopped by police	-0.0095
	(0.0064)
Unemployment rate	-0.0121^***
	(0.0016)
Num. obs.	33365
Log Likelihood	-12297.3329
Deviance	24594.6659
AIC	24630.6659
BIC	24782.1406
p < 0.001; p < 0.01; p < 0.05

htmlreg(probit.self, custom.coef.names = c("Age", "Age squared", "Education", "Married", "Has a Child aged 1 to 5", "Has a Newborn", "Has a Child 6 to 17", "Locus of Control score", "Self-esteem score", "Intelligence test score", "Urban location", "South location", "Northeast location", "North Central location", "Spouse's Income", "Ever stopped by police", "Unemployment rate"), custom.model.names = "Probit-Self-Employed", digits = 4)

Statistical models
	Probit-Self-Employed
Age	0.0101^***
	(0.0024)
Age squared	-0.0001^***
	(0.0000)
Education	-0.0019^*
	(0.0008)
Married	0.0109^**
	(0.0035)
Has a Child aged 1 to 5	0.0437^***
	(0.0040)
Has a Newborn	-0.0017
	(0.0054)
Has a Child 6 to 17	0.0184^***
	(0.0035)
Locus of Control score	-0.0019^**
	(0.0007)
Self-esteem score	0.0011^**
	(0.0004)
Intelligence test score	0.0001
	(0.0001)
Urban location	0.0034
	(0.0033)
South location	-0.0238^***
	(0.0038)
Northeast location	-0.0247^***
	(0.0038)
North Central location	-0.0227^***
	(0.0038)
Spouse’s Income	0.0003^***
	(0.0000)
Ever stopped by police	0.0298^***
	(0.0062)
Unemployment rate	0.0006
	(0.0013)
Num. obs.	28228
Log Likelihood	-6881.4622
Deviance	13762.9243
AIC	13798.9243
BIC	13947.3896
p < 0.001; p < 0.01; p < 0.05

Are your estimates between parts a and b similar? Please interpret your results.

The models show similar results. We are seeing differences in marriage, policever (mattered in linear regression for the working model but does not matter in probit), and education.

Marriage is different, 1/2 the size. Education is more important in the probit model then the linear probability model. Policever - mattered in the linear regression working model but does not have the impact in the probit model.

all betas are % points if income increases by 1,000, my likelihood of working by xxx points.

Consider what variables would affect your likelihood of working, but not necessarily your likelihood of becoming self-employed?

Children is the variable that affects the likelihood of working but doesn’t appear to have any real impact on likelihood of being self-employed.

probit.work_children <- probitmfx(working ~ age + agesq + educ + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)

probit.self_nochildren <- probitmfx(self_empl ~ age + agesq + educ + married +  rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate, data=NLSY)

#htmlreg(list(probit.work_children,probit.self_nochildren), digits = 4)


#needs labels here 
htmlreg(probit.work_children, custom.coef.names = c( "Age", "Age squared", "Education", "Married", "Has a Child aged 1 to 5", "Has a Newborn", "Has a Child 6 to 17", "Locus of Control score", "Self-esteem score", "Intelligence test score", "Urban location", "South location", "Northeast location", "North Central location", "Spouse's Income", "Ever stopped by police", "Unemployment rate"), custom.model.names = "Probit-Working with children",digits = 4)

Statistical models
	Probit-Working with children
Age	0.0107^***
	(0.0029)
Age squared	-0.0002^***
	(0.0000)
Education	0.0163^***
	(0.0011)
Married	0.0239^***
	(0.0047)
Has a Child aged 1 to 5	-0.1730^***
	(0.0050)
Has a Newborn	-0.0166^*
	(0.0065)
Has a Child 6 to 17	-0.0474^***
	(0.0043)
Locus of Control score	0.0011
	(0.0008)
Self-esteem score	0.0020^***
	(0.0005)
Intelligence test score	0.0012^***
	(0.0001)
Urban location	0.0190^***
	(0.0041)
South location	0.0169^**
	(0.0053)
Northeast location	0.0108
	(0.0059)
North Central location	0.0087
	(0.0053)
Spouse’s Income	-0.0012^***
	(0.0001)
Ever stopped by police	-0.0095
	(0.0064)
Unemployment rate	-0.0121^***
	(0.0016)
Num. obs.	33365
Log Likelihood	-12297.3329
Deviance	24594.6659
AIC	24630.6659
BIC	24782.1406
p < 0.001; p < 0.01; p < 0.05

htmlreg(probit.self_nochildren, custom.coef.names = c( "Age", "Age squared", "Education", "Married", "Locus of Control score", "Self-esteem score", "Intelligence test score", "Urban location", "South location", "Northeast location", "North Central location", "Spouse's Income", "Ever stopped by police", "Unemployment rate"), custom.model.names = "Probit - Self-employed without children", digits = 4)

Statistical models
	Probit - Self-employed without children
Age	0.0167^***
	(0.0023)
Age squared	-0.0002^***
	(0.0000)
Education	-0.0033^***
	(0.0008)
Married	0.0247^***
	(0.0034)
Locus of Control score	-0.0017^**
	(0.0007)
Self-esteem score	0.0010^**
	(0.0004)
Intelligence test score	0.0001
	(0.0001)
Urban location	0.0026
	(0.0033)
South location	-0.0260^***
	(0.0039)
Northeast location	-0.0261^***
	(0.0039)
North Central location	-0.0228^***
	(0.0039)
Spouse’s Income	0.0004^***
	(0.0000)
Ever stopped by police	0.0287^***
	(0.0062)
Unemployment rate	0.0007
	(0.0013)
Num. obs.	28228
Log Likelihood	-6973.1138
Deviance	13946.2277
AIC	13976.2277
BIC	14099.9487
p < 0.001; p < 0.01; p < 0.05

Now consider a sample selection model where the woman first decided whether or not to work and then decides if she should be self-employed. What variable do you choose as your instrument? That is, what variable affects your decision to work, but not your decision to be self-employed? Provide some reasoning for your answer.

Having children may affect your decision to work but not your decision to be self employed. High unemployment rate might affect your decision to work but not your decision to be self employed. having been stopped by the police might affect your ability to be employed but not your decision to work. So some variables that could be important are:policever, unemp_rate, d_ch_0

Estimate a Heckman two-step equation to correct for sample selection in your self-employment equation.

heck.1<-heckit(working ~ age + agesq + educ + married + d_ch_1_5 + d_ch_0 + d_ch_6_17 + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever + unemp_rate,self_empl ~ age + agesq + educ + married + rotter_score + sesteem_score1 + afqt_1 + urban + south + northeast + northcen + sp_inc1000 + policever, data = NLSY, method = "2step")

#s - selection, o - out model
#millsratio is significant meaning there is some bias coming from the selection choice. Positive means OLS is overstating its betas. 

#htmlreg(list(heck.1$lm),digits = 5)

htmlreg(heck.1$lm, custom.coef.names = c("Intercept", "Age", "Age squared", "Education", "Married", "Locus of Control score", "Self-esteem score", "Intelligence test score", "Urban location", "South location", "Northeast location", "North Central location", "Spouse's Income", "Ever stopped by police", "Inverse Mills Ratio"), custom.model.names = "Heckman Model", digits = 4)

Statistical models
	Heckman Model
Intercept	-0.2607^***
	(0.0402)
Age	0.0144^***
	(0.0023)
Age squared	-0.0002^***
	(0.0000)
Education	0.0006
	(0.0009)
Married	0.0149^***
	(0.0037)
Locus of Control score	-0.0017^*
	(0.0007)
Self-esteem score	0.0015^***
	(0.0004)
Intelligence test score	0.0004^***
	(0.0001)
Urban location	0.0066
	(0.0035)
South location	-0.0261^***
	(0.0047)
Northeast location	-0.0277^***
	(0.0051)
North Central location	-0.0257^***
	(0.0047)
Spouse’s Income	0.0002^***
	(0.0001)
Ever stopped by police	0.0251^***
	(0.0054)
Inverse Mills Ratio	0.1354^***
	(0.0116)
R²	0.0928
Adj. R²	0.0923
Num. obs.	28228
p < 0.001; p < 0.01; p < 0.05

Are there any differences between your results from c) and f)

In the Heckman model the Inverse Mills Ratio is statistically significant as there is bias coming from the selection choice in the Heckman model. This means that OLS is overstating its betas.

Binary Dependent Variables Assignment

Amy Shah and Jennifer Russo, group project

MSBA Data Analytics III