In this part of the problem set, we are going to replicate part of the results of Joshua Angrist and William Evans’ article “Children and Their Parents’ Labor Supply: Evidence from Exogenous Variation in Family Size.” Here is the abstract of the study:

Research on the labor-supply consequences of childbearing is complicated by the endogeneity of fertility. This study uses parental preferences for a mixed sibling-sex composition to construct instrumental variables (IV) estimates of the effect of childbearing on labor supply. IV estimates for women are significant but smaller than ordinary least-squares estimates. The IV are also smaller for more educated women and show no impact of family size on husbands’ labor supply. A comparison of estimates using sibling-sex composition and twins instruments implies that the impact of a third child disappears when the child reaches age 13. (JEL J13, J22)

The purpose of this exercise is to study how fertility affects female labor supply. In order to do this, we are going to compare female labor supply in households with two children versus households with three children. Since fertility decisions are endogenous, we are going to use two sets of instruments: whether there is a multiple pregnancy in the second pregnancy and sex composition of the first two children. This latter instrument was the one proposed by Angrist & Evans (1998). Intuitively, parents are more likely to have a third child when the first two have the same sex. Assuming that whether the first two children have the same sex is random, we can use this variable as an instrument for the number of children in the household.

We are going to use the census80.csv dataset that corresponds to an extract of the 1980 US Census. It has been restricted to the set of families with two or three children and with mother’s age between 21 and 35 years. The data set contains the following variables:

Setting my working directory and uploading the data set.

setwd("~/EDX courses/MicroMaster MIT/14.310x-Data Analysis for Social Scientists/Programs")
mydata <- read.csv("census80.csv")
summary(mydata)
    workedm           weeksm          whitem           blackm      
 Min.   :0.0000   Min.   : 0.00   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.: 0.00   1st Qu.:1.0000   1st Qu.:0.0000  
 Median :1.0000   Median :12.00   Median :1.0000   Median :0.0000  
 Mean   :0.5716   Mean   :20.82   Mean   :0.8314   Mean   :0.1125  
 3rd Qu.:1.0000   3rd Qu.:48.00   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :52.00   Max.   :1.0000   Max.   :1.0000  
                                                                   
     hispm            othracem           sex1st           sex2nd      
 Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.00000   Median :0.00000   Median :0.0000   Median :0.0000  
 Mean   :0.02725   Mean   :0.02886   Mean   :0.4871   Mean   :0.4881  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
                                                                      
    ageq2nd         ageq3rd         numberkids   
 Min.   : 0.00   Min.   : 0.00    Min.   :2.000  
 1st Qu.: 9.00   1st Qu.: 5.00    1st Qu.:2.000  
 Median :19.00   Median :13.00    Median :2.000  
 Mean   :21.75   Mean   :16.59    Mean   :2.286  
 3rd Qu.:33.00   3rd Qu.:26.00    3rd Qu.:3.000  
 Max.   :71.00   Max.   :67.00    Max.   :3.000  
                 NA's   :305132                  
Q.1. Use the command summary to summarize the variables in the data. Using your output, fill in the following information:
  1. Fraction of mothers that work:
  1. 3rd quartile of weeks worked:
  1. Proportion of Hispanic mothers:
  1. Median age of the second child in quarters:
Q.2. Use the variable ageq2nd and the variable ageq3rd to construct an indicator variable on whether there was a multiple pregnancy during the mother’s second pregnancy. What is the proportion of households with a multiple pregnancy in the second pregnancy?
#Loading Required Library
library("AER")

#Constructing an Indicator Variable Using ageq2nd and age3rd variables
mydata$temp[mydata$ageq2nd == mydata$ageq3rd] <- 1 #Creating a temporary vector that meets the given criteria
mydata$multiple <- 0 
mydata$multiple[mydata$temp == 1] <- 1
summary(mydata$multiple)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00000 0.00000 0.00729 0.00000 1.00000 

Based on the outcomes the proportion of households with a multiple pregnancy in the second pregnancy is 0.00729.

Q.3. Use the variables sex1st and sex2nd to construct an indicator variable on whether the first and the second born children have the same sex. What is the proportion of households in which the first two children have the same sex?
#Constructing an Indicator Variable Using sex1st and sex2nd variables
mydata$samesex <- (mydata$sex1st == mydata$sex2nd) 
mydata$samesex[mydata$samesex == FALSE] <- 0
mydata$samesex[mydata$samesex == TRUE] <- 1
summary(mydata$samesex)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  1.0000  0.5019  1.0000  1.0000 

The proportion of the household in which the first two children have the same sex is 0.5019.

Now let’s set up the model we want to estimate. In particular, we are interested in estimating the following equation:

laborsupplyh = α0 + α11 3childrenh + α2 blackmotherh + α3 hispanicmotherh + α4 otherraceh + εh(equation 1)

where,

# Creating a variable 'three' that suggests the families with three children 
mydata$three <- (mydata$numberkids == 3) 
mydata$three[mydata$numberkids == FALSE] <- 0
mydata$three[mydata$numberkids == TRUE] <- 1
summary(mydata$three)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.2861  1.0000  1.0000 
Q. 4. Run the above model through OLS using whether the mom works and the number of weeks she works as the dependent variables. According to your estimates, which of the following statements are correct? Select all that apply.
# Running the OLS Model
ols1 <- lm(workedm ~ three + blackm + hispm + othracem, data = mydata)

# Creating an empty matrix
OLS <- matrix(ncol = 2, nrow = 2, data = NA) #creating an empty matrix to input my answer

#Inputting the values of my interest in the empty matrix
OLS[1, 1] <- ols1$coefficients[2] #1st value, in first row X first column
pvalue <- summary(ols1)
OLS[2, 1] <- pvalue$coefficients[2, 4]#2nd value in second row X first column

ols2 <- lm(weeksm ~ three + blackm + hispm + othracem, data = mydata)
OLS[1, 2] <- ols2$coefficients[2]# 3rd value in 1st row second column
pvalue <- summary(ols2)
OLS[2, 2] <- pvalue$coefficients[2, 4] #4th value in 2nd row second column
OLS
           [,1]      [,2]
[1,] -0.0839132 -3.940177
[2,]  0.0000000  0.000000

Based on the Results, the Right Options are as follow:

Since fertility is an endogenous variable, we want to use the multiple pregnancy and the same sex variables as instruments for having three children in the household. We are going to estimate the first-stage using each variable separately. Run a regression for each of these instruments using the indicator of having three children as the dependent variable and controlling for the race of the mother.

Q. 5. According to your estimates, by having a multiple pregnancy during the second pregnancy, by how many percentage points does the likelihood of having a third child increase?

Required model: 13childrenh = β0 + β1 multipleh + β2 blackmotherh + β3 hispanicmotherh + β4otherraceh + νh (equation 2)

ols3 <- lm(three ~ multiple + blackm + hispm + othracem, data = mydata)
myanswer <- matrix(ncol = 1, nrow = 2, data = NA)
myanswer[1,1] <- ols3$coefficients[2]
pvalue <- summary(ols3)
myanswer[2,1] <- pvalue$coefficients[2,4]
myanswer
          [,1]
[1,] 0.7179404
[2,] 0.0000000
## Or we can simply print the summary statistics
summary(ols3)

Call:
lm(formula = three ~ multiple + blackm + hispm + othracem, data = mydata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3528 -0.2710 -0.2710  0.6641  0.7290 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.2710109  0.0007523 360.242  < 2e-16 ***
multiple    0.7179404  0.0080400  89.296  < 2e-16 ***
blackm      0.0648870  0.0021730  29.860  < 2e-16 ***
hispm       0.0817475  0.0042103  19.416  < 2e-16 ***
othracem    0.0115414  0.0040954   2.818  0.00483 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4472 on 427413 degrees of freedom
Multiple R-squared:  0.02108,   Adjusted R-squared:  0.02107 
F-statistic:  2301 on 4 and 427413 DF,  p-value: < 2.2e-16

In either cases, we calculated that having a multiple pregnancy at the second pregnancy increases the likelihood of having a third child by 71.79%.

Q.6. According to your estimates, when the first two children are of the same sex, by how many percentage points does the likelihood of having a third child increase?
ols4 <- lm(three ~ samesex + blackm + hispm + othracem, data = mydata)
myanswer1 <- matrix(ncol = 1, nrow = 2, data = NA)
myanswer1[1,1] <- ols4$coefficients[2]
pvalue <- summary(ols4)
myanswer1[2,1] <- pvalue$coefficients[2,4]
myanswer1
              [,1]
[1,]  4.901816e-02
[2,] 1.669377e-276

Based on the findings, having two children of the same sex increases the likelihood of having a third child by 4.902%.

Q.7. Now, run the IV regression using whether the mother works as the dependent variable and multiple pregnancy as the instrument. According to this model, by how many percentage points does the likelihood that the mother works change when a third child is born?
iv1 <- ivreg(workedm ~ three + blackm + hispm + othracem | blackm + hispm + othracem + multiple, data = mydata)

iv2 <- ivreg(weeksm ~ three + blackm + hispm + othracem | blackm + hispm + othracem + multiple, data = mydata)

iv <- matrix(ncol=2, nrow=2, data="Not Yet")
iv[1,1] <- iv1$coefficients[2]
pvalue <- summary(iv1)
iv[2,1] <- pvalue$coefficients[2,4]

iv[1,2] <- iv2$coefficients[2]
pvalue1 <- summary(iv2)
iv[2,2] <- pvalue1$coefficients[2,4]
iv
     [,1]                   [,2]                 
[1,] "-0.064125589077173"   "-3.13765416218266"  
[2,] "1.93810863508335e-07" "1.1639267997331e-08"

The results show that having a third child decreases the likelihood that the mother works by 6.41% when we use multiple pregnancy at the second pregnancy variables as an instrument.

Q.8. Now, run the IV regression using whether the mother works as the dependent variable and the same-sex variable as the instrument. According to this model, how does the likelihood that the mother works change when a third child is born?
iv3 <- ivreg(workedm ~ three + blackm + hispm + othracem | blackm + hispm + othracem + samesex, data = mydata)

iv4 <- ivreg(weeksm ~ three + blackm + hispm + othracem | blackm + hispm + othracem + samesex, data = mydata)


iv_1 <- matrix(ncol=2, nrow=2, data="Not Yet")
iv_1[1,1] <- iv3$coefficients[2]
pvalue2 <- summary(iv3)
iv_1[2,1] <- pvalue2$coefficients[2,4]

iv_1[1,2] <- iv4$coefficients[2]
pvalue3 <- summary(iv4)
iv_1[2,2] <- pvalue3$coefficients[2,4]
iv_1
     [,1]                  [,2]                  
[1,] "-0.0982205358125773" "-4.99429876267641"   
[2,] "0.00137589272606202" "0.000268763736611791"

The results show that, if we use the same-sex variable as the instrument, then having a third child decreases the likelihood that the mother works by 9.82%.

Q. 9. As you should see, the following relationship holds between the point estimates of the three strategies that we have used: αIV−multiple1≤αOLS1≤α^IV−samesex1. Assuming a model of heterogeneous effects, what might explain these differences?

IV estimates are local treatment effects. Thus, we are identifying the effect of fertility over women who have a third child when the relevant instrument changes.

Why?

Under heterogeneous effects, IV estimates correspond to LATE (local average treatment effects). Thus, we are able to identify the average effect over the population that decides to have a third child when the instrument is switched on. This implies, that α^IV−multiple1=0.06412559 is the treatment effect on those that have a third child due to a multiple pregnancy. In general, for most of the population, having a multiple pregnancy would imply having a third child. On the other hand, α^IV−samesex1=0.098220536 corresponds to the treatment effect on those that decide to have a third child when the first two children have the same sex.

Part II

thinking clearly about experimental design allows us to identify parameters beyond treatment effects, for example, General Equilibrium Effects as in the French Unemployment experiment. Another potential advantage of designing carefully experiments is the identification of potential mechanisms that drive a causal relationship. In this set of questions, we are going to discuss the identification of mechanisms. We are going to study Bursztyn et al.’s (2014) article “Understanding Mechanisms Underlying Peer Effects: Evidence from a Field Experiment on Financial Decisions”

For now, assume you are interested in establishing whether there is social influence on financial decisions, and that you have the following experimental design:

Using this experimental design, you decide to estimate the following model:

decisionp = β0 + β1 informationp + εp (equation 4)

where, - decisionp is a dummy variable that indicates whether investor 2 in the pair p takes the same decision as her peer;
- informationp indicates whether pair p belongs to the treatment group and investor 2 received information on the decision of investor 1; finally, - εij is an error term

Q.10. Does this experimental design allow you to identify the causal effect of what peers do on financial decisions?

Yes. Because I have conducted an RCT in which I have randomized whether an investor learns or not about the decision of her peer. Then, I can identify a causal treatment effect in the parameter β1. If I see an effect on the decision, it means that his/her decision was influenced by the knowledge of what investor 1 did.

A researcher points out that equation 4 is not exploiting all the information in the data. She suggests that I can estimate the following model, which will allow me to identify not only the causal effect of knowing the peer’s decision, but also the causal effect of having a peer who doesn’t purchase the asset:

purchasep2 = β0 + β1 purchasep1 + β2 informationp + β3 purchasep1 × informationp + εp (equation 5)

where, - purchasep2 is a dummy variable that indicates whether investor 2 in pair p purchased the asset; - purchasep1 indicates whether investor 1 purchased the asset; - informationp indicates whether the pair p belongs to the treatment group of sharing information; - purchasep1×informationp is the interaction; finally, εp is an error term.

Q.11. Which parameter allows you to identify the causal effect of having a peer who doesn’t purchase the asset?

It is not possible to tell in this setting.

Why? In this setting, the researcher has randomized whether the second investor knows about the decision of the first one. However, pairs are endogenously formed and thus it is not possible to identify the causal effect of having a peer who declined to purchase the asset.

Q.12. Which parameter allows you to identify heterogeneous effects of social influence by investor’s 1 decision (whether she decided to purchase the asset or not )?

β3. The parameter β3 corresponds to a difference-in-differences estimator. In particular, it tells us whether investors who learn that their peers purchased the asset react differently than investors who learn that their peers decided to decline the offer.

Economic theory has identified two potential mechanisms of social influence on financial decisions. When someone learns that her peers have purchased an asset, she can be influenced via:

  1. Social learning: she learned some information of the asset via the decision of her peers.
  2. Social utility: she is influenced by the fact that her peers hold the asset, even under a setting where information remains constant.

Instead of estimating the model in equation 5, you could use the following one:

purchasep2 = β0 + β1 nopurchasep1 + β2 informationp + εij(equation 6)

where, - nopurchasep1 indicates whether investor 1 of pair p declined to purchase the asset.

Q.13. Would any of the models given by equations 4, 5 or 6 allow you to separately identify the channels of social learning and social utility?

No. Neither the experimental design of equations 4, 5, nor 6 would allow you to separate those channels. In order to do this you need to carefully think about the experimental design. This is precisely what Bursztyn et al. (2014) did in their article and what we are going to discuss next.

Bursztyn et al. (2014) conduct an experiment in which they precisely try to separately identify these channels. Figure 1 presents the experimental design of their paper. Here is a brief summary of their experimental design:

  1. Partner with a financial company.

  2. Identify peer-pairs of investors using referrals to a financial company.

  3. Randomize who is investor 1 and investor 2 in each pair.

  4. Offer investor 1 the possibility of participating in a lottery to purchase a new financial asset.

  5. On those pairs in which investor 1 decided to participate in the lottery, randomize whether she can or can’t purchase the asset.

  6. On the pairs in which investor 1 couldn’t purchase the asset, randomize whether investor 2 learns the decision of her peer:

  • No information (group A).
  • Information that individual 1 decided to participate in the lottery and was unsuccessful in purchasing the asset (group B).
  1. On the pairs in which investor 1 could purchase the asset, randomize whether investor 2 learns the decision of her peer:
  • No information (group A)
  • Information that individual 2 decided to participate in the lottery and was successful in purchasing the asset (group C).
  1. Have an additional group of individuals with no information: investors 2 in pairs in which investor 1 declined to purchase the asset (group Aneg).

  2. Their main outcome is whether investor 2 decides to purchase the asset or not.

Q.14. Which comparison between these groups will correspond to the treatment effect of social influence (social learning + social utility ) in equation 6?

Group C vs. Group A. In equation 6, the treatment effect of social influence was in the parameter β2. This parameter compares investors 2 who learned their peer decided to purchase the asset versus investors 2 who have a peer who purchased the asset but didn’t learn this information. In Bursztyn et al.’s design, Group A are investors whose peers were interested in the asset but never learned this information. This is the control group in the design of equation 6. In the experimental design for equation 6, there is a combined effect of learning the interest of purchasing the asset from investor 1 and knowing that she indeed holds it. The analogous treatment group in Bursztyn, et al is group C.

Q. 15. Which comparison between these groups will correspond to the treatment effect of social learning without social utility?

Group B vs. Group A. Investors 2 in group B learn that their peers were interested in purchasing the asset, but do not hold it. Thus, by comparing them with investors in group A, it is possible to identify the social learning channel.

Q.16. Which comparison between these groups will correspond to the treatment effect of social utility conditional on social learning?

Group C vs. Group B. Investors 2 in group C learn that their peers were interested in purchasing the asset and were able to acquire it. In contrast, investors 2 in group B learn about the interest of their peers and that they were unsuccessful acquiring it. Thus, the comparison between these two groups identify social utility after social learning.

Q.17. Which comparison between these groups will correspond to the treatment effect of social utility without social learning?

It is not possible to tell with this experimental design. In order to identify this effect, we will need an additional group in which investor 1 is forced to hold the asset and investor 2 learns this information. She can’t learn whether her peer is interested or not in the asset. The comparison with group A would give us the social utility treatment effect without social learning. However, this is very difficult to achieve as part of the experimental design.

Thanks