Basics of Identification

Part 2: Using Twins for Identification: Economic Returns to Schooling

Assignment from the seminal paper by Orley Ashenfelter and Alan Krueger. Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173

2.1. Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

The paper is trying to reveal causal relation between education level and wage. Whether more schooling years increases the wage?

b. What would be the ideal experiment to test this causal link?

The sample was collected from the fair rather than randomly from throughout USA. Random collection would have controlled for biases due to :

Since the fair celebrates twins and their similarity, twins not similar in terms of education, jobs and financially may have chosen not to attend.
Location effects - Since the biggest twins festival happens at the same place every year, may be specialized treatment is given to twins during their schooling.

c. What is the identification strategy?

Controlled the unobservables by considering twins in which case everything else will remain constant. Measurement error was controlled by interviewing the twins separately about others education and wage.

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

Assumptions are:

Ability as unobservables was taken care of by considering twins but WILL differs from person to person. Some are more inclines towards education and others are not. Schooling levels among twins were similar
Everything else is constant (Eg:Ability and family background)
Bias due to the data collection method. Since the fair celebrates twins and their similarity, twins who believes they are not similar in terms of education, jobs and financially may have chosen not to attend.

2.2. Replication analysis from Ashenfleter and Krueger AER 1994

a. Load the data from this website. Variable names are self-explanatory

df <- read_dta("data/AshenfelterKrueger1994_twins.dta")
head(df,6)

famid	age	educ1	educ2	lwage1	lwage2	male1	male2	white1	white2
1	33.3	16	16	2.16	2.42	0	0	1	1
2	43.6	12	19	2.17	2.89	0	0	1	1
3	31	12	12	2.79	2.8	1	1	1	1
4	34.6	14	14	2.82	2.26	1	1	1	1
5	35	15	13	2.03	3.56	0	0	1	1
6	29.3	14	12	2.71	2.48	1	1	1	1

b. Reproduce the result from table 3 column 5 of the paper

edu <- df$educ1-df$educ2
lwage <- df$lwage1-df$lwage2

model1 <- lm(lwage ~ edu, data=df)

stargazer(model1, header=FALSE, type='text', font.size="small",
            omit.stat=c("adj.rsq", "ser", "f"), title = "Table 3: First Difference estimates of log wage for identical twins",
            covariate.labels= c("Education","Constant"))

## 
## Table 3: First Difference estimates of log wage for identical twins
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         lwage           
## ----------------------------------------
## Education             0.092***          
##                        (0.024)          
##                                         
## Constant               -0.079*          
##                        (0.045)          
##                                         
## ----------------------------------------
## Observations             149            
## R2                      0.092           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

The estimate of effect of difference in schooling of twins on the wage difference is 9.2%.

d. Reproduce the result in table 3 column 1. You will need to reshape the data first.

First reshape the data to the long form.

df_new <- reshape(df, idvar = c("famid","age"), varying = list(c(3,4),c(5,6),c(7,8),c(9,10)), v.names = c("edu","lwage","male","white"), direction = "long")
head(df_new,6)

famid	age	time	edu	lwage	male	white
1	33.3	1	16	2.16	0	1
2	43.6	1	12	2.17	0	1
3	31	1	12	2.79	1	1
4	34.6	1	14	2.82	1	1
5	35	1	15	2.03	0	1
6	29.3	1	14	2.71	1	1

Let’s reproduce the result from table 3 column 1 of the paper using the new dataframe

df_new$age_sq <- ((df_new$age^2)/100)

model2 <- lm(lwage ~ edu + age + age_sq + male + white, data=df_new)

stargazer(model2, header=FALSE, type='text', font.size="small",
            omit.stat=c("adj.rsq", "ser", "f"), title = "Table 3: Ordinary Least Square (OLS) estimates of log wage for identical twins",
            covariate.labels= c("Own education","Age","Age Squared (/100)","Male","White"))

## 
## Table 3: Ordinary Least Square (OLS) estimates of log wage for identical twins
## ==============================================
##                        Dependent variable:    
##                    ---------------------------
##                               lwage           
## ----------------------------------------------
## Own education               0.084***          
##                              (0.014)          
##                                               
## Age                         0.088***          
##                              (0.019)          
##                                               
## Age Squared (/100)          -0.087***         
##                              (0.023)          
##                                               
## Male                        0.204***          
##                              (0.063)          
##                                               
## White                       -0.410***         
##                              (0.127)          
##                                               
## Constant                     -0.471           
##                              (0.426)          
##                                               
## ----------------------------------------------
## Observations                   298            
## R2                            0.272           
## ==============================================
## Note:              *p<0.1; **p<0.05; ***p<0.01

Table with both the model-

model_tbl3 = list("OLS (i)" = model2,"First difference (v)" = model1)


coefs <- names(coef(model_tbl3[[1]]))[str_detect(names(coef(model_tbl3[[1]])), "vdc")]


huxtable <- huxreg(model_tbl3,number_format = 3, omit_coefs =coefs,
       coefs = c("Own education"="edu","Age"="age","Age Squared (/100)"="age_sq", "Male"="male","White"="white"),
      statistics = c("Sample size:" = "nobs", "R2" = "r.squared"))%>% 
  set_caption("Table 3: Ordinary Least Square (OLS) and First difference estimates of log wage for identical twins")

add_footnote(huxtable,"Each equation also includes an intercept term. Number in parentheses are estimated standard errors.")

Table 3: Ordinary Least Square (OLS) and First difference estimates of log wage for identical twins
	OLS (i)	First difference (v)
Own education	0.084 ***	0.092 ***
	(0.014)	(0.024)
Age	0.088 ***
	(0.019)
Age Squared (/100)	-0.087 ***
	(0.023)
Male	0.204 **
	(0.063)
White	-0.410 **
	(0.127)
Sample size:	298	149
R2	0.272	0.092
* p < 0.001; p < 0.01; * p < 0.05.
Each equation also includes an intercept term. Number in parentheses are estimated standard errors.

e. Explain how the coefficient on education should be interpreted.

The estimate of effect of schooling on wage is 8.4% in stacked data as against 9.2% in the first difference.

f. Explain how the coefficient on the control variables should be interpreted.

The estimate of effect of schooling on wage is 8.4% as against 9.2% in the first difference. Wage increases with age but after certain cutoff age, the wage starts declining with age as indicated by coefficient of Age Squared. The effect of race on age is -41%.

Basics of Identification

Cliff

07/02/2022

Part 1: Paper using randomized data: Impact of Class Size on Learning

1.1. Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

b. What would be the ideal experiment to test this causal link?

c. What is the identification strategy?

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

Part 2: Using Twins for Identification: Economic Returns to Schooling

2.1. Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

b. What would be the ideal experiment to test this causal link?

c. What is the identification strategy?

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

2.2. Replication analysis from Ashenfleter and Krueger AER 1994

a. Load the data from this website. Variable names are self-explanatory

b. Reproduce the result from table 3 column 5 of the paper

c. Explain how this coefficient should be interpreted.

d. Reproduce the result in table 3 column 1. You will need to reshape the data first.

e. Explain how the coefficient on education should be interpreted.

f. Explain how the coefficient on the control variables should be interpreted.