Part 1: Paper using randomized data: Impact of Class Size on Learning

Download and go over this seminal paper by Alan Krueger.

Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532

1.1. Briefly answer these questions:

c. What is the identification strategy?

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

Part 2: Using Twins for Identification: Economic Returns to Schooling

Download and go over this seminal paper by Orley Ashenfelter and Alan Krueger.

Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173

2.1. Briefly answer these questions:

c. What is the identification strategy?

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

2.2. Replication analysis

a. Load Ashenfleter and Krueger AER 1994 data

You can load it directly from my website here. Variable names are self-explanatory if you read the paper.
famid age educ1 educ2 lwage1 lwage2 male1 male2 white1 white2
1 33.25120 16 16 2.161021 2.420368 0 0 1 1
2 43.57016 12 19 2.169054 2.890372 0 0 1 1
3 30.96783 12 12 2.791778 2.803360 1 1 1 1
4 34.63381 14 14 2.824351 2.263366 1 1 1 1
5 34.97878 15 13 2.032088 3.555348 0 0 1 1
6 29.33881 14 12 2.708050 2.484907 1 1 1 1

b. Reproduce the result from table 3 column 5 of the paper

You will need to create the “difference” variables first.
I use the stargazer package to make ok-looking regression result tables. There are other ways.

## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                        lwageD           
## ----------------------------------------
## educD                 0.092***          
##                        (0.024)          
##                                         
## ----------------------------------------
## Observations             149            
## Adjusted R2             0.086           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

d. Reproduce the result in table 3 column 1

You will need to reshape the data first.
Hint: I used the reshape2 package. It required me to rename the variables with a dot, like “educ.1” instead of just “educ1”. Then I just run a reshape(data, direction="long", varying =..., timevar = ...).

There are probably other ways to do it using melt or gather.

Reshaped data:
famid age twin educ lwage male white id agesq
1.1 1 33.25120 1 16 2.161021 0 1 1 11.056424
1.2 1 33.25120 2 16 2.420368 0 1 1 11.056424
2.1 2 43.57016 1 12 2.169054 0 1 2 18.983588
2.2 2 43.57016 2 19 2.890372 0 1 2 18.983588
3.1 3 30.96783 1 12 2.791778 1 1 3 9.590065
3.2 3 30.96783 2 12 2.803360 1 1 3 9.590065

Regression result matches the paper exactly:

## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         lwage           
## ----------------------------------------
## educ                  0.084***          
##                        (0.014)          
##                                         
## age                   0.088***          
##                        (0.019)          
##                                         
## agesq                 -0.087***         
##                        (0.023)          
##                                         
## male                  0.204***          
##                        (0.063)          
##                                         
## white                 -0.410***         
##                        (0.127)          
##                                         
## ----------------------------------------
## Observations             298            
## Adjusted R2             0.260           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

e. Explain how the coefficient on education should be interpreted.

f. Explain how the coefficient on the control variables should be interpreted.