This homework focuses on two seminal papers that have shaped our discipline in thinking about how to demonstrate causality. You have to download and read over both papers. Read carefully enough to get familiar with the key ideas and identification strategies of the papers. Answer the questions below (briefly – a couple of sentences are enough). For the second paper, you also have a simple replication task to perform in R. You can check elements of answers here, in case you wonder if your results look correct or not. As always: your code needs to be visible, and completely reproducible starting from the data source.

Part 1: Paper using randomized data: Impact of Class Size on Learning

Download and go over this seminal paper by Alan Krueger.

Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532

1.1. Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

This paper aims at examining the impact of class size on the performance of standardized test from the stage of kindergarten to third grade.

b. What would be the ideal experiment to test this causal link?

The ideal experiment should be designed to randomly select an adequately large sample of children assigned to various classroom sizes so that personal attributes, the background of parents and other unobserved characteristics would not pose impact on the academic output evaluated at standard test performance.

c. What is the identification strategy?

This study employed data from the Tennessee Student/Teacher Achievement Ratio experiment (Project STAR), in which more than 10,000 kindergarten students were randomly assigned to three group with different classroom sizes and remained in the same group for four years through 3rd grade. The standardized tests were conducted at the end of every academic year. The project itself was designed to satisfy the characteristics of a massive-scale randomization focusing on classroom size and overcome the omitted variable bias.

d. What are the assumptions / threats to this identification strategy?

Two major shortcomings stem from the data: individuals who were assigned to the classroom with 15-30 students faced a second random selection between classes with and without a teacher aide when entering the 1st grade; approximately one tenth of students switched between class sizes in different grades. These two phenomenon potentially weakened the initial randomization effort.

Part 2: Paper using Twins for Identification: Economic Returns to Schooling

Download and go over this seminal paper by Orley Ashenfelter and Alan Krueger. Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173

2.1. Briefly answer these questions:

a. What is the causal link?

This paper seeks to identify the impact of years of schooling on wage rate.

b. What would be the ideal experiment to test this causal link?

The ideal experiment should be designed to randomly select an adequately large sample of individuals exposed to various educational levels so that personal skills and other unobserved characteristics would not pose impact on the wage rate.

c. What is the identification strategy?

Twins are considered ideal subjects for this study due to their high level of genetical similarities with their siblings. Twins who attended the Twinsburg Festival were approached with a questionnaire derived from the Current Population Survey (CPS) that inquired about their educational background, wage rates from recent jobs, wage rate of their twin siblings and a series of other personal and family characteristics.

d. What are the assumptions / threats to this identification strategy?

Two types of the measure errors were present in the sample: the measurement error in individual’s schooling levels resulted from responses provided by their twin sibling and the measurement errors of the twin’s parental schooling levels

2.2. Replication Analysis

a. Load Ashenfelter and Krueger AER 1994 data. You can load it directly from my website here. Variable names should be self-explanatory if you read the paper.

library(haven)
library(stargazer)
TwinsEdu <- read_dta("AshenfelterKrueger1994_twins.dta")
View(TwinsEdu)


b. Reproduce the result from table 3 column 5.

# Calculte the difference for log wage rates and years of schooling. 
dif_lwage <- TwinsEdu$lwage1 - TwinsEdu$lwage2
dif_educ <- TwinsEdu$educ1 - TwinsEdu$educ2

# Fixed-effect estimation: Regressing the intrapair difference in log wage rates on the intrapair difference in 
# years of schooling.

Tab3_Col5 <- lm(dif_lwage ~ dif_educ, data = TwinsEdu )

stargazer(Tab3_Col5, type = "text", title = "Table 3 - FIXED-EFFECT ESTIMATION", align = TRUE,  
          dep.var.labels = c("First Difference (v)"), covariate.labels = c("Own education"),
          keep.stat = c("n", "rsq"), omit = c("Constant", "adj.rsq"), 
          table.layout = "=d=t-s=n")
## 
## Table 3 - FIXED-EFFECT ESTIMATION
## =========================================
##                  First Difference (v)    
## =========================================
## Own education          0.092***          
##                         (0.024)          
##                                          
## -----------------------------------------
## Observations              149            
## R2                       0.092           
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01


c. Explain how this coefficient should be interpreted.

The estimated coefficient implies that one more year of schooling experience will make the wage rate increase by 9.2%, and the result is statistically significant at 1% level.

d. Reproduce the result in table 3 column 1. You will need to reshape the data first. Hint: I used the reshape command from the rehsape2 package. It likes to have a “.” in variable names so I renamed the variables with “.1” and “.2” instead of just “1” and “2” – but you can avoid that by just setting sep=““. There are probably other ways to do it using melt or gather.

library(reshape)
library(reshape2)

TwinsEdu_org <- reshape(TwinsEdu, direction="long",
                v.names=c("educ", "lwage", "white", "male"),
                varying=c("educ1", "lwage1", "male1", "white1", "educ2", "lwage2", "male2", "white2"),
                times = c("t_1","t_2"), 
                idvar=c("age","famid"), 
                timevar="twin")

TwinsEdu_org1 <- TwinsEdu_org[order(TwinsEdu_org$famid), ]
head(TwinsEdu_org1)
## # A tibble: 6 × 7
##   famid   age twin   educ lwage white  male
##   <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1     1  33.3 t_1      16  2.16     1     0
## 2     1  33.3 t_2      16  2.42     1     0
## 3     2  43.6 t_1      12  2.17     1     0
## 4     2  43.6 t_2      19  2.89     1     0
## 5     3  31.0 t_1      12  2.79     1     1
## 6     3  31.0 t_2      12  2.80     1     1
# Create an age-squared variable included in the estimation
age_square <- TwinsEdu_org1$age^2/100

# Compute the OLS estimation: 

Tab3_Col1 <- lm (lwage ~ educ + age + age_square + male + white, data = TwinsEdu_org1)

stargazer(Tab3_Col1, type = "text", title = "Table 3 - LOG WAGE ESTIMATION FOR IDENTICAL TWINS", align = TRUE,  
          dep.var.labels = c("OLS (i)"), 
          covariate.labels = c("Own education", "Age", "Age Squared (/100)", "Male", "White"),
          keep.stat = c("n", "rsq"), omit = c("Constant", "adj.rsq")) 
## 
## Table 3 - LOG WAGE ESTIMATION FOR IDENTICAL TWINS
## ==============================================
##                        Dependent variable:    
##                    ---------------------------
##                              OLS (i)          
## ----------------------------------------------
## Own education               0.084***          
##                              (0.014)          
##                                               
## Age                         0.088***          
##                              (0.019)          
##                                               
## Age Squared (/100)          -0.087***         
##                              (0.023)          
##                                               
## Male                        0.204***          
##                              (0.063)          
##                                               
## White                       -0.410***         
##                              (0.127)          
##                                               
## ----------------------------------------------
## Observations                   298            
## R2                            0.272           
## ==============================================
## Note:              *p<0.1; **p<0.05; ***p<0.01


e. Explain how the coefficient on education should be interpreted.

The estimated coefficient on education implies that one more year of schooling experience will make the wage rate increase by 8.4%, and the result is statistically significant at 1% level.

f. Explain how the coefficient on the control variables should be interpreted.

The estimated coefficient on the control variables implies that in average and hold everything else constant, male earns 20.4% more in wage rate than female, white ethnic group earns 41.0% less in wage rate than other ethnic groups. All the coefficients are statistically significant at 1% level.