Basics of Identification

Part 1: Paper using randomized data: Impact of Class Size on Learning

Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532

a. What is the causal link the paper is trying to reveal?

Krueger (1999) estimated the effect of class sizes on student performance (test scores).

b. What would be the ideal experiment to test this causal link?

The author argued that most of the past studies are based on the value-added specification, thus showing a need to develop an appropriate model of student learning. To test this causal effect, the ideal experiment would be a random assignment of teachers and students in different class sizes across schools. At the end of each school year, student performance would be tested.

c. What is the identification strategy?

Identification strategy is that each school is required to have at least one of each class-size type (small, regular with aide, and regular without aide), and a random assignment of students within schools. The independence between class-size assignment and other variables is only valid within schools, because randomization was done within schools.

d. What are the assumptions / threats to this identification strategy?

Krueger (1999) made several assumptions and deviated from the ideal experimental design:

Students were randomly reassigned between regular-size classes (with and without full-time aides) at the beginning of first grade, while students in small classes continued on in small classes, often with the same set of classmates (re-randomization).
Roughly 10% of students were switched between small and regular sized classes due to the behavioral problems or parental complaints (nonrandom transitions).

They addressed this problem, and the variability of class size for a given type of assignment, in some of the analysis that follows initial random assignment was used as an instrumental variable for actual class size. Furthermore, they addressed the limitation about students and their families relocation during the school year, .

Part 2: Paper using Twins for Identification: Economic Returns to Schooling

Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173

a. What is the causal link the paper is trying to reveal?

Ashenfelter and krueger (1994) estimated the returns to schooling by contrasting the wage rates of identical twins with different schooling levels.

b. What would be the ideal experiment to test this causal link?

The ideal experiment would be a random assignment of subjects to different schooling levels so that all other differences are controlled, and the returns would be attributed to different schooling levels.

c. What is the identification strategy?

Ashenfelter and krueger (1994) controlled other unobservable factors by assuming that they would be identical for the twins to estimate the causal effects of schooling on wages.

d. What are the assumptions / threats to this identification strategy

Measurement errors that were not addressed in the past studies could be a threat to the identification. However, this study incorporates errors in the measurement of schooling. Schooling level of student may also be associated with the family factors such as twins who are raised by individual parents, thus having its effect on their wages.

Replication Analysis

Reproduce the result from table 3 column 5

# Load STATA file using the foreign package, make table using stargazer package, and melt data using reshape package
library(foreign)
library(stargazer)
library(reshape)

# Import dta data
my_data <- read.dta("AshenfelterKrueger1994_twins.dta")
head(my_data)

##   famid      age educ1 educ2   lwage1   lwage2 male1 male2 white1 white2
## 1     1 33.25120    16    16 2.161021 2.420368     0     0      1      1
## 2     2 43.57016    12    19 2.169054 2.890372     0     0      1      1
## 3     3 30.96783    12    12 2.791778 2.803360     1     1      1      1
## 4     4 34.63381    14    14 2.824351 2.263366     1     1      1      1
## 5     5 34.97878    15    13 2.032088 3.555348     0     0      1      1
## 6     6 29.33881    14    12 2.708050 2.484907     1     1      1      1

# Create difference variable for lwage and education
my_data$wage_diff <- my_data$lwage1 - my_data$lwage2
my_data$educ_diff <- my_data$educ1 - my_data$educ2

# Run the first difference model
mod <- lm(wage_diff ~ educ_diff, data = my_data)

# Create a table with stargazer package
stargazer(mod, type = "text", title = "TABLE 3", align = TRUE, keep.stat = c("n","rsq"),
          dep.var.labels = c("First difference"), covariate.labels = c("Own education"),
          omit = c("Constant"))               # Display sample size and R-squared and remove constant

## 
## TABLE 3
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                    First difference      
## -----------------------------------------
## Own education          0.092***          
##                         (0.024)          
##                                          
## -----------------------------------------
## Observations              149            
## R2                       0.092           
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

Interpretation: The result shows that \(\hat{\beta} = 0.092\), which means wage increases by 9.2% when the schooling level increases by 1 year and is statistically significant at 1% significance level.

Reproduce the result from table 3 column 1

# reshape the data (make it long using melt command from the reshape package)
wage <- melt(cbind(my_data$lwage1, my_data$lwage2))
educ <- melt(cbind(my_data$educ1, my_data$educ2))
male <- melt(cbind(my_data$malew, my_data$male2))
white <- melt(cbind(my_data$white1, my_data$white2))
age <- melt(cbind(my_data$age, my_data$age))

# create a new dataset by combining these variables and make it a data frame
my_newdata <- data.frame(cbind(wage[,3], educ[,3], male[,3], white[,3], age[,3]))

# Give variable names to the data frame
colnames(my_newdata) <- c("wage", "educ", "male", "white", "age")

# Then, create new variable age squared
my_newdata$agesq <- ((my_newdata$age)^2) / 100

# Run the model for this new dataset
mod1 <- lm(wage ~ educ + male + white + age + agesq, data=my_newdata)

# Create a table with stargazer package
stargazer(mod1, type = "text", title = "TABLE 3", align = TRUE, keep.stat = c("n","rsq"),
          dep.var.labels = c("OLS"),
          covariate.labels = c("Own education", "Male", "White", "Age", "Age squared / 100"),
          omit = c("Constant"))               # Display sample size and R-squared and remove constant

## 
## TABLE 3
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                               OLS            
## ---------------------------------------------
## Own education              0.084***          
##                             (0.014)          
##                                              
## Male                       0.204***          
##                             (0.063)          
##                                              
## White                      -0.410***         
##                             (0.127)          
##                                              
## Age                        0.088***          
##                             (0.019)          
##                                              
## Age squared / 100          -0.087***         
##                             (0.023)          
##                                              
## ---------------------------------------------
## Observations                  298            
## R2                           0.272           
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

Coefficient on education:
The result shows that the coefficient on education is 0.084, which means wage increases by 8.4% on average when the schooling level increases by 1 year and is statistically significant at 1% significance level.

Coefficient on other control variables:
The coefficient on male is 0.204, which means wage of male twins is 22.63% higher than the female on average and is statistically significant at 1% significance level.

The coefficient on white is -0.410, which means wage of white twins is 33.63% lower than non-white and is statistically significant at 1% significance level.

The coefficient on age is 0.088 and the coefficient on agesq is -0.087. So, the marginal effect of age on wage is 100(0.088) + 2(-0.087)age. This mean at age 40, wage increases by 1.84% for an additional year and is statistically significant at 1% significance level.

Basics of Identification

Creation

2/14/2021

Part 1: Paper using randomized data: Impact of Class Size on Learning

a. What is the causal link the paper is trying to reveal?

b. What would be the ideal experiment to test this causal link?

c. What is the identification strategy?

d. What are the assumptions / threats to this identification strategy?

Part 2: Paper using Twins for Identification: Economic Returns to Schooling

a. What is the causal link the paper is trying to reveal?

b. What would be the ideal experiment to test this causal link?

c. What is the identification strategy?

d. What are the assumptions / threats to this identification strategy

Replication Analysis

Reproduce the result from table 3 column 5

Reproduce the result from table 3 column 1