Part 1: Paper using Randomized Data: Impact of Class Size on Learning

Based on seminal paper by Alan Krueger: Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532

a. What is the causal link the paper is trying to reveal?

Answer: In short, the paper is trying reveal the causal impact of class size on student’s class performance. In detail, there are two hypothesis that author set forward:

The first hypothesis is more of an interest. Krueger uses large-scale randomized experiment, the Tennessee Student/Teacher Achievement Ratio (STAR), conducted in United States from 1985-1986 in 80 schools with 11,600 students. Author have three related goals: probing the sensitivity of the experimental estimates to flaw in the randomized trial, and finding a correct education production functional form from experimental design so that it can be used to estimate and interpret the effect using non-experimental setup (one with observational data).

b. What would be the ideal experiment to test this causal link?

Answer: In the STAR experiment, there were some deviations from the ideal situations. For example, re-randomization was done because, after the initial assignment, schools faced complaints from their parents. The very-very hypothetical but correct experimental setup can be explained in a time machine story. Take a kid, assign him into the small class and test his scores at the end of the year. Assume the time machine exists; take a kid one year back and assign him in the regular class and again test the score. Compare these two scores: because everything else was precisely same, the change in score is a causal effect of class size. It would be fun to go back and forth in time like this, but unfortunately we don’t know which generation will enjoy such scientific invention.

The best we can do from available techniques is that we can ensure no deviations from the ideal situations. For this case, the ideal experiment to test this causal link:

c. What is the identification strategy?

Answer: Ceteris paribus is expected be ensured by randomization in the original experiment.

But, because of non-ideal situations or threats (written in part d), author used the instrumental variable approach (with 2SLS) to establish comparability.

d. What are the assumptions / threats to this identification strategy?

Answer:

With reference to the data, above identification strategy (randomization) would not be revealing a causal effect if:

Part 2: Paper using Twins for Identification: Economic Returns to Schooling

Based on seminal paper by Orley Ashenfelter and Alan Krueger: Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173

2.1. Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

Answer: Authors are interested in exploring the causal link between schooling level and economic returns in terms of wages. Specifically, author’s hypothesis is to find if there is the effect of one more year of school attendance in the future economic returns, i.e., \(H_0: \beta_{schooling} = 0\) and \(H_a: \beta_{schooling} \not= 0\)

b. What would be the ideal experiment to test this causal link?

Answer: Suppose I completed my undergrad and went for applying job A in country X. I got a job offer $3000 per month. Now, I decided not to go to the job, rather joined master’s degree and completed the first year. Again, I applied for job A in country X. Everything else constant, they offered me real $3300 per month. This would be a ideal and hypothetical experiment again.

Author’s attempt to find the schooling’s effect is very interesting. The mono-zygotic (identical) twins are identical genetically, so there is less chance of bias due to abilities and in-born characters. We can also use twins for the ideal experiment (although we donot need twins for this study), but samples should be drawn randomly. The data collection from twins should be nationally representative and should include all diversities. Or, we can randomly assign 5 years of schooling vs 6 years of schooling to two groups and then observe their economic returns.

c. What is the identification strategy?

Answer:

Although similar questions are examined in the past, the study sample selection is innovative enough. They used questionnaire survey data from twins surveyed at the 16th Annual Twins Day Festival in Twinsburg, largest twins festival in the U.S.

Since they focus on the mono-zygotic (identical) twins rather than paternal twins, the estimates are obtained subject to many unobserved traits and abilities as controls. Also, the identification relies on the fact that twins behave identically.

But, again there are some threats (written in part d). Author used the IV, generalized least squares, first difference to control for any confounding residues (to establish comparability).

d. What are the assumptions / threats to this identification strategy?

Answer: With reference to the data the authors are using, the threats are:

  • Collecting data from twins festivals may lead to biased sample. For example, those twins may not have appeared in the festival (which is a entertainment) who do not earn much and they are frustrated of not getting a job. Also, jobless twins would not fly from Florida to Ohio just to attend the festival, given their job search struggles.

  • Twins may not behave identically. Twins might have grown in two completely different places with different families and their level of motivations are different.

  • Twins can have different abilities. Controlling for genetic constituent some how helps, but not completely.

2.2. Replication analysis:

a. Load Ashenfelter and Krueger AER 1994 data. Answer:

setwd("~/OneDrive - University of Georgia/4th Sem PhD/Adv Econometric Applications_Filipski/Assignments/HW4")
library(haven)
twinsData <- read_dta("AshenfelterKrueger1994_twins.dta")

b. Reproduce the result from table 3 column 5.

Answer:

#generate the variables for first-differencing: as mentioned in page 1165, the regression of intrapair difference in wage rates on intrapair difference in schooling levels.
twinsData$wage_diff <- twinsData$lwage2 - twinsData$lwage1
twinsData$educ_diff <- twinsData$educ2 - twinsData$educ1

FD_Table3 <- lm(wage_diff ~ educ_diff, data = twinsData)

#for a table with regression results: install.packages("stargazer") 
library(stargazer)
stargazer (FD_Table3,
           type="text",         #type="html" or "latex" did not worked for me
           align=TRUE,
           no.space=TRUE,
           keep.stat = c("n","rsq"),
           column.labels=c("First difference (v)"),
           covariate.labels = "Own education",
           title="TABLE 3: LOG WAGE EQUATIONS FOR IDENTICAL TWINS")
## 
## TABLE 3: LOG WAGE EQUATIONS FOR IDENTICAL TWINS
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                        wage_diff         
##                  First difference (v)    
## -----------------------------------------
## Own education          0.092***          
##                         (0.024)          
## Constant                0.079*           
##                         (0.045)          
## -----------------------------------------
## Observations              149            
## R2                       0.092           
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

Answer: This suggest that, on average, increase in the difference in education by one year increases the wage difference by 9.2 percentage. The result is statistically significant at 1 percent level.

Remember: this is log-lin type model, so we multiply coefficient estimate by 100.

d. Reproduce the result in table 3 column 1.

Answer: Reference for codes: Oscar Torres-Reyna (2011)

#reshaping data: to run OLS, we need to have data arranged in different way: we need ID and time arranged in rows.
#install.packages("reshape2")
#There are probably other ways to do it using melt or gather.
library(reshape2)
twinsData$wage_diff <- NULL
twinsData$educ_diff <- NULL
twinsData_long <- reshape(twinsData,
                          idvar= c("famid","age"),
                          varying = c("educ1", "educ2","lwage1", "lwage2","male1","male2", "white1","white2"),
                          sep = "",
                          timevar = "twin",
                          times = c("educ1", "educ2","lwage1", "lwage2","male1","male2", "white1","white2"),
                          direction = "long")        #our data is wide currently, need to reshape to long
twinsData_long$ageSQ <- (twinsData_long$age^2)/100

OLS_Table3 <- lm(lwage ~ educ+ age+ ageSQ+ male+ white, data = twinsData_long)
stargazer (OLS_Table3,
           type="text",         #type="html" or "latex" did not worked for me
           align=TRUE,
           no.space=TRUE,
           keep.stat = c("n","rsq"),
           column.labels=c("OLS (i)"),
           covariate.labels = "Own education",
           title="TABLE 3: ORDINARY LEAST SQUARES ESTIMATES OF LOG WAGE EQUATIONS FOR IDENTICAL TWINS")
## 
## TABLE 3: ORDINARY LEAST SQUARES ESTIMATES OF LOG WAGE EQUATIONS FOR IDENTICAL TWINS
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                          lwage           
##                         OLS (i)          
## -----------------------------------------
## Own education          0.084***          
##                         (0.014)          
## age                    0.088***          
##                         (0.019)          
## ageSQ                  -0.087***         
##                         (0.023)          
## male                   0.204***          
##                         (0.063)          
## white                  -0.410***         
##                         (0.127)          
## Constant                -0.471           
##                         (0.426)          
## -----------------------------------------
## Observations              298            
## R2                       0.272           
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

e. Explain how the coefficient on education should be interpreted.

Answer: This means, holding everything else constant, on average, increase in the education by a year increases the wage difference by 8.4 percentage. The result is statistically significant at 1 percent level.

Remember: this is log-lin type model, so we multiply coefficient estimate by 100.

f. Explain how the coefficient on the control variables should be interpreted.

Answer: The control variable should be interpreted as follows:

  • Age variable is interesting. Holding all else constant, on average, wage increases by 8.8 percent with age but starts decreasing by 8.7 percent after certain age. We can calculate the marginal effects and turning point using dertivatives.The result is statistically significant at 1 percent level.

  • Holding all else constant, on average, male earn 20.4 percent higher wage than female counterparts, which is statistically significant at 1 percent level.

  • Holding all else constant, on average, white person earn 41 percent lower wage than non-white counterparts, which is statistically significant at 1 percent level.

Remember: this is log-lin type model, so we multiply coefficient estimate by 100.