Part 1: Paper using randomized data: Impact of Class Size on Learning

Download and go over this seminal paper by Alan Krueger. Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532

1.1. Briefly answer these questions:

c. What is the identification strategy?

The paper uses RCT(randomized controlled trial) conducted in the United States. This experiment is a Tennessee Student/Teacher Achievement Ratio experiment, known as Project STAR. This experiment randomly assign students and teachers into three groups of different class sizes: “small classes (13-17 students per teacher), regular-size classes (22-25 students), and regular/aide classes (22-25 students) which also included a full-time teacher’s aide”. Students of each group are given “standardized tests at the end of each school year”. The experiment last for 4 years. The author then compares the tests score in each class size to analyse the effect of the class size on students’ performance.

d. What are the assumptions / threats to this identification strategy?

  • assumptions:
    • randomization.
    • controlled environment.
    • similar characteristic of students/ teaching environment.
  • threats (as the author illustrate in the introduction part):
    • Re-randomization of students in regular-size classes with and without full-time could compromise the experimental results.
    • Other nonrandom transitions: 10% students switched between small and regular classes because of parents’ complaints.
    • Class sizes varied more than intended caused by family relocation.
    • sample attrition was common.
    • students may have switched schools nonrandomly.

Part 2: Paper using Twins for Identification: Economic Returns to Schooling

Download and go over this seminal paper by Orley Ashenfelter and Alan Krueger. Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173

2.1. Briefly answer these questions:

c. What is the identification strategy?

The author’s team interviewed twins at 16th Annual Twins Days Festival in Twinsburg, Ohio, in August of 1991. The twins they interview is identity twins, which means that they are genetically identity. is After collecting the survey data,

d. What are the assumptions / threats to this identification strategy? (Answer specifically with reference to the data the authors are using)

  • Threats:
    • Twins may have different abilities even though they are genetically identity.
    • selection bias:
      • twins in the sample have stronger similarities than in a random sample of twin because the author chooses them in a festival.
      • twins in this study do vary in dimensions that the twins in other studies do not.
    • measurement error

The author mentions two types of threat that the data should deal with:

  • omitted ability variables
    • The author tries to use “coefficients \(\beta\) to measure the structural (or selection-corrected) effect of the observables on earnings” and gets unbiased estimator.
  • Measurement Error
    • The measure error in the correlation between the two measures of schooling, which can be revealed by estimation of “the reliability ratio for the twins schooling levels in Table 2 are 0.92 and 0.88”. The measure error clearly biases the estimator.

2.2. Replication analysis

a. Load Ashenfelter and Krueger AER 1994 data. You can load it directly from my website here. Variable names should be self-explanatory if you read the paper.

library(haven)
d <- read_dta("hw4/AshenfelterKrueger1994_twins.dta")

b. Reproduce the result from table 3 column 5.

# first difference
wage_dif = d$lwage1 - d$lwage2
edu_dif = d$educ1 -d$educ2
g <-  lm(wage_dif ~ edu_dif ,data = d)
summary(g)
## 
## Call:
## lm(formula = wage_dif ~ edu_dif, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03115 -0.20909  0.00722  0.34395  1.15740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07859    0.04547  -1.728 0.086023 .  
## edu_dif      0.09157    0.02371   3.862 0.000168 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5542 on 147 degrees of freedom
## Multiple R-squared:  0.09211,    Adjusted R-squared:  0.08593 
## F-statistic: 14.91 on 1 and 147 DF,  p-value: 0.0001682
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(g,  
          type="text",
          title = "Table 3 column 5",
          dep.var.labels = c("First difference (v)"),
          covariate.labels = c("Own education"))
## 
## Table 3 column 5
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                        First difference (v)    
## -----------------------------------------------
## Own education                0.092***          
##                               (0.024)          
##                                                
## Constant                      -0.079*          
##                               (0.045)          
##                                                
## -----------------------------------------------
## Observations                    149            
## R2                             0.092           
## Adjusted R2                    0.086           
## Residual Std. Error      0.554 (df = 147)      
## F Statistic           14.914*** (df = 1; 147)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

The coefficient can be interpreted that a unit increase of intrapair difference in education in twins will increase the intrapair difference in income by 9.2% on average.

d. Reproduce the result in table 3 column 1. You will need to reshape the data first.

Hint: I used the reshape command from the rehsape2 package. It likes to have a “.” in variable names so I renamed the variables with “.1” and “.2” instead of just “1” and “2” – but you can avoid that by just setting sep=““. There are probably other ways to do it using melt or gather.

library(reshape2)
d2 <- reshape(d,
          idvar= c("famid","age"),
             sep= "",
           timevar = "twin",
          direction = "long",
          varying = 3:ncol(d))

d2$age2 <- ((d2$age)^2)/100
g2 <- lm(lwage ~ educ + age + age2 + male + white , data = d2)
library(stargazer)
stargazer(g2,  
          type="text",
          title = "Table 3 column 1",
          dep.var.labels = c("OLS (i)"),
          covariate.labels = c("Own education","Age","Age squared(/100)","Male","White"))
## 
## Table 3 column 1
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               OLS (i)          
## -----------------------------------------------
## Own education                0.084***          
##                               (0.014)          
##                                                
## Age                          0.088***          
##                               (0.019)          
##                                                
## Age squared(/100)            -0.087***         
##                               (0.023)          
##                                                
## Male                         0.204***          
##                               (0.063)          
##                                                
## White                        -0.410***         
##                               (0.127)          
##                                                
## Constant                      -0.471           
##                               (0.426)          
##                                                
## -----------------------------------------------
## Observations                    298            
## R2                             0.272           
## Adjusted R2                    0.260           
## Residual Std. Error      0.532 (df = 292)      
## F Statistic           21.860*** (df = 5; 292)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

e. Explain how the coefficient on education should be interpreted.

If the years of schooling increases by one year when other variables remain the same, the wages of the twins will increase by 8.4% on average.

f. Explain how the coefficient on the control variables should be interpreted.

  • Age

When twins grow one year older, holding other variables constant, wages increase by an average of 8.8%.

  • Age squared \[wage = \beta_1 + \beta_1 age + \beta_2 age^2\] The coefficient of age is positive and the coefficient of age squared is negative, which means that the relationship between age and wage is a inverted “U” shape. Wages increases as age increase but at a certain peak, wages start to decrease when age increases.

  • Male

Male twins on average earn 20.4% more wages than female holding other variables constant.

  • White

White people on average earn 41% less wages than other races holding other variables constant.