Part 1

1.1 Briefly answer this question

What is the identification strategy?

The identification strategy is exploiting random assignment of student to different class size.

d. What are the assumptions / threats to this identification strategy?

  • The biggest threat to identification is reassignment. The reasignment is likely to be non-random. If students with higher/lower test scores self select, this may lead to biased final estimators.

Part 2

2.1 Briefly answer these questions:

c. What is the identification strategy?

The identification strategy is using the within twins variation in the returns to schooling. The identical twins will be similar in many aspects making them comparable. In the absence of treatment, the outcomes of those twins is assumed to be very similar. While not exactly the ideal situation of comparing the potential outcome and the actual outcome of a person, taking twins is nearer to the experiment. TWins are alike in many respect, so a twin could be a counterfactual to another twin. Thus the argument is that the difference in outcomes due to treatment can be attributed to treatment.

d. What are the assumptions / threats to this identification strategy?

The assumptions/threats to identifications are:

  • Any variable that vary across twins (like education level) may enter the wage equations. For example, family may invest differently in each twin which will be correlated with the ability to earn. The fixed effects regression controls for the characteristics that is same for both twins, however unobservables that vary across twins-pair is still a problem as shown in equation 6 in the paper.
  • There is also the effects of measurement error that will lead to bias in the estimators of the effect of schooling.
  • The study may not have external validity as the twins selected are not overall representation of the whole population.

2.2 Replication analysis

a. Load Ashenfelter and Krueger AER 1994 data. You can load it directly from my website here. Variable names should be self-explanatory if you read the paper.

#install.packages(c("dplyr", "stringr", "ggplot2", "readxl", "stargazer", "plm"), repos="http://cran.us.r-project.org")
library(plm)
library(stargazer)
library(haven)
AK <- read_dta("../HW4/AshenfelterKrueger1994_twins.dta")

b. Reproduce the result from table 3 column 5.

AK$wagediff = AK $lwage1 - AK$lwage2
AK$educdiff = AK$educ1- AK$educ2
# run the regression
diffReg <- lm(wagediff ~ educdiff, data = AK)
# tabulate the regression
stargazer(diffReg, type="text", header= FALSE, title = "Results from Table 3 column 5" , no.space=TRUE, omit.stat = c("adj.rsq", "ser", "f"), covariate.labels = c("Own Education")) # print the results
## 
## Results from Table 3 column 5
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                        wagediff          
## -----------------------------------------
## Own Education          0.092***          
##                         (0.024)          
## Constant                -0.079*          
##                         (0.045)          
## -----------------------------------------
## Observations              149            
## R2                       0.092           
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

As this is a log linear model, the co-efficient \(\hat\beta= 0.092\) is interpreted as a one year increase in education leads to 9.2 percent increase in wages.

d. Reproduce the result in table 3 column 1

# d. Change the column names

colnames(AK) <- c("famid", "age", "educ.1", "educ.2", "lwage.1", "lwage.2", "male.1", "male.2", "white.1", "white.2")

# use reshape command to convert data to long format
AKR <- reshape(AK, direction="long", varying = list(c(3,4),c(5,6),c(7,8),c(9,10)), v.names =         c("educ", "lwage", "male", "white"), timevar = "twin", times = c('1', '2'),
        idvar = c("famid", "age"), split = list(regexp="."))

# generate agesquared
AKR$agesq <- (AKR$age)*(AKR$age)/100

# run regression
tab3col1 <- lm(lwage ~ educ + age + agesq + male + white, data = AKR)
          stargazer(tab3col1, type="text",covariate.labels = c("Own Education", "Age", "Age            squared / 100", "Male", "White"), title = "Table 3 column 1", no.space=TRUE,      keep.stat = c("n","adj.rsq")) # print the results
## 
## Table 3 column 1
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                              lwage           
## ---------------------------------------------
## Own Education              0.084***          
##                             (0.014)          
## Age                        0.088***          
##                             (0.019)          
## Age squared / 100          -0.087***         
##                             (0.023)          
## Male                       0.204***          
##                             (0.063)          
## White                      -0.410***         
##                             (0.127)          
## Constant                    -0.471           
##                             (0.426)          
## ---------------------------------------------
## Observations                  298            
## Adjusted R2                  0.260           
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

e. Explain how the coefficient on education should be interpreted.

The coefficient means that an additional year of schooling leads to 8.4 percent increase in wages.

f. Explain how the coefficient on the control variables should be interpreted.

In general, the coefficeints of control variables do not have a causal interpretation because the empirical model is not specified to examine causal relationships between the control variable and the dependent variable. They are associations.
  • Age Squared: The variable agesquared is there to capture the non-linear association between age earnings. Putting just age in the linear form would indicate that earnings will go on increasing as you increase age not matter what, which is not the case. In reality, this is not the case. In fact, after a certain age, earnings will start to decrease. The negative sign indicates that earnings increases in the beginning as age increases and decreases after a certain threshold showing a concave relationship.

  • Age: The coefficient of age says that that there is a positive association between age and wage. A 1 year increase in age is associated with 8.8% increase in wage.

  • Male: The postive sign indicates that positive association of wage and being male. Male are more likely to earn more wages compared to the reference category. On average, being male is associated with 20.4 percent higher wage.

  • White: The negative sign indicates a white person is likely to earn less wages compared to the reference category. The wage of a white person is 41% higher than the reference category.

In general, the coefficients of control variables do not have a causal interpretation because the empirical model is not specified to examine causal relationships between the control variables and the dependent variable. Therefore, they denote associations.