HW4

Part 2

2.1 Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

The paper is trying the causal effects of schooling on wages.

b. What would be the ideal experiment to test this causal link?

The ideal experiment would be to first take a person and send them to school and record the wages of that person on completion of the persons schooling. Then have to go back in time and tell the same person not to go to school and compute the wages of that person while keeping everything else the same. It is computing the difference between the actual outcome and the potential outcome of the exactly same person had they not gone to school. However, it is not possible to do such an experiment. So the alternative strategy would be to randomly assign level of schooling to people.

c. What is the identification strategy?

The identification strategy is using the within twins variation in the returns to schooling. The identical twins will be similar in many aspects making them comparable. In the absence of treatment, the outcomes of those twins is assumed to be very similar. While not exactly the ideal situation of comparing the potential outcome and the actual outcome of a person, taking twins is nearer to the experiment. TWins are alike in many respect, so a twin could be a counterfactual to another twin. Thus the argument is that the difference in outcomes due to treatment can be attributed to treatment.

d. What are the assumptions / threats to this identification strategy?

The assumptions/threats to identifications are:

Any variable that vary across twins (like education level) may enter the wage equations. For example, family may invest differently in each twin which will be correlated with the ability to earn. The fixed effects regression controls for the characteristics that is same for both twins, however unobservables that vary across twins-pair is still a problem as shown in equation 6 in the paper.
There is also the effects of measurement error that will lead to bias in the estimators of the effect of schooling.
The study may not have external validity as the twins selected are not overall representation of the whole population.

2.2 Replication analysis

a. Load Ashenfelter and Krueger AER 1994 data. You can load it directly from my website here. Variable names should be self-explanatory if you read the paper.

#install.packages(c("dplyr", "stringr", "ggplot2", "readxl", "stargazer", "plm"), repos="http://cran.us.r-project.org")
library(plm)
library(stargazer)

library(haven)
AK <- read_dta("../HW4/AshenfelterKrueger1994_twins.dta")

b. Reproduce the result from table 3 column 5.

AK$wagediff = AK $lwage1 - AK$lwage2
AK$educdiff = AK$educ1- AK$educ2
# run the regression
diffReg <- lm(wagediff ~ educdiff, data = AK)
# tabulate the regression
stargazer(diffReg, type="text", header= FALSE, title = "Results from Table 3 column 5" , no.space=TRUE, omit.stat = c("adj.rsq", "ser", "f"), covariate.labels = c("Own Education")) # print the results

## 
## Results from Table 3 column 5
## =========================================
##                   Dependent variable:    
##               ---------------------------
##                        wagediff          
## -----------------------------------------
## Own Education          0.092***          
##                         (0.024)          
## Constant                -0.079*          
##                         (0.045)          
## -----------------------------------------
## Observations              149            
## R2                       0.092           
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

As this is a log linear model, the co-efficient \(\hat\beta= 0.092\) is interpreted as a one year increase in education leads to 9.2 percent increase in wages.

d. Reproduce the result in table 3 column 1

# d. Change the column names

colnames(AK) <- c("famid", "age", "educ.1", "educ.2", "lwage.1", "lwage.2", "male.1", "male.2", "white.1", "white.2")

# use reshape command to convert data to long format
AKR <- reshape(AK, direction="long", varying = list(c(3,4),c(5,6),c(7,8),c(9,10)), v.names =         c("educ", "lwage", "male", "white"), timevar = "twin", times = c('1', '2'),
        idvar = c("famid", "age"), split = list(regexp="."))

# generate agesquared
AKR$agesq <- (AKR$age)*(AKR$age)/100

# run regression
tab3col1 <- lm(lwage ~ educ + age + agesq + male + white, data = AKR)
          stargazer(tab3col1, type="text",covariate.labels = c("Own Education", "Age", "Age            squared / 100", "Male", "White"), title = "Table 3 column 1", no.space=TRUE,      keep.stat = c("n","adj.rsq")) # print the results

## 
## Table 3 column 1
## =============================================
##                       Dependent variable:    
##                   ---------------------------
##                              lwage           
## ---------------------------------------------
## Own Education              0.084***          
##                             (0.014)          
## Age                        0.088***          
##                             (0.019)          
## Age squared / 100          -0.087***         
##                             (0.023)          
## Male                       0.204***          
##                             (0.063)          
## White                      -0.410***         
##                             (0.127)          
## Constant                    -0.471           
##                             (0.426)          
## ---------------------------------------------
## Observations                  298            
## Adjusted R2                  0.260           
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

e. Explain how the coefficient on education should be interpreted.

The coefficient means that an additional year of schooling leads to 8.4 percent increase in wages.

f. Explain how the coefficient on the control variables should be interpreted.

In general, the coefficeints of control variables do not have a causal interpretation because the empirical model is not specified to examine causal relationships between the control variable and the dependent variable. They are associations.

Age Squared: The variable agesquared is there to capture the non-linear association between age earnings. Putting just age in the linear form would indicate that earnings will go on increasing as you increase age not matter what, which is not the case. In reality, this is not the case. In fact, after a certain age, earnings will start to decrease. The negative sign indicates that earnings increases in the beginning as age increases and decreases after a certain threshold showing a concave relationship.
Age: The coefficient of age says that that there is a positive association between age and wage. A 1 year increase in age is associated with 8.8% increase in wage.
Male: The postive sign indicates that positive association of wage and being male. Male are more likely to earn more wages compared to the reference category. On average, being male is associated with 20.4 percent higher wage.
White: The negative sign indicates a white person is likely to earn less wages compared to the reference category. The wage of a white person is 41% higher than the reference category.

In general, the coefficients of control variables do not have a causal interpretation because the empirical model is not specified to examine causal relationships between the control variables and the dependent variable. Therefore, they denote associations.

HW4

2/1/2021

Part 1

1.1 Briefly answer this question

a. What is the Causal link the paper is trying to reveal?

b. What would be the ideal experiment to test this causal link?

What is the identification strategy?

d. What are the assumptions / threats to this identification strategy?

Part 2

2.1 Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

b. What would be the ideal experiment to test this causal link?

c. What is the identification strategy?

d. What are the assumptions / threats to this identification strategy?

2.2 Replication analysis

a. Load Ashenfelter and Krueger AER 1994 data. You can load it directly from my website here. Variable names should be self-explanatory if you read the paper.

b. Reproduce the result from table 3 column 5.

c. Explain how this coefficient should be interpreted.

d. Reproduce the result in table 3 column 1

e. Explain how the coefficient on education should be interpreted.

f. Explain how the coefficient on the control variables should be interpreted.

In general, the coefficeints of control variables do not have a causal interpretation because the empirical model is not specified to examine causal relationships between the control variable and the dependent variable. They are associations.