The paper is trying to reveal the causal effect of class size on student achievement.
The ideal experiment would be to get the difference between the actual outcome after treatment and the potential outcome of a person had there been no treatment. The ideal experiment to reveal the causal link would be to create a parallel world where everything else except the class size is same and compare the student test scores on students in different class size. For example, first take a student, teach in a small class, record the test score. Then go back in time and teach the same student in a different class size keeping everything else (teacher, material, class environment, etc.) the same and record the test scores. However, this is not possible. Therefore, the next best alternative would be to randomly assign class sizes to students and look at the difference in test scores of students.
The identification strategy is exploiting random assignment of student to different class size.
The paper is trying the causal effects of schooling on wages.
The ideal experiment would be to first take a person and send them to school and record the wages of that person on completion of the persons schooling. Then have to go back in time and tell the same person not to go to school and compute the wages of that person while keeping everything else the same. It is computing the difference between the actual outcome and the potential outcome of the exactly same person had they not gone to school. However, it is not possible to do such an experiment. So the alternative strategy would be to randomly assign level of schooling to people.
The identification strategy is using the within twins variation in the returns to schooling. The identical twins will be similar in many aspects making them comparable. In the absence of treatment, the outcomes of those twins is assumed to be very similar. While not exactly the ideal situation of comparing the potential outcome and the actual outcome of a person, taking twins is nearer to the experiment. TWins are alike in many respect, so a twin could be a counterfactual to another twin. Thus the argument is that the difference in outcomes due to treatment can be attributed to treatment.
The assumptions/threats to identifications are:
#install.packages(c("dplyr", "stringr", "ggplot2", "readxl", "stargazer", "plm"), repos="http://cran.us.r-project.org")
library(plm)
library(stargazer)
library(haven)
AK <- read_dta("../HW4/AshenfelterKrueger1994_twins.dta")
AK$wagediff = AK $lwage1 - AK$lwage2
AK$educdiff = AK$educ1- AK$educ2
# run the regression
diffReg <- lm(wagediff ~ educdiff, data = AK)
# tabulate the regression
stargazer(diffReg, type="text", header= FALSE, title = "Results from Table 3 column 5" , no.space=TRUE, omit.stat = c("adj.rsq", "ser", "f"), covariate.labels = c("Own Education")) # print the results
##
## Results from Table 3 column 5
## =========================================
## Dependent variable:
## ---------------------------
## wagediff
## -----------------------------------------
## Own Education 0.092***
## (0.024)
## Constant -0.079*
## (0.045)
## -----------------------------------------
## Observations 149
## R2 0.092
## =========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As this is a log linear model, the co-efficient \(\hat\beta= 0.092\) is interpreted as a one year increase in education leads to 9.2 percent increase in wages.
# d. Change the column names
colnames(AK) <- c("famid", "age", "educ.1", "educ.2", "lwage.1", "lwage.2", "male.1", "male.2", "white.1", "white.2")
# use reshape command to convert data to long format
AKR <- reshape(AK, direction="long", varying = list(c(3,4),c(5,6),c(7,8),c(9,10)), v.names = c("educ", "lwage", "male", "white"), timevar = "twin", times = c('1', '2'),
idvar = c("famid", "age"), split = list(regexp="."))
# generate agesquared
AKR$agesq <- (AKR$age)*(AKR$age)/100
# run regression
tab3col1 <- lm(lwage ~ educ + age + agesq + male + white, data = AKR)
stargazer(tab3col1, type="text",covariate.labels = c("Own Education", "Age", "Age squared / 100", "Male", "White"), title = "Table 3 column 1", no.space=TRUE, keep.stat = c("n","adj.rsq")) # print the results
##
## Table 3 column 1
## =============================================
## Dependent variable:
## ---------------------------
## lwage
## ---------------------------------------------
## Own Education 0.084***
## (0.014)
## Age 0.088***
## (0.019)
## Age squared / 100 -0.087***
## (0.023)
## Male 0.204***
## (0.063)
## White -0.410***
## (0.127)
## Constant -0.471
## (0.426)
## ---------------------------------------------
## Observations 298
## Adjusted R2 0.260
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The coefficient means that an additional year of schooling leads to 8.4 percent increase in wages.
Age Squared: The variable agesquared is there to capture the non-linear association between age earnings. Putting just age in the linear form would indicate that earnings will go on increasing as you increase age not matter what, which is not the case. In reality, this is not the case. In fact, after a certain age, earnings will start to decrease. The negative sign indicates that earnings increases in the beginning as age increases and decreases after a certain threshold showing a concave relationship.
Age: The coefficient of age says that that there is a positive association between age and wage. A 1 year increase in age is associated with 8.8% increase in wage.
Male: The postive sign indicates that positive association of wage and being male. Male are more likely to earn more wages compared to the reference category. On average, being male is associated with 20.4 percent higher wage.
White: The negative sign indicates a white person is likely to earn less wages compared to the reference category. The wage of a white person is 41% higher than the reference category.
In general, the coefficients of control variables do not have a causal interpretation because the empirical model is not specified to examine causal relationships between the control variables and the dependent variable. Therefore, they denote associations.