#Loading required packages
library(prettydoc) #For the theme used in this document
library(haven) #Stata data loading to R
library(tidyverse) #Required for renaming
library(stargazer) #Nice tables
library(reshape2) #Required for 2.2 part(d)Part 1: Paper using randomized data: Impact of Class Size on Learning
1.1. Briefly answer these questions:
a. What is the causal link the paper is trying to reveal?
Krueger (1999) attempts to study the effects of class size on student’s achievement measured by performance on test standardized tests.
b. What would be the ideal experiment to test this causal link?
Ideally, I would like to have the scores of students before randomly assigning them to the smaller classrooms. In other words, I would redo the project STAR but with baseline survey to have test scores of whole population. In this way, I can exploit both cross-sectional and time series variation to study the effects of class size on performance.
c. What is the identification strategy?
In this paper, the author exploits a an experiment called Project STAR carried out in the Tennessee in which students were randomly assigned three different sizes of classrooms. Author exploits this exogenous experimental variation to identify the causal impact of interest.
d. What are the assumptions / threats to this identification strategy?
One of the identifying assumption author is making is that the control group (students in regular classrooms) are similar in other characteristics to the treated group (students in small classrooms). However, it may be possible that this could not be the case simply because educated or serious parents have figured out a way to game the schools to admit their children to smaller classrooms. Another assumption is that the effort and motivation level of teachers remains the same across groups which can arguably be contested. In the situations where the above mentioned assumptions are not plausible, this paper would estimate anything, but the causal effects.
Part 2: Using Twins for Identification: Economic Returns to Schooling
2.1. Briefly answer these questions:
a. What is the causal link the paper is trying to reveal?
This paper studies the causal effects of schooling on later-life earnings.
b. What would be the ideal experiment to test this causal link?
Ideally, to study the causal effects of schooling, I would choose a village at randomly assign the number of years in school at birth to each child born in a given cohort. Then, I would wait until my sample enters the labor market and try to analyze the effects of assigned schooling on earnings. I assume perfect compliance in this thought experiment to be able to claim causality.
c. What is the identification strategy?
In this paper, the authors try to study the effects of schooling by exploiting the variation among the twins. The authors argue that it this way, they are able to shutdown the concerns of ability bias, thus, claiming causality.
d. What are the assumptions / threats to this identification strategy?
I think one of the main threats to this identification strategy would be that the ability is function of many things so it can be different across genetically identical children. Also, it may be possible that twins may be living in totally separate environments and thus attending schools with different quality that can also affects earnings.
My second concerns is regarding the sample. The authors rely on the survey which was administered in a festival. There can be two potential issues here: (1). Measurement error (authors mentioned it). (2). Selection-into-sample.
2.2. Replication analysis
a. Load Ashenfelter and Krueger AER 1994 data.
#Setting directory
setwd("D:/UGA Coursework/Second Year/AAEC 8610/HWs/HW4")
df <- read_dta("AshenfelterKrueger1994_twins.dta")
head(df)## # A tibble: 6 × 10
## famid age educ1 educ2 lwage1 lwage2 male1 male2 white1 white2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 33.3 16 16 2.16 2.42 0 0 1 1
## 2 2 43.6 12 19 2.17 2.89 0 0 1 1
## 3 3 31.0 12 12 2.79 2.80 1 1 1 1
## 4 4 34.6 14 14 2.82 2.26 1 1 1 1
## 5 5 35.0 15 13 2.03 3.56 0 0 1 1
## 6 6 29.3 14 12 2.71 2.48 1 1 1 1
b. Reproduce the result from table 3 column 5.
df$lwage=(df$lwage1-df$lwage2)
df$leduc=(df$educ1-df$educ2)
regression_fd <- lm(lwage~leduc,data = df)
stargazer(regression_fd, type = "text", title = "Regression Results",
notes = "This table replicates the results from Ashenfelter and Krueger (1994)",
align = TRUE)##
## Regression Results
## =========================================================================================
## Dependent variable:
## ---------------------------------------------------------------------
## lwage
## -----------------------------------------------------------------------------------------
## leduc 0.092***
## (0.024)
##
## Constant -0.079*
## (0.045)
##
## -----------------------------------------------------------------------------------------
## Observations 149
## R2 0.092
## Adjusted R2 0.086
## Residual Std. Error 0.554 (df = 147)
## F Statistic 14.914*** (df = 1; 147)
## =========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## This table replicates the results from Ashenfelter and Krueger (1994)
c. Explain how this coefficient should be interpreted.
This regression can be regarded as a fixed effects model in which we estimate the effects of inpair difference in schooling on the difference in inpair wages differences.
d. Reproduce the result in table 3 column 1. You will need to reshape the data first.
#Reshaping the data
dfNEW <- reshape(df,
idvar= c("famid","age"),
sep= "", timevar = "twin",
direction = "long",
varying = 3:10)
dfNEW$agesq <- dfNEW$age^2
dfNEW$agesq <- dfNEW$agesq/100
regression_ols <- lm(lwage~educ+age+agesq+male+white, data = dfNEW)
stargazer(regression_ols, type = "text", title = "Regression Results",
notes = "This table replicates the results from Ashenfelter and Krueger (1994)",
align = TRUE)##
## Regression Results
## =========================================================================================
## Dependent variable:
## ---------------------------------------------------------------------
## lwage
## -----------------------------------------------------------------------------------------
## educ 0.084***
## (0.014)
##
## age 0.088***
## (0.019)
##
## agesq -0.087***
## (0.023)
##
## male 0.204***
## (0.063)
##
## white -0.410***
## (0.127)
##
## Constant -0.471
## (0.426)
##
## -----------------------------------------------------------------------------------------
## Observations 298
## R2 0.272
## Adjusted R2 0.260
## Residual Std. Error 0.532 (df = 292)
## F Statistic 21.860*** (df = 5; 292)
## =========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## This table replicates the results from Ashenfelter and Krueger (1994)
e. Explain how the coefficient on education should be interpreted.
This is the naive way to estimate the effects of schooling on wages in which we regress wages on years of schooling and a bunch of control variables with a hope that control variables will capture the unobserved, confounding factors. So, in this case, we can interpret the variable as: given the control variables, one year increase in schooling will increase earnings, on average, by 8.4 percentage points.
f. Explain how the coefficient on the control variables should be interpreted.
On average, one additional year in age, and being male increase wages by 8.8 and 20.4 percent, respectively. However, whites in this sample earn 41 percent less than nonwhites.