1.1.a. What is the causal link the paper is trying to reveal?
The paper evaluates the causal impact of class sizes on student performance. The main hypothesis is that students in smaller class sizes achieve better learning outcomes than those in larger class sizes. Also, the paper estimated three major goals which are: “To probe the senstivity of the experimental estimates to flaws in the experimental desing; to use the experiment to indentify an appropriate specification of the eductaion function to etimate nonexperimental data; to use the experimental results to interprete esimates fro the literature based on observational data”.
1.1.b. What would be the ideal experiment to test this causal link?
The ideal experiment designed and used in the paper is to group the students involved in the studies into control and treatment groups. The treatment groups are class with small student size, regular class size with aides while the control group is the regular class size.This experiment was conducted when the students entered Kindergarten to third grade in 1985-1986 school year. After each grade the students takes a required test.The test results were collected for the four year period. This experiment was refered to as Student/Teacher Achievement ratio experiment (STAR).
1.1.c. What is the identification strategy?
The author uses a randomized controlled trial of an education intervention that initially assigns kindergarden students in Tennessee to three different study groups: small classes (treatment 1), regular-size classes with teacher aides (treatment 2), and regular-size classes without teacher aides (control). The intervention is implemented for four years. At the end of each year, students take standardized tests, and the performance of the study groups is compared. The effect of class sizes is evaluated by comparing treatment 1 relative to the control, while the effect of teacher aides is evaluated by comparing treatment 2 relative to the control.
1.1.d. What are the assumptions / threats to this identification strategy?
There are several threats to the identificaton strategy. For example, it would not be revealing a causal effect if any of the following two cases holds. the authors made some assumptions in their estimation to account for this problems e.g. “Suppose that the 2-4 percent extra students who withdrew from regular and regular/aide classes all would have scored in the one-hundredth percentile of the SAT exams. With this intentionally extreme assumption, the average score of students in the regular and regular/aide classes would only have increased by one-two percentile points if the extra students had not withdrawn from kindergarten. At the opposite extreme, if the higher withdrawal rate is due to the lowest achieving students leaving regular-size classes, the regularsize class students would have scored one-two points lower, on average, if they had remaine”
This identification strategy would not be revealing a causal effect if some of the students that withdrew from any of three class-groups in the study transfered to another pulic school included in the studies.
2.1.a. What is the causal link the paper is trying to reveal?
The paper evaluates the causal impact of schooling levels on wages. The main hypothesis is that higher schooling levels lead to higher wages.
2.1.b. What would be the ideal experiment to test this causal link?
The ideal experiment used by the authors was to surveyed identical twins in a twin festival held in 1991 in Ohio. To avoid or reduced the problem of measurement error in the data collection. The authors surveyed each pair of twins seperately and they asked each pair of twins about the other eductaion level, wage, and parents education level. The data collected were compared for each pairs. Wage data were collected since that is one of the variable of interest for the estimation. The authors estimated fixed effect model for each twin with the assumption that the equation is identical for the two twins.
2.1.c. What is the identification strategy?
To identify the causal effect of schooling levels on wages, the authors use a sample of twins (genetically identical, thus with same innate ability) to estimate a wage equation with little to no endogeneity problem.
As the authors deal with a sample of twins, errors in wages may arise from the unobserved characteristics of families that are selected into the sample (family bias) and the unobserved characteristics of the twins themselves (individual bias). Therefore, the econometric specification disentangles these biases and isolates them from the desired structural effect of years of schooling on wages. The authors identify the structural effect of individual schooling levels on wages after controlling for the potential structural effects of other individual-level education and job variables and for the family selection effect.
2.1.d. What are the assumptions / threats to this identification strategy?
There are several threats to the identificaton strategy. Most importantly, it would not be revealing a causal effect if there are sources of endogeneity in the wage regression. The two potential sources applicable, here, are:
omitted ability variables ;
measurement errors (mostly in the determinants of wage).
The authors conduct robust analyses to address these two issues in the econometric estimations.
The process of removing non-white from the estimation is a good strategy because there is high probability that twins of color that attended the festival are medium/high income earners given that low income earners twins of colored may have not attended the festival. Including this kind of data may biased the result between non-white and white in the data.
2.2.a. Load Ashenfelter and Krueger AER 1994 data.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 3.0.0 ✓ dplyr 1.0.2
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(haven)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
#Load Ashenfelterkruger1994 dataset
paper2data <- read_dta("/Users/twinkleroy/Downloads/AshenfelterKrueger1994_twins.dta")
paper2data
## # A tibble: 149 x 10
## famid age educ1 educ2 lwage1 lwage2 male1 male2 white1 white2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 33.3 16 16 2.16 2.42 0 0 1 1
## 2 2 43.6 12 19 2.17 2.89 0 0 1 1
## 3 3 31.0 12 12 2.79 2.80 1 1 1 1
## 4 4 34.6 14 14 2.82 2.26 1 1 1 1
## 5 5 35.0 15 13 2.03 3.56 0 0 1 1
## 6 6 29.3 14 12 2.71 2.48 1 1 1 1
## 7 7 47.6 12 12 2.30 1.83 0 0 1 1
## 8 8 51.9 13 12 2.80 2.85 1 1 1 1
## 9 9 36.1 12 12 2.48 2.77 1 1 1 1
## 10 10 48.1 19 17 3.22 2.75 0 0 1 1
## # … with 139 more rows
2.2.b. Reproduce the result from table 3 column 5. The equation estimated in column 5 of Table 3 is a first-difference regression of the logarithm of wage on years of schooling (educ), as shown in Equation (1):
\[ \Delta (log) wage = \Delta \alpha + \beta \Delta educ + \Delta \epsilon \]
# Generating first-differencing variables
paper2data$dlwage <- paper2data$lwage1 - paper2data$lwage2
paper2data$deduc <- paper2data$educ1 - paper2data$educ2
# First-difference regression
fdmodel <- lm(dlwage ~ deduc, paper2data)
summary(fdmodel)
##
## Call:
## lm(formula = dlwage ~ deduc, data = paper2data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.03115 -0.20909 0.00722 0.34395 1.15740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07859 0.04547 -1.728 0.086023 .
## deduc 0.09157 0.02371 3.862 0.000168 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5542 on 147 degrees of freedom
## Multiple R-squared: 0.09211, Adjusted R-squared: 0.08593
## F-statistic: 14.91 on 1 and 147 DF, p-value: 0.0001682
stargazer(fdmodel, type="text", no.space=TRUE, keep.stat = c("n","rsq"), column.labels=c("First difference"))
##
## ========================================
## Dependent variable:
## ---------------------------
## dlwage
## First difference
## ----------------------------------------
## deduc 0.092***
## (0.024)
## Constant -0.079*
## (0.045)
## ----------------------------------------
## Observations 149
## R2 0.092
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
2.2.c. Explain how this coefficient should be interpreted.
\[ \beta = \frac{\partial \quad \Delta log(wage)}{\partial \quad \Delta educ} \] \[ \beta = \frac{\partial \quad log(wage)}{\partial \quad educ} \quad \Rightarrow \quad 100 \beta = \frac{100 \quad d (wage)/wage}{d(educ)}. \] The regression above is regression of intrapair difference in wage rates on intrapair difference in schooling levels. The coefficient means intrapair schooling returns 9.63% when the intrapair differnce increase by 1 unit or \[ \hat{\beta}=0.092 \]
2.2.d. Reproduce the result from table 3 column 1.
# Reshaping the dataset
paper2data <- paper2data[-c(11, 12)]
paper2dataLong = reshape(data=paper2data, idvar=c("famid","age"), varying = 3:10, sep = "", timevar = "twin", times = c(1, 2), new.row.names= 1:10000, direction = "long")
## Warning: Setting row names on a tibble is deprecated.
## Warning: Setting row names on a tibble is deprecated.
# Sorting by family and twin identifiers
paper2dataLong <- paper2dataLong[, c(1, 3, 2, 4, 6, 7, 5)]
#paper2dataLong[order(paper2dataLong$famid, paper2dataLong$twin), ]
# Generating a new variable for quadratic relationship between age and wage
paper2dataLong$agesqdiv100 <- paper2dataLong$age*paper2dataLong$age/100
paper2dataLong <- paper2dataLong[, c(1, 2, 3, 8, 4, 5, 6, 7)]
# Pooled OLS regression
polsmodel <- lm(lwage ~ educ + age + agesqdiv100 + male + white, paper2dataLong)
summary(polsmodel)
##
## Call:
## lm(formula = lwage ~ educ + age + agesqdiv100 + male + white,
## data = paper2dataLong)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62602 -0.28748 0.00277 0.28474 2.42317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.47061 0.42602 -1.105 0.270210
## educ 0.08387 0.01443 5.814 1.60e-08 ***
## age 0.08782 0.01883 4.663 4.75e-06 ***
## agesqdiv100 -0.08686 0.02335 -3.720 0.000239 ***
## male 0.20403 0.06302 3.237 0.001345 **
## white -0.41047 0.12668 -3.240 0.001333 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5324 on 292 degrees of freedom
## Multiple R-squared: 0.2724, Adjusted R-squared: 0.2599
## F-statistic: 21.86 on 5 and 292 DF, p-value: < 2.2e-16
# Formatting the table
stargazer(polsmodel, type="text", no.space=TRUE, keep.stat = c("n","adj.rsq"), column.labels=c("OLS"))
##
## ========================================
## Dependent variable:
## ---------------------------
## lwage
## OLS
## ----------------------------------------
## educ 0.084***
## (0.014)
## age 0.088***
## (0.019)
## agesqdiv100 -0.087***
## (0.023)
## male 0.204***
## (0.063)
## white -0.410***
## (0.127)
## Constant -0.471
## (0.426)
## ----------------------------------------
## Observations 298
## Adjusted R2 0.260
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The equation estimated in column 1 of Table 3 is a pooled OLS regression of the logarithm of earnings (wage) on years of schooling (educ) and other control variables, as shown
\[ (log) wage = \alpha + \beta_1 educ + \beta_2 age + \beta_3 (age^2/100) + \beta_4 male + \beta_5 white + \epsilon \] 2.2.e. Explain how this coefficient should be interpreted. The coefficient on education is 0.084 and in absolute value related to the regression is 8.7% (100[exp(0.084)-1]) which means that one unit increase in education of the two pair twins will lead to 8.7% increase in their wages, holding other variables constant.
2.2.f. Explain how the coefficient on the control variables should be interpreted.
The marginal effect of age on the logarithm of wage can be derived as follows: For age,the coefficient means one unit increase age of the twins will bring about 9.19% increase in wage of the twins, holding other variables constant.The relationship between age and the logarithm of wage is not linear, but quadratic.
For agesquared, the coefficient means that one squared unit increase in age will bring about 8.3% decrease in wages, holding other variables constant, this can be also be explained by life-cycle theory (at retired age, ones income/wages decreases compared to when the twins are in labor force).
For male, the coefficient implies that wages for twins that are pairs of male is 22.63% higher than those that are not male, holding other variables constant.
\[ \beta_{4}=\mathop{\mathbb{E}}[log(wage)|male,educ,age,white]-\mathop{\mathbb{E}}[log(wage)|female,educ,age,white]. \]
For white, the coeeficient means implies that for the pairs of twins that are white, their wage rates is 33.63% lower compared to non-white in the studies, holding other variables constant.