#loading necessary packages 
library(haven)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

Part 1: Paper using randomized data: Impact of Class Size on Learning

Reference: Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532

1.1. Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

This paper is trying to identify the causal link between the class size and student performamces based on their standardized test results.

b. What would be the ideal experiment to test this causal link?

Randomized control trial could be the ideal experiment to test the causal link where a large sample of students are randomly assigned to different control groups based on class sizes. The performances of different control groups can be evaluated based on the standardized test scores after the trial period.

c. What is the identification strategy?

The identification strategy used in this study is that students are randomly assigned to classes with different sizes and it would help overcome the problem of omitted variable/characteristics that might be correlated with the class sizes.

d. What are the assumptions / threats to this identification strategy?

The authors mentioned some of the limitations of the random assignments such as even after randomly assigning students in different class sizes, students switched between small and regular classes between grades, primarily because of behavioral problems or parental complaints. Also, the actual class sizes varied because some students nonrandomly left to join another school. These non-randomization could affect the identification strategy.

Part 2: Paper using Twins for Identification: Economic Returns to Schooling

Reference: Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173

2.1. Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

Ashenfelter and Krueger (1994) estimated the causal relationship between the wage rate of identical twins with different schooling levels.

b. What would be the ideal experiment to test this causal link?

The ideal experiment to test this causal link between wage rate and different schooling levels is to design a randomized control trial experiment. The sample size can be divided into the experiment group and control groups based on years spent in school and then comparing the parameter estimate of the regression between wage and level of school between different groups will give us the causal link.

c. What is the identification strategy?

The identification strategy of the paper is to estimate the variation in wage rate between identical twins with each year of school completed.

d. What are the assumptions / threats to this identification strategy?

The authors made the assumption that monozygotic (from the same egg) twins are genetically identical and have similar family backgrounds, therefore, the difference in wage could result from different level of schooling due to difference in individual preferences.

2.2. Replication analysis

a. Load Ashenfelter and Krueger AER 1994 data.

#loading the dataset
paper2data <- read_dta("AshenfelterKrueger1994_twins.dta")

dim(paper2data)
## [1] 149  10
head(paper2data)
## # A tibble: 6 x 10
##   famid   age educ1 educ2 lwage1 lwage2 male1 male2 white1 white2
##   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1     1  33.3    16    16   2.16   2.42     0     0      1      1
## 2     2  43.6    12    19   2.17   2.89     0     0      1      1
## 3     3  31.0    12    12   2.79   2.80     1     1      1      1
## 4     4  34.6    14    14   2.82   2.26     1     1      1      1
## 5     5  35.0    15    13   2.03   3.56     0     0      1      1
## 6     6  29.3    14    12   2.71   2.48     1     1      1      1

b. Reproduce the result from table 3 column 5.

paper2data$educdiff <- paper2data$educ1-paper2data$educ2

paper2data$wagediff <- paper2data$lwage1-paper2data$lwage2

model1 <- lm(wagediff~educdiff, data=paper2data)

stargazer(model1, type="text", title="Table 3 Column 5", align=TRUE, dep.var.labels = "First difference (v)",keep.stat = c("n", "rsq"), omit=c("Constant") )
## 
## Table 3 Column 5
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                 First difference (v)    
## ----------------------------------------
## educdiff              0.092***          
##                        (0.024)          
##                                         
## ----------------------------------------
## Observations             149            
## R2                      0.092           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

On an average, the wage rate increases by 9.2% with one additional year of schooling or education at 1% significance level.

d. Reproduce the result in table 3 column 1.

tab3col1 <- reshape(paper2data, varying=c("educ1", "lwage1", "male1", "white1", "educ2", "lwage2", "male2", "white2"), 
                v.names=c("educ", "lwage", "male", "white"),
                timevar = "twin",
                times=c("T1", "T2"),
                idvar = c("famid", "age"),
                direction = "l")

tab3col1.sort <- tab3col1[order(tab3col1$famid), ]

tab3col1$agesq <- tab3col1$age^2/100

model2 <- lm(lwage~educ+age+agesq+male+white, data=tab3col1)

stargazer(model2, type="text", title="Table 3 Column 1", align=TRUE, dep.var.labels = "OLS (i)",keep.stat = c("n", "rsq"), covariate.labels=c("Own education" ,"Age", "Age squared (÷ 100)", "Male", "White"), omit=c("Constant") )
## 
## Table 3 Column 1
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               OLS (i)          
## -----------------------------------------------
## Own education                0.084***          
##                               (0.014)          
##                                                
## Age                          0.088***          
##                               (0.019)          
##                                                
## Age squared (÷ 100)          -0.087***         
##                               (0.023)          
##                                                
## Male                         0.204***          
##                               (0.063)          
##                                                
## White                        -0.410***         
##                               (0.127)          
##                                                
## -----------------------------------------------
## Observations                    298            
## R2                             0.272           
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

e. Explain how the coefficient on education should be interpreted.

One additional year of education will lead to a 8.4% increase in wage rate on an average at 1% significance level.

f. Explain how the coefficient on the control variables should be interpreted.

The control variable age has non-linear relation with wage rate which increases at first and then starts decreasing. The coefficient of control variable male indicates that on average the wage rate of male twins is 20.4% higher than the female twins. Also, white twins have 41% lower wage rate than non-white twins on average. All the parameter estimates of control variables are statistically significant at 1% level.