The paper is looking into the causal link between student test scores and class size.
The ideal experiment would be to take a student, record their test score from when they are taught in a small class, then turn back time and record the test score of the same student, same material, but in a larger class. This is clearly not possible in reality, so the ideal realistic experiment would involve randomization of students across all demographics into groups of several classes of varying sizes (so as to avoid being confounded by teachers) and testing several years worth of students and teachers. (ideally it should require excessive randomization to be useful data.)
the identification strategy in this paper is the use of several years of the randomization of in student assignment between three types of class sizes (small, regular, regular with aide), but then the continuation of students in their assigned class size type in subsequent grades. The paper also takes great pains to alleviate concerns about nonrandom attrition and changes in class size type, such as the clarification this sample was done across rural, urban, and suburban populations, and the demographics among the three groups were appropriate.
The paper assumes that there is no differentiation between students which attend a school large enough to be in the sample compared to one that was too small to be counted. They also assume the fact that half of the students present in kindergarten are missing in at least one subsequent year is a concern but don’t directly address the issue. There could be concern about the potential of teacher adjustment to their assigned class size types as the years of the study continue. While this would effect teachers across the board, and therefore the relationship between students in large classes and small classes would remain the same, it could potentially have an effect on the time series qualities of the data.
The paper is looking to reveal the causal link between schooling level and wages later in life.
The ideal unrealistic experiment would be to take one person, allow them one level of schooling and see what their adult wages are later in their life, then take that same person, turn back time, give them another level of education, and then see how their wages later in life change. A realistic experiment would be to have a massive survey of persons, starting at school age that accounts for every external cause of wage level, i.e. home-life, schooling level, genetic ability, etc., and then compare over a large enough group of people over a large enough span of time.
The identification strategy in this paper is to leverage the similarities in many observable things between ‘identical’ twins with different schooling levels. The use of twins would supposedly eliminate concern about differences in household environments as well as observation about raw genetic differences between persons.They also take several measurements of schooling levels so as to alleviate concerns about misreporting.
There could be concern in their sample being twins who attended a ‘twin festival’. If these sets of twins are inherently different from other sets of twins, perhaps because they enjoy being a twin or have a good relationship with their families. There could concern about twins being inherently different than other non-twin people. If the relationship of having a twin somehow impacts the kind of wages a person makes later in life, regardless of schooling, then the data set would only be useful for studying this relationship in regards to twins.
setwd("C:/Users/Julie/Documents/Grad_School/PhD/2.2.p/MA_HW/HW_9")
#loading data in
library(foreign)
AK_94 <- read.dta("AshenfelterKrueger1994_twins.dta")
head(AK_94)
## famid age educ1 educ2 lwage1 lwage2 male1 male2 white1 white2
## 1 1 33.25120 16 16 2.161021 2.420368 0 0 1 1
## 2 2 43.57016 12 19 2.169054 2.890372 0 0 1 1
## 3 3 30.96783 12 12 2.791778 2.803360 1 1 1 1
## 4 4 34.63381 14 14 2.824351 2.263366 1 1 1 1
## 5 5 34.97878 15 13 2.032088 3.555348 0 0 1 1
## 6 6 29.33881 14 12 2.708050 2.484907 1 1 1 1
#making Table 3 Col. 5
#need difference in lwage for pair and difference in edu for pair
AK_94_a<- AK_94
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
AK_94_a<- AK_94
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Dif <-mutate(AK_94_a, Wage_dif = lwage1-lwage2, educ_dif = educ1-educ2)
DiD <- lm(Wage_dif ~ educ_dif, data = Dif)
print(DiD)
##
## Call:
## lm(formula = Wage_dif ~ educ_dif, data = Dif)
##
## Coefficients:
## (Intercept) educ_dif
## -0.07859 0.09157
stargazer(DiD)
##
## % Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
## % Date and time: Tue, Apr 14, 2020 - 10:03:08 PM
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{1}{c}{\textit{Dependent variable:}} \\
## \cline{2-2}
## \\[-1.8ex] & Wage\_dif \\
## \hline \\[-1.8ex]
## educ\_dif & 0.092$^{***}$ \\
## & (0.024) \\
## & \\
## Constant & $-$0.079$^{*}$ \\
## & (0.045) \\
## & \\
## \hline \\[-1.8ex]
## Observations & 149 \\
## R$^{2}$ & 0.092 \\
## Adjusted R$^{2}$ & 0.086 \\
## Residual Std. Error & 0.554 (df = 147) \\
## F Statistic & 14.914$^{***}$ (df = 1; 147) \\
## \hline
## \hline \\[-1.8ex]
## \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
## \end{tabular}
## \end{table}
This coefficient should be interpreted as a 9.2 percent increase in wages per every additional year of schooling completed. This result controlling for family background as compared to column 1. These results subject to the assumption that they still apply to out of sample wages (non-twin).
library(stargazer)
AK_94_b <-AK_94
educ<-c(AK_94_b$educ1, AK_94_b$educ2)
wage<-c(AK_94_b$lwage1, AK_94_b$lwage2)
age<-c(AK_94_b$age, AK_94_b$age)
famid<-c(AK_94_b$famid, AK_94_b$famid)
age_sq <- age^2/100
male<-c(AK_94_b$male1, AK_94_b$male2)
white<-c(AK_94_b$white1, AK_94_b$white2)
ols_2 <-data.frame(famid, age, age_sq, educ, wage, male, white)
reg_ols <- lm(wage~educ + age +age_sq + male + white, ols_2)
print(reg_ols)
##
## Call:
## lm(formula = wage ~ educ + age + age_sq + male + white, data = ols_2)
##
## Coefficients:
## (Intercept) educ age age_sq male white
## -0.47061 0.08387 0.08782 -0.08686 0.20403 -0.41047
stargazer(reg_ols)
##
## % Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
## % Date and time: Tue, Apr 14, 2020 - 10:03:08 PM
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{1}{c}{\textit{Dependent variable:}} \\
## \cline{2-2}
## \\[-1.8ex] & wage \\
## \hline \\[-1.8ex]
## educ & 0.084$^{***}$ \\
## & (0.014) \\
## & \\
## age & 0.088$^{***}$ \\
## & (0.019) \\
## & \\
## age\_sq & $-$0.087$^{***}$ \\
## & (0.023) \\
## & \\
## male & 0.204$^{***}$ \\
## & (0.063) \\
## & \\
## white & $-$0.410$^{***}$ \\
## & (0.127) \\
## & \\
## Constant & $-$0.471 \\
## & (0.426) \\
## & \\
## \hline \\[-1.8ex]
## Observations & 298 \\
## R$^{2}$ & 0.272 \\
## Adjusted R$^{2}$ & 0.260 \\
## Residual Std. Error & 0.532 (df = 292) \\
## F Statistic & 21.860$^{***}$ (df = 5; 292) \\
## \hline
## \hline \\[-1.8ex]
## \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
## \end{tabular}
## \end{table}
This result should be interpreted as a 8.4 percent increase in wage per additional year of schooling completed. This does not control for family background.
The coefficients should be interpreted similarly to the education variable. For a one year increase in age there is a 8.8 percent increase in wages, although there appears to be a quadratic relationship between log wages and age (age squared is negative). Being male should increase wages by 20 percent, but interestingly being white decreases them by about 40 percent, which goes against what is assumed.
##Paper 3: Card and Krueger (1994)
###3.1. Briefly answer these questions:
The paper is looking at the impact of minimum wage (specifically a rise in minimum wage) on establishment-level employment levels.
The ideal unrealistic experiment would be to have a series of different establishment level employers be shocked by a rise in minimum wage and then measure the impact on employment levels. Then take the same businesses and go back in time and measure the employment levels without the change in minimum wage. The ideal realistic experiment would be to have similar businesses in two locations that are also similar in demand, population, relevant geographic properties, and all other variables impacting employment levels, then raise minimum wage in one and not the other, then measure the difference.
The identification strategy in this paper uses the change in the New Jersey minimum wage in 1989 as well as Pennsylvania’s lack of change in minimum wage to compare the two states. They also compare across fast food restaurants since the skills required at one fast food restaurant would be the same in a different restaurant location if it is the same chain of fast food. The researchers also accounted for if restaurants were closed or simply did not respond in the follow up survey.
The paper assumes that there are no differences between the two states, although they do account for regional differences in the states. The paper also takes into account that there could also be non-wage offsets that are not seen in the wage data. The paper does not account for the differences in neighboring states, which could pose a threat to the identification.
a.load data
CK_94 <- read.csv("CardKrueger1994_fastfood.csv")
CK_94_a<-CK_94
stores <- aggregate(CK_94_a[, 3:4], list(CK_94_a$state), mean, na.rm = TRUE)
emp<-aggregate(CK_94_a[, 7:10], list(CK_94_a$state), mean)
check <- merge(emp, stores)
print(check)
## Group.1 bk kfc roys wendys emptot emptot2
## 1 0 0.4430380 0.1518987 0.2151899 0.1898734 23.33117 21.16558
## 2 1 0.4108761 0.2054381 0.2477341 0.1359517 20.43941 21.02743
C. Dif-in-dif by hand
CK_94_b <-data.frame(CK_94$id, CK_94$state, CK_94$emptot, CK_94$emptot2, CK_94$demp)
pt1<-aggregate(CK_94_b[,3:5], list(CK_94_b$CK_94.state), mean, na.rm=TRUE)
rownames(pt1) <- pt1$Group.1
pt1$Group.1 <- NULL
pt1_transpose <- as.data.frame(t(as.matrix(pt1)))
names(pt1_transpose)[1] <- "PA"
names(pt1_transpose)[2] <- "NJ"
pt2<- mutate(pt1_transpose, dif = NJ-PA)
pt2
## PA NJ dif
## 1 23.331169 20.4394081 -2.8917607
## 2 21.165584 21.0274295 -0.1381549
## 3 -2.283333 0.4666667 2.7500000
According to these results, NJ has lower FTE both before and after the change as compared to PA, but the magnitude of the employment change in NJ was much less than that of PA.
CK_94_c <- CK_94
c <- mutate(CK_94_c, wage_dif = wage_st-wage_st2)
c <-mutate(c, emp_dif = emptot-emptot2)
dif_model <- lm(emp_dif ~ wage_dif, c)
dif_model
##
## Call:
## lm(formula = emp_dif ~ wage_dif, data = c)
##
## Coefficients:
## (Intercept) wage_dif
## 1.132 2.943
This is close to the answer from part d.
library(tidyr)
CK_94_long <- gather(CK_94_a, condition, measurement, wage_st:wage_st2, factor_key = TRUE)
head(CK_94_long)
## id state emptot emptot2 demp chain bk kfc roys wendys condition
## 1 46 0 40.50 24.0 -16.50 1 1 0 0 0 wage_st
## 2 49 0 13.75 11.5 -2.25 2 0 1 0 0 wage_st
## 3 506 0 8.50 10.5 2.00 2 0 1 0 0 wage_st
## 4 56 0 34.00 20.0 -14.00 4 0 0 0 1 wage_st
## 5 61 0 24.00 35.5 11.50 4 0 0 0 1 wage_st
## 6 62 0 20.50 NA NA 4 0 0 0 1 wage_st
## measurement
## 1 NA
## 2 NA
## 3 NA
## 4 5.0
## 5 5.5
## 6 5.0
Unclear as to how we are to appropriately reshape the data.
With unclear data organization, I would run the Difference-in-Difference the same way as previous.
The linear trends assumption for difference-in-difference models is the assumption that the trends between the two groups that are being studied are both following a similar linear trend. This allows for any divergence to be due to the shock between the two groups and not to changes in the slopes between the two groups.