I used three seminal papers that demonstrate causality. Download three papers:
Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532
Ashenfelter and Krueger (1994) Estimates of the Economic Return to Schooling from a New Sample of Twins AER 84(5): 1157-1173
Card and Krueger (1994) Minimum Wages and Employment: A case Study of the Fast-Food Industry in New Jersey and Pennsylvania AER 84(4): 772-793
This paper examines the causal effect of class size on students’ achievement (test scores), especially the impact of small class sizes vs. regular class sizes.
This paper uses the true randomized experiment, the Project STAR class-size experiment. The true randomized experiments are the best way for isolating causal impacts.
The Project STAR is the random assignment experiment, by randomly assigning students to classes within schools.
Classes were divided into three types:
Small classes (13-17),
Regular classes (22-25), and
Regular classes with Aide (22-25).
Students and teachers also were randomly assigned to these classes.
Students stayed in the same type of class for four years.
Students were tested at the end of each year.
Total of 11,600 students and 80 schools were used to analyze the causal relationship.
The paper is trying to reveal the causal effect of education (schooling) on earnings.
The ideal experiment to test this causal link is the random assignment of subjects to schooling levels with controlling for all other differences.
Each twin was asked to report on his or her own schooling level and on his or her sibling’s.
Each twin was asked to report on his or her parents’ schooling level.
The authors surveyed twins.
Differences in schooling across individuals are caused by ability differences. However, when it comes to identical twins, it is not true. As a result, the ability is the same as regressing differences in earnings within twin pairs on differences in their schooling levels.
Multiple measures of schooling were used to address the potentially severe attenuation bias in within-twin estimates of the return to schooling that arise from measurement errors.
library(haven)
twins <- read_dta("AshenfelterKrueger1994_twins.dta")
head(twins)
## # A tibble: 6 x 10
## famid age educ1 educ2 lwage1 lwage2 male1 male2 white1 white2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 33.3 16 16 2.16 2.42 0 0 1 1
## 2 2 43.6 12 19 2.17 2.89 0 0 1 1
## 3 3 31.0 12 12 2.79 2.80 1 1 1 1
## 4 4 34.6 14 14 2.82 2.26 1 1 1 1
## 5 5 35.0 15 13 2.03 3.56 0 0 1 1
## 6 6 29.3 14 12 2.71 2.48 1 1 1 1
twins$d_educ <- twins$educ1 - twins$educ2
twins$d_lwage <- twins$lwage1 - twins$lwage2
model <- lm(d_lwage ~ d_educ, twins)
summary(model)
##
## Call:
## lm(formula = d_lwage ~ d_educ, data = twins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.03115 -0.20909 0.00722 0.34395 1.15740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07859 0.04547 -1.728 0.086023 .
## d_educ 0.09157 0.02371 3.862 0.000168 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5542 on 147 degrees of freedom
## Multiple R-squared: 0.09211, Adjusted R-squared: 0.08593
## F-statistic: 14.91 on 1 and 147 DF, p-value: 0.0001682
library(stargazer)
stargazer(model, type="text",title="Result of Table 3 and Column 5", align=TRUE, header=FALSE, font.size="small", digits=3)
##
## Result of Table 3 and Column 5
## ===============================================
## Dependent variable:
## ---------------------------
## d_lwage
## -----------------------------------------------
## d_educ 0.092***
## (0.024)
##
## Constant -0.079*
## (0.045)
##
## -----------------------------------------------
## Observations 149
## R2 0.092
## Adjusted R2 0.086
## Residual Std. Error 0.554 (df = 147)
## F Statistic 14.914*** (df = 1; 147)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
If the first-difference in schooling levels between twins increase by 1 year, the first-difference in wage rates increase by 9.2%, all other things equal.
This would imply that the bias from measurement error in schooling is likely to increase by forming differences between twins.
library(tidyr)
# wide - long data format
twinwide <- twins[-(11:12)]
twinwide$famid <- factor(twinwide$famid)
twinlong <-gather(twinwide, variables, values, educ1:white2, factor_key=TRUE)
twinlong <- twinlong %>% separate(variables, sep = "(?<=[A-Za-z])(?=[0-9])", c("xvar", "twinid"))
twinlong <- twinlong %>% spread(xvar, values)
#Generate AGE squared / 100
twinlong[order(twinlong$famid), ]
## # A tibble: 298 x 7
## famid age twinid educ lwage male white
## <fct> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 33.3 1 16 2.16 0 1
## 2 1 33.3 2 16 2.42 0 1
## 3 2 43.6 1 12 2.17 0 1
## 4 2 43.6 2 19 2.89 0 1
## 5 3 31.0 1 12 2.79 1 1
## 6 3 31.0 2 12 2.80 1 1
## 7 4 34.6 1 14 2.82 1 1
## 8 4 34.6 2 14 2.26 1 1
## 9 5 35.0 1 15 2.03 0 1
## 10 5 35.0 2 13 3.56 0 1
## # ... with 288 more rows
twinlong$agesq <- twinlong$age ^2 /100
# run OLS regression
model2<- lm(lwage ~ educ+age+agesq+male+white, twinlong)
stargazer(model2, type="text",title="Result of Table 3 and Column 1", align=TRUE, header=FALSE, font.size="small", digits=3)
##
## Result of Table 3 and Column 1
## ===============================================
## Dependent variable:
## ---------------------------
## lwage
## -----------------------------------------------
## educ 0.084***
## (0.014)
##
## age 0.088***
## (0.019)
##
## agesq -0.087***
## (0.023)
##
## male 0.204***
## (0.063)
##
## white -0.410***
## (0.127)
##
## Constant -0.471
## (0.426)
##
## -----------------------------------------------
## Observations 298
## R2 0.272
## Adjusted R2 0.260
## Residual Std. Error 0.532 (df = 292)
## F Statistic 21.860*** (df = 5; 292)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
plot(twins$d_educ, twins$d_lwage, main="FIGURE 1. INTRAPAIR RETURNS TO SCHOOLING, IDENTICAL TWINS",
xlab="Difference in Years of Schooling", ylab="Difference in Log Hourly Wage", pch=19)
abline(lm(d_lwage~d_educ, data=twins), col="orange")
Age: we need to consider both “age” and “agesq”. If the age increases by 1 year, wage rates increase by (0.088 - 20.087 age)%, all other things equal.
Male: Males have wage rates 20.4% more than females, all other things equal.
White: Whites have wage rates 41.0% more than non-whites, all other things equal.
This paper is trying to reveal the causal effect on change in minimum wages on establishment level emplyment outcomes.
The ideal experiment for this causal link would be the truly random experiment.
New Jersey’s minumum wage increased, while Pennsylvania’s minimum wage stayed the same.
Authors collected data on employment at fast food restaurants before and after the policy was effective in the two states.
Treatment group is NJ fast food restaurants and control group is Penn fast-food restaurants.
fastfood <- read.csv("CardKrueger1994_fastfood.csv")
head(fastfood)
## id state emptot emptot2 demp chain bk kfc roys wendys wage_st wage_st2
## 1 46 0 40.50 24.0 -16.50 1 1 0 0 0 NA 4.30
## 2 49 0 13.75 11.5 -2.25 2 0 1 0 0 NA 4.45
## 3 506 0 8.50 10.5 2.00 2 0 1 0 0 NA 5.00
## 4 56 0 34.00 20.0 -14.00 4 0 0 0 1 5.0 5.25
## 5 61 0 24.00 35.5 11.50 4 0 0 0 1 5.5 4.75
## 6 62 0 20.50 NA NA 4 0 0 0 1 5.0 NA
library(dplyr)
count <- fastfood %>%
group_by(state) %>%
summarize(bk = sum(bk), kfc = sum(kfc), roys = sum(roys), wendys = sum(wendys))
count <- as.matrix(t(count[,-1]))
percent <- prop.table(count, margin=2)*100
percent <- percent[,c(2,1)]
colnames(percent) <- c("NJ", "PA")
percent
## NJ PA
## bk 41.08761 44.30380
## kfc 20.54381 15.18987
## roys 24.77341 21.51899
## wendys 13.59517 18.98734
FTE1 <- aggregate(emptot ~ state, fastfood, mean)
nj_FTE1 <- as.numeric(FTE1[2,2])
pa_FTE1 <- as.numeric(FTE1[1,2])
FTE2 <- aggregate(emptot2 ~ state, fastfood, mean)
nj_FTE2 <- as.numeric(FTE2[2,2])
pa_FTE2 <- as.numeric(FTE2[1,2])
NJ <- c(nj_FTE1, nj_FTE2)
PA <- c(pa_FTE1, pa_FTE2)
means <- data.frame(NJ, PA)
rownames(means) <-c("FTE 1", "FTE 2")
means
## NJ PA
## FTE 1 20.43941 23.33117
## FTE 2 21.02743 21.16558
#FTE employment before, all available observations
diff_FTE1 <- nj_FTE1 - pa_FTE1
FTE1_sd <- aggregate(emptot ~ state, fastfood, sd)
FTE1_len <- aggregate(emptot ~ state, fastfood, length)
nj_FTE1_se <- as.numeric(FTE1_sd[2,2] /sqrt(as.numeric(FTE1_len[2,2])))
pa_FTE1_se <- as.numeric(FTE1_sd[1,2] /sqrt(as.numeric(FTE1_len[1,2])))
diff_nj_FTE1_se <- nj_FTE1_se^2
diff_pa_FTE1_se <- pa_FTE1_se^2
diff_FTE1_se <- sqrt(diff_nj_FTE1_se + diff_pa_FTE1_se)
#FTE employment after, all available observations
diff_FTE2 <- nj_FTE2 - pa_FTE2
FTE2_sd <- aggregate(emptot2 ~ state, fastfood, sd)
FTE2_len <- aggregate(emptot2 ~ state, fastfood, length)
nj_FTE2_se <- as.numeric(FTE2_sd[2,2] /sqrt(as.numeric(FTE2_len[2,2])))
pa_FTE2_se <- as.numeric(FTE2_sd[1,2] /sqrt(as.numeric(FTE2_len[1,2])))
diff_nj_FTE2_se <- nj_FTE2_se^2
diff_pa_FTE2_se <- pa_FTE2_se^2
diff_FTE2_se <- sqrt(diff_nj_FTE2_se + diff_pa_FTE2_se)
#Change in mean FTE employment, balanced sample of stores
fastfood$demp <- as.numeric(fastfood$demp)
change_FTE<- aggregate(demp ~ state, fastfood, mean)
nj_change_FTE <- as.numeric(change_FTE[2,2])
pa_change_FTE <- as.numeric(change_FTE[1,2])
diff_change_FTE <- nj_change_FTE - pa_change_FTE
change_FTE_sd<- aggregate(demp ~ state, fastfood, sd)
change_FTE_len <- aggregate(demp ~ state, fastfood, length)
nj_change_FTE_se <- as.numeric(change_FTE_sd[2,2] /sqrt(as.numeric(change_FTE_len[2,2])))
pa_change_FTE_se <- as.numeric(change_FTE_sd[1,2] /sqrt(as.numeric(change_FTE_len[1,2])))
diff_nj_change_FTE_se <- nj_change_FTE_se^2
diff_pa_change_FTE_se <- pa_change_FTE_se^2
diff_change_FTE_se <- sqrt(diff_nj_change_FTE_se + diff_pa_change_FTE_se)
FTE_before <- c(pa_FTE1, nj_FTE1, diff_FTE1)
FTE_before_se <- c(pa_FTE1_se,nj_FTE1_se, diff_FTE1_se)
FTE_after <- c(pa_FTE2, nj_FTE2, diff_FTE2)
FTE_after_se <- c(pa_FTE2_se,nj_FTE2_se, diff_FTE2_se)
Change_FTE <- c(pa_change_FTE, nj_change_FTE, diff_change_FTE)
Change_FTE_se <- c(pa_change_FTE_se, nj_change_FTE_se, diff_change_FTE_se)
t3 <-rbind(FTE_before, FTE_before_se, FTE_after, FTE_after_se, Change_FTE, Change_FTE_se)
colnames(t3) <-c("PA", "NJ", "Difference")
t3
## PA NJ Difference
## FTE_before 23.3311688 20.4394081 -2.8917607
## FTE_before_se 1.3511489 0.5082607 1.4435831
## FTE_after 21.1655844 21.0274295 -0.1381549
## FTE_after_se 0.9432212 0.5203094 1.0772131
## Change_FTE -2.2833333 0.4666667 2.7500000
## Change_FTE_se 1.2532690 0.4808286 1.3423410
The DiD estimator is 2.75.
Mean change in minimum wage from before to after between NJ and PA is 2.75 FTE employees.
OLS <- lm(demp ~ state, fastfood)
summary(OLS)
##
## Call:
## lm(formula = demp ~ state, data = fastfood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.217 -3.967 0.533 4.533 33.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.283 1.036 -2.205 0.0280 *
## state 2.750 1.154 2.382 0.0177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.968 on 382 degrees of freedom
## (26 observations deleted due to missingness)
## Multiple R-squared: 0.01464, Adjusted R-squared: 0.01206
## F-statistic: 5.675 on 1 and 382 DF, p-value: 0.01769
OLS also have the same DiD estimator (state) and its coefficient is 2.75.
FTElong<-reshape(fastfood,
direction = 'long',
varying = c("emptot", "wage_st", "emptot2", "wage_st2"),
v.names = c('emptot', 'wage'),
timevar = "treatment",
times = c(0,1),
idvar="groupid")
head(FTElong)
## id state demp chain bk kfc roys wendys treatment emptot wage groupid
## 1.0 46 0 -16.50 1 1 0 0 0 0 40.50 NA 1
## 2.0 49 0 -2.25 2 0 1 0 0 0 13.75 NA 2
## 3.0 506 0 2.00 2 0 1 0 0 0 8.50 NA 3
## 4.0 56 0 -14.00 4 0 0 0 1 0 34.00 5.0 4
## 5.0 61 0 11.50 4 0 0 0 1 0 24.00 5.5 5
## 6.0 62 0 NA 4 0 0 0 1 0 20.50 5.0 6
#Linear models for panel data
library(plm)
plm <- plm(emptot ~ state + treatment + state*treatment, FTElong, index=c("id"))
summary(plm)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = emptot ~ state + treatment + state * treatment,
## data = FTElong, index = c("id"))
##
## Unbalanced Panel: n = 409, T = 1-4, N = 794
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -19.6083 -1.9833 0.0000 1.9833 19.6083
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## state 0.8750 6.3674 0.1374 0.89077
## treatment -2.2833 1.0355 -2.2050 0.02805 *
## state:treatment 2.7500 1.1544 2.3823 0.01769 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 15595
## Residual Sum of Squares: 15360
## R-Squared: 0.015019
## Adj. R-Squared: -1.0447
## F-statistic: 1.94162 on 3 and 382 DF, p-value: 0.12237
DiD estimator is 2.75 and it is the same as the OLS estimator and the result from Table 3.
Interpretation is the same as the previous one.
The linear trends assumption means that the difference between the treatment and control group is the same over time if there is no treatment.