I used three seminal papers that demonstrate causality. Download three papers:

Part 1: Paper using randomized data: Impact of Class Size on Learning

1-1. Background of the paper

c. What is the identification strategy?

The Project STAR is the random assignment experiment, by randomly assigning students to classes within schools.

Classes were divided into three types:

  1. Small classes (13-17),

  2. Regular classes (22-25), and

  3. Regular classes with Aide (22-25).

Students and teachers also were randomly assigned to these classes.

Students stayed in the same type of class for four years.

Students were tested at the end of each year.

Total of 11,600 students and 80 schools were used to analyze the causal relationship.

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

There are some limitations of this experiment.

  1. Random assignment after kindergarten was done again for Regular vs. Regular with Aide although Small classes were okay.

  2. 10 % of students switched from Small to Regular or vise versa because of behavioral problems or parental complaints.

  3. Kindergarten was not required. Therefore, many students begin school in their first grade.

They also assumed that initial class assignment only affects current test scores by affecting current class size.

Part 2: Using Twins for Identification: Economic Returns to Schooling

2-1. Background of the paper

c. What is the identification strategy?

The authors surveyed twins.

Differences in schooling across individuals are caused by ability differences. However, when it comes to identical twins, it is not true. As a result, the ability is the same as regressing differences in earnings within twin pairs on differences in their schooling levels.

Multiple measures of schooling were used to address the potentially severe attenuation bias in within-twin estimates of the return to schooling that arise from measurement errors.

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

There is the problem of measurement error. The reports from twins could be different. For example, twins were asked to report on their parents’ education. Since both twins have the same parents, the answers should be the same. Differences between the two responses represent measurement error.

After receiving two measures of schooling, authors use the first-difference of schooling reported by one of a pair as an instrument for the first-difference reported by the other. If measurement errors are uncorrelated, this allows for a correction of the problem of measurement error in the schooling variable.

2.2. Replication analysis

a. Load Ashenfleter and Krueger AER 1994 data

library(haven)
twins <- read_dta("AshenfelterKrueger1994_twins.dta")
head(twins)
## # A tibble: 6 x 10
##   famid   age educ1 educ2 lwage1 lwage2 male1 male2 white1 white2
##   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1     1  33.3    16    16   2.16   2.42     0     0      1      1
## 2     2  43.6    12    19   2.17   2.89     0     0      1      1
## 3     3  31.0    12    12   2.79   2.80     1     1      1      1
## 4     4  34.6    14    14   2.82   2.26     1     1      1      1
## 5     5  35.0    15    13   2.03   3.56     0     0      1      1
## 6     6  29.3    14    12   2.71   2.48     1     1      1      1

b. Reproduce the result from table 3 column 5 of the paper

twins$d_educ <- twins$educ1 - twins$educ2
twins$d_lwage <- twins$lwage1 - twins$lwage2

model <- lm(d_lwage ~ d_educ, twins)
summary(model)
## 
## Call:
## lm(formula = d_lwage ~ d_educ, data = twins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03115 -0.20909  0.00722  0.34395  1.15740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07859    0.04547  -1.728 0.086023 .  
## d_educ       0.09157    0.02371   3.862 0.000168 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5542 on 147 degrees of freedom
## Multiple R-squared:  0.09211,    Adjusted R-squared:  0.08593 
## F-statistic: 14.91 on 1 and 147 DF,  p-value: 0.0001682
library(stargazer)
stargazer(model, type="text",title="Result of Table 3 and Column 5", align=TRUE, header=FALSE, font.size="small", digits=3)
## 
## Result of Table 3 and Column 5
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               d_lwage          
## -----------------------------------------------
## d_educ                       0.092***          
##                               (0.024)          
##                                                
## Constant                      -0.079*          
##                               (0.045)          
##                                                
## -----------------------------------------------
## Observations                    149            
## R2                             0.092           
## Adjusted R2                    0.086           
## Residual Std. Error      0.554 (df = 147)      
## F Statistic           14.914*** (df = 1; 147)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

c. Explain how this coefficient should be interpreted.

If the first-difference in schooling levels between twins increase by 1 year, the first-difference in wage rates increase by 9.2%, all other things equal.

This would imply that the bias from measurement error in schooling is likely to increase by forming differences between twins.

d. Reproduce the result in table 3 column 1 and Figure 1

library(tidyr)
# wide - long data format
twinwide <- twins[-(11:12)]
twinwide$famid <- factor(twinwide$famid)

twinlong <-gather(twinwide, variables, values, educ1:white2, factor_key=TRUE)

twinlong <- twinlong %>% separate(variables,  sep = "(?<=[A-Za-z])(?=[0-9])", c("xvar", "twinid"))
twinlong <- twinlong %>% spread(xvar, values)

#Generate AGE squared / 100
twinlong[order(twinlong$famid), ]
## # A tibble: 298 x 7
##    famid   age twinid  educ lwage  male white
##    <fct> <dbl> <chr>  <dbl> <dbl> <dbl> <dbl>
##  1 1      33.3 1         16  2.16     0     1
##  2 1      33.3 2         16  2.42     0     1
##  3 2      43.6 1         12  2.17     0     1
##  4 2      43.6 2         19  2.89     0     1
##  5 3      31.0 1         12  2.79     1     1
##  6 3      31.0 2         12  2.80     1     1
##  7 4      34.6 1         14  2.82     1     1
##  8 4      34.6 2         14  2.26     1     1
##  9 5      35.0 1         15  2.03     0     1
## 10 5      35.0 2         13  3.56     0     1
## # ... with 288 more rows
twinlong$agesq <- twinlong$age ^2 /100
        
# run OLS regression
model2<- lm(lwage ~ educ+age+agesq+male+white, twinlong)
stargazer(model2, type="text",title="Result of Table 3 and Column 1", align=TRUE, header=FALSE, font.size="small", digits=3)
## 
## Result of Table 3 and Column 1
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                lwage           
## -----------------------------------------------
## educ                         0.084***          
##                               (0.014)          
##                                                
## age                          0.088***          
##                               (0.019)          
##                                                
## agesq                        -0.087***         
##                               (0.023)          
##                                                
## male                         0.204***          
##                               (0.063)          
##                                                
## white                        -0.410***         
##                               (0.127)          
##                                                
## Constant                      -0.471           
##                               (0.426)          
##                                                
## -----------------------------------------------
## Observations                    298            
## R2                             0.272           
## Adjusted R2                    0.260           
## Residual Std. Error      0.532 (df = 292)      
## F Statistic           21.860*** (df = 5; 292)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
plot(twins$d_educ, twins$d_lwage, main="FIGURE 1. INTRAPAIR RETURNS TO SCHOOLING, IDENTICAL TWINS",
   xlab="Difference in Years of Schooling", ylab="Difference in Log Hourly Wage", pch=19)
abline(lm(d_lwage~d_educ, data=twins), col="orange")

e. Explain how the coefficient on education should be interpreted.

  1. Schooling: If the schooling levels of twins increase by 1 year, wage rates increase by 8.4%, all other things equal.

f. Explain how the coefficient on the control variables should be interpreted.

  1. Age: we need to consider both “age” and “agesq”. If the age increases by 1 year, wage rates increase by (0.088 - 20.087 age)%, all other things equal.

  2. Male: Males have wage rates 20.4% more than females, all other things equal.

  3. White: Whites have wage rates 41.0% more than non-whites, all other things equal.

Part 3: Paper using Difference-in-Differences: Impact of Minimum Wages

3.1. Background of the paper

c. What is the identification strategy?

New Jersey’s minumum wage increased, while Pennsylvania’s minimum wage stayed the same.

Authors collected data on employment at fast food restaurants before and after the policy was effective in the two states.

Treatment group is NJ fast food restaurants and control group is Penn fast-food restaurants.

d. What are the assumptions / threats to this identification strategy? (Answer specifically with reference to the data the authors are using)

  1. NJ and Penn states are close each other. Therefore, they can be affected by the same unobservable shocks.

  2. Authors only looked at fast food restaurants that were already in place in first wave.

  3. There could be measurement errors becasue they collected data from telephone surveys.

3.2. Replication Analysis

a. Load data from Card and Krueger AER 1994. You can load it directly from my website here.

fastfood <- read.csv("CardKrueger1994_fastfood.csv")
head(fastfood)
##    id state emptot emptot2   demp chain bk kfc roys wendys wage_st wage_st2
## 1  46     0  40.50    24.0 -16.50     1  1   0    0      0      NA     4.30
## 2  49     0  13.75    11.5  -2.25     2  0   1    0      0      NA     4.45
## 3 506     0   8.50    10.5   2.00     2  0   1    0      0      NA     5.00
## 4  56     0  34.00    20.0 -14.00     4  0   0    0      1     5.0     5.25
## 5  61     0  24.00    35.5  11.50     4  0   0    0      1     5.5     4.75
## 6  62     0  20.50      NA     NA     4  0   0    0      1     5.0       NA

b. Verify that the data is matching that of the paper.

library(dplyr)
count <- fastfood %>%
  group_by(state) %>%
  summarize(bk = sum(bk), kfc = sum(kfc), roys = sum(roys), wendys = sum(wendys)) 

count <- as.matrix(t(count[,-1]))
percent <- prop.table(count, margin=2)*100
percent <- percent[,c(2,1)]
colnames(percent) <- c("NJ", "PA")
percent
##              NJ       PA
## bk     41.08761 44.30380
## kfc    20.54381 15.18987
## roys   24.77341 21.51899
## wendys 13.59517 18.98734
FTE1 <- aggregate(emptot ~ state, fastfood, mean)
nj_FTE1 <- as.numeric(FTE1[2,2])  
pa_FTE1 <- as.numeric(FTE1[1,2])    

FTE2 <- aggregate(emptot2 ~ state, fastfood, mean)
nj_FTE2 <- as.numeric(FTE2[2,2])  
pa_FTE2 <- as.numeric(FTE2[1,2])    

NJ <- c(nj_FTE1, nj_FTE2)
PA <- c(pa_FTE1, pa_FTE2)
means <- data.frame(NJ, PA)
rownames(means) <-c("FTE 1", "FTE 2")
means
##             NJ       PA
## FTE 1 20.43941 23.33117
## FTE 2 21.02743 21.16558

c. Compute the difference-in-differences estimator “by hand”.

#FTE employment before, all available observations
diff_FTE1 <- nj_FTE1 - pa_FTE1

FTE1_sd <- aggregate(emptot ~ state, fastfood, sd)
FTE1_len <- aggregate(emptot ~ state, fastfood, length)

nj_FTE1_se <- as.numeric(FTE1_sd[2,2] /sqrt(as.numeric(FTE1_len[2,2])))  
pa_FTE1_se <- as.numeric(FTE1_sd[1,2] /sqrt(as.numeric(FTE1_len[1,2])))

diff_nj_FTE1_se <- nj_FTE1_se^2 
diff_pa_FTE1_se <- pa_FTE1_se^2 
diff_FTE1_se <- sqrt(diff_nj_FTE1_se + diff_pa_FTE1_se)

#FTE employment after, all available observations
diff_FTE2 <- nj_FTE2 - pa_FTE2

FTE2_sd <- aggregate(emptot2 ~ state, fastfood, sd)
FTE2_len <- aggregate(emptot2 ~ state, fastfood, length)

nj_FTE2_se <- as.numeric(FTE2_sd[2,2] /sqrt(as.numeric(FTE2_len[2,2])))  
pa_FTE2_se <- as.numeric(FTE2_sd[1,2] /sqrt(as.numeric(FTE2_len[1,2])))

diff_nj_FTE2_se <- nj_FTE2_se^2 
diff_pa_FTE2_se <- pa_FTE2_se^2 
diff_FTE2_se <- sqrt(diff_nj_FTE2_se + diff_pa_FTE2_se)

#Change in mean FTE employment, balanced sample of stores
fastfood$demp <- as.numeric(fastfood$demp)
change_FTE<- aggregate(demp ~ state, fastfood, mean)

nj_change_FTE <- as.numeric(change_FTE[2,2])  
pa_change_FTE <- as.numeric(change_FTE[1,2])    

diff_change_FTE <- nj_change_FTE - pa_change_FTE

change_FTE_sd<- aggregate(demp ~ state, fastfood, sd)
change_FTE_len <- aggregate(demp ~ state, fastfood, length)

nj_change_FTE_se <- as.numeric(change_FTE_sd[2,2] /sqrt(as.numeric(change_FTE_len[2,2])))  
pa_change_FTE_se <- as.numeric(change_FTE_sd[1,2] /sqrt(as.numeric(change_FTE_len[1,2])))

diff_nj_change_FTE_se <- nj_change_FTE_se^2 
diff_pa_change_FTE_se <- pa_change_FTE_se^2 
diff_change_FTE_se <- sqrt(diff_nj_change_FTE_se + diff_pa_change_FTE_se)

FTE_before  <- c(pa_FTE1, nj_FTE1, diff_FTE1)
FTE_before_se <- c(pa_FTE1_se,nj_FTE1_se, diff_FTE1_se)
FTE_after  <- c(pa_FTE2, nj_FTE2, diff_FTE2)
FTE_after_se <- c(pa_FTE2_se,nj_FTE2_se, diff_FTE2_se)
Change_FTE  <- c(pa_change_FTE, nj_change_FTE, diff_change_FTE)
Change_FTE_se <- c(pa_change_FTE_se, nj_change_FTE_se, diff_change_FTE_se)

t3 <-rbind(FTE_before, FTE_before_se, FTE_after, FTE_after_se, Change_FTE, Change_FTE_se)
colnames(t3) <-c("PA", "NJ", "Difference")
t3
##                       PA         NJ Difference
## FTE_before    23.3311688 20.4394081 -2.8917607
## FTE_before_se  1.3511489  0.5082607  1.4435831
## FTE_after     21.1655844 21.0274295 -0.1381549
## FTE_after_se   0.9432212  0.5203094  1.0772131
## Change_FTE    -2.2833333  0.4666667  2.7500000
## Change_FTE_se  1.2532690  0.4808286  1.3423410

d. Interpret the difference-in-differences estimator. Does it (roughly) match the one in the paper?

The DiD estimator is 2.75.

Mean change in minimum wage from before to after between NJ and PA is 2.75 FTE employees.

e. Use OLS to obtain the same Diff-in-diff estimator as you just did.

OLS <- lm(demp ~ state, fastfood)
summary(OLS)
## 
## Call:
## lm(formula = demp ~ state, data = fastfood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.217  -3.967   0.533   4.533  33.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -2.283      1.036  -2.205   0.0280 *
## state          2.750      1.154   2.382   0.0177 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.968 on 382 degrees of freedom
##   (26 observations deleted due to missingness)
## Multiple R-squared:  0.01464,    Adjusted R-squared:  0.01206 
## F-statistic: 5.675 on 1 and 382 DF,  p-value: 0.01769

OLS also have the same DiD estimator (state) and its coefficient is 2.75.

f. Reshape your data to long form.

FTElong<-reshape(fastfood, 
             direction = 'long', 
             varying = c("emptot", "wage_st", "emptot2", "wage_st2"),
             v.names = c('emptot', 'wage'),
             timevar = "treatment",
             times   = c(0,1),
             idvar="groupid")
head(FTElong)
##      id state   demp chain bk kfc roys wendys treatment emptot wage groupid
## 1.0  46     0 -16.50     1  1   0    0      0         0  40.50   NA       1
## 2.0  49     0  -2.25     2  0   1    0      0         0  13.75   NA       2
## 3.0 506     0   2.00     2  0   1    0      0         0   8.50   NA       3
## 4.0  56     0 -14.00     4  0   0    0      1         0  34.00  5.0       4
## 5.0  61     0  11.50     4  0   0    0      1         0  24.00  5.5       5
## 6.0  62     0     NA     4  0   0    0      1         0  20.50  5.0       6

g. Run the appropriate DiD regression and comment on the result.

#Linear models for panel data
library(plm)
plm <- plm(emptot ~ state + treatment + state*treatment, FTElong, index=c("id"))
summary(plm)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = emptot ~ state + treatment + state * treatment, 
##     data = FTElong, index = c("id"))
## 
## Unbalanced Panel: n = 409, T = 1-4, N = 794
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -19.6083  -1.9833   0.0000   1.9833  19.6083 
## 
## Coefficients:
##                 Estimate Std. Error t-value Pr(>|t|)  
## state             0.8750     6.3674  0.1374  0.89077  
## treatment        -2.2833     1.0355 -2.2050  0.02805 *
## state:treatment   2.7500     1.1544  2.3823  0.01769 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    15595
## Residual Sum of Squares: 15360
## R-Squared:      0.015019
## Adj. R-Squared: -1.0447
## F-statistic: 1.94162 on 3 and 382 DF, p-value: 0.12237

DiD estimator is 2.75 and it is the same as the OLS estimator and the result from Table 3.

Interpretation is the same as the previous one.