Diff-in-Diff

Assignment from the seminal paper by David Card and Alan Krueger. Card and Krueger (1994) Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania AER 84(4): 772-793.

Part 1: Briefly answer these questions:

a. What is the causal link the paper is trying to reveal?

The paper is trying to study the impact of higher minimum wage policy on the employment. The treatment group is New Jersey (NJ) where the law is implemented and Pennsylvania(PA) is the control group.

b. What would be the ideal experiment to test this causal link?

Ideal experiment would be to randomly assign the treatment and control group. Here, one state is treatment and other is control group assuming they are almost identical which might not be the case.

c. What is the identification strategy?

Implementation of the higher minimum wage policy in NJ wage will affect the employment in NJ and no any effect is likely in PA where no such policy is implemented. Before and after comparison is done in both the states to study the impact of policy implementation

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

Assumptions to the identification strategy are the following:

Employment largely depends on socio-economic condition of the place. It is assumed that NJ and PA are same in terms of socio-economic condition, which might be a concern unless the author took the border of the two state for the survey.
It is also assumed that all other factors are same, which implies no change other than policy implementation in NJ and not in PA.

Part 2: Replication Analysis

a. Load the data

df <- read.csv("data/CardKrueger1994_fastfood.csv")
head(df,6)

id	emptot	emptot2	demp	chain	bk	kfc	wendys	wage_st	wage_st2
46	40.5	24	-16.5	1	1	0	0		4.3
49	13.8	11.5	-2.25	2	0	1	0		4.45
506	8.5	10.5	2	2	0	1	0		5
56	34	20	-14	4	0	0	1	5	5.25
61	24	35.5	11.5	4	0	0	1	5.5	4.75
62	20.5			4	0	0	1	5

b. Verify that the data is correct. Reproduce the % of Burger King, KFC, Roys, and Wendys, as well as the FTE means in the 2 waves

a <- df %>% group_by(state) %>% summarise(mean(bk)*100, mean(kfc)*100, mean(roys)*100, 
                                          mean(wendys)*100, mean(emptot, na.rm=TRUE),mean(emptot2, na.rm=TRUE)) %>% t()
aa <- a[2:7,]

tt_bk = t.test(df$bk[df$state==1],df$bk[df$state==0], var.equal=FALSE)$statistic
tt_kfc = t.test(df$kfc[df$state==1],df$kfc[df$state==0], var.equal=FALSE)$statistic
tt_roys = t.test(df$roys[df$state==1],df$roys[df$state==0], var.equal=FALSE)$statistic 
tt_wendys = t.test(df$wendys[df$state==1],df$wendys[df$state==0], var.equal=FALSE)$statistic 
tt_fte1 = t.test(df$emptot[df$state==1],df$emptot[df$state==0], var.equal=FALSE)$statistic
tt_fte2 = t.test(df$emptot2[df$state==1],df$emptot2[df$state==0], var.equal=FALSE)$statistic
ttest <- data.frame(c(tt_bk,tt_kfc,tt_roys,tt_wendys,tt_fte1,tt_fte2))


tab <- cbind(aa,ttest)
 
colnames(tab) <- c("PA","NJ","t")
tab[,1:2] <- sapply(tab[,1:2],as.numeric)
tab$Var <- c("Burger King","KFC","Roy Rogers","Wendy's","FTE-Wave1","FTE-Wave2")
#Reordering column names
tab2 <- tab[,c(4,2,1,3)]
rownames(tab2) <- NULL
huxtable(tab2)

Var	NJ	PA	t
Burger King	41.1	44.3	-0.515
KFC	20.5	15.2	1.16
Roy Rogers	24.8	21.5	0.623
Wendy's	13.6	19	-1.12
FTE-Wave1	20.4	23.3	-2
FTE-Wave2	21	21.2	-0.128

c. Use OLS to obtain the diff-in-diff

model1 <- lm(demp~state,data=df)

huxtable <- huxreg(model1, coefs = c("State"="state"),
                   statistics = c("N. obs." = "nobs","R squared" = "r.squared"))
huxtable

	(1)
State	2.750 *
	(1.154)
N. obs.	384
R squared	0.015
* p < 0.001; p < 0.01; * p < 0.05.

As per Table 3 in the paper, the coeff is 2.76 whereas with OLS what I get is 2.75

d. The equation of a standard “difference in difference” regression:

\(y_{i,t}\) = \(\alpha\) + \(\beta S_i\) + \(\tau T_t\) + \(\gamma (S_i * T_t)\) + \(\varepsilon_{i,t}\)

where:

\(y_{i,t}\) is the \(i^{th}\) observation at time t,

\(S_i\) is the dummy for state, 1 for NJ and 0 for PA,

\(T_t\) is the dummy for time, 0 for before and 1 is after the implementation of the policy,

\(\varepsilon_{i,t}\) is the error term

Part 3: Optional:

e. To do Diff-in-Diff,First reshape the data to the long form.

df$time_1 <- rep(0,410)
df$time_2 <- rep(1,410)
df1 <- df %>% select(c(1,2,3,4,13,14))

df_new <- reshape(df1, idvar = c("id","state"), varying = list(c("emptot","emptot2"),c("time_1","time_2")), v.names = c("emp_toto","timeofyr"), direction = "long")
head(df_new,6)

id	time	emp_toto
46	1	40.5
49	1	13.8
506	1	8.5
56	1	34
61	1	24
62	1	20.5

df_new[is.na(df_new)]=0
model2 <- lm(emp_toto ~ state, data=df_new)
model3 <- lm(emp_toto ~ state + timeofyr + state*timeofyr, data=df_new)
model4 <- plm(emp_toto ~ state + timeofyr + state*timeofyr, data=df_new,index=c("id"))

huxtable <- huxreg("Simple Reg" = model2,"Diff-in-Diff" = model3,"Diff-in-Diff using PLM" = model4, coefs = c("State"="state","Time" = "timeofyr","State:Time" = "state:timeofyr" ),
                   statistics = c("N. obs." = "nobs","R squared" = "r.squared"))%>% 
  set_caption("Table: Simple regression and the Diff-in-Diff regression")
huxtable

Table: Simple regression and the Diff-in-Diff regression
	Simple Reg	Diff-in-Diff	Diff-in-Diff using PLM
State	-1.642	-2.919 *	0.973
	(0.882)	(1.247)	(7.438)
Time		-2.111	-2.111
		(1.585)	(1.179)
State:Time		2.554	2.554
		(1.764)	(1.312)
N. obs.	820	820	820
R squared	0.004	0.007	0.009
* p < 0.001; p < 0.01; * p < 0.05.

f. Computing DID

DID computed from table3 in the paper:

\(\beta_{NJ,0}\) = 20.44

\(\beta_{NJ,1}\) = 21.03

\(\beta_{PA,0}\) = 23.33

\(\beta_{PA,1}\) = 21.17

\(\beta_{DID}\) = \(\beta_{NJ,1}\) - \(\beta_{NJ,0}\) - (\(\beta_{PA,1}\) - \(\beta_{PA,0}\) )

## [1] 2.75