Difference-in-Differences

Part 1: Card and Krueger (1994) Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania AER 84(4): 772-793.

a. What is the causal link the paper is trying to reveal?

Card and Krueger (1994) estimated the effect of higher minimum wage on employment by surveying 410 fast-food restaurants in New Jersey and Pennsylvania.

b. What would be the ideal experiment to test this causal link?

The ideal experiment would be to compare the employment, wages, and prices at fast-food stores in New Jersey and Pennsylvania (no increase in minimum wage) before and after the rise. Alternatively, comparing initially high-wage paying stores and other stores within New Jersey would also estimate impacts of the new law.

c. What is the identification strategy?

Difference-in-differences was the identification strategy used to construct a sample frame of fast-food restaurants, conduct a telephone survey before the scheduled increase in New Jersey’s minimum wage and then another survey after the minimum-wage increase.

d. What are the assumptions / threats to this identification strategy? (answer specifically with reference to the data the authors are using)

Card and Krueger (1994) used Pennsylvania stores as a control group with an assumption that the seasonal patterns of employment is similar to that in New Jersey, and thus indicated effective differencing out of any seasonal employment effects. Additionally, some of the stores from the first wave of interviews did not response in the second wave for several reasons. However, authors were able to address that problem in this study.

Part 2: Replication Analysis

a. Load data from Card and Krueger AER 1994

# Read csv file
my_data <- read.csv("CardKrueger1994_fastfood.csv")
head(my_data)

##    id state emptot emptot2   demp chain bk kfc roys wendys wage_st wage_st2
## 1  46     0  40.50    24.0 -16.50     1  1   0    0      0      NA     4.30
## 2  49     0  13.75    11.5  -2.25     2  0   1    0      0      NA     4.45
## 3 506     0   8.50    10.5   2.00     2  0   1    0      0      NA     5.00
## 4  56     0  34.00    20.0 -14.00     4  0   0    0      1     5.0     5.25
## 5  61     0  24.00    35.5  11.50     4  0   0    0      1     5.5     4.75
## 6  62     0  20.50      NA     NA     4  0   0    0      1     5.0       NA

summary(my_data)         # look at the data summary

##        id            state            emptot         emptot2     
##  Min.   :  1.0   Min.   :0.0000   Min.   : 5.00   Min.   : 0.00  
##  1st Qu.:119.2   1st Qu.:1.0000   1st Qu.:14.56   1st Qu.:14.50  
##  Median :237.5   Median :1.0000   Median :19.50   Median :20.50  
##  Mean   :246.5   Mean   :0.8073   Mean   :21.00   Mean   :21.05  
##  3rd Qu.:371.8   3rd Qu.:1.0000   3rd Qu.:24.50   3rd Qu.:26.50  
##  Max.   :522.0   Max.   :1.0000   Max.   :85.00   Max.   :60.50  
##                                   NA's   :12      NA's   :14     
##       demp               chain             bk              kfc        
##  Min.   :-41.50000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: -4.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :  0.00000   Median :2.000   Median :0.0000   Median :0.0000  
##  Mean   : -0.07044   Mean   :2.117   Mean   :0.4171   Mean   :0.1951  
##  3rd Qu.:  4.00000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   : 34.00000   Max.   :4.000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :26                                                           
##       roys            wendys          wage_st         wage_st2    
##  Min.   :0.0000   Min.   :0.0000   Min.   :4.250   Min.   :4.250  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:4.250   1st Qu.:5.050  
##  Median :0.0000   Median :0.0000   Median :4.500   Median :5.050  
##  Mean   :0.2415   Mean   :0.1463   Mean   :4.616   Mean   :4.996  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:4.950   3rd Qu.:5.050  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.750   Max.   :6.250  
##                                    NA's   :20      NA's   :21

b. Verify that the data is correct

# Percentage calculation: Use dplyr package
library(dplyr)
stores <- t(my_data %>%     # transpose because we want stores in a row, not column
  group_by(state) %>%       # grouping by states
  summarise(across(c(bk, kfc, roys, wendys), list(mean = mean)))) # gives average of all stores
colnames(stores) <- c("PA", "NJ")         # provide column names    
stores <- round(stores[-1,], 3) * 100     # remove first row and calculate percentage by *100
rownames(stores) <- c("Burger King", "KFC", "Roy Rogers", "Wendy's")  # change row names
stores

##               PA   NJ
## Burger King 44.3 41.1
## KFC         15.2 20.5
## Roy Rogers  21.5 24.8
## Wendy's     19.0 13.6

# Mean of FTE 
fte <- t(my_data %>%              # transpose because we want fte in a row, not column
        group_by(state) %>%       # grouping by states
        summarise(across(c(emptot, emptot2), list(mean = mean), na.rm = TRUE)))   # gives average of FTE
colnames(fte) <- c("PA", "NJ")   # provide column names  
fte <- round(fte[-1,], 1)        # remove first row and calculate percentage by *100
rownames(fte) <- c("FTE employment1", "FTE employment2")  # change row names
fte

##                   PA   NJ
## FTE employment1 23.3 20.4
## FTE employment2 21.2 21.0

c. Use OLS to obtain their Diff-in-diff estimator

# OLS estimation to obtain DiD estimator
# Regress difference variable with state
mod <- lm(demp ~ state, data = my_data)
# Create a table with stargazer package
library(stargazer)
stargazer(mod, type = "text", title = "TABLE 3 output from OLS", align = TRUE, keep.stat = c("n","rsq"), dep.var.labels = c("Difference, NJ - PA"), covariate.labels = c("State"))  # Display sample size and R-squared

## 
## TABLE 3 output from OLS
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                  Difference, NJ - PA    
## ----------------------------------------
## State                  2.750**          
##                        (1.154)          
##                                         
## Constant              -2.283**          
##                        (1.036)          
##                                         
## ----------------------------------------
## Observations             384            
## R2                      0.015           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

The output shows that \(\hat{\beta} = 2.75\), which is statistically significant at 5% significance level. This estimate is similar to that of Table 3 in the article. This means, the full time employment increased in NJ after the changes in the minimum wage.

d. What would be the equation of a standard “difference in difference” regression?

We need to create a dummy variable to indicate treatment time (0 for before treatment and 1 for after treatment) and then the model can be expressed as:
\[ FTEemployment_{i,t} = \alpha + \beta(state_i) + \gamma(time_{i,t}) + \delta(state_i*time_{i,t}) + \epsilon_{i,t} \]

Part 3: Optional Questions

e. Run the regression you wrote up in part d

# DiD regression
# Reshape data first using the melt function from the reshape package
library(reshape)
emptot <- melt(cbind(my_data$emptot, my_data$emptot2))
state <- melt(cbind(my_data$state, my_data$state))
time <- c(rep(0, length(my_data$emptot)), rep(1, length(my_data$emptot2)))  #0 for before treatment and 1 for after treatment

# Create new data
my_newdata <- data.frame(cbind(emptot[,3], state[,3], time))    # create new data frame 

# Give variable names to the data frame
colnames(my_newdata) <- c("emptot", "state", "time")   # renaming the column names

# Run the DiD model for this new dataset
mod1 <- lm(emptot ~ state + time + state*time, data=my_newdata)  # Note that there is an interaction term 
stargazer(mod1, type = "text", title = "TABLE 3 output from Difference-In-Differences",
          align = TRUE, keep.stat = c("n","rsq"),
          dep.var.labels = c("Difference, NJ - PA"), covariate.labels = c("State", "Tretament time"))  # Display sample size and R-squared

## 
## TABLE 3 output from Difference-In-Differences
## ==========================================
##                    Dependent variable:    
##                ---------------------------
##                    Difference, NJ - PA    
## ------------------------------------------
## State                   -2.892**          
##                          (1.194)          
##                                           
## Tretament time           -2.166           
##                          (1.516)          
##                                           
## state:time                2.754           
##                          (1.688)          
##                                           
## Constant                23.331***         
##                          (1.072)          
##                                           
## ------------------------------------------
## Observations               794            
## R2                        0.007           
## ==========================================
## Note:          *p<0.1; **p<0.05; ***p<0.01

The output shows \(\hat{\delta} = 2.75\) that matches the output in the article. However, it is not statistically significant.

f. Compute the difference-in-differences estimator “by hand”. Don’t use a regression.

To do this, first I used the code from part b (i.e. verify that the data is correct), where I found mean of FTE in wave 1 and wave 2. Then, I took the difference between means within each wave. Next, I took the difference between the values obtained from previous step, which is our parameter of interest.

# DiD by hand (wo regression)
# Mean of FTE 
fte <- t(my_data %>%              # transpose because we want fte in a row, not column
           group_by(state) %>%    # grouping by states
           summarise(across(c(emptot, emptot2), list(mean = mean), na.rm = TRUE)))   # gives average of FTE
colnames(fte) <- c("PA", "NJ")                         # provide column names                 
fte <- round(fte[-1,], 2)
rownames(fte) <- c("FTEemployment1", "FTEemployment2")  # change row names
fte

##                   PA    NJ
## FTEemployment1 23.33 20.44
## FTEemployment2 21.17 21.03

# Take mean difference
MeanDiff <- fte[, 1] - fte[, 2]
# Then the difference between wave 1 and wave 2
DiD <- 2.89 - 0.14
DiD

## [1] 2.75

Result shows the estimate = 2.75, which matches the estimate from the article.