Part 1: Paper using randomized data: Impact of Class Size on Learning

Download and go over this seminal paper by Alan Krueger. Krueger (1999) Experimental Estimates of Education Production Functions QJE 114 (2) : 497-532

c. What is the identification strategy?

ANSWER: The identifcation strategy is that the change in educational outcome is solely coming from the random sorting into different class sizes and not any other effect impacting the educational outcome.

d. What are the assumptions / threats to this identification strategy? (Answer specifically with reference to the data the authors are using) (For instance: “This identification strategy would not be revealing a causal effect if [insert potential issue]”)

ANSWER: Again the threat to the identification strategy is the contamination of the sample owing to attrition and input of new students into different size classes throughout the study period. Parents switching their kids to classes that are smaller in size and performing better makes the experiment results tenuous. Furthermore, the initial sorting of kids in kindergarten into these groups is claimed to be random, but the characteristics of the students and their parental education and income status is crucial to understand whether these class group sizes differ statistically.

Part 2: Paper using Twins for Identification: Economic Returns to Schooling

2.1 Briefly answer these questions:

c. What is the identification strategy?

ANSWER: The identification strategy is claimed to be a random sample of twins at a large and popular event that brings a lot of twins from across the US and world. Moreover, isolating the twins and aksing theirs and their siblings educational and wages helps to address the measurement error in self-repoting characteristics.

d. What are the assumptions / threats to this identification strategy? (Answer specifically with reference to the data the authors are using)

ANSWER: The threats to the identification is :

  • first, the selection of twins at an event for twins could bias the sample as those who have resources and interests to attend these meetings are the ones that are gathered in the sample. This makes “randomizing” tenouous as it ignores those twins that fail to attend.

  • Second, since these twins have a similar educational background it is hard to identify whether the impact on wages is really from the difference in schooling level or factors like opportunities and internships and other scholarly activities driving the wage differential.

2.2. Replication analysis

a. Load Ashenfelter and Krueger AER 1994 data.

setwd("C:\\Users\\Akash\\Dropbox\\UGA\\AAEC8610AdvEcotrix_Filipski\\FilipskiHW9")
library(tidyverse)
library(haven)
library(stargazer)
mydata <- read_dta(file = "AshenfelterKrueger1994_twins.dta")
head(mydata)

b. Reproduce the result from table 3 column 5.

mydata$wagediff = mydata$lwage2-mydata$lwage1
mydata$educdiff = mydata$educ2-mydata$educ1


firstdiff.model = lm(wagediff~educdiff, data=mydata)
stargazer(firstdiff.model, 
          header=FALSE, type='html', 
          font.size="small", digits=3, 
          omit.stat=c("adj.rsq", "ser", "f"), 
          title = "Result of Table 3 and Column 5")
Result of Table 3 and Column 5
Dependent variable:
wagediff
educdiff 0.092***
(0.024)
Constant 0.079*
(0.045)
Observations 149
R2 0.092
Note: p<0.1; p<0.05; p<0.01

Moreover, replicating the figure 1:

plot(mydata$educdiff, mydata$wagediff, main="FIGURE 1. INTRAPAIR RETURNS TO SCHOOLING, IDENTICAL TWINS",
   xlab="Difference in Years of Schooling", ylab="Difference in Log Hourly Wage", pch=19)
abline(lm(wagediff~educdiff, data=mydata), col="red") # regression line (y~x)

c. Explain how this coefficient should be interpreted.

ANSWER: The first-difference estimate tells us that the if the intr-pair difference in schooling increases by 1 year the intra-pair difference in wages among the twin sibling increases by 9.2%, ceteris paribus. Thus the return to schooling has a positive and statistically significant effect on wages.

d. Reproduce the result in table 3 column 1. You will need to reshape the data first.

library(tidyr)
mydata.wide = mydata[-(11:12)]
# Make sure the subject column is a factor
mydata.wide$famid <- factor(mydata.wide$famid)

#First to long
mydata.long <- gather(mydata.wide, variables, values, educ1:white2, factor_key=TRUE)
## Warning: attributes are not identical across measure variables;
## they will be dropped
# Separate the text from numeric
separate_DF <- mydata.long %>% separate(variables,  sep = "(?<=[A-Za-z])(?=[0-9])", c("xvar", "twinid"))
# Now spread it to look like a long data we need. 
spread_df <- separate_DF %>% spread(xvar, values)



## Generate AGe squared /100
spread_df$agesq = (spread_df$age*spread_df$age)/100


## Now run your OLS regression


ols.model = lm(lwage ~ educ+age+agesq+male+white, data = spread_df)
##print(summary(ols.model),digits=3)
stargazer(ols.model, 
          header=FALSE, type='html', 
          font.size="small", digits=3, 
          omit.stat=c("rsq", "ser", "f"), 
          title = "Result of Table 3 and Column 1")
Result of Table 3 and Column 1
Dependent variable:
lwage
educ 0.084***
(0.014)
age 0.088***
(0.019)
agesq -0.087***
(0.023)
male 0.204***
(0.063)
white -0.410***
(0.127)
Constant -0.471
(0.426)
Observations 298
Adjusted R2 0.260
Note: p<0.1; p<0.05; p<0.01

e. Explain how the coefficient on education should be interpreted.

ANSWER: The OLS estimate on returns to schooling tells us the effect of schooling on the wage of 8.4 percent per year completed, ceteris paribus.

f. Explain how the coefficient on the control variables should be interpreted.

ANSWER:

  • Age: If the age of the twin changes by 1 year, the wage increases by 8.8%, ceteris paribus.

  • Age-squared: This variable estimates the non-linear effect of age where the negative coefficient tells us that wages increase as age increases and decreases after a certain age (a concave relationship). Doing some math \[\dfrac{-\beta[age]}{(2*\beta[age^2])} \] tells us that the peak is at age 50.5 years.

  • Male: Being a Male increases the wage by 20.4%, ceteris paribus

  • White: Effect of being White reduces wages by 41.0% , ceteris paribus. The authors mention this is different from earlier CPS studies.

Part 3: Paper using Difference-in-Differences: Impact of Minimum Wages.

Reference: Card and Krueger (1994) Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania AER 84(4): 772-793.

3.1. Briefly answer these questions:

c. What is the identification strategy?

ANSWER: The identification strategy relies on a couple of things. First, on the assumption that it is unlikely that the rise in minimum wage was obscured by an economy in agood period. Second, based on the parallel trends assumption, East Pennsylvania fast-food restaurants serve as a control group for NJ treatment group of fast food restaurant.

d. What are the assumptions / threats to this identification strategy? (Answer specifically with reference to the data the authors are using)

ANSWER:

    1. Control group choice: The authors used eastern Pennsylvania fast-food restaurants as the control group to the treatment group of NJ fast-food restaurants on the basis that they behave similarly. This assumption could invalidate the “parallel trends” assumption if the characteristics of the two groups differ.
    1. The time lag between when the survey was conducted pre-treatment and post-treatment may be hiding other things happening. The post-treatment survey is 6 months after the policy was put in place. There could be changes in the competition brought about influx of new fast-food or other restaurants.
    1. Mc Donald’s exclusion from the survey could bring hide the true effect of the policy on employment changes since Mc Donalds, back in that era, had large market share.

3.2. Replication Analysis

(Note: for this one, you will not obtain exacly the same results as the paper - but close.)

a. Load data from Card and Krueger AER 1994. You can load it directly from my website here.

ANSWER:

rm(list = ls())
setwd("C:\\Users\\Akash\\Dropbox\\UGA\\AAEC8610AdvEcotrix_Filipski\\FilipskiHW9")

task3data = read.csv("CardKrueger1994_fastfood.csv", header = TRUE)

head(task3data)
attach(task3data)

b. Verify that the data is matching that of the paper. Reproduce the % of Burger King, KFC,Roys, and Wendys, as well as the FTE means in the 2 waves (Top of Table 2).

ANSWER:

## Computing Table 2 statistics for verifying the data. 
#DISTRIBUTION OF STORES :


### NewJersey
njstore.bk = format(round(mean(bk[state==1], na.rm=TRUE)*100,1),nsmall=1)
njstore.kfc = format(round(mean(kfc[state==1], na.rm=TRUE)*100,1),nsmall=1)
njstore.roys = format(round(mean(roys[state==1], na.rm=TRUE)*100,1) ,nsmall=1)
njstore.wendys = format(round(mean(wendys[state==1], na.rm=TRUE)*100,1),nsmall=1)




### Pennsylvania
pennstore.bk = format(round(mean(bk[state==0], na.rm=TRUE)*100,1) ,nsmall=1)
pennstore.kfc = format(round(mean(kfc[state==0], na.rm=TRUE)*100,1),nsmall=1)
pennstore.roys = format(round(mean(roys[state==0], na.rm=TRUE)*100,1) ,nsmall=1)
pennstore.wendys = format(round(mean(wendys[state==0], na.rm=TRUE)*100,1),nsmall=1)


##### T_TESTS
tstat.bk = t.test(bk[state==1], bk[state==0], var.equal = FALSE)
#round(tstat.bk$statistic,1)
tstat.kfc = t.test(kfc[state==1], kfc[state==0], var.equal = FALSE)
#round(tstat.kfc$statistic,1)
tstat.roys = t.test(roys[state==1], roys[state==0], var.equal = FALSE)
#round(tstat.roys$statistic,1)
tstat.wendys = t.test(wendys[state==1], wendys[state==0], var.equal = FALSE)
#round(tstat.wendys$statistic,1)








# WAVE 1 : MEAN FTE  - 
FTE.NJ.wave1.mean = round(mean(emptot[state==1], na.rm=TRUE),1)
FTE.NJ.wave1.sd = sd(emptot[state==1], na.rm=TRUE)
FTE.NJ.wave1.se = round(
  FTE.NJ.wave1.sd/sqrt(length(which(!is.na(emptot[state==1])))), 2
)

FTE.PENN.wave1.mean = round(mean(emptot[state==0], na.rm=TRUE),1)
FTE.PENN.wave1.sd = sd(emptot[state==0], na.rm=TRUE)
FTE.PENN.wave1.se = round(
  FTE.PENN.wave1.sd/sqrt(length(which(!is.na(emptot[state==0])))), 2
)

##### T_TESTS - WAVE 1 _ FTE
tstat.fte.wave1 = t.test(emptot[state==1], emptot[state==0], var.equal = FALSE)
#format(round(tstat.fte.wave1$statistic,1), nsmall=1)









# WAVE 2 : MEAN FTE  - 
FTE.NJ.wave2.mean = format(round(mean(emptot2[state==1], na.rm=TRUE),1), nsmall=1)
FTE.NJ.wave2.sd = sd(emptot2[state==1], na.rm=TRUE)
FTE.NJ.wave2.se = round(
  FTE.NJ.wave2.sd/sqrt(length(which(!is.na(emptot2[state==1])))),2
)
FTE.PENN.wave2.mean =format(round(mean(emptot2[state==0], na.rm=TRUE),1), nsmall=1)
FTE.PENN.wave2.sd = sd(emptot2[state==0], na.rm=TRUE)
FTE.PENN.wave2.se = round(
  FTE.PENN.wave2.sd/sqrt(length(which(!is.na(emptot2[state==0])))), 2
)

##### T_TESTS - WAVE 1 _ FTE
tstat.fte.wave2 = t.test(emptot2[state==1], emptot2[state==0], var.equal = FALSE)
#format(round(tstat.fte.wave2$statistic,1), nsmall=1)

This is the reproduction of Table 2 - it verifies our data matches the paper.

##  TABLE 2:

row1 = c(njstore.bk, pennstore.bk, round(tstat.bk$statistic,1) )
row2 = c(njstore.kfc, pennstore.kfc, round(tstat.kfc$statistic,1) )
row3 = c(njstore.roys, pennstore.roys, round(tstat.roys$statistic,1) )
row4 = c(njstore.wendys, pennstore.wendys, round(tstat.wendys$statistic,1) )
row5 = c("","","")
row6 = c(FTE.NJ.wave1.mean, FTE.PENN.wave1.mean, format(round(tstat.fte.wave1$statistic,1), nsmall=1))
row7 = c(FTE.NJ.wave1.se, FTE.PENN.wave1.se,"")
row8 = c("","","")
row9 = c(FTE.NJ.wave2.mean, FTE.PENN.wave2.mean, format(round(tstat.fte.wave2$statistic,1), nsmall=1))
row10 = c(FTE.NJ.wave2.se, FTE.PENN.wave2.se,"")
tab2 <- data.frame(
  rbind(row1, row2, row3, row4, row5, row6, row7, row8, row9, row10)
)
row.names(tab2)<-c("a. Burger King", "b. KFC", "c. Roy Rogers","d. Wendys", "Means In Wave 1", "FTE Employment (Wave I)", "Std. Err (Wave I).", "Means In Wave II", "FTE Employment (Wave II)", "Std. Err. (Wave II)")
colnames(tab2) <- c("Stores: NJ","Stores:PA","t-statistic")
tab2

c. Compute the difference-in-differences estimator “by hand”.

Don’t use a regression. Reproduce the columns 1,2,3 rows 1,2,4 (not 3) of the top-left corner of Table 3 in the paper. You will not obtain the exact same estimates - but pretty close (differences only on the decimals). You can skip computing the standard errors by hand if you are not sure how to do that. My table is transposed compared to theirs, but the results are the same. If you have time, try to make a table that matches theirs, but that isn’t the point here.

ANSWER:

Let’s first get column 1,2 and 3 of the row 1 of Table 3: FTE employment before minimum wage policy change.

### 1. FTE employment before policy change all available observations


##Pennsylvania:

FTE.PENN.before = round(mean(emptot[state==0], na.rm=TRUE),2)
FTE.PENN.before.sd = sd(emptot[state==0], na.rm=TRUE)
FTE.PENN.before.se = round(FTE.PENN.before.sd/sqrt(length(which(!is.na(emptot[state==0])))), 2)


##NEW JERSEY (COLUMN 2):

FTE.NJ.before = round(mean(emptot[state==1], na.rm=TRUE),2)
FTE.NJ.before.sd = sd(emptot[state==1], na.rm=TRUE)
FTE.NJ.before.se = round(FTE.NJ.before.sd/sqrt(length(which(!is.na(emptot[state==1])))), 2)


## DIFFERENCE BETWEEN COLUMN 1 and Column 2 [NJ-PA]:
#Difference between means
## Calculate difference btween the means
diff.means <- FTE.NJ.before - FTE.PENN.before

## GETTING THE STANDARD ERROR OF DIFFERENCE OF MEANS
obs <- c(length(which(!is.na(emptot[state==0]))),length(which(!is.na(emptot[state==1]))))
SDs <- c(FTE.PENN.before.sd, FTE.NJ.before.sd)

# Standard error of difference

se.diff.before = round(sqrt(
  ((SDs[1]^2)/obs[1]) +
    ((SDs[2]^2)/obs[2]) 
), 2
)



## put means & Se into a vector
means.FTE.before<-c(FTE.PENN.before,FTE.NJ.before, diff.means)
se.FTE.before<- c(FTE.PENN.before.se,FTE.NJ.before.se, se.diff.before)

## add means to dataframe
tab3 <- rbind(means.FTE.before)
tab3<- rbind(tab3,se.FTE.before)
row.names(tab3)[1] <- "FTE employment before"
row.names(tab3)[2] <- "Std. Err. (Before)"

Now, we will calculate column 1,2 and 3 of the row 2 of Table 3: FTE employment AFTER minimum wage policy change.

### 1. FTE employment AFTER  policy change all available observations


##Pennsylvania  (COLUMN 1):

FTE.PENN.after = round(mean(emptot2[state==0], na.rm=TRUE),2)
FTE.PENN.after.sd = sd(emptot2[state==0], na.rm=TRUE)
FTE.PENN.after.se = round(FTE.PENN.after.sd/sqrt(length(which(!is.na(emptot2[state==0])))), 2)


##NEW JERSEY (COLUMN 2):

FTE.NJ.after = round(mean(emptot2[state==1], na.rm=TRUE),2)
FTE.NJ.after.sd = sd(emptot2[state==1], na.rm=TRUE)
FTE.NJ.after.se = round(FTE.NJ.after.sd/sqrt(length(which(!is.na(emptot2[state==1])))), 2)

## DIFFERENCE BETWEEN COLUMN 1 and Column 2 [NJ-PA]:
#Difference between means
## Calculate difference btween the means
diff.means.after <- FTE.NJ.after - FTE.PENN.after

## GETTING THE STANDARD ERROR OF DIFFERENCE OF MEANS
obs <- c(length(which(!is.na(emptot2[state==0]))),length(which(!is.na(emptot2[state==1]))))
SDs <- c(FTE.PENN.after.sd, FTE.NJ.after.sd)

# Standard error of difference

se.diff.after = round(sqrt(
  ((SDs[1]^2)/obs[1]) +
    ((SDs[2]^2)/obs[2]) 
), 2
)



## put means & Se into a vector
means.FTE.after<-c(FTE.PENN.after,FTE.NJ.after, diff.means.after)
se.FTE.after<- c(FTE.PENN.after.se,FTE.NJ.after.se, se.diff.after)

## add means to dataframe
tab3 <- rbind(tab3, means.FTE.after)
tab3<- rbind(tab3,se.FTE.after)
row.names(tab3)[3] <- "FTE employment After"
row.names(tab3)[4] <- "Std. Err. (After)"

Now, we calculate row 4 (column 1, 2, 3) of Table 3. For this we need a balanced sample.

### 1. FTE employment AFTER  policy change all available observations

smalldf = subset(task3data, select=c(id, state, emptot,emptot2, demp))
smalldf = na.omit(smalldf)

## PENNSYLVANIA

FTE.PENN.before = round(mean(smalldf$emptot[smalldf$state==0], na.rm=TRUE),2)
FTE.PENN.after = round(mean(smalldf$emptot2[smalldf$state==0], na.rm=TRUE),2)
## Calculate difference btween the means
diff.means.PENN <- FTE.PENN.after - FTE.PENN.before
# Standard error of difference

se.diff.PENN = round(
  sqrt(var(smalldf$demp[smalldf$state==0])/length(smalldf$demp[smalldf$state==0])),
  2)

##NEW JERSEY
FTE.NJ.before = round(mean(smalldf$emptot[smalldf$state==1]),2)
FTE.NJ.after = round(mean(smalldf$emptot2[smalldf$state==1]),2)
## Calculate difference btween the means
diff.means.NJ <- FTE.NJ.after - FTE.NJ.before

# Standard error of difference

se.diff.NJ = round(
  sqrt(var(smalldf$demp[smalldf$state==1])/length(smalldf$demp[smalldf$state==1])),
  2)


## DIFFERENCES IN DIFFERENCES
## COLUMN 2-Column 1 (NJ-PA)

diffindiff <- diff.means.NJ-diff.means.PENN 

## Std Err
se.diffindiff = round(sqrt(se.diff.PENN^2+se.diff.NJ^2),2)


## put means & Se into a vector
diff.FTE<-c(diff.means.PENN ,diff.means.NJ , diffindiff)
se.FTE.diff<- c(se.diff.PENN,se.diff.NJ, se.diffindiff)

THIS IS THE TABLE 3

## add means to dataframe
tab3 <- rbind(tab3, diff.FTE)
tab3<- rbind(tab3,se.FTE.diff)
row.names(tab3)[5] <- "Change in Mean FTE employment, balanced"
row.names(tab3)[6] <- "Std. Err. (Change in Mean FTE)"


tab3.df = data.frame(tab3)
colnames(tab3.df)[1] = "PA"
colnames(tab3.df)[2] = "NJ"
colnames(tab3.df)[3] = "Difference: NJ-PA"



tab3.df

d. Interpret the difference-in-differences estimator. Does it (roughly) match the one in the paper?

ANSWER:

  • I get the estimates that match the paper.

  • The differences in differences estimator tells us whether the expected mean change in outcome (here, employment level in fast food industry) from before to after the introduction of the minimum wage policy was different between Pennsylvania and New Jersey. We see that the “relative gain” in employment in NJ is 2.75 FTE employees (statistically significant at 5%).

e. Use OLS to obtain the same Diff-in-diff estimator as you just did.

ANSWER:

smalldf = subset(task3data, select=c(id, state, emptot,emptot2, demp))
smalldf = na.omit(smalldf)
lmfit <-lm(demp~state, data = smalldf)
#Load libraries
library("lmtest")
library("sandwich")
# Robust t test
#Using "HC1" will replicate the robust standard errors you would obtain using STATA.
simple.ols<- coeftest(lmfit, vcov = vcovHC(lmfit, type = "HC1"))

The results are represented in the below table:

library(stargazer)

stargazer(simple.ols, header=FALSE, type='html', font.size="small", digits=2,
omit.stat=c("adj.rsq", "ser", "f"), title = "OLS to obtain Diff-in-diff")
OLS to obtain Diff-in-diff
Dependent variable:
state 2.75**
(1.34)
Constant -2.28*
(1.25)
Note: p<0.1; p<0.05; p<0.01

f. Reshape your data to long form.

ANSWER:

rm(list = ls())
task3data = read.csv("CardKrueger1994_fastfood.csv", header = TRUE)
task3data$emptot0 <- task3data$emptot
task3data$emptot1 <- task3data$emptot2

library(tidyr)
task3.wide = subset(task3data, select = c("id", "state", "emptot0", "emptot1"))
# Make sure the subject column is a factor
task3.wide$id <- factor(task3.wide$id)

#First to long
task3.long <- gather(task3.wide, variables, empvalue, emptot0:emptot1, factor_key=TRUE)

# Separate the text from numeric
separate_DF <- task3.long %>% separate(variables,  sep = "(?<=[A-Za-z])(?=[0-9])", c("xvar", "treatdate"))
# Now spread it to look like a long data we need. 
spread_df <- separate_DF %>% spread(xvar, empvalue)

spread_df$state<-as.factor(spread_df$state)
spread_df$treatdate<-as.factor(spread_df$treatdate)

summary(spread_df)
##        id      state   treatdate     emptot     
##  407    :  4   0:158   0:410     Min.   : 0.00  
##  1      :  2   1:662   1:410     1st Qu.:14.50  
##  2      :  2                     Median :20.00  
##  3      :  2                     Mean   :21.03  
##  4      :  2                     3rd Qu.:25.50  
##  5      :  2                     Max.   :85.00  
##  (Other):806                     NA's   :26
head(spread_df)

g. Run the appropriate DiD regression and comment on the result.

ANSWER:

##fixed effects regression uses the plm package
library(plm)

fe <- plm(emptot ~ state + treatdate + state*treatdate
            , data=spread_df, index = c("id") )
fe.robust<- coeftest(fe, vcov = vcovHC(fe, type = "HC1"))
stargazer(fe.robust, header=FALSE, type='html', font.size="small", digits=2,
omit.stat=c("adj.rsq", "ser", "f"), title = "Fixed Effect Diff-In-Diff")
Fixed Effect Diff-In-Diff
Dependent variable:
state1 0.88
(0.67)
treatdate1 -2.28*
(1.25)
state1:treatdate1 2.75**
(1.34)
Note: p<0.1; p<0.05; p<0.01

COMMENT: My results exactly match the results of Table 3 . I get the diff-in-diff estimator : 2.75 (1.34). Here, again the interpretation is that post-treatment (change in min wage) employment level rose 2.75 FTE in New Jersey. So, running a regression is an easier way to obtain a diff-in-diff estimator