Suppose the governor of the state of Georgia has proposed a new labor
policy that seeks to increase the wages of workers in the manufacturing
industry. This is in light with the state of Georgia wanting to reduce
the importation and over reliance of manufactured goods from other
neighboring states. Suppose again that we want to estimate the effect of
this policy on the wages of workers in two industries, manufacturing and
services. We have two years of data before and two years of data after
the policy with a simultaneous treatment on the third year.
We generate fake data to simulate this scenario with the following
code.
set.seed(1234)
n <- 1200 # set number of observations to 1200
year <- rep(c(0, 1, 2, 3), each = n/4) # create the year dummies
industry <- rep(c(0, 1), n/2) # we create the industry dummy with services=0 and manufacturing=1
treatment <- c(rep(0,600), rep(1,600)) # 4 years of data: year 3 is the beginning of the policy
error <- rnorm(n, 0, 10)
wage <- 50 +80*year + 90*industry + 120*treatment*industry + error # the data generating process
data <- data.frame(wage, year, industry, treatment)
Here, year variable is a time variable that equal to
0,1,2,3 for the first, second, third and fourth year respectively.
industry variable is a binary variable indicating whether
the worker is in services industry (0) or the manufacturing industry
(1), and treatment is a binary variable indicating whether
the worker was affected by the policy (1) or not (0). The error term is
normally distributed with a mean of 0 and a standard deviation of 10.
Furthermore we assume the data generating process (DGP) of wages is at
follows: $wage=50 + 80year + 90industry +
120treatmentindustry + error $ With this DGP, our pre-assumed
average treatment effect on the treated (ATT) of the policy is 120.
library(ggplot2)
ggplot(data, aes(x = year, y = wage, color = factor(industry))) +
geom_point() +
ggtitle("Wages by industry and year") +
ylab("Wage") +
xlab("Year") +
scale_color_discrete(name = "Industry")
This produces a scatterplot of wages by year and industry, with the color indicating the industry. We can see that the wages are higher in the manufacturing industry than in the services industry through out the years. Additionally the gap between the wages between these two industries was similar until after the policy was introduced in the third year that we see a discrete jump in wages for the manufacturing workers. With this graph we are convinced that we had a parallel pre-trends assumption between wages before the policy was passed.
To pick up the “true” ATT, i will run the DID regression with Time-Period Fixed Effects. With the true DID regression with Time-Period FE, I specify the model as; \(Wage_{i,t}=\alpha + \beta Industry_{i}+\tau T + \gamma (Industry \times Treatment) + \epsilon_{i,t}\) where, \(Wage_{i,t}\) is the wage for worker \(i\) in time period \(t\), \(Industry_{i}\) is a binary variable indicating whether the worker is in services industry (0) or the manufacturing industry (1). \(Treatment_t\) is a binary variable indicating whether the worker was affected by the policy (1) or not (0). \(T\) is the time dummies and \(\epsilon_{i,t}\) is the error term. Here, I used the plm package for my regression
# DID regression with Tme FE
library(plm)
did_fe <- plm(wage~industry+treatment*industry, data=data, index = "year", model="within")
# Presenting result table
library(stargazer)
stargazer(did_fe, type="text",
no.space=TRUE, keep.stat = c("n","rsq"),
title = " DID REGRESSION WITH TIME FIXED EFFECT",
dep.var.labels = "WAGES")
##
## DID REGRESSION WITH TIME FIXED EFFECT
## ==============================================
## Dependent variable:
## ---------------------------
## WAGES
## ----------------------------------------------
## industry 90.172***
## (0.808)
## industry:treatment 119.223***
## (1.143)
## ----------------------------------------------
## Observations 1,200
## R2 0.985
## ==============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The regression results show that the coefficient of the interaction term between industry and treatment is positive and statistically significant \((\gamma= 119.2, p < 0.001)\). This means that the estimated treatment effect is statistically significant in the third and fourth post-policy-change years, indicating that the policy change had a positive effect on wages for the manufacturing treatment group. As illustrated in part 1 where the pre-assumed ATT was 120, this regression was able to pick up the true ATT as expected.
Here, I was trying to plot the co-efficient estimate from my regression but it was not working out for me. I will be glad with some suggestion on this.
Let’s imagine a scenario in which students only get scholarships if their test scores are above a particular cutoff. For 600 students, we will create fictitious data as follows:
# Generate fake data
n <- 600
test_score <- rnorm(n, mean = 50, sd = 10)
threshold <- 60
treatment <- ifelse(test_score >= threshold, 1, 0)
eps <- rnorm(n, mean = 0, sd = 5)
Y <- 10 + 200*treatment + 5*test_score + eps
# Combine into a data frame
rdddata <- data.frame(Y, test_score, treatment)
Where y is the outcome variable (scholarship),
test_score is a running variable which contains the exam
score of the 600 students. treatment a binary variable
which is equal to 1 if students score is above or equal to the threshold
(60) and 0 otherwise. Suppose that the outcome variable follows a data
generating process as; \(Y_i= 10 +
200*treamtment_i + 5*test_score+ \epsilon_i\) where \(\epsilon_i\) is the error term. Here our
pre-determined LATE is 200.
We can create a scatterplot of test scores vs. outcomes, with a vertical line at the threshold to visually see the discontinuity.
# Plot data
library(ggplot2)
ggplot(rdddata, aes(x = test_score, y = Y)) +
geom_point() +
geom_vline(xintercept = threshold, color = "red")
Here we see that the threshold (60), there is a discontinuity in the
outcome variable for those below and those above. The gap between these
two line depicts the LATE.
In order to pick up the “true” effect, I will run an rdd regression
as follows. $Y_i=+ Treatment_i + Testscore_i + _i $ where all variables
follow the same definition as above and \(\beta\) is the LATE to be estimated. In
addition, we will use the rdd package to estimate a
regression discontinuity model with a linear model on each side of the
threshold.
# Load rdd package
library(rdd)
# Run regression
rdd_model <- RDestimate(Y ~ test_score, data=rdddata, cutpoint=60, kernel="rectangular", bw=2)
# Print results
summary(rdd_model)
##
## Call:
## RDestimate(formula = Y ~ test_score, data = rdddata, cutpoint = 60,
## bw = 2, kernel = "rectangular")
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 2 57 195.9 2.416 81.07 0
## Half-BW 1 29 198.3 3.280 60.46 0
## Double-BW 4 117 197.6 1.746 113.19 0
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 7644 3 53 0
## Half-BW 3485 3 25 0
## Double-BW 18395 3 113 0
On each side of the threshold, we decide to employ a local linear model, with the bandwidth equal to 2 and a triangle kernel. Although there are additional options in the rdd package, these are the options that are frequently used in practice.
# Print results
summary(rdd_model)
##
## Call:
## RDestimate(formula = Y ~ test_score, data = rdddata, cutpoint = 60,
## bw = 2, kernel = "rectangular")
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 2 57 195.9 2.416 81.07 0
## Half-BW 1 29 198.3 3.280 60.46 0
## Double-BW 4 117 197.6 1.746 113.19 0
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 7644 3 53 0
## Half-BW 3485 3 25 0
## Double-BW 18395 3 113 0
The summary output of our regression model shows that the estimated effect of the scholarship on the outcome is 203.3, with a standard error of 2.523 statistically significant at 1% level. This suggests that the scholarship has a positive effect on the outcome. Also, the LATE which is 203.3 was approximately equal to the pre-determined LATE of 200. This implies that the regression was able to pick up the true effect as expected.
We can also plot the estimated treatment effect over a range of test scores to see how it varies around the threshold.
# Plot estimated treatment effect
plot(rdd_model, plot_ci = T)