Difference-in-Difference Estimator

1. Create a fake data to simulate a relevant situation

Suppose the governor of the state of Georgia has proposed a new labor policy that seeks to increase the wages of workers in the manufacturing industry. This is in light with the state of Georgia wanting to reduce the importation and over reliance of manufactured goods from other neighboring states. Suppose again that we want to estimate the effect of this policy on the wages of workers in two industries, manufacturing and services. We have two years of data before and two years of data after the policy with a simultaneous treatment on the third year.
We generate fake data to simulate this scenario with the following code.

set.seed(1234)
n <- 1200  # set number of observations to 1200
year <- rep(c(0, 1, 2, 3), each = n/4) # create the year dummies
industry <- rep(c(0, 1), n/2)  # we create the industry dummy with services=0 and manufacturing=1
treatment <- c(rep(0,600), rep(1,600)) # 4 years of data: year 3 is the beginning of the policy
error <- rnorm(n, 0, 10)
wage <- 50 +80*year + 90*industry + 120*treatment*industry + error  # the data generating process
data <- data.frame(wage, year, industry, treatment)

Here, year variable is a time variable that equal to 0,1,2,3 for the first, second, third and fourth year respectively. industry variable is a binary variable indicating whether the worker is in services industry (0) or the manufacturing industry (1), and treatment is a binary variable indicating whether the worker was affected by the policy (1) or not (0). The error term is normally distributed with a mean of 0 and a standard deviation of 10. Furthermore we assume the data generating process (DGP) of wages is at follows: $wage=50 + 80year + 90industry + 120treatmentindustry + error $ With this DGP, our pre-assumed average treatment effect on the treated (ATT) of the policy is 120.

2. Plot the data to show the pattern you are trying to estimate

library(ggplot2)
ggplot(data, aes(x = year, y = wage, color = factor(industry))) + 
  geom_point() +
  ggtitle("Wages by industry and year") +
  ylab("Wage") +
  xlab("Year") +
  scale_color_discrete(name = "Industry")

This produces a scatterplot of wages by year and industry, with the color indicating the industry. We can see that the wages are higher in the manufacturing industry than in the services industry through out the years. Additionally the gap between the wages between these two industries was similar until after the policy was introduced in the third year that we see a discrete jump in wages for the manufacturing workers. With this graph we are convinced that we had a parallel pre-trends assumption between wages before the policy was passed.

3. Use R to run relevant regressions

To pick up the “true” ATT, i will run the DID regression with Time-Period Fixed Effects. With the true DID regression with Time-Period FE, I specify the model as; $Wage_{i,t}=\alpha + \beta Industry_{i}+\tau T + \gamma (Industry \times Treatment) + \epsilon_{i,t}$ where, $Wage_{i,t}$ is the wage for worker $i$ in time period $t$, $Industry_{i}$ is a binary variable indicating whether the worker is in services industry (0) or the manufacturing industry (1). $Treatment_t$ is a binary variable indicating whether the worker was affected by the policy (1) or not (0). $T$ is the time dummies and $\epsilon_{i,t}$ is the error term. Here, I used the plm package for my regression

# DID regression with Tme FE
library(plm)
did_fe <- plm(wage~industry+treatment*industry, data=data, index = "year", model="within")

4. Present results and discuss

# Presenting result table
library(stargazer)
stargazer(did_fe, type="text",
          no.space=TRUE, keep.stat = c("n","rsq"),
          title = " DID REGRESSION WITH TIME FIXED EFFECT",
          dep.var.labels = "WAGES")

## 
## DID REGRESSION WITH TIME FIXED EFFECT
## ==============================================
##                        Dependent variable:    
##                    ---------------------------
##                               WAGES           
## ----------------------------------------------
## industry                    90.172***         
##                              (0.808)          
## industry:treatment         119.223***         
##                              (1.143)          
## ----------------------------------------------
## Observations                  1,200           
## R2                            0.985           
## ==============================================
## Note:              *p<0.1; **p<0.05; ***p<0.01

The regression results show that the coefficient of the interaction term between industry and treatment is positive and statistically significant $(\gamma= 119.2, p < 0.001)$. This means that the estimated treatment effect is statistically significant in the third and fourth post-policy-change years, indicating that the policy change had a positive effect on wages for the manufacturing treatment group. As illustrated in part 1 where the pre-assumed ATT was 120, this regression was able to pick up the true ATT as expected.

5.Add any plots/graphs/comments you deem necessary.

Here, I was trying to plot the co-efficient estimate from my regression but it was not working out for me. I will be glad with some suggestion on this.

Regression Discontinuity

1. Create a fake data to simulate a relevant situation

Let’s imagine a scenario in which students only get scholarships if their test scores are above a particular cutoff. For 600 students, we will create fictitious data as follows:

# Generate fake data
n <- 600
test_score <- rnorm(n, mean = 50, sd = 10)
threshold <- 60
treatment <- ifelse(test_score >= threshold, 1, 0)
eps <- rnorm(n, mean = 0, sd = 5)
Y <- 10 + 200*treatment + 5*test_score + eps

# Combine into a data frame
rdddata <- data.frame(Y, test_score, treatment)

Where y is the outcome variable (scholarship), test_score is a running variable which contains the exam score of the 600 students. treatment a binary variable which is equal to 1 if students score is above or equal to the threshold (60) and 0 otherwise. Suppose that the outcome variable follows a data generating process as; $Y_i= 10 + 200*treamtment_i + 5*test_score+ \epsilon_i$ where $\epsilon_i$ is the error term. Here our pre-determined LATE is 200.

2. Plot the data to show the pattern you are trying to estimate

We can create a scatterplot of test scores vs. outcomes, with a vertical line at the threshold to visually see the discontinuity.

# Plot data
library(ggplot2)
ggplot(rdddata, aes(x = test_score, y = Y)) +
  geom_point() +
  geom_vline(xintercept = threshold, color = "red")

Here we see that the threshold (60), there is a discontinuity in the outcome variable for those below and those above. The gap between these two line depicts the LATE.

3. Use R to run relevant regressions

In order to pick up the “true” effect, I will run an rdd regression as follows. $Y_i=+ Treatment_i + Testscore_i + _i $ where all variables follow the same definition as above and $\beta$ is the LATE to be estimated. In addition, we will use the rdd package to estimate a regression discontinuity model with a linear model on each side of the threshold.

# Load rdd package
library(rdd)

# Run regression
rdd_model <- RDestimate(Y ~ test_score, data=rdddata, cutpoint=60, kernel="rectangular", bw=2)

# Print results
summary(rdd_model)

## 
## Call:
## RDestimate(formula = Y ~ test_score, data = rdddata, cutpoint = 60, 
##     bw = 2, kernel = "rectangular")
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|)
## LATE       2           57           195.9     2.416        81.07   0       
## Half-BW    1           29           198.3     3.280        60.46   0       
## Double-BW  4          117           197.6     1.746       113.19   0       
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p
## LATE        7644  3          53         0
## Half-BW     3485  3          25         0
## Double-BW  18395  3         113         0

On each side of the threshold, we decide to employ a local linear model, with the bandwidth equal to 2 and a triangle kernel. Although there are additional options in the rdd package, these are the options that are frequently used in practice.

4. Present results and discuss

# Print results
summary(rdd_model)

## 
## Call:
## RDestimate(formula = Y ~ test_score, data = rdddata, cutpoint = 60, 
##     bw = 2, kernel = "rectangular")
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|)
## LATE       2           57           195.9     2.416        81.07   0       
## Half-BW    1           29           198.3     3.280        60.46   0       
## Double-BW  4          117           197.6     1.746       113.19   0       
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p
## LATE        7644  3          53         0
## Half-BW     3485  3          25         0
## Double-BW  18395  3         113         0

The summary output of our regression model shows that the estimated effect of the scholarship on the outcome is 203.3, with a standard error of 2.523 statistically significant at 1% level. This suggests that the scholarship has a positive effect on the outcome. Also, the LATE which is 203.3 was approximately equal to the pre-determined LATE of 200. This implies that the regression was able to pick up the true effect as expected.

5.Add any plots/graphs/comments you deem necessary.

We can also plot the estimated treatment effect over a range of test scores to see how it varies around the threshold.

# Plot estimated treatment effect
plot(rdd_model, plot_ci = T)

HW07

Godwin Nutsugah

2023-03-16

Difference-in-Difference Estimator

1. Create a fake data to simulate a relevant situation

2. Plot the data to show the pattern you are trying to estimate

3. Use R to run relevant regressions

4. Present results and discuss

5.Add any plots/graphs/comments you deem necessary.

Regression Discontinuity

1. Create a fake data to simulate a relevant situation

2. Plot the data to show the pattern you are trying to estimate

3. Use R to run relevant regressions

4. Present results and discuss

5.Add any plots/graphs/comments you deem necessary.