I. Please skim through the entire article first -

A Guide To Using The Difference-In-Differences Regression Model: https://towardsdatascience.com/a-guide-to-using-the-difference-in-differences-regression-model-87cd2fb3224a

A.

In a simple diff-in-diff paper, we are looking for either the 2 X 2 matrix of regression equations (so that you can construct the diff-in-diff estimator), or you are looking for the following regression form: Y = α + β1(time)+ β2(treatment) + β3(time*treatment)

β1 is the expected mean change in outcome from before to after the onset of the intervention era among the control group. It reflects, if you will, the pure effect of the passage of time in the absence of the actual intervention.
β2 (coefficient of the treatment variable) is the estimated mean difference in Y between the treatment and control groups prior to the intervention: it represents whatever “baseline” differences existed between the groups before the intervention was applied to the control group.
β3 by itself is the difference in differences estimator. In most contexts, it is β3 that is the focus of interest. It tells us whether the expected mean change in outcome from before to after was different in the two groups. (That would typically be the hallmark of an effective intervention, assuming adequate power, etc.) To get the estimated mean difference in Y between the treatment and control groups after the intervention, you need to look at β1 + β3. It is possible that you will find that β1 + β3 is significantly different from zero, even though neither β1, nor β3 by itself is.

B. Now, you will run a simple linear regression in R on the dataset.

Estimating Equation

\(y_i\) = \(\beta_0\) + \(\beta_1TimePeriod_i\) + \(\beta_2Treatment_i\) + \(\beta_3(TimePeriod*Treatment)i\) + \(\epsilon_i\)

The dependent variable is HPI_CHG and yhe independent variables are Time_Period (dummy), Disaster Affected (dummy), and their interaction term.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

diff_data <- read.csv("E:/RM/us_fred_coastal_us_states_avg_hpi_before_after_2005.csv")

# Creating the interaction term
Interaction_Term <- diff_data$Time_Period * diff_data$Disaster_Affected

# Simple linear regression
model <- lm(HPI_CHG ~ Time_Period + Disaster_Affected + Interaction_Term, data = diff_data)

regression_DID <- summary(model)$coefficients["Interaction_Term", "Estimate"]

# Summary of the regression
summary(model)

## 
## Call:
## lm(formula = HPI_CHG ~ Time_Period + Disaster_Affected + Interaction_Term, 
##     data = diff_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.023081 -0.007610 -0.000171  0.004656  0.035981 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.037090   0.002819  13.157  < 2e-16 ***
## Time_Period       -0.027847   0.003987  -6.985  1.2e-08 ***
## Disaster_Affected -0.013944   0.006176  -2.258   0.0290 *  
## Interaction_Term   0.019739   0.008734   2.260   0.0288 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01229 on 44 degrees of freedom
## Multiple R-squared:  0.5356, Adjusted R-squared:  0.504 
## F-statistic: 16.92 on 3 and 44 DF,  p-value: 1.882e-07

print(regression_DID)

## [1] 0.01973946

C. Let’s try to understand and interpret what you have done.

1. What is the control and the control group, and what is the treatment and the treatment group?

Control represents the pre-weather event (2005 Atlantic hurricane season) conditions and house price changes in states not affected by these events and control group consists of states that were not significantly affected by coastal weather events. Treatment is the impact of coastal weather events on house prices in the affected states, comparing before and after the events. Treatment group has states that were heavily impacted by coastal weather events, representing the group exposed to the treatment.

2. The basic idea is to compare the difference in outcomes between the treatment and control groups before and after the treatment is introduced. By differencing these differences, what do you hope to achieve, and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma.

We’re trying to figure out if hurricanes affect house prices. The Control Group includes states where hurricanes had little impact, so we can see how prices change without hurricanes. The Treatment Group has states hit hard by hurricanes, so we can see how prices change with hurricanes. We want to see if hurricanes really make prices go up or down. To do that, we compare how prices changed before and after hurricanes in both groups. It should work because it helps us separate the real effect of hurricanes from other things that might also be changing prices. By comparing the two groups and looking at how prices change before and after the hurricanes, we can be more confident that any big price changes we see are likely caused by the hurricanes themselves.

3. Create the 2 X 2 matrix of regression equations with actual values from the data. HINT - take the mean of your y variable, not the count in each cell. Does your difference in difference coefficient in the linear regression above match the difference in difference effect of the two groups from the 2*2 table created? It should match by the way.

mean_control_pre <- mean(diff_data$HPI_CHG[diff_data$Time_Period == 0 & diff_data$Disaster_Affected == 0])
mean_control_post <- mean(diff_data$HPI_CHG[diff_data$Time_Period == 1 & diff_data$Disaster_Affected == 0])
mean_treatment_pre <- mean(diff_data$HPI_CHG[diff_data$Time_Period == 0 & diff_data$Disaster_Affected == 1])
mean_treatment_post <- mean(diff_data$HPI_CHG[diff_data$Time_Period == 1 & diff_data$Disaster_Affected == 1])


matrix_2x2 <- matrix(c(mean_control_pre, mean_control_post, mean_treatment_pre, mean_treatment_post), nrow = 2)

print(matrix_2x2)

##             [,1]       [,2]
## [1,] 0.037090020 0.02314612
## [2,] 0.009242792 0.01503835

DID_coefficient <- (mean_treatment_post - mean_treatment_pre) - (mean_control_post - mean_control_pre)

print(DID_coefficient)

## [1] 0.01973946

coefficients_match <- abs(DID_coefficient - regression_DID) 
coefficients_match

## [1] 0

Therefore, the difference in differences coefficient from the linear regression matched the difference between the means of the treatment group in the post-treatment period and the pre-treatment period, minus the difference between the means of the control group in the post-treatment period and the pre-treatment period.

4. What are the “threats to identification”? In other words, what are the “implicit assumptions” like for simple OLS we have the 5 Gauss Markov Assumptions/conditions? Alternatively, under what conditions can you trust the point estimates, and when would you buy the study results with a grain of salt?

In difference in difference analysis, there are many assumptions and potential threats to identification that we should consider. These assumptions are critical to trust the point estimates and the validity of the study results.

Parallel Trends Assumption: It assumes that, in the absence of the treatment, the two groups (treatment and control) would follow parallel trends over time. If this assumption is violated, it suggests that other factors are influencing the outcomes, making it difficult to attribute changes solely to the treatment. The parallel trends assumption tells that in the absence of coastal weather events, both groups should have similar pre-treatment trends in house prices. This is a baseline for comparison.
Common Trends: From this assumption, both groups should have common trends before the treatment begins. This is essential for creating a baseline to compare post-treatment changes.
Adequate Time Periods: DiD may not work well with very short time periods because there may not be enough data points to capture trends accurately. To satisfy this assumption we got a dataset with sufficient time periods before and after the coastal weather events. We got the target variable data (housing prices) for four quarters before and after the hurricane and calculated the average quarter-over-quarter fractional change in the house price index over the two sets of quarters state-wise.
Large Enough Sample: Having a sufficiently large sample size is essential to detect statistically significant treatment effects. To accomplish this, we got the data from 24 coastal states and classified them into treatment and control groups based upon the median values of the affected states to avoid bias due to a imbalanced dataset.

II. OPTIONAL: Summarize the Mariel Boatlift paper.

A. ARTICLE SUMMARY

1) What are the authors trying to do? In other words, what is the economic hypothesis that the authors are trying to test for?

The authors of “The Impact of the Mariel Boatlift on the Miami Labor Market” are attempting to test the economic hypothesis that a sudden and significant entry of low-skilled immigrants into the Miami labor market did not significantly harm native workers employment and wages. They seek to disprove the belief that such an immigration shock would have a negative impact on the labor market.

2) What do the authors find? Do the results make sense under a perfectly competitive labor market?

From the paper, I found that the Mariel Boatlift had no statistically significant adverse effects on the employment of native employees in Miami who were less skilled. In a perfectly competitive labor market, an increase in the labor supply such as the arrival of a large number of new immigrants would typically lead to a decrease in wages and potentially some displacement of native workers due to increased competition for jobs. In such a model, one would expect that the Mariel Boatlift would result in lower wages and possibly higher unemployment among native workers.

B. METHODOLOGY / CRITICAL ANALYSIS

1) What is the treatment here? What is the treatment group, and what is the control group?

The treatment in this study is the sudden inflow of low-skilled Cuban immigrants due to the Mariel Boatlift. The treatment group consists of Miami where this occurred, and the control group consists of other cities in the United States that did not experience the same immigration shock.

2) Intuitively, why does methodology work? Also, do you buy the study’s results? How convinced are you with the study’s methodology?

The methodology in the study compared Miami to other cities and it is a the conventional way to assess the impact of the Mariel Boatlift on the local labor market. This approach is interesting because it allows researchers to isolate the effects of the Boatlift from other potential factors influencing the labor market. The study’s methodology works because it leverages the differences in labor market outcomes between Miami and other cities both before and after the Boatlift. Although the study findings are interesting, it’s important to exercise caution before making significant policy decisions based on a single study. It’s valuable to consider other relevant studies and factors when making policy decisions. The papers findings seem interesting but I’m not convinced with the study’s methodolgy, as the study covers a relatively short time frame, and the impact of the Boatlift might not be fully realized in the data. Longer-term effects could differ from the short-term findings. The boatlift happend in the year 1980, and the survey compared the multiple cities from 1979 to 1985. I’m not convinced about the long term effects to make an informed decision.

3) What are the “threats to identification”? In other words, what are the “implicit assumptions” like for simple OLS we have the 5 Gauss Markov Assumptions/conditions? Alternatively, under what conditions can you trust the point estimates, and when would you buy the study results with a grain of salt?

The study covers a relatively short time frame. The labor market effects of the Boatlift may not fully manifest within this period. Long-term effects might differ from short-term effects.
Cuban immigrants arrived in Miami as a result of the Boatlift. These immigrants might have differed from non-migrants in terms of skills. This immigration may have an impact on labor market results. If the migrants were more skilled, the estimated effects cannot be solely attributed to the Boatlift itself. Selectivity could be driving the results.
If the Miami’s labor market was already on a different trajectory compared to the comparison cities before the Boatlift, it violates the parallel trends assumption.

I would consider the results of the study if the above criteria have been met.

III. Optional: Find another good Diff-in-Diff paper (that has the 2*2 table or estimating equation with an interaction term).

“California Paid Family Leave and Parental Time Use”. - Samantha Trajkovski

The author used a Diff-in-Diff method to compare parents time allocation before and after the implementation of Paid Family Leave Policy in California in 2004. They found that this policy had a positive effect on the time parents spent with their children. This study highlights the importance of family-friendly policies in promoting more time for children in their crucial early years. In this paper, the treatment group are the people residing in the California.

https://surface.syr.edu/cpr/249/

Week - 6 Discussion

2023-10-10