Discussion_Diff_in_Diff

Author

Gina Occhipinti

First, I load and view the data.

# load the data 

data <- read.csv("/Users/ginaocchipinti/Documents/Econometrics Course - BC/us_fred_coastal_us_states_avg_hpi_before_after_2005.csv")

View(data)

str(data)

'data.frame':   48 obs. of  6 variables:
 $ STATE            : chr  "GASTHPI_CHG" "NCSTHPI_CHG" "TXSTHPI_CHG" "MASTHPI_CHG" ...
 $ HPI_CHG          : num  0.014 0.0142 0.0102 0.0275 0.0176 ...
 $ Time_Period      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Disaster_Affected: int  0 0 1 0 1 1 0 0 1 0 ...
 $ NUM_DISASTERS    : int  1 3 5 4 4 3 1 5 5 3 ...
 $ NUM_IND_ASSIST   : int  0 0 22 9 14 49 0 6 55 0 ...

Then, I run a simple linear regression where:

the dependent variable is HPI_CHG
the independent variables are Time_Period (dummy), Disaster Affected (dummy), and their interaction term

#regression

didreg = lm(HPI_CHG ~ Time_Period * Disaster_Affected, data = data)
summary(didreg)


Call:
lm(formula = HPI_CHG ~ Time_Period * Disaster_Affected, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.023081 -0.007610 -0.000171  0.004656  0.035981 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    0.037090   0.002819  13.157  < 2e-16 ***
Time_Period                   -0.027847   0.003987  -6.985  1.2e-08 ***
Disaster_Affected             -0.013944   0.006176  -2.258   0.0290 *  
Time_Period:Disaster_Affected  0.019739   0.008734   2.260   0.0288 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01229 on 44 degrees of freedom
Multiple R-squared:  0.5356,    Adjusted R-squared:  0.504 
F-statistic: 16.92 on 3 and 44 DF,  p-value: 1.882e-07

What is the control and the control group, and what is the treatment and the treatment group? 4 lines max.
- The control is the Time Period which has not been disaster affected. Meaning, this group was monitored in the time periods leading up to the hurricanes and after the hurricanes, and was not affected by them (the hurricanes did not cause disaster in their area). The treatment group was monitored in both time periods as well and was affected by the hurricanes.
The basic idea is to compare the difference in outcomes between the treatment and control groups before and after the treatment is introduced. By differencing these differences, what do you hope to achieve, and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma.
- Intuitively, we want to evaluate how a treatment performs. To do so, we evaluate two groups - treated and not treated. Now with the passing of time, changes are bound to happen to the two groups anyway. We want to account for this, and then evaluate what other changes occurred to the treatment group. Essentially, remove or subtract the effect of general changes over time from the treatment group results and we theoretically should have just the treatment effects on that group. This helps us understand independently how the the treatment works.

Create the 2 X 2 matrix of regression equations with actual values from the data. Does your difference in difference coefficient in the linear regression above match the difference in difference effect of the two groups from the 2*2 table created?

Yes, the DiD coefficient is the same between the lm() regression approach and the matrix calculation approach.

mean_table <- tapply(X = data$HPI_CHG,
                     INDEX = list(Time_Period = ifelse(data$Time_Period == 0, "pre", "post"),
                                       Treatment_Status = ifelse(data$Disaster_Affected == 0, "control", "treatment")),
                     FUN = mean)

# Display the table
print(mean_table)

           Treatment_Status
Time_Period     control  treatment
       post 0.009242792 0.01503835
       pre  0.037090020 0.02314612

# Calculate DiD effect
DiD_effect <- (mean_table[1, 2] - mean_table[2, 2])  - (mean_table[1, 1] - mean_table[2, 1])
print(DiD_effect)

[1] 0.01973946

What are the “threats to identification”? In other words, what are the “implicit assumptions” like for simple OLS we have the 5 Gauss Markov Assumptions/conditions? Alternatively, under what conditions can you trust the point estimates, and when would you buy the study results with a grain of salt?

Some “threats to identification” include violation of some of the key assumptions for DinD. For example, “Parallel Trends” assumes that without the treatment, the treatment and control groups would have followed the same path over time, but this is violated if events or unobservable occur to cause one group to deviate from that path re-existing trends or differential shocks between groups can violate the parallel trends assumption. Differences post-treatment may reflect these pre-existing differences rather than the treatment effect itself. For example, say the treatment states were “red” Republican states and the control states were “blue” Democrat. It’s possible that in the red states, a policy was enacted that influenced the Housing Price Index that did not happen in the control states, causing them to take different paths, regardless of the treatment.

Another threat is to the “No Anticipation” assumption, which says the treatment group does not alter behavior or outcomes before the treatment is applied. If anticipation effects exist, it violates this assumption, where the pre-treatment period is no longer a valid counterfactual for comparison. In our example, it’s possible the treatment group, familiar with hurricanes and expecting an intense season, might have implemented changes in neighborhoods to influence the Housing Price Index.

In the paper that uses triple Diff, what are the three margins? Type out the estimating equation and explain the study design - why does it work (who are we comparing against whom)?

The paper reviews the impact of the Cycle program in Bihar, India, where the government provided bicycles to girls enrolling in secondary school. The program aimed to reduce the gender gap in secondary school enrollment by improving access to school. The researchers use data from a large representative household survey, and employ a triple difference (DDD) approach to estimate the program’s impact. The comparison groups are:

Girls exposed to the program versus those not exposed (younger vs. older cohorts)
Comparing these differences between boys (as a control group) and girls in the same age cohorts.
Comparing these differences with a neighboring state (Jharkhand) that did not have the program.

\[ Y_ihv = \beta_0 + beta_1 \times F_ihv \times T_ihv \times BH_ihv + \beta_2 \times F_ihv \times BH_ihv + \beta_3 \times T_ihv \times BH_ihv + \beta_4 \\ \times F_ihv \times T_ihv + \beta_5 \times F_ihv + \beta_6 \times T_ihv + \beta_7 \times BH_ihv + \epsilon_ihv \]

The three margins are the different cohorts of students being compared. This approach helps isolate the impact of the program for other external factors that might have caused the change in results.