Discussion_Diff_in_Diff

Author

Gina Occhipinti

First, I load and view the data.

# load the data 

data <- read.csv("/Users/ginaocchipinti/Documents/Econometrics Course - BC/us_fred_coastal_us_states_avg_hpi_before_after_2005.csv")

View(data)

str(data)
'data.frame':   48 obs. of  6 variables:
 $ STATE            : chr  "GASTHPI_CHG" "NCSTHPI_CHG" "TXSTHPI_CHG" "MASTHPI_CHG" ...
 $ HPI_CHG          : num  0.014 0.0142 0.0102 0.0275 0.0176 ...
 $ Time_Period      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Disaster_Affected: int  0 0 1 0 1 1 0 0 1 0 ...
 $ NUM_DISASTERS    : int  1 3 5 4 4 3 1 5 5 3 ...
 $ NUM_IND_ASSIST   : int  0 0 22 9 14 49 0 6 55 0 ...

Then, I run a simple linear regression where:

#regression

didreg = lm(HPI_CHG ~ Time_Period * Disaster_Affected, data = data)
summary(didreg)

Call:
lm(formula = HPI_CHG ~ Time_Period * Disaster_Affected, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.023081 -0.007610 -0.000171  0.004656  0.035981 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    0.037090   0.002819  13.157  < 2e-16 ***
Time_Period                   -0.027847   0.003987  -6.985  1.2e-08 ***
Disaster_Affected             -0.013944   0.006176  -2.258   0.0290 *  
Time_Period:Disaster_Affected  0.019739   0.008734   2.260   0.0288 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01229 on 44 degrees of freedom
Multiple R-squared:  0.5356,    Adjusted R-squared:  0.504 
F-statistic: 16.92 on 3 and 44 DF,  p-value: 1.882e-07
  1. What is the control and the control group, and what is the treatment and the treatment group? 4 lines max.
    • The control is the Time Period which has not been disaster affected. Meaning, this group was monitored in the time periods leading up to the hurricanes and after the hurricanes, and was not affected by them (the hurricanes did not cause disaster in their area). The treatment group was monitored in both time periods as well and was affected by the hurricanes.
  2. The basic idea is to compare the difference in outcomes between the treatment and control groups before and after the treatment is introduced. By differencing these differences, what do you hope to achieve, and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma.
    • Intuitively, we want to evaluate how a treatment performs. To do so, we evaluate two groups - treated and not treated. Now with the passing of time, changes are bound to happen to the two groups anyway. We want to account for this, and then evaluate what other changes occurred to the treatment group. Essentially, remove or subtract the effect of general changes over time from the treatment group results and we theoretically should have just the treatment effects on that group. This helps us understand independently how the the treatment works.
  3. Create the 2 X 2 matrix of regression equations with actual values from the data. Does your difference in difference coefficient in the linear regression above match the difference in difference effect of the two groups from the 2*2 table created?
    • Yes, the DiD coefficient is the same between the lm() regression approach and the matrix calculation approach.

      mean_table <- tapply(X = data$HPI_CHG,
                           INDEX = list(Time_Period = ifelse(data$Time_Period == 0, "pre", "post"),
                                             Treatment_Status = ifelse(data$Disaster_Affected == 0, "control", "treatment")),
                           FUN = mean)
      
      # Display the table
      print(mean_table)
                 Treatment_Status
      Time_Period     control  treatment
             post 0.009242792 0.01503835
             pre  0.037090020 0.02314612
      # Calculate DiD effect
      DiD_effect <- (mean_table[1, 2] - mean_table[2, 2])  - (mean_table[1, 1] - mean_table[2, 1])
      print(DiD_effect)
      [1] 0.01973946

What are the “threats to identification”? In other words, what are the “implicit assumptions” like for simple OLS we have the 5 Gauss Markov Assumptions/conditions? Alternatively, under what conditions can you trust the point estimates, and when would you buy the study results with a grain of salt?

In the paper that uses triple Diff, what are the three margins? Type out the estimating equation and explain the study design - why does it work (who are we comparing against whom)?

The paper reviews the impact of the Cycle program in Bihar, India, where the government provided bicycles to girls enrolling in secondary school. The program aimed to reduce the gender gap in secondary school enrollment by improving access to school. The researchers use data from a large representative household survey, and employ a triple difference (DDD) approach to estimate the program’s impact. The comparison groups are:

\[ Y_ihv = \beta_0 + beta_1 \times F_ihv \times T_ihv \times BH_ihv + \beta_2 \times F_ihv \times BH_ihv + \beta_3 \times T_ihv \times BH_ihv + \beta_4 \\ \times F_ihv \times T_ihv + \beta_5 \times F_ihv + \beta_6 \times T_ihv + \beta_7 \times BH_ihv + \epsilon_ihv \]

The three margins are the different cohorts of students being compared. This approach helps isolate the impact of the program for other external factors that might have caused the change in results.