Discussion 6

Clear Data

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  580775 31.1    1324645 70.8         NA   669422 35.8
Vcells 1068540  8.2    8388608 64.0      16384  1851968 14.2

Bringing in data

Rows: 48 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): STATE
dbl (5): HPI_CHG, Time_Period, Disaster_Affected, NUM_DISASTERS, NUM_IND_ASSIST

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating interaction term - Dif in Dif

dt$did <- (dt$Time_Period * dt$Disaster_Affected)
head(dt)
# A tibble: 6 × 7
  STATE HPI_CHG Time_Period Disaster_Affected NUM_DISASTERS NUM_IND_ASSIST   did
  <chr>   <dbl>       <dbl>             <dbl>         <dbl>          <dbl> <dbl>
1 GAST…  0.0140           0                 0             1              0     0
2 NCST…  0.0142           0                 0             3              0     0
3 TXST…  0.0102           0                 1             5             22     0
4 MAST…  0.0275           0                 0             4              9     0
5 ALST…  0.0176           0                 1             4             14     0
6 MSST…  0.0133           0                 1             3             49     0

Part B) Running a Linear Regression

\(Y = \beta_0 + \beta_1 * Time + \beta_2 * Treated + \beta_3 * Time * Treated + \epsilon_i\)

\(House \ Price \ Change = \beta_0 + \beta_1 * Time \ Period_i + \beta_2 * Disaster \ Affected + \beta_3 * Difference \ in \ Difference + \epsilon_i\)

\(Y_i\) = House Price Change

\(\beta_0\) = Base Line average (B)

\(\beta_1\) = Time Trend in control group (D - B)

\(X_1\) = Time

\(\beta_2\) = Difference between two groups prior to intervention (A - B)

\(X_2\) = Treated

\(\beta_3\) = Difference in change over time [(C - A) - (D - B)]

\(X_3\) = Difference in Difference

\(\epsilon_i\) = Error term representing enexplainted variation

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
lm <- lm(HPI_CHG  ~ Time_Period + Disaster_Affected + did, data = dt)

stargazer(lm, type = "text")

===============================================
                        Dependent variable:    
                    ---------------------------
                              HPI_CHG          
-----------------------------------------------
Time_Period                  -0.028***         
                              (0.004)          
                                               
Disaster_Affected            -0.014**          
                              (0.006)          
                                               
did                           0.020**          
                              (0.009)          
                                               
Constant                     0.037***          
                              (0.003)          
                                               
-----------------------------------------------
Observations                    48             
R2                             0.536           
Adjusted R2                    0.504           
Residual Std. Error       0.012 (df = 44)      
F Statistic           16.916*** (df = 3; 44)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Part C) Understanding and interpreting

1) What is the control and the control group, and what is the treatment and the treatment group?

  • Control group: Areas not affected by the disaster, Disaster affected = 0

  • Treatment group: Areas affected by the disaster, Disaster affected = 1

  • Control period: Before the disaster, Time Period = 0.

  • Treatment period: After the disaster, Time Period = 1.

2) The basic idea is to compare the difference in outcomes between the treatment and control groups before and after the treatment is introduced. By differencing these differences, what do you hope to achieve, and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma. 

The goal is to isolate the true effect if the disaster by comparing the change in the outcome in the treatment group before and after the disaster with the change in the control group.

This primarily works because we are able to seperate out what is related to the disaster and not related by looking at any difference in outcomes for the control group. Now we can take what is not related to the disaster and subtract that out from the treatment group’s difference.

Simple terms: It's like checking if a plant grows faster because of sunlight instead of just because of time passing. We can do this by comparing it to another plant kept in the shade.

3) Create the 2 X 2 matrix of regression equations with actual values from the data.

Creating the table

# Create the 2x2 matrix of means using tapply
mean_table <- tapply(dt$HPI_CHG, 
                     list(Time_Period = ifelse(dt$Time_Period == 0, "Before (Time = 0)", "After (Time = 1)"),
                          Treatment_Status = ifelse(dt$Disaster_Affected == 0, "Control (Not impacted)", "Treatment (Impacted)")),
                     mean)


print(mean_table)
                   Treatment_Status
Time_Period         Control (Not impacted) Treatment (Impacted)
  After (Time = 1)             0.009242792           0.01503835
  Before (Time = 0)            0.037090020           0.02314612

Calculating the difference in difference - Manually

Y1 <- mean_table["Before (Time = 0)", "Control (Not impacted)"]  # Control - Before 

Y2 <- mean_table["After (Time = 1)", "Control (Not impacted)"] # Control - After 

Y3 <- mean_table["Before (Time = 0)", "Treatment (Impacted)"] # Treatment - Before

Y4 <- mean_table["After (Time = 1)", "Treatment (Impacted)"] # Treatment - After

diff_in_diff <- (Y4 - Y3) - (Y2 - Y1)

Comparing the manual method vs that from the model

# Manual vs  regression
cat("Manual Diff in Diff estimate: ", diff_in_diff, "\n")
Manual Diff in Diff estimate:  0.01973946 
interaction_coeff <- coef(lm)["did"]
cat("Coefficient from the regression: ", interaction_coeff, "\n")
Coefficient from the regression:  0.01973946 

As we can see above, both match

Part D) Threats to Identification

The main “threat to identification” is the parallel trends assumptions which states the control and treatment groups would follow the same trend over time with absence of “treatment”. For this reason analyzing the movement between the two groups prior and post to the treatment for multiple periods is important. If it does not hold, the treatment effect estimate may be biased.

Other assumptions consist of the “Spillover effect” which states there are no direct or indirect effects on the control group from the treatment being applied. Similarly, the homogenous effect is also important as it assumes that the treatment effect does not vary systematically across different units. These are all impacts which can undermine the validity of the results, if they pass, one can assume strong results.

Part E) Cycling to school

uses triple Diff, what are the three margins?  Type out the estimating equation and explain the study design - why does it work (who are we comparing against whom)?  8 lines max.

The three margins are:

  1. T - Indicator for being in a “treated” cohort (aged 14 or 15)
  2. F - indicator of gender (Female)
  3. BH - Indicator for being an observation from Bihar

\[ \begin{align*} Y_{ihv} &= \beta_0 + \beta_1 \cdot \text{T}_{ihv} + \beta_2 \cdot \text{F}_{ihv} + \beta_3 \cdot \text{BH}_{ihv} \\ &\quad + \beta_4 \cdot (\text{T}_{ihv} \times \text{F}_{ihv}) + \beta_5 \cdot (\text{T}_{ihv} \times \text{BH}_{ihv}) \\ &\quad + \beta_6 \cdot (\text{F}_{ihv} \times \text{BH}_{ihv}) + \beta_7 \cdot (\text{T}_{ihv} \times \text{F}_{ihv} \times \text{BH}_{ihv}) \\ &\quad + \epsilon_{ihv} \end{align*} \]

The overall objective of the study was aimed to increase school enrollment for girls in Bihar, india by providing bicycles to them. The program sought to address the gender gap in secondary school enrollment through cost effective investments. The comparison is made between girls and boys (treatment and control groups) over time, and between students who live far and close to school.

The program worked well as girls living far from school (who had lower enrollment rates) benefited the most from the treatment. On the other hand, boys and students living closer (control groups) did not see the same gains, this comparison enabled them to isolate the treatment effect.