used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 580775 31.1 1324645 70.8 NA 669422 35.8
Vcells 1068540 8.2 8388608 64.0 16384 1851968 14.2
Discussion 6
Clear Data
Bringing in data
Rows: 48 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): STATE
dbl (5): HPI_CHG, Time_Period, Disaster_Affected, NUM_DISASTERS, NUM_IND_ASSIST
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Creating interaction term - Dif in Dif
$did <- (dt$Time_Period * dt$Disaster_Affected) dt
head(dt)
# A tibble: 6 × 7
STATE HPI_CHG Time_Period Disaster_Affected NUM_DISASTERS NUM_IND_ASSIST did
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GAST… 0.0140 0 0 1 0 0
2 NCST… 0.0142 0 0 3 0 0
3 TXST… 0.0102 0 1 5 22 0
4 MAST… 0.0275 0 0 4 9 0
5 ALST… 0.0176 0 1 4 14 0
6 MSST… 0.0133 0 1 3 49 0
Part B) Running a Linear Regression
\(Y = \beta_0 + \beta_1 * Time + \beta_2 * Treated + \beta_3 * Time * Treated + \epsilon_i\)
\(House \ Price \ Change = \beta_0 + \beta_1 * Time \ Period_i + \beta_2 * Disaster \ Affected + \beta_3 * Difference \ in \ Difference + \epsilon_i\)
\(Y_i\) = House Price Change
\(\beta_0\) = Base Line average (B)
\(\beta_1\) = Time Trend in control group (D - B)
\(X_1\) = Time
\(\beta_2\) = Difference between two groups prior to intervention (A - B)
\(X_2\) = Treated
\(\beta_3\) = Difference in change over time [(C - A) - (D - B)]
\(X_3\) = Difference in Difference
\(\epsilon_i\) = Error term representing enexplainted variation
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
<- lm(HPI_CHG ~ Time_Period + Disaster_Affected + did, data = dt)
lm
stargazer(lm, type = "text")
===============================================
Dependent variable:
---------------------------
HPI_CHG
-----------------------------------------------
Time_Period -0.028***
(0.004)
Disaster_Affected -0.014**
(0.006)
did 0.020**
(0.009)
Constant 0.037***
(0.003)
-----------------------------------------------
Observations 48
R2 0.536
Adjusted R2 0.504
Residual Std. Error 0.012 (df = 44)
F Statistic 16.916*** (df = 3; 44)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Part C) Understanding and interpreting
1) What is the control and the control group, and what is the treatment and the treatment group?
Control group: Areas not affected by the disaster, Disaster affected = 0
Treatment group: Areas affected by the disaster, Disaster affected = 1
Control period: Before the disaster, Time Period = 0.
Treatment period: After the disaster, Time Period = 1.
2) The basic idea is to compare the difference in outcomes between the treatment and control groups before and after the treatment is introduced. By differencing these differences, what do you hope to achieve, and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma.
The goal is to isolate the true effect if the disaster by comparing the change in the outcome in the treatment group before and after the disaster with the change in the control group.
This primarily works because we are able to seperate out what is related to the disaster and not related by looking at any difference in outcomes for the control group. Now we can take what is not related to the disaster and subtract that out from the treatment group’s difference.
Simple terms: It's like checking if a plant grows faster because of sunlight instead of just because of time passing. We can do this by comparing it to another plant kept in the shade.
3) Create the 2 X 2 matrix of regression equations with actual values from the data.
Creating the table
# Create the 2x2 matrix of means using tapply
<- tapply(dt$HPI_CHG,
mean_table list(Time_Period = ifelse(dt$Time_Period == 0, "Before (Time = 0)", "After (Time = 1)"),
Treatment_Status = ifelse(dt$Disaster_Affected == 0, "Control (Not impacted)", "Treatment (Impacted)")),
mean)
print(mean_table)
Treatment_Status
Time_Period Control (Not impacted) Treatment (Impacted)
After (Time = 1) 0.009242792 0.01503835
Before (Time = 0) 0.037090020 0.02314612
Calculating the difference in difference - Manually
<- mean_table["Before (Time = 0)", "Control (Not impacted)"] # Control - Before
Y1
<- mean_table["After (Time = 1)", "Control (Not impacted)"] # Control - After
Y2
<- mean_table["Before (Time = 0)", "Treatment (Impacted)"] # Treatment - Before
Y3
<- mean_table["After (Time = 1)", "Treatment (Impacted)"] # Treatment - After
Y4
<- (Y4 - Y3) - (Y2 - Y1) diff_in_diff
Comparing the manual method vs that from the model
# Manual vs regression
cat("Manual Diff in Diff estimate: ", diff_in_diff, "\n")
Manual Diff in Diff estimate: 0.01973946
<- coef(lm)["did"]
interaction_coeff cat("Coefficient from the regression: ", interaction_coeff, "\n")
Coefficient from the regression: 0.01973946
As we can see above, both match
Part D) Threats to Identification
The main “threat to identification” is the parallel trends assumptions which states the control and treatment groups would follow the same trend over time with absence of “treatment”. For this reason analyzing the movement between the two groups prior and post to the treatment for multiple periods is important. If it does not hold, the treatment effect estimate may be biased.
Other assumptions consist of the “Spillover effect” which states there are no direct or indirect effects on the control group from the treatment being applied. Similarly, the homogenous effect is also important as it assumes that the treatment effect does not vary systematically across different units. These are all impacts which can undermine the validity of the results, if they pass, one can assume strong results.
Part E) Cycling to school
uses triple Diff, what are the three margins? Type out the estimating equation and explain the study design - why does it work (who are we comparing against whom)? 8 lines max.
The three margins are:
- T - Indicator for being in a “treated” cohort (aged 14 or 15)
- F - indicator of gender (Female)
- BH - Indicator for being an observation from Bihar
\[ \begin{align*} Y_{ihv} &= \beta_0 + \beta_1 \cdot \text{T}_{ihv} + \beta_2 \cdot \text{F}_{ihv} + \beta_3 \cdot \text{BH}_{ihv} \\ &\quad + \beta_4 \cdot (\text{T}_{ihv} \times \text{F}_{ihv}) + \beta_5 \cdot (\text{T}_{ihv} \times \text{BH}_{ihv}) \\ &\quad + \beta_6 \cdot (\text{F}_{ihv} \times \text{BH}_{ihv}) + \beta_7 \cdot (\text{T}_{ihv} \times \text{F}_{ihv} \times \text{BH}_{ihv}) \\ &\quad + \epsilon_{ihv} \end{align*} \]
The overall objective of the study was aimed to increase school enrollment for girls in Bihar, india by providing bicycles to them. The program sought to address the gender gap in secondary school enrollment through cost effective investments. The comparison is made between girls and boys (treatment and control groups) over time, and between students who live far and close to school.
The program worked well as girls living far from school (who had lower enrollment rates) benefited the most from the treatment. On the other hand, boys and students living closer (control groups) did not see the same gains, this comparison enabled them to isolate the treatment effect.