#import data
df1 <- read_csv("us_fred_coastal_us_states_avg_hpi_before_after_2005.csv", show_col_types = FALSE)
#overview of the data
stargazer(df1)
##
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
## % Date and time: Tue, Oct 10, 2023 - 19:59:12
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lccccc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## Statistic & \multicolumn{1}{c}{N} & \multicolumn{1}{c}{Mean} & \multicolumn{1}{c}{St. Dev.} & \multicolumn{1}{c}{Min} & \multicolumn{1}{c}{Max} \\
## \hline \\[-1.8ex]
## \hline \\[-1.8ex]
## \end{tabular}
## \end{table}
#create did variable
df1$did <- df1$Time_Period*df1$Disaster_Affected
Estimating Equation
\[ y_i = \beta_0 + \beta_1TimePeriod_i + \beta_2Treated_i +\beta_3(TimePeriod_i*Treated_i)+\epsilon_i \]
#create model
lm_formula = HPI_CHG ~ Time_Period + Disaster_Affected + did
lm1 <- lm(formula = lm_formula, data = df1)
summary(lm1)
##
## Call:
## lm(formula = lm_formula, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.023081 -0.007610 -0.000171 0.004656 0.035981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.037090 0.002819 13.157 < 2e-16 ***
## Time_Period -0.027847 0.003987 -6.985 1.2e-08 ***
## Disaster_Affected -0.013944 0.006176 -2.258 0.0290 *
## did 0.019739 0.008734 2.260 0.0288 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01229 on 44 degrees of freedom
## Multiple R-squared: 0.5356, Adjusted R-squared: 0.504
## F-statistic: 16.92 on 3 and 44 DF, p-value: 1.882e-07
stargazer(lm1, type = "text", title = "Model Results")
##
## Model Results
## ===============================================
## Dependent variable:
## ---------------------------
## HPI_CHG
## -----------------------------------------------
## Time_Period -0.028***
## (0.004)
##
## Disaster_Affected -0.014**
## (0.006)
##
## did 0.020**
## (0.009)
##
## Constant 0.037***
## (0.003)
##
## -----------------------------------------------
## Observations 48
## R2 0.536
## Adjusted R2 0.504
## Residual Std. Error 0.012 (df = 44)
## F Statistic 16.916*** (df = 3; 44)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
2 X 2 Matrix
As seen above, this is how a matrix for the diff in diff model can be
computed. All of these values utilizes the expected value of the
dependent variable based on different outcomes. Hence, we will construct
a simple table first to calculate the mean of HPI_CHG based
on Time_Period and Disaster_Affected.
#compute mean
mn <- aggregate(df1$HPI_CHG, FUN=mean,
by=list(df1$Time_Period, df1$Disaster_Affected))
colnames(mn) <- c('Time_Period', 'Disaster_Affected', 'Mean of HPI_CHG') #rename columns
print(mn)
## Time_Period Disaster_Affected Mean of HPI_CHG
## 1 0 0 0.037090020
## 2 1 0 0.009242792
## 3 0 1 0.023146118
## 4 1 1 0.015038346
Calculate Diff in Diff Coefficients using Data
\(\beta_0\) is simply the mean value
of the control group before treatment which is the first mean in the
table where Time_Period and Disaster_Affected
is 0.
\(\beta_1\) reflects the difference
in mean between the control group when Time_Period 0 and 1
which is respectively the first and second mean in the table.
\(\beta_2\) is the difference in
mean between the control (Disaster_Affected = 0) and
treatment group (Disaster_Affected=1) when
Time_Period is 0.
\(\beta_3\) represents the mean
outcome of the treatment group after treatment. Essentially, it is the
net difference between the two groups, \(E(Y_{Treatment}-Y_{Control}|Time=1)-E(Y_{Treatment}-Y_{Control}|Time=0)\),
before and after treatment which can be calculated by deducting \(\beta_2\) and the difference between the
control and treatment group when Time_Period is 1.
# calculate coefficients from mean
b0 <- signif(mn[1,3], 2)
b1 <- signif((mn[1,3] - mn[2,3]), 2)
b2 <- signif((mn[1,3] - mn[3,3]), 2)
b3 <- signif((b2 - (mn[2,3]-mn[4,3])), 2)
b3
## [1] 0.02
Now that we have the coefficients, we can create a complete matrix with the values.
| Time_Period = 0 | Time_Period = 1 | |
| Treated = 0 | \(y_i= 0.037+\epsilon_i\) | \(y_i=0.037+0.028+\epsilon_i\) |
| Treated = 1 | \(y_i=0.037+0.014+\epsilon_i\) | \(y_i=0.037+0.028+0.014+0.02+\epsilon_i\) |
Interpretation
What is the control and the control group, and what is the treatment and treatment group?
In this model, the treated variable, \(\beta_2\) which is
Disaster_Affected has a value of 0 or 1 since it is a dummy
variable. If the observation is 0, it is in the control group; 1
indicates a observation being in the treatment group. The treatment is
Time_Period whereby 0 indicates pre-treatment and 1 is
post-treatment.
The basic idea is to compare the difference in outcomes between treatment and control groups before and after treatment is introduced. By differencing these difference, what do you hope to achieve and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma.
One of the main purpose in many studies is to understand the effect of treatment. However, there are other factors that might affect the outcome. In simple terms, Difference in Differences is effective in finding out the effect of treatment while accounting for other factors that could affect the outcome. The reason for this is because the methodology introduces a control and control group that allows us to reliably determine the effect of treatment.
Does your difference in differences coefficient in the linear regression above match the difference in differences effect of the two groups from the 2*2 table created?
Yes, the effect of the two groups in the 2 x 2 matrix matches the coefficients in the linear regression model.
What are the “threats to identification”? In other words, what are the “implicit assumptions” like for simple OLS we have the 5 Gauss Markov Assumptions/conditions? Alternatively, under what conditions can you trust the point estimates, and when would you buy the study results with a grain of salt?
Data used in the model must be panel data or repeated cross-sectional data to effectively demonstrate the effect of treatment. This avoids potential bias caused by permanent differences between the control and treatment group.
Exchangeability - This is achieved when the control and treatment group are exchangeable from having a randomized controlled trial.
Parallel Trend Assumption - The control and treatment group should have similar trend over time in the absence of treatment. Difference between groups should be constant to have an unbiased estimation of the causal effect. This can be checked for using data visualization.
Stable Unit Treatment Values Assumption (SUTVA)
Each obeservation’s potential outcome is not affected by another observation’s exposure to treatment.
There should be consistency across treatment level; otherwise, this might lead to different outcomes.
Useful Links (Diff-in-Diff)