Regression Discussion 6

Author

Langley Burke

Regression Discussion 6

data <- read.csv("/Users/langleyburke/Downloads/1fc451683137398e11c75b2e47031cf1-211bac7f1490d57867d34c1a516617d59a485b21/us_fred_coastal_us_states_avg_hpi_before_after_2005.csv")

Simple Linear Regression

Creating Interaction Term

data$DiD <- data$Time_Period * data$Disaster_Affected

Summary Final

library(stargazer)


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(data, type="text")


================================================
Statistic         N  Mean  St. Dev.  Min    Max 
------------------------------------------------
HPI_CHG           48 0.022  0.017   -0.006 0.061
Time_Period       48 0.500  0.505     0      1  
Disaster_Affected 48 0.208  0.410     0      1  
NUM_DISASTERS     48 3.208  2.143     1     10  
NUM_IND_ASSIST    48 8.583  14.946    0     55  
DiD               48 0.104  0.309     0      1  
------------------------------------------------

Equation

\[ HPICHG_i = \beta_0 +\beta_1*TimePeriod_i + \\\beta_2*DisasterAffected_i + \beta_3* (TimePeriod_i*DisasterAffected_i) + \epsilon_i \]

didmodel <- lm(HPI_CHG ~ Time_Period + Disaster_Affected + DiD, data = data)

summary(didmodel)


Call:
lm(formula = HPI_CHG ~ Time_Period + Disaster_Affected + DiD, 
    data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.023081 -0.007610 -0.000171  0.004656  0.035981 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.037090   0.002819  13.157  < 2e-16 ***
Time_Period       -0.027847   0.003987  -6.985  1.2e-08 ***
Disaster_Affected -0.013944   0.006176  -2.258   0.0290 *  
DiD                0.019739   0.008734   2.260   0.0288 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01229 on 44 degrees of freedom
Multiple R-squared:  0.5356,    Adjusted R-squared:  0.504 
F-statistic: 16.92 on 3 and 44 DF,  p-value: 1.882e-07

What is the control and the control group, and what is the treatment and the treatment group? 4 lines max.

The control is the variable that is not effected by the treatment, here it is the areas not affected by disaster. Treatment then would be areas affected by the disasters. In a diff and diff model you also have control groups which is the areas not affected by the disaster at the time period after it occurs, while the treatment group is the areas affected after the time period the disaster occurs.

The basic idea is to compare the difference in outcomes between the treatment and control groups before and after the treatment is introduced. By differencing these differences, what do you hope to achieve, and why should it work? Intuitively, why does methodology work/describe it in plain simple English to your grandma.

The main idea is that in the absence of the treatment (the disaster), the two groups (treatment and control) would have followed similar trends over time. By comparing the changes in outcomes between the treatment and control groups before and after the treatment occurs, you can study the effect of the treatment on the variable being studied. It works because these trends are similar before, if they were not then the difference wouldn’t give much information because the treatment group wouldn’t have been where the control is in the first place.

Create the 2 X 2 matrix of regression equations with actual values from the data. HINT - take the mean of your y variable, not the count in each cell. Does your difference in difference coefficient in the linear regression above match the difference in difference effect of the two groups from the 2*2 table created? It should match by the way.

table <- table("Time Period" = ifelse(test = data$Time_Period == 0, yes = "pre", no = "post"),
                   "Disaster" = ifelse(data$Disaster_Affected == 0, yes = "control", no = "treatment"))
table

           Disaster
Time Period control treatment
       post      19         5
       pre       19         5

There are 5 data points that are in the treatment group because they are affected after the time period (post), and 5 that are treatment in the pre (0 for time period). There are 19 data points for our control variable (not affected) for the pre and post time period.

2 By 2 Table

#Calculating the mean of Y for each group
mean_table <- tapply(data$HPI_CHG, list("Time Period" = ifelse(data$Time_Period == 0, "pre", "post"),
"Disaster" = ifelse(data$Disaster_Affected == 0, "control", "treatment")), mean)

# Display the table
print(mean_table)

           Disaster
Time Period     control  treatment
       post 0.009242792 0.01503835
       pre  0.037090020 0.02314612

#Pre and Post Means differences
prediff <- (mean_table[1,2] - mean_table[2,2])

postdiff <- (mean_table[1,1]-mean_table[2,1])

#Difference in Difference - pre minus post

DiDvalue <- prediff - postdiff

DiDvalue

[1] 0.01973946

The same as the linear regression.

Assumptions/Threats

Parallel Trends

Like I mentioned earlier, this concept is insightful when the trends are parallel, this is why you are able to compare the differences and it give a meaningful value because it shows the difference of where the treatment group would be if the treatment did not occur. Assume the control and treatment would follow parallel trends if no treatment occurred. A threat to this would be anything that affects the trends of either the control or treatment differently (pre treatment) because it would cause the trends to be different between the two which would lead to a violation of this assumption. Another threat to parallel trends could be not enough data for pre and port time periods. There must be sufficient time periods in these models both pre and post to establish if the trends truly are parallel, the more data we have the more change over time we get to visualize and interpret if it is parallel or not. If there is only one time period for example it is hard to determine if the parallel trend is followed.

No anticipation

This assumption is important for ensuring the validity of causal inferences in studies because it helps isolate the effect of the treatment from other factors that might influence outcomes. If anticipation effects are present, it could bias the estimated treatment effects and lead to incorrect conclusions. Anticipation of treatment gives people time to react which will change trends potentially in the pre treatment timeline but also in the post treatment.

Paper

Time: Pre and Post Bicycle Program

Treatment: Giving girls in Bihar access to Bicycles

Control: Boys not gaining access in Bihar and Boys and Girls not gaining access in Jharkhand

\[ \begin{aligned} SecondarySchoolEnrollment_i = \beta_0 + \\\beta_1Bihar_i +\beta_2Female_i+\beta_3Treat_i+\\\beta_4Female_i*Bihar_i+\beta_5Treat_i*Bihar_i+\\\beta_6Treat_i*Female_i+\beta_7Treat_i*Female_i*Bihar_i + \epsilon_i \end{aligned} \]

The treatment group is girls in Bihar that were given access to bikes as part of the program. The expectation is that it will improve the access and enrollment to secondary schools. It compares with two control groups (making it triple diff), these groups are boys in Bihar, comparing the difference in enrollment within the same state (it controls for general trends in education that could effect both genders to make sure the difference viewed is from the treatment). The other control group is Girls and Boys within a different state, Jharkhand. The study follows the difference of a state that does not implement the bicycle program. The study works due to the evaluation of trends across time periods to make the assumption of parallel trends, it also gives a more accurate estimate of the treatment by including another control group. By adding this third control it accounts for other factors that could effect secondary school enrollment.