Advanced quantitative data analysis

class: center, middle, inverse, title-slide

.title[
# Advanced quantitative data analysis
]
.subtitle[
## Difference in Difference I
]
.author[
### Mengni Chen
]
.institute[
### Department of Sociology, University of Copenhagen
]

---

#Let's get ready

```r
#install.packages("did")
library(tidyverse) # Add the tidyverse package to my current library.
library(haven) # Handle labelled data.
library(broom)
library(splitstackshape) #transform wide data (with stacked variables) to long data
library(plm) #linear models for panel data
```

---
#Does partnership make you happier? Evolution of analysis design
.pull-left[
A cross-sectional OLS
<img src="https://github.com/fancycmn/slide12/blob/main/S12_Pic3.PNG?raw=true" width="80%" style="display: block; margin-left:20px;">

]

.pull-right[
A fixed effect model
<img src="https://github.com/fancycmn/slide12/blob/main/S12_Pic4.PNG?raw=true" width="80%" style="display: block; margin-left:10px;">

]

---
#Does partnership make you happier? Fixed effect design

**What is the true partner effect?**

.pull-left[
Fixed effect looks at the within change
<img src="https://github.com/fancycmn/slide12/blob/main/S12_Pic4.PNG?raw=true" width="100%" style="display: block; margin-left:20px;">

]

.pull-right[
What if those singles also change
<img src="https://github.com/fancycmn/slide12/blob/main/S12_Pic2.PNG?raw=true" width="85%" style="display: block; margin-left:10px;">

]

---
#Does partnership make you happier? Difference in Difference 
<img src="https://github.com/fancycmn/slide12/blob/main/S12_Pic5.PNG?raw=true" width="75%" style="display: block; margin-left:10px;">

---
#Difference in Difference 
<img src="https://github.com/fancycmn/slide12/blob/main/S12_Pic6.PNG?raw=true" width="65%" style="display: block; margin-left:10px;">
- Deal with the impact of time by introducing group
- Isolate the within variation for both the treated group and untreated group.
- Compare the within variation in the treated group to the within variation in the untreated group. 
- Because the within variation in the untreated group is affected by time, doing this comparison controls for time differences.
- **We are looking for how much more the treated group changed than the untreated group when going from before to after.**

---
#Difference in Difference
Assume that the event of treatment occurs between `\(t_{0}\)` and `\(t_{1}\)`. Group B is treated (get a partner) while Group A is the control (remain single)

`\(δ_{DD}=\)` Effect of treatment on outcome
- First way of understanding
  - `\(δ_{DD}=(\bar{Y_{B,2}}-\bar{Y_{A,2}})-(\bar{Y_{B,1}}-\bar{Y_{A,1}})\)`

- `\((\bar{Y_{B,2}}-\bar{Y_{A,2}})\)`: difference across the two groups **"After"** treatment

- `\((\bar{Y_{B,1}}-\bar{Y_{A,1}})\)`: difference across the two groups **"Before"** treatment
- Second way of understanding
  - `\(δ_{DD}=(\bar{Y_{B,2}}-\bar{Y_{B,1}})-(\bar{Y_{A,2}}-\bar{Y_{A,1}})\)`

- `\((\bar{Y_{B,2}}-\bar{Y_{B,1}})\)`: the pathway for the treated group

- `\((\bar{Y_{A,2}}-\bar{Y_{A,1}})\)`: the pathway for the control group
  
---
#Difference in Difference
Assume that the event of treatment occurs between `\(t_{0}\)` and `\(t_{1}\)`. Group B is treated (get a partner) while Group A is the control (remain single)
- Third way of understanding: what would the treatment group outcome be post-treatment, if treatment had not occurred?
  - Suppose if the treatment had not occurred to the treated group, the treated group would follow the same trend as the control group
  - Counter factual outcome will be the `\(\bar{Y_{B,1}}\)` + `\((\bar{Y_{A,2}}-\bar{Y_{A,1}})\)`(i.e.  the change experienced by the control group)
  - Then, the actual outcome after treatment `\(-\)` the counterfactual outcome is the ATE (average treatment effect) on treated

---
#Difference in Difference: Visualize the three ways of understanding
The following three graphs corresponds to the first, second and third way of understanding
<img src="https://github.com/fancycmn/slide12/blob/main/S12_Pic11.PNG?raw=true" width="100%" style="display: block; margin-left:0px;">

---
#Difference in Difference: assumption of this method
  - Parallel trends
    - We assume that trends of dependent variable over time were identical between treated and non-treated group before the treatment takes place
    - We assume that the trends would have remained parallel, if there would have been no treatment.

If the assumption holds, then any difference in the trends between the two groups can be attributed to the treatment effect.

---
#Difference in Difference with OLS
Let `\(G_{treatid}=1\)` if observation `\(i\)` is in the treatment group, and 0 if observation `\(i\)` is in the control group.

Let `\(T_{treattime}=1\)` observation `\(i\)` occurs in the post-treatment time, and 0 if observation `\(i\)` occurs in the pre-treatment period.

`\(Y_{i}=b_{0} + δ_{G}G_{treatid} + δ_{T}T_{treattime} + δ_{GT}(G_{treatid}*T_{treattime})\)`

`\(δ_{GT}\)` is the difference-in-difference estimation of average treatment effect.

--
  - When it is pre-treatment time ( `\(T_{treattime}=0\)` ),
    - the outcome for the control group ( `\(G_{treatid}=0\)` ) is  `\(Y_{i}=b_{0}\)`
    - the outcome for the treated group ( `\(G_{treatid}=1\)` ) is  `\(Y_{i}=b_{0}+δ_{G}\)`
--
  - When it is post-treatment time ( `\(T_{treattime}=1\)` ),
    - the outcome for the control group ( `\(G_{treatid}=0\)` ) is  `\(Y_{i}=b_{0}+δ_{T}\)`
    - the outcome for the treated group ( `\(G_{treatid}=1\)` ) is  `\(Y_{i}=b_{0}+δ_{G}+δ_{T}+δ_{GT}\)`
--

The difference between the two groups before the treatment is `\(δ_{G}\)`.

The difference between the two groups after the treatment is `\(δ_{G}+δ_{GT}\)`.

The difference in difference is `\(δ_{GT}\)`, the average treatment effect.
---
#Now use a real example: Does partnership make you happier
- [Prepare the data](https://rpubs.com/fancycmn/972520)
- Specific ways of estimation depend on whether the data is a balanced or unbalanced data

---
#DID: when you have a balanced data

```r
#Create a balanced of wave 1 and 2
balanced <- long_data %>% 
  group_by(id) %>% 
  mutate(
    treatgroup=sum(havepartner), #create a variable to identify the treat group
    treattime=case_when(wave==1 ~ 0,
                        wave==2 ~ 1), #create a variable to identify the treatment time
    grouptime=treatgroup*treattime #create an interaction between the treat group and treatment time.
  ) %>% 
  filter(check=="1-2") #drop individuals who only participate in the first wave. Now the data is balanced.

unbalanced <- long_data %>% 
  group_by(id) %>% 
  mutate(
    treatgroup=sum(havepartner), #create a variable to identify the treat group
    treattime=case_when(wave==1 ~ 0,
                        wave==2 ~ 1), #create a variable to identify the treatment time
    grouptime=treatgroup*treattime #create an interaction between the treat group and treatment time.
  ) 
```

---
#DID using OLS regression
.pull-left[

```r
didreg1a <-  lm(sat ~ treatgroup + treattime + grouptime, data = balanced)
summary(didreg1a)
```

```
## 
## Call:
## lm(formula = sat ~ treatgroup + treattime + grouptime, data = balanced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6364 -0.6364  0.3636  1.3636  2.6093 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.51713    0.04452 168.855  < 2e-16 ***
## treatgroup  -0.12640    0.09286  -1.361 0.173538    
## treattime    0.11924    0.06296   1.894 0.058314 .  
## grouptime    0.45030    0.13133   3.429 0.000612 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.735 on 3938 degrees of freedom
## Multiple R-squared:  0.007628,   Adjusted R-squared:  0.006872 
## F-statistic: 10.09 on 3 and 3938 DF,  p-value: 1.279e-06
```
]

.pull-right[

```r
didreg1b <- lm(sat ~ treatgroup* treattime, data = balanced)
summary(didreg1b)
```

```
## 
## Call:
## lm(formula = sat ~ treatgroup * treattime, data = balanced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6364 -0.6364  0.3636  1.3636  2.6093 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.51713    0.04452 168.855  < 2e-16 ***
## treatgroup           -0.12640    0.09286  -1.361 0.173538    
## treattime             0.11924    0.06296   1.894 0.058314 .  
## treatgroup:treattime  0.45030    0.13133   3.429 0.000612 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.735 on 3938 degrees of freedom
## Multiple R-squared:  0.007628,   Adjusted R-squared:  0.006872 
## F-statistic: 10.09 on 3 and 3938 DF,  p-value: 1.279e-06
```
]

---
#DID using twoway fixed effect

```r
balanced1 <- pdata.frame(balanced, index=c("id", "wave"))
didreg2 <- plm(sat ~ grouptime, data=balanced1, model="within", effect = "twoway") ## "+ treattime" is to fixed the time
summary(didreg2)
```

```
## Twoways effects Within Model
## 
## Call:
## plm(formula = sat ~ grouptime, data = balanced1, effect = "twoway", 
##     model = "within")
## 
## Balanced Panel: n = 1971, T = 2, N = 3942
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -5.0596e+00 -4.4038e-01 -4.4409e-16  4.4038e-01  5.0596e+00 
## 
## Coefficients:
##           Estimate Std. Error t-value  Pr(>|t|)    
## grouptime 0.450301   0.095662  4.7072 2.686e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    3178.6
## Residual Sum of Squares: 3143.2
## R-Squared:      0.011128
## Adj. R-Squared: -0.97925
## F-statistic: 22.1578 on 1 and 1969 DF, p-value: 2.6864e-06
```

---
#DID results: when the data are balanced
  - When the data are balanced and treatment time is the same for all individuals, the two ways of estimation will give the same average treatment impact.
  - In the case of "partnership", the result is that the the average treatment effect is 0.4503. This means that getting a partner will increase the life satisfaction by 0.4503 points.

---
#DID: when the data are unbalanced
- OLS regression

```r
didreg3 <- lm(sat ~ treatgroup* treattime, data = unbalanced)
summary(didreg3)
```

```
## 
## Call:
## lm(formula = sat ~ treatgroup * treattime, data = unbalanced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6364 -0.6364  0.3636  1.3636  2.6093 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.42891    0.03830 193.944  < 2e-16 ***
## treatgroup           -0.03818    0.09161  -0.417 0.676885    
## treattime             0.20746    0.05944   3.490 0.000488 ***
## treatgroup:treattime  0.36208    0.13185   2.746 0.006052 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.771 on 4558 degrees of freedom
## Multiple R-squared:  0.009036,   Adjusted R-squared:  0.008383 
## F-statistic: 13.85 on 3 and 4558 DF,  p-value: 5.443e-09
```

---
#DID: when the data are unbalanced
- Twoway fixed effect

```r
unbalanced1 <- pdata.frame(unbalanced, index=c("id", "wave"))
didreg4 <- plm(sat ~ grouptime, data=unbalanced1, effect="twoway", model="within") 
summary(didreg4)
```

```
## Twoways effects Within Model
## 
## Call:
## plm(formula = sat ~ grouptime, data = unbalanced1, effect = "twoway", 
##     model = "within")
## 
## Unbalanced Panel: n = 2591, T = 1-2, N = 4562
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -5.05962 -0.44038  0.00000  0.44038  5.05962 
## 
## Coefficients:
##           Estimate Std. Error t-value  Pr(>|t|)    
## grouptime 0.450301   0.095662  4.7072 2.686e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    3178.6
## Residual Sum of Squares: 3143.2
## R-Squared:      0.011128
## Adj. R-Squared: -1.2906
## F-statistic: 22.1578 on 1 and 1969 DF, p-value: 2.6864e-06
```
]

---
#DID: when the data are unbalanced
**Which to choose: ols vs two-way fixed effect regression**
- two-way fixed effect is more often used
- ols is better, as long as you are sure that the selection is not affecting the parallel trend assumption.

---
#Take home
- Understand what is difference-in-difference method
- Why we use difference-in-difference
- Know what is the assumption of difference in difference
- Know to do difference-in-difference estimation
- Important code
  - `plm()` to do twoway fixed effect for DID
  - use `att_gt()` to do DID

---
class: center, middle
#[Exercise](https://rpubs.com/fancycmn/1121222)