class: center, middle, inverse, title-slide .title[ # Advanced quantitative data analysis ] .subtitle[ ## Difference in Difference I ] .author[ ### Mengni Chen ] .institute[ ### Department of Sociology, University of Copenhagen ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; padding: 20px 80px 20px 80px; } .remark-code, .remark-inline-code { background: #f0f0f0; } .remark-code { font-size: 14px; } </style> #Let's get ready ```r library(tidyverse) # Add the tidyverse package to my current library. library(haven) # Handle labelled data. library(texreg)# Output regression results library(splitstackshape) #transform wide data (with stacked variables) to long data library(plm) #linear models for panel data ``` --- #Does partnership make you happier? Evolution of analysis design .pull-left[ A cross-sectional OLS <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2001.JPG?raw=true" width="80%" style="display: block; margin-left:20px;"> ] -- .pull-right[ A fixed effect model <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2002.JPG?raw=true" width="80%" style="display: block; margin-left:10px;"> ] --- #Does partnership make you happier? Fixed effect design **What is the true partner effect?** .pull-left[ Fixed effect looks at the within change <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2002.JPG?raw=true" width="100%" style="display: block; margin-left:20px;"> ] -- .pull-right[ What if those singles also change <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2003.JPG?raw=true" width="100%" style="display: block; margin-left:10px;"> ] --- #Difference in Difference - Isolate the within variation for both the treated group and untreated group. - Compare the within variation in the treated group to the within variation in the untreated group. - Because the within variation in the untreated group is affected by time, doing this comparison controls for time differences. - **We are looking for how much more the treated group changed than the untreated group when going from before to after.** <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2004a.JPG?raw=true" width="70%" style="display: block; margin-left:10px;"> --- #Difference in Difference - Now we only use 2 waves for demonstration - Assume that the event of treatment occurs between `\(t_{1}\)`(wave1) and `\(t_{2}\)`(wave2). Group B is treated (get a partner) while Group A is the control (remain single) - Individuals who get a partner at wave 2, belong to the treated group - Individuals who remains at wave 1& 2, belong to the control group - At wave 1, everyone is single. At wave 2, treatment happens, that partnership formation appears. So Treatment time is wave 2. - `\(δ_{DD}=\)` Effect of treatment on outcome, here the effect of partnership on life satisfaction --- #Difference in Difference - First way of understanding - `\(δ_{DD}=(\bar{Y_{B,2}}-\bar{Y_{A,2}})-(\bar{Y_{B,1}}-\bar{Y_{A,1}})\)` - `\((\bar{Y_{B,2}}-\bar{Y_{A,2}})\)`: difference across the two groups **"After"** treatment - `\((\bar{Y_{B,1}}-\bar{Y_{A,1}})\)`: difference across the two groups **"Before"** treatment <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2005.JPG?raw=true" width="40%" style="display: block; margin-left:10px;"> --- #Difference in Difference - Second way of understanding - `\(δ_{DD}=(\bar{Y_{B,2}}-\bar{Y_{B,1}})-(\bar{Y_{A,2}}-\bar{Y_{A,1}})\)` - `\((\bar{Y_{B,2}}-\bar{Y_{B,1}})\)`: the pathway for the treated group - `\((\bar{Y_{A,2}}-\bar{Y_{A,1}})\)`: the pathway for the control group <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2006.JPG?raw=true" width="40%" style="display: block; margin-left:10px;"> --- #Difference in Difference - Third way of understanding: what would the outcome of the treatment group be like in the post-treatment, if treatment had not occurred? - Suppose if the treatment had not occurred to the treated group, the treated group would follow the same trend as the control group - Counter factual outcome will be the `\(\bar{Y_{B,1}}\)` + `\((\bar{Y_{A,2}}-\bar{Y_{A,1}})\)`(i.e. the change experienced by the control group) - Then, the actual outcome after treatment `\(-\)` the counterfactual outcome is the ATE (average treatment effect) on treated <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2007.JPG?raw=true" width="45%" style="display: block; margin-left:10px;"> --- #Difference in Difference: Visualize the three ways of understanding The following three graphs corresponds to the first, second and third way of understanding <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2008.JPG?raw=true" width="100%" style="display: block; margin-left:0px;"> --- #Difference in Difference: assumption of this method - Parallel trends - We assume that trends of dependent variable over time were identical between treated and non-treated group before the treatment takes place - We assume that the trends would have remained parallel, if there would have been no treatment. If the assumption holds, then any difference in the trends between the two groups can be attributed to the treatment effect. --- #Difference in Difference using OLS Let `\(G_{treatid}=1\)` if individuals `\(i\)` is in the treatment group, and 0 if observation `\(i\)` is in the control group. Let `\(T_{treattime}=1\)` observation `\(t\)` occurs in the post-treatment time, and 0 if observation `\(t\)` occurs in the pre-treatment period. `\(Y_{i,t}=b_{0} + δ_{G}G_{treatid} + δ_{T}T_{treattime} + δ_{GT}(G_{treatid}*T_{treattime})\)` `\(δ_{GT}\)` is the difference-in-difference estimation of average treatment effect. -- - When it is pre-treatment time ( `\(T_{treattime}=0\)` ), - the outcome for the control group ( `\(G_{treatid}=0\)` ) is `\(Y_{i,t}=b_{0}\)` - the outcome for the treated group ( `\(G_{treatid}=1\)` ) is `\(Y_{i,t}=b_{0}+δ_{G}\)` -- - When it is post-treatment time ( `\(T_{treattime}=1\)` ), - the outcome for the control group ( `\(G_{treatid}=0\)` ) is `\(Y_{i,t}=b_{0}+δ_{T}\)` - the outcome for the treated group ( `\(G_{treatid}=1\)` ) is `\(Y_{i,t}=b_{0}+δ_{G}+δ_{T}+δ_{GT}\)` -- The difference between the two groups before the treatment is `\(δ_{G}\)`. The difference between the two groups after the treatment is `\(δ_{G}+δ_{GT}\)`. The difference in difference is `\(δ_{GT}\)`, the average treatment effect. --- #Difference in Difference using twoway fixed effect The TWFE model includes: - Unit Fixed Effects `\(μ_{i}\)`: Captures time-invariant characteristics of each unit (e.g., state, individual, etc.) - Time Fixed Effects `\(λ_{t}\)`: Captures time-varying factors common to all units (e.g., macroeconomic trends). - The regression equation is : `\(Y_{i,t}=μ_{i}+λ_{t}+ βTreatment_{i,t}+ ϵ_{i,t}\)` - `\(Treatment_{i,t}\)` is a binary indicator equal to 1 if the individual `\(i\)` is treated at time t, and 0 otherwise. - `\(β\)` is the average treatment effect on the treated (ATT). --- #Now use a real example: Does partnership make you happier - [Prepare the data](https://rpubs.com/fancycmn/1249792) - Specific ways of estimation depend on whether the data is a balanced or unbalanced data --- #DID: prepare the data ```r #Create a balanced of wave 1 and 2 twowaves_balanced <- twowaves_long %>% filter(check=="1-2") %>% group_by(id) %>% mutate( treatgroup=sum(getpartner), #create a variable to identify the treat group treattime=case_when(wave==1 ~ 0, wave==2 ~ 1), #create a variable to identify the treatment time group_time=treatgroup*treattime #create an interaction between the treat group and treatment time. ) #drop individuals who only participate in the first wave. Now the data is balanced. twowaves_unbalanced <- twowaves_long %>% group_by(id) %>% mutate( treatgroup=sum(getpartner), #create a variable to identify the treat group treattime=case_when(wave==1 ~ 0, wave==2 ~ 1), #create a variable to identify the treatment time group_time=treatgroup*treattime #create an interaction between the treat group and treatment time. ) ``` --- #DID: a detailed look at the data - balanced data <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2009.JPG?raw=true" width="100%" style="display: block; margin-left:10px;"> - unbalanced data <img src="https://github.com/fancycmn/2024-Session12/blob/main/Figure%2010.JPG?raw=true" width="100%" style="display: block; margin-left:10px;"> --- #DID using OLS regression ```r did_ols1 <- lm(sat ~ treatgroup + treattime + group_time, data = twowaves_balanced) summary(did_ols1) ``` ``` ## ## Call: ## lm(formula = sat ~ treatgroup + treattime + group_time, data = twowaves_balanced) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.6364 -0.6364 0.3636 1.3636 2.6093 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.51713 0.04452 168.855 < 2e-16 *** ## treatgroup -0.12640 0.09286 -1.361 0.173538 ## treattime 0.11924 0.06296 1.894 0.058314 . ## group_time 0.45030 0.13133 3.429 0.000612 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.735 on 3938 degrees of freedom ## Multiple R-squared: 0.007628, Adjusted R-squared: 0.006872 ## F-statistic: 10.09 on 3 and 3938 DF, p-value: 1.279e-06 ``` --- #DID using twoway fixed effect ```r twowaves_balanced_panel <- pdata.frame(twowaves_balanced, index=c("id", "wave")) did_fixed1 <- plm(sat ~ group_time, data=twowaves_balanced_panel, model="within", effect = "twoway") ##specify effect=="twoway" summary(did_fixed1) ``` ``` ## Twoways effects Within Model ## ## Call: ## plm(formula = sat ~ group_time, data = twowaves_balanced_panel, ## effect = "twoway", model = "within") ## ## Balanced Panel: n = 1971, T = 2, N = 3942 ## ## Residuals: ## Min. 1st Qu. Median 3rd Qu. Max. ## -5.0596e+00 -4.4038e-01 -4.4409e-16 4.4038e-01 5.0596e+00 ## ## Coefficients: ## Estimate Std. Error t-value Pr(>|t|) ## group_time 0.450301 0.095662 4.7072 2.686e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Total Sum of Squares: 3178.6 ## Residual Sum of Squares: 3143.2 ## R-Squared: 0.011128 ## Adj. R-Squared: -0.97925 ## F-statistic: 22.1578 on 1 and 1969 DF, p-value: 2.6864e-06 ``` --- #DID using twoway fixed effect ```r table(twowaves_balanced$partner,twowaves_balanced$group_time) ``` ``` ## ## 0 1 ## No 3489 0 ## Yes 0 453 ``` ```r did_fixed1a <- plm(sat ~ partner, data=twowaves_balanced_panel, model="within", effect = "twoway") ##specify effect=="twoway" summary(did_fixed1a) ``` ``` ## Twoways effects Within Model ## ## Call: ## plm(formula = sat ~ partner, data = twowaves_balanced_panel, ## effect = "twoway", model = "within") ## ## Balanced Panel: n = 1971, T = 2, N = 3942 ## ## Residuals: ## Min. 1st Qu. Median 3rd Qu. Max. ## -5.0596e+00 -4.4038e-01 -4.4409e-16 4.4038e-01 5.0596e+00 ## ## Coefficients: ## Estimate Std. Error t-value Pr(>|t|) ## partnerYes 0.450301 0.095662 4.7072 2.686e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Total Sum of Squares: 3178.6 ## Residual Sum of Squares: 3143.2 ## R-Squared: 0.011128 ## Adj. R-Squared: -0.97925 ## F-statistic: 22.1578 on 1 and 1969 DF, p-value: 2.6864e-06 ``` --- #DID: two waves and balanced data ```r texreg::screenreg(list(did_ols1, did_fixed1, did_fixed1a), custom.model.names=c("OLS with interaction", "FE", "FE1a"), include.ci = FALSE, single.row = TRUE) ``` ``` ## ## ========================================================================= ## OLS with interaction FE FE1a ## ------------------------------------------------------------------------- ## (Intercept) 7.52 (0.04) *** ## treatgroup -0.13 (0.09) ## treattime 0.12 (0.06) ## group_time 0.45 (0.13) *** 0.45 (0.10) *** ## partnerYes 0.45 (0.10) *** ## ------------------------------------------------------------------------- ## R^2 0.01 0.01 0.01 ## Adj. R^2 0.01 -0.98 -0.98 ## Num. obs. 3942 3942 3942 ## ========================================================================= ## *** p < 0.001; ** p < 0.01; * p < 0.05 ``` --- #DID: two waves and balanced data - When the data are balanced and treatment time is the same for all individuals, the two ways of estimation will give the same average treatment impact. - simply, when everyone participated in all the two waves, the treatment time is all at wave 2. - In the case of "partnership", the result is that the the average treatment effect is 0.45. This means that getting a partner will increase the life satisfaction by 0.45 points. --- #DID: two waves and unbalanced data - OLS regression ```r did_ols2<- lm(sat ~ treatgroup + treattime + group_time, data = twowaves_unbalanced) summary(did_ols2) ``` ``` ## ## Call: ## lm(formula = sat ~ treatgroup + treattime + group_time, data = twowaves_unbalanced) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.6364 -0.6364 0.3636 1.3636 2.6093 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.42891 0.03830 193.944 < 2e-16 *** ## treatgroup -0.03818 0.09161 -0.417 0.676885 ## treattime 0.20746 0.05944 3.490 0.000488 *** ## group_time 0.36208 0.13185 2.746 0.006052 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.771 on 4558 degrees of freedom ## Multiple R-squared: 0.009036, Adjusted R-squared: 0.008383 ## F-statistic: 13.85 on 3 and 4558 DF, p-value: 5.443e-09 ``` --- #DID: when the data are unbalanced - Twoway fixed effect ```r twowaves_unbalanced_panel <- pdata.frame(twowaves_unbalanced, index=c("id", "wave")) did_fixed2 <- plm(sat ~ group_time, data=twowaves_unbalanced_panel, effect="twoway", model="within") summary(did_fixed2) ``` ``` ## Twoways effects Within Model ## ## Call: ## plm(formula = sat ~ group_time, data = twowaves_unbalanced_panel, ## effect = "twoway", model = "within") ## ## Unbalanced Panel: n = 2591, T = 1-2, N = 4562 ## ## Residuals: ## Min. 1st Qu. Median 3rd Qu. Max. ## -5.05962 -0.44038 0.00000 0.44038 5.05962 ## ## Coefficients: ## Estimate Std. Error t-value Pr(>|t|) ## group_time 0.450301 0.095662 4.7072 2.686e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Total Sum of Squares: 3178.6 ## Residual Sum of Squares: 3143.2 ## R-Squared: 0.011128 ## Adj. R-Squared: -1.2906 ## F-statistic: 22.1578 on 1 and 1969 DF, p-value: 2.6864e-06 ``` --- #DID using twoway fixed effect ```r table(twowaves_unbalanced$partner,twowaves_unbalanced$group_time) ``` ``` ## ## 0 1 ## No 4109 0 ## Yes 0 453 ``` ```r did_fixed2a <- plm(sat ~ partner, data=twowaves_unbalanced_panel, model="within", effect = "twoway") summary(did_fixed2a) ``` ``` ## Twoways effects Within Model ## ## Call: ## plm(formula = sat ~ partner, data = twowaves_unbalanced_panel, ## effect = "twoway", model = "within") ## ## Unbalanced Panel: n = 2591, T = 1-2, N = 4562 ## ## Residuals: ## Min. 1st Qu. Median 3rd Qu. Max. ## -5.05962 -0.44038 0.00000 0.44038 5.05962 ## ## Coefficients: ## Estimate Std. Error t-value Pr(>|t|) ## partnerYes 0.450301 0.095662 4.7072 2.686e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Total Sum of Squares: 3178.6 ## Residual Sum of Squares: 3143.2 ## R-Squared: 0.011128 ## Adj. R-Squared: -1.2906 ## F-statistic: 22.1578 on 1 and 1969 DF, p-value: 2.6864e-06 ``` --- #DID: when the data are unbalanced ```r texreg::screenreg(list(did_ols2, did_fixed2, did_fixed2a), custom.model.names=c("OLS with interaction", "FE", "FE2a"), include.ci = FALSE, single.row = TRUE) ``` ``` ## ## ========================================================================= ## OLS with interaction FE FE2a ## ------------------------------------------------------------------------- ## (Intercept) 7.43 (0.04) *** ## treatgroup -0.04 (0.09) ## treattime 0.21 (0.06) *** ## group_time 0.36 (0.13) ** 0.45 (0.10) *** ## partnerYes 0.45 (0.10) *** ## ------------------------------------------------------------------------- ## R^2 0.01 0.01 0.01 ## Adj. R^2 0.01 -1.29 -1.29 ## Num. obs. 4562 4562 4562 ## ========================================================================= ## *** p < 0.001; ** p < 0.01; * p < 0.05 ``` --- #DID: compare balanced and unbalanced data ```r texreg::screenreg(list(did_ols1, did_fixed1,did_fixed1a, did_ols2, did_fixed2,did_fixed2a), custom.model.names=c("OLS(balanced)", "FE(balanced)", "FE1a(balanced)", "OLS(unbalanced)", "FE(unbalanced)", "FE2a(unbalanced)" ), include.ci = FALSE, single.row = TRUE) ``` ``` ## ## =================================================================================================================================== ## OLS(balanced) FE(balanced) FE1a(balanced) OLS(unbalanced) FE(unbalanced) FE2a(unbalanced) ## ----------------------------------------------------------------------------------------------------------------------------------- ## (Intercept) 7.52 (0.04) *** 7.43 (0.04) *** ## treatgroup -0.13 (0.09) -0.04 (0.09) ## treattime 0.12 (0.06) 0.21 (0.06) *** ## group_time 0.45 (0.13) *** 0.45 (0.10) *** 0.36 (0.13) ** 0.45 (0.10) *** ## partnerYes 0.45 (0.10) *** 0.45 (0.10) *** ## ----------------------------------------------------------------------------------------------------------------------------------- ## R^2 0.01 0.01 0.01 0.01 0.01 0.01 ## Adj. R^2 0.01 -0.98 -0.98 0.01 -1.29 -1.29 ## Num. obs. 3942 3942 3942 4562 4562 4562 ## =================================================================================================================================== ## *** p < 0.001; ** p < 0.01; * p < 0.05 ``` --- #DID: ols vs two-way fixed effect regression - Two-way fixed effect is more often used - When there are more than two waves - When there are unobserved time-invariant characteristics - Automatically controls for any unobserved, time-invariant individual characteristics and time trends that affect all units equally. - OLS - When you have two waves, two groups, and balanced data, both fixed effect and ols with interaction yield identical results - When you have two waves, two group, and balanced data, OLS with interaction is simpler and straightforward. --- #DID: six waves data ```r sixwaves_long2_panel <- pdata.frame(sixwaves_long2, index=c("id", "wave")) did_fixed3 <- plm(sat ~ partner, data=sixwaves_long2_panel, model="within", effect = "twoway") summary(did_fixed3) ``` ``` ## Twoways effects Within Model ## ## Call: ## plm(formula = sat ~ partner, data = sixwaves_long2_panel, effect = "twoway", ## model = "within") ## ## Unbalanced Panel: n = 2591, T = 1-6, N = 9409 ## ## Residuals: ## Min. 1st Qu. Median 3rd Qu. Max. ## -7.74613 -0.49037 0.00000 0.54974 5.69040 ## ## Coefficients: ## Estimate Std. Error t-value Pr(>|t|) ## partnerYes 0.396628 0.048681 8.1475 4.386e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Total Sum of Squares: 10762 ## Residual Sum of Squares: 10658 ## R-Squared: 0.0096507 ## Adj. R-Squared: -0.36776 ## F-statistic: 66.381 on 1 and 6812 DF, p-value: 4.3856e-16 ``` ```r #Interpretation: the average treatment effect is 0.39. This means that getting a partner will increase the life satisfaction by 0.39 points ``` --- #Take home - Understand what is difference-in-difference method - Why we use difference-in-difference - Know what is the assumption of difference in difference - Know to do difference-in-difference estimation - Important code - `plm()` to do twoway fixed effect for DID --- class: center, middle #[Exercise](https://rpubs.com/fancycmn/1249788)