\[ Y_{it}=\beta_{0}+\beta_{1}X_{it}+\delta_{2}T_{i,t=2}+\delta_{3}T_{i,t=3}+\ldots+\delta_{t}T_{i,t=t}+u_{it}\]
\(\delta_{2}T_{i,t=2},\delta_{3}T_{i,t=3},\ldots,\delta_{t}T_{i,t=t}\)這些變數可以解釋不同時間點之間的異質性。
假設調查1,000位民眾,\(Y_{it}\)是選後對於民主的支持度,\(T_{i,t=1}\)是選前一個月對於民主的支持度,\(T_{i,t=2}\)是選前三個月對於民主的支持度,\(X_{it}\)是選後的民主價值,可以用以下的固定效果(fixed-effect)模型表示這這幾個變數之間的關係:(出處)
\[ Y_{it}=\beta_{0}+\beta_{1}X_{it}+\delta_{2}T_{i,t=2}+u_{it}\]
資料中有三個時間點\(t,t=1,t=2\),方程式的右邊中只需要1避免共線性。
也可以寫成
\[\begin{align*} Y_{it}=\alpha_{i}+\beta_{t}X_{it}+\epsilon_{it} \tag{1.1} \end{align*}\]
\[\sum_{i=1}^{N}\sum_{t=1}^{T} \{(\{Y_{i}-\bar{Y}) - \beta(X_{it}-\bar{X_{i}})\}^2\]
\[\begin{align*} \bar{Y_{i}}=\alpha_{i}+\beta_{t}\bar{X_{i}}+ \bar{\epsilon_{i}} \tag{1.2} \end{align*}\]
\[Y_{it}-\bar{Y}_{i}=\beta(X_{it}-\bar{X}_{i})+(\epsilon_{it}-\bar{\epsilon}_{i})\]
時間序列的資料也被稱為longitidunal data,可以是觀察一個個人或一個群體在不同時間的變化,例如每天中共戰機來擾台的飛機數,也可以是一群人在不同時間的變化,例如一些國家在不同時間點的GDP成長率。如果加上不同的地點,例如不同國家在不同年度的資料,稱為panel data。
用矩陣表示如下:
\[\begin{align*} \left[ \begin{array}{c} y_{i1} \\ y_{i2} \\ \vdots\\ y_{iT}\\ \end{array} \right] = \left[ \begin{array}{ccc} x_{i11} & x_{i12}&\cdots & x_{i1k} \\ x_{i21} & x_{i22}&\cdots & x_{i2k} \\ \vdots & \vdots & \ddots & \vdots\\ x_{nT1}& x_{nT2}&\cdots & x_{nTk} \\ \end{array} \right] \left[ \begin{array}{c} \beta_{i1} \\ \beta_{i2} \\ \vdots\\ \beta_{ik}\\ \end{array} \right] + \left[ \begin{array}{c} \epsilon_{i1} \\ \epsilon_{i2} \\ \vdots\\ \epsilon_{ik}\\ \end{array} \right] \end{align*}\]
我們假設\(\hat{\delta_{k}}\sim N(\delta_{k}, \sigma^2)\),也就是每個時間的影響應該來自於平均數\(\delta_{k}\)、變異數一致是\(\sigma^2\)的常態分佈。
實務上,可以單純地納入時間點在迴歸模型中,然後估計係數。
Imai and Kim (2019)證明上述的\(\hat{\beta}_{FE}\)有可能不是最佳估計。如果\(X_{i}\in \{0,1\}\),也就是二元變數,類似因果關係裡面的刺激(treatment),\(\hat{\beta}_{FE}\)會與\(X_{i}\)的變異數成比例,也就不是固定不變,違反最小平方法的迴歸模型假設。
第一種方法是在觀察的時間序列之外,收集資料構成對照的時間序列,如果干預或者刺激有效果,應該只會影響觀察到的時間序列,不會影響對照的時間序列。建立因果關係的時間序列分析的模型是為了預測反事實的結果。做法是使用控制組的時間序列預測如果沒有干預的情形下,時間序列的結果是什麼?
例如某家公司的股價可能因為某天宣布裁員而上漲(因為減少人事成本),但是其他公司有可能都因為其他利多都上漲了,所以必須建立一個綜合許多公司的時間序列,才能估計裁員消息與股價的因果關係。這套方法也稱為synthetic control。
已經有CausalImpact
這個套件可以執行以上的估計。
Brodersen et al.有完整的說明
實務上,需要收集三個資料。第一是依變數受到刺激之前的時間序列,第二個是同一時間其他變數受到刺激之前的時間序列,第三則是有關依變數的參數的資訊。這三個資訊用state space model加以結合。
在這張圖3.1中,\(x_{t}\)是觀察不到的潛在變數,\(y_{t}\)則是\(x_{t}\)的函數:
Figure 3.1: State Space Model
\[y_{t}=A_{t} x_{t}+v_{t}\]
\[x_{t}=\Phi x_{t-1}+w_{t}\]
rm(list = ls())
library(ggplot2)
library(patchwork)
library(CausalImpact)
library(assertthat)
library(dplyr)
#as_patchwork
patchwork <- here::here('source', 'add_plot.R')
#source(patchwork)
#plot_annotation
annotation <- here::here('source', 'plot_annotation.R')
source(annotation)
CreateImpactPlot <- here::here('source', 'CreateImpactPlot.R')
source(CreateImpactPlot)
#create data
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- 1.2 * x1 + rnorm(100)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
1#causal impact analysis
## [1] 1
pre.period <- c(1, 70)
post.period <- c(71, 100)
impact <- CausalImpact(data, pre.period, post.period)
plot(impact, c("original", "cumulative")) +
plot_annotation(title = "Analysis of click behavior after intervention"
, theme = theme(plot.title = element_text(hjust = 0.5))) &
theme(
panel.background = element_rect(fill = "transparent"), # panel bg
plot.background = element_rect(fill = "transparent", color = NA), # plot bg
panel.grid.major = element_blank(), # get rid of major grid
panel.grid.minor = element_blank())
Figure 3.2: Analysis of click behavior after intervention
參考Nurkholis的說明。
模擬每日交易的消費者數目(DTU),分析交易界面的更新提高交易的程度。
納入5個變數解釋DTU,並且考慮季節的作用,以及先驗的分佈。
首先準備資料:
require(CausalImpact)
require(ggplot2) # Optional, only for visualization purposes
require(timeDate) # Optional, only for simulating the data
# Set iteration counts
niter = 3000
# Set seed to ensure simulation reproducibility, entirely optional
addTaskCallback(function(...) {set.seed(1212);TRUE})
## 1
## 1
# Generate workdays from June 2021 until May 2022, totalling 261 days
time_idx = as.timeDate(seq(from = as.Date('2021-06-01'), to = as.Date('2022-05-31'), by = 'day'))
time_idx = as.Date(time_idx[isBizday(time_idx, holidays = holidayNYSE(), wday = 1:5)])
length(time_idx)
## [1] 261
# Generate response: daily transacting user
dtu = arima.sim(n = 261, list(ar = c(0.5677, -0.132), ma = c(-0.2456)), sd = 180) + seq(from = 2540, to = 2400, length.out = 50) + 1000
dtu[1:220] = dtu[1:220] + (seq(from = 50000, to = 1, length.out = 220))^0.6
dtu[121:145] = dtu[121:145] + (seq(from = 1, to = 20000, length.out = 25))^0.5
dtu[146:170] = dtu[146:170] + (seq(from = 20000, to = 1, length.out = 25))^0.5
# Generate predictors: daily active user
dau = dtu*20 + rnorm(261, sd = 60)
# Generate predictors: push notification ctr
pn = (dtu-mean(dtu))/sd(dtu) + rgamma(n = 261, shape = 2, scale = 0.5)
pn = (pn-min(pn))/(max(pn)-min(pn)) * 0.03
# Generate predictors: email ctr
email = (dtu-mean(dtu))/sd(dtu) + rgamma(n = 261, shape = 1.5, scale = 0.9)
email = (email-min(email))/(max(email)-min(email)) * 0.021
# Generate predictors: Google ads ctr
gads = (dtu-mean(dtu))/sd(dtu) + rgamma(n = 261, shape = 15, scale = 10)
gads = (gads-min(gads))/(max(gads)-min(gads)) * 0.08
# Generate predictors: daily new users
nu = sqrt(dtu*40) - sd(dtu) + rgamma(n = 261, shape = 6, scale = 29)
# Add intervention effect on April 1st 2022. We also keep the true counterfactual.
dtu_counterfactual = dtu
dtu[219:261] = dtu[219:261] + (seq(from = 1, to = 50000, length.out = 43))^0.5
# Finalization
data = zoo((cbind(round(dtu), round(dau), pn, email, gads, round(nu))), time_idx)
colnames(data) = c('DTU', 'DAU', 'PN', 'EMAIL', 'G-ADS', 'NU')
# Define Pre and Post Intervention Periods
pre_intervention = as.Date(c('2021-06-01', '2022-03-31'))
post_intervention = as.Date(c('2022-04-01', '2022-05-31'))
# Compute the standard deviation of response (DTU) in the pre-intervention period
# useful for specifying prior later
sd_dtu = sd(data$DTU[index(data) <= pre_intervention[2]])
# Set the response (DTU) in post-intervention to NA
# We keep the observed post-intervention in a separate object (post_period_dtu)
post_period_dtu = as.vector(data$DTU[index(data) >= post_intervention[1]])
data$DTU[index(data) >= post_intervention[1]] = NA
tail(data) # DTU should be NA
## DTU DAU PN EMAIL G-ADS NU
## 2022-05-24 NA 64916 0.008519 0.009574 0.04661 304
## 2022-05-25 NA 73064 0.008548 0.003735 0.03506 277
## 2022-05-26 NA 68495 0.004606 0.005792 0.02570 141
## 2022-05-27 NA 70408 0.007215 0.005454 0.03054 278
## 2022-05-30 NA 63408 0.002184 0.002497 0.01565 258
## 2022-05-31 NA 61576 0.001674 0.000000 0.03303 182
cat('Intervention of April 1st 2022 is the', which(time(data) == '2022-04-01'),'th data')
## Intervention of April 1st 2022 is the 219 th data
time(data) = seq(1, NROW(data), length.out = NROW(data))
# > Output: Intervention of April 1st 2022 is the 219 th data
ss = list() # Initial
ss = AddLocalLevel(ss, y = data$DTU) # Add Local Level, pass the empty list
#ss = AddSeasonal(ss, y = data$DTU, nseasons = 50) # Add Seasonality, pass the #local level list
ss = AddAr(ss, y = data$DTU) # Add Autoregressive, pass the local level and seasonal
summary(lm(DTU ~. , data = data))
##
## Call:
## lm(formula = DTU ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.552 -1.641 -0.101 1.860 6.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.82e+00 4.62e+00 0.61 0.54
## DAU 5.00e-02 7.25e-05 688.76 <2e-16 ***
## PN 3.51e+01 8.35e+01 0.42 0.67
## EMAIL -4.76e+01 8.24e+01 -0.58 0.56
## `G-ADS` 9.46e+00 1.35e+01 0.70 0.48
## NU -4.45e-04 2.92e-03 -0.15 0.88
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3 on 212 degrees of freedom
## (43 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.94e+05 on 5 and 212 DF, p-value: <2e-16
ss = list()
# Local Linear Trend
ss = AddLocalLinearTrend(ss, y = data$DTU, level.sigma.prior = SdPrior(sigma.guess = sd_dtu, sample.size = 0.01))
# Seasonality
ss = AddSeasonal(ss, y = data$DTU, nseasons = 50, season.duration = 1)
# Static Regression, we only specify the prior
reg_prior = SpikeSlabPrior(x = cbind(1, as.matrix(data[1:218, -1])),
y = as.matrix(data[1:218, 1]),
expected.r2 = 0.9,
prior.df = 25,
expected.model.size = 2)
bsts_model = bsts(DTU~., data = data,
state.specification = ss,
niter = 10,
prior = reg_prior)
## =-=-=-=-= Iteration 0 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 1 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 2 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 3 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 4 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 5 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 6 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 7 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 8 Fri Jun 7 08:39:45 2024 =-=-=-=-=
## =-=-=-=-= Iteration 9 Fri Jun 7 08:39:45 2024 =-=-=-=-=
PlotBstsCoefficients(bsts_model)
result = CausalImpact(bsts.model = bsts_model,
post.period.response = post_period_dtu)
summary(result)
## Posterior inference {CausalImpact}
##
## Average Cumulative
## Actual 3642 156610
## Prediction (s.d.) 3492 (NA) 150141 (NA)
## 95% CI [3495, 3495] [150267, 150267]
##
## Absolute effect (s.d.) 150 (NA) 6469 (NA)
## 95% CI [148, 148] [6343, 6343]
##
## Relative effect (s.d.) 4.2% (NA%) 4.2% (NA%)
## 95% CI [4.2%, 4.2%] [4.2%, 4.2%]
##
## Posterior tail-area probability p: 0.5
## Posterior prob. of a causal effect: 50%
##
## For more details, type: summary(impact, "report")
plot(result)
Figure 3.3: Causal Impact
PanelMatch
這個套件執行以上的方法。PanelMatch
套件顯示資料結構,紅色代表民主的狀態,藍色代表非民主的狀態:library(PanelMatch)
DisplayTreatment(unit.id = "wbcode2",
time.id = "year", legend.position = "none",
xlab = "year", ylab = "Country Code",
treatment = "dem", data = dem)
Figure 4.1: 民主與否的資料
在這個模型中,我們允許\(L=2,\ldots\),代表接收刺激後效果延遲出現的時間點可以至少有2。同時允許\(F=0,1,\ldots\),也就是效果發生後,對於往後幾年的影響。
當政策\(X\)從0變成1時的平均效果ATT的表示方式:
\[ \delta(F, L)=\left\{Y_{i,t+F}\left(X_{it}=1,X_{i,t+1}=0,\{X_{i,t-\ell}\}_{\ell=2}^L\right)-Y_{i,t+F}\left(X_{it}=0,X_{i,t-1}=0,\{X_{i,t-\ell}\}_{\ell=2}^L\right)\mid X_{it}=1,X_{i,t-1}=0\right\} \]
\[ \mathbb{E}[Y_{i,t+F}\left(X_{it}=X_{i,t-1}=1,\{X_{i,t-\ell}\}_{\ell=2}^L\right)-Y_{i,t-1}\mid X_{it}=0,X_{i,t-1}=1,\{X_{i,t-\ell},Y_{i,t-\ell}\}_{\ell=2}^L,\{\textrm{Z}_{i,t-\ell}\}_{\ell=0}^L]\\ =\mathbb{E}[Y_{i,t+F}\left(X_{it}=X_{i,t-1}=1,\{X_{i,t-\ell}\}_{\ell=2}^L\right)-Y_{i,t-1}\mid X_{it}=1,X_{i,t-1}=0,\{X_{i,t-\ell},Y_{i,t-\ell}\}_{\ell=2}^L,\{\textrm{Z}_{i,t-\ell}\}_{\ell=0}^L] \]
過去的結果以及相關特徵成立的條件下,不論過去X是1或0,現在X變成1之後,Y的平均差異相同。
ART表示從刺激的反轉,也就是從民主轉為非民主所造成的平均差異。
\[ \mathcal{M}_{it}=\{\textsf{i}^{\prime}:\textsf{i}^{\prime}\neq \textsf{i}, X_{\textsf{i}^{\prime}t^{\prime}}=0, X_{\textsf{i}^{\prime}t^{\prime}}=X_{it^{\prime}}\hspace{0.5em}\text{for}\hspace{0.5em}\text{all}\hspace{0.5em}t^{\prime}=t-1,\ldots t-L\} \]
Figure 4.2: 配對觀察值(Imai, Kim, and Wang, 2023)
如果找不到相對應的控制觀察值,就剔除實驗的觀察值。
多階段的差異效果估計式(Multi-period DiD estimator):
\[ \hat{\delta}_{att}(F,L)=\frac{1}{\sum_{t=t+1}^{T-F}D_{it}}\sum_{i=1}^{N}\sum_{t=L+1}^{T-F}D_{it}\Big\{(Y_{i,t+F}-Y_{i,t-1})-\sum_{i\in\mathcal{M}_{it}}w_{it}^{\textsf{i}^{\prime}}(Y_{\textsf{i}^{\prime},t+F}-Y_{\textsf{i}^{\prime},t-1})\Big\} \]
# get a matched set without refinement
sets0 <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2",
treatment = "dem", refinement.method = "none",
data = dem, match.missing = FALSE,
covs.formula = ~ I(lag(y, 1:4)) + I(lag(tradewb, 1:4)),
size.match = 5, qoi = "att",
outcome.var = "y",
lead = 0:4, forbid.treatment.reversal = FALSE)
# get a matched set with refinement using CBPS.match, setting the
# size of matched set to 5
sets1 <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2",
treatment = "dem", refinement.method = "mahalanobis",
data = dem, match.missing = FALSE,
covs.formula = ~ I(lag(y, 1:4)) + I(lag(tradewb, 1:4)),
size.match = 5, qoi = "att",
outcome.var = "y",
lead = 0:4, forbid.treatment.reversal = FALSE)
# get another matched set with refinement using CBPS.weight
sets2 <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2",
treatment = "dem", refinement.method = "ps.weight",
data = dem, match.missing = FALSE,
covs.formula = ~ I(lag(y, 1:4)) + I(lag(tradewb, 1:4)),
size.match = 10, qoi = "att",
outcome.var = "y",
lead = 0:4, forbid.treatment.reversal = FALSE)
# use the function to produce the scatter plot
balance_scatter(non_refined_set = sets0$att,
matched_set_list = list(sets1$att, sets2$att),
data = dem,
covariates = c("y", "tradewb"))
# add legend
legend(x = 0, y = 0.8,
legend = c("mahalanobis",
"PS weighting"),
y.intersp = 0.65,
x.intersp = 0.3,
xjust = 0,
pch = c(1, 3), pt.cex = 1,
bty = "n", ncol = 1, cex = 1, bg = "white")
#add some additional data to data set
dem$rdata <- runif(runif(nrow(dem)))
pm.obj <- PanelMatch(lead = 0:3, lag = 4, time.id = "year", unit.id = "wbcode2", treatment = "dem",
outcome.var ="y", refinement.method = "mahalanobis",
data = dem, match.missing = TRUE,
covs.formula = ~ tradewb + rdata + I(lag(tradewb, 1:4)) + I(lag(y, 1:4)),
size.match = 5, qoi = "att")
get_covariate_balance(pm.obj$att, dem, covariates = c("tradewb", "rdata"),
ylim = c(-2,2))
## tradewb rdata
## t_4 0.03667 -0.0009159
## t_3 -0.03548 0.1417359
## t_2 -0.02535 0.0104848
## t_1 0.07292 0.0487455
get_covariate_balance(pm.obj$att, dem, covariates = c("tradewb", "rdata"),
plot = TRUE, ylim = c(-2,2))
PM.results <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2",
treatment = "dem", refinement.method = "mahalanobis",
data = dem, match.missing = TRUE,
covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)),
size.match = 5, qoi = "att",
outcome.var = "y", lead = 0:4, forbid.treatment.reversal = FALSE)
#print(PM.results$att)
plot(PM.results$att)
plot(PM.results$att, include.empty.sets = TRUE)
PM.results <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2",
treatment = "dem", refinement.method = "mahalanobis",
data = dem, match.missing = TRUE,
covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)),
size.match = 5, qoi = "att",
outcome.var = "y", lead = 0:4, forbid.treatment.reversal = TRUE)
PE.results <- PanelEstimate(sets = PM.results, data = dem,
number.iterations = 100)
summary(PE.results)
## Weighted Difference-in-Differences with Mahalanobis Distance
## Matches created with 4 lags
##
## Standard errors computed with 100 Weighted bootstrap samples
##
## Estimate of Average Treatment Effect on the Treated (ATT) by Period:
## $summary
## estimate std.error 2.5% 97.5%
## t+0 -0.9685 0.9545 -2.695 0.9333
## t+1 -1.2142 1.7868 -4.761 2.5191
## t+2 -0.6130 2.2185 -5.102 4.0725
## t+3 0.9730 2.4839 -3.875 6.3687
## t+4 1.4794 2.6956 -3.483 6.5741
##
## $lag
## [1] 4
##
## $iterations
## [1] 100
##
## $qoi
## [1] "att"
plot(PE.results)
Figure 4.3: ATT估計
PM.results <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2",
treatment = "dem", refinement.method = "mahalanobis",
data = dem, match.missing = TRUE,
covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)),
size.match = 10, qoi = "att",
outcome.var = "y", lead = 0:4, forbid.treatment.reversal = TRUE)
PE.results <- PanelEstimate(sets = PM.results, data = dem,
number.iterations = 100)
summary(PE.results)
## Weighted Difference-in-Differences with Mahalanobis Distance
## Matches created with 4 lags
##
## Standard errors computed with 100 Weighted bootstrap samples
##
## Estimate of Average Treatment Effect on the Treated (ATT) by Period:
## $summary
## estimate std.error 2.5% 97.5%
## t+0 -1.3886 1.014 -3.202 0.5747
## t+1 -1.6657 1.640 -4.330 1.6831
## t+2 -0.9682 2.107 -4.255 3.1408
## t+3 0.2769 2.358 -3.727 4.8725
## t+4 0.8332 2.569 -4.237 5.5699
##
## $lag
## [1] 4
##
## $iterations
## [1] 100
##
## $qoi
## [1] "att"
plot(PE.results)
透過具有因果推論基礎的配對,不需要貝氏統計的假設,可以估計時間序列中的因果關係。
\(\texttt{PanelMatch}\)套件操作容易,但是要花時間配對。
## 最後更新時間: 2024-06-07 08:39:56