例如:對一群人隨機抽樣,有人抽到籤去服役,有人不需要去服役,然後在若干年後詢問這些人對於國家的認同感,這些人之間唯一的差別是有沒有去服役,就可以估計出服役與否對國家認同感的作用(Erikson, 2011)。
例如:對叛亂組織轟炸後,如果這些組織存活,轟炸會降低還是提高這些組織的地位?比對有受到與沒受到轟炸的叛亂組織的後續發展,可以估計轟炸的效果(Lyall, 2009)。
例如:在1997,英國太陽報從支持保守黨轉而支持工黨,而在2009,太陽報從支持工黨又轉回保守黨。藉由統計報紙的立場轉變前後的讀者態度,可以估計媒體的影響 (Reeves, McKee, Stuckler, 2016)。
過去統計比較重視預測,但是近年來越來越重視因果關係,因為因果關係的研究可以告訴我們相關背後的機制,進而延伸到其他領域的研究。但是因果關係需要好的研究設計以及適當的資料,有時候可遇不可求。
另一方面,實驗設計雖然可以確認因果關係,但是是不是具有理論意涵?政策意涵?還是只是發現造成差異的條件?需要進一步地研究。
Figure 2.1: 學前教育
Figure 2.2: 餐廳員工數與最低時薪
Figure 2.3: 餐廳員工數與最低時薪
\(D_{i}\): Indicator of treatment for unit \(i\) \[ \mathrm{D}_{i} = \begin{cases} 1, & \text{if unit} \hspace{3pt} i \hspace{3pt} \text{receives treatment} \\ % & is your "\tab"-like command (it's a tab alignment character) 0, & \text{otherwise} \end{cases} \]
\(Y_{i}\): Observed outcome of interest for unit \(i\)
\[ \tag{1} \mathrm{Y}_{di}=\begin {cases} Y_{0i}, & \text{Potential outcome for unit} \hspace{3pt}i\hspace{3pt} \text{without treatment} \\ Y_{1i}, & \text{Potential outcome for unit } \hspace{3pt} i \hspace{3pt} \text{with treatment} \end{cases} \]
\(Y_{i}=D_{i}\cdot Y_{1i}+(1-D_{i})\cdot Y_{0i}\)
\[\begin{equation} \tag{2} \mathrm{Y}_{i}=\begin {cases} Y_{0i}, & \text{if} \hspace{3pt}D_{i}=0 \\ Y_{1i}, & \text{if} \hspace{3pt} D_{i}=1 \end{cases} \end{equation}\]
\[\alpha_{i} = Y_{1i}-Y_{0i}\]
\[ Y_{i}(D_{1}, D_{2},\ldots,D_{n})=Y_{i}(D'_{1}, D'_{2},\ldots,D'_{n})\quad \text{if}\quad D_{i}=D'_{i} \]
\[\begin{align} \tag{3} ATE & = E[Y_{1}-Y_{0}]\\ & =E[Y_{1}]-E[Y_{0}]\\ & =E[Y_{1}\mid D=1]-E[Y_{0}\mid D=0] \end{align}\]
\(i\) | \(Y_{1i}\) | \(Y_{0i}\) | \(D_{i}\) | \(\alpha_{i}\) | \(Y_{i}\) |
---|---|---|---|---|---|
1 | 3 | 0 | 1 | 3 | 3 |
2 | 1 | 1 | 1 | 0 | 1 |
3 | 1 | 0 | 1 | 0 | |
4 | 1 | 1 | 0 | 0 | |
\(E[Y_{1}]\) | 1.5 | ||||
\(E[Y_{0}]\) | 0.5 | ||||
\(E[Y_{1}-Y_{0}]\) |
計算第2個時間點或者假設受到刺激的測量結果的平均值\(E[Y_{1}]\): \[ E[Y_{1}] = \frac{1}{N}\Sigma Y_{1i}=1.5 \]
計算第1個時間點或者假設沒得到刺激的測量結果的平均值\(E[Y_{0}]:\)$ E[Y_{0}] = Y_{0i}=0.5 $$
兩者相減: \[ E[Y_{1}]-E[Y_{0}] = 1 \]
估計平均實驗效果: \[ \alpha_{ATE}=E[Y_{1}-Y_{0}]=\frac{1}{4}\cdot(3+0+1+0)=1 \]
因為 \[Y_{i}=Y_{1}\cdot D+(1-D)\cdot Y_{0}\]
所以\(\sum Y_{i}=3+1+0+0=4\)。\(E[Y_{i}]=1\)。
\[\begin{align*} \text{ATE} & =E[Y_{1}|D=1]-E[Y_{0}|D=0]\\ & = E[Y_{1}|D=1]-E[Y_{0}|D=1] +E[Y_{0}|D=1]-E[Y_{0}|D=0] \end{align*}\]
\(E[Y_{0}|D=1] - E[Y_{0}|D=1]\)其實是0,但是我們刻意加進去。
\(E[Y_{1}|D=1]-E[Y_{0}|D=1]\)被稱為ATT或者ATET, expected treatment effect given the treatment。而 \(E[Y_{0}|D=1]-E[Y_{0}|D=0]\)代表某種誤差。也就是說:
\[ \text{ATE}=\text{ATT}+Bias \]
\(i\) | \(Y_{1i}\) | \(Y_{0i}\) | \(D_{i}\) | \(\alpha_{i}\) | \(Y_{i}\) |
---|---|---|---|---|---|
1 | 3 | 0 | 3 | 3 | |
2 | 1 | 1 | 0 | 1 | |
\(E[Y_{1}]\) | 2 | ||||
\(E[Y_{0}]\) | 0.5 | ||||
\(E[Y_{1}-Y_{0}]\) |
\[ \alpha_{ATT}=E[Y_{1}-Y_{0}|D=1]=\frac{1}{2}\cdot(3+1-0-1)=1.5 \]
\(\alpha_{ATT}\)與\(\alpha_{i}\)相差0.5。
\(\alpha_{ATT}\)與\(\alpha_{i}\)相等的前提是\(E[Y_{0}|D=1]-E[Y_{0}|D=0]= 0\)。也就是說:實際上收到刺激但是「假設」並沒有收到的表現狀況,跟實際上沒收到刺激而且真的沒有收到刺激的表現狀況相等。例如:給植物澆水,但是沒注意到水都澆到盆栽外面去,跟沒澆水的表現狀況相同。或者給頭痛患者服藥,但是因為頭太痛而忘了吃藥,藥效跟沒吃一樣。如果這些情形為真,表示就算我們無法觀察到控制組的受試者「真的」得到刺激的表現,我們可以假設控制組的受試者即使「真的」得到刺激,他們的表現跟沒有得到刺激的受試者相同。
\[ \alpha_{ATT}=E[Y_{1i}]=\frac{1}{2}\cdot(3+0)=1.5 \]
假設有兩群人,一群人吃頭痛藥以減緩頭痛,另一群人不吃頭痛藥。ATT代表吃頭痛藥的這一群人之後的平均頭痛狀況,ATE則代表吃頭痛藥與不吃頭痛藥之後的平均頭痛狀況差異。如果頭痛藥真的對每個人都有效,ATE會大於ATT,這是因為沒吃藥的人的平均頭痛狀況不會減輕,吃藥的人才會減輕,ATE代表的差異就會很明顯。
\[ Y=\beta_{0} +\beta_{1}D \]
\[\begin{align} D= \begin{cases}1\\ 0 \end{cases} \end{align}\]
\[ \hat{\beta}_{OLS}=\frac{\sum_{i=1}^n(Y-\bar{Y})(D-\bar{D})}{\sum_{i=1}^n(D-\bar{D})^2}=\hat{\tau} \]
\[ Y=\beta_{0} +\beta_{1}D+\beta_{2}X \]
star <- read.csv("~/Dropbox/EastAsia2024/data/DSS/STAR.csv")
head(star)
## classtype reading math graduated
## 1 small 578 610 1
## 2 regular 612 612 1
## 3 regular 583 606 1
## 4 small 661 648 1
## 5 small 614 636 1
## 6 regular 610 603 0
star <- star %>% mutate(D = as.factor(classtype)) %>%
mutate(graduated = recode_factor(graduated,
'1'='Yes', '0'='No'))
class(star$D)
## [1] "factor"
levels(star$D)
## [1] "regular" "small"
mean(star$reading[star$D=='small'])
## [1] 632.7
mean(star$reading[star$D=='regular'])
## [1] 625.5
library(dplyr)
star.sta <- star|> group_by (D) |>
summarise(avg.reading = mean(reading)) |>
mutate(avg.reading =round(avg.reading, 3))
p1 <- ggplot2::ggplot(data=star.sta, aes(x=D, y = avg.reading, fill=D)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = avg.reading) )
p1
library(gtsummary)
star %>% tbl_summary(
by = D,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
graduated ~ "{n} / {N} ({p}%)"
),
digits = all_continuous() ~ 2) %>%
add_overall() %>%
add_n() %>%
modify_header(label ~ "**Variable**")
Variable | N | Overall, N = 1,2741 | regular, N = 6891 | small, N = 5851 |
---|---|---|---|---|
classtype | 1,274 | |||
regular | 689 (54%) | 689 (100%) | 0 (0%) | |
small | 585 (46%) | 0 (0%) | 585 (100%) | |
reading | 1,274 | 628.80 (36.73) | 625.49 (35.88) | 632.70 (37.37) |
math | 1,274 | 631.59 (38.84) | 628.84 (37.94) | 634.83 (39.66) |
graduated | 1,274 | 1,108 / 1,274 (87%) | 597 / 689 (87%) | 511 / 585 (87%) |
1 n (%); Mean (SD); n / N (%) |
mean(star$reading[star$D=='small']) - mean(star$reading[star$D=='regular'])
## [1] 7.211
m1 <- lm(reading ~ D, data = star)
# Summarize the model
stargazer::stargazer(m1, title = 'OLS and Causal Inference', type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
reading | |
Dsmall | 7.211*** |
(2.056) | |
Constant | 625.500*** |
(1.393) | |
Observations | 1,274 |
R2 | 0.010 |
Adjusted R2 | 0.009 |
Residual Std. Error | 36.570 (df = 1272) |
F Statistic | 12.300*** (df = 1; 1272) |
Note: | p<0.1; p<0.05; p<0.01 |
\[ Y_{i}=\beta_{0}+\beta_{1}D_{\text{1}i}+\beta_{2}D_{\text{2}i}+\beta_{3}D_{\text{3}i}+\beta_{4}D_{\text{4}i}+\sum_{k=1}^{K-1}\gamma_{k}C_{ki} \]
\[ \alpha_{ATE(\text{x})}=E[Y_{1}-Y_{0}|X=\text{x}] \]
\[ \alpha_{ATT(\text{x})}=E[Y_{1}-Y_{0}|D=1, X=\text{x}] \]
\[\alpha_{x}=E[Y_{1}-Y_{0}|X = x]\]
\[ Y_{1},Y_{0}\mathrel{\unicode{x2AEB}} T|X \]
\[ Y = \beta_{0}+\beta_{1}\cdot D+\beta_{2}\cdot X+\beta_{3}\cdot DX \]
最小平方法迴歸模型如果有交互作用,可以估計CATE。這是因為自變數X與刺激變數D之間的交互作用,代表自變數X透過D的作用。當D=1時,代表得到刺激,D=0時,代表屬於控制組。如果D=1且X=x時,D的係數大於當D=0的係數,表示D的作用會隨著X=x而不同。
例如以下資料有X變數與刺激D,X的值是1, 2, 3,D則是0, 1。
# Simulate some data
set.seed(02138)
n <- 250
X <- sample(c(1:3), n, replace = T) # Covariate
D <- rbinom(n, 1, 0.55) # Treatment indicator
Y <- 1 + D * 2 + X * 3 + D * X * 2 + rnorm(n) # Outcome
DF = data.frame(Y=Y, Treatment=D, X=X)
\[\begin{align} E[Y_{1}-Y_{0}|X=1] =E[Y_{1}|X=1]-E[Y_{0}|X=1]\\ & = (\beta_{0}+\beta_{1}+\beta_{2}+\beta_{3})-(\beta_{0}+\beta_{2})\\ & = \beta_{1}+\beta_{3} \end{align}\]
R
估計有交互作用的迴歸模型如下:# Fit a linear model with interaction between treatment and covariate
res <- lm(Y ~ D * X, data = DF)
# Summarize the model
stargazer::stargazer(res, title = 'Interaction Model', type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
Y | |
D | 1.967*** |
(0.344) | |
X | 2.900*** |
(0.118) | |
D:X | 2.028*** |
(0.156) | |
Constant | 1.163*** |
(0.260) | |
Observations | 250 |
R2 | 0.955 |
Adjusted R2 | 0.954 |
Residual Std. Error | 1.001 (df = 246) |
F Statistic | 1,721.000*** (df = 3; 246) |
Note: | p<0.1; p<0.05; p<0.01 |
library(ggplot2)
interplot::interplot(res, var1 = "D", var2 = "X", val2 = 1:3,
, ercolor = "#EE00CC", esize = 1.5)+
labs(x = "Covariate (X)") +
geom_point(size = 2, color = "#2211CC") +
theme_bw()
Figure 5.1: Interacton Plot
hdcate
使用機器學習降低自變數的維度,然後估計LATE:
library(hdcate)
# get simulation data
n_obs <- 300 # Num of observations
n_var <- 50 # Num of observed variables
n_rel_var <- 4 # Num of relevant variables
data <- HDCATE.get_sim_data(n_obs, n_var, n_rel_var)
# conditional expectation model is misspecified
x_formula <- paste(paste0('X', c(2:n_var)), collapse ='+')
# for example, and alternatively, the propensity score model is misspecified
# x_formula <- paste(paste0('X', c(1:(n_var-1))), collapse ='+')
# Example 1: full-sample estimator
# create a new HDCATE model
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
# estimate HDCATE function, inference, and plot
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
HDCATE.fit(model)
HDCATE.inference(model)
HDCATE.plot(model)
例如:政府無法強迫所有長輩都來接受流感疫苗。為了觀察長輩接種疫苗的實驗效果,政府決定送1,000元超商禮券給來打疫苗的長輩。但是為了確認這個政策可行,政府抽出1,000位民眾,其中又分成400位通知來疫苗可以獲得禮券,600位只通知要來打疫苗。被分到實驗組的民眾可能為了禮券而來打疫苗,有些沒被分到實驗組的民眾則是無論有沒有禮券都會來打疫苗,也有些民眾被分到控制組就不會去打疫苗。LATE等於實驗組且有打疫苗跟控制組而沒來打疫苗的比例。
另一個例子是Gerber et al.(2010)隨機打電話給兩群民眾,一群是通知要做好回收,一群是通知要記得去投票。因為有些民眾經常在家而且會接電話,有些民眾則常常不在家,就算在家也不會接電話,如果假設接電話與否是隨機,雖然實務上可能不是,通知去投票而且有接電話以及通知做好回收而且接電話是compliers,這兩群人的投票率差異等於LATE。
\[ \text{LATE}=\frac{E[Y_{1}-Y_{0}|D = 1]}{\text{Pr}(D=1)} \]
如果我們可以假設刺激的效果是因為某些外在的因素而有差異,也就D會被外在的因素影響,造成接收到刺激的差異。我們必須先排除這些外在因素的影響,才能正確估計刺激的效果。因為分成兩個步驟,所以叫做兩階段(two-stage)迴歸模型。
第一階段: \[ D=\pi_{0}+\pi_{1}Z+\nu_{1} \]
第二階段:
\[ Y=\alpha_{0}+\alpha_{1}D+\nu_{2} \]
\[ Y=\gamma_{0}+\gamma_{1} Z +\nu_{3} \]
\[ \alpha_{1}=\frac{\gamma_{1}}{\pi_{1}}=\frac{\texttt{Cov}[Y,Z]}{\texttt{Cov}[D,Z]} \]
實驗組中接收到刺激以及控制組中沒接收到刺激的個案,稱為complier。這兩群人之間的平均效果差異,稱為local average treatment effect (LATE),也稱為complier average causal effect(CACE)。
加入工具變數的迴歸模型可以估計LATE。也就是工具變數(instrumental variable, IV)解釋刺激D以及某一個自變數,但是與最後的表現Y無關。工具變數對於自變數的影響代表對於刺激的接受程度。
經由\(\texttt{ivreg}\)函數,可以估計刺激這一個變數的係數,代表LATE。
# Load necessary packages
#install.packages("AER") # Install AER package if not already installed
library(AER)
# Generate example data
set.seed(123)
n <- 1000 # Sample size
Z <- rnorm(n) # Instrumental variable
X <- rnorm(n) # Exogenous variable
D <- rbinom(n, 1, plogis(0.5 + Z + X)) # Endogenous variable (binary treatment)
Y <- 1 + D * 2 + X * 3 + rnorm(n) # Outcome variable
# Create a data frame
data <- data.frame(Y = Y, D = D, Z = Z, X = X)
# Estimate LATE using IV regression (2SLS)
iv_model <- ivreg(Y ~ D | Z + X, data = data)
# Summarize the model
stargazer::stargazer(iv_model, title = 'Instrumental Variable Model', type=ifelse(knitr::is_latex_output(),"latex","html"),
label=knitr::opts_current$get("label"))
Dependent variable: | |
Y | |
D | 11.500*** |
(0.581) | |
Constant | -4.745*** |
(0.387) | |
Observations | 1,000 |
R2 | -0.507 |
Adjusted R2 | -0.509 |
Residual Std. Error | 4.550 (df = 998) |
Note: | p<0.1; p<0.05; p<0.01 |
# Extract LATE coefficient
LATE <- coef(iv_model)["D"]
# Print the LATE estimate
print(paste("Estimated LATE:", round(LATE, 3)))
[1] “Estimated LATE: 11.503”
#sample gender and interest in study
star<-star %>% mutate(gender=sample(0: 1, 1274, prob = c(0.535, 0.465), replace = T)) %>%
mutate(interest=sample(1:5, 1274, prob = c(0.21,0.17,0.11,0.21,0.3),replace = T))
treat<-subset(star, star$D == 'small')
cont<-subset(star, star$D == 'regular')
roundfnc <- function(x, na.rm = F) round(x, 3)
treat.d <- star %>% select(gender, interest) %>%
dplyr::mutate_if(is.factor, as.numeric) %>%
summarise(across(everything(), list(M=mean, S=sd, max=max, min=min), na.rm=TRUE))
cont.d <- star %>% select(gender, interest) %>%
dplyr::mutate_if(is.factor, as.numeric) %>%
summarise(across(everything(), list(M=mean, S=sd, max=max, min=min), na.rm=TRUE))
treat.d<-roundfnc(treat.d); cont.d<-roundfnc(cont.d)
#png('~/C/Fig/balancestar.png', width=1500, height = 1200, res=200)
par(mfrow=c(1,2))
plot(density(treat$gender, na.rm=TRUE), xlim=c(-0.5,1.5), ylim=c(0,1.6), xlab="Gender", main='')
lines(density(cont$gender, na.rm=TRUE), col="RED")
plot(density(treat$interest, na.rm=TRUE), xlim=c(1,5), ylim=c(0,0.7), xlab="Interest", main='')
lines(density(cont$interest, na.rm=TRUE), col="RED")
估計式除了無偏估計這一個標準外,估計值之間的離散程度應該越小越好。
樣本平均值的差異用\(\alpha=\bar{Y_{1}}-\bar{Y_{0}}\)。該估計值的變異數為\(\text{Var}(\hat{\alpha})=\frac{N}{N-1}\Bigl(\frac{\sigma^2_{Y1}}{n1}+\frac{\sigma^2_{Y0}}{n0}\Bigr)\)
Huber-White estimator如下,但是這不是無偏估計: \[ \hat{V}_{hw}=\frac{\frac{1}{n_{1}}\sum_{i}(Y_{i}-\bar{Y_{1}})^2}{n_{1}}+\frac{\frac{1}{n_{0}}\sum_{i}(Y_{i}-\bar{Y_{0}})^2}{n_{0}} \]
無偏估計應該是下面這個估計式:
\[ \hat{V}_{HC2}=\frac{\frac{1}{n_{1}-1}\sum_{i}(Y_{i}-\bar{Y_{1}})^2}{n_{1}}+\frac{\frac{1}{n_{0}-1}\sum_{i}(Y_{i}-\bar{Y_{0}})^2}{n_{0}} \]
\[\frac{1}{n_{1}-1}\sum_{i}(Y_{i}-\bar{Y_{1}})^2=\sigma_{1}^2 \]
\[\frac{1}{n_{0}-1}\sum_{i}(Y_{i}-\bar{Y_{0}})^2=\sigma_{0}^2\]
sandwich
這個套件裡面的\(\texttt{vcovHC}\)函數,計算ATE的變異數。# Load necessary library
library(sandwich)
# Estimate the variance of ATE using vcovHC
cov_matrix <- vcovHC(lm(reading ~ D, data = star), type = "HC3")
var_ATE <- cov_matrix[2, 2] # Variance of the coefficient for 'group'
# Print the results
#print(paste("Average Treatment Effect (ATE):", round(ATE, 2)))
print(paste("Variance of ATE:", round(var_ATE, 2)))
[1] “Variance of ATE: 4.26”
v.small <- var(star$reading[star$D == 'small']); v.small
## [1] 1396
v.regular <- var(star$reading[star$D == 'regular']); v.regular
## [1] 1287
n.group<- star %>% group_by (D) %>%
summarise(n = n())
n.small<-as.numeric(n.group[2,2]); n.regular<-as.numeric(n.group[1,2])
PoolSE<-sqrt((v.small/n.small)+(v.regular/n.regular))
print(paste("標準誤:", round(PoolSE, 2)))
## [1] "標準誤: 2.06"
\[ \text{SE}(\widehat{ATE})=\sqrt{\frac{1}{N-1} \Bigl\{ \frac{m\times\text{Var_Y0}}{N-m}+\frac{(N-m)\text{Var_Y1}}{m}+2\text{Cov(Y0,Y1)} \Bigr\}} \]
R
語法寫成如下。N <-nrow(star)
m <- n.small #N of treat group
avg.small <- mean(star$reading[star$D=='small'])
avg.regular <- mean(star$reading[star$D=='regular'])
x.small <- star$reading[star$D=='small'] #treated
x.regular <- star$reading[star$D=='regular']
var_Y0 <-sum((x.regular - avg.regular)^2)/(N-m);
var_Y1 <- sum((x.small - avg.small)^2)/(m)
S <- (m * var_Y0 / (N - m)) + (((N - m) * var_Y1) / m) + 2 * (sum((x.small - avg.small) * (x.regular - avg.regular)) / N)
hat.ATE.var <- sqrt((1 / (N - 1)) * S)
print(paste("聯合標準誤:", round(hat.ATE.var, 2)))
library(sandwich)
# Generate example data
set.seed(123) # for reproducibility
n <- 1000 # total sample size
n_treated <- 500 # sample size of treated group
n_control <- n - n_treated # sample size of control group
treated <- rnorm(n_treated, mean = 70, sd = 10) # simulate test scores for treated group
control <- rnorm(n_control, mean = 65, sd = 10) # simulate test scores for control group
group <- c(rep("Treated", n_treated), rep("Control", n_control))
data <- data.frame(group = group, score = c(treated, control))
# Calculate average outcomes for treated and control groups
avg_treated <- mean(data$score[data$group == "Treated"])
avg_control <- mean(data$score[data$group == "Control"])
# Calculate the ATE
ATE <- avg_treated - avg_control
# Estimate the variance of ATE using vcovHC
cov_matrix <- vcovHC(lm(score ~ group, data = data), type = "HC3")
var_ATE <- cov_matrix[2, 2] # Variance of the coefficient for 'group'
# Print the results
print(paste("Average Treatment Effect (ATE):", round(ATE, 2)))
## [1] "Average Treatment Effect (ATE): 5.37"
print(paste("Variance of ATE:", round(var_ATE, 2)))
## [1] "Variance of ATE: 0.39"
print(paste("Standard deviation of ATE:", round(sqrt(var_ATE), 2)))
## [1] "Standard deviation of ATE: 0.63"
N <- 1000
m <- n_treated #N of treat group, 50
avg.treated <- avg_treated
avg.control <- avg_control
x.treated <- treated #treated
x.control <- control
var_Y0 <- sum((x.control - avg.control)^2)/(N-m);
var_Y1 <- sum((x.treated - avg.treated)^2)/m;
S <- (m * var_Y0 / (N - m)) + (((N - m) * var_Y1) / m) + 2 * (sum((x.control - avg.control) * (x.treated - avg.treated)) / N)
hat.ATE.var <- sqrt((1 / (N - 1)) * S)
print(paste("標準差:", round(hat.ATE.var, 2)))
## [1] "標準差: 0.44"
library(gtsummary)
star %>% tbl_summary(
by = D,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"
),
digits = all_continuous() ~ 2) %>%
add_p(pvalue_fun = ~ style_pvalue(.x, digits = 2)) %>%
add_overall() %>%
add_n() %>%
modify_header(label ~ "**Variable**")
Variable | N | Overall, N = 1,2741 | regular, N = 6891 | small, N = 5851 | p-value2 |
---|---|---|---|---|---|
classtype | 1,274 | <0.001 | |||
regular | 689 / 1,274 (54%) | 689 / 689 (100%) | 0 / 585 (0%) | ||
small | 585 / 1,274 (46%) | 0 / 689 (0%) | 585 / 585 (100%) | ||
reading | 1,274 | 628.80 (36.73) | 625.49 (35.88) | 632.70 (37.37) | <0.001 |
math | 1,274 | 631.59 (38.84) | 628.84 (37.94) | 634.83 (39.66) | 0.013 |
graduated | 1,274 | 1,108 / 1,274 (87%) | 597 / 689 (87%) | 511 / 585 (87%) | 0.71 |
gender | 1,274 | 586 / 1,274 (46%) | 307 / 689 (45%) | 279 / 585 (48%) | 0.26 |
interest | 1,274 | 0.91 | |||
1 | 256 / 1,274 (20%) | 134 / 689 (19%) | 122 / 585 (21%) | ||
2 | 214 / 1,274 (17%) | 114 / 689 (17%) | 100 / 585 (17%) | ||
3 | 116 / 1,274 (9.1%) | 64 / 689 (9.3%) | 52 / 585 (8.9%) | ||
4 | 297 / 1,274 (23%) | 167 / 689 (24%) | 130 / 585 (22%) | ||
5 | 391 / 1,274 (31%) | 210 / 689 (30%) | 181 / 585 (31%) | ||
1 n / N (%); Mean (SD) | |||||
2 Pearson’s Chi-squared test; Wilcoxon rank sum test |
\[ SE_{p}=S_{p}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}} \]
R
計算如下:
v.small <- var(star$reading[star$D == 'small']); v.small
## [1] 1396
v.regular <- var(star$reading[star$D == 'regular']); v.regular
## [1] 1287
n.group<- star %>% group_by (D) %>%
summarise(n = n())
n.small<-n.group[2,2]; n.regular<-n.group[1,2]
p1 <- (((n.small-1)*v.small)+((n.regular-1)*v.regular))
PoolSD<-sqrt(p1/(n.small+n.regular-2));
PoolSE <- PoolSD*sqrt((1/n.small)+(1/n.regular));
print(paste("聯合標準誤:", round(PoolSE, 2)))
## [1] "聯合標準誤: 2.06"
library(dplyr)
# Simulate data
set.seed(02138)
N <- 200 # Total students
df <- data.frame(
ID = 1:N,
prior_gpa = rnorm(N, mean = 3.0, sd = 0.5), # Prior performance
treatment = sample(c(0,1), N, replace = TRUE), # Randomly assigned treatment (1 = new method, 0 = traditional)
motivation = rnorm(N, mean = 50, sd = 10) # Random motivation scores
)
# Generate final scores based on treatment effect
df$final_score <- with(df, 75 + 5*treatment + 2*prior_gpa + 0.5*motivation + rnorm(N, mean=0, sd=5))
# Simulate observational study (self-selection)
df_obs <- df %>%
mutate(treatment = ifelse(prior_gpa > 3.2 & motivation > 50, 1, sample(c(0,1), N, replace = TRUE, prob = c(0.5, 0.5))))
# Estimate propensity scores (probability of receiving treatment)
ps_model <- glm(treatment ~ prior_gpa + motivation, data = df_obs, family = binomial)
df_obs$pscore <- predict(ps_model, type = "response")
MatchIt
套件,進行比對:# Load libraries for causal inference
library(MatchIt)
# Perform propensity score matching
matched <- matchit(treatment ~ prior_gpa + motivation, data = df_obs, method = "nearest", distance = "logit")
# Extract matched dataset
df_matched <- match.data(matched)
# Compare final scores in matched dataset
t.test(final_score ~ treatment, data = df_matched)
##
## Welch Two Sample t-test
##
## data: final_score by treatment
## t = -4.1, df = 164, p-value = 6e-05
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -6.918 -2.431
## sample estimates:
## mean in group 0 mean in group 1
## 106.3 110.9
Dependent variable: | ||
final_score | ||
(1) | (2) | |
treatment | 4.674*** | 0.172 |
(1.136) | (1.013) | |
prior_gpa | 2.288** | |
(0.994) | ||
motivation | 0.539*** | |
(0.048) | ||
Constant | 106.300*** | 73.920*** |
(0.803) | (4.078) | |
Observations | 166 | 166 |
R2 | 0.094 | 0.495 |
Adjusted R2 | 0.088 | 0.486 |
Residual Std. Error | 7.320 (df = 164) | 5.498 (df = 162) |
F Statistic | 16.920*** (df = 1; 164) | 52.910*** (df = 3; 162) |
Note: | p<0.1; p<0.05; p<0.01 |
(1) 請以圖形表示受試組與實驗組的次數分佈。
(2) 請問這兩組受訪者的贊成分數的平均數分別是多少?
(3) 請問這兩組受訪者的贊成分數的平均數差異為何?
(4) 請問聯合標準誤為何?
# 1. Simulate Data (Replace with your actual data)
set.seed(02138) # For reproducibility
n <- 1000 # Sample size
age <- sample(20: 70, n, replace = TRUE)
education <- sample(1:4, n, replace = TRUE) # 1=low, 4=high
motivation <- rnorm(n, 0, 1) # Higher values indicate higher motivation
D <- rbinom(n, 1, prob = 0.3 + 0.2 * (education > 2) + 0.1 * motivation) # Treatment assignment depends on covariates
D[is.na(D)]<-1
income <- 20000 + 5000 * education + 3000 * motivation + 8000 * D + rnorm(n, 0, 10000)
df_obs <- data.frame(age, education, motivation, D, income)
最後更新時間: 2025-03-14 22:00:42