課題: Y= working_hour, D= gender, X= age, expとしてDのYについての因果効果の異質性を推定せよ。
授業で提示されたデータをダウンロードする。
変数genderをダミー変数にするため、Femaleを1、Maleを0とする。結果変数としてworking_hour、説明変数としてgender、age、expを選択する。
data <- data_raw %>%
mutate(gender = if_else(gender == "Female", 1, 0)) %>%
select(working_hour, gender, age, exp)データの欠損値、そして基本統計量を確認する。今回のデータに欠損値はなく、tidyなデータであることがわかる。データの偏りについては、女性よりも男性のほうが多く、30代から50代にかけてのデータが多いことがわかる。仕事の経験年数に関して言うと、年齢の平均と経験年数の差が28.3であり、多くの人が社会に出ると考えられる20代前半よりいくらか高いため、全体として年齢層が高い人が抽出されている、もしくは終身雇用がすべてのケースに当てはまらず、転職をしている可能性が推測される。
| Name | data |
| Number of rows | 2129 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| working_hour | 0 | 1 | 47.01 | 9.39 | 35 | 40 | 45 | 50 | 108 | ▇▃▁▁▁ |
| gender | 0 | 1 | 0.33 | 0.47 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▅ |
| age | 0 | 1 | 41.20 | 11.48 | 18 | 32 | 40 | 50 | 65 | ▃▇▇▆▃ |
| exp | 0 | 1 | 12.86 | 11.02 | 0 | 4 | 10 | 20 | 49 | ▇▃▃▂▁ |
変数を設定する。今回は、Yを被説明変数、Dを目的変数、Xをコントロール変数として定義した。それぞれには、週の労働時間、性別、年齢、経験年数を割り当てている。
被説明変数と目的変数のモデルを検証する前に、変数間の関係性を図示する。
# histogram of working hour
data %>%
mutate(gender = if_else(gender == 0, "Male", "Female")) %>%
ggplot(aes(working_hour, fill = gender)) +
geom_histogram() +
scale_x_continuous(breaks = seq(35, 120, by = 5),
expand = c(0, 0)) +
facet_wrap(~gender, ncol = 1) +
labs(
title = "The histogram of working hours per week",
subtitle = str_wrap("There is not significant difference in the distribution of weeky working hours between femal and male workers."),
x = "Working Hour per week",
y = "Count",
) +
theme_bw()# scatter plot for working hour and age
data %>%
mutate(gender = if_else(gender == 0, "Male", "Female")) %>%
ggplot(aes(age, working_hour, color = gender)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", color = "gray70") +
scale_x_continuous(breaks = seq(15, 65, by = 5),
expand = c(0.05, 0.05)) +
facet_wrap(~gender, ncol = 1) +
labs(
title = "The scatter plot on the relation between age and working duration per week",
subtitle = str_wrap("For both female and male workes, there are slightly negative relations between age and weekly working hours. People tend to work less as they become older."),
x = "Age",
y = "Working Hour per week",
) +
theme_bw()# scatter plot for working hour and exp
data %>%
mutate(gender = if_else(gender == 0, "Male", "Female")) %>%
ggplot(aes(exp, working_hour, color = gender)) +
geom_point(alpha = 0.25) +
geom_smooth(method = "lm", color = "gray70") +
facet_wrap(~gender, ncol = 1) +
labs(
title = "The scatter plot on between experience and working duration per week",
subtitle = str_wrap("We can observe a decrease in weekly working hours for male workers as they get more experienced, but it is not the case for female workers, whick seems to have a zero degree of slop between two variables."),
x = "Experience",
y = "Working Hour per week",
) +
theme_bw()CV.SuperLeanerを用いて、性別差の労働時間に対する影響を推定していく。推定に用いる条件は、データの分割回数は10、アルゴリズムはOLS、Random Forest, LASSOとする。
# how many of divisions?
v <- 10
# used libraries
SL.libraries <- c("SL.lm", # OLS
"SL.ranger", # Random Forest
"SL.glmnet") # LASSO
# Fitting
fits <- list()
fits$fit.Y <- CV.SuperLearner(Y = Y,
X = X,
SL.library = SL.libraries,
V = v # how many of divisions?
)
fits$fit.D <- CV.SuperLearner(Y = D,
X = X,
SL.library = SL.libraries,
V = v # how many of divisions?
)
# show
result <- lapply(fits, summary)
result## $fit.Y
##
## Call: CV.SuperLearner(Y = Y, X = X, V = v, SL.library = SL.libraries)
##
## Risk is based on: Mean Squared Error
##
## All risk estimates are based on V = 10
##
## Algorithm Ave se Min Max
## Super Learner 88.113 4.9677 68.295 110.01
## Discrete SL 87.905 4.9624 67.313 110.30
## SL.lm_All 87.938 4.9641 67.313 110.30
## SL.ranger_All 94.698 4.9498 76.175 114.46
## SL.glmnet_All 87.933 4.9665 67.296 110.47
##
## $fit.D
##
## Call: CV.SuperLearner(Y = D, X = X, V = v, SL.library = SL.libraries)
##
## Risk is based on: Mean Squared Error
##
## All risk estimates are based on V = 10
##
## Algorithm Ave se Min Max
## Super Learner 0.21339 0.0036246 0.19250 0.23254
## Discrete SL 0.21615 0.0035169 0.19637 0.23270
## SL.lm_All 0.21608 0.0035356 0.19489 0.23270
## SL.ranger_All 0.22242 0.0045351 0.20250 0.25092
## SL.glmnet_All 0.21586 0.0035058 0.19637 0.23093
結果として、結果変数のMSEが最も低いモデルははOLS(MSE = 87.9378818)で、目的変数genderのMSEが最も低いパラメーターを計算したモデルは、Random ForestとLASSOであった(それぞれ共に、0.2161457)。
週の労働時間に対するgenderのパラメーター\(\hat\tau\)をcausal_forestをつかって推定していく。
# rename the dataset
data_result <- data
# 予測誤差
Y.oht <- fits$fit.Y$SL.predict
D.oht <- fits$fit.D$SL.predict
X.matrix <- model.matrix(~ 0 + .,
data = X)
# prediction for tau of D
fit <- causal_forest(Y = Y, # outcome var: working hour
W = D, # explanatory var: gender
X = X.matrix, # control var: age, exp
Y.hat = Y.oht, # computated value of outcome var
W.hat = D.oht) # computated value of explanatory var
pred <- predict(fit,
estimate.variance = T) # show the estimate variance上記で推定した週の労働時間に対するgenderのパラメーター\(\hat\tau\)を95%信頼区間を使い図示する。
# the new dataset
data_result <- data %>%
mutate(tau = pred$predictions, # tau の予測値を保存する
tau.var = pred$variance.estimates # 予測値の誤差
) %>%
arrange(tau) %>% # order based on
mutate(id = c(1:nrow(data)), # for ggplot2 visualization
se = sqrt(tau.var), # for computing CI
ci.low = tau - 1.96*se,
ci.high = tau + 1.96*se)
# histogram of the predicted value for tau
fit$predictions %>%
as_data_frame() %>%
`names<-`("tau") %>%
ggplot(aes(tau)) +
geom_histogram() +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0.01, 0)) +
labs(
title = "The histogram of the predicted value for tau",
x = "The Value of tau"
) +
theme_bw()# point range plot
ggplot(data_result,
aes(id, tau,
ymin = ci.low,
ymax = ci.high)
) +
geom_pointrange(alpha = 0.1) +
theme_bw() +
geom_hline(yintercept = 0) +
labs(
title = "The point range plot with the parameter of tau"
) +
theme_bw()causal_forestを使い\(\hat \tau\)の推定をしたが、ほかのコントロール変数が複雑に絡み合っているため因果関係を推定することが難しい。だから、次はOLSを回し、変数genderの労働時間への貢献度を推定する。その際、centeringを行うことで、切片がない、以下のモデルの推定する。
\[ \hat Y = \hat \beta_1 + \hat \beta_2 X_1 + \hat \beta_3 X_2 + \hat \beta_4 X_3 \]
where \(X_1\)= gender, \(X_2\)= exp, \(X_3\)= age
Since the intercept is equal to zero, meaning \(\hat \beta_1 = 0\),
\[ \hat Y = \hat \beta_2 X_1 + \hat \beta_3 X_1 + \hat \beta_4 X_1 \]
# for linear projection
Z <-
data %>% # the original data
select(gender, age, exp) %>%
mutate(gender = gender - mean(gender), # demean "gender" (from 1 to 0)
age = age - mean(age),
exp = exp - mean(exp))
# OLS
best_linear_projection(fit, Z)##
## Best linear projection of the conditional average treatment effect.
## Confidence intervals are cluster- and heteroskedasticity-robust (HC3):
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.5993643 0.4203475 -10.9418 < 2.2e-16 ***
## gender 0.4564556 1.1519190 0.3963 0.691955
## age -0.0040605 0.0439702 -0.0923 0.926431
## exp 0.1577138 0.0525245 3.0027 0.002707 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
このテーブルの結果から、平均的に女性であるとであると週に約0.29時間(=17.4分)傾向があることがわかる。P値が\(0.779 \approx 0.78 \leq 0.05\)と非常に高いため、genderの係数である\(\beta_2\)が0であるという帰無仮説を95%信頼区間で棄却しそうになるが、今回はworking_hourが正規分布に従っていないため、判断できない。以下のヒストグラムから、右に歪んだ(right-skewed)の分布の形であることがわかる。
g_0 <- data %>%
ggplot(aes(working_hour)) +
geom_histogram() +
labs(
title = "The histogram of the weekly working hours",
subtitle = str_wrap("We can observe this distribution is right-skewed")) +
theme_bw()
g_1 <- data %>%
mutate(gender = if_else(gender == 1, "Female", "Male")) %>%
ggplot(aes(gender, working_hour)) +
geom_violin() +
labs(
title = "The distributio of the weekly working hours by gender",
subtitle = str_wrap("Male seems to have move variance than female workers.")
) +
theme_bw() +
coord_flip()
library(patchwork)
g_0 / g_1