課題: Y= working_hour, D= gender, X= age, expとしてDのYについての因果効果の異質性を推定せよ。

Load library

pacman::p_load("tidyverse", "skimr", "SuperLearner", "grf", "ranger", "estimatr")

Import data

授業で提示されたデータをダウンロードする。

data_raw <- read_csv("~/R/Musashi_University/2021_second_semester/keiryoukaizaigaku/Data/data.csv")

Tidy data

変数genderをダミー変数にするため、Female1Male0とする。結果変数としてworking_hour、説明変数としてgenderageexpを選択する。

data <- data_raw %>% 
  mutate(gender = if_else(gender == "Female", 1, 0)) %>% 
  select(working_hour, gender, age, exp)

データの欠損値、そして基本統計量を確認する。今回のデータに欠損値はなく、tidyなデータであることがわかる。データの偏りについては、女性よりも男性のほうが多く、30代から50代にかけてのデータが多いことがわかる。仕事の経験年数に関して言うと、年齢の平均と経験年数の差が28.3であり、多くの人が社会に出ると考えられる20代前半よりいくらか高いため、全体として年齢層が高い人が抽出されている、もしくは終身雇用がすべてのケースに当てはまらず、転職をしている可能性が推測される。

# check
skim(data)
Data summary
Name data
Number of rows 2129
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
working_hour 0 1 47.01 9.39 35 40 45 50 108 ▇▃▁▁▁
gender 0 1 0.33 0.47 0 0 0 1 1 ▇▁▁▁▅
age 0 1 41.20 11.48 18 32 40 50 65 ▃▇▇▆▃
exp 0 1 12.86 11.02 0 4 10 20 49 ▇▃▃▂▁

Set variables

変数を設定する。今回は、Yを被説明変数、Dを目的変数、Xをコントロール変数として定義した。それぞれには、週の労働時間、性別、年齢、経験年数を割り当てている。

Y <- data$working_hour # outcome var

D <- data$gender # objective var

X <- select(data, age, exp) # other variables

Visualization

被説明変数と目的変数のモデルを検証する前に、変数間の関係性を図示する。

# histogram of working hour
data %>% 
  mutate(gender = if_else(gender == 0, "Male", "Female")) %>% 
  ggplot(aes(working_hour, fill = gender)) + 
  geom_histogram() +
  scale_x_continuous(breaks = seq(35, 120, by = 5),
                     expand = c(0, 0)) +
  facet_wrap(~gender, ncol = 1) +
  labs(
    title = "The histogram of working hours per week",
    subtitle = str_wrap("There is not significant difference in the distribution of weeky working hours between femal and male workers."),
    x = "Working Hour per week",
    y = "Count",
  ) +
  theme_bw()

# scatter plot for working hour and age
data %>% 
  mutate(gender = if_else(gender == 0, "Male", "Female")) %>% 
  ggplot(aes(age, working_hour, color = gender)) + 
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", color = "gray70") +
  scale_x_continuous(breaks = seq(15, 65, by = 5),
                     expand = c(0.05, 0.05)) +
  facet_wrap(~gender, ncol = 1) +
  labs(
    title = "The scatter plot on the relation between age and working duration per week",
    subtitle = str_wrap("For both female and male workes, there are slightly negative relations between age and weekly working hours. People tend to work less as they become older."),
    x = "Age",
    y = "Working Hour per week",
  ) +
  theme_bw()

# scatter plot for working hour and exp
data %>% 
  mutate(gender = if_else(gender == 0, "Male", "Female")) %>% 
  ggplot(aes(exp, working_hour, color = gender)) + 
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", color = "gray70") + 
  facet_wrap(~gender, ncol = 1) +
  labs(
    title = "The scatter plot on between experience and working duration per week",
    subtitle = str_wrap("We can observe a decrease in weekly working hours for male workers as they get more experienced, but it is not the case for female workers, whick seems to have a zero degree of slop between two variables."),
    x = "Experience",
    y = "Working Hour per week",
  ) +
  theme_bw()

Modeling

CV.SuperLeanerを用いて、性別差の労働時間に対する影響を推定していく。推定に用いる条件は、データの分割回数は10、アルゴリズムはOLS、Random Forest, LASSOとする。

# how many of divisions?
v <- 10

# used libraries
SL.libraries <-  c("SL.lm", # OLS
                   "SL.ranger", # Random Forest
                   "SL.glmnet")  # LASSO

# Fitting
fits <- list()

fits$fit.Y <- CV.SuperLearner(Y = Y,
                              X = X,
                              SL.library = SL.libraries,
                              V = v # how many of divisions?
)

fits$fit.D <- CV.SuperLearner(Y = D,
                              X = X,
                              SL.library = SL.libraries,
                              V = v # how many of divisions?
)

# show
result <- lapply(fits, summary)
result
## $fit.Y
## 
## Call:  CV.SuperLearner(Y = Y, X = X, V = v, SL.library = SL.libraries) 
## 
## Risk is based on: Mean Squared Error
## 
## All risk estimates are based on V =  10 
## 
##      Algorithm    Ave     se    Min    Max
##  Super Learner 88.113 4.9677 68.295 110.01
##    Discrete SL 87.905 4.9624 67.313 110.30
##      SL.lm_All 87.938 4.9641 67.313 110.30
##  SL.ranger_All 94.698 4.9498 76.175 114.46
##  SL.glmnet_All 87.933 4.9665 67.296 110.47
## 
## $fit.D
## 
## Call:  CV.SuperLearner(Y = D, X = X, V = v, SL.library = SL.libraries) 
## 
## Risk is based on: Mean Squared Error
## 
## All risk estimates are based on V =  10 
## 
##      Algorithm     Ave        se     Min     Max
##  Super Learner 0.21339 0.0036246 0.19250 0.23254
##    Discrete SL 0.21615 0.0035169 0.19637 0.23270
##      SL.lm_All 0.21608 0.0035356 0.19489 0.23270
##  SL.ranger_All 0.22242 0.0045351 0.20250 0.25092
##  SL.glmnet_All 0.21586 0.0035058 0.19637 0.23093

結果として、結果変数のMSEが最も低いモデルははOLS(MSE = 87.9378818)で、目的変数genderのMSEが最も低いパラメーターを計算したモデルは、Random ForestとLASSOであった(それぞれ共に、0.2161457)。

Estimation (Parameter Estimation)

週の労働時間に対するgenderのパラメーター\(\hat\tau\)causal_forestをつかって推定していく。

# rename the dataset
data_result <- data

# 予測誤差
Y.oht <- fits$fit.Y$SL.predict
D.oht <- fits$fit.D$SL.predict

X.matrix <- model.matrix(~ 0 + .,
                         data = X)

# prediction for tau of D
fit <- causal_forest(Y = Y, # outcome var: working hour
                     W = D, # explanatory var: gender
                     X = X.matrix, # control var: age, exp
                     Y.hat = Y.oht, # computated value of outcome var
                     W.hat = D.oht) # computated value of explanatory var

pred <- predict(fit,
                estimate.variance = T) # show the estimate variance

Visualization

上記で推定した週の労働時間に対するgenderのパラメーター\(\hat\tau\)を95%信頼区間を使い図示する。

# the new dataset
data_result <- data %>% 
  mutate(tau = pred$predictions, # tau の予測値を保存する
         tau.var = pred$variance.estimates # 予測値の誤差
  ) %>% 
  arrange(tau) %>% # order based on 
  mutate(id = c(1:nrow(data)), # for ggplot2 visualization
         se = sqrt(tau.var), # for computing CI
         ci.low = tau - 1.96*se,
         ci.high = tau + 1.96*se)

# histogram of the predicted value for tau
fit$predictions %>% 
  as_data_frame() %>% 
  `names<-`("tau") %>%  
  ggplot(aes(tau)) + 
  geom_histogram() +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0.01, 0)) +
  labs(
    title = "The histogram of the predicted value for tau",
    x = "The Value of tau"
  ) +
  theme_bw()

# point range plot 
ggplot(data_result, 
       aes(id, tau,
           ymin = ci.low,
           ymax = ci.high)
  ) +
  geom_pointrange(alpha = 0.1) +
  theme_bw() +
  geom_hline(yintercept = 0) +
  labs(
    title = "The point range plot with the parameter of tau"
  ) +
  theme_bw()

Linear prediction

causal_forestを使い\(\hat \tau\)の推定をしたが、ほかのコントロール変数が複雑に絡み合っているため因果関係を推定することが難しい。だから、次はOLSを回し、変数genderの労働時間への貢献度を推定する。その際、centeringを行うことで、切片がない、以下のモデルの推定する。

\[ \hat Y = \hat \beta_1 + \hat \beta_2 X_1 + \hat \beta_3 X_2 + \hat \beta_4 X_3 \]

where  \(X_1\)= gender, \(X_2\)= exp, \(X_3\)= age

Since the intercept is equal to zero, meaning \(\hat \beta_1 = 0\),

\[ \hat Y = \hat \beta_2 X_1 + \hat \beta_3 X_1 + \hat \beta_4 X_1 \]

# for linear projection
Z <- 
  data %>% # the original data
  select(gender, age, exp) %>% 
  mutate(gender = gender - mean(gender), # demean "gender" (from 1 to 0)
         age = age - mean(age),
         exp = exp - mean(exp))

# OLS
best_linear_projection(fit, Z)
## 
## Best linear projection of the conditional average treatment effect.
## Confidence intervals are cluster- and heteroskedasticity-robust (HC3):
## 
##               Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) -4.5993643  0.4203475 -10.9418 < 2.2e-16 ***
## gender       0.4564556  1.1519190   0.3963  0.691955    
## age         -0.0040605  0.0439702  -0.0923  0.926431    
## exp          0.1577138  0.0525245   3.0027  0.002707 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

このテーブルの結果から、平均的に女性であるとであると週に約0.29時間(=17.4分)傾向があることがわかる。P値が\(0.779 \approx 0.78 \leq 0.05\)と非常に高いため、genderの係数である\(\beta_2\)が0であるという帰無仮説を95%信頼区間で棄却しそうになるが、今回はworking_hourが正規分布に従っていないため、判断できない。以下のヒストグラムから、右に歪んだ(right-skewed)の分布の形であることがわかる。

g_0 <- data %>% 
  ggplot(aes(working_hour)) +
  geom_histogram() +
  labs(
    title = "The histogram of the weekly working hours",
    subtitle = str_wrap("We can observe this distribution is right-skewed")) +
  theme_bw()

g_1 <- data %>% 
  mutate(gender = if_else(gender == 1, "Female", "Male")) %>% 
  ggplot(aes(gender, working_hour)) +
  geom_violin() +
  labs(
    title = "The distributio of the weekly working hours by gender",
    subtitle = str_wrap("Male seems to have move variance than female workers.")
  ) +
  theme_bw() +
  coord_flip()
  

library(patchwork)

g_0 / g_1

AIPW

最後に、AIPWを使い、個々人のパラメータを推定するのは困難なため、サンプル全体の平均の推定値を推定する。

# AIPWの推定 (データ全体としての予測の正確性についての検証)
average_treatment_effect(fit,
                         target.sample = "overlap")
##  estimate   std.err 
## -4.868205  0.412699
beepr::beep(1)