はじめに

前回はt検定をベイズアプローチからやってみた。今回は回帰分析である。なお、ここで回帰分析とは誤差が正規分布に従うことを仮定した最小二乗法によってパラメータが求められる普通の回帰分析を指す。

回帰分析

使用するパッケージを読み込む。

# get packages
library(tidyverse)
## ─ Attaching packages ──── tidyverse 1.2.1 ─
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ─ Conflicts ────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.19.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(ggmcmc)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

データは以下のように設定する。

# making data
set.seed(1234)
n <- 1000
X1 <- rnorm(n = n, mean = 50, sd = 10)
X2 <- runif(n = n, min = 0, max = 1) %>% round(0)
X3 <- runif(n = n, min = 1, max = 4) %>% round(0)
Y <- 0.1 * X1 + 0.4 * X2 + 0.2 * X3 + rnorm(n = n, mean = 0, sd = 2) 
df <- tibble(X1 = X1, X2 = X2, X3 = X3, Y = Y)

まずは普通に回帰分析を回してみる。

# ols regression
res <- lm(Y ~ X1 + X2 + X3, data = df)
summary(res)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0446 -1.4074  0.0756  1.3956  5.9744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.199244   0.369042   0.540  0.58939    
## X1          0.103335   0.006423  16.087  < 2e-16 ***
## X2          0.343459   0.128101   2.681  0.00746 ** 
## X3          0.088088   0.066696   1.321  0.18689    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.024 on 996 degrees of freedom
## Multiple R-squared:  0.2123, Adjusted R-squared:  0.2099 
## F-statistic: 89.47 on 3 and 996 DF,  p-value: < 2.2e-16

次にベイズによるアプローチ。まず手元のデータ\(Y\)\(X1\)\(X2\)\(X3\)を使って以下のように表現されると考える。

\[ Y_i = \beta_0 + \beta_1 \times X1 + \beta_2 \times X2 + \beta_3 \times X3 + error (i = 1, 2, ... N) \] ここで\(i\)は個人iを指す。また\(error\)は以下のように平均0、標準偏差\(\sigma\)の正規分布に従うとする。

\[ error 〜 Nromal(0, \sigma) \] 以上から\(Y\)は以下のような正規分布に従うと考えられる。

\[ Y 〜Nromal(\beta_0 + \beta_1 \times X1 + \beta_2 \times X2 + \beta_3 \times X3, \sigma) \] 以上を踏まえてstan側のファイルに以下のように記述する。なお、これはols regression model.stanとして保存しておく。

// ols regression model
data {
  int N;
  real X1[N];
  int<lower =0, upper = 1> X2[N];
  int<lower =0, upper = 4> X3[N];
  real Y[N];
}

parameters {
  real<lower = 0> sigma;
  real beta0;
  real beta1;
  real beta2;
  real beta3;
}

model {
  sigma ~ uniform(0, 100);
  beta0 ~ normal(0, 1000);
  beta1 ~ normal(0, 1000);
  beta2 ~ normal(0, 1000);
  beta3 ~ normal(0, 1000);
  for(i in 1:N)
    Y[i] ~ normal(beta0 + beta1 * X1[i] + beta2 * X2[i] + beta3 * X3[i], sigma);
}

なお、求められるパラメータは\(\beta_0\)\(\beta_1\)\(\beta_2\)\(\beta_3\)\(\sigma\)だが、これらの事前分布は以下のように設定してある。

\[ \beta_0 〜 Normal(0, 1000) \\ \beta_1 〜 Normal(0, 1000)\\ \beta_2 〜 Normal(0, 1000)\\ \beta_3 〜 Normal(0, 1000)\\ \sigma 〜 Uniform(0, 100)\\ \]

次に、MCMCサンプルを生成して結果を確認する。またMCMCサンプルの抽出も行う。コードは以下の通りである。

# bayesian regression
data <- list(N = nrow(df), X1 = df$X1, X2 = df$X2, X3 = df$X3, Y = df$Y)
fit <- rstan::stan(file = 'ols regression model.stan', data = data, seed = 1234)
## Warning in readLines(file, warn = TRUE): '/Users/YONE/Desktop/ols
## regression model.stan' で不完全な最終行が見つかりました
## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000407 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.07 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 6.15097 seconds (Warm-up)
## Chain 1:                4.68157 seconds (Sampling)
## Chain 1:                10.8325 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0.000153 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.53 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 5.41259 seconds (Warm-up)
## Chain 2:                4.53654 seconds (Sampling)
## Chain 2:                9.94912 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0.000238 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 2.38 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 7.48254 seconds (Warm-up)
## Chain 3:                4.32575 seconds (Sampling)
## Chain 3:                11.8083 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0.000152 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.52 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 6.75802 seconds (Warm-up)
## Chain 4:                4.1117 seconds (Sampling)
## Chain 4:                10.8697 seconds (Total)
## Chain 4:
print(fit)
## Inference for Stan model: ols regression model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean   sd     2.5%      25%      50%      75%    97.5%
## sigma     2.03    0.00 0.05     1.94     1.99     2.03     2.06     2.12
## beta0     0.21    0.01 0.37    -0.50    -0.04     0.19     0.46     0.93
## beta1     0.10    0.00 0.01     0.09     0.10     0.10     0.11     0.12
## beta2     0.34    0.00 0.13     0.10     0.26     0.34     0.43     0.58
## beta3     0.09    0.00 0.07    -0.05     0.04     0.09     0.13     0.22
## lp__  -1205.15    0.04 1.57 -1208.87 -1205.97 -1204.84 -1203.97 -1203.04
##       n_eff Rhat
## sigma  3579    1
## beta0  2146    1
## beta1  2484    1
## beta2  3690    1
## beta3  3011    1
## lp__   1689    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Dec 17 17:00:39 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
sample <- rstan::extract(fit)

一応、traceplotも確認し収束しているかをチェックする。

# check convergence
g_mcmc <- fit %>% 
  ggs() %>% 
  ggs_traceplot()
print(g_mcmc)

最後に実質的に解釈可能なことが多い回帰係数である\(\beta_1\)\(\beta_2\)\(\beta_3\)のMCMCサンプルを用いてその事後分布を確認する。なお、赤い縦ラインが2.5パーセンタイル、2.5パーセンタイル、97.5パーセンタイルである。また青い縦ラインが0.5パーセンタイル、99.5パーセンタイルである。黒の点線が0を意味する。解釈は前回t検定の場合と同じである。

# visualize
df <- tibble(beta1 = sample$beta1, beta2 = sample$beta2, beta3 = sample$beta3)
quantile(df$beta1, c(0.025, 0.975, 0.005, 0.995))
##       2.5%      97.5%       0.5%      99.5% 
## 0.09077166 0.11577081 0.08698145 0.11936291
quantile(df$beta2, c(0.025, 0.975, 0.005, 0.995))
##       2.5%      97.5%       0.5%      99.5% 
## 0.09605344 0.58442292 0.02403559 0.67078405
quantile(df$beta3, c(0.025, 0.975, 0.005, 0.995))
##        2.5%       97.5%        0.5%       99.5% 
## -0.05047175  0.21704949 -0.09046274  0.26873224
get_posterior_distribution <- function(data, x, xlab = "", colour = "blue"){
  xlab = xlab
  colour = colour
  data %>% 
    ggplot() +
    geom_density(aes(x = x), fill = colour, alpha = 0.3) + 
    geom_vline(xintercept = 0, colour = 'black', linetype = 'dashed') + 
    geom_vline(xintercept = quantile(x, 0.975), colour = 'red') + 
    geom_vline(xintercept = quantile(x, 0.025), colour = 'red') + 
    geom_vline(xintercept = quantile(x, 0.995), colour = 'blue') +
    geom_vline(xintercept = quantile(x, 0.005), colour = 'blue') +
    xlab(xlab) %>% 
    return()
}      

g_beta1 <- df %>% 
  get_posterior_distribution(df$beta1, xlab = "beta1", colour = "blue")
g_beta2 <- df %>% 
  get_posterior_distribution(df$beta2, xlab = "beta2", colour = "red")
g_beta3 <- df %>% 
  get_posterior_distribution(df$beta3, xlab = "beta3", colour = "green")

gridExtra::grid.arrange(g_beta1, g_beta2, g_beta3)