ベイズ統計モデリング備忘録

回帰分析

使用するパッケージを読み込む。

# get packages
library(tidyverse)

## ─ Attaching packages ──── tidyverse 1.2.1 ─

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ─ Conflicts ────── tidyverse_conflicts() ─
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(rstan)

## Loading required package: StanHeaders

## rstan (Version 2.19.2, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## 
## Attaching package: 'rstan'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(ggmcmc)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

データは以下のように設定する。

# making data
set.seed(1234)
n <- 1000
X1 <- rnorm(n = n, mean = 50, sd = 10)
X2 <- runif(n = n, min = 0, max = 1) %>% round(0)
X3 <- runif(n = n, min = 1, max = 4) %>% round(0)
Y <- 0.1 * X1 + 0.4 * X2 + 0.2 * X3 + rnorm(n = n, mean = 0, sd = 2) 
df <- tibble(X1 = X1, X2 = X2, X3 = X3, Y = Y)

まずは普通に回帰分析を回してみる。

# ols regression
res <- lm(Y ~ X1 + X2 + X3, data = df)
summary(res)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0446 -1.4074  0.0756  1.3956  5.9744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.199244   0.369042   0.540  0.58939    
## X1          0.103335   0.006423  16.087  < 2e-16 ***
## X2          0.343459   0.128101   2.681  0.00746 ** 
## X3          0.088088   0.066696   1.321  0.18689    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.024 on 996 degrees of freedom
## Multiple R-squared:  0.2123, Adjusted R-squared:  0.2099 
## F-statistic: 89.47 on 3 and 996 DF,  p-value: < 2.2e-16

次にベイズによるアプローチ。まず手元のデータ\(Y\)は\(X1\)と\(X2\)と\(X3\)を使って以下のように表現されると考える。

\[ Y_i = \beta_0 + \beta_1 \times X1 + \beta_2 \times X2 + \beta_3 \times X3 + error　(i = 1, 2, ... N) \] ここで\(i\)は個人iを指す。また\(error\)は以下のように平均0、標準偏差\(\sigma\)の正規分布に従うとする。

\[ error 〜 Nromal(0, \sigma) \] 以上から\(Y\)は以下のような正規分布に従うと考えられる。

\[ Y 〜Nromal(\beta_0 + \beta_1 \times X1 + \beta_2 \times X2 + \beta_3 \times X3, \sigma) \] 以上を踏まえてstan側のファイルに以下のように記述する。なお、これはols regression model.stanとして保存しておく。

// ols regression model
data {
  int N;
  real X1[N];
  int<lower =0, upper = 1> X2[N];
  int<lower =0, upper = 4> X3[N];
  real Y[N];
}

parameters {
  real<lower = 0> sigma;
  real beta0;
  real beta1;
  real beta2;
  real beta3;
}

model {
  sigma ~ uniform(0, 100);
  beta0 ~ normal(0, 1000);
  beta1 ~ normal(0, 1000);
  beta2 ~ normal(0, 1000);
  beta3 ~ normal(0, 1000);
  for(i in 1:N)
    Y[i] ~ normal(beta0 + beta1 * X1[i] + beta2 * X2[i] + beta3 * X3[i], sigma);
}

なお、求められるパラメータは\(\beta_0\)、\(\beta_1\)、\(\beta_2\)、\(\beta_3\)、\(\sigma\)だが、これらの事前分布は以下のように設定してある。

\[ \beta_0 〜 Normal(0, 1000) \\ \beta_1 〜 Normal(0, 1000)\\ \beta_2 〜 Normal(0, 1000)\\ \beta_3 〜 Normal(0, 1000)\\ \sigma 〜 Uniform(0, 100)\\ \]

次に、MCMCサンプルを生成して結果を確認する。またMCMCサンプルの抽出も行う。コードは以下の通りである。

# bayesian regression
data <- list(N = nrow(df), X1 = df$X1, X2 = df$X2, X3 = df$X3, Y = df$Y)
fit <- rstan::stan(file = 'ols regression model.stan', data = data, seed = 1234)

## Warning in readLines(file, warn = TRUE): '/Users/YONE/Desktop/ols
## regression model.stan' で不完全な最終行が見つかりました

## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000407 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.07 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 6.15097 seconds (Warm-up)
## Chain 1:                4.68157 seconds (Sampling)
## Chain 1:                10.8325 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0.000153 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.53 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 5.41259 seconds (Warm-up)
## Chain 2:                4.53654 seconds (Sampling)
## Chain 2:                9.94912 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0.000238 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 2.38 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 7.48254 seconds (Warm-up)
## Chain 3:                4.32575 seconds (Sampling)
## Chain 3:                11.8083 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'ols regression model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0.000152 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.52 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 6.75802 seconds (Warm-up)
## Chain 4:                4.1117 seconds (Sampling)
## Chain 4:                10.8697 seconds (Total)
## Chain 4:

print(fit)

## Inference for Stan model: ols regression model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean   sd     2.5%      25%      50%      75%    97.5%
## sigma     2.03    0.00 0.05     1.94     1.99     2.03     2.06     2.12
## beta0     0.21    0.01 0.37    -0.50    -0.04     0.19     0.46     0.93
## beta1     0.10    0.00 0.01     0.09     0.10     0.10     0.11     0.12
## beta2     0.34    0.00 0.13     0.10     0.26     0.34     0.43     0.58
## beta3     0.09    0.00 0.07    -0.05     0.04     0.09     0.13     0.22
## lp__  -1205.15    0.04 1.57 -1208.87 -1205.97 -1204.84 -1203.97 -1203.04
##       n_eff Rhat
## sigma  3579    1
## beta0  2146    1
## beta1  2484    1
## beta2  3690    1
## beta3  3011    1
## lp__   1689    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Dec 17 17:00:39 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

sample <- rstan::extract(fit)

一応、traceplotも確認し収束しているかをチェックする。

# check convergence
g_mcmc <- fit %>% 
  ggs() %>% 
  ggs_traceplot()
print(g_mcmc)

最後に実質的に解釈可能なことが多い回帰係数である\(\beta_1\)、\(\beta_2\)、\(\beta_3\)のMCMCサンプルを用いてその事後分布を確認する。なお、赤い縦ラインが2.5パーセンタイル、2.5パーセンタイル、97.5パーセンタイルである。また青い縦ラインが0.5パーセンタイル、99.5パーセンタイルである。黒の点線が0を意味する。解釈は前回t検定の場合と同じである。

# visualize
df <- tibble(beta1 = sample$beta1, beta2 = sample$beta2, beta3 = sample$beta3)
quantile(df$beta1, c(0.025, 0.975, 0.005, 0.995))

##       2.5%      97.5%       0.5%      99.5% 
## 0.09077166 0.11577081 0.08698145 0.11936291

quantile(df$beta2, c(0.025, 0.975, 0.005, 0.995))

##       2.5%      97.5%       0.5%      99.5% 
## 0.09605344 0.58442292 0.02403559 0.67078405

quantile(df$beta3, c(0.025, 0.975, 0.005, 0.995))

##        2.5%       97.5%        0.5%       99.5% 
## -0.05047175  0.21704949 -0.09046274  0.26873224

get_posterior_distribution <- function(data, x, xlab = "", colour = "blue"){
  xlab = xlab
  colour = colour
  data %>% 
    ggplot() +
    geom_density(aes(x = x), fill = colour, alpha = 0.3) + 
    geom_vline(xintercept = 0, colour = 'black', linetype = 'dashed') + 
    geom_vline(xintercept = quantile(x, 0.975), colour = 'red') + 
    geom_vline(xintercept = quantile(x, 0.025), colour = 'red') + 
    geom_vline(xintercept = quantile(x, 0.995), colour = 'blue') +
    geom_vline(xintercept = quantile(x, 0.005), colour = 'blue') +
    xlab(xlab) %>% 
    return()
}      

g_beta1 <- df %>% 
  get_posterior_distribution(df$beta1, xlab = "beta1", colour = "blue")
g_beta2 <- df %>% 
  get_posterior_distribution(df$beta2, xlab = "beta2", colour = "red")
g_beta3 <- df %>% 
  get_posterior_distribution(df$beta3, xlab = "beta3", colour = "green")

gridExtra::grid.arrange(g_beta1, g_beta2, g_beta3)

ベイズ統計モデリング備忘録

はじめに

回帰分析