Bayes Training Error と Gibbs Training Error は本当にマイナスになるのか？

はじめに

Watanabe本

Watanabe, S., “Algebraic Geometry and Statistical Learning Theory”, 2009

の6章 (p.181) に気になる図が乗っている。

\(B_g\) と \(G_g\) はいわゆる汎化誤差なんだけど、その経験近似(?)である \(B_t\) (Bayes Training Error) と \(G_t\) (Gibbs Trainign Error) がマイナス値を取っている！

にわかには信じられないので具体的な学習モデルで確認してみたいと思う。

学習モデル

確認に使用する学習モデルは例7.1に乗っている3層ニューラルネットワークとする。

詳しくは以前の記事を参照のこと。

Watanabe本例7.1 ニューラルネットの具体例における学習係数の算出

定義

\(B_t\) (Bayes Training Error) と \(G_t\) (Gibbs Trainign Error) の定義は p.177 に載っている。

\[ \begin{align} B_t &= \frac{1}{n}\sum_{i=1}^n \log \frac{q(X_i)}{E_w \left[p(X_i | w)\right]} \\ G_t &= E_w \left[\frac{1}{n} \sum_{i=1}^n \log \frac{q(X_i)}{p(X_i | w)}\right] \end{align} \]

ただし、\(E_w[\cdot]\) は \(p(w | D_n)\) に対する期待値である。

Stan によるモデルの記述

パラメータ推定には Stan を用いる。

data {
  int N;
  vector[N] x;
  vector[N] y;
}
parameters {
  ordered[2] a;
  ordered[2] b;
}
model {
  y ~ normal(a[1] * (exp(b[1] * x) - 1) + a[2] * (exp(b[2] * x) - 1), 1);
  a ~ uniform(-10, 10);
  b ~ uniform(-10, 10);
}

学習

rstan パッケージを使って Stan を実行する。

library(rstan)

set.seed(123)

Ns <- 10:20
M <- 10

result <- data.frame()
for (n in Ns) {
  for (m in seq_len(M)) {
    x <- rnorm(n)
    y <- 0 + rnorm(n)
    
    stan_data <- list(N = n, x = x, y = y)
    stan_fit <- sampling(stan_model, data = stan_data, chains = 3)
    params <- extract(stan_fit)
    
    a <- params$a[,1]
    b <- params$b[,1]
    c <- params$a[,2]
    d <- params$b[,2]
    
    p <- function(x, a, b, c, d) a * (exp(b * x) - 1) + c * (exp(d * x) - 1)
    
    Bt <- mean(
      mapply(function(x, y) {
        dnorm(y, 0, 1, log=TRUE) - log(mean(dnorm(y, p(x, a, b, c, d), 1)))
      }, x, y)
    )
    
    Gt <- mean(
      mapply(function(a, b, c, d){
        mean(mapply(function(x, y) {
          dnorm(y, 0, 1, log=TRUE) - dnorm(y, p(x, a, b, c, d), 1, log=TRUE)
        }, x, y))
      }, a, b, c, d)
    )
    
    res <- data.frame(n = n, error = c(Bt, Gt), 
                      type = c("Bayes Training Error", "Gibbs Training Error"))
    result <- rbind(result, res)
  }
}

結果はこんな感じ。

head(result)

##    n       error                 type
## 1 10 -0.15630762 Bayes Training Error
## 2 10 -0.11315263 Gibbs Training Error
## 3 10 -0.13434266 Bayes Training Error
## 4 10 -0.10756627 Gibbs Training Error
## 5 10  0.01053031 Bayes Training Error
## 6 10  0.05037206 Gibbs Training Error

プロット

最初の図と同じ図を描いてみる。

library(dplyr)
df <- result %>% group_by(type, n) %>% summarise(mean_error = mean(error))

library(ggplot2)
theme_set(theme_bw())
ggplot(df, aes(as.factor(n), mean_error)) + 
  geom_point() + 
  geom_hline(yintercept = 0, color = "red") + 
  facet_wrap(~type) + ylab("Mean of Error") + xlab("n")

本当にマイナスになりました！

ちなみに、\(B_t\) と \(G_t\) のバラツキはこんな感じです。

ggplot(result, aes(as.factor(n), error)) + geom_point() + 
  geom_hline(yintercept = 0, color = "red") + 
  facet_wrap(~type) + ylab("Error") + xlab("n")

なので、プラスにもなるんですが、期待値はマイナスになるということのようです。