使用training数据固定\(X\),对\(Y\)求期望来估计泛化误差
假设train data是
\(X\),
\(Y_0\),那么有
\(\overline{err} = \sum{L(\hat{y_i}, y_{0i})}\)
如果新得(未出现得)数据是
\(X\),
\(Y_1\),那么有
\(Err = \sum{L(\hat{y_i}, y_{1i})}\)
\(Err\) 和
\(\overline{err}\)得关系是
\[\begin{eqnarray*}
Err - \overline{err} \equiv op
\end{eqnarray*}\]
对上式求对Y得期望
\(E_y(Err) - E_y(\overline{err}) = E_y(op)\)
有人证明。。。。
\[\begin{eqnarray*}
E_y(op) &=& \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)}
\end{eqnarray*}\]
特别得,当linear model时
\[\begin{eqnarray*}
\beta &=& (X^TX)^{-1}X^TY \\
H &=& X(X^TX)^{-1}X^TY \\
\hat{H} &=& HY
\end{eqnarray*}\]
\[\begin{eqnarray*}
E_y(op) &=& \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)} \\
&=& \frac{2}{N} \cdot \sum{Cov(x_i^T\beta[i], y_i)} \\
&=& \frac{2}{N} \cdot \sum{Cov(x_i^T(X^TX)^{-1}X^Ty_i, y_i)} \\
&=& \frac{2}{N} \cdot tr(HCov(Y,Y)) \\
&=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(H) \\
&=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(X(X^TX)^{-1}X^T) \\
&=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(X^TX(X^TX)^{-1}) \\
&=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(I_d) \\
&=& \frac{2}{N} \cdot d\sigma_{\varepsilon}^{2} \\
\end{eqnarray*}\]
Efron,1986 证明过,上式对于linear model + square-loss
成立;
对于linear model + log-loss
近似成立;
对于0-1 loss
不成立,然而有得人仍然这样用—。—
当err是用负对数似然表示时, 同时 \(N \to \infty\)时
\[\begin{eqnarray*}
E_y(Err) & = & E_y(\log{like}) \\
& = & \hat{err} + \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)} \\
& = & \log{like} + \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)} \\
\text{若满足上面特定条件} & = & \log{like} + \frac{2}{N} \cdot d\sigma_{\varepsilon}^{2} \\
\text{当 $N \to \infty$时} & \approx & \log{like} + 2\frac{d}{N} \\
\text{以上,我们推导出了AIC}
\end{eqnarray*}\]
AIC
定义:
\[\begin{eqnarray*}
-\frac{2}{N} \cdot loglik + 2\frac{d}{N}
\end{eqnarray*}\]
总结
AIC是对\(E_y(Err)\)在linear model + log-loss
下得近似
BIC
定义:
\[\begin{eqnarray*}
-2 \cdot loglik + \log{N} \cdot d
\end{eqnarray*}\]
由来:
\[ obj : \max P(M_k|D) \]
\[\begin{eqnarray*}
P(M_k|D) &\varpropto& P(M_k, D) \\
&=& P(M_k) \cdot P(D|M_k) \\
&=& P(M_k) \cdot \int P(D|\theta, M_k)P(\theta|M_k) d_{\theta}
\end{eqnarray*}\]
上式后面得积分难以计算,使用laplace近似
\[\begin{eqnarray*}
\log{P(D|M_k)} = \log{P(D|\hat{\theta},M_k)} - \frac{d_m}{2} \cdot \log{N} + O(1)
\end{eqnarray*}\]
总结
BIC就是假设对模型后面概率得比较(前提是假设模型得先验概率相等) ### 与AIC得关系
对模型复杂度惩罚更高