使用training数据固定\(X\),对\(Y\)求期望来估计泛化误差

假设train data是 \(X\), \(Y_0\),那么有
\(\overline{err} = \sum{L(\hat{y_i}, y_{0i})}\)
如果新得(未出现得)数据是 \(X\), \(Y_1\),那么有
\(Err = \sum{L(\hat{y_i}, y_{1i})}\)
\(Err\)\(\overline{err}\)得关系是
\[\begin{eqnarray*} Err - \overline{err} \equiv op \end{eqnarray*}\] 对上式求对Y得期望
\(E_y(Err) - E_y(\overline{err}) = E_y(op)\)
有人证明。。。。
\[\begin{eqnarray*} E_y(op) &=& \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)} \end{eqnarray*}\] 特别得,当linear model时 \[\begin{eqnarray*} \beta &=& (X^TX)^{-1}X^TY \\ H &=& X(X^TX)^{-1}X^TY \\ \hat{H} &=& HY \end{eqnarray*}\] \[\begin{eqnarray*} E_y(op) &=& \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)} \\ &=& \frac{2}{N} \cdot \sum{Cov(x_i^T\beta[i], y_i)} \\ &=& \frac{2}{N} \cdot \sum{Cov(x_i^T(X^TX)^{-1}X^Ty_i, y_i)} \\ &=& \frac{2}{N} \cdot tr(HCov(Y,Y)) \\ &=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(H) \\ &=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(X(X^TX)^{-1}X^T) \\ &=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(X^TX(X^TX)^{-1}) \\ &=& \frac{2}{N} \cdot \sigma_{\varepsilon}^{2} \cdot tr(I_d) \\ &=& \frac{2}{N} \cdot d\sigma_{\varepsilon}^{2} \\ \end{eqnarray*}\]

Efron,1986 证明过,上式对于linear model + square-loss成立;
对于linear model + log-loss近似成立;
对于0-1 loss不成立,然而有得人仍然这样用—。—

当err是用负对数似然表示时, 同时 \(N \to \infty\)

\[\begin{eqnarray*} E_y(Err) & = & E_y(\log{like}) \\ & = & \hat{err} + \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)} \\ & = & \log{like} + \frac{2}{N} \cdot \sum{Cov(\hat{y_i}, y_i)} \\ \text{若满足上面特定条件} & = & \log{like} + \frac{2}{N} \cdot d\sigma_{\varepsilon}^{2} \\ \text{当 $N \to \infty$时} & \approx & \log{like} + 2\frac{d}{N} \\ \text{以上,我们推导出了AIC} \end{eqnarray*}\]

AIC

定义:

\[\begin{eqnarray*} -\frac{2}{N} \cdot loglik + 2\frac{d}{N} \end{eqnarray*}\]

总结

AIC是对\(E_y(Err)\)linear model + log-loss下得近似

BIC

定义:

\[\begin{eqnarray*} -2 \cdot loglik + \log{N} \cdot d \end{eqnarray*}\]

由来:

\[ obj : \max P(M_k|D) \] \[\begin{eqnarray*} P(M_k|D) &\varpropto& P(M_k, D) \\ &=& P(M_k) \cdot P(D|M_k) \\ &=& P(M_k) \cdot \int P(D|\theta, M_k)P(\theta|M_k) d_{\theta} \end{eqnarray*}\] 上式后面得积分难以计算,使用laplace近似 \[\begin{eqnarray*} \log{P(D|M_k)} = \log{P(D|\hat{\theta},M_k)} - \frac{d_m}{2} \cdot \log{N} + O(1) \end{eqnarray*}\]

总结

BIC就是假设对模型后面概率得比较(前提是假设模型得先验概率相等) ### 与AIC得关系
对模型复杂度惩罚更高