Update on 2025-04-21

Decision tree

树型方法将特征空间划分为一组矩形(递归二元划分), 然后在每个矩形内拟合一个简单的模型(通常是均值或众数).

CART(classification and regression tree)

决策树的一些优点包括: 易于理解和解释; 可以对结果进行可视化; 计算效率高.

决策树的缺点包括: 预测性能不佳.

Regression Trees: data

Major League Baseball Data from the 1986 and 1987 seasons.
Years Hits Salary
-Andy Allanson 1 66 NA
-Alan Ashby 14 81 475.0
-Alvin Davis 3 130 480.0
-Andre Dawson 11 141 500.0
-Andres Galarraga 2 87 91.5
-Alfredo Griffin 11 169 750.0
-Al Newman 2 37 70.0
-Argenis Salazar 3 73 100.0

Regression Trees: Baseball data

Features splitting:

Regression Trees: Baseball data

Tree result:

How to split the features?

考虑分裂变量 \(j\) 和分裂点 \(s\), 并定义半平面对:

\[ R_1(j, s) = \{\mathbf x_j | \mathbf x_j \leq s\}\ \mbox{and}\ R_2(j, s) = \{\mathbf x_j | \mathbf x_j > s\}. \]

For all samples \(\{x_i, y_i\}\) solved by:

\[ \mbox{min}_{j, s} \bigg[\mbox{min}_{c1} \sum_{x_i \in R_1(j,\ s)} (y_i - c_1)^2 + \mbox{min}_{c_2} \sum_{x_i \in R_2(j,\ s)} (y_i - c_2)^2 \bigg] \]

对于每个分裂变量 \(j\), 分裂点 \(s\) 可以很快地完成, 因此通过扫描所有输入数据, 确定最佳对 \((j, s)\) 是可行的. 每个分裂区间的预测结果将会是:

\[ \hat c_1 = \mbox{Mean}\big(y_i | x_i \in R_1(j, s) \big) \mbox{ and } \hat c_2 = \mbox{Mean}\big(y_i | x_i \in R_2(j, s) \big). \]

Pruning the tree

We define a subtree \(T \subset T0\),

\[ \begin{aligned} N_m =& \#\{x_i \in R_m\}, \\ \hat c_m =& \frac{1}{N_m} \sum_{x_i \in R_m} y_i, \\ Q_m(T) =& \frac{1}{N_m} \sum_{x_i \in R_m} (y_i - \hat c_m)^2. \end{aligned} \]

we define the cost complexity criterion:

\[ C_{\alpha} = \sum_{m = 1}^{|T|} N_m Q_m(T) + \alpha|T|. \]

How large should the tree be?

我们定义子树 \(T \subseteq T_0\) 为可以通过修剪 \(T_0\) 得到的任何树, 即将其任意数量的内部(非终端)节点合并. 该方法的思想是针对每个 \(\alpha\) 值, 在 \(T_0\) 中找到子树 \(T_{\alpha}\subseteq T_0\), 使 \(C_{\alpha}(T)\) 最小化.

调整参数 \(\alpha \geq 0\) 可以平衡树的大小和对数据的拟合程度. 较大的 \(\alpha\) 值会导致较小的子树 \(T_{\alpha}\), 反之亦然. 当 \(\alpha=0\) 时, 可以得到完整的树 \(T_0\)(每一个叶子结点的样本值都相等). 随着 \(\alpha\) 的逐步增大, 对每个\(\alpha\), 可以证明存在一个唯一的最小子树 \(T_{\alpha}\), 它能够最小化 \(C_{\alpha}(T)\). 为了找到\(T_{\alpha}\), 我们使用最弱连接剪枝: 我们依次合并会导致节点增加最小的内部节点, 直到产生单节点(根)树. 这给出了一系列(有限的)子树,可以证明该序列必须包含\(T_{\alpha}\).

通过五重或十重交叉验证,我们估计\(\alpha\)值, 选择最小化交叉验证平方和的值\(\hat{\alpha}\). 我们的最终树是\(T_{\hat{\alpha}}\).

具体的:

\[ R_\alpha = R(T) + \alpha|T|, \] \(R\) is MSE or impurity. \(T_t\) 定义为以节点\(t\)为根节点的一棵树.

graphviz

Classfication tree

In the regression problem, for a sample \(\{x_i, y_i\}\) satisfies:

\[ \mbox{min}_{j, s} \bigg[\mbox{min}_{c_1} \sum_{x_i \in R_1(j,\ s)} (y_i - c_1)^2 + \mbox{min}_{c_2} \sum_{x_i \in R_2(j,\ s)} (y_i - c_2)^2 \bigg], \]

In the classification problem, for a sample \(\{x_i, y_i\}\), we define:

\[ \hat p_{mk} = \frac{1}{N_m} \sum_{x_i \in R_m(j, s)} I(y_i = k). \]

We classify the observations belongs to \(R_m\) to class \(k_m\), where

\[ k_m = \mbox{argmax}_k\ \hat p_{mk}. \] Our problem is how to measure a node impurity to find the split point \(s\).

lmpurity

Node impurity:

Different measures of node impurity

For samples in \(R_m\), the measures of node impurity have 3 methods (general condition and binary condition):

Misclassification error:

\[ 1 - \mbox{max}_k\ \hat p_{mk};\ 1 - \mbox{max} (\hat p_{m0},\ \hat p_{m1}). \]

Gini index:

\[ \sum_{k = 1}^K \hat p_{mk} (1 - \hat p_{mk});\ \hat p_{m1}(1 - \hat p_{m1}) + \hat p_{m0}(1 - \hat p_{m0}). \]

Cross-entropy:

\[ -\sum_{k = 1}^K \hat p_{mk} \mbox{log } \hat p_{mk};\ -\hat p_{m1} \mbox{log } \hat p_{m1} - \hat p_{m0} \mbox{log } \hat p_{m0}. \]

Impurity measures for two-class classification

Build a classification tree

Misclassification error:

\[ \mbox{min}_{j, s} \bigg[ N_1 \big(1 - \mbox{max}_m\ \hat p_{1k_1} \big) + N_2 \big(1 - \mbox{max}_m\ \hat p_{2k_2} \big) \bigg]. \]

Gini index:

\[ \mbox{min}_{j, s} \bigg[ N_1 \sum_{k_1} \hat p_{1k_1}(1 - \hat p_{1k_1}) + N_2 \sum_{k_2} \hat p_{2k_2}(1 - \hat p_{2k_2}) \bigg]. \]

Cross-entropy:

\[ \mbox{min}_{j, s} \bigg[ - N_1 \sum_{k_1} \hat p_{1k_1} \mbox{log } \hat p_{1k_1} - N_2 \sum_{k_2} \hat p_{2k_2} \mbox{log } \hat p_{2k_2} \bigg]. \]

why we prefer gini and entropy?

在一个二分类问题中,每个类别分别有400个观察值(分别表示为(400,0)和(0,400)),假设一个分割创建了节点(300,100)和(100,300),而另一个分割创建了节点(400,200)和(0,200).这两个分割具有相同的误分类错误率.

split1:

\(\hat p_{mk}\) 0 1
\(R_1=0\)
\(\frac{3}{4}\) \(\frac{1}{4}\)
\(R_2=1\) \(\frac{1}{4}\) \(\frac{3}{4}\)

split2:

\(\hat p_{mk}\) 0 1
\(R_1=0\)
\(\frac{2}{3}\) \(\frac{1}{3}\)
\(R_2=1\) \(0\) \(1\)

Which split is prefered by three methods?

Misclassification error:

split1:

\[ (1 - \frac{3}{4}) * 400 + (1 - \frac{3}{4}) * 400 = 200. \]

split2:

\[ (1 - \frac{2}{3}) * 600 + (1 - 1) * 200 = 200. \]

Which split is prefered by three methods?

Gini index:

split1:

\[ [\frac{1}{4} (1 - \frac{1}{4}) + \frac{3}{4} (1 - \frac{3}{4})] * 400 + [\frac{1}{4} (1 - \frac{3}{4}) + \frac{3}{4} (1 - \frac{1}{4})] * 400 = 300. \]

split2:

\[ [\frac{1}{3} (1 - \frac{1}{3}) + \frac{2}{3} (1 - \frac{2}{3})] * 600 + [1(1 - 0) + 0(1 - 0)] * 200 = \frac{800}{3}. \]

Concave impurity measures

For misclassifications error satisfies(here \(k_f\) means the number of misclassification samples and \(k_t\) means the correct samples):

\[ (1 - \frac{k_{1t}}{N_1}) \frac{N_1}{N} + (1 - \frac{k_{2t}}{N_2}) \frac{N_2}{N} = \frac{k_{1f} + k_{2f}}{N}. \] For a concave impurity measures:

\[ \begin{aligned} \frac{N_1}{N} f(\frac{k_{1f}}{N_1}) + \frac{N_2}{N}f(\frac{k_{2f}}{N_2}) \leq & f(\frac{N_1}{N} \cdot \frac{k_{1f}}{N_1} + \frac{N_2}{N} \cdot \frac{k_{2f}}{N_2}) \\ = & f(\frac{k_{1f} + k_{2f}}{N}). \end{aligned} \] Only when \(\frac{k_{1f}}{N_1} = \frac{k_{2f}}{N_2}\), the \(=\) of \(\leq\) can be got.

Bootstrap:a procedure for reducing the variance

一组由随机变量 \(a\) 的100个独立观测值组成的样本,记为 \(a_1,\ a_2,\ \cdots,\ a_{100}\),它们共享相同的方差 \(\sigma^2\).我们希望估计变量 \(a\) 的真实值.

M1: 随机选择 \(\hat a = a_i\) 作为估计值,\(\hat a\) 的方差为 \(\sigma^2\).

M2: 使用样本的均值 \(\hat a = \bar a\) 来估计变量 \(a\).均值 \(\bar a\) 的方差满足:

\[ \mbox{var}(\hat a) = \mbox{var}(\bar a) = \frac{1}{100^2}\sum_{100} \mbox{var}(a_i) = \frac{1}{100} \sigma^2. \]

Bootstrap:a procedure for reducing the variance

M3: 我们可以使用 bootstrap,通过从训练集中重复抽取样本.在这种方法中,我们生成10个不同的 bootstrapp 训练数据集,并最后对所有预测结果求平均.

\[ \bar a_j = \frac{1}{100} \sum_{i = 1}^{100} a_{ij} \] \[ \hat a = \frac{1}{10} \sum_{j = 1}^{10} \bar a_j. \] The variance of \(\hat a\) ?

A simple example

Train set is \(t_0 = \{a_1, a_2\}\) and bootstrap train sets: \[ t_1 = \{a_1, \tilde a_1\},\ t_2 = \{a_1, a_2\}. \] So \[ \bar a_1 = \frac{a_1 + \tilde a_1}{2},\ \bar a_2 = \frac{a_1 + a_2}{2}, \] and \[ \hat a = \frac{\bar a_1 + \bar a_2}{2}. \]

The variance: \[ \begin{aligned} \mbox{var}(\hat a) =& \frac{1}{4} \mbox{var}(\frac{a_1 + \tilde a_1}{2}) + \frac{1}{4} \mbox{var}(\frac{a_1 + a_2}{2}) \color{red}{+ 2\times \frac{1}{4} \mbox{cov}(\bar a_1,\ \bar a_2)} \\ =& \frac{1}{8}\sigma^2 + \frac{1}{8}\sigma^2 \color{red}{+ \frac{1}{2}\mbox{cov}(\frac{a_1 + \tilde a_1}{2},\ \frac{a_1 + a_2}{2})} \\ =& \frac{1}{4}\sigma^2 \color{red}{+ \frac{1}{8}\sigma^2} \in [\frac{1}{4}\sigma^2, \frac{1}{2}\sigma^2 ]. \end{aligned} \]

General situation

There are totally number of \(b\) bootstraps and \(\mbox{var}(\bar a_j) = \sigma^2\).

\[ \begin{aligned} \mbox{var}(\hat a) =& \frac{1}{b^2} \sum_{j = 1}^b \mbox{var}(\bar a_j) + \frac{2 b (b - 1)}{2b^2}\mbox{cov}(\bar a_p, \bar a_q) \\ =& \frac{1}{b} \sigma^2 + \frac{b - 1}{b} \mbox{cov}(\frac{a_{p1} + a_{p2} + \cdots a_{pn}}{n}, \frac{a_{q1} + a_{q2} + \cdots a_{qn}}{n}) = \frac{1}{b} \sigma^2 + \frac{b - 1}{b} \frac{\overbrace{n \rho}^{\rm 非独立组数}}{n^2} (n\sigma^2) \end{aligned} \] where \(\rho\) is the percentage of same samples in set \(p\) and set \(q\).

If we change mean estimator as any other function \(f\): \[ \begin{aligned} \mbox{var}(\hat f) =& \frac{1}{b^2} \sum_{j = 1}^b \mbox{var}(f_j) + \frac{2 b (b - 1)}{2b^2}\mbox{cov}(f_p, f_q) \\ =& \frac{1}{b} \sigma_f^2 + \frac{b - 1}{b} \sigma_f^2 \times \tilde\rho, \end{aligned} \] here \(\tilde \rho \approx \rho\) because of non-linear estimation. When \(f\) is a CART, we call the bootstrap Bagging, which is a special case of random forest.

Bootstrap

Bagging(CART pro), some inportant attension

Prediction

然而, 当我们对大量树进行 bagging 时, 无法再使用单一树来表示生成的统计学习过程, 也不再清楚哪些变量对该过程最为重要. 因此, bagging 通过牺牲可解释性来提高预测准确性.

对于给定的测试观测数据, 为将 bagging 应用于回归树, 我们只需使用 \(b\) 个 bootstrap 训练集构建 \(b\) 个回归树, 得出预测结果并平均. 对于分类树, 我们可以记录每棵树预测的类别, 并进行多数投票.

Variable Importance Measures

在对回归树进行 bagging 的情况下, 我们可以记录在给定预测变量上进行分裂所导致的 MSE 总降低量, 然后对所有 \(b\) 棵树进行平均.

同样地, 我们可以累加在给定预测变量上进行分裂所导致的 Gini 指数总降低量. 具体来说, 对于某个特征变量 \(j\), 我们可以计算在 \(b\) 棵树中所有可能分裂点上的 Gini 指数降低量, 然后将其累加起来, 即得到总降低量.

Random forests, CART pro max!

How

bagging 和随机森林之间的主要区别是选择预测变量子集大小 \(m = \sqrt p\), 每次分裂都会取一个新的包含 \(m\) 个预测变量的样本.

Why

假设在数据集中有一个非常强的预测变量, 以及一些其他适度强的预测变量. 那么在 bagging 树的集合中, 大部分或所有的树都将使用该强预测变量作为顶层分裂. 特别地, 在这种情况下, 这意味着 bagging 不会比单棵树在方差方面有显著的减少.

So

随机森林通过强制每个分裂只考虑一部分预测变量来解决这个问题. 因此, 平均而言, 有 \((p-m)/p\) 的分裂节点甚至不会考虑该强预测变量, 因此其他预测变量将更有机会被使用.

随机森林通过组合不同的树来实现方差降低, 有时会以略微增加偏差为代价. 实际上, 方差的降低通常是显著的, 因此可以得到一个整体上更好的模型.