1. 摘要:

1.1 决策树模型介绍

1.1.1 决策树(Decision Tree):

决策树是一种典型的单分类器,从本质上讲,决策树的分类思想是产生一系列规则,然后通过这些规则进行数据分析的数据挖掘过程。该分类器的生成和决策过程分为三个部分:

  1. 首先,通过队训练集进行递归分析,生成一棵形状如倒立的树状结构;
  2. 然后,分析这棵树从根节点到叶节点的路径,产生一系列规则;
  3. 最后,根据这些规则对新数据进行分类预测。

决策树可视为一个树状模型,树中包含三种节点:根节点、(中间)内部节点、叶节点:

  • 树的每个节点表示样本对象的属性
  • 从每个节点出发的分叉路径代表某个可能的属性值
  • 根节点包含样本全集
  • 从根节点出发,经过若干中间节点后,到达叶节点的路径表示某个规则,整个树表示由训练样本决定的规则集合。
  • 每个叶节点对应从根节点到该叶节点所经历的路径所表示的对象的值。即叶节点对应于决策结果,其他每个节点则对应于一个属性测试(分割变量选择、分割节点选择);
  • 每个节点包含的样本集合根据属性测试的结果被划分到子节点中;

决策树模型构建的问题基本可以归结为两点:

  • 生成树
    • 选择变量作为分割变量(分割节点);
    • 如何拆分,即拆分的规则;
  • 树的剪枝(prune)

1.1.2 决策树学习基本算法

—————————————————-

  • 输入:
    • 训练集:\(D=\{(x_{1}, y_{1}),(x_{2}, y_{2}),\ldots,(x_{n}, y_{n})\};\)
    • 变量集:\(A=\{a_{1}, a_{2},\ldots,a_{p}\}\)
  • 过程:
    • 函数: \(TreeGenerate(D, A)\)
    1. 生成根节点node;
    2. if \(D\)中样本全属于同一类别\(C\) then
    3. 将根节点node标记为\(C\)类叶节点;return
    4. end if
    5. if \(A=\emptyset\) or \(D\)中样本在\(A\)上取值相同 then
    6. 将根节点node标记为叶节点,其类别标记为\(D\)中样本数最多的类;return
    7. end if
    8. \(A\)中选择最优划分变量\(a_{*}\);(下面会具体介绍怎么选择)
    9. for \(a_{*}\)的每一个值\(a_{*}^{v}\) do
    10. 为node生成一个分支;令\(D_{v}\)表示\(D\)中在\(a_{*}\)上取值为\(a_{*}^{v}\)的样本子集;
    11. if \(D_{v}=\emptyset\) then
    12. 将根节点node标记为叶节点,其类别标记为\(D\)中样本数最多的类;
    13. else
    14. \(TreeGenerate(D, A \& \{a_{*}\})\)为分支节点
    15. end if
    16. end for
  • 输出:
    • 以node为根节点的一棵决策树

—————————————————-

1.1.3 决策树最优划分变量选择

随着变量划分过程不断进行,希望决策树的分支节点所包含的样本尽可能属于同一类别,即节点的“纯度(purity)”越来越高。

(1)信息增益
  • 信息熵(information entropy)
    • 度量样本集合纯度最常用的指标
    • 信息熵的值越小,则样本集合的纯度越高
  • 信息增益(information gain)
    • 信息增益主要作为ID3决策树[Quinlan,1986]的变量选择划分准则,可以看到这里对节点的划分不是二叉划分,是多叉划分
    • 信息增益越大,则使用相应变量来进行划分所获得的“纯度提升”越大

假设样本集合\(D\)(根节点)中第\(k\)类样本所占的比例为\(p_{k}(k=1, 2, \ldots, |\mathcal{Y}|)\),则样本集合\(D\)(根节点)的信息熵为:

\[Ent(D)=-\sum^{|\mathcal{Y}|}_{k=1}p_{k}log_{2}p_{k}\]

假设离散变量\(a\)\(V\)个可能的取值\(\{a^{1}, a^{2}, \ldots, a^{V}\}\),若使用变量\(a\)对样本集合\(D\)进行划分,则会产生\(V\)个分支节点,其中第\(v\)个分支节点包含了\(D\)中所有在变量\(a\)上取值为\(a^{v}\)的样本,记为\(D^{v}\), 易知集合\(D^{v}\)的信息熵为\(Ent(D^{v})\)。再考虑到不同的分支节点所包含的样本数不同,所以给分支节点赋权重\(\frac{|D^{v}|}{|D|}\),即样本数越多的分支节点影响越大,于是可计算出用变量\(a\)进行划分所获得的“信息增益”为:

\[Gain(D, a)=Ent(D)-\sum^{V}_{v=1}\frac{|D^{v}|}{|D|}Ent(D^{v})\]

最终选择获得信息增益最大的变量为:

\[a_{*}=\underset{a\in A}{argmax}Gain(D, a)\]

(2)信息增益率
  • 信息增益准则对可取值数目多的变量有所偏好,为减少这种偏好可能带来的不利影响,C4.5决策树算法[Quinlan,1993]不直接使用信息增益,使用信息增益率(gain ratio)来选择最优的划分变量
  • 信息增益率准则对变量取值数目较少的属性有所偏好,因此C4.5并不是直接选择信息增益率最大的变量,而是使用了一个启发式的:先从变量集合中找出信息增益率高于平均水平的变量,再从中选择信息增益率最高的
  • C4.5仍然是多分叉树

\[Gain\_ratio(D, a)=\frac{Gain(D, a)}{IV(a)}\]

其中\(IV(a)\)称为变量\(a\)的“固有值(intrinsic value)”,变量\(a\)可能取值的数目越多(即\(V\)越大),则\(IV(a)\)的值通常会越大:

\[IV(a)=-\sum^{V}_{v=1}\frac{|D^{v}|}{|D|}log_{2}\frac{|D^{v}|}{|D|}\]

(3)基尼指数
  • 基尼值(Gini value)
    • 数据集的纯度可用基尼值来度量
    • 基尼值越小,则数据集的纯度越高
  • 基尼指数(Gini index)
    • CART决策树[Breiamn et al.,1984]使用基尼指数(Gini index)来选择划分变量,CART是二分叉树
    • 基尼指数越小,则使用相应变量来进行划分所获得的“纯度提升”越大

数据集\(D\)的纯度可用基尼值来度量:

\[Gini(D)=\sum^{|\mathcal{Y}|}_{k=1}\sum_{k' \neq k}p_{k}p_{k'}=1-\sum^{|\mathcal{Y}|}_{k=1}p_{k}^{2}\]

假设变量 \(a\)\(V\) 个可能的取值 \(\{a^{1}, a^{2}, \ldots, a^{V}\}\)。若使用变量 \(a\) 对样本集合 \(D\) 进行划分,

  • \(a\) 为离散变量
    • 对于每一个变量 \(a\) 的取值 \(a^{i}, (i=1, 2,\ldots,V)\),会产生两个个分支节点。其中第一个分支节点(左分支)包含了 \(D\) 中所有在变量 \(a\) 上取值为 \(a^{i}\) 的样本,记为 \(D^{l}\);第二个分支节点(右分支)包含了 \(D\) 中所有在变量 \(a\) 上取值不为 \(a^{i}\) 的样本,记为 \(D^{r}\)
  • \(a\) 为连续变量
    • 对于每一个变量 \(a\) 的取值 \(a^{i}, (i=1, 2,\ldots,n)\) (已经从小到大进行了排列:\(a^{1} \leqslant a^{2} \leqslant \ldots \leqslant a^{n}\)),基于分割点 \(t\) 可将样本集合 \(D\) 划分为两个子集,即会产生两个分支节点。其中第一个分支节点(左分支)包含了 \(D\) 中所有在变量 \(a\) 上取值为 \(a \leqslant t\) 的样本,记为 \(D^{l}_{t}\);第二个分支节点(右分支)包含了 \(D\) 中所有在变量 \(a\) 上取值为 \(a > t\) 的样本,记为 \(D^{r}_{t}\)。 显然对于相邻的两个取值 \(a^{i}\)\(a^{i+1}\)来说,\(t\) 在区间 \([a^{i},a^{i+1})\) 中取任意值所产生的划分结果都相同。因此,对连续变量,可考察包含 \(n-1\) 个元素的候选分割点集合: \[T_{a}=\bigg\{\frac{a^{i}+a^{i+1}}{2}|1 \leqslant i \leqslant n-1 \bigg\}\] 即把区间 \([a^{i},a^{i+1})\) 的中位点作为候选分割点,然后就可以像离散变量一样来考察这些分割点,选取最优的分割点进行样本集合的划分。

易知集合 \(D^{l}\) 的基尼值为 \(Gini(D^{l})\) ,集合 \(D^{r}\) 的基尼值为 \(Gini(D^{r})\)。再考虑到两个不同的分支节点所包含的样本数不同,所以给左右两个分支节点分别赋权重 \(\frac{|D^{l}|}{|D|}\)\(\frac{|D^{r}|}{|D|}\),即样本数越多的分支节点影响越大,于是可计算出用变量 \(a\) 进行划分的基尼指数为:

\[Gini\_index(D,a)=\frac{|D^{l}|}{|D|}Gini(D^{l})+\frac{|D^{r}|}{|D|}Gini(D^{r})\]

选择使得划分后基尼指最数小的变量作为最优划分变量:

\[a_{*}=\underset{a\in A}{argmin}Gini\_index(D, a)\]

1.1.4 决策树剪枝

剪枝(Purning)是决策树减轻“过拟合”的主要手段。在决策树学习中,为了尽可能地正确分类训练样本,节点划分不断重复,有时会造成决策树分支过多,这时可能因为训练样本学得太好了,以至于把训练样本自身的一些特点当做所有数据都具有的一般性质而导致过拟合。因此可通过去掉一些分支来降低过拟合的风险。

决策树兼职的基本策略:
  • 预剪枝(prepruning)
    • 在决策树生成过程中,对每个节点在分割前先进行估计。若当前的分割不能带来决策树泛化性能的提升,则停止分割并将当前节点标记为叶节点
    • 在预剪枝过程中,决策树的很多分支都没有展开,这不仅降低了过拟合的风险,还显著减少了决策树的训练时间和预测时间;但有些分支的当前分割虽不能提升泛化性能、甚至能导致泛化性能下降,但在其基础上的后续分割却有可能导致泛化性能的显著提高,容易导致欠拟合
  • 后剪枝(postpurning)
    • 先从训练样本集生成一棵最大规模的完整的决策树,然后自底向上地对非叶结点进行考察。若将该提升决策树的泛化性能,则将该子树替换为叶节点节点对应的子树替换为叶节点能
    • 后剪枝决策树比预剪枝决策树保留了更多的分支。后剪枝过程的欠拟合风险很小,泛化性能往往优于预剪枝决策树,但后剪枝过程是在生成完全决策树之后进行的,并且要自底向上地对树中所有的非叶节点进行逐一考察,因此训练时间开销比未剪枝决策树和预剪枝决策树都要大的多
判断决策树泛化性能的提升采用性能评估方法和度量:
  • 评估方法
    • 留出法
    • 交叉验证
    • 自助法
  • 评估性能度量
    • 分类
      • 分类准确率(错误率)
    • 回归
      • MSE

1.2 决策树模型优缺点

决策树模型是一种简单易用的非参数分类器,算法的优缺点:

  • 决策树模型的优点是:
    • 不需要对数据有任何的先验设计
    • 计算速度较快
    • 结果容易解释
    • 稳健性强
    • 对噪声数据和缺失数据不敏感
  • 决策树模型的缺点是:
    • 因为它是矩形的判别边界,使得精确度不高,对回归问题不太适合。在处理回归问题时建议使用模型树(model tree)方法,即先将数据切分,再对各组数据进行线性回归。party包mod, RWeka包M5P可以实现
    • 单个决策树不太稳定,数据微小的变化会造成模型结构变化。树模型还会有变量选择偏向,即会选择那些有取值较多的变量。一种改善的方法是使用条件推断树,即party包中的ctree函数,还可以采用集成学习方法

1.3 CART(Classification And Regression Tree)模型介绍

  • CART(Classification And Regression Tree 分类与回归树):
    • CART是树模型算法的一种,它先从自变量集合中寻找最佳分割变量和最佳分割点(基尼指数),将数据划分为两组。针对分组后的数据将上述步骤重复下去,直到满足某种停止条件。这样反复分隔数据后使分组后的数据变得一致,纯度较高。同时可自动探测出复杂数据的潜在结构、重要模式和关系,探测出的知识又可用来构造精确和可靠地预测模型。关于分类与回归树的深入理论,可以参考Breiman等人的著作(CART)。
    • CART模型的构建可以看作是一个对变量进行递归分割选择的过程。递归分割,顾名思义也就是对变量进行逐层分隔,直到分割结果满足某种条件才停下来,这里“分割的结果”可能是得到一些分类值,也可能是一些描述统计量或预测值。建立的CART模型可分为分类树(classification tree)回归树(regression tree)两种:
      • 分类树用于因变量为分类数据的情况,树的末端为因变量的分类值;
      • 回归树则可以用于因变量为连续变量的情况,树的末端可以给出相应类别中的因变量描述或预测。
  • CART模型构建具体步骤:
    • 1.对所有自变量和所有分割点进行评估,选择最优分割变量,生成一棵规模较大的CART树。最佳的选择是使得分割后组内的数据纯度更高,即组内数据的目标变量变异更小。这种纯度可通过Gini指数或熵Entropy来度量(见1.1.3节中(3)基尼指数);
    • 2.对树进行剪枝(prune)。要兼顾树的规模和误差的大小,因此通常会使用CP参数(complexity parameter)来对树的复杂度进行控制,使预测误差和树的规模都尽可能小。通常的做法是,先建立一个划分较细较为复杂的树模型,再根据交叉验证方法来估计不同剪枝条件下各模型的误差,选择误差最小的树模型;
    • 3.输出最终结果,进行预测和解释。
  • Therneau 和Atkinson(1997)的rpart库(在S-Plus软件中),后被Ripley引入R语言的同名附加包rpart,提供了实现快速计算以及封装好的一系列 R 函数。

2. R中CART实现

2.1 Package 和 Function

  • rpart package
    • rpart()
    • plot()
    • text()
    • summary()
    • printcp()
    • plotcp()
    • prune()
    • snip.rpart()

2.2 应用CART对数据进行分类

这部分的例子主要参考《数据科学中的R语言》中的例子。

(1)数据

数据来自mlbench包中的PimaIndiansDiabetes2数据,该数据包括了9各变量,768个样本。目标变量为是否有糖尿病,他是一个二元变量,其他解释变量是个体的若干特征,如年龄和其他医学指标,均为数值变量。

首先对数据进行预处理:

library(mlbench)
set.seed(1)
data(PimaIndiansDiabetes2, package = "mlbench")
data <- PimaIndiansDiabetes2
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# 标准化
preProcValue <- preProcess(data[, -9], method = c("center", "scale"))
scaleddata <- predict(preProcValue, data[, -9])
# YeoJohnson转换,使数据接近正态分布,并减弱异常值的影响
preProcbox <- preProcess(scaleddata, method = c("YeoJohnson"))
boxdata <- predict(preProcbox, scaleddata)
# 缺失值插补,使用装袋算法
library(ipred)
preProcimp <- preProcess(boxdata, method = "bagImpute")
procdata <- predict(preProcimp, boxdata)
procdata$class <- data[, 9]
head(procdata)
##     pregnant    glucose    pressure     triceps     insulin       mass
## 1  0.5284016  0.7595155 -0.03275471  0.52823166  0.31794703  0.1613449
## 2 -1.0902050 -1.4250227 -0.52420088 -0.01466832 -1.22004921 -0.9327976
## 3  0.8956985  1.5845945 -0.69028150 -1.01637841  0.02420507 -1.5205042
## 4 -1.0902050 -1.2509263 -0.52420088 -0.62258605 -0.67266546 -0.6791257
## 5 -1.5823833  0.4627602 -2.74133178  0.52823166  0.09906180  1.3218244
## 6  0.3065886 -0.1924995  0.12832774 -0.77660624 -0.41566226 -1.1067781
##     pedigree         age class
## 1  0.3807017  0.90154325   pos
## 2 -0.4347451 -0.20794297   neg
## 3  0.4674736 -0.11085497   pos
## 4 -1.3678689 -1.55542743   neg
## 5  1.7918753 -0.02068449   pos
## 6 -1.1705718 -0.31192824   neg
str(procdata)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  0.528 -1.09 0.896 -1.09 -1.582 ...
##  $ glucose : num  0.76 -1.425 1.585 -1.251 0.463 ...
##  $ pressure: num  -0.0328 -0.5242 -0.6903 -0.5242 -2.7413 ...
##  $ triceps : num  0.5282 -0.0147 -1.0164 -0.6226 0.5282 ...
##  $ insulin : num  0.3179 -1.22 0.0242 -0.6727 0.0991 ...
##  $ mass    : num  0.161 -0.933 -1.521 -0.679 1.322 ...
##  $ pedigree: num  0.381 -0.435 0.467 -1.368 1.792 ...
##  $ age     : num  0.9015 -0.2079 -0.1109 -1.5554 -0.0207 ...
##  $ class   : Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

(2)用函数rpart()生成一棵复杂度最高(cp = 0)的树

library(magrittr)
library(rpart)
rpartModel <- rpart(class ~.,
                    data = procdata, 
                    # na.action
                    method = "class", 
                    # parms = list(prior = c(0.5, 0.5), loss = matrix(c(0, 5, 1, 0), 2), split = gini)
                    control = rpart.control(cp = 0)
                    )

(3)输出高复杂度树的结果

objects(rpartModel)
##  [1] "call"                "control"             "cptable"            
##  [4] "frame"               "functions"           "method"             
##  [7] "numresp"             "ordered"             "parms"              
## [10] "splits"              "terms"               "variable.importance"
## [13] "where"               "y"
print(rpartModel)
## n= 768 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 768 268 neg (0.65104167 0.34895833)  
##      2) glucose< 0.184076 483  92 neg (0.80952381 0.19047619)  
##        4) insulin< -0.512991 254  19 neg (0.92519685 0.07480315)  
##          8) pedigree< 0.6050545 228  11 neg (0.95175439 0.04824561) *
##          9) pedigree>=0.6050545 26   8 neg (0.69230769 0.30769231)  
##           18) pregnant< -0.463193 17   2 neg (0.88235294 0.11764706) *
##           19) pregnant>=-0.463193 9   3 pos (0.33333333 0.66666667) *
##        5) insulin>=-0.512991 229  73 neg (0.68122271 0.31877729)  
##         10) age< -0.4816543 95  13 neg (0.86315789 0.13684211)  
##           20) triceps< -0.09488671 51   2 neg (0.96078431 0.03921569) *
##           21) triceps>=-0.09488671 44  11 neg (0.75000000 0.25000000)  
##             42) insulin>=-0.17934 27   2 neg (0.92592593 0.07407407) *
##             43) insulin< -0.17934 17   8 pos (0.47058824 0.52941176) *
##         11) age>=-0.4816543 134  60 neg (0.55223881 0.44776119)  
##           22) mass< -0.9759824 20   0 neg (1.00000000 0.00000000) *
##           23) mass>=-0.9759824 114  54 pos (0.47368421 0.52631579)  
##             46) pedigree< -1.176257 18   3 neg (0.83333333 0.16666667) *
##             47) pedigree>=-1.176257 96  39 pos (0.40625000 0.59375000)  
##               94) age>=1.114937 8   1 neg (0.87500000 0.12500000) *
##               95) age< 1.114937 88  32 pos (0.36363636 0.63636364)  
##                190) triceps>=-0.06454431 61  28 pos (0.45901639 0.54098361)  
##                  380) pregnant< 0.6254129 35  15 neg (0.57142857 0.42857143)  
##                    760) insulin< -0.3082836 9   1 neg (0.88888889 0.11111111) *
##                    761) insulin>=-0.3082836 26  12 pos (0.46153846 0.53846154)  
##                     1522) triceps< 0.6087562 15   6 neg (0.60000000 0.40000000) *
##                     1523) triceps>=0.6087562 11   3 pos (0.27272727 0.72727273) *
##                  381) pregnant>=0.6254129 26   8 pos (0.30769231 0.69230769) *
##                191) triceps< -0.06454431 27   4 pos (0.14814815 0.85185185) *
##      3) glucose>=0.184076 285 109 pos (0.38245614 0.61754386)  
##        6) mass< -0.3794478 75  24 neg (0.68000000 0.32000000)  
##         12) glucose< 0.6940911 40   6 neg (0.85000000 0.15000000) *
##         13) glucose>=0.6940911 35  17 pos (0.48571429 0.51428571)  
##           26) insulin>=0.1152267 8   2 neg (0.75000000 0.25000000) *
##           27) insulin< 0.1152267 27  11 pos (0.40740741 0.59259259)  
##             54) pressure>=0.2081851 11   4 neg (0.63636364 0.36363636) *
##             55) pressure< 0.2081851 16   4 pos (0.25000000 0.75000000) *
##        7) mass>=-0.3794478 210  58 pos (0.27619048 0.72380952)  
##         14) glucose< 0.9987702 118  46 pos (0.38983051 0.61016949)  
##           28) age< -0.2599356 50  23 neg (0.54000000 0.46000000)  
##             56) pressure>=0.0594595 25   6 neg (0.76000000 0.24000000)  
##              112) mass< 1.176336 17   1 neg (0.94117647 0.05882353) *
##              113) mass>=1.176336 8   3 pos (0.37500000 0.62500000) *
##             57) pressure< 0.0594595 25   8 pos (0.32000000 0.68000000)  
##              114) pedigree< -0.8401298 7   3 neg (0.57142857 0.42857143) *
##              115) pedigree>=-0.8401298 18   4 pos (0.22222222 0.77777778) *
##           29) age>=-0.2599356 68  19 pos (0.27941176 0.72058824)  
##             58) pedigree< -0.1368963 29  13 pos (0.44827586 0.55172414)  
##              116) mass< 0.8450892 21   9 neg (0.57142857 0.42857143)  
##                232) age< 0.5616431 11   3 neg (0.72727273 0.27272727) *
##                233) age>=0.5616431 10   4 pos (0.40000000 0.60000000) *
##              117) mass>=0.8450892 8   1 pos (0.12500000 0.87500000) *
##             59) pedigree>=-0.1368963 39   6 pos (0.15384615 0.84615385) *
##         15) glucose>=0.9987702 92  12 pos (0.13043478 0.86956522) *
summary(rpartModel)
## Call:
## rpart(formula = class ~ ., data = procdata, method = "class", 
##     control = rpart.control(cp = 0))
##   n= 768 
## 
##             CP nsplit rel error    xerror       xstd
## 1  0.250000000      0 1.0000000 1.0000000 0.04928752
## 2  0.100746269      1 0.7500000 0.8208955 0.04675050
## 3  0.016791045      2 0.6492537 0.8097015 0.04655757
## 4  0.016169154      7 0.5597015 0.7425373 0.04530720
## 5  0.009328358     10 0.5111940 0.7388060 0.04523292
## 6  0.007462687     17 0.4440299 0.7201493 0.04485359
## 7  0.005597015     18 0.4365672 0.7350746 0.04515811
## 8  0.003731343     23 0.4067164 0.7089552 0.04461956
## 9  0.001865672     24 0.4029851 0.7164179 0.04477612
## 10 0.000000000     26 0.3992537 0.7276119 0.04500691
## 
## Variable importance
##  glucose  insulin     mass      age  triceps pressure pregnant pedigree 
##       28       21       15       10        8        6        6        5 
## 
## Node number 1: 768 observations,    complexity param=0.25
##   predicted class=neg  expected loss=0.3489583  P(node) =1
##     class counts:   500   268
##    probabilities: 0.651 0.349 
##   left son=2 (483 obs) right son=3 (285 obs)
##   Primary splits:
##       glucose < 0.184076     to the left,  improve=65.38139, (0 missing)
##       insulin < -0.3030813   to the left,  improve=61.01892, (0 missing)
##       age     < -0.4816543   to the left,  improve=33.99082, (0 missing)
##       mass    < -0.3952471   to the left,  improve=31.59702, (0 missing)
##       triceps < -0.528681    to the left,  improve=25.39395, (0 missing)
##   Surrogate splits:
##       insulin  < -0.1038691   to the left,  agree=0.809, adj=0.484, (0 split)
##       age      < 0.845479     to the left,  agree=0.663, adj=0.091, (0 split)
##       pressure < 0.6246349    to the left,  agree=0.660, adj=0.084, (0 split)
##       mass     < 0.9401524    to the left,  agree=0.659, adj=0.081, (0 split)
##       triceps  < 0.9550761    to the left,  agree=0.647, adj=0.049, (0 split)
## 
## Node number 2: 483 observations,    complexity param=0.01679104
##   predicted class=neg  expected loss=0.1904762  P(node) =0.6289062
##     class counts:   391    92
##    probabilities: 0.810 0.190 
##   left son=4 (254 obs) right son=5 (229 obs)
##   Primary splits:
##       insulin < -0.512991    to the left,  improve=14.336390, (0 missing)
##       age     < -0.4816543   to the left,  improve=13.771530, (0 missing)
##       mass    < -0.8728089   to the left,  improve=10.906270, (0 missing)
##       triceps < -0.6029218   to the left,  improve= 9.739729, (0 missing)
##       glucose < -0.8138309   to the left,  improve= 8.942090, (0 missing)
##   Surrogate splits:
##       glucose  < -0.6810465   to the left,  agree=0.828, adj=0.638, (0 split)
##       age      < -0.4816543   to the left,  agree=0.642, adj=0.245, (0 split)
##       triceps  < -0.5626903   to the left,  agree=0.619, adj=0.197, (0 split)
##       pregnant < 0.1759085    to the left,  agree=0.605, adj=0.166, (0 split)
##       mass     < -0.03009352  to the left,  agree=0.588, adj=0.131, (0 split)
## 
## Node number 3: 285 observations,    complexity param=0.1007463
##   predicted class=pos  expected loss=0.3824561  P(node) =0.3710938
##     class counts:   109   176
##    probabilities: 0.382 0.618 
##   left son=6 (75 obs) right son=7 (210 obs)
##   Primary splits:
##       mass    < -0.3794478   to the left,  improve=18.022660, (0 missing)
##       glucose < 0.9246864    to the left,  improve=14.717490, (0 missing)
##       triceps < -0.3007869   to the left,  improve= 8.177419, (0 missing)
##       age     < -1.007204    to the left,  improve= 7.573969, (0 missing)
##       insulin < -0.01940693  to the left,  improve= 6.648419, (0 missing)
##   Surrogate splits:
##       triceps  < -0.6567442   to the left,  agree=0.821, adj=0.320, (0 split)
##       insulin  < -0.09424273  to the left,  agree=0.772, adj=0.133, (0 split)
##       age      < -1.472758    to the left,  agree=0.747, adj=0.040, (0 split)
##       pressure < -0.8992403   to the left,  agree=0.740, adj=0.013, (0 split)
##       pedigree < -1.624087    to the left,  agree=0.740, adj=0.013, (0 split)
## 
## Node number 4: 254 observations,    complexity param=0.005597015
##   predicted class=neg  expected loss=0.07480315  P(node) =0.3307292
##     class counts:   235    19
##    probabilities: 0.925 0.075 
##   left son=8 (228 obs) right son=9 (26 obs)
##   Primary splits:
##       pedigree < 0.6050545    to the left,  improve=3.1419610, (0 missing)
##       mass     < -0.5395378   to the left,  improve=1.7069670, (0 missing)
##       triceps  < -0.5740215   to the left,  improve=1.3003370, (0 missing)
##       age      < -0.8655906   to the left,  improve=1.1170900, (0 missing)
##       pregnant < 1.124862     to the left,  improve=0.9008398, (0 missing)
## 
## Node number 5: 229 observations,    complexity param=0.01679104
##   predicted class=neg  expected loss=0.3187773  P(node) =0.2981771
##     class counts:   156    73
##    probabilities: 0.681 0.319 
##   left son=10 (95 obs) right son=11 (134 obs)
##   Primary splits:
##       age      < -0.4816543   to the left,  improve=10.747750, (0 missing)
##       mass     < -0.958659    to the left,  improve= 8.968932, (0 missing)
##       pregnant < 0.6254129    to the left,  improve= 5.101790, (0 missing)
##       triceps  < -0.6116801   to the left,  improve= 4.588160, (0 missing)
##       pedigree < -1.164945    to the left,  improve= 2.777680, (0 missing)
##   Surrogate splits:
##       pregnant < -0.1140821   to the left,  agree=0.777, adj=0.463, (0 split)
##       pressure < -0.648699    to the left,  agree=0.664, adj=0.189, (0 split)
##       pedigree < -1.308669    to the left,  agree=0.620, adj=0.084, (0 split)
##       triceps  < -1.271052    to the left,  agree=0.607, adj=0.053, (0 split)
##       insulin  < 0.008187541  to the right, agree=0.607, adj=0.053, (0 split)
## 
## Node number 6: 75 observations,    complexity param=0.009328358
##   predicted class=neg  expected loss=0.32  P(node) =0.09765625
##     class counts:    51    24
##    probabilities: 0.680 0.320 
##   left son=12 (40 obs) right son=13 (35 obs)
##   Primary splits:
##       glucose  < 0.6940911    to the left,  improve=4.954286, (0 missing)
##       age      < -0.7307851   to the left,  improve=2.946878, (0 missing)
##       pregnant < -0.8715991   to the left,  improve=2.346767, (0 missing)
##       pressure < -0.8992403   to the left,  improve=2.127119, (0 missing)
##       insulin  < 0.1675863    to the right, improve=2.127119, (0 missing)
##   Surrogate splits:
##       insulin  < -0.1120576   to the left,  agree=0.680, adj=0.314, (0 split)
##       age      < -1.155611    to the left,  agree=0.640, adj=0.229, (0 split)
##       mass     < -0.555797    to the left,  agree=0.627, adj=0.200, (0 split)
##       pregnant < -0.1140821   to the left,  agree=0.600, adj=0.143, (0 split)
##       pedigree < -1.128585    to the right, agree=0.587, adj=0.114, (0 split)
## 
## Node number 7: 210 observations,    complexity param=0.01616915
##   predicted class=pos  expected loss=0.2761905  P(node) =0.2734375
##     class counts:    58   152
##    probabilities: 0.276 0.724 
##   left son=14 (118 obs) right son=15 (92 obs)
##   Primary splits:
##       glucose  < 0.9987702    to the left,  improve=6.956746, (0 missing)
##       pedigree < -0.6188634   to the left,  improve=3.647846, (0 missing)
##       age      < -1.007204    to the left,  improve=2.750476, (0 missing)
##       pregnant < 0.8090614    to the left,  improve=2.176582, (0 missing)
##       pressure < -0.9412698   to the right, improve=2.114189, (0 missing)
##   Surrogate splits:
##       insulin  < 0.3671545    to the left,  agree=0.681, adj=0.272, (0 split)
##       pedigree < 0.591473     to the left,  agree=0.590, adj=0.065, (0 split)
##       age      < 0.7660988    to the left,  agree=0.590, adj=0.065, (0 split)
##       triceps  < -0.6352969   to the right, agree=0.586, adj=0.054, (0 split)
##       mass     < -0.2394526   to the right, agree=0.586, adj=0.054, (0 split)
## 
## Node number 8: 228 observations
##   predicted class=neg  expected loss=0.04824561  P(node) =0.296875
##     class counts:   217    11
##    probabilities: 0.952 0.048 
## 
## Node number 9: 26 observations,    complexity param=0.005597015
##   predicted class=neg  expected loss=0.3076923  P(node) =0.03385417
##     class counts:    18     8
##    probabilities: 0.692 0.308 
##   left son=18 (17 obs) right son=19 (9 obs)
##   Primary splits:
##       pregnant < -0.463193    to the left,  improve=3.547511, (0 missing)
##       mass     < -0.004590906 to the left,  improve=2.243590, (0 missing)
##       age      < -1.007204    to the left,  improve=1.813765, (0 missing)
##       insulin  < -1.044182    to the left,  improve=1.401923, (0 missing)
##       pressure < 0.2081851    to the left,  improve=1.332562, (0 missing)
##   Surrogate splits:
##       pressure < 0.04778651   to the left,  agree=0.808, adj=0.444, (0 split)
##       age      < -0.2668235   to the left,  agree=0.769, adj=0.333, (0 split)
##       insulin  < -0.6491159   to the left,  agree=0.731, adj=0.222, (0 split)
## 
## Node number 10: 95 observations,    complexity param=0.001865672
##   predicted class=neg  expected loss=0.1368421  P(node) =0.1236979
##     class counts:    82    13
##    probabilities: 0.863 0.137 
##   left son=20 (51 obs) right son=21 (44 obs)
##   Primary splits:
##       triceps  < -0.09488671  to the left,  improve=2.0989680, (0 missing)
##       mass     < -0.1712321   to the left,  improve=1.6362440, (0 missing)
##       pedigree < -1.523283    to the right, improve=1.2862610, (0 missing)
##       insulin  < 0.1027434    to the right, improve=0.9487719, (0 missing)
##       pressure < -0.168235    to the left,  improve=0.7636337, (0 missing)
##   Surrogate splits:
##       mass     < 0.09760599   to the left,  agree=0.800, adj=0.568, (0 split)
##       insulin  < -0.3009875   to the left,  agree=0.716, adj=0.386, (0 split)
##       pressure < -0.08696023  to the left,  agree=0.684, adj=0.318, (0 split)
##       pedigree < 0.05315371   to the left,  agree=0.653, adj=0.250, (0 split)
##       glucose  < -0.3532218   to the left,  agree=0.621, adj=0.182, (0 split)
## 
## Node number 11: 134 observations,    complexity param=0.01679104
##   predicted class=neg  expected loss=0.4477612  P(node) =0.1744792
##     class counts:    74    60
##    probabilities: 0.552 0.448 
##   left son=22 (20 obs) right son=23 (114 obs)
##   Primary splits:
##       mass     < -0.9759824   to the left,  improve=9.426551, (0 missing)
##       pedigree < -1.164945    to the left,  improve=4.630224, (0 missing)
##       age      < 1.114937     to the right, improve=4.428181, (0 missing)
##       insulin  < -0.1830623   to the left,  improve=2.690844, (0 missing)
##       triceps  < -0.6116801   to the left,  improve=1.858953, (0 missing)
##   Surrogate splits:
##       triceps < -0.6116801   to the left,  agree=0.866, adj=0.1, (0 split)
## 
## Node number 12: 40 observations
##   predicted class=neg  expected loss=0.15  P(node) =0.05208333
##     class counts:    34     6
##    probabilities: 0.850 0.150 
## 
## Node number 13: 35 observations,    complexity param=0.009328358
##   predicted class=pos  expected loss=0.4857143  P(node) =0.04557292
##     class counts:    17    18
##    probabilities: 0.486 0.514 
##   left son=26 (8 obs) right son=27 (27 obs)
##   Primary splits:
##       insulin  < 0.1152267    to the right, improve=1.4486770, (0 missing)
##       age      < -0.4816543   to the left,  improve=1.4486770, (0 missing)
##       pressure < 0.1683357    to the right, improve=1.1958590, (0 missing)
##       mass     < -1.115579    to the right, improve=0.9657143, (0 missing)
##       pregnant < -0.8715991   to the left,  improve=0.9142857, (0 missing)
##   Surrogate splits:
##       pressure < -1.194316    to the left,  agree=0.8, adj=0.125, (0 split)
##       mass     < -0.411094    to the right, agree=0.8, adj=0.125, (0 split)
## 
## Node number 14: 118 observations,    complexity param=0.01616915
##   predicted class=pos  expected loss=0.3898305  P(node) =0.1536458
##     class counts:    46    72
##    probabilities: 0.390 0.610 
##   left son=28 (50 obs) right son=29 (68 obs)
##   Primary splits:
##       age      < -0.2599356   to the left,  improve=3.913240, (0 missing)
##       pedigree < -0.1659089   to the left,  improve=3.119775, (0 missing)
##       mass     < 1.159416     to the left,  improve=2.663740, (0 missing)
##       pregnant < 0.8090614    to the left,  improve=2.602148, (0 missing)
##       insulin  < 0.5847789    to the right, improve=2.127430, (0 missing)
##   Surrogate splits:
##       pregnant < 0.1759085    to the left,  agree=0.805, adj=0.54, (0 split)
##       pressure < 0.1683357    to the left,  agree=0.695, adj=0.28, (0 split)
##       triceps  < -0.1174975   to the left,  agree=0.669, adj=0.22, (0 split)
##       insulin  < 0.5409827    to the right, agree=0.627, adj=0.12, (0 split)
##       mass     < -0.1785516   to the left,  agree=0.602, adj=0.06, (0 split)
## 
## Node number 15: 92 observations
##   predicted class=pos  expected loss=0.1304348  P(node) =0.1197917
##     class counts:    12    80
##    probabilities: 0.130 0.870 
## 
## Node number 18: 17 observations
##   predicted class=neg  expected loss=0.1176471  P(node) =0.02213542
##     class counts:    15     2
##    probabilities: 0.882 0.118 
## 
## Node number 19: 9 observations
##   predicted class=pos  expected loss=0.3333333  P(node) =0.01171875
##     class counts:     3     6
##    probabilities: 0.333 0.667 
## 
## Node number 20: 51 observations
##   predicted class=neg  expected loss=0.03921569  P(node) =0.06640625
##     class counts:    49     2
##    probabilities: 0.961 0.039 
## 
## Node number 21: 44 observations,    complexity param=0.001865672
##   predicted class=neg  expected loss=0.25  P(node) =0.05729167
##     class counts:    33    11
##    probabilities: 0.750 0.250 
##   left son=42 (27 obs) right son=43 (17 obs)
##   Primary splits:
##       insulin  < -0.17934     to the right, improve=4.3257080, (0 missing)
##       pregnant < -0.8715991   to the left,  improve=1.1363640, (0 missing)
##       pressure < -0.6072412   to the right, improve=0.9166667, (0 missing)
##       mass     < 0.2023862    to the right, improve=0.9166667, (0 missing)
##       age      < -0.8655906   to the right, improve=0.6195402, (0 missing)
##   Surrogate splits:
##       pedigree < -0.7389428   to the right, agree=0.727, adj=0.294, (0 split)
##       mass     < 0.4869275    to the right, agree=0.705, adj=0.235, (0 split)
##       pregnant < 0.1759085    to the left,  agree=0.682, adj=0.176, (0 split)
##       pressure < 0.04778651   to the left,  agree=0.659, adj=0.118, (0 split)
## 
## Node number 22: 20 observations
##   predicted class=neg  expected loss=0  P(node) =0.02604167
##     class counts:    20     0
##    probabilities: 1.000 0.000 
## 
## Node number 23: 114 observations,    complexity param=0.01679104
##   predicted class=pos  expected loss=0.4736842  P(node) =0.1484375
##     class counts:    54    60
##    probabilities: 0.474 0.526 
##   left son=46 (18 obs) right son=47 (96 obs)
##   Primary splits:
##       pedigree < -1.176257    to the left,  improve=5.529605, (0 missing)
##       age      < 1.114937     to the right, improve=3.369089, (0 missing)
##       triceps  < -0.09148137  to the right, improve=2.842105, (0 missing)
##       pressure < 0.9937638    to the right, improve=2.361868, (0 missing)
##       mass     < 0.1820187    to the right, improve=2.068031, (0 missing)
## 
## Node number 26: 8 observations
##   predicted class=neg  expected loss=0.25  P(node) =0.01041667
##     class counts:     6     2
##    probabilities: 0.750 0.250 
## 
## Node number 27: 27 observations,    complexity param=0.009328358
##   predicted class=pos  expected loss=0.4074074  P(node) =0.03515625
##     class counts:    11    16
##    probabilities: 0.407 0.593 
##   left son=54 (11 obs) right son=55 (16 obs)
##   Primary splits:
##       pressure < 0.2081851    to the right, improve=1.9461280, (0 missing)
##       pregnant < 0.286815     to the left,  improve=1.0703700, (0 missing)
##       pedigree < -0.4926652   to the left,  improve=0.9259259, (0 missing)
##       glucose  < 1.213834     to the left,  improve=0.6734007, (0 missing)
##       age      < 0.9534364    to the right, improve=0.5084656, (0 missing)
##   Surrogate splits:
##       triceps  < -0.5011203   to the right, agree=0.778, adj=0.455, (0 split)
##       insulin  < -0.2967346   to the left,  agree=0.741, adj=0.364, (0 split)
##       pregnant < 0.974253     to the right, agree=0.667, adj=0.182, (0 split)
##       glucose  < 0.8995504    to the left,  agree=0.667, adj=0.182, (0 split)
##       mass     < -1.55739     to the left,  agree=0.630, adj=0.091, (0 split)
## 
## Node number 28: 50 observations,    complexity param=0.01616915
##   predicted class=neg  expected loss=0.46  P(node) =0.06510417
##     class counts:    27    23
##    probabilities: 0.540 0.460 
##   left son=56 (25 obs) right son=57 (25 obs)
##   Primary splits:
##       pressure < 0.0594595    to the right, improve=4.840000, (0 missing)
##       insulin  < 0.5847789    to the right, improve=2.717193, (0 missing)
##       mass     < 1.176336     to the left,  improve=2.014825, (0 missing)
##       pregnant < -1.336294    to the right, improve=1.284444, (0 missing)
##       glucose  < 0.6140985    to the right, improve=1.274635, (0 missing)
##   Surrogate splits:
##       insulin  < 0.2970999    to the right, agree=0.66, adj=0.32, (0 split)
##       mass     < 0.587874     to the right, agree=0.66, adj=0.32, (0 split)
##       glucose  < 0.5597943    to the right, agree=0.64, adj=0.28, (0 split)
##       triceps  < 0.4408404    to the right, agree=0.62, adj=0.24, (0 split)
##       pregnant < -0.8715991   to the left,  agree=0.60, adj=0.20, (0 split)
## 
## Node number 29: 68 observations,    complexity param=0.005597015
##   predicted class=pos  expected loss=0.2794118  P(node) =0.08854167
##     class counts:    19    49
##    probabilities: 0.279 0.721 
##   left son=58 (29 obs) right son=59 (39 obs)
##   Primary splits:
##       pedigree < -0.1368963   to the left,  improve=2.8836790, (0 missing)
##       mass     < 1.062265     to the left,  improve=1.5252100, (0 missing)
##       age      < 0.02097219   to the right, improve=1.4156860, (0 missing)
##       pressure < 1.455499     to the left,  improve=1.2184190, (0 missing)
##       triceps  < 0.234774     to the left,  improve=0.6778075, (0 missing)
##   Surrogate splits:
##       triceps  < 0.4077414    to the left,  agree=0.647, adj=0.172, (0 split)
##       mass     < 1.485691     to the right, agree=0.647, adj=0.172, (0 split)
##       glucose  < 0.6500342    to the left,  agree=0.603, adj=0.069, (0 split)
##       pressure < 2.138253     to the right, agree=0.603, adj=0.069, (0 split)
##       insulin  < -0.3114232   to the left,  agree=0.603, adj=0.069, (0 split)
## 
## Node number 42: 27 observations
##   predicted class=neg  expected loss=0.07407407  P(node) =0.03515625
##     class counts:    25     2
##    probabilities: 0.926 0.074 
## 
## Node number 43: 17 observations
##   predicted class=pos  expected loss=0.4705882  P(node) =0.02213542
##     class counts:     8     9
##    probabilities: 0.471 0.529 
## 
## Node number 46: 18 observations
##   predicted class=neg  expected loss=0.1666667  P(node) =0.0234375
##     class counts:    15     3
##    probabilities: 0.833 0.167 
## 
## Node number 47: 96 observations,    complexity param=0.01679104
##   predicted class=pos  expected loss=0.40625  P(node) =0.125
##     class counts:    39    57
##    probabilities: 0.406 0.594 
##   left son=94 (8 obs) right son=95 (88 obs)
##   Primary splits:
##       age      < 1.114937     to the right, improve=3.835227, (0 missing)
##       triceps  < -0.09148137  to the right, improve=2.981483, (0 missing)
##       insulin  < -0.3188784   to the left,  improve=2.077335, (0 missing)
##       mass     < 0.1820187    to the right, improve=1.862827, (0 missing)
##       pedigree < 0.2368907    to the left,  improve=1.600556, (0 missing)
## 
## Node number 54: 11 observations
##   predicted class=neg  expected loss=0.3636364  P(node) =0.01432292
##     class counts:     7     4
##    probabilities: 0.636 0.364 
## 
## Node number 55: 16 observations
##   predicted class=pos  expected loss=0.25  P(node) =0.02083333
##     class counts:     4    12
##    probabilities: 0.250 0.750 
## 
## Node number 56: 25 observations,    complexity param=0.007462687
##   predicted class=neg  expected loss=0.24  P(node) =0.03255208
##     class counts:    19     6
##    probabilities: 0.760 0.240 
##   left son=112 (17 obs) right son=113 (8 obs)
##   Primary splits:
##       mass     < 1.176336     to the left,  improve=3.487647, (0 missing)
##       pressure < 1.032542     to the left,  improve=2.135873, (0 missing)
##       triceps  < 0.3041504    to the left,  improve=1.620000, (0 missing)
##       pedigree < -0.3451885   to the right, improve=1.440513, (0 missing)
##       insulin  < 0.5847789    to the right, improve=1.355294, (0 missing)
##   Surrogate splits:
##       pressure < 0.9550774    to the left,  agree=0.76, adj=0.250, (0 split)
##       triceps  < 0.9852931    to the left,  agree=0.76, adj=0.250, (0 split)
##       insulin  < -0.1169469   to the right, agree=0.76, adj=0.250, (0 split)
##       age      < -1.3108      to the right, agree=0.72, adj=0.125, (0 split)
## 
## Node number 57: 25 observations,    complexity param=0.003731343
##   predicted class=pos  expected loss=0.32  P(node) =0.03255208
##     class counts:     8    17
##    probabilities: 0.320 0.680 
##   left son=114 (7 obs) right son=115 (18 obs)
##   Primary splits:
##       pedigree < -0.8401298   to the left,  improve=1.2292060, (0 missing)
##       pregnant < -1.336294    to the right, improve=1.2272220, (0 missing)
##       age      < -1.007204    to the left,  improve=1.0851280, (0 missing)
##       triceps  < -0.1668723   to the right, improve=0.7501299, (0 missing)
##       mass     < 0.02762902   to the right, improve=0.6101587, (0 missing)
##   Surrogate splits:
##       glucose < 0.7333992    to the right, agree=0.84, adj=0.429, (0 split)
##       insulin < 0.6317337    to the right, agree=0.80, adj=0.286, (0 split)
## 
## Node number 58: 29 observations,    complexity param=0.005597015
##   predicted class=pos  expected loss=0.4482759  P(node) =0.03776042
##     class counts:    13    16
##    probabilities: 0.448 0.552 
##   left son=116 (21 obs) right son=117 (8 obs)
##   Primary splits:
##       mass     < 0.8450892    to the left,  improve=2.3091130, (0 missing)
##       triceps  < 0.2403171    to the left,  improve=2.0495890, (0 missing)
##       pressure < 0.5498793    to the left,  improve=1.3159810, (0 missing)
##       pedigree < -1.051749    to the left,  improve=0.6900657, (0 missing)
##       insulin  < 0.3133556    to the left,  improve=0.4876847, (0 missing)
##   Surrogate splits:
##       triceps < 0.5360359    to the left,  agree=0.862, adj=0.50, (0 split)
##       insulin < 0.4690746    to the left,  agree=0.793, adj=0.25, (0 split)
## 
## Node number 59: 39 observations
##   predicted class=pos  expected loss=0.1538462  P(node) =0.05078125
##     class counts:     6    33
##    probabilities: 0.154 0.846 
## 
## Node number 94: 8 observations
##   predicted class=neg  expected loss=0.125  P(node) =0.01041667
##     class counts:     7     1
##    probabilities: 0.875 0.125 
## 
## Node number 95: 88 observations,    complexity param=0.009328358
##   predicted class=pos  expected loss=0.3636364  P(node) =0.1145833
##     class counts:    32    56
##    probabilities: 0.364 0.636 
##   left son=190 (61 obs) right son=191 (27 obs)
##   Primary splits:
##       triceps  < -0.06454431  to the right, improve=3.617376, (0 missing)
##       mass     < -0.3794478   to the right, improve=2.362567, (0 missing)
##       pregnant < 0.6254129    to the left,  improve=1.825312, (0 missing)
##       pedigree < -0.8348698   to the right, improve=1.654067, (0 missing)
##       insulin  < -0.3225317   to the left,  improve=1.556704, (0 missing)
##   Surrogate splits:
##       mass     < -0.3794478   to the right, agree=0.830, adj=0.444, (0 split)
##       glucose  < -0.9768363   to the right, agree=0.716, adj=0.074, (0 split)
##       insulin  < -0.5023809   to the right, agree=0.716, adj=0.074, (0 split)
##       pressure < -0.7737986   to the right, agree=0.705, adj=0.037, (0 split)
## 
## Node number 112: 17 observations
##   predicted class=neg  expected loss=0.05882353  P(node) =0.02213542
##     class counts:    16     1
##    probabilities: 0.941 0.059 
## 
## Node number 113: 8 observations
##   predicted class=pos  expected loss=0.375  P(node) =0.01041667
##     class counts:     3     5
##    probabilities: 0.375 0.625 
## 
## Node number 114: 7 observations
##   predicted class=neg  expected loss=0.4285714  P(node) =0.009114583
##     class counts:     4     3
##    probabilities: 0.571 0.429 
## 
## Node number 115: 18 observations
##   predicted class=pos  expected loss=0.2222222  P(node) =0.0234375
##     class counts:     4    14
##    probabilities: 0.222 0.778 
## 
## Node number 116: 21 observations,    complexity param=0.005597015
##   predicted class=neg  expected loss=0.4285714  P(node) =0.02734375
##     class counts:    12     9
##    probabilities: 0.571 0.429 
##   left son=232 (11 obs) right son=233 (10 obs)
##   Primary splits:
##       age      < 0.5616431    to the left,  improve=1.1220780, (0 missing)
##       pressure < 0.6823251    to the left,  improve=0.9972527, (0 missing)
##       pedigree < -0.8830097   to the right, improve=0.6311688, (0 missing)
##       glucose  < 0.4486419    to the right, improve=0.5079365, (0 missing)
##       triceps  < 0.1249368    to the left,  improve=0.5079365, (0 missing)
##   Surrogate splits:
##       pedigree < -0.960914    to the right, agree=0.857, adj=0.7, (0 split)
##       pressure < 0.3673065    to the left,  agree=0.714, adj=0.4, (0 split)
##       triceps  < 0.1249368    to the right, agree=0.714, adj=0.4, (0 split)
##       pregnant < 0.4174951    to the right, agree=0.667, adj=0.3, (0 split)
##       glucose  < 0.5733696    to the left,  agree=0.667, adj=0.3, (0 split)
## 
## Node number 117: 8 observations
##   predicted class=pos  expected loss=0.125  P(node) =0.01041667
##     class counts:     1     7
##    probabilities: 0.125 0.875 
## 
## Node number 190: 61 observations,    complexity param=0.009328358
##   predicted class=pos  expected loss=0.4590164  P(node) =0.07942708
##     class counts:    28    33
##    probabilities: 0.459 0.541 
##   left son=380 (35 obs) right son=381 (26 obs)
##   Primary splits:
##       pregnant < 0.6254129    to the left,  improve=2.075302, (0 missing)
##       pedigree < 0.2368907    to the left,  improve=1.766478, (0 missing)
##       insulin  < -0.3225317   to the left,  improve=1.666740, (0 missing)
##       mass     < 1.321788     to the left,  improve=1.580796, (0 missing)
##       glucose  < -0.5017753   to the left,  improve=1.181387, (0 missing)
##   Surrogate splits:
##       age     < 0.5372991    to the left,  agree=0.672, adj=0.231, (0 split)
##       insulin < -0.3992971   to the right, agree=0.656, adj=0.192, (0 split)
##       mass    < 0.1449565    to the right, agree=0.656, adj=0.192, (0 split)
##       triceps < -0.01614765  to the right, agree=0.639, adj=0.154, (0 split)
##       glucose < 0.05869105   to the left,  agree=0.607, adj=0.077, (0 split)
## 
## Node number 191: 27 observations
##   predicted class=pos  expected loss=0.1481481  P(node) =0.03515625
##     class counts:     4    23
##    probabilities: 0.148 0.852 
## 
## Node number 232: 11 observations
##   predicted class=neg  expected loss=0.2727273  P(node) =0.01432292
##     class counts:     8     3
##    probabilities: 0.727 0.273 
## 
## Node number 233: 10 observations
##   predicted class=pos  expected loss=0.4  P(node) =0.01302083
##     class counts:     4     6
##    probabilities: 0.400 0.600 
## 
## Node number 380: 35 observations,    complexity param=0.009328358
##   predicted class=neg  expected loss=0.4285714  P(node) =0.04557292
##     class counts:    20    15
##    probabilities: 0.571 0.429 
##   left son=760 (9 obs) right son=761 (26 obs)
##   Primary splits:
##       insulin  < -0.3082836   to the left,  improve=2.4420020, (0 missing)
##       glucose  < -0.4830425   to the left,  improve=1.4628570, (0 missing)
##       pregnant < -0.8715991   to the right, improve=1.4435560, (0 missing)
##       pedigree < -0.8399107   to the right, improve=1.4285710, (0 missing)
##       triceps  < 0.6195738    to the left,  improve=0.5761905, (0 missing)
##   Surrogate splits:
##       mass    < -0.1800426   to the left,  agree=0.857, adj=0.444, (0 split)
##       glucose < 0.0905579    to the right, agree=0.800, adj=0.222, (0 split)
## 
## Node number 381: 26 observations
##   predicted class=pos  expected loss=0.3076923  P(node) =0.03385417
##     class counts:     8    18
##    probabilities: 0.308 0.692 
## 
## Node number 760: 9 observations
##   predicted class=neg  expected loss=0.1111111  P(node) =0.01171875
##     class counts:     8     1
##    probabilities: 0.889 0.111 
## 
## Node number 761: 26 observations,    complexity param=0.009328358
##   predicted class=pos  expected loss=0.4615385  P(node) =0.03385417
##     class counts:    12    14
##    probabilities: 0.462 0.538 
##   left son=1522 (15 obs) right son=1523 (11 obs)
##   Primary splits:
##       triceps  < 0.6087562    to the left,  improve=1.3594410, (0 missing)
##       age      < 0.5871544    to the right, improve=1.2238290, (0 missing)
##       pedigree < -0.6222004   to the right, improve=1.0341880, (0 missing)
##       pregnant < -0.8715991   to the right, improve=0.7326007, (0 missing)
##       glucose  < -0.3532218   to the left,  improve=0.6230769, (0 missing)
##   Surrogate splits:
##       insulin  < 0.253726     to the left,  agree=0.808, adj=0.545, (0 split)
##       mass     < 0.6008499    to the left,  agree=0.808, adj=0.545, (0 split)
##       pedigree < 0.520139     to the left,  agree=0.692, adj=0.273, (0 split)
##       pressure < 0.3661899    to the left,  agree=0.654, adj=0.182, (0 split)
##       age      < -0.3673601   to the right, agree=0.615, adj=0.091, (0 split)
## 
## Node number 1522: 15 observations
##   predicted class=neg  expected loss=0.4  P(node) =0.01953125
##     class counts:     9     6
##    probabilities: 0.600 0.400 
## 
## Node number 1523: 11 observations
##   predicted class=pos  expected loss=0.2727273  P(node) =0.01432292
##     class counts:     3     8
##    probabilities: 0.273 0.727

(4)绘制高复杂度树

# method 1
plot(rpartModel)
text(rpartModel)

# method 2
rpart.plot::rpart.plot(rpartModel)

# method 3
rattle::fancyRpartPlot(rpartModel)

# method 4
partykit::as.party(rpartModel) %>% plot()

(5)分析树的结构,求得使得预测误差最小的CP值,剪枝

# Displays CP table for Fitted Rpart Object
printcp(rpartModel)
## 
## Classification tree:
## rpart(formula = class ~ ., data = procdata, method = "class", 
##     control = rpart.control(cp = 0))
## 
## Variables actually used in tree construction:
## [1] age      glucose  insulin  mass     pedigree pregnant pressure triceps 
## 
## Root node error: 268/768 = 0.34896
## 
## n= 768 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.2500000      0   1.00000 1.00000 0.049288
## 2  0.1007463      1   0.75000 0.82090 0.046750
## 3  0.0167910      2   0.64925 0.80970 0.046558
## 4  0.0161692      7   0.55970 0.74254 0.045307
## 5  0.0093284     10   0.51119 0.73881 0.045233
## 6  0.0074627     17   0.44403 0.72015 0.044854
## 7  0.0055970     18   0.43657 0.73507 0.045158
## 8  0.0037313     23   0.40672 0.70896 0.044620
## 9  0.0018657     24   0.40299 0.71642 0.044776
## 10 0.0000000     26   0.39925 0.72761 0.045007
# Plot a Complexity Parameter Table for an Rpart Fit
plotcp(rpartModel)

cptable <- as.data.frame(rpartModel$cptable)
cptable$errsd <- cptable$xerror + cptable$xstd
cpvalue <- cptable[which.min(cptable$errsd), "CP"]
pruneModel <- prune(rpartModel, cpvalue)
rattle::fancyRpartPlot(pruneModel)

(6)衡量模型分类效果
pre <- predict(pruneModel, newdata = procdata, type = "class")
## 混淆矩阵
preTable <- table(pre, procdata$class)
preTable
##      
## pre   neg pos
##   neg 444  53
##   pos  56 215
## 分类准确率 
accurancy = sum(diag(preTable)) / sum(preTable)
accurancy
## [1] 0.8580729
## 灵敏度
preTable[1, 1] / sum(preTable[, 1])
## [1] 0.888
## 特异度
preTable[2, 2] / sum(preTable[, 2])
## [1] 0.8022388

多重交叉验证

  • 不包含模型参数调节的CV
    • CV的正常思路
  • 包含模型参数调节的CV
    • 先将数据划分为训练集和检验集。训练集中用CV选择调节模型参数,最终选出参数;参数确定后使用整体训练集得到最终模型,再用检验集来观察判断最终模型的效果

方法一:自制10-folds CV

index <- sample(1:10, nrow(procdata), replace = TRUE)
res <- array(0, dim = c(2, 2, 10))
n <- ncol(procdata)
for(i in 1:10) {
    train <- procdata[index != i, ]
    test <- procdata[index == i, ]
    model <- rpart(class ~., data = train, control = rpart.control(cp = 0.1))
    pre <- predict(model, test[, -n], type = "class")
    res[, , i] <- as.matrix(table(pre, test[, n]))
}
table <- apply(res, MARGIN = c(1, 2), sum)
sum(diag(table)) / sum(table)
## [1] 0.7265625

方法二:caret package 10-folds CV

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
tunedf <- data.frame(.cp = seq(0.001, 0.1, length = 10))
treemodel <- train(x = procdata[, -9], y = procdata[, 9], method = 'rpart', trControl = fitControl, tuneGrid = tunedf)
plot(treemodel)

pre <- predict.train()

2.3

training_two$family_size <- train$SibSp + train$Parch + 1
library(rpart)
rpart.model = rpart(Survived ~ .,
                    data = training,
                    method = "class",
                    control = rpart.control(minsplit = 50, cp = 0.01)
                    
summary(rpart.model)
## Visualizing the tree
### method 1
library(rpart)
plot(rpart.model, compress = TRUE)
text(rpart.model, use.n = TRUE)

### method 2
library(rattle)
fancyRpartPlot(rpart.model)

### method 3
library(partykit)
rpart1a = as.party(rpart.model)
plot(rpart1a)

rpart.predict = predict(rpart.model, testing, type = "class")

## table(rpart.predict, testing$Survived)
## rpart.accuracy = mean(rpart.predict == testing$Survived)
## rpart.accuracy

confusionMatrix(rpart.predict, testing$Survived)

# Create the ROC curve
library(pROC)
rpartROC = roc(testing$Survived, rpart.response,
               levels = levels(as.factor(testing$Survived)))
plot(rpartROC, type = "S", print.thres = 0.5)


## Solution
my.prediction = predict(rpart.model, newdata = test, type = "class")
my_solution = data.frame(PassengerId = test$PassengerId, Survived = rpart.prediction)
nrow(my_solution)
write.csv(my_solution, file = "F:/Rworkd/my_solution/my_solution.csv",
          row.names = FALSE)

###########################################################################

cvCtrl = trainControl(method = "repeatdecv", repeats = 3, 
                      summaryFunction = twoClassSummary, 
                      classProbs = FALSE)
set.seed(1)
rpartTune = train(Survived ~ ., 
                  data = training, 
                  method = "rpart", 
                  tuneLength = 30, 
                  metric = "ROC", 
                  trControl = cvCtrl)
rpartTune
plot(rpartTune, scales = list(x = list(log = 10)))
rpartPred = predict(rpartTune, testing)
confusionMatrix(rpartPred2, testing$Species)
rpartProbs = predict(rpartTune, testing, type = "prob")
head(rpartProbs)
## Create the ROC curve
library(pROC)
rpartROC = roc(testing$Species, rpartProbs[, "PS"],
               levels = rev(testProbs$Class))
rpartROC
plot(rpartROC, type = "S", print.thres = 0.5)
## Solution
my.prediction = predict(rpart.model, newdata = test, type = "class")
my_solution = data.frame(PassengerId = test$PassengerId, Survived = rpart.prediction)
nrow(my_solution)
write.csv(my_solution, file = "F:/Rworkd/my_solution/my_solution.csv",
          row.names = FALSE)