决策树是一种典型的单分类器,从本质上讲,决策树的分类思想是产生一系列规则,然后通过这些规则进行数据分析的数据挖掘过程。该分类器的生成和决策过程分为三个部分:
决策树可视为一个树状模型,树中包含三种节点:根节点、(中间)内部节点、叶节点:
决策树模型构建的问题基本可以归结为两点:
—————————————————-
- 输入:
- 训练集:\(D=\{(x_{1}, y_{1}),(x_{2}, y_{2}),\ldots,(x_{n}, y_{n})\};\)
- 变量集:\(A=\{a_{1}, a_{2},\ldots,a_{p}\}\)
- 过程:
- 函数: \(TreeGenerate(D, A)\)
- 生成根节点node;
- if \(D\)中样本全属于同一类别\(C\) then
- 将根节点node标记为\(C\)类叶节点;return
- end if
- if \(A=\emptyset\) or \(D\)中样本在\(A\)上取值相同 then
- 将根节点node标记为叶节点,其类别标记为\(D\)中样本数最多的类;return
- end if
- 从\(A\)中选择最优划分变量\(a_{*}\);(下面会具体介绍怎么选择)
- for \(a_{*}\)的每一个值\(a_{*}^{v}\) do
- 为node生成一个分支;令\(D_{v}\)表示\(D\)中在\(a_{*}\)上取值为\(a_{*}^{v}\)的样本子集;
- if \(D_{v}=\emptyset\) then
- 将根节点node标记为叶节点,其类别标记为\(D\)中样本数最多的类;
- else
- 以\(TreeGenerate(D, A \& \{a_{*}\})\)为分支节点
- end if
- end for
- 输出:
- 以node为根节点的一棵决策树
—————————————————-
随着变量划分过程不断进行,希望决策树的分支节点所包含的样本尽可能属于同一类别,即节点的“纯度(purity)”越来越高。
假设样本集合\(D\)(根节点)中第\(k\)类样本所占的比例为\(p_{k}(k=1, 2, \ldots, |\mathcal{Y}|)\),则样本集合\(D\)(根节点)的信息熵为:
\[Ent(D)=-\sum^{|\mathcal{Y}|}_{k=1}p_{k}log_{2}p_{k}\]
假设离散变量\(a\)有\(V\)个可能的取值\(\{a^{1}, a^{2}, \ldots, a^{V}\}\),若使用变量\(a\)对样本集合\(D\)进行划分,则会产生\(V\)个分支节点,其中第\(v\)个分支节点包含了\(D\)中所有在变量\(a\)上取值为\(a^{v}\)的样本,记为\(D^{v}\), 易知集合\(D^{v}\)的信息熵为\(Ent(D^{v})\)。再考虑到不同的分支节点所包含的样本数不同,所以给分支节点赋权重\(\frac{|D^{v}|}{|D|}\),即样本数越多的分支节点影响越大,于是可计算出用变量\(a\)进行划分所获得的“信息增益”为:
\[Gain(D, a)=Ent(D)-\sum^{V}_{v=1}\frac{|D^{v}|}{|D|}Ent(D^{v})\]
最终选择获得信息增益最大的变量为:
\[a_{*}=\underset{a\in A}{argmax}Gain(D, a)\]
\[Gain\_ratio(D, a)=\frac{Gain(D, a)}{IV(a)}\]
其中\(IV(a)\)称为变量\(a\)的“固有值(intrinsic value)”,变量\(a\)可能取值的数目越多(即\(V\)越大),则\(IV(a)\)的值通常会越大:
\[IV(a)=-\sum^{V}_{v=1}\frac{|D^{v}|}{|D|}log_{2}\frac{|D^{v}|}{|D|}\]
数据集\(D\)的纯度可用基尼值来度量:
\[Gini(D)=\sum^{|\mathcal{Y}|}_{k=1}\sum_{k' \neq k}p_{k}p_{k'}=1-\sum^{|\mathcal{Y}|}_{k=1}p_{k}^{2}\]
假设变量 \(a\) 有 \(V\) 个可能的取值 \(\{a^{1}, a^{2}, \ldots, a^{V}\}\)。若使用变量 \(a\) 对样本集合 \(D\) 进行划分,
易知集合 \(D^{l}\) 的基尼值为 \(Gini(D^{l})\) ,集合 \(D^{r}\) 的基尼值为 \(Gini(D^{r})\)。再考虑到两个不同的分支节点所包含的样本数不同,所以给左右两个分支节点分别赋权重 \(\frac{|D^{l}|}{|D|}\),\(\frac{|D^{r}|}{|D|}\),即样本数越多的分支节点影响越大,于是可计算出用变量 \(a\) 进行划分的基尼指数为:
\[Gini\_index(D,a)=\frac{|D^{l}|}{|D|}Gini(D^{l})+\frac{|D^{r}|}{|D|}Gini(D^{r})\]
选择使得划分后基尼指最数小的变量作为最优划分变量:
\[a_{*}=\underset{a\in A}{argmin}Gini\_index(D, a)\]
剪枝(Purning)是决策树减轻“过拟合”的主要手段。在决策树学习中,为了尽可能地正确分类训练样本,节点划分不断重复,有时会造成决策树分支过多,这时可能因为训练样本学得太好了,以至于把训练样本自身的一些特点当做所有数据都具有的一般性质而导致过拟合。因此可通过去掉一些分支来降低过拟合的风险。
决策树模型是一种简单易用的非参数分类器,算法的优缺点:
这部分的例子主要参考《数据科学中的R语言》中的例子。
数据来自mlbench包中的PimaIndiansDiabetes2数据,该数据包括了9各变量,768个样本。目标变量为是否有糖尿病,他是一个二元变量,其他解释变量是个体的若干特征,如年龄和其他医学指标,均为数值变量。
首先对数据进行预处理:
library(mlbench)
set.seed(1)
data(PimaIndiansDiabetes2, package = "mlbench")
data <- PimaIndiansDiabetes2
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# 标准化
preProcValue <- preProcess(data[, -9], method = c("center", "scale"))
scaleddata <- predict(preProcValue, data[, -9])
# YeoJohnson转换,使数据接近正态分布,并减弱异常值的影响
preProcbox <- preProcess(scaleddata, method = c("YeoJohnson"))
boxdata <- predict(preProcbox, scaleddata)
# 缺失值插补,使用装袋算法
library(ipred)
preProcimp <- preProcess(boxdata, method = "bagImpute")
procdata <- predict(preProcimp, boxdata)
procdata$class <- data[, 9]
head(procdata)
## pregnant glucose pressure triceps insulin mass
## 1 0.5284016 0.7595155 -0.03275471 0.52823166 0.31794703 0.1613449
## 2 -1.0902050 -1.4250227 -0.52420088 -0.01466832 -1.22004921 -0.9327976
## 3 0.8956985 1.5845945 -0.69028150 -1.01637841 0.02420507 -1.5205042
## 4 -1.0902050 -1.2509263 -0.52420088 -0.62258605 -0.67266546 -0.6791257
## 5 -1.5823833 0.4627602 -2.74133178 0.52823166 0.09906180 1.3218244
## 6 0.3065886 -0.1924995 0.12832774 -0.77660624 -0.41566226 -1.1067781
## pedigree age class
## 1 0.3807017 0.90154325 pos
## 2 -0.4347451 -0.20794297 neg
## 3 0.4674736 -0.11085497 pos
## 4 -1.3678689 -1.55542743 neg
## 5 1.7918753 -0.02068449 pos
## 6 -1.1705718 -0.31192824 neg
str(procdata)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 0.528 -1.09 0.896 -1.09 -1.582 ...
## $ glucose : num 0.76 -1.425 1.585 -1.251 0.463 ...
## $ pressure: num -0.0328 -0.5242 -0.6903 -0.5242 -2.7413 ...
## $ triceps : num 0.5282 -0.0147 -1.0164 -0.6226 0.5282 ...
## $ insulin : num 0.3179 -1.22 0.0242 -0.6727 0.0991 ...
## $ mass : num 0.161 -0.933 -1.521 -0.679 1.322 ...
## $ pedigree: num 0.381 -0.435 0.467 -1.368 1.792 ...
## $ age : num 0.9015 -0.2079 -0.1109 -1.5554 -0.0207 ...
## $ class : Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
library(magrittr)
library(rpart)
rpartModel <- rpart(class ~.,
data = procdata,
# na.action
method = "class",
# parms = list(prior = c(0.5, 0.5), loss = matrix(c(0, 5, 1, 0), 2), split = gini)
control = rpart.control(cp = 0)
)
objects(rpartModel)
## [1] "call" "control" "cptable"
## [4] "frame" "functions" "method"
## [7] "numresp" "ordered" "parms"
## [10] "splits" "terms" "variable.importance"
## [13] "where" "y"
print(rpartModel)
## n= 768
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 768 268 neg (0.65104167 0.34895833)
## 2) glucose< 0.184076 483 92 neg (0.80952381 0.19047619)
## 4) insulin< -0.512991 254 19 neg (0.92519685 0.07480315)
## 8) pedigree< 0.6050545 228 11 neg (0.95175439 0.04824561) *
## 9) pedigree>=0.6050545 26 8 neg (0.69230769 0.30769231)
## 18) pregnant< -0.463193 17 2 neg (0.88235294 0.11764706) *
## 19) pregnant>=-0.463193 9 3 pos (0.33333333 0.66666667) *
## 5) insulin>=-0.512991 229 73 neg (0.68122271 0.31877729)
## 10) age< -0.4816543 95 13 neg (0.86315789 0.13684211)
## 20) triceps< -0.09488671 51 2 neg (0.96078431 0.03921569) *
## 21) triceps>=-0.09488671 44 11 neg (0.75000000 0.25000000)
## 42) insulin>=-0.17934 27 2 neg (0.92592593 0.07407407) *
## 43) insulin< -0.17934 17 8 pos (0.47058824 0.52941176) *
## 11) age>=-0.4816543 134 60 neg (0.55223881 0.44776119)
## 22) mass< -0.9759824 20 0 neg (1.00000000 0.00000000) *
## 23) mass>=-0.9759824 114 54 pos (0.47368421 0.52631579)
## 46) pedigree< -1.176257 18 3 neg (0.83333333 0.16666667) *
## 47) pedigree>=-1.176257 96 39 pos (0.40625000 0.59375000)
## 94) age>=1.114937 8 1 neg (0.87500000 0.12500000) *
## 95) age< 1.114937 88 32 pos (0.36363636 0.63636364)
## 190) triceps>=-0.06454431 61 28 pos (0.45901639 0.54098361)
## 380) pregnant< 0.6254129 35 15 neg (0.57142857 0.42857143)
## 760) insulin< -0.3082836 9 1 neg (0.88888889 0.11111111) *
## 761) insulin>=-0.3082836 26 12 pos (0.46153846 0.53846154)
## 1522) triceps< 0.6087562 15 6 neg (0.60000000 0.40000000) *
## 1523) triceps>=0.6087562 11 3 pos (0.27272727 0.72727273) *
## 381) pregnant>=0.6254129 26 8 pos (0.30769231 0.69230769) *
## 191) triceps< -0.06454431 27 4 pos (0.14814815 0.85185185) *
## 3) glucose>=0.184076 285 109 pos (0.38245614 0.61754386)
## 6) mass< -0.3794478 75 24 neg (0.68000000 0.32000000)
## 12) glucose< 0.6940911 40 6 neg (0.85000000 0.15000000) *
## 13) glucose>=0.6940911 35 17 pos (0.48571429 0.51428571)
## 26) insulin>=0.1152267 8 2 neg (0.75000000 0.25000000) *
## 27) insulin< 0.1152267 27 11 pos (0.40740741 0.59259259)
## 54) pressure>=0.2081851 11 4 neg (0.63636364 0.36363636) *
## 55) pressure< 0.2081851 16 4 pos (0.25000000 0.75000000) *
## 7) mass>=-0.3794478 210 58 pos (0.27619048 0.72380952)
## 14) glucose< 0.9987702 118 46 pos (0.38983051 0.61016949)
## 28) age< -0.2599356 50 23 neg (0.54000000 0.46000000)
## 56) pressure>=0.0594595 25 6 neg (0.76000000 0.24000000)
## 112) mass< 1.176336 17 1 neg (0.94117647 0.05882353) *
## 113) mass>=1.176336 8 3 pos (0.37500000 0.62500000) *
## 57) pressure< 0.0594595 25 8 pos (0.32000000 0.68000000)
## 114) pedigree< -0.8401298 7 3 neg (0.57142857 0.42857143) *
## 115) pedigree>=-0.8401298 18 4 pos (0.22222222 0.77777778) *
## 29) age>=-0.2599356 68 19 pos (0.27941176 0.72058824)
## 58) pedigree< -0.1368963 29 13 pos (0.44827586 0.55172414)
## 116) mass< 0.8450892 21 9 neg (0.57142857 0.42857143)
## 232) age< 0.5616431 11 3 neg (0.72727273 0.27272727) *
## 233) age>=0.5616431 10 4 pos (0.40000000 0.60000000) *
## 117) mass>=0.8450892 8 1 pos (0.12500000 0.87500000) *
## 59) pedigree>=-0.1368963 39 6 pos (0.15384615 0.84615385) *
## 15) glucose>=0.9987702 92 12 pos (0.13043478 0.86956522) *
summary(rpartModel)
## Call:
## rpart(formula = class ~ ., data = procdata, method = "class",
## control = rpart.control(cp = 0))
## n= 768
##
## CP nsplit rel error xerror xstd
## 1 0.250000000 0 1.0000000 1.0000000 0.04928752
## 2 0.100746269 1 0.7500000 0.8208955 0.04675050
## 3 0.016791045 2 0.6492537 0.8097015 0.04655757
## 4 0.016169154 7 0.5597015 0.7425373 0.04530720
## 5 0.009328358 10 0.5111940 0.7388060 0.04523292
## 6 0.007462687 17 0.4440299 0.7201493 0.04485359
## 7 0.005597015 18 0.4365672 0.7350746 0.04515811
## 8 0.003731343 23 0.4067164 0.7089552 0.04461956
## 9 0.001865672 24 0.4029851 0.7164179 0.04477612
## 10 0.000000000 26 0.3992537 0.7276119 0.04500691
##
## Variable importance
## glucose insulin mass age triceps pressure pregnant pedigree
## 28 21 15 10 8 6 6 5
##
## Node number 1: 768 observations, complexity param=0.25
## predicted class=neg expected loss=0.3489583 P(node) =1
## class counts: 500 268
## probabilities: 0.651 0.349
## left son=2 (483 obs) right son=3 (285 obs)
## Primary splits:
## glucose < 0.184076 to the left, improve=65.38139, (0 missing)
## insulin < -0.3030813 to the left, improve=61.01892, (0 missing)
## age < -0.4816543 to the left, improve=33.99082, (0 missing)
## mass < -0.3952471 to the left, improve=31.59702, (0 missing)
## triceps < -0.528681 to the left, improve=25.39395, (0 missing)
## Surrogate splits:
## insulin < -0.1038691 to the left, agree=0.809, adj=0.484, (0 split)
## age < 0.845479 to the left, agree=0.663, adj=0.091, (0 split)
## pressure < 0.6246349 to the left, agree=0.660, adj=0.084, (0 split)
## mass < 0.9401524 to the left, agree=0.659, adj=0.081, (0 split)
## triceps < 0.9550761 to the left, agree=0.647, adj=0.049, (0 split)
##
## Node number 2: 483 observations, complexity param=0.01679104
## predicted class=neg expected loss=0.1904762 P(node) =0.6289062
## class counts: 391 92
## probabilities: 0.810 0.190
## left son=4 (254 obs) right son=5 (229 obs)
## Primary splits:
## insulin < -0.512991 to the left, improve=14.336390, (0 missing)
## age < -0.4816543 to the left, improve=13.771530, (0 missing)
## mass < -0.8728089 to the left, improve=10.906270, (0 missing)
## triceps < -0.6029218 to the left, improve= 9.739729, (0 missing)
## glucose < -0.8138309 to the left, improve= 8.942090, (0 missing)
## Surrogate splits:
## glucose < -0.6810465 to the left, agree=0.828, adj=0.638, (0 split)
## age < -0.4816543 to the left, agree=0.642, adj=0.245, (0 split)
## triceps < -0.5626903 to the left, agree=0.619, adj=0.197, (0 split)
## pregnant < 0.1759085 to the left, agree=0.605, adj=0.166, (0 split)
## mass < -0.03009352 to the left, agree=0.588, adj=0.131, (0 split)
##
## Node number 3: 285 observations, complexity param=0.1007463
## predicted class=pos expected loss=0.3824561 P(node) =0.3710938
## class counts: 109 176
## probabilities: 0.382 0.618
## left son=6 (75 obs) right son=7 (210 obs)
## Primary splits:
## mass < -0.3794478 to the left, improve=18.022660, (0 missing)
## glucose < 0.9246864 to the left, improve=14.717490, (0 missing)
## triceps < -0.3007869 to the left, improve= 8.177419, (0 missing)
## age < -1.007204 to the left, improve= 7.573969, (0 missing)
## insulin < -0.01940693 to the left, improve= 6.648419, (0 missing)
## Surrogate splits:
## triceps < -0.6567442 to the left, agree=0.821, adj=0.320, (0 split)
## insulin < -0.09424273 to the left, agree=0.772, adj=0.133, (0 split)
## age < -1.472758 to the left, agree=0.747, adj=0.040, (0 split)
## pressure < -0.8992403 to the left, agree=0.740, adj=0.013, (0 split)
## pedigree < -1.624087 to the left, agree=0.740, adj=0.013, (0 split)
##
## Node number 4: 254 observations, complexity param=0.005597015
## predicted class=neg expected loss=0.07480315 P(node) =0.3307292
## class counts: 235 19
## probabilities: 0.925 0.075
## left son=8 (228 obs) right son=9 (26 obs)
## Primary splits:
## pedigree < 0.6050545 to the left, improve=3.1419610, (0 missing)
## mass < -0.5395378 to the left, improve=1.7069670, (0 missing)
## triceps < -0.5740215 to the left, improve=1.3003370, (0 missing)
## age < -0.8655906 to the left, improve=1.1170900, (0 missing)
## pregnant < 1.124862 to the left, improve=0.9008398, (0 missing)
##
## Node number 5: 229 observations, complexity param=0.01679104
## predicted class=neg expected loss=0.3187773 P(node) =0.2981771
## class counts: 156 73
## probabilities: 0.681 0.319
## left son=10 (95 obs) right son=11 (134 obs)
## Primary splits:
## age < -0.4816543 to the left, improve=10.747750, (0 missing)
## mass < -0.958659 to the left, improve= 8.968932, (0 missing)
## pregnant < 0.6254129 to the left, improve= 5.101790, (0 missing)
## triceps < -0.6116801 to the left, improve= 4.588160, (0 missing)
## pedigree < -1.164945 to the left, improve= 2.777680, (0 missing)
## Surrogate splits:
## pregnant < -0.1140821 to the left, agree=0.777, adj=0.463, (0 split)
## pressure < -0.648699 to the left, agree=0.664, adj=0.189, (0 split)
## pedigree < -1.308669 to the left, agree=0.620, adj=0.084, (0 split)
## triceps < -1.271052 to the left, agree=0.607, adj=0.053, (0 split)
## insulin < 0.008187541 to the right, agree=0.607, adj=0.053, (0 split)
##
## Node number 6: 75 observations, complexity param=0.009328358
## predicted class=neg expected loss=0.32 P(node) =0.09765625
## class counts: 51 24
## probabilities: 0.680 0.320
## left son=12 (40 obs) right son=13 (35 obs)
## Primary splits:
## glucose < 0.6940911 to the left, improve=4.954286, (0 missing)
## age < -0.7307851 to the left, improve=2.946878, (0 missing)
## pregnant < -0.8715991 to the left, improve=2.346767, (0 missing)
## pressure < -0.8992403 to the left, improve=2.127119, (0 missing)
## insulin < 0.1675863 to the right, improve=2.127119, (0 missing)
## Surrogate splits:
## insulin < -0.1120576 to the left, agree=0.680, adj=0.314, (0 split)
## age < -1.155611 to the left, agree=0.640, adj=0.229, (0 split)
## mass < -0.555797 to the left, agree=0.627, adj=0.200, (0 split)
## pregnant < -0.1140821 to the left, agree=0.600, adj=0.143, (0 split)
## pedigree < -1.128585 to the right, agree=0.587, adj=0.114, (0 split)
##
## Node number 7: 210 observations, complexity param=0.01616915
## predicted class=pos expected loss=0.2761905 P(node) =0.2734375
## class counts: 58 152
## probabilities: 0.276 0.724
## left son=14 (118 obs) right son=15 (92 obs)
## Primary splits:
## glucose < 0.9987702 to the left, improve=6.956746, (0 missing)
## pedigree < -0.6188634 to the left, improve=3.647846, (0 missing)
## age < -1.007204 to the left, improve=2.750476, (0 missing)
## pregnant < 0.8090614 to the left, improve=2.176582, (0 missing)
## pressure < -0.9412698 to the right, improve=2.114189, (0 missing)
## Surrogate splits:
## insulin < 0.3671545 to the left, agree=0.681, adj=0.272, (0 split)
## pedigree < 0.591473 to the left, agree=0.590, adj=0.065, (0 split)
## age < 0.7660988 to the left, agree=0.590, adj=0.065, (0 split)
## triceps < -0.6352969 to the right, agree=0.586, adj=0.054, (0 split)
## mass < -0.2394526 to the right, agree=0.586, adj=0.054, (0 split)
##
## Node number 8: 228 observations
## predicted class=neg expected loss=0.04824561 P(node) =0.296875
## class counts: 217 11
## probabilities: 0.952 0.048
##
## Node number 9: 26 observations, complexity param=0.005597015
## predicted class=neg expected loss=0.3076923 P(node) =0.03385417
## class counts: 18 8
## probabilities: 0.692 0.308
## left son=18 (17 obs) right son=19 (9 obs)
## Primary splits:
## pregnant < -0.463193 to the left, improve=3.547511, (0 missing)
## mass < -0.004590906 to the left, improve=2.243590, (0 missing)
## age < -1.007204 to the left, improve=1.813765, (0 missing)
## insulin < -1.044182 to the left, improve=1.401923, (0 missing)
## pressure < 0.2081851 to the left, improve=1.332562, (0 missing)
## Surrogate splits:
## pressure < 0.04778651 to the left, agree=0.808, adj=0.444, (0 split)
## age < -0.2668235 to the left, agree=0.769, adj=0.333, (0 split)
## insulin < -0.6491159 to the left, agree=0.731, adj=0.222, (0 split)
##
## Node number 10: 95 observations, complexity param=0.001865672
## predicted class=neg expected loss=0.1368421 P(node) =0.1236979
## class counts: 82 13
## probabilities: 0.863 0.137
## left son=20 (51 obs) right son=21 (44 obs)
## Primary splits:
## triceps < -0.09488671 to the left, improve=2.0989680, (0 missing)
## mass < -0.1712321 to the left, improve=1.6362440, (0 missing)
## pedigree < -1.523283 to the right, improve=1.2862610, (0 missing)
## insulin < 0.1027434 to the right, improve=0.9487719, (0 missing)
## pressure < -0.168235 to the left, improve=0.7636337, (0 missing)
## Surrogate splits:
## mass < 0.09760599 to the left, agree=0.800, adj=0.568, (0 split)
## insulin < -0.3009875 to the left, agree=0.716, adj=0.386, (0 split)
## pressure < -0.08696023 to the left, agree=0.684, adj=0.318, (0 split)
## pedigree < 0.05315371 to the left, agree=0.653, adj=0.250, (0 split)
## glucose < -0.3532218 to the left, agree=0.621, adj=0.182, (0 split)
##
## Node number 11: 134 observations, complexity param=0.01679104
## predicted class=neg expected loss=0.4477612 P(node) =0.1744792
## class counts: 74 60
## probabilities: 0.552 0.448
## left son=22 (20 obs) right son=23 (114 obs)
## Primary splits:
## mass < -0.9759824 to the left, improve=9.426551, (0 missing)
## pedigree < -1.164945 to the left, improve=4.630224, (0 missing)
## age < 1.114937 to the right, improve=4.428181, (0 missing)
## insulin < -0.1830623 to the left, improve=2.690844, (0 missing)
## triceps < -0.6116801 to the left, improve=1.858953, (0 missing)
## Surrogate splits:
## triceps < -0.6116801 to the left, agree=0.866, adj=0.1, (0 split)
##
## Node number 12: 40 observations
## predicted class=neg expected loss=0.15 P(node) =0.05208333
## class counts: 34 6
## probabilities: 0.850 0.150
##
## Node number 13: 35 observations, complexity param=0.009328358
## predicted class=pos expected loss=0.4857143 P(node) =0.04557292
## class counts: 17 18
## probabilities: 0.486 0.514
## left son=26 (8 obs) right son=27 (27 obs)
## Primary splits:
## insulin < 0.1152267 to the right, improve=1.4486770, (0 missing)
## age < -0.4816543 to the left, improve=1.4486770, (0 missing)
## pressure < 0.1683357 to the right, improve=1.1958590, (0 missing)
## mass < -1.115579 to the right, improve=0.9657143, (0 missing)
## pregnant < -0.8715991 to the left, improve=0.9142857, (0 missing)
## Surrogate splits:
## pressure < -1.194316 to the left, agree=0.8, adj=0.125, (0 split)
## mass < -0.411094 to the right, agree=0.8, adj=0.125, (0 split)
##
## Node number 14: 118 observations, complexity param=0.01616915
## predicted class=pos expected loss=0.3898305 P(node) =0.1536458
## class counts: 46 72
## probabilities: 0.390 0.610
## left son=28 (50 obs) right son=29 (68 obs)
## Primary splits:
## age < -0.2599356 to the left, improve=3.913240, (0 missing)
## pedigree < -0.1659089 to the left, improve=3.119775, (0 missing)
## mass < 1.159416 to the left, improve=2.663740, (0 missing)
## pregnant < 0.8090614 to the left, improve=2.602148, (0 missing)
## insulin < 0.5847789 to the right, improve=2.127430, (0 missing)
## Surrogate splits:
## pregnant < 0.1759085 to the left, agree=0.805, adj=0.54, (0 split)
## pressure < 0.1683357 to the left, agree=0.695, adj=0.28, (0 split)
## triceps < -0.1174975 to the left, agree=0.669, adj=0.22, (0 split)
## insulin < 0.5409827 to the right, agree=0.627, adj=0.12, (0 split)
## mass < -0.1785516 to the left, agree=0.602, adj=0.06, (0 split)
##
## Node number 15: 92 observations
## predicted class=pos expected loss=0.1304348 P(node) =0.1197917
## class counts: 12 80
## probabilities: 0.130 0.870
##
## Node number 18: 17 observations
## predicted class=neg expected loss=0.1176471 P(node) =0.02213542
## class counts: 15 2
## probabilities: 0.882 0.118
##
## Node number 19: 9 observations
## predicted class=pos expected loss=0.3333333 P(node) =0.01171875
## class counts: 3 6
## probabilities: 0.333 0.667
##
## Node number 20: 51 observations
## predicted class=neg expected loss=0.03921569 P(node) =0.06640625
## class counts: 49 2
## probabilities: 0.961 0.039
##
## Node number 21: 44 observations, complexity param=0.001865672
## predicted class=neg expected loss=0.25 P(node) =0.05729167
## class counts: 33 11
## probabilities: 0.750 0.250
## left son=42 (27 obs) right son=43 (17 obs)
## Primary splits:
## insulin < -0.17934 to the right, improve=4.3257080, (0 missing)
## pregnant < -0.8715991 to the left, improve=1.1363640, (0 missing)
## pressure < -0.6072412 to the right, improve=0.9166667, (0 missing)
## mass < 0.2023862 to the right, improve=0.9166667, (0 missing)
## age < -0.8655906 to the right, improve=0.6195402, (0 missing)
## Surrogate splits:
## pedigree < -0.7389428 to the right, agree=0.727, adj=0.294, (0 split)
## mass < 0.4869275 to the right, agree=0.705, adj=0.235, (0 split)
## pregnant < 0.1759085 to the left, agree=0.682, adj=0.176, (0 split)
## pressure < 0.04778651 to the left, agree=0.659, adj=0.118, (0 split)
##
## Node number 22: 20 observations
## predicted class=neg expected loss=0 P(node) =0.02604167
## class counts: 20 0
## probabilities: 1.000 0.000
##
## Node number 23: 114 observations, complexity param=0.01679104
## predicted class=pos expected loss=0.4736842 P(node) =0.1484375
## class counts: 54 60
## probabilities: 0.474 0.526
## left son=46 (18 obs) right son=47 (96 obs)
## Primary splits:
## pedigree < -1.176257 to the left, improve=5.529605, (0 missing)
## age < 1.114937 to the right, improve=3.369089, (0 missing)
## triceps < -0.09148137 to the right, improve=2.842105, (0 missing)
## pressure < 0.9937638 to the right, improve=2.361868, (0 missing)
## mass < 0.1820187 to the right, improve=2.068031, (0 missing)
##
## Node number 26: 8 observations
## predicted class=neg expected loss=0.25 P(node) =0.01041667
## class counts: 6 2
## probabilities: 0.750 0.250
##
## Node number 27: 27 observations, complexity param=0.009328358
## predicted class=pos expected loss=0.4074074 P(node) =0.03515625
## class counts: 11 16
## probabilities: 0.407 0.593
## left son=54 (11 obs) right son=55 (16 obs)
## Primary splits:
## pressure < 0.2081851 to the right, improve=1.9461280, (0 missing)
## pregnant < 0.286815 to the left, improve=1.0703700, (0 missing)
## pedigree < -0.4926652 to the left, improve=0.9259259, (0 missing)
## glucose < 1.213834 to the left, improve=0.6734007, (0 missing)
## age < 0.9534364 to the right, improve=0.5084656, (0 missing)
## Surrogate splits:
## triceps < -0.5011203 to the right, agree=0.778, adj=0.455, (0 split)
## insulin < -0.2967346 to the left, agree=0.741, adj=0.364, (0 split)
## pregnant < 0.974253 to the right, agree=0.667, adj=0.182, (0 split)
## glucose < 0.8995504 to the left, agree=0.667, adj=0.182, (0 split)
## mass < -1.55739 to the left, agree=0.630, adj=0.091, (0 split)
##
## Node number 28: 50 observations, complexity param=0.01616915
## predicted class=neg expected loss=0.46 P(node) =0.06510417
## class counts: 27 23
## probabilities: 0.540 0.460
## left son=56 (25 obs) right son=57 (25 obs)
## Primary splits:
## pressure < 0.0594595 to the right, improve=4.840000, (0 missing)
## insulin < 0.5847789 to the right, improve=2.717193, (0 missing)
## mass < 1.176336 to the left, improve=2.014825, (0 missing)
## pregnant < -1.336294 to the right, improve=1.284444, (0 missing)
## glucose < 0.6140985 to the right, improve=1.274635, (0 missing)
## Surrogate splits:
## insulin < 0.2970999 to the right, agree=0.66, adj=0.32, (0 split)
## mass < 0.587874 to the right, agree=0.66, adj=0.32, (0 split)
## glucose < 0.5597943 to the right, agree=0.64, adj=0.28, (0 split)
## triceps < 0.4408404 to the right, agree=0.62, adj=0.24, (0 split)
## pregnant < -0.8715991 to the left, agree=0.60, adj=0.20, (0 split)
##
## Node number 29: 68 observations, complexity param=0.005597015
## predicted class=pos expected loss=0.2794118 P(node) =0.08854167
## class counts: 19 49
## probabilities: 0.279 0.721
## left son=58 (29 obs) right son=59 (39 obs)
## Primary splits:
## pedigree < -0.1368963 to the left, improve=2.8836790, (0 missing)
## mass < 1.062265 to the left, improve=1.5252100, (0 missing)
## age < 0.02097219 to the right, improve=1.4156860, (0 missing)
## pressure < 1.455499 to the left, improve=1.2184190, (0 missing)
## triceps < 0.234774 to the left, improve=0.6778075, (0 missing)
## Surrogate splits:
## triceps < 0.4077414 to the left, agree=0.647, adj=0.172, (0 split)
## mass < 1.485691 to the right, agree=0.647, adj=0.172, (0 split)
## glucose < 0.6500342 to the left, agree=0.603, adj=0.069, (0 split)
## pressure < 2.138253 to the right, agree=0.603, adj=0.069, (0 split)
## insulin < -0.3114232 to the left, agree=0.603, adj=0.069, (0 split)
##
## Node number 42: 27 observations
## predicted class=neg expected loss=0.07407407 P(node) =0.03515625
## class counts: 25 2
## probabilities: 0.926 0.074
##
## Node number 43: 17 observations
## predicted class=pos expected loss=0.4705882 P(node) =0.02213542
## class counts: 8 9
## probabilities: 0.471 0.529
##
## Node number 46: 18 observations
## predicted class=neg expected loss=0.1666667 P(node) =0.0234375
## class counts: 15 3
## probabilities: 0.833 0.167
##
## Node number 47: 96 observations, complexity param=0.01679104
## predicted class=pos expected loss=0.40625 P(node) =0.125
## class counts: 39 57
## probabilities: 0.406 0.594
## left son=94 (8 obs) right son=95 (88 obs)
## Primary splits:
## age < 1.114937 to the right, improve=3.835227, (0 missing)
## triceps < -0.09148137 to the right, improve=2.981483, (0 missing)
## insulin < -0.3188784 to the left, improve=2.077335, (0 missing)
## mass < 0.1820187 to the right, improve=1.862827, (0 missing)
## pedigree < 0.2368907 to the left, improve=1.600556, (0 missing)
##
## Node number 54: 11 observations
## predicted class=neg expected loss=0.3636364 P(node) =0.01432292
## class counts: 7 4
## probabilities: 0.636 0.364
##
## Node number 55: 16 observations
## predicted class=pos expected loss=0.25 P(node) =0.02083333
## class counts: 4 12
## probabilities: 0.250 0.750
##
## Node number 56: 25 observations, complexity param=0.007462687
## predicted class=neg expected loss=0.24 P(node) =0.03255208
## class counts: 19 6
## probabilities: 0.760 0.240
## left son=112 (17 obs) right son=113 (8 obs)
## Primary splits:
## mass < 1.176336 to the left, improve=3.487647, (0 missing)
## pressure < 1.032542 to the left, improve=2.135873, (0 missing)
## triceps < 0.3041504 to the left, improve=1.620000, (0 missing)
## pedigree < -0.3451885 to the right, improve=1.440513, (0 missing)
## insulin < 0.5847789 to the right, improve=1.355294, (0 missing)
## Surrogate splits:
## pressure < 0.9550774 to the left, agree=0.76, adj=0.250, (0 split)
## triceps < 0.9852931 to the left, agree=0.76, adj=0.250, (0 split)
## insulin < -0.1169469 to the right, agree=0.76, adj=0.250, (0 split)
## age < -1.3108 to the right, agree=0.72, adj=0.125, (0 split)
##
## Node number 57: 25 observations, complexity param=0.003731343
## predicted class=pos expected loss=0.32 P(node) =0.03255208
## class counts: 8 17
## probabilities: 0.320 0.680
## left son=114 (7 obs) right son=115 (18 obs)
## Primary splits:
## pedigree < -0.8401298 to the left, improve=1.2292060, (0 missing)
## pregnant < -1.336294 to the right, improve=1.2272220, (0 missing)
## age < -1.007204 to the left, improve=1.0851280, (0 missing)
## triceps < -0.1668723 to the right, improve=0.7501299, (0 missing)
## mass < 0.02762902 to the right, improve=0.6101587, (0 missing)
## Surrogate splits:
## glucose < 0.7333992 to the right, agree=0.84, adj=0.429, (0 split)
## insulin < 0.6317337 to the right, agree=0.80, adj=0.286, (0 split)
##
## Node number 58: 29 observations, complexity param=0.005597015
## predicted class=pos expected loss=0.4482759 P(node) =0.03776042
## class counts: 13 16
## probabilities: 0.448 0.552
## left son=116 (21 obs) right son=117 (8 obs)
## Primary splits:
## mass < 0.8450892 to the left, improve=2.3091130, (0 missing)
## triceps < 0.2403171 to the left, improve=2.0495890, (0 missing)
## pressure < 0.5498793 to the left, improve=1.3159810, (0 missing)
## pedigree < -1.051749 to the left, improve=0.6900657, (0 missing)
## insulin < 0.3133556 to the left, improve=0.4876847, (0 missing)
## Surrogate splits:
## triceps < 0.5360359 to the left, agree=0.862, adj=0.50, (0 split)
## insulin < 0.4690746 to the left, agree=0.793, adj=0.25, (0 split)
##
## Node number 59: 39 observations
## predicted class=pos expected loss=0.1538462 P(node) =0.05078125
## class counts: 6 33
## probabilities: 0.154 0.846
##
## Node number 94: 8 observations
## predicted class=neg expected loss=0.125 P(node) =0.01041667
## class counts: 7 1
## probabilities: 0.875 0.125
##
## Node number 95: 88 observations, complexity param=0.009328358
## predicted class=pos expected loss=0.3636364 P(node) =0.1145833
## class counts: 32 56
## probabilities: 0.364 0.636
## left son=190 (61 obs) right son=191 (27 obs)
## Primary splits:
## triceps < -0.06454431 to the right, improve=3.617376, (0 missing)
## mass < -0.3794478 to the right, improve=2.362567, (0 missing)
## pregnant < 0.6254129 to the left, improve=1.825312, (0 missing)
## pedigree < -0.8348698 to the right, improve=1.654067, (0 missing)
## insulin < -0.3225317 to the left, improve=1.556704, (0 missing)
## Surrogate splits:
## mass < -0.3794478 to the right, agree=0.830, adj=0.444, (0 split)
## glucose < -0.9768363 to the right, agree=0.716, adj=0.074, (0 split)
## insulin < -0.5023809 to the right, agree=0.716, adj=0.074, (0 split)
## pressure < -0.7737986 to the right, agree=0.705, adj=0.037, (0 split)
##
## Node number 112: 17 observations
## predicted class=neg expected loss=0.05882353 P(node) =0.02213542
## class counts: 16 1
## probabilities: 0.941 0.059
##
## Node number 113: 8 observations
## predicted class=pos expected loss=0.375 P(node) =0.01041667
## class counts: 3 5
## probabilities: 0.375 0.625
##
## Node number 114: 7 observations
## predicted class=neg expected loss=0.4285714 P(node) =0.009114583
## class counts: 4 3
## probabilities: 0.571 0.429
##
## Node number 115: 18 observations
## predicted class=pos expected loss=0.2222222 P(node) =0.0234375
## class counts: 4 14
## probabilities: 0.222 0.778
##
## Node number 116: 21 observations, complexity param=0.005597015
## predicted class=neg expected loss=0.4285714 P(node) =0.02734375
## class counts: 12 9
## probabilities: 0.571 0.429
## left son=232 (11 obs) right son=233 (10 obs)
## Primary splits:
## age < 0.5616431 to the left, improve=1.1220780, (0 missing)
## pressure < 0.6823251 to the left, improve=0.9972527, (0 missing)
## pedigree < -0.8830097 to the right, improve=0.6311688, (0 missing)
## glucose < 0.4486419 to the right, improve=0.5079365, (0 missing)
## triceps < 0.1249368 to the left, improve=0.5079365, (0 missing)
## Surrogate splits:
## pedigree < -0.960914 to the right, agree=0.857, adj=0.7, (0 split)
## pressure < 0.3673065 to the left, agree=0.714, adj=0.4, (0 split)
## triceps < 0.1249368 to the right, agree=0.714, adj=0.4, (0 split)
## pregnant < 0.4174951 to the right, agree=0.667, adj=0.3, (0 split)
## glucose < 0.5733696 to the left, agree=0.667, adj=0.3, (0 split)
##
## Node number 117: 8 observations
## predicted class=pos expected loss=0.125 P(node) =0.01041667
## class counts: 1 7
## probabilities: 0.125 0.875
##
## Node number 190: 61 observations, complexity param=0.009328358
## predicted class=pos expected loss=0.4590164 P(node) =0.07942708
## class counts: 28 33
## probabilities: 0.459 0.541
## left son=380 (35 obs) right son=381 (26 obs)
## Primary splits:
## pregnant < 0.6254129 to the left, improve=2.075302, (0 missing)
## pedigree < 0.2368907 to the left, improve=1.766478, (0 missing)
## insulin < -0.3225317 to the left, improve=1.666740, (0 missing)
## mass < 1.321788 to the left, improve=1.580796, (0 missing)
## glucose < -0.5017753 to the left, improve=1.181387, (0 missing)
## Surrogate splits:
## age < 0.5372991 to the left, agree=0.672, adj=0.231, (0 split)
## insulin < -0.3992971 to the right, agree=0.656, adj=0.192, (0 split)
## mass < 0.1449565 to the right, agree=0.656, adj=0.192, (0 split)
## triceps < -0.01614765 to the right, agree=0.639, adj=0.154, (0 split)
## glucose < 0.05869105 to the left, agree=0.607, adj=0.077, (0 split)
##
## Node number 191: 27 observations
## predicted class=pos expected loss=0.1481481 P(node) =0.03515625
## class counts: 4 23
## probabilities: 0.148 0.852
##
## Node number 232: 11 observations
## predicted class=neg expected loss=0.2727273 P(node) =0.01432292
## class counts: 8 3
## probabilities: 0.727 0.273
##
## Node number 233: 10 observations
## predicted class=pos expected loss=0.4 P(node) =0.01302083
## class counts: 4 6
## probabilities: 0.400 0.600
##
## Node number 380: 35 observations, complexity param=0.009328358
## predicted class=neg expected loss=0.4285714 P(node) =0.04557292
## class counts: 20 15
## probabilities: 0.571 0.429
## left son=760 (9 obs) right son=761 (26 obs)
## Primary splits:
## insulin < -0.3082836 to the left, improve=2.4420020, (0 missing)
## glucose < -0.4830425 to the left, improve=1.4628570, (0 missing)
## pregnant < -0.8715991 to the right, improve=1.4435560, (0 missing)
## pedigree < -0.8399107 to the right, improve=1.4285710, (0 missing)
## triceps < 0.6195738 to the left, improve=0.5761905, (0 missing)
## Surrogate splits:
## mass < -0.1800426 to the left, agree=0.857, adj=0.444, (0 split)
## glucose < 0.0905579 to the right, agree=0.800, adj=0.222, (0 split)
##
## Node number 381: 26 observations
## predicted class=pos expected loss=0.3076923 P(node) =0.03385417
## class counts: 8 18
## probabilities: 0.308 0.692
##
## Node number 760: 9 observations
## predicted class=neg expected loss=0.1111111 P(node) =0.01171875
## class counts: 8 1
## probabilities: 0.889 0.111
##
## Node number 761: 26 observations, complexity param=0.009328358
## predicted class=pos expected loss=0.4615385 P(node) =0.03385417
## class counts: 12 14
## probabilities: 0.462 0.538
## left son=1522 (15 obs) right son=1523 (11 obs)
## Primary splits:
## triceps < 0.6087562 to the left, improve=1.3594410, (0 missing)
## age < 0.5871544 to the right, improve=1.2238290, (0 missing)
## pedigree < -0.6222004 to the right, improve=1.0341880, (0 missing)
## pregnant < -0.8715991 to the right, improve=0.7326007, (0 missing)
## glucose < -0.3532218 to the left, improve=0.6230769, (0 missing)
## Surrogate splits:
## insulin < 0.253726 to the left, agree=0.808, adj=0.545, (0 split)
## mass < 0.6008499 to the left, agree=0.808, adj=0.545, (0 split)
## pedigree < 0.520139 to the left, agree=0.692, adj=0.273, (0 split)
## pressure < 0.3661899 to the left, agree=0.654, adj=0.182, (0 split)
## age < -0.3673601 to the right, agree=0.615, adj=0.091, (0 split)
##
## Node number 1522: 15 observations
## predicted class=neg expected loss=0.4 P(node) =0.01953125
## class counts: 9 6
## probabilities: 0.600 0.400
##
## Node number 1523: 11 observations
## predicted class=pos expected loss=0.2727273 P(node) =0.01432292
## class counts: 3 8
## probabilities: 0.273 0.727
# method 1
plot(rpartModel)
text(rpartModel)
# method 2
rpart.plot::rpart.plot(rpartModel)
# method 3
rattle::fancyRpartPlot(rpartModel)
# method 4
partykit::as.party(rpartModel) %>% plot()
# Displays CP table for Fitted Rpart Object
printcp(rpartModel)
##
## Classification tree:
## rpart(formula = class ~ ., data = procdata, method = "class",
## control = rpart.control(cp = 0))
##
## Variables actually used in tree construction:
## [1] age glucose insulin mass pedigree pregnant pressure triceps
##
## Root node error: 268/768 = 0.34896
##
## n= 768
##
## CP nsplit rel error xerror xstd
## 1 0.2500000 0 1.00000 1.00000 0.049288
## 2 0.1007463 1 0.75000 0.82090 0.046750
## 3 0.0167910 2 0.64925 0.80970 0.046558
## 4 0.0161692 7 0.55970 0.74254 0.045307
## 5 0.0093284 10 0.51119 0.73881 0.045233
## 6 0.0074627 17 0.44403 0.72015 0.044854
## 7 0.0055970 18 0.43657 0.73507 0.045158
## 8 0.0037313 23 0.40672 0.70896 0.044620
## 9 0.0018657 24 0.40299 0.71642 0.044776
## 10 0.0000000 26 0.39925 0.72761 0.045007
# Plot a Complexity Parameter Table for an Rpart Fit
plotcp(rpartModel)
cptable <- as.data.frame(rpartModel$cptable)
cptable$errsd <- cptable$xerror + cptable$xstd
cpvalue <- cptable[which.min(cptable$errsd), "CP"]
pruneModel <- prune(rpartModel, cpvalue)
rattle::fancyRpartPlot(pruneModel)
pre <- predict(pruneModel, newdata = procdata, type = "class")
## 混淆矩阵
preTable <- table(pre, procdata$class)
preTable
##
## pre neg pos
## neg 444 53
## pos 56 215
## 分类准确率
accurancy = sum(diag(preTable)) / sum(preTable)
accurancy
## [1] 0.8580729
## 灵敏度
preTable[1, 1] / sum(preTable[, 1])
## [1] 0.888
## 特异度
preTable[2, 2] / sum(preTable[, 2])
## [1] 0.8022388
方法一:自制10-folds CV
index <- sample(1:10, nrow(procdata), replace = TRUE)
res <- array(0, dim = c(2, 2, 10))
n <- ncol(procdata)
for(i in 1:10) {
train <- procdata[index != i, ]
test <- procdata[index == i, ]
model <- rpart(class ~., data = train, control = rpart.control(cp = 0.1))
pre <- predict(model, test[, -n], type = "class")
res[, , i] <- as.matrix(table(pre, test[, n]))
}
table <- apply(res, MARGIN = c(1, 2), sum)
sum(diag(table)) / sum(table)
## [1] 0.7265625
方法二:caret package 10-folds CV
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
tunedf <- data.frame(.cp = seq(0.001, 0.1, length = 10))
treemodel <- train(x = procdata[, -9], y = procdata[, 9], method = 'rpart', trControl = fitControl, tuneGrid = tunedf)
plot(treemodel)
pre <- predict.train()
training_two$family_size <- train$SibSp + train$Parch + 1
library(rpart)
rpart.model = rpart(Survived ~ .,
data = training,
method = "class",
control = rpart.control(minsplit = 50, cp = 0.01)
summary(rpart.model)
## Visualizing the tree
### method 1
library(rpart)
plot(rpart.model, compress = TRUE)
text(rpart.model, use.n = TRUE)
### method 2
library(rattle)
fancyRpartPlot(rpart.model)
### method 3
library(partykit)
rpart1a = as.party(rpart.model)
plot(rpart1a)
rpart.predict = predict(rpart.model, testing, type = "class")
## table(rpart.predict, testing$Survived)
## rpart.accuracy = mean(rpart.predict == testing$Survived)
## rpart.accuracy
confusionMatrix(rpart.predict, testing$Survived)
# Create the ROC curve
library(pROC)
rpartROC = roc(testing$Survived, rpart.response,
levels = levels(as.factor(testing$Survived)))
plot(rpartROC, type = "S", print.thres = 0.5)
## Solution
my.prediction = predict(rpart.model, newdata = test, type = "class")
my_solution = data.frame(PassengerId = test$PassengerId, Survived = rpart.prediction)
nrow(my_solution)
write.csv(my_solution, file = "F:/Rworkd/my_solution/my_solution.csv",
row.names = FALSE)
###########################################################################
cvCtrl = trainControl(method = "repeatdecv", repeats = 3,
summaryFunction = twoClassSummary,
classProbs = FALSE)
set.seed(1)
rpartTune = train(Survived ~ .,
data = training,
method = "rpart",
tuneLength = 30,
metric = "ROC",
trControl = cvCtrl)
rpartTune
plot(rpartTune, scales = list(x = list(log = 10)))
rpartPred = predict(rpartTune, testing)
confusionMatrix(rpartPred2, testing$Species)
rpartProbs = predict(rpartTune, testing, type = "prob")
head(rpartProbs)
## Create the ROC curve
library(pROC)
rpartROC = roc(testing$Species, rpartProbs[, "PS"],
levels = rev(testProbs$Class))
rpartROC
plot(rpartROC, type = "S", print.thres = 0.5)
## Solution
my.prediction = predict(rpart.model, newdata = test, type = "class")
my_solution = data.frame(PassengerId = test$PassengerId, Survived = rpart.prediction)
nrow(my_solution)
write.csv(my_solution, file = "F:/Rworkd/my_solution/my_solution.csv",
row.names = FALSE)