The Wisconsin Breast Cancer dataset from the previous lab is reused for the purposes of evaluating Decision Trees as a predictive algorithm. The data loading steps are exactly the same as the previous lab session’s and was skipped for brevity.
To conduct stratified sampling, the sample.split method from the caTool’s package was used. Note that this functionality similar to that of caret’s createDataPartition function.
# Using the caTool's split method
set.seed(123)
split = sample.split(breastdata$Class, SplitRatio = 0.7)
# Create training and testing sets
dataTrain = subset(breastdata, split == TRUE)
dataTest = subset(breastdata, split == FALSE)
prop.table(table(dataTrain$Class))
##
## benign malignant
## 0.655102 0.344898
prop.table(table(dataTest$Class))
##
## benign malignant
## 0.6555024 0.3444976
Both datasets have similarly distributed target variables.
The training data was plugged into the rpart Decision Tree model with progressive complexity.
As per the rpart docs, the tree is first divided by a single variable that has been found to ‘best’ split the data. The process is then repeated until no further improvements can be made.
For tree1, the arguments were:
tree1 <- rpart(Class ~ ., data= dataTrain, method="class")
print(tree1)
## n= 490
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 490 169 benign (0.65510204 0.34489796)
## 2) UniformityCellShape< 3.5 327 20 benign (0.93883792 0.06116208)
## 4) BareNuclei< 6 313 8 benign (0.97444089 0.02555911)
## 8) NormalNucleoli< 3.5 303 2 benign (0.99339934 0.00660066) *
## 9) NormalNucleoli>=3.5 10 4 malignant (0.40000000 0.60000000) *
## 5) BareNuclei>=6 14 2 malignant (0.14285714 0.85714286) *
## 3) UniformityCellShape>=3.5 163 14 malignant (0.08588957 0.91411043)
## 6) UniformityCellSize< 4.5 40 11 malignant (0.27500000 0.72500000)
## 12) BareNuclei< 7.5 22 11 benign (0.50000000 0.50000000)
## 24) UniformityCellShape< 4.5 11 2 benign (0.81818182 0.18181818) *
## 25) UniformityCellShape>=4.5 11 2 malignant (0.18181818 0.81818182) *
## 13) BareNuclei>=7.5 18 0 malignant (0.00000000 1.00000000) *
## 7) UniformityCellSize>=4.5 123 3 malignant (0.02439024 0.97560976) *
prp(tree1)
The outputs of the print and basic prp functions are either:
Nonetheless, it was observed that the Decision Tree only used 4 variables from a possible 9 as predictors and that UniformityCellSh was the first variable to split. Trying out other variations:
prp (tree1, type = 4)
rpart.plot(tree1, extra = 104, nn = TRUE)
The second plot is quite self-explanatory:
Adding in extra parameters into the prp function adds further detail to the resulting printed tree. Here, type 4 prp shows labels on both right and left splits below each node.
The rpart.plot is a more illustrative example. extra=104 will output the percentage of the observations after each split and show the class probability in each node, which sums to 1. nn=TRUE displays the node numbers.
rpart.plot(tree1, extra = 106, nn = TRUE, fallen.leaves = TRUE, shadow.col = "darkgray")
Other modifications change the aesthetic of the plot. Here, extra=106 only displays the class probability of the other class in each node. Note the very 90’s WordArt style shadows on each node.
plotcp(tree1)
printcp(tree1)
##
## Classification tree:
## rpart(formula = Class ~ ., data = dataTrain, method = "class")
##
## Variables actually used in tree construction:
## [1] BareNuclei NormalNucleoli UniformityCellShape
## [4] UniformityCellSize
##
## Root node error: 169/490 = 0.3449
##
## n= 490
##
## CP nsplit rel error xerror xstd
## 1 0.798817 0 1.000000 1.00000 0.062260
## 2 0.059172 1 0.201183 0.26627 0.037827
## 3 0.013807 2 0.142012 0.19527 0.032827
## 4 0.011834 5 0.100592 0.21302 0.034174
## 5 0.010000 6 0.088757 0.21302 0.034174
Per rpart’s docs:
if any split does not increase the overall R2 of the model by at least cp (where R2 is the usual linear-models definition) then that split is decreed to be, a priori, not worth pursuing.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
A work by Arnold Cheong
tp057228@mail.apu.edu.my