Introduction

The Wisconsin Breast Cancer dataset from the previous lab is reused for the purposes of evaluating Decision Trees as a predictive algorithm. The data loading steps are exactly the same as the previous lab session’s and was skipped for brevity.


Splitting the data

To conduct stratified sampling, the sample.split method from the caTool’s package was used. Note that this functionality similar to that of caret’s createDataPartition function.

# Using the caTool's split method
set.seed(123)
split = sample.split(breastdata$Class, SplitRatio = 0.7)
# Create training and testing sets
dataTrain = subset(breastdata, split == TRUE)
dataTest = subset(breastdata, split == FALSE)

prop.table(table(dataTrain$Class))
## 
##    benign malignant 
##  0.655102  0.344898
prop.table(table(dataTest$Class))
## 
##    benign malignant 
## 0.6555024 0.3444976

Both datasets have similarly distributed target variables.


Running the trees

The training data was plugged into the rpart Decision Tree model with progressive complexity.
As per the rpart docs, the tree is first divided by a single variable that has been found to ‘best’ split the data. The process is then repeated until no further improvements can be made.

For tree1, the arguments were:

  1. The target variable Class to be evaluated against all the other variables in the dataTrain dataset
  2. The method used by the model is a classifier. Note that y/Class is a factor
tree1 <- rpart(Class ~ ., data= dataTrain, method="class")
print(tree1)
## n= 490 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 490 169 benign (0.65510204 0.34489796)  
##    2) UniformityCellShape< 3.5 327  20 benign (0.93883792 0.06116208)  
##      4) BareNuclei< 6 313   8 benign (0.97444089 0.02555911)  
##        8) NormalNucleoli< 3.5 303   2 benign (0.99339934 0.00660066) *
##        9) NormalNucleoli>=3.5 10   4 malignant (0.40000000 0.60000000) *
##      5) BareNuclei>=6 14   2 malignant (0.14285714 0.85714286) *
##    3) UniformityCellShape>=3.5 163  14 malignant (0.08588957 0.91411043)  
##      6) UniformityCellSize< 4.5 40  11 malignant (0.27500000 0.72500000)  
##       12) BareNuclei< 7.5 22  11 benign (0.50000000 0.50000000)  
##         24) UniformityCellShape< 4.5 11   2 benign (0.81818182 0.18181818) *
##         25) UniformityCellShape>=4.5 11   2 malignant (0.18181818 0.81818182) *
##       13) BareNuclei>=7.5 18   0 malignant (0.00000000 1.00000000) *
##      7) UniformityCellSize>=4.5 123   3 malignant (0.02439024 0.97560976) *
prp(tree1)

The outputs of the print and basic prp functions are either:

Nonetheless, it was observed that the Decision Tree only used 4 variables from a possible 9 as predictors and that UniformityCellSh was the first variable to split. Trying out other variations:

prp (tree1, type = 4)

rpart.plot(tree1, extra = 104, nn = TRUE)

The second plot is quite self-explanatory:

  1. Cell Shape Uniformity has the most discriminative power, splitting the data in a 2:1 ratio. Low values for this variable indicates that the sample has high odds of being benign.
  2. Further splits toward the left concludes that cells that have low Cell Shape Uniformity (<4), low BareNuclei (<6) and low Normal Nucleoli (<4) are overwhelmingly benign. This leaf node constitutes 62% of all data in the training set.
  3. On the other end, cells that have high Cell Shape Uniformity (>= 4) and high Cell Size Uniformity (>=5) are overwhelmingly malignant. This leaf node constitutes 25% of all data in the test set.

Adding in extra parameters into the prp function adds further detail to the resulting printed tree. Here, type 4 prp shows labels on both right and left splits below each node.
The rpart.plot is a more illustrative example. extra=104 will output the percentage of the observations after each split and show the class probability in each node, which sums to 1. nn=TRUE displays the node numbers.

rpart.plot(tree1, extra = 106, nn = TRUE, fallen.leaves = TRUE, shadow.col = "darkgray")

Other modifications change the aesthetic of the plot. Here, extra=106 only displays the class probability of the other class in each node. Note the very 90’s WordArt style shadows on each node.

Plotting the Cp’s

plotcp(tree1)

printcp(tree1)
## 
## Classification tree:
## rpart(formula = Class ~ ., data = dataTrain, method = "class")
## 
## Variables actually used in tree construction:
## [1] BareNuclei          NormalNucleoli      UniformityCellShape
## [4] UniformityCellSize 
## 
## Root node error: 169/490 = 0.3449
## 
## n= 490 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.798817      0  1.000000 1.00000 0.062260
## 2 0.059172      1  0.201183 0.26627 0.037827
## 3 0.013807      2  0.142012 0.19527 0.032827
## 4 0.011834      5  0.100592 0.21302 0.034174
## 5 0.010000      6  0.088757 0.21302 0.034174

Per rpart’s docs:

if any split does not increase the overall R2 of the model by at least cp (where R2 is the usual linear-models definition) then that split is decreed to be, a priori, not worth pursuing.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

 

A work by Arnold Cheong

tp057228@mail.apu.edu.my