Decision Trees-Part 2 Iris

rpartXse() combines two steps : 1) Growing the regression trees 2)Post-pruning them. Trees in rpart are grown until one of the 3 conditions is satisfied 1) decrease in the error of the current node is less than a threshold (cp), 2) the number of samples in a node is less than a specified threshold(minsplit) 3) the tree depth exceeds a specified value (maxdepth).

set.seed(1234)
data(iris)

ct1<-rpartXse(Species ~ ., iris)
ct2<-rpartXse(Species ~ ., iris, se=0)
par(mfrow=c(1,2))
prp(ct1, type=0, extra=101, roundint = FALSE) # left tree
prp(ct2, type=0, extra=101,roundint = FALSE)  # right tree

ct3<- rpartXse(Species ~ ., iris, se=0.05)
par(mfrow=c(1,3))
prp(ct1, type=0, extra=101, roundint = FALSE) # left tree
prp(ct2, type=0, extra=101,roundint = FALSE)  # middle tree
prp(ct3, type=0, extra=101, roundint = FALSE) # right tree

The code above calls generates two trees ct1 and ct2. ct1 is generated using the default values of cp=0.01, minsplit=20 and maxdepth=30 and 1-SE post pruning rule (SE=1)that compares the standard error for each subtree and selects the one with the smallest error.The convention is to use the best tree (lowest cross-validate relative error) or the smallest (simplest) tree within one standard error of the best tree. ct2 is generated using the same procedure, except that SE is set to 0. Although the book says otherwise, I obtained the same tree for SE=1 and SE=0. So, I tried S=0.05. And got the same results.

set.seed(1234)
rndSample <- sample(1:nrow(iris), 100)
tr <- iris[rndSample, ]
ts <- iris[- rndSample, ]
ct <- rpartXse(Species ~ ., tr, se=0.5)
prp(ct, type=0, extra=101, roundint = FALSE)

psi <- predict(ct, ts)
head(psi)

##    setosa versicolor virginica
## 1       1          0         0
## 7       1          0         0
## 11      1          0         0
## 12      1          0         0
## 15      1          0         0
## 16      1          0         0

ps2 <- predict(ct, ts, type="class")
head(ps2)

##      1      7     11     12     15     16 
## setosa setosa setosa setosa setosa setosa 
## Levels: setosa versicolor virginica

(cm <- table(ps2, ts$Species))

##             
## ps2          setosa versicolor virginica
##   setosa         18          0         0
##   versicolor      0         15         1
##   virginica       0          3        13

100*(1-sum(diag(cm))/sum(cm))

## [1] 8

The code above splits the iris dataset into training sample tr with 100 rows, randomly picked using the function sample() and a testing dataset ts with the 50 rows not in the training dataset.

A model (ct) is build using the rpartXse() function and the training dataset tr and se set to 0.05. Note that this time, the tree generated was different from the one above, even though the same parameters were used. Then the function predict() using the model ct obtained and applied to the testing dataset ts. The prediction are stored in an object ps2. The function predict has two return value options. The default is to return the probability that a given row will be classified with a specific label. For example for row 11, we get: setosa versicolor virginica 11 1 0.00000000 0.00000000 100% probability that row 11 is setosa, 0% probability that it is versicolor or virginica ps1 is a matrix that contains all rows of ts with the probabilities of belonging to any of the 3 species respectively.

ps2 <- predict(ct, ts, type=“class”) This statement created a new prediction model ps2 in which each row will be assigned a label(class) because the type=class parameter was added in the predict() function. table() provides a tabulation of results obtained by ps2 and the real labels for each row in ts. The last statement returns the classification error rate (as a %). Unlike the book, my classification rate was 8%.

Decision Trees-Part 2 Iris

Lila Ghemri

2022-11-04