Basic Algorithm

  1. Begin with all variables in one group

  2. Find variable/split that results in best separation

  3. Divide data into two “leaves” on the split, aka “node”

  4. Within each group represented by a leaf, find the best variable/split that separates the outcomes

  5. Continue until groups become too small or sufficiently “pure”

Simple Example: Iris Data

data(iris);library(ggplot2);library(caret)
## Loading required package: lattice
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

We will predict species based on the other variables.

inTrain<-createDataPartition(y=iris$Species,p=0.70,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]
dim(training);dim(testing)
## [1] 105   5
## [1] 45  5

Build model.

qplot(Petal.Width,Sepal.Width,colour=Species,data=training)

modFit<-train(Species~.,method="rpart",data=training)
## Loading required package: rpart
print(modFit$finalModel)  
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 35  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.65 37  3 versicolor (0.00000000 0.91891892 0.08108108) *
##     7) Petal.Width>=1.65 33  1 virginica (0.00000000 0.03030303 0.96969697) *

Visualize the result.

plot(modFit$finalModel,uniform=TRUE,main="Classification Tree")
text(modFit$finalModel,use.n=TRUE,all=TRUE,cex=.8)

library(rattle)
## Loading required package: RGtk2
## Rattle: A free graphical interface for data mining with R.
## Version 3.5.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(modFit$finalModel)