How does a prediction tree work?

Start with all variables in one group.
Find a variable/split that best separates the outcomes.
Divide the data into 2 groups (leaves) on that split (node).
Within each split, find the best variable/split that separates the outcomes.
Continue until the groups are too small or sufficiently homogeneous.

The lecture mentions an overview of a few measures of impurity. There is some interesting math behind these, but not relevant for this note set.

An example from the iris dataset:

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(rpart.plot)

## Loading required package: rpart

data(iris)
names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

ggplot(data=iris,aes(x=Petal.Width,y=Sepal.Width,col=Species)) + geom_point()

We can see that there are pretty distinct custers of species when compared to petal width and sepal width. Let’s try to predict species using the other variables.

set.seed(22222)
inTrain<-createDataPartition(y=iris$Species,p=.7,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]

Use caret’s train function with the method=rpart argument to create a model to predict the species using all variables and the decision tree classification system described above.

modFit<-train(Species~.,data=training,method="rpart")
modFit$finalModel

## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 35  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.75 37  3 versicolor (0.00000000 0.91891892 0.08108108) *
##     7) Petal.Width>=1.75 33  1 virginica (0.00000000 0.03030303 0.96969697) *

plot(modFit$finalModel,uniform=TRUE)
text(modFit$finalModel,use.n=TRUE,all=TRUE,cex=.8)

Note that the standard plot system butchers the decision tree. I looked into it (my thanks to google and the wonderful R community), and it seems like rpart.plot is a package better suited for this task. There is also talk about the rattle package, but was unable to fully install it due to the unavailability of one of its dependencies. Here is the plot with rpart.plot.

rpart.plot(modFit$finalModel)

WOW! So much better!

Bagging

Or, bootstrap aggregating

Basic Idea:

Resample cases and recalculate predictions
Average result or take a majority vote (depending on prediciton variable type)
Useful for non-linear models
Has similar bias, but reduced variance.

Let’s do an example with the Ozone data from the ElemStatLearn library.

require(ElemStatLearn)

## Loading required package: ElemStatLearn

data(Ozone, package="ElemStatLearn")

## Warning in data(Ozone, package = "ElemStatLearn"): data set 'Ozone' not
## found

ozone<-ozone[order(ozone$ozone),]
head(ozone)

##     ozone radiation temperature wind
## 17      1         8          59  9.7
## 19      4        25          61  9.7
## 14      6        78          57 18.4
## 45      7        48          80 14.3
## 106     7        49          69 10.3
## 7       8        19          61 20.1

Do the sampling manually. I bet $5 there’s a function that will do this automatically in later slides… but for now:

ll<-matrix(NA,nrow=10,ncol=155)
for(i in 1:10){
    ss<-sample(1:dim(ozone)[1],replace=TRUE)
    ozone0<-ozone[ss,];ozone0<-ozone0[order(ozone0$ozone),]
    loess0<-loess(temperature~ozone,data=ozone0,span=.2)
    ll[i,]<-predict(loess0,newdata=data.frame(ozone=1:155))
}

Plot them. Use base plot because why not practice?

Bagged Loess Curve

with(ozone, plot(ozone,temperature,pch=19,cex=.5))
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
lines(1:155,apply(ll,2,mean),col="red",lwd=2)

We see that this bagged loess curve, which is an average of the curves from models built on individual samples, does a good job of tracking the trends in the data without the variability found in any individual sample curve.

In the caret package:

bagEarth
treebag
bagFDA

These are methods in the train function in the caret package. Perhaps worth investigating in the future.

Alternatively, you create your own model using the bag function. Please use caution and read documentation thoroughly.

Thanks for playing!

Machine Learning: Predicting with Trees and Bagging

DRWatkins

June 15, 2017