The lecture mentions an overview of a few measures of impurity. There is some interesting math behind these, but not relevant for this note set.
An example from the iris dataset:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart.plot)
## Loading required package: rpart
data(iris)
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
ggplot(data=iris,aes(x=Petal.Width,y=Sepal.Width,col=Species)) + geom_point()
We can see that there are pretty distinct custers of species when compared to petal width and sepal width. Let’s try to predict species using the other variables.
set.seed(22222)
inTrain<-createDataPartition(y=iris$Species,p=.7,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]
Use caret’s train function with the method=rpart argument to create a model to predict the species using all variables and the decision tree classification system described above.
modFit<-train(Species~.,data=training,method="rpart")
modFit$finalModel
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)
## 2) Petal.Length< 2.45 35 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)
## 6) Petal.Width< 1.75 37 3 versicolor (0.00000000 0.91891892 0.08108108) *
## 7) Petal.Width>=1.75 33 1 virginica (0.00000000 0.03030303 0.96969697) *
plot(modFit$finalModel,uniform=TRUE)
text(modFit$finalModel,use.n=TRUE,all=TRUE,cex=.8)
Note that the standard plot system butchers the decision tree. I looked into it (my thanks to google and the wonderful R community), and it seems like rpart.plot is a package better suited for this task. There is also talk about the rattle package, but was unable to fully install it due to the unavailability of one of its dependencies. Here is the plot with rpart.plot.
rpart.plot(modFit$finalModel)
Or, bootstrap aggregating
Let’s do an example with the Ozone data from the ElemStatLearn library.
require(ElemStatLearn)
## Loading required package: ElemStatLearn
data(Ozone, package="ElemStatLearn")
## Warning in data(Ozone, package = "ElemStatLearn"): data set 'Ozone' not
## found
ozone<-ozone[order(ozone$ozone),]
head(ozone)
## ozone radiation temperature wind
## 17 1 8 59 9.7
## 19 4 25 61 9.7
## 14 6 78 57 18.4
## 45 7 48 80 14.3
## 106 7 49 69 10.3
## 7 8 19 61 20.1
Do the sampling manually. I bet $5 there’s a function that will do this automatically in later slides… but for now:
ll<-matrix(NA,nrow=10,ncol=155)
for(i in 1:10){
ss<-sample(1:dim(ozone)[1],replace=TRUE)
ozone0<-ozone[ss,];ozone0<-ozone0[order(ozone0$ozone),]
loess0<-loess(temperature~ozone,data=ozone0,span=.2)
ll[i,]<-predict(loess0,newdata=data.frame(ozone=1:155))
}
Plot them. Use base plot because why not practice?
with(ozone, plot(ozone,temperature,pch=19,cex=.5))
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
lines(1:155,apply(ll,2,mean),col="red",lwd=2)
We see that this bagged loess curve, which is an average of the curves from models built on individual samples, does a good job of tracking the trends in the data without the variability found in any individual sample curve.
These are methods in the train function in the caret package. Perhaps worth investigating in the future.
Alternatively, you create your own model using the bag function. Please use caution and read documentation thoroughly.
Thanks for playing!
.