The basic idea behind predicting with trees is that if we have a lot of of variables that we want to use to predict an outcome, we can take each of those variables, and use it to split the outcome into different groups. As we split the outcomes into different groups, then we can evaluate the homogeneity of the outcome within each group. And continue to split again until we get outcomes that are separated into groups that are homogeneous enough, or they are small enough.
PROS * it’s easy to interpret * We tend to get better performance in non-linear settings than with the linear regression models
CONS * without precluding or some kind of cross-validation, this method can lead to overfitting * they can be harder to estimate uncertainty than it can be for the linear regression model setting * in general, the results may be variable depending on exact values of the parameters, or the variables that we collected
The Basic Algorithm behind building these trees is as follow:
There are different measures of impurity and they’re all based on the probability that we can estimate within a particular group. The misclassification error is 1 minus the probability equal to the most common class in that particular leaf. Zero (0) refers to the perfect purity, and 0.5 means no purity. If the leaf is perfectly balanced between two different outcomes, then, we have no purity.
Similarly, there is another measure called the Gini Index. We should not confuse it with the Gini Coefficient. And it’s basically 1 minus the sum of the squared probabilities that belong to any of the different classes. So again, zero means perfect purity, and 0.5 means no purity.
Likewise, there another measure called Deviance/information gain. It is deviance if we use the logarithm of base e in the formula, and information gain if we use the logarithm with base 2. And it’s basically minus the sum of the probability of being assigned to class k and leaf m times the log base 2 or base e, of that same probability. Here as well, 0 means perfect purity, but 1 means no purity at all.
Let’s take an example: If we are trying to put blue and red balls in two different boxes (16 each), all red in one and blue in other. Let’s say, one algorithm identified 15 out of 16 correctly, then:
Likewise, if an algorithm correctly identifies 8 out of 16 balls, then
Let’s go ahead and see an example using the IRIS dataset.
library(caret)
data(iris)
library(ggplot2)
names(iris) #Checking data for the variables
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
There are “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width” and “Species” are the variables in the dataset. We are trying to predict species. There are 50 of each of the setosa, versicolor, and virginica species.
We are now going to break the dataset into training and testing set for prediction. We are going to break them in 70% and 30% order.
inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training)
## [1] 105 5
dim(testing)
## [1] 45 5
Altogether, we have 105 examples that we can use to build the prediction model, and 45 examples to test,validate the model.
The first thing we are going to do is to plot the Petal.Width against Sepal.Width and color code them according to the species.
qplot(Petal.Width, Sepal.Width, colour=Species, data=training)
We can see that there are three distinct clusters in the plot. As they are relatively different from each other, it is comparatively easy classification problem. It might be a little challenging for the linear model, but not at all challenging, for this advanced modeling with the classification trees.
We are going to test the model fit using the ‘caret’ library. We are training the model using the ‘train’ function. The outcome is the ‘Species’ and we are using all variables in the data (tilde followed by a period). We are using ‘rpart’ method on the training dataset.
modFit<-train(Species ~ ., method="rpart", data=training)
print(modFit$finalModel)
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)
## 2) Petal.Length< 2.45 35 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)
## 6) Petal.Length< 4.75 32 0 versicolor (0.00000000 1.00000000 0.00000000) *
## 7) Petal.Length>=4.75 38 3 virginica (0.00000000 0.07894737 0.92105263) *
We printed the model fit statistics of the final model and it shows the total number of nodes and the probability of each split (leaf) being in each class.
There’s a split that says petal.length is less than 2.45, and if that happens then they belong to setosa.
We can plot the model fit to make these results look obvious. Here, we are plotting the final model that we produced. It will produce a dendogram and we can follow left and right based on the given arguments.
plot(modFit$finalModel, uniform=TRUE, main="Classification Tree")
text(modFit$finalModel, use.n=TRUE, all=TRUE, cex=0.8)
Based on the dendogram, we can say that if the Petal Length is less than the species is going to be setosa. In our training set, there are 35 cases of setosa. And if, the Petal Length is less than 1.55 the species is going to be cersicolor, and, and if it is larger than that they are going to be virginica.
If we want to create a fancy tree, then we can use the ‘rattle’ package. It looks much easier to read and understand.
library(rattle)
fancyRpartPlot(modFit$finalModel)
At first, a leaf has 33% chance of being in any of the given species. Now, if the petal length is less than 2.6, then the leaf is most probably from setosa. If the length is less than 1.8, then there is 95% chance that the leaf is of versicolor, if not it will be virginica. Overall, there are total of 33% chance of a leaf to be setosa, 35% to be versicolor, and 31% chance to be virginica.
Now, we can pass the final model on the testing set. As usual, we are using the ‘predict’ function in the caret package. We took the model fit from the training set and pass it to predict the species on the testing set.
predict(modFit, newdata=testing)
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor virginica versicolor virginica versicolor
## [25] versicolor virginica versicolor versicolor versicolor versicolor
## [31] virginica versicolor virginica virginica virginica virginica
## [37] virginica virginica virginica virginica virginica virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
The prediction shows the particular class label for each of the 45 data points in the testing set, because the classification tree was built to predict a particular class.
Notes and Further Resources
Bootstrap aggregating is a very simple idea. The basic idea is take data and resample them. So, this is the similar to the idea of bootstrapping. After resampling the cases with replacement, we then, recalculate our prediction function on that resampled data. We then either average the predictions from all these repeated predictors or we majority vote or something like that when we do classification.
Based on the above description it is not hard to understand that this method is used when we fit complicated models. When we average all of those models together, we get a smoother model fit, that gives us a better balance between potential bias in our fit and variance in our fit. However, we get a similar bias that we would get from fitting any one of those models individually, but a reduced variability because we’ve averaged a bunch of different predictors together. This is most useful for non-linear functions.
So, we are going to see an example with smoothing, but it’s also very useful for things like predicting with trees. Let’s use the Ozon dataset and work on it:
#library(ElemStatLearn)
data(ozone, package="ElemStatLearn")
ozone<-ozone[order(ozone$ozone), ]#ordering the data by the outcome
head(ozone)
## ozone radiation temperature wind
## 17 1 8 59 9.7
## 19 4 25 61 9.7
## 14 6 78 57 18.4
## 45 7 48 80 14.3
## 106 7 49 69 10.3
## 7 8 19 61 20.1
The ElemStatLearn package is no longer available. Thus, we can’t use this package.
We can see that there are four predictors, ozone, radiation, temperature, and wind. We are going to predict temperature as the functino of ozone.
The first thing we can do is to:
All in all, we’ve resampled the data set ten different times, fit a smooth curve through it those ten different times, and average those values.
ll<-matrix(NA, nrow=10, ncol=155)
for(i in 1:10){
ss<-sample(1:dim(ozone)[1], replace=T)
ozone0<-ozone[ss,]
ozone0<-ozone0[order(ozone0$ozone), ]
loess0<-loess(temperature ~ ozone, data=ozone0, span = 0.2)
ll[i,]<-predict(loess0,newdata=data.frame(ozone=1:155))
}
We haven’t asked for any output, yet. So, we don’t see anything. Now, lets plot the ozone against the temperature.
plot(ozone$ozone, ozone$temperature, pch=19, cex=0.5)
for(i in 1:10){
lines(1:155, ll[i,], col="grey", lwd=2)}
lines(1:155, apply (ll,2, mean), col="red", lwd=2)
Each black dot represented an observation; each gray line represented a fit with one resampled dataset. As can be seen they captured a lot of variability in the data set, but they also over-capture some of the variability. They look a little bit too curvy. When, we average them together, we get something that’s a little bit smoother and close to the middle of the dataset (the red line). Thus, the red line is callled the bagged loess curve, which is basically the average of multiple fitted loess curves, the same dataset where we have resampled it every time.
Ther are the proofs that the bagging estimates always have lower variability, but similar bias to the individual model fit. In the caret package there are some models that perform bagging for us. If we use train function we can set method to be either bagEarth, or treebag or bagFDA.
Alternatively, we can create our own bag function in caret package. If we are going to use them ourserves, it is important that we read the documentation carefully (http://www.inside-r.org/packages/cran/caret/docs/nbBag) begore we do so.
Let’s see an example. The idea is:
predictors=data.frame(ozone=ozone$ozone)
temperature=ozone$temperature
treebag<-bag(predictors, temperature, B=10,
bagControl=bagControl(fit=ctreeBag$fit,
predict=ctreeBag$pred,
aggregate=ctreeBag$aggregate))
Let’s create a custom bag verstion of the conditional regression trees. 1. place temperature on Y-axis and ozone on the x-axis. The little gray dots represent the actual values 2. The red dots will represent the fit from the single conditional regression tree. The red dots seem to be flat eventhough we can see a clear upward trend in the data points. 3. When we average them from 10 different models, we can see that the new fit (represented by the blue dots) seem to capturing the upward trend.
plot(ozone$ozone, temperature, col='lightgrey', pch=19)
points(ozone$ozone, predict(treebag$fits[[1]]$fit, predictors), pch=5, col="red")
points(ozone$ozone, predict(treebag, predictors), pch=2,col="blue")
Now we are going to see some of the parts of the bagging functions. In this particular case, we are using the ‘ctreeBag’ function in the ‘caret package’
ctreeBag$fit
## function (x, y, ...)
## {
## loadNamespace("party")
## data <- as.data.frame(x, stringsAsFactors = TRUE)
## data$y <- y
## party::ctree(y ~ ., data = data)
## }
## <bytecode: 0x000000002064a088>
## <environment: namespace:caret>
Based on the outcome, we can say that it basically uses the ctree function to train the conditional regression function on the dataset.
ctreeBag$pred
## function (object, x)
## {
## if (!is.data.frame(x))
## x <- as.data.frame(x, stringsAsFactors = TRUE)
## obsLevels <- levels(object@data@get("response")[, 1])
## if (!is.null(obsLevels)) {
## rawProbs <- party::treeresponse(object, x)
## probMatrix <- matrix(unlist(rawProbs), ncol = length(obsLevels),
## byrow = TRUE)
## out <- data.frame(probMatrix)
## colnames(out) <- obsLevels
## rownames(out) <- NULL
## }
## else out <- unlist(party::treeresponse(object, x))
## out
## }
## <bytecode: 0x000000002064aad0>
## <environment: namespace:caret>
The prediction takes in an object and a new dataset x and gives us the new prediction. It basically calculates the tree response or the outcome from the object and the new data each time. It then calculates probability matrix and returns either the actually the observed levels that it predicts or just returns the predicted response from the variable.
ctreeBag$aggregate
## function (x, type = "class")
## {
## if (is.matrix(x[[1]]) | is.data.frame(x[[1]])) {
## pooled <- x[[1]] & NA
## classes <- colnames(pooled)
## for (i in 1:ncol(pooled)) {
## tmp <- lapply(x, function(y, col) y[, col], col = i)
## tmp <- do.call("rbind", tmp)
## pooled[, i] <- apply(tmp, 2, median)
## }
## if (type == "class") {
## out <- factor(classes[apply(pooled, 1, which.max)],
## levels = classes)
## }
## else out <- as.data.frame(pooled, stringsAsFactors = TRUE)
## }
## else {
## x <- matrix(unlist(x), ncol = length(x))
## out <- apply(x, 1, median)
## }
## out
## }
## <bytecode: 0x0000000020648a60>
## <environment: namespace:caret>
The aggregation then takes those values and averages them together or puts them together in some way. As seen in the result above, the aggregate is basically getting the prediction from every single one of the model fits, so that’ s across a large number of observations. And then it binds them together into one data matrix by row being equal to the prediction from one of the model predictions. And then, it takes the median at every value. So in other words it takes the median prediction from each of the different model fits across all the bootstrap samples.
To this end, bagging is very useful method for nonlinear models, and it’s widely used. It’s often used with trees. And we can think of an extension to this as being random forest. Several models use bagging and caret’s main train function together. We can also build our own specific bagging functions for any classification or prediction algorithm that we’d like to take a look at.
To reiterate, the basic idea involve:
Random forests is an extension to bagging for classification and regression trees. The basic idea is very similar to bagging in the sense that we bootstrap samples, so we take a resample of our observed data, and our training data set. And then, we rebuild classification or regression trees on each of those bootstrap samples. The one difference is that at each split, when we split the data each time in a classification tree, we also bootstrap the variables. In other words, only a subset of the variables is considered at each potential split. This makes for a diverse set of potential trees that can be built. And so the idea is we grow a large number of trees. And then we either vote or average those trees in order to get the prediction for a new outcome.
PROS
CONS
Let’s see an example:
data(iris)
library(ggplot2)
set.seed(123)
inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training)
## [1] 105 5
dim(testing)
## [1] 45 5
We invoked the required libraries and split the data into training and testing sets. The dimensions show that we did it correct.
Now, lets create some random forests.
library(caret)
library(randomForest)
modFit<-train(Species~., data=training, method="rf", prox=TRUE)
modFit
## Random Forest
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9488042 0.9221447
## 3 0.9468995 0.9192369
## 4 0.9427645 0.9129347
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
We can see that there are 105 observations, 4-predictors, and 3 classes. The system has resampled the data 25 times using the bootstrapped method. There were total of 105 data points included in each resampling.
Based on the statistics, the final value (the final model) was mtry 2 because it had the accuracy of 0.9488042.
Now, I can look at the specific tree in my final model. We are going to get the outcome for second option (k=2), as well for 4 (k=15), just to make sure:
getTree(modFit$finalModel, k=2)
## left daughter right daughter split var split point status prediction
## 1 2 3 4 0.70 1 0
## 2 0 0 0 0.00 -1 1
## 3 4 5 4 1.65 1 0
## 4 0 0 0 0.00 -1 2
## 5 6 7 3 4.85 1 0
## 6 8 9 2 3.10 1 0
## 7 0 0 0 0.00 -1 3
## 8 0 0 0 0.00 -1 3
## 9 0 0 0 0.00 -1 2
getTree(modFit$finalModel, k=15)
## left daughter right daughter split var split point status prediction
## 1 2 3 4 0.70 1 0
## 2 0 0 0 0.00 -1 1
## 3 4 5 4 1.70 1 0
## 4 6 7 3 5.35 1 0
## 5 8 9 3 4.90 1 0
## 6 0 0 0 0.00 -1 2
## 7 0 0 0 0.00 -1 3
## 8 10 11 1 5.95 1 0
## 9 0 0 0 0.00 -1 3
## 10 0 0 0 0.00 -1 2
## 11 0 0 0 0.00 -1 3
Yes it did. We got two tables that included the left daughter, right daughter, which variable we were splitting, and where the variable splitted and what the prediction will going to be at that specific split point for the 2nd and 15th model.
Now Let’s use the “centers” information to see what the prediction would be. 1. We are plotting petal width against petal length 2. We then passed the classCenter functions, that would identify the centers of the predicted values.
irisP<-classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
irisP<-as.data.frame(irisP)
irisP$Species<-rownames(irisP)
p<-qplot(Petal.Width, Petal.Length, col=Species, data=training)
p+geom_point(aes(x=Petal.Width, y=Petal.Length, col=Species), size=7, shape=4, data=irisP)
Each dot represents a data point. So, we predict that each species has a prediction for these two variables that’s right in the center of the cloud of points corresponding to that particular species.
We can then predict new values using the predict functions.
pred<-predict(modFit, testing)
testing$predRight<-pred==testing$Species
table(pred, testing$Species)
##
## pred setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 2
## virginica 0 1 13
We passed to predict our model fit and the testing data set. We also set a variable, testing predict, which is that we got the prediction right in the data set. In other words, our prediction is equal to the testing dataset Species. We then made a table of our predictions versus the species to see what that variable would look like. So, as we can see, we missed three (two + one) values with our random forest model. But overall it was highly accurate in the prediction.
We can then look and see which of the three that we missed.
qplot(Petal.Width, Petal.Length, colour=predRight, data=testing, main="Newdata Predictions")
And perhaps unsurprisingly we can see the these three that we missed, marked in red in the plot, are the ones in-between two separate classes.
We can use this tool to explore, and see where our prediction is doing well and where it is doing poorly.