Predicting with Trees

The basic idea behind predicting with trees is that if we have a lot of of variables that we want to use to predict an outcome, we can take each of those variables, and use it to split the outcome into different groups. As we split the outcomes into different groups, then we can evaluate the homogeneity of the outcome within each group. And continue to split again until we get outcomes that are separated into groups that are homogeneous enough, or they are small enough.

PROS * it’s easy to interpret * We tend to get better performance in non-linear settings than with the linear regression models

CONS * without precluding or some kind of cross-validation, this method can lead to overfitting * they can be harder to estimate uncertainty than it can be for the linear regression model setting * in general, the results may be variable depending on exact values of the parameters, or the variables that we collected

The Basic Algorithm behind building these trees is as follow:

  1. Start with all variables in one group
  2. Find the variable/split that best separates the outcomes
  3. Divide the data into two groups (“leaves”) on that split (“node”)
  4. Within each split, find the best variable/split that separates the outcomes
  5. Continue until the groups are too small or sufficiently “pure” or sufficiently homogeneous

There are different measures of impurity and they’re all based on the probability that we can estimate within a particular group. The misclassification error is 1 minus the probability equal to the most common class in that particular leaf. Zero (0) refers to the perfect purity, and 0.5 means no purity. If the leaf is perfectly balanced between two different outcomes, then, we have no purity.

Similarly, there is another measure called the Gini Index. We should not confuse it with the Gini Coefficient. And it’s basically 1 minus the sum of the squared probabilities that belong to any of the different classes. So again, zero means perfect purity, and 0.5 means no purity.

Likewise, there another measure called Deviance/information gain. It is deviance if we use the logarithm of base e in the formula, and information gain if we use the logarithm with base 2. And it’s basically minus the sum of the probability of being assigned to class k and leaf m times the log base 2 or base e, of that same probability. Here as well, 0 means perfect purity, but 1 means no purity at all.

Let’s take an example: If we are trying to put blue and red balls in two different boxes (16 each), all red in one and blue in other. Let’s say, one algorithm identified 15 out of 16 correctly, then:

  1. What is the misclassification error?
  1. What is the Gini Index?
  1. What is the information gain?

Likewise, if an algorithm correctly identifies 8 out of 16 balls, then

  1. What is the misclassification error?
  1. What is the Gini Index?
  1. What is the information gain?

Let’s go ahead and see an example using the IRIS dataset.

library(caret)
data(iris)
library(ggplot2)
names(iris) #Checking data for the variables
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
table(iris$Species) 
## 
##     setosa versicolor  virginica 
##         50         50         50

There are “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width” and “Species” are the variables in the dataset. We are trying to predict species. There are 50 of each of the setosa, versicolor, and virginica species.

We are now going to break the dataset into training and testing set for prediction. We are going to break them in 70% and 30% order.

inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training)
## [1] 105   5
dim(testing)
## [1] 45  5

Altogether, we have 105 examples that we can use to build the prediction model, and 45 examples to test,validate the model.

The first thing we are going to do is to plot the Petal.Width against Sepal.Width and color code them according to the species.

qplot(Petal.Width, Sepal.Width, colour=Species, data=training)

We can see that there are three distinct clusters in the plot. As they are relatively different from each other, it is comparatively easy classification problem. It might be a little challenging for the linear model, but not at all challenging, for this advanced modeling with the classification trees.

We are going to test the model fit using the ‘caret’ library. We are training the model using the ‘train’ function. The outcome is the ‘Species’ and we are using all variables in the data (tilde followed by a period). We are using ‘rpart’ method on the training dataset.

modFit<-train(Species ~ ., method="rpart", data=training)
print(modFit$finalModel)
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 35  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Length< 4.75 32  0 versicolor (0.00000000 1.00000000 0.00000000) *
##     7) Petal.Length>=4.75 38  3 virginica (0.00000000 0.07894737 0.92105263) *

We printed the model fit statistics of the final model and it shows the total number of nodes and the probability of each split (leaf) being in each class.

There’s a split that says petal.length is less than 2.45, and if that happens then they belong to setosa.

We can plot the model fit to make these results look obvious. Here, we are plotting the final model that we produced. It will produce a dendogram and we can follow left and right based on the given arguments.

plot(modFit$finalModel, uniform=TRUE, main="Classification Tree")
text(modFit$finalModel, use.n=TRUE, all=TRUE, cex=0.8)

Based on the dendogram, we can say that if the Petal Length is less than the species is going to be setosa. In our training set, there are 35 cases of setosa. And if, the Petal Length is less than 1.55 the species is going to be cersicolor, and, and if it is larger than that they are going to be virginica.

If we want to create a fancy tree, then we can use the ‘rattle’ package. It looks much easier to read and understand.

library(rattle)
fancyRpartPlot(modFit$finalModel)

At first, a leaf has 33% chance of being in any of the given species. Now, if the petal length is less than 2.6, then the leaf is most probably from setosa. If the length is less than 1.8, then there is 95% chance that the leaf is of versicolor, if not it will be virginica. Overall, there are total of 33% chance of a leaf to be setosa, 35% to be versicolor, and 31% chance to be virginica.

Now, we can pass the final model on the testing set. As usual, we are using the ‘predict’ function in the caret package. We took the model fit from the training set and pass it to predict the species on the testing set.

predict(modFit, newdata=testing)
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor virginica  versicolor virginica  versicolor
## [25] versicolor virginica  versicolor versicolor versicolor versicolor
## [31] virginica  versicolor virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

The prediction shows the particular class label for each of the 45 data points in the testing set, because the classification tree was built to predict a particular class.

Notes and Further Resources

  1. Classification trees are not linear models
  2. They use interaction between variables
  3. Data transformations may be less important
  4. Trees can also be used for regression problems (continuous outcome)

Bootstrap Aggregating (Bagging)

Bootstrap aggregating is a very simple idea. The basic idea is take data and resample them. So, this is the similar to the idea of bootstrapping. After resampling the cases with replacement, we then, recalculate our prediction function on that resampled data. We then either average the predictions from all these repeated predictors or we majority vote or something like that when we do classification.

Based on the above description it is not hard to understand that this method is used when we fit complicated models. When we average all of those models together, we get a smoother model fit, that gives us a better balance between potential bias in our fit and variance in our fit. However, we get a similar bias that we would get from fitting any one of those models individually, but a reduced variability because we’ve averaged a bunch of different predictors together. This is most useful for non-linear functions.

So, we are going to see an example with smoothing, but it’s also very useful for things like predicting with trees. Let’s use the Ozon dataset and work on it:

#library(ElemStatLearn)
data(ozone, package="ElemStatLearn")
ozone<-ozone[order(ozone$ozone), ]#ordering the data by the outcome
head(ozone)
##     ozone radiation temperature wind
## 17      1         8          59  9.7
## 19      4        25          61  9.7
## 14      6        78          57 18.4
## 45      7        48          80 14.3
## 106     7        49          69 10.3
## 7       8        19          61 20.1

The ElemStatLearn package is no longer available. Thus, we can’t use this package.

We can see that there are four predictors, ozone, radiation, temperature, and wind. We are going to predict temperature as the functino of ozone.

The first thing we can do is to:

  1. create a matrix with 10 rows and 155 columns
  2. order a loop that resamples the data 10 different times
  3. Create a new dataset ‘resample0’ which will just be the subset of the corresponding sample
  4. we are then, gong to reorder the every time by the ozone variable
  5. then we will fit a loess curve each time. It is smooth curve that we can fit through the data. So, we are fitting a smooth curve relating temperature to the ozone. Each time we are going to use the resample dataset that we created. And, we will use the common span each time being how smooth that fit will be.
  6. We then predict for every single loess curve the outcome for a new data set for the exact same values. So, the ith row of this object is now the prediction from the loess curve, from the ith resample of the date ozone.

All in all, we’ve resampled the data set ten different times, fit a smooth curve through it those ten different times, and average those values.

ll<-matrix(NA, nrow=10, ncol=155)
for(i in 1:10){
  ss<-sample(1:dim(ozone)[1], replace=T)
  ozone0<-ozone[ss,]
  ozone0<-ozone0[order(ozone0$ozone), ]
  loess0<-loess(temperature ~ ozone, data=ozone0, span = 0.2)
  ll[i,]<-predict(loess0,newdata=data.frame(ozone=1:155))
}

We haven’t asked for any output, yet. So, we don’t see anything. Now, lets plot the ozone against the temperature.

plot(ozone$ozone, ozone$temperature, pch=19, cex=0.5)
for(i in 1:10){
  lines(1:155, ll[i,], col="grey", lwd=2)}
lines(1:155, apply (ll,2, mean), col="red", lwd=2)

Each black dot represented an observation; each gray line represented a fit with one resampled dataset. As can be seen they captured a lot of variability in the data set, but they also over-capture some of the variability. They look a little bit too curvy. When, we average them together, we get something that’s a little bit smoother and close to the middle of the dataset (the red line). Thus, the red line is callled the bagged loess curve, which is basically the average of multiple fitted loess curves, the same dataset where we have resampled it every time.

Ther are the proofs that the bagging estimates always have lower variability, but similar bias to the individual model fit. In the caret package there are some models that perform bagging for us. If we use train function we can set method to be either bagEarth, or treebag or bagFDA.

Alternatively, we can create our own bag function in caret package. If we are going to use them ourserves, it is important that we read the documentation carefully (http://www.inside-r.org/packages/cran/caret/docs/nbBag) begore we do so.

Let’s see an example. The idea is:

  1. we take predictors variables from the dataset and put them in a different dataframe. In the example below, our predictor is ozone variable in the ozone dataset.
  2. The outcome is the temperature variable, and
  3. We pass the through the bag function available in the caret package. We tell we are using the predictors from above data frame, followed by the outcome variable, and number of replication, e.g., B=10, meaning 10 replications
predictors=data.frame(ozone=ozone$ozone)
temperature=ozone$temperature
treebag<-bag(predictors, temperature, B=10,
             bagControl=bagControl(fit=ctreeBag$fit,
                                   predict=ctreeBag$pred,
                                   aggregate=ctreeBag$aggregate))

Let’s create a custom bag verstion of the conditional regression trees. 1. place temperature on Y-axis and ozone on the x-axis. The little gray dots represent the actual values 2. The red dots will represent the fit from the single conditional regression tree. The red dots seem to be flat eventhough we can see a clear upward trend in the data points. 3. When we average them from 10 different models, we can see that the new fit (represented by the blue dots) seem to capturing the upward trend.

plot(ozone$ozone, temperature, col='lightgrey', pch=19)
points(ozone$ozone, predict(treebag$fits[[1]]$fit, predictors), pch=5, col="red")
points(ozone$ozone, predict(treebag, predictors), pch=2,col="blue")

Now we are going to see some of the parts of the bagging functions. In this particular case, we are using the ‘ctreeBag’ function in the ‘caret package’

ctreeBag$fit
## function (x, y, ...) 
## {
##     loadNamespace("party")
##     data <- as.data.frame(x, stringsAsFactors = TRUE)
##     data$y <- y
##     party::ctree(y ~ ., data = data)
## }
## <bytecode: 0x000000002064a088>
## <environment: namespace:caret>

Based on the outcome, we can say that it basically uses the ctree function to train the conditional regression function on the dataset.

ctreeBag$pred
## function (object, x) 
## {
##     if (!is.data.frame(x)) 
##         x <- as.data.frame(x, stringsAsFactors = TRUE)
##     obsLevels <- levels(object@data@get("response")[, 1])
##     if (!is.null(obsLevels)) {
##         rawProbs <- party::treeresponse(object, x)
##         probMatrix <- matrix(unlist(rawProbs), ncol = length(obsLevels), 
##             byrow = TRUE)
##         out <- data.frame(probMatrix)
##         colnames(out) <- obsLevels
##         rownames(out) <- NULL
##     }
##     else out <- unlist(party::treeresponse(object, x))
##     out
## }
## <bytecode: 0x000000002064aad0>
## <environment: namespace:caret>

The prediction takes in an object and a new dataset x and gives us the new prediction. It basically calculates the tree response or the outcome from the object and the new data each time. It then calculates probability matrix and returns either the actually the observed levels that it predicts or just returns the predicted response from the variable.

ctreeBag$aggregate
## function (x, type = "class") 
## {
##     if (is.matrix(x[[1]]) | is.data.frame(x[[1]])) {
##         pooled <- x[[1]] & NA
##         classes <- colnames(pooled)
##         for (i in 1:ncol(pooled)) {
##             tmp <- lapply(x, function(y, col) y[, col], col = i)
##             tmp <- do.call("rbind", tmp)
##             pooled[, i] <- apply(tmp, 2, median)
##         }
##         if (type == "class") {
##             out <- factor(classes[apply(pooled, 1, which.max)], 
##                 levels = classes)
##         }
##         else out <- as.data.frame(pooled, stringsAsFactors = TRUE)
##     }
##     else {
##         x <- matrix(unlist(x), ncol = length(x))
##         out <- apply(x, 1, median)
##     }
##     out
## }
## <bytecode: 0x0000000020648a60>
## <environment: namespace:caret>

The aggregation then takes those values and averages them together or puts them together in some way. As seen in the result above, the aggregate is basically getting the prediction from every single one of the model fits, so that’ s across a large number of observations. And then it binds them together into one data matrix by row being equal to the prediction from one of the model predictions. And then, it takes the median at every value. So in other words it takes the median prediction from each of the different model fits across all the bootstrap samples.

To this end, bagging is very useful method for nonlinear models, and it’s widely used. It’s often used with trees. And we can think of an extension to this as being random forest. Several models use bagging and caret’s main train function together. We can also build our own specific bagging functions for any classification or prediction algorithm that we’d like to take a look at.

To reiterate, the basic idea involve:

  1. resample our data,
  2. refit our nonlinear model,
  3. then average those model fits together over resamples to get a smoother model fit,
  4. than we would’ve got from any individual fit on its own.

Predicting through Random Forests

Random forests is an extension to bagging for classification and regression trees. The basic idea is very similar to bagging in the sense that we bootstrap samples, so we take a resample of our observed data, and our training data set. And then, we rebuild classification or regression trees on each of those bootstrap samples. The one difference is that at each split, when we split the data each time in a classification tree, we also bootstrap the variables. In other words, only a subset of the variables is considered at each potential split. This makes for a diverse set of potential trees that can be built. And so the idea is we grow a large number of trees. And then we either vote or average those trees in order to get the prediction for a new outcome.

PROS

  1. this approach is much an accurate methods of prediction

CONS

  1. It is quite slow because it has to build a large number of trees
  2. It can be hard to interpret because we often have large number of trees that are averaged together. The trees represent bootstrap samples with bootstrap nodes that can be a lot harder to understand.
  3. Can also lead to overfitting which can further complicate the interpretation. Thus, it is important that we use cross validation when building random forests.

Let’s see an example:

data(iris)
library(ggplot2)
set.seed(123)
inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training)
## [1] 105   5
dim(testing)
## [1] 45  5

We invoked the required libraries and split the data into training and testing sets. The dimensions show that we did it correct.

Now, lets create some random forests.

library(caret)
library(randomForest)
modFit<-train(Species~., data=training, method="rf", prox=TRUE)
modFit
## Random Forest 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9488042  0.9221447
##   3     0.9468995  0.9192369
##   4     0.9427645  0.9129347
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We can see that there are 105 observations, 4-predictors, and 3 classes. The system has resampled the data 25 times using the bootstrapped method. There were total of 105 data points included in each resampling.

Based on the statistics, the final value (the final model) was mtry 2 because it had the accuracy of 0.9488042.

Now, I can look at the specific tree in my final model. We are going to get the outcome for second option (k=2), as well for 4 (k=15), just to make sure:

getTree(modFit$finalModel, k=2)
##   left daughter right daughter split var split point status prediction
## 1             2              3         4        0.70      1          0
## 2             0              0         0        0.00     -1          1
## 3             4              5         4        1.65      1          0
## 4             0              0         0        0.00     -1          2
## 5             6              7         3        4.85      1          0
## 6             8              9         2        3.10      1          0
## 7             0              0         0        0.00     -1          3
## 8             0              0         0        0.00     -1          3
## 9             0              0         0        0.00     -1          2
getTree(modFit$finalModel, k=15)
##    left daughter right daughter split var split point status prediction
## 1              2              3         4        0.70      1          0
## 2              0              0         0        0.00     -1          1
## 3              4              5         4        1.70      1          0
## 4              6              7         3        5.35      1          0
## 5              8              9         3        4.90      1          0
## 6              0              0         0        0.00     -1          2
## 7              0              0         0        0.00     -1          3
## 8             10             11         1        5.95      1          0
## 9              0              0         0        0.00     -1          3
## 10             0              0         0        0.00     -1          2
## 11             0              0         0        0.00     -1          3

Yes it did. We got two tables that included the left daughter, right daughter, which variable we were splitting, and where the variable splitted and what the prediction will going to be at that specific split point for the 2nd and 15th model.

Now Let’s use the “centers” information to see what the prediction would be. 1. We are plotting petal width against petal length 2. We then passed the classCenter functions, that would identify the centers of the predicted values.

irisP<-classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
irisP<-as.data.frame(irisP)
irisP$Species<-rownames(irisP)
p<-qplot(Petal.Width, Petal.Length, col=Species, data=training)
p+geom_point(aes(x=Petal.Width, y=Petal.Length, col=Species), size=7, shape=4, data=irisP)

Each dot represents a data point. So, we predict that each species has a prediction for these two variables that’s right in the center of the cloud of points corresponding to that particular species.

We can then predict new values using the predict functions.

pred<-predict(modFit, testing)
testing$predRight<-pred==testing$Species
table(pred, testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13

We passed to predict our model fit and the testing data set. We also set a variable, testing predict, which is that we got the prediction right in the data set. In other words, our prediction is equal to the testing dataset Species. We then made a table of our predictions versus the species to see what that variable would look like. So, as we can see, we missed three (two + one) values with our random forest model. But overall it was highly accurate in the prediction.

We can then look and see which of the three that we missed.

qplot(Petal.Width, Petal.Length, colour=predRight, data=testing, main="Newdata Predictions")

And perhaps unsurprisingly we can see the these three that we missed, marked in red in the plot, are the ones in-between two separate classes.

We can use this tool to explore, and see where our prediction is doing well and where it is doing poorly.