Spam Filter with Random Forests

.

Overview

For this project we shall attempt to make a spam filter. Spam filters all have the same basic objective: to keep unwanted emails out of users’ inbox’s. However, there are several different types of spam filters, and they each use different filtering methods to hone in on spam. We shall implement our own version of spa filtering.

To construct a spam filter usually entails a lot of text processing on a huge number of emails. For our use case, we shall use a feature matrix that was taken from an example resource online. The feature matrix has rows given by individual emails and columns given by the number of each word or character that appears in that email, as well as three different numerical measures regarding capital letters (average length of consecutive capitals, longest sequence of consecutive capitals, and total number of capital letters). \(Y\) is given by the user supplied label marking that email as either spam (\(Y = 1\)) or not (\(Y = 0\)).

Setup

We read in the R data set, which as mentioned, is a list and has the following items:

load("spam.Rdata")
spam <- spam
names(spam)
## [1] "train"            "Xmat"             "XdataF"           "Y"               
## [5] "covariate_labels" "column_labels"

We now construct training and testing data sets. This is done so as to assess the accuracy of our model. That is, after our model has been processed by using the training set, you test the model by making predictions against the test set. We inspect our newly created test and training data sets, by checking the dimensions and that all is in order.

train <-  spam$train
test  <-  !train

# Predictors (data)
X <-  spam$XdataF[train,]
X_test <-  spam$XdataF[test,]

# Response (label)
Y <-  factor(spam$Y[train])
Y_test  <-  factor(spam$Y[test])

Our test set contains 246 observations/data points. Our training set has 4355 points. This represents a \(5\%\) test data split.

We generate a function that shall be used to get the misclassifation error, i.e., the percentage of classifications that were incorrect.

# Misclassification Function
misClass =function(pred.class,true.class,produceOutput=FALSE){
  confusion.mat = table(pred.class,true.class)
  if(produceOutput){
    return(1-sum(diag(confusion.mat))/sum(confusion.mat))   
  }
  else{
    print('miss-class')
    print(1-sum(diag(confusion.mat))/sum(confusion.mat))
    print('confusion mat')
    print(confusion.mat)
  }
}

The packages for this project are shown and loaded below.

#Load Packages
if(!require('randomForest')){install.packages('randomForest');require('randomForest')}
library(randomForest)

Random Forests Background

TLDR; a random forest is a form of ensemble algorithm, where we have many decisions trees (a forest) whose results are aggregated into one final result. It uses bagging and feature randomness (at each split) when building each individual tree to try to create an uncorrelated forest of trees.

Bagging

A downfall of decision trees, is that they suffer from high variance. This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets. We can overcome this by introducing bagging.

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees.

  • The key idea behind bagging is: averaging a set of observations reduces variance

So a natural way to reduce the variance and increase the test set accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. Since we do not have multiple training sets available to us, we simply bootstrap, by taking repeated samples from the (single) training data set. This whole process is bagging - construct \(B\) regression trees using \(B\) bootstrapped training sets, and average the resulting predictions.

These trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging these \(B\) trees reduces the variance. The number of trees \(B\) is not a critical parameter with bagging; using a very large value of \(B\) will not lead to over fitting (in practice we use a value of \(B\) sufficiently large that the error has settled down).

  • bagged trees can suffer from poor results, since in the trees at each split all the predictors are considered. So if one very strong predictor is present, most or all of the trees will use this strong predictor in the top split and so all of the bagged trees will look quite similar to each other (correlated to one another.

  • correlated trees when aggregated or averaged, lead to poor results.

Out of Bag Error Estimatyion

We can estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach. Recall that the key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations. On average, each bagged tree makes use of around two-thirds of the observations. The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. We can predict the response for the \(i\)th observation using each of the trees in which that observation was OOB. In order to obtain a single prediction for the \(i\)th observation, we can average these predicted responses (if regression is the goal) or can take a majority vote (if classification is the goal).

Random Forests

An ensemble method is an approach that combines many simple “building block” models in order to obtain a single and potentially very powerful model. These simple building block models are sometimes known as weak learners, since they may lead to mediocre predictions on their own. We have already touched on bagging. We shall now look at Random Forests. These (two) are ensemble methods for which the simple building block is a regression or a classification tree.

Random forests provide an improvement over bagged trees by way of a small tweak that de-correlates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors. The split is allowed to use only one of those \(m\) predictors. A fresh sample of \(m\) predictors is taken at each split, and typically we choose \(m \approx \sqrt{p}\) - that is, the number of predictors considered at each split (\(m\)) is approximately equal to the square root of the total number of predictors.

In other words, in building a random forest, at each split in the tree, the algorithm is not even allowed to consider a majority of the available predictors. The rationale for this is simple. Suppose that there is one very strong predictor in the data set, along with a number of other moderately strong predictors. Then in the collection of bagged trees, most or all of the trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar to each other. Hence the predictions from the bagged trees will be highly correlated. Averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. In particular, this means that bagging will not lead to a substantial reduction in variance over a single tree in this setting. Random forests overcome this problem by forcing each split to consider only a subset of the predictors. We can think of this process as de-correlating the trees, thereby making the average of the resulting trees less variable and hence more reliable.

Random Forests and Bagging

The main difference between bagging and random forests is the choice of predictor subset size \(m\). For instance, if a random forest is built using \(m = p\), then this amounts simply to bagging.

Note that, Using a small value of \(m\) in building a random forest will typically be helpful when we have a large number of correlated predictors.

Deciding on Number of Iterations

So when using random forests, it is common to iteratively compute batches of random trees until the OOB error rate stabilizes. The following function implementation does this.

checkNumberItersRF = function(ntrees = 10, tolParm = 1, maxIter = 10, verbose = 0){
  # initialize empty lists
  misClass_out   = list()
  totalTrees_out = list()
  
  # initialize other variables
  n              = nrow(X)
  votes          = matrix(0,nrow=n,ncol=2)
  totalTrees     = 0
  iterations     = 0
  misClass_old   = 1
  
  while(iterations < maxIter){
    votes[is.nan(votes)] = 0
    iterations    = iterations + 1
    totalTrees    = totalTrees + ntrees
    if(verbose >= 2){cat('Total trees: ',totalTrees,'\n')}
    out.rf        = randomForest(X, Y,ntree = ntrees)
    
    oob.times        = out.rf$oob.times
    votes_iterations = out.rf$votes*oob.times
    votes[oob.times>0,] = matrix(votes + votes_iterations,nrow=n)[oob.times>0,]
    if(min(apply(votes,1,sum)) == 0){next}        #If the condition is satisfied, then go to the next iteration
    
    Yhat          = apply(votes,1,which.max) - 1
    misClass_new  = misClass(Yhat,Y,produceOutput = TRUE)
    misClass_out[[iterations]]   = misClass_new
    totalTrees_out[[iterations]] = totalTrees
    percentChange = 100*(misClass_new - misClass_old)/misClass_old #placeholder for compare the current solution to previous
    if(verbose >= 1){cat('% change: ',percentChange,'\n')}
    if(percentChange > -tolParm){break}
    misClass_old = misClass_new
  }
  if(iterations == maxIter){
    stop("too many iterations, try a larger ntrees or maxIter value") #Stops the function and produces an error message
  }
  return(list('misClass' = unlist(misClass_out),
              'totalTree' = unlist(totalTrees_out)))
}

Note that, tolParm indicates the number for which iterations will continue until the percent decrease is less than tolParm. Parameter maxIterm indicates the maximum number of iterations that will be attempted by the function. This is important anytime a while loop is used to prevent an infinite loop. The parameter verbose is a conventional parameter used to specify how much printed output should be produced. Larger values tend to correspond to more output.

out.checkNumberItersRF = checkNumberItersRF(maxIter = 10, verbose = 0)
plot(out.checkNumberItersRF$totalTree,out.checkNumberItersRF$misClass,
     type='l', xlab = "Iterations", ylab= "Missclassification Error")

Since the reported \(B\) will be random (due to random forest being random), for this particular run, we get an ensemble size of:

print(max(out.checkNumberItersRF$totalTree))
## [1] 70

Accuracy and precision Measures for Random Forest

For this random forest that we have chosen, lets have a look at the the test missclassification rate, sensitivity, specificity, precision, recall, and confusion matrix.

ntrees = max(out.checkNumberItersRF$totalTree)
out.rf = randomForest(X, Y, ntree = ntrees)
class.rf = predict(out.rf,X_test,type='class')
confusionMat = table(class.rf,Y_test)

sensitivity = confusionMat[2,2]/sum(confusionMat[,2])
paste("Sensitivity: ", sensitivity)
## [1] "Sensitivity:  0.902654867256637"
specificity = confusionMat[1,1]/sum(confusionMat[,1])
paste("Specificity: ", specificity)
## [1] "Specificity:  0.992481203007519"
recall = sensitivity
paste("Recall: ", recall)
## [1] "Recall:  0.902654867256637"
precision = confusionMat[2,2]/sum(confusionMat[2,])
paste("Precision: ", precision)
## [1] "Precision:  0.990291262135922"

These measures are popular and commonly used in data science to evaluate the models you build. For these measures, we calculated them manually using the below formulas.

  • Precision: Out of all the examples that predicted as positive, how many are really positive?

    • Precision \(= \frac{TP}{TP+FP}\)
  • Recall: Out of all the positive examples, how many are predicted as positive?

    • Recall \(= \frac{TP}{TP+FN}\)
  • Specificity: Out of all the people that do not have the disease, how many got negative results?

    • Specificity \(= \frac{TN}{TN+FP}\)
  • Sensitivity: Out of all the people that have the disease, how many got positive test results?

    • Senstitivity \(= \frac{TP}{TP + FN}\)

For example consider the below example:

Variable Importance

it can be difficult to interpret the resulting model. Recall that one of the advantages of decision trees is the attractive and easily interpreted diagram that results. However, when we bag a large number of trees, it is no longer possible to represent the resulting statistical learning procedure using a single tree, and it is no longer clear which variables are most important to the procedure. Thus, bagging improves prediction accuracy at the expense of interpretability.

Although the collection of bagged trees is much more difficult to interpret than a single tree, one can obtain an overall summary of the importance of each predictor using the RSS (for regression trees) or the Gini index (for classification trees). In the case of regression trees, we can record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all \(B\) trees. A large value indicates an important predictor. Similarly, in the context of classification trees, we can add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all \(B\) trees.

out.rf = randomForest(X, Y, importance = TRUE, 
                      proximity=TRUE, ntree = ntrees)

varImpPlot(out.rf,type=1, main= "Variable Importance Plot")

Specifically the varImpPlot() function creates a plot that displays the importance of each predictor variable in the final model.

Random Forest vs Pruned Tree

Here we shall fit an un-pruned classification tree to the training data. Pruning generally reduces the complexity of the final classifier (tree), and hence improves predictive accuracy and generalisability by the reduction of over-fitting. We prune the tree via weakest-link pruning (i.e. using the cv.tree and prune.misclass pair of functions. We shall then plot the trees, the original and the pruned tree (shown below).

#Package Needed
require(tree)

#Original Tree
out.tree.orig   = tree(Y~.,data=X)

#Cut Tree
out.tree.cv     = cv.tree(out.tree.orig,FUN=prune.misclass)
best.size       = out.tree.cv$size[which.min(out.tree.cv$dev)]
best.size       = out.tree.cv$size[max(which(out.tree.cv$dev == min(out.tree.cv$dev)))]
out.tree        = prune.misclass(out.tree.orig,best=best.size)
class.tree      = predict(out.tree,X_test,type='class')


par(mfrow=c(1,2))
#Original Tree
plot(out.tree.orig)
text(out.tree.orig)

#Pruned Tree
plot(out.tree)
text(out.tree)

The pruned tree is on the right and the original tree is on the left.

We compare the test and training missclassification rates for the un-pruned tree, the CV pruned tree, and random forest. We construct two random forests, one with mtry set at the default and another set at mtry \(= p\).

# Bagged Tree (mtry = p)
out.bag = randomForest(X, Y, mtry = ncol(X))
class.bag = predict(out.bag,X_test,type='class')
plot(out.bag)                      

Note, that by default, the randomForest() function uses 500 trees and (total predictors\(/3\)) randomly selected predictors as potential candidates at each split. We here, set \(mtry\) to \(p\) (i.e., all predictors in data set to be considered at each split. We keep the number of trees, \(ntree\) as default value (500) for this bagged tree.

Lets compare the missclassification rates for the three different trees, using the test data.

# Test Set, Prediction (Missclass Err)
misClass(class.rf,Y_test)
## [1] "miss-class"
## [1] 0.04878049
## [1] "confusion mat"
##           true.class
## pred.class   0   1
##          0 132  11
##          1   1 102
misClass(class.bag,Y_test)
## [1] "miss-class"
## [1] 0.05691057
## [1] "confusion mat"
##           true.class
## pred.class   0   1
##          0 129  10
##          1   4 103
misClass(class.tree,Y_test)
## [1] "miss-class"
## [1] 0.07317073
## [1] "confusion mat"
##           true.class
## pred.class   0   1
##          0 127  12
##          1   6 101

We can apply similar commands to retrieve the miss-classification errors using the training data.

# Training Set, Prediction (Missclass Err)
class.tree.train   = predict(out.tree.orig,X,type='class')
class.treeCV.train = predict(out.tree,X,type='class')
class.bag.train    = predict(out.bag,X,type='class')
class.rf.train     = predict(out.rf,X,type='class')

# Miss Class Errors and Conf Matrices
misClass(class.tree.train,Y)
## [1] "miss-class"
## [1] 0.08633754
## [1] "confusion mat"
##           true.class
## pred.class    0    1
##          0 2497  218
##          1  158 1482
misClass(class.treeCV.train,Y)
## [1] "miss-class"
## [1] 0.08633754
## [1] "confusion mat"
##           true.class
## pred.class    0    1
##          0 2497  218
##          1  158 1482
misClass(class.rf.train,Y)
## [1] "miss-class"
## [1] 0.004592423
## [1] "confusion mat"
##           true.class
## pred.class    0    1
##          0 2655   20
##          1    0 1680
misClass(class.bag.train,Y)
## [1] "miss-class"
## [1] 0.0006888634
## [1] "confusion mat"
##           true.class
## pred.class    0    1
##          0 2654    2
##          1    1 1698