Random Forest Classifier for Spam Detection

Detecting Spam using a Random Forest Classifier with Parameter Tuning using Grid Search Determination

Random Forest Classifier

Random forest classifiers fall under the umbrella of decision trees. Decision trees are models that can be applied to both regression and classification problems.

With regression trees the predictor space, the set of all possible values for $X_1, X_2, ..., X_p$, is divided into $J$ distinct and non-overlapping regions, $R_1, R_2,...,R_j$ b y applying some decision rules. for every observation that falls into the region $R_j$, we make the same prediction, which is the mean of all the response values $Y$ for the training observations in $R_j$.

We choose to divide the predictor space into high-dimension rectangles with the goal being to find the boxes $R_1, R_2,...,R_j$ that minimize the RSS (residual sum of squares), given by

\[\sum_{j=1}^J\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2\]

where $\hat{y}_{R_j}$ is the mean response for the training observations within the $j$th box. Because it is computationally infeasible to consider every partition we take a top-down, greedy approach called recursive binary splitting.

The approach begins at the top of the tree where all observations are in one region $R$ then successive splits of the predictor space that create two neww branches further down the tree. It is called greedy because at each step the best split is made at that step without considering future steps that may lead to a better tree.

To perform recursive binary splitting the predictor $X_j$ is selected along with cutoff point $s$ such that splitting the predictor space into regions $[x\vert X_j<s]$ and $[X\vert X_j\geq s]$ leads to the largest reduction in RSS.

In other words, for any $j$ and $s$ the half-planes \[R_1(j,s)=[X\vert X_j<s] \text{ and } R_2(j, s) = [X\vert X_j \geq s]\] are defined and we want the value of $s$ and $j$ that minimize \[\sum_{i: x_i\in R_1(j, s)}(j_i-\hat{y}_{R_1})^2 + \sum_{i: x_i\in R_2(j, s)}(j_i-\hat{y}_{R_1})^2\]

where $\hat{y}_{R_1}$ is the mean response for the training observations in $R_1(j, s)$ and $\hat{y}_{R_2}$ is the mean response for the training observations in $R_2(j, s)$. the process of splitting regions continues until a stopping criterion is met.

The process may lead to overfitting which can be resolved by first intentionally overgrowing a tree and then pruning it back to obtain an optimal subtree.

Classification trees differ from regression trees in that they are used to predict a qualitative response. The process is similar but RSS is not used. Instead the classification error rate is used. The classification error rate is the fraction of training observations in that region that do not belong to the most common class.
\[E = 1-\max_k(\hat{p}_{mk})\] where $\hat{p}_{mk}$ is the proportion of training observations in the $m$th region that are from the $k$th class.

Sometimes this is insufficient so either the Gini index or entropy are used where the Gini index is defined by \[G=\sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk})\] and entropy is defined \[D = -\sum_{k=1}^K\hat{p}_{mk}\cdot log (\hat{p}_{mk})\]

since $0\leq\hat{p}_{mk}\leq 1$, it follows that $0\leq \hat{p}_{mk}log\hat{p}_{mk}$.

Without detailing bootstrapping and bootstrap aggregating, or bagging, random forests produce an improvement over bagging by decorrelating the trees produced.

A large number of decision trees are built on bootstrapped random training samples. When a split is considered, a random sample of the $m$ predictors is chosen as the split candidate from the full set of $p$ predictors and only those $m$ predictors is used. At each split a new sample of $m$ predictors is taken and typically $m\approx\sqrt{p}$ is considered at each split.

Random forests eliminate the influence of particularly strong predictors by forcing each split to consider a random subset of the predictors. On average, $(p-m)/p$ of the splits will not even consider the strong predictor, so other predictors will have more opportunity. This decorrelates the tree making the average of the resulting trees less variable thus more reliable.

Building the Random Forest Model

This project uses grid search to tune a random forest model using the Spam7 data in library(DAAG).

Tasks:

Separate the data into a training set (70%) and a testing set (30%). Tune the model by the grid search method and report the testing error.
Plot for comparison the different settings of the hyperparameters

Set Up Your Project and Load Libraries

We need the ISLR, rpart, and rpart.plot. If you cannot install rpart.plot inside Rstudio, you can download the release file and install if from R console.

Create the testing and training sets for this project.

Using the spam7 data set from the DAAG library, use the set.seed() function for repeatability of results and build a randomized training set using 70% of the data and a test set with the remaining 30%.

library(DAAG)

## Loading required package: lattice

library(caret)

## Loading required package: ggplot2

library(ggplot2)
set.seed(123)
data(spam7)
dataset <- spam7
train = sample(4601, 3220)
training <- dataset[train, ]
testing <- dataset[-train, ]
sum(training$yesno == 'y')/sum(training$yesno == 'n')

## [1] 0.6445352

sum(testing$yesno == 'y')/sum(testing$yesno == 'n')

## [1] 0.6638554

str(dataset)

## 'data.frame':    4601 obs. of  7 variables:
##  $ crl.tot: num  278 1028 2259 191 191 ...
##  $ dollar : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ bang   : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ money  : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ n000   : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ make   : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ yesno  : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...

This dataframe contains the following columns: 1. crl.tot is the total length of words in capitals. 2. dollar number of occurrences of the dollar $ symbol. 3. bang is the total number of occurrences of the ! symbol. 4. money is the total number of occurrences of the word “money”. 5. n00 is the total number of occurrences of the the string ‘000’. 6. make is the number of occurrences of the word “make”. 7. yesno is the outcome variable, a factor with levels ‘n’, not spam, ‘y’ is spam.

Create the random forest model with tuning parameters

Run the model on the training set with set parameters

Produce a plot to compare different settings of the hyper-parameters.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

bagged_spam7 <- randomForest(yesno ~., 
                             data = training, 
                             mtry = 6,
                             ntree = 2000,)
plot(bagged_spam7)

One choice when using the train() function is to set ‘method’ equal to ‘treebag’ but ‘treebag’ does not accept a tuning variable. Note that the ‘rf’ method using the train() function has only one tuning variable, ‘mtry’, which is the number of predictors used at each split.

Now, load the caret package.

library(caret)

Set up the variables for the grid search method and then run train() using the ‘rf’ (for random forest) method.

# Set up trainControl variable 
trControl <- trainControl(method = 'cv',
                          number = 10,
                          verbose = FALSE,
                          search = 'grid')

# Set up tuneGrid using the expand.grid variable
mtryGrid <- expand.grid(.mtry = c(2,3,4,5,6))

# Set up model using train
rf_Spam7 <- train(yesno ~. , 
                  data = training, 
                  method = "rf", 
                  metric = "Accuracy", 
                  trControl = trControl,
                  tuneGrid = mtryGrid)

# Plot the model output
plot(rf_Spam7)

Plot model using “Kappa” as the metric

plot(rf_Spam7, metric = "Kappa")

Make predictions with the Test set and show the error.

Expand upon the hyperparameter grid search model.

optimal_trees <- list()

hyperpa_grid <- expand.grid(nodesize = c(NULL, 5, 10),
                            maxnodes = c(9, 99),
                            ntree = c(100, 500, 1000))

tuneGrid <- expand.grid(.mtry = c(3, 6))

for ( i in 1:nrow(hyperpa_grid)) {
  set.seed(123)
  rf_maxtrees <- train(yesno ~. ,
                       data = training, 
                       method = "rf",
                       metric = "Accuracy", 
                       tuneGrid = tuneGrid,
                       trControl = trControl,
                       importance = TRUE,
                       nodesize = hyperpa_grid[i, 1],
                       maxnodes = hyperpa_grid[i, 2],
                       ntree = hyperpa_grid[i, 3])
  key <- paste(hyperpa_grid[i,1], 
               hyperpa_grid[i, 2],
               hyperpa_grid[i, 3],
               (rf_maxtrees$finalModel)$mtry, 
               sep = '-' )
  optimal_trees[[key]] <- rf_maxtrees
}
# (rf_maxtrees$finalModel)$mtry retrieves the best value of a number of variables

Generate a boxplot to display the cross-validation accuracy of the model across different parameter values.

results_mtry <- resamples(optimal_trees)

mtx_cv <- results_mtry$values[, seq(from = 2, to = 24, length.out = 12)]

colnames(mtx_cv) <- gsub(" ~.* ", "", colnames(mtx_cv))

library(ggplot2)
library(reshape2)
M2 <- melt(mtx_cv, measure.vars = colnames(mtx_cv))
M2 <- melt(mtx_cv, measure.vars = colnames(mtx_cv))

ggplot(M2, aes(x = variable,  y = value))+
  geom_boxplot()+
  xlab("Model")+
  ylab("Accuracy")+
  stat_summary(fun.y = mean, shape = 1, col = 'red', geom = 'point')+
  ggtitle("Boxplot of 10-fold Cross-Validation Accuracy for Random Forests")+
  theme(plot.title = element_text(hjust = 0.5))+
  coord_flip()

## Warning: `fun.y` is deprecated. Use `fun` instead.

Next, repeat the process without controlling maxnodes.

optimal_trees <- list()

hyperpa_grid <- expand.grid(nodesize = c(10, 20),
                            ntree = c(300, 500, 1000))

tuneGrid <- expand.grid(.mtry = c(3)) #the best value from the last step is 3

for (i in 1:nrow(hyperpa_grid)){
  set.seed(123)
  rf_maxtrees <- train(yesno ~. ,
                     data = training, 
                     method = 'rf', 
                     metric = 'Accuracy', 
                     tuneGrid = tuneGrid, 
                     trControl = trControl,
                     importance = TRUE, 
                     nodesize = hyperpa_grid[i, 1], 
                     ntree = hyperpa_grid[i, 2])
  key <- paste(hyperpa_grid[i, 1], hyperpa_grid[i, 2], sep = '_')
  optimal_trees[[key]] <- rf_maxtrees
}

results_mtry <- resamples(optimal_trees)
mtx_cv <- results_mtry$values[, seq(from = 2, to = 12, length.out = 6)]
colnames(mtx_cv) <- gsub(" ~.* ", "", colnames(mtx_cv))

library(ggplot2)
library(reshape2)
M2 <- melt(mtx_cv, measure.vars = colnames(mtx_cv))
M2 <- melt(mtx_cv, measure.vars = colnames(mtx_cv))

ggplot(M2, aes(x = variable, y = value))+
  geom_boxplot()+ xlab("Model")+ 
  ylab("Accuracy")+
  stat_summary(fun.y = mean, shape = 1, col = 'red', geom = 'point')+
  ggtitle("Boxplot of 10-fold Cross-Validation for Random Forests")+
  theme(plot.title = element_text(hjust = 0.05))+
  coord_flip()

## Warning: `fun.y` is deprecated. Use `fun` instead.

Use the receive operating characteristics curve (“ROC”) to select the optimal model.

optimal_trees = list()

hyperparameter.grid <- expand.grid(
maxnodes = c(5, 9, 50, nrow(training)),
ntree = c(100, 250, 500)
)

nr = nrow(hyperparameter.grid)
tuneGrid <- expand.grid(.mtry = c(2, 5)) # best value from last step is 5

for (i in 1:nrow(hyperparameter.grid)) {

    set.seed(999)
    rf_maxtrees <- train(yesno ~.,
                    data = training,
                    method = "rf",
                    metric = "Accuracy",
                    tuneGrid = tuneGrid,
                    trControl = trainControl(method = "cv",
                                                   number = 10),
                    importance = TRUE,
                    nodesize = 5,
                    maxnodes = hyperparameter.grid[i,1],
                    ntree = hyperparameter.grid[i,2])
    key <- paste(hyperparameter.grid[i,1], hyperparameter.grid[i,2], sep = '-')
    optimal_trees[[key]] <- rf_maxtrees
}

results_mtry <- resamples(optimal_trees)
summary(results_mtry)

## 
## Call:
## summary.resamples(object = results_mtry)
## 
## Models: 5-100, 9-100, 50-100, 3220-100, 5-250, 9-250, 50-250, 3220-250, 5-500, 9-500, 50-500, 3220-500 
## Number of resamples: 10 
## 
## Accuracy 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 5-100    0.8136646 0.8364655 0.8602567 0.8540473 0.8680124 0.8913043    0
## 9-100    0.8447205 0.8582510 0.8740301 0.8705128 0.8819876 0.8944099    0
## 50-100   0.8633540 0.8766465 0.8833614 0.8838457 0.8928571 0.9037267    0
## 3220-100 0.8633540 0.8789757 0.8819864 0.8841610 0.8905280 0.9037267    0
## 5-250    0.8416149 0.8664596 0.8666663 0.8661572 0.8715969 0.8785047    0
## 9-250    0.8509317 0.8560372 0.8740253 0.8689552 0.8787877 0.8850932    0
## 50-250   0.8633540 0.8773292 0.8847352 0.8835447 0.8875113 0.9068323    0
## 3220-250 0.8571429 0.8786927 0.8818038 0.8819871 0.8890665 0.8975155    0
## 5-500    0.8540373 0.8664596 0.8697671 0.8701974 0.8770462 0.8881988    0
## 9-500    0.8482972 0.8582510 0.8724725 0.8680264 0.8770462 0.8819876    0
## 50-500   0.8633540 0.8803453 0.8850921 0.8838553 0.8881988 0.9006211    0
## 3220-500 0.8633540 0.8828529 0.8866497 0.8854033 0.8905280 0.8944099    0
## 
## Kappa 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 5-100    0.6175166 0.6536682 0.7035802 0.6904347 0.7193636 0.7688774    0
## 9-100    0.6655032 0.6906223 0.7282020 0.7205603 0.7497463 0.7712112    0
## 50-100   0.7039204 0.7350993 0.7494560 0.7497011 0.7676661 0.7923099    0
## 3220-100 0.7047841 0.7398741 0.7469710 0.7508391 0.7651172 0.7923099    0
## 5-250    0.6698565 0.7095550 0.7148159 0.7141362 0.7257688 0.7419774    0
## 9-250    0.6779463 0.6866576 0.7278354 0.7164671 0.7374186 0.7563699    0
## 50-250   0.7039204 0.7365139 0.7515013 0.7490464 0.7576474 0.7993019    0
## 3220-250 0.6913652 0.7360685 0.7458832 0.7458230 0.7617820 0.7795527    0
## 5-500    0.6868153 0.7135752 0.7205886 0.7229777 0.7354373 0.7639681    0
## 9-500    0.6668491 0.6897220 0.7242577 0.7140859 0.7340964 0.7472527    0
## 50-500   0.7039204 0.7422561 0.7527474 0.7496641 0.7591623 0.7865429    0
## 3220-500 0.7047841 0.7466584 0.7559496 0.7531640 0.7646741 0.7751581    0

# modified code from the following website: https://www.guru99.com/r-random-forest-tutorial.html

Results

The best model in the output above for Accuracy shows an error of (1 - 0.9006211) = 0.0993789 or 9.94% (rounded).

randomForest.spam7 <- train(yesno ~.,
                    data = training,
                    method = "rf",
                    metric = "Accuracy",
                    tuneGrid = tuneGrid,
                    trControl = trainControl(method = "cv",
                                                   number = 10),
                    importance = TRUE
                    )
print(randomForest.spam7)

## Random Forest 
## 
## 3220 samples
##    6 predictor
##    2 classes: 'n', 'y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2898, 2899, 2898, 2898, 2897, 2897, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8876023  0.7581148
##   5     0.8788979  0.7426942
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Store the Accuracy outputs in a data.frame()

seq(from = 3, to = 25, length.out = 25)

##  [1]  3.000000  3.916667  4.833333  5.750000  6.666667  7.583333  8.500000
##  [8]  9.416667 10.333333 11.250000 12.166667 13.083333 14.000000 14.916667
## [15] 15.833333 16.750000 17.666667 18.583333 19.500000 20.416667 21.333333
## [22] 22.250000 23.166667 24.083333 25.000000

mtx_cv = results_mtry$values[, seq(from = 3, to = 25, length.out = 25)]
                                                                    
colnames(mtx_cv) = c(paste(hyperparameter.grid[,1], hyperparameter.grid[,2], sep = '-'), 'default')

Create a box-plot to display the results of the different tuning parameters.

library(ggplot2)
library(reshape2)

M2 <- melt(mtx_cv, measure.vars = colnames(mtx_cv))

M2 <- melt(mtx_cv, measure.vars = colnames(mtx_cv))

ggplot(M2,aes(x = variable, y = value)) +
  geom_boxplot() + xlab("Model") +
   ylab("Accuracy & Kappa") +
     stat_summary(fun.y = mean,shape = 1,col = 'red',geom = 'point')+
       ggtitle("Boxplot of Cross Validation Accuracy over 10 Runs for random forests")+
         theme(plot.title = element_text(hjust = 0.5))

## Warning: `fun.y` is deprecated. Use `fun` instead.

Random Forest Classifier for Spam Detection

Chris Schmidt

11/18/2019

Detecting Spam using a Random Forest Classifier with Parameter Tuning using Grid Search Determination