Abstract

This research article is about creating Spam Filters using different machine learning models for training and testing. We will see most of the challenges will come from both prepping the data and optimizing the performance of the training. Finally, we will assess the prediction results for highest performance based on accuracy and time efficiency.

Data Wrangling

We will begin by importing the data, analyzing it, and then performing any data munging needed before applying spam filters.

Import with Text Mining Package

We will begin with importing the data from our local directory. The package we will be using is the tm library (short for Text Mining Package). The following two functions we will use are:

  1. DirSource which accepts a character vector of full path name that correspond to the working directory (good time to use setwd to prep beforehand)

  2. VCorpus which stores a collection of documents in memory (hence V for volatile)

We can see below that there is roughly 500 plus documents in each corpus.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 587
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 501

Data Analysis

A Document Term Matrix is simply a matrix that describes the frequency of terms that occur.

Using the DocumentTermMatrix function from the tm package we will extract the first 10 document from each corpus.

We can use the inspect function to take a look at the document term matrix created for each corpus. Below we can see a summary for each.

Sparsity refers to the threshold of the relative document frequency for a term, above which the term will be removed.

## <<DocumentTermMatrix (documents: 10, terms: 879)>>
## Non-/sparse entries: 2244/6546
## Sparsity           : 74%
## Maximal term length: 84
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     2002 and aug esmtp for from received:
##   00001.7c53336b37003a9286aba55d2945844c   13   2  13     6   9   13        10
##   00002.9c4069e25e1ef370c078db7ee85ff9ac   12   3  12     2   6   11        10
##   00003.860e3c3cee1b42ead714c5c874fe25f7   12  10  11     2   5   11         9
##   00004.864220c5b6930b209cc287c361c99af1   11   4   9     6   5    9         7
##   00005.bf27cdeaf0b8c4647ecd61b1d09da613   11   2  11     2   6   10         9
##   00006.253ea2f9a9cc36fa0b1129b04b806608   11   3  13     2   5   12        11
##   0001.ea7e79d3153e7469e7a9c3e0af6a357e    13   2  13     6   9   13        10
##   0002.b3120c4bcbf3101e661161ee7efcb8bf    12   3  12     2   6   11        10
##   0003.acfc5ad94bbd27118a0d8685d18c89dd    12  10  11     2   5   11         9
##   0004.e8d5727378ddde5c3be181df593f1712    11   4   9     6   5    9         7
##                                         Terms
## Docs                                     the thu, with
##   00001.7c53336b37003a9286aba55d2945844c  15   11    9
##   00002.9c4069e25e1ef370c078db7ee85ff9ac   5    6    8
##   00003.860e3c3cee1b42ead714c5c874fe25f7  16    5   11
##   00004.864220c5b6930b209cc287c361c99af1  11    8    8
##   00005.bf27cdeaf0b8c4647ecd61b1d09da613   4    5    7
##   00006.253ea2f9a9cc36fa0b1129b04b806608   6    5    7
##   0001.ea7e79d3153e7469e7a9c3e0af6a357e   15   11    9
##   0002.b3120c4bcbf3101e661161ee7efcb8bf    5    6    8
##   0003.acfc5ad94bbd27118a0d8685d18c89dd   16    5   11
##   0004.e8d5727378ddde5c3be181df593f1712   11    8    8
## <<DocumentTermMatrix (documents: 10, terms: 2394)>>
## Non-/sparse entries: 3248/20692
## Sparsity           : 86%
## Maximal term length: 89
## Weighting          : term frequency (tf)
## Sample             :
##                                        Terms
## Docs                                    <option 2002 and aug for from the thu,
##   0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1       0    0   0   0   0    0   0    0
##   0001.bfc8d64d12b325ff385cca8d07b84288       0    6   4   6   5    7   5    3
##   0002.24b47bb3ce90708ae29d0aec1da08610       0    9   0   9   6    7   4    7
##   0003.4b3d943b8df71af248d12f8b2e7a224a       0    7   0   7   4    5   4    5
##   0004.1874ab60c71f0b31b580f313a3f6e777       0   10  11   6  14   10  12    5
##   0005.1f42bb885de0ef7fc5cd09d34dc2ba54       0    9   0   9   6    7   3    7
##   0006.7a32642f8c22bbeb85d6c3b5f3890a2c       0    6  11   6   8    8  20    5
##   0007.859c901719011d56f8b652ea071c1f8b       0    6   3   6   4    8   2    5
##   0008.9562918b57e044abfbce260cc875acde     226    6   2   6   4    6   7    5
##   0009.c05e264fbf18783099b53dbc9a9aacda       0    6  13   6   9    8  23    5
##                                        Terms
## Docs                                    with you
##   0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1    0   0
##   0001.bfc8d64d12b325ff385cca8d07b84288    5   7
##   0002.24b47bb3ce90708ae29d0aec1da08610    6   2
##   0003.4b3d943b8df71af248d12f8b2e7a224a    4   2
##   0004.1874ab60c71f0b31b580f313a3f6e777   10   9
##   0005.1f42bb885de0ef7fc5cd09d34dc2ba54    6   3
##   0006.7a32642f8c22bbeb85d6c3b5f3890a2c    7  10
##   0007.859c901719011d56f8b652ea071c1f8b    5   2
##   0008.9562918b57e044abfbce260cc875acde    7   6
##   0009.c05e264fbf18783099b53dbc9a9aacda    7  11

Cleaning Up the Data

From the previous section we saw that upon inspection of the data it was quite dirty with punctuation and other terms we probably were not interested in such as numbers.

The tm text mining package comes with a handy tool called tm_map, which applies transformations to documents. The transformations that are applied are functions such as removeNumbers, again all conveniently found as part of the tm library.

We can now re-inspect to verify that each corpus has been cleaned up

## <<DocumentTermMatrix (documents: 10, terms: 564)>>
## Non-/sparse entries: 1516/4124
## Sparsity           : 73%
## Maximal term length: 55
## Weighting          : term frequency (tf)
## Sample             :
##                                         Terms
## Docs                                     aug esmtp list localhost mail receiv
##   00001.7c53336b37003a9286aba55d2945844c  13     6    6         4    1     10
##   00002.9c4069e25e1ef370c078db7ee85ff9ac  12     2    2         3    4     10
##   00003.860e3c3cee1b42ead714c5c874fe25f7  11     2    2         3    2      9
##   00004.864220c5b6930b209cc287c361c99af1   9     6    2         3    2      7
##   00005.bf27cdeaf0b8c4647ecd61b1d09da613  11     2    2         3    2      9
##   00006.253ea2f9a9cc36fa0b1129b04b806608  13     2    2         3    4     11
##   0001.ea7e79d3153e7469e7a9c3e0af6a357e   13     6    6         4    1     10
##   0002.b3120c4bcbf3101e661161ee7efcb8bf   12     2    2         3    4     10
##   0003.acfc5ad94bbd27118a0d8685d18c89dd   11     2    2         3    2      9
##   0004.e8d5727378ddde5c3be181df593f1712    9     6    2         3    2      7
##                                         Terms
## Docs                                     subject thu zzzzlocalhost
##   00001.7c53336b37003a9286aba55d2945844c       4  12             2
##   00002.9c4069e25e1ef370c078db7ee85ff9ac       2   7             2
##   00003.860e3c3cee1b42ead714c5c874fe25f7       2   6             2
##   00004.864220c5b6930b209cc287c361c99af1       1   9             2
##   00005.bf27cdeaf0b8c4647ecd61b1d09da613       2   6             2
##   00006.253ea2f9a9cc36fa0b1129b04b806608       2   6             2
##   0001.ea7e79d3153e7469e7a9c3e0af6a357e        4  12             2
##   0002.b3120c4bcbf3101e661161ee7efcb8bf        2   7             2
##   0003.acfc5ad94bbd27118a0d8685d18c89dd        2   6             2
##   0004.e8d5727378ddde5c3be181df593f1712        1   9             2
##                                         Terms
## Docs                                     zzzzteanayahoogroupscom
##   00001.7c53336b37003a9286aba55d2945844c                       0
##   00002.9c4069e25e1ef370c078db7ee85ff9ac                       6
##   00003.860e3c3cee1b42ead714c5c874fe25f7                       5
##   00004.864220c5b6930b209cc287c361c99af1                       0
##   00005.bf27cdeaf0b8c4647ecd61b1d09da613                       5
##   00006.253ea2f9a9cc36fa0b1129b04b806608                       5
##   0001.ea7e79d3153e7469e7a9c3e0af6a357e                        0
##   0002.b3120c4bcbf3101e661161ee7efcb8bf                        6
##   0003.acfc5ad94bbd27118a0d8685d18c89dd                        5
##   0004.e8d5727378ddde5c3be181df593f1712                        0
## <<DocumentTermMatrix (documents: 10, terms: 1349)>>
## Non-/sparse entries: 1961/11529
## Sparsity           : 85%
## Maximal term length: 58
## Weighting          : term frequency (tf)
## Sample             :
##                                        Terms
## Docs                                    aug esmtp localhost option receiv size
##   0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1   0     0         0      0      0    0
##   0001.bfc8d64d12b325ff385cca8d07b84288   6     2         3      0      4    0
##   0002.24b47bb3ce90708ae29d0aec1da08610   9     4         3      0      6    0
##   0003.4b3d943b8df71af248d12f8b2e7a224a   7     2         3      0      4    0
##   0004.1874ab60c71f0b31b580f313a3f6e777  10     2         3      0     12    0
##   0005.1f42bb885de0ef7fc5cd09d34dc2ba54   9     4         3      0      6    0
##   0006.7a32642f8c22bbeb85d6c3b5f3890a2c   6     3         3      0      8    0
##   0007.859c901719011d56f8b652ea071c1f8b   6     2         3      0      4    0
##   0008.9562918b57e044abfbce260cc875acde   6     2         3    226      5   25
##   0009.c05e264fbf18783099b53dbc9a9aacda   6     3         3      0     10    0
##                                        Terms
## Docs                                    tabl thu valueopt width
##   0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1    0   0        0     0
##   0001.bfc8d64d12b325ff385cca8d07b84288    4   4        0     0
##   0002.24b47bb3ce90708ae29d0aec1da08610    0   8        0     0
##   0003.4b3d943b8df71af248d12f8b2e7a224a    0   6        0     0
##   0004.1874ab60c71f0b31b580f313a3f6e777    0   6        0     0
##   0005.1f42bb885de0ef7fc5cd09d34dc2ba54    0   8        0     0
##   0006.7a32642f8c22bbeb85d6c3b5f3890a2c    0   6        0     0
##   0007.859c901719011d56f8b652ea071c1f8b    0   6        0     0
##   0008.9562918b57e044abfbce260cc875acde   28   6      163    36
##   0009.c05e264fbf18783099b53dbc9a9aacda    0   6        0     0

Final Transformations

Now that we verified the data has been cleaned up, we will perform some final transformations to prepare the data for spam filters.

In order to not waste memory we overwrite existing variables with the entire corpus of each dataset. To optimize performance we will apply sparsity to terms that don’t even appear in at least 5% of documents.

We make sure to classify using 0 as ham and 1 as spam, since as a convention 1 is usually used for raising alarms or flags.

The classification should be done over a factor type, so we use as.factor to ensure the correct data type. Lastly we combine both data sets using the rbind.fill.matrix from the plyr package and put it in it’s final container as a data frame using as.data.frame. We validate any NA cohersions by setting NA values to 0.

Spam Filter Approaches

To develop the Spam Filters we start by creating the models and training them. In the final part we will fit the models to testing data to see how well they perform.

System Prep for Model Creation

R by default only uses a single thread, which can make training models extremely slow. In order to speed up R’s performance we can use the libraries parallel and doparallel to take advantage of parallel processing.

When using the train function of the caret package, we need to set the argument of trControl by passing a trainControl object. This object specifies the number of folds used for k-fold cross validation. It also tells caret to use the cluster registered in registerDoParallel when assigning the value TRUE to the allowParallel argument.

The below step is quite tedious, but the amount of time it saves is invaluable as we will see from our timers.

Next we want to randomize the sampling, and divide the sample into a set for training and setting.

Using the set.seed function, we’re able to create reproducible randomization. All we need to do is initialize (set the see) to any integer value. In other words we are saying whatever random generator you use, start the algorithm using this value. This results in being able to reproduce randomization (confusing since that in itself seems like an oxymoron).

Now that seed has been set we will divide the data set into two groups, one for training and one for testing the model. Since we have a pretty good volume, an 80-20 split will work well.

Lastly we first create a vector of indexes created from a “random” (remember we are using a set seed) sampling of 80% of all the emails. Using that index we subset our training and testing data.

Creating the Machine Learning Models

Generalized Linear Model Creation

Using the caret library we can leverage the train function for reusability. We must set the argument method to specify what type of model we will be using. It is also very important to set the trControl argument as mentioned previously so that training is optimized with parallel processing

## Time difference of 22.10079 secs

Model Fitting

From the stats library we will use the predict function by passing any of our models to the object argument. Finally, we will use the confusionMatrix function from the caret library to assess the actual performance for each model.

GLM Prediction

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119   9
##          1   5  85
##                                           
##                Accuracy : 0.9358          
##                  95% CI : (0.8946, 0.9644)
##     No Information Rate : 0.5688          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8684          
##                                           
##  Mcnemar's Test P-Value : 0.4227          
##                                           
##             Sensitivity : 0.9597          
##             Specificity : 0.9043          
##          Pos Pred Value : 0.9297          
##          Neg Pred Value : 0.9444          
##              Prevalence : 0.5688          
##          Detection Rate : 0.5459          
##    Detection Prevalence : 0.5872          
##       Balanced Accuracy : 0.9320          
##                                           
##        'Positive' Class : 0               
## 

SVM Prediction

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124   0
##          1   0  94
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9832, 1)
##     No Information Rate : 0.5688     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5688     
##          Detection Rate : 0.5688     
##    Detection Prevalence : 0.5688     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Bayes GLM Prediction

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124   0
##          1   0  94
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9832, 1)
##     No Information Rate : 0.5688     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5688     
##          Detection Rate : 0.5688     
##    Detection Prevalence : 0.5688     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Random Forest Prediction

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124   0
##          1   0  94
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9832, 1)
##     No Information Rate : 0.5688     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5688     
##          Detection Rate : 0.5688     
##    Detection Prevalence : 0.5688     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Conclusion

We saw that the majority of work came from the data munging process and the training of the models. We were very generous with the volume of data because we wanted to gain as much accuracy as possible. However, in the beginning of this research the performance impact for using such volume was over 10 minutes sometimes to train individual models. The big turning point in the research was enabling parallel processing which brought down the training time for each model to mostly an impressive sub-minute for each. Overall the General Linear Model had the poorest performance while the rest performed perfectly with varying completion times. The Supported Vector Machine with Linear Kernel Model performed the quickest out of all 4 models, about 1/20th of the longest time.