This research article is about creating Spam Filters using different machine learning models for training and testing. We will see most of the challenges will come from both prepping the data and optimizing the performance of the training. Finally, we will assess the prediction results for highest performance based on accuracy and time efficiency.
We will begin by importing the data, analyzing it, and then performing any data munging needed before applying spam filters.
We will begin with importing the data from our local directory. The package we will be using is the tm library (short for Text Mining Package). The following two functions we will use are:
DirSource which accepts a character vector of full path name that correspond to the working directory (good time to use setwd to prep beforehand)
VCorpus which stores a collection of documents in memory (hence V for volatile)
We can see below that there is roughly 500 plus documents in each corpus.
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 587
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 501
A Document Term Matrix is simply a matrix that describes the frequency of terms that occur.
Using the DocumentTermMatrix function from the tm package we will extract the first 10 document from each corpus.
We can use the inspect function to take a look at the document term matrix created for each corpus. Below we can see a summary for each.
Sparsity refers to the threshold of the relative document frequency for a term, above which the term will be removed.
## <<DocumentTermMatrix (documents: 10, terms: 879)>>
## Non-/sparse entries: 2244/6546
## Sparsity : 74%
## Maximal term length: 84
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs 2002 and aug esmtp for from received:
## 00001.7c53336b37003a9286aba55d2945844c 13 2 13 6 9 13 10
## 00002.9c4069e25e1ef370c078db7ee85ff9ac 12 3 12 2 6 11 10
## 00003.860e3c3cee1b42ead714c5c874fe25f7 12 10 11 2 5 11 9
## 00004.864220c5b6930b209cc287c361c99af1 11 4 9 6 5 9 7
## 00005.bf27cdeaf0b8c4647ecd61b1d09da613 11 2 11 2 6 10 9
## 00006.253ea2f9a9cc36fa0b1129b04b806608 11 3 13 2 5 12 11
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 13 2 13 6 9 13 10
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 12 3 12 2 6 11 10
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 12 10 11 2 5 11 9
## 0004.e8d5727378ddde5c3be181df593f1712 11 4 9 6 5 9 7
## Terms
## Docs the thu, with
## 00001.7c53336b37003a9286aba55d2945844c 15 11 9
## 00002.9c4069e25e1ef370c078db7ee85ff9ac 5 6 8
## 00003.860e3c3cee1b42ead714c5c874fe25f7 16 5 11
## 00004.864220c5b6930b209cc287c361c99af1 11 8 8
## 00005.bf27cdeaf0b8c4647ecd61b1d09da613 4 5 7
## 00006.253ea2f9a9cc36fa0b1129b04b806608 6 5 7
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 15 11 9
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 5 6 8
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 16 5 11
## 0004.e8d5727378ddde5c3be181df593f1712 11 8 8
## <<DocumentTermMatrix (documents: 10, terms: 2394)>>
## Non-/sparse entries: 3248/20692
## Sparsity : 86%
## Maximal term length: 89
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs <option 2002 and aug for from the thu,
## 0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1 0 0 0 0 0 0 0 0
## 0001.bfc8d64d12b325ff385cca8d07b84288 0 6 4 6 5 7 5 3
## 0002.24b47bb3ce90708ae29d0aec1da08610 0 9 0 9 6 7 4 7
## 0003.4b3d943b8df71af248d12f8b2e7a224a 0 7 0 7 4 5 4 5
## 0004.1874ab60c71f0b31b580f313a3f6e777 0 10 11 6 14 10 12 5
## 0005.1f42bb885de0ef7fc5cd09d34dc2ba54 0 9 0 9 6 7 3 7
## 0006.7a32642f8c22bbeb85d6c3b5f3890a2c 0 6 11 6 8 8 20 5
## 0007.859c901719011d56f8b652ea071c1f8b 0 6 3 6 4 8 2 5
## 0008.9562918b57e044abfbce260cc875acde 226 6 2 6 4 6 7 5
## 0009.c05e264fbf18783099b53dbc9a9aacda 0 6 13 6 9 8 23 5
## Terms
## Docs with you
## 0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1 0 0
## 0001.bfc8d64d12b325ff385cca8d07b84288 5 7
## 0002.24b47bb3ce90708ae29d0aec1da08610 6 2
## 0003.4b3d943b8df71af248d12f8b2e7a224a 4 2
## 0004.1874ab60c71f0b31b580f313a3f6e777 10 9
## 0005.1f42bb885de0ef7fc5cd09d34dc2ba54 6 3
## 0006.7a32642f8c22bbeb85d6c3b5f3890a2c 7 10
## 0007.859c901719011d56f8b652ea071c1f8b 5 2
## 0008.9562918b57e044abfbce260cc875acde 7 6
## 0009.c05e264fbf18783099b53dbc9a9aacda 7 11
From the previous section we saw that upon inspection of the data it was quite dirty with punctuation and other terms we probably were not interested in such as numbers.
The tm text mining package comes with a handy tool called tm_map, which applies transformations to documents. The transformations that are applied are functions such as removeNumbers, again all conveniently found as part of the tm library.
ds_ham %<>% tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>% tm_map(stripWhitespace) %>%
tm_map(removeWords, stopwords()) %>% tm_map(stemDocument) %>%
tm_map(removeNumbers)
ds_spam %<>% tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>% tm_map(stripWhitespace) %>%
tm_map(removeWords, stopwords()) %>% tm_map(stemDocument) %>%
tm_map(removeNumbers)We can now re-inspect to verify that each corpus has been cleaned up
dtm_ham <- DocumentTermMatrix(ds_ham[1:10])
dtm_spam <- DocumentTermMatrix(ds_spam[1:10])
inspect(dtm_ham)## <<DocumentTermMatrix (documents: 10, terms: 564)>>
## Non-/sparse entries: 1516/4124
## Sparsity : 73%
## Maximal term length: 55
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug esmtp list localhost mail receiv
## 00001.7c53336b37003a9286aba55d2945844c 13 6 6 4 1 10
## 00002.9c4069e25e1ef370c078db7ee85ff9ac 12 2 2 3 4 10
## 00003.860e3c3cee1b42ead714c5c874fe25f7 11 2 2 3 2 9
## 00004.864220c5b6930b209cc287c361c99af1 9 6 2 3 2 7
## 00005.bf27cdeaf0b8c4647ecd61b1d09da613 11 2 2 3 2 9
## 00006.253ea2f9a9cc36fa0b1129b04b806608 13 2 2 3 4 11
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 13 6 6 4 1 10
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 12 2 2 3 4 10
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 11 2 2 3 2 9
## 0004.e8d5727378ddde5c3be181df593f1712 9 6 2 3 2 7
## Terms
## Docs subject thu zzzzlocalhost
## 00001.7c53336b37003a9286aba55d2945844c 4 12 2
## 00002.9c4069e25e1ef370c078db7ee85ff9ac 2 7 2
## 00003.860e3c3cee1b42ead714c5c874fe25f7 2 6 2
## 00004.864220c5b6930b209cc287c361c99af1 1 9 2
## 00005.bf27cdeaf0b8c4647ecd61b1d09da613 2 6 2
## 00006.253ea2f9a9cc36fa0b1129b04b806608 2 6 2
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 4 12 2
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 2 7 2
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 2 6 2
## 0004.e8d5727378ddde5c3be181df593f1712 1 9 2
## Terms
## Docs zzzzteanayahoogroupscom
## 00001.7c53336b37003a9286aba55d2945844c 0
## 00002.9c4069e25e1ef370c078db7ee85ff9ac 6
## 00003.860e3c3cee1b42ead714c5c874fe25f7 5
## 00004.864220c5b6930b209cc287c361c99af1 0
## 00005.bf27cdeaf0b8c4647ecd61b1d09da613 5
## 00006.253ea2f9a9cc36fa0b1129b04b806608 5
## 0001.ea7e79d3153e7469e7a9c3e0af6a357e 0
## 0002.b3120c4bcbf3101e661161ee7efcb8bf 6
## 0003.acfc5ad94bbd27118a0d8685d18c89dd 5
## 0004.e8d5727378ddde5c3be181df593f1712 0
## <<DocumentTermMatrix (documents: 10, terms: 1349)>>
## Non-/sparse entries: 1961/11529
## Sparsity : 85%
## Maximal term length: 58
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug esmtp localhost option receiv size
## 0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1 0 0 0 0 0 0
## 0001.bfc8d64d12b325ff385cca8d07b84288 6 2 3 0 4 0
## 0002.24b47bb3ce90708ae29d0aec1da08610 9 4 3 0 6 0
## 0003.4b3d943b8df71af248d12f8b2e7a224a 7 2 3 0 4 0
## 0004.1874ab60c71f0b31b580f313a3f6e777 10 2 3 0 12 0
## 0005.1f42bb885de0ef7fc5cd09d34dc2ba54 9 4 3 0 6 0
## 0006.7a32642f8c22bbeb85d6c3b5f3890a2c 6 3 3 0 8 0
## 0007.859c901719011d56f8b652ea071c1f8b 6 2 3 0 4 0
## 0008.9562918b57e044abfbce260cc875acde 6 2 3 226 5 25
## 0009.c05e264fbf18783099b53dbc9a9aacda 6 3 3 0 10 0
## Terms
## Docs tabl thu valueopt width
## 0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1 0 0 0 0
## 0001.bfc8d64d12b325ff385cca8d07b84288 4 4 0 0
## 0002.24b47bb3ce90708ae29d0aec1da08610 0 8 0 0
## 0003.4b3d943b8df71af248d12f8b2e7a224a 0 6 0 0
## 0004.1874ab60c71f0b31b580f313a3f6e777 0 6 0 0
## 0005.1f42bb885de0ef7fc5cd09d34dc2ba54 0 8 0 0
## 0006.7a32642f8c22bbeb85d6c3b5f3890a2c 0 6 0 0
## 0007.859c901719011d56f8b652ea071c1f8b 0 6 0 0
## 0008.9562918b57e044abfbce260cc875acde 28 6 163 36
## 0009.c05e264fbf18783099b53dbc9a9aacda 0 6 0 0
Now that we verified the data has been cleaned up, we will perform some final transformations to prepare the data for spam filters.
In order to not waste memory we overwrite existing variables with the entire corpus of each dataset. To optimize performance we will apply sparsity to terms that don’t even appear in at least 5% of documents.
We make sure to classify using 0 as ham and 1 as spam, since as a convention 1 is usually used for raising alarms or flags.
ham <- DocumentTermMatrix(ds_ham) %>% removeSparseTerms(.95) %>% as.matrix() %>%
cbind("IsSpam" = 0)
spam <- DocumentTermMatrix(ds_spam) %>% removeSparseTerms(.95) %>% as.matrix() %>%
cbind("IsSpam" = 1)The classification should be done over a factor type, so we use as.factor to ensure the correct data type. Lastly we combine both data sets using the rbind.fill.matrix from the plyr package and put it in it’s final container as a data frame using as.data.frame. We validate any NA cohersions by setting NA values to 0.
To develop the Spam Filters we start by creating the models and training them. In the final part we will fit the models to testing data to see how well they perform.
R by default only uses a single thread, which can make training models extremely slow. In order to speed up R’s performance we can use the libraries parallel and doparallel to take advantage of parallel processing.
When using the train function of the caret package, we need to set the argument of trControl by passing a trainControl object. This object specifies the number of folds used for k-fold cross validation. It also tells caret to use the cluster registered in registerDoParallel when assigning the value TRUE to the allowParallel argument.
The below step is quite tedious, but the amount of time it saves is invaluable as we will see from our timers.
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv",
number = 5,
allowParallel = TRUE)Next we want to randomize the sampling, and divide the sample into a set for training and setting.
Using the set.seed function, we’re able to create reproducible randomization. All we need to do is initialize (set the see) to any integer value. In other words we are saying whatever random generator you use, start the algorithm using this value. This results in being able to reproduce randomization (confusing since that in itself seems like an oxymoron).
Now that seed has been set we will divide the data set into two groups, one for training and one for testing the model. Since we have a pretty good volume, an 80-20 split will work well.
Lastly we first create a vector of indexes created from a “random” (remember we are using a set seed) sampling of 80% of all the emails. Using that index we subset our training and testing data.
Using the caret library we can leverage the train function for reusability. We must set the argument method to specify what type of model we will be using. It is also very important to set the trControl argument as mentioned previously so that training is optimized with parallel processing
start_time <- Sys.time()
gl_model <- train(IsSpam ~ ., data = all_emails_training, method = 'glm',
trControl = fitControl)
Sys.time() - start_time## Time difference of 22.10079 secs
start_time <- Sys.time()
svm_model <- train(IsSpam ~ ., data = all_emails_training, method = 'svmLinear2',
trControl = fitControl)
Sys.time() - start_time## Time difference of 3.515811 secs
start_time <- Sys.time()
bayes_model <- train(IsSpam ~ ., data = all_emails_training, method = 'bayesglm',
trControl = fitControl)
Sys.time() - start_time## Time difference of 1.047032 mins
start_time <- Sys.time()
ranger_model <- train(IsSpam ~ ., data = all_emails_training, method = 'ranger',
trControl = fitControl)
Sys.time() - start_time## Time difference of 15.77024 secs
From the stats library we will use the predict function by passing any of our models to the object argument. Finally, we will use the confusionMatrix function from the caret library to assess the actual performance for each model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 119 9
## 1 5 85
##
## Accuracy : 0.9358
## 95% CI : (0.8946, 0.9644)
## No Information Rate : 0.5688
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8684
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.9597
## Specificity : 0.9043
## Pos Pred Value : 0.9297
## Neg Pred Value : 0.9444
## Prevalence : 0.5688
## Detection Rate : 0.5459
## Detection Prevalence : 0.5872
## Balanced Accuracy : 0.9320
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 124 0
## 1 0 94
##
## Accuracy : 1
## 95% CI : (0.9832, 1)
## No Information Rate : 0.5688
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5688
## Detection Rate : 0.5688
## Detection Prevalence : 0.5688
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 124 0
## 1 0 94
##
## Accuracy : 1
## 95% CI : (0.9832, 1)
## No Information Rate : 0.5688
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5688
## Detection Rate : 0.5688
## Detection Prevalence : 0.5688
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 124 0
## 1 0 94
##
## Accuracy : 1
## 95% CI : (0.9832, 1)
## No Information Rate : 0.5688
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5688
## Detection Rate : 0.5688
## Detection Prevalence : 0.5688
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
We saw that the majority of work came from the data munging process and the training of the models. We were very generous with the volume of data because we wanted to gain as much accuracy as possible. However, in the beginning of this research the performance impact for using such volume was over 10 minutes sometimes to train individual models. The big turning point in the research was enabling parallel processing which brought down the training time for each model to mostly an impressive sub-minute for each. Overall the General Linear Model had the poorest performance while the rest performed perfectly with varying completion times. The Supported Vector Machine with Linear Kernel Model performed the quickest out of all 4 models, about 1/20th of the longest time.