The packages tm and RTextTools are required for this code.
suppressMessages(library(knitr))
suppressMessages(library(tm))
library(RTextTools)
## Warning: package 'RTextTools' was built under R version 3.2.4
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
library(stringr)
Today we’ll be creating a spam filter. We will use the machine learning technique of supervised learning, where we have two initial sets of data, known as training sets; these are classified as spam email and non-spam aka ham email. We will then use a text mining package to identify patterns in each, to create an algorithm that can classify emails as spam or not.
The idea is that emails have certain textual patterns that suggest they are either spam or not spam - our text mining would try and identify those patterns, and then use them to classify emails as spam or not.
Then, we’ll test out our algorithm on emails we know to be either spam or non-spam, and see how accurately our algorithm classifies these emails.
First, let’s identify the data we’ll be using.
Non-spam training set: 20021010_easy_ham.tar.bz2 Spam training set: 20021010_spam.tar.bz2 Non-spam test set: 20030228_easy_ham_2.tar.bz2 Spam test set: 20030228_spam.tar.bz2 Source: https://spamassassin.apache.org/publiccorpus/
First, we’ll encode the corpus locations:
spamdir <- "C:\\Users\\asher\\Documents\\Classes\\CUNY\\DATA 607\\Week 11\\"
eh1Loc <- paste0(spamdir, "easy_ham")
eh2Loc <- paste0(spamdir, "easy_ham_2")
es1Loc <- paste0(spamdir, "easy_spam")
es2Loc <- paste0(spamdir, "easy_spam_2")
Since we’ll be processing several corpora, I’ve built a function to whittle down the steps to one.
The function proceeds like so:
dtm_create <- function(corpusLoc) {
corpusDir <- DirSource(corpusLoc)
corpus <- Corpus(corpusDir)
corpus <- DocumentTermMatrix(corpus, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE, stemming = TRUE, stripWhitespace = TRUE, wordLengths = c(3, 15)))
corpus <- removeSparseTerms(corpus, 0.99)#removes words that don't appear in 99% of emails
corpus
}
corpora <- c(dtm_create(eh1Loc),
dtm_create(es1Loc),
dtm_create(eh2Loc),
dtm_create(es2Loc)) #Combines the different corpora into one corpus
For each document term matrix, we’ll have to add a field that classifies it as spam or not. This field is a column whose length is the same as the number of files for each directory. We’ll use the list.files function to call up a vector containing the files in each directory, and a length function to count the number of filenames
trainLength <- length(list.files(eh1Loc)) + length(list.files(es1Loc)) #Training set length
totalLength <- trainLength + length(list.files(eh2Loc)) + length(list.files(es2Loc)) #Training+Testing set length
emailLabels<- c(rep("non-spam", length(list.files(eh1Loc))), #Training set non-spam
rep("spam", length(list.files(es1Loc))), #Training set spam
rep("non-spam", length(list.files(eh2Loc))), #Test set non-spam
rep("spam", length(list.files(es2Loc)))) #Test set spam
The bounds of our training set will be the number of files in each of the directories it includes
Now, we’ll create a container for our corpora, which we’ll use to create our model.
The create container parameters are as follows: corpora (a document term matrix): the corpora we are using as our training and testing sets labels (a character vector): a column that identifies whether each document in the assembled corpora, i.e. each email, is spam or not. trainSize: the size of our training set, in this case 2551 non-spam + 501 spam = 3052 testSize: the size of our testing set, in this case 501 non-spam + 501 spam = 1002 virgin: this parameter refers to whether the learning is supervised or unsupervised, i.e. whether we know the true classification of the documents under consideration, in this case emails. In this case, the learning is supervised, which corresponds to a setting of FALSE. See here for background.
corporaContainer <- create_container(corpora, labels = emailLabels, trainSize = 1:trainLength, testSize = (trainLength + 1):totalLength, virgin = FALSE)
The RTextTools package has nine algorithms to choose from: “BAGGING”, “BOOSTING”, “GLMNET”, “MAXENT”, “NNET”, “RF”, “SLDA”, “SVM”, “TREE”
We create a model using the train_models function, which calls for the corpora container, and a selection of algorithms.
We will choose to use the MAXENT, SVM and TREE algorithms. The other algorithms, for reasons that are unclear, failed to process results successfully.
corporaModels <- train_models(corporaContainer, algorithms = c("MAXENT", "SVM","TREE")) #Create models for classifying emails as spam or not.
results <- classify_models(corporaContainer, corporaModels) #Apply the models to the testing sets of email, to see whether they'd be classified as spam or not.
comparison <- data.frame(Label = emailLabels[(trainLength + 1):totalLength], MAXENT = results$MAXENTROPY_LABEL, SVM = results$SVM_LABEL, TREE = results$TREE_LABEL)
Below, we can see the results for each algorithm
MAXENT <- table(comparison[,c(2,1)]) #MAXENT results
SVM <- table(comparison[,c(3,1)]) #SVM results
TREE <- table(comparison[,c(4,1)]) #TREE results
MAXENT
## Label
## MAXENT non-spam spam
## non-spam 1384 1
## spam 17 500
SVM
## Label
## SVM non-spam spam
## non-spam 1378 0
## spam 23 501
TREE
## Label
## TREE non-spam spam
## non-spam 1383 35
## spam 18 466
We see that of our three algorithms, MAXENT categorizes the most emails correctly, with 1884/1902, vs 1879 for SVM and 1849 for TREE.
Looking further, we see that MAXENT has 17 false positives - that is, falsely categorizes 17 non-spam emails as spam, and 1 false negative, a non-spam email categorized as spam. SVM has 23 false positives, and no false negatives; TREE has 18 false positives, and 35 false negatives.
However, these results are suspiciously high, and I’m afraid my results are somehow tainted, but I have not identified how.
Below, I’ve converted my results to what’s known as sensitivity and specificity, the likelihood that an email is correctly classified, given that it is spam and given that it isn’t, respectively.
prob_table<- function(table) {
rns <- c("True", "False")
cns <- c("Positive","Negative")
TP <- round(table[2,2]/(table[2,2] + table[1,2]),2)
TN <- round(table[1,1]/(table[1,1] + table[2,1]),2)
print(paste0("Sensitivity: ", TP,"; Specificity: ", TN))
data.frame(Positive = c(TP, 1-TP), Negative = c(TN,1-TN), row.names = rns)
}
prob_table(MAXENT)
## [1] "Sensitivity: 1; Specificity: 0.99"
## Positive Negative
## True 1 0.99
## False 0 0.01
prob_table(SVM)
## [1] "Sensitivity: 1; Specificity: 0.98"
## Positive Negative
## True 1 0.98
## False 0 0.02
prob_table(TREE)
## [1] "Sensitivity: 0.93; Specificity: 0.99"
## Positive Negative
## True 0.93 0.99
## False 0.07 0.01