Emails: one of the most useful communication tools available to everyone with an internet connection. Unfortunately, everyone with an internet connection can use it, including robots. Our goal in this exercize is to identify the “good” emails (ham) from the “bad” (spam–which is actually quite tasty sometimes) using sets of already identified emails (our “training” documents), and “predict” whether or not a new document is spam (our “testing documents”).
So let’s start by loading our necessary packages. I’m using a new (for me) package called pacman to load the libraries. The pacman package seems like an easier way to install and load multiple packages at the same time, and anything that saves Data Scientists’ time is worth investigating.
#install.packages('pacman')
pacman::p_load(knitr, tm, RTextTools, stringr, wordcloud)
For the collection of data, I’m using the corpus from Spam Assassin on February 28, 2003. I’ve downloaded all 5 of the these packages locally (in the folder ‘C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11’) which includes 2 Easy Ham sets, 1 Hard Ham set, and 2 Spam sets. These local files were also uploaded to my GitHub page for reproducability. For now, I’ll ignore the Hard Ham set, which I can come back to when time allows, and focus on the 2 Easy Ham sets and the 2 Spam sets. I’ll use the initial sets for both Ham and Spam as our training documents, and use the 2nd Ham and Spam sets to see how well our filter works as our test documents. So let’s start by loading them all.
We’ll start by setting some values for our working drive. This will make it easier going forward (trust me, you don’t want to have to type C:/Users/XXXXX everytime you need to grab something from a local file).
#set wd to our appropriate working directory
wd <- "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11"
#set a name to each folder
easy_ham <- '/easy_ham'
easy_ham_2 <- '/easy_ham_2'
spam <- '/spam'
spam_2 <- '/spam_2'
Now we need to find the names of all of the files in each set:
ham_names <- list.files(sprintf("%s%s", wd, easy_ham))
ham2_names <- list.files(sprintf("%s%s", wd, easy_ham_2))
spam_names <- list.files(sprintf("%s%s", wd, spam))
spam2_names <- list.files(sprintf("%s%s", wd, spam_2))
head(ham_names)
## [1] "00001.7c53336b37003a9286aba55d2945844c"
## [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
## [4] "00004.864220c5b6930b209cc287c361c99af1"
## [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
## [6] "00006.253ea2f9a9cc36fa0b1129b04b806608"
Then using these names, we’ll create a list of each of the files. These files create the basis of our corpus.
ham_files <- sprintf("%s%s/%s", wd, easy_ham, ham_names)
ham2_files <- sprintf("%s%s/%s", wd, easy_ham_2, ham2_names)
spam_files <- sprintf("%s%s/%s", wd, spam, spam_names)
spam2_files <- sprintf("%s%s/%s", wd, spam_2, spam2_names)
head(ham_files)
## [1] "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/easy_ham/00001.7c53336b37003a9286aba55d2945844c"
## [2] "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/easy_ham/00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/easy_ham/00003.860e3c3cee1b42ead714c5c874fe25f7"
## [4] "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/easy_ham/00004.864220c5b6930b209cc287c361c99af1"
## [5] "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/easy_ham/00005.bf27cdeaf0b8c4647ecd61b1d09da613"
## [6] "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/easy_ham/00006.253ea2f9a9cc36fa0b1129b04b806608"
As a set-up for our corpus, we will use the first file of the first Ham folder and make sure everything can get established correctly. To do this, we’ll [1] pull in all of the text of each file in one line, [2] set the corpus, and [3] add the appropriate meta data.
First we’ll start by pulling in all of the texts from each document. Since we don’t care about the added white space (we’ll remove it later) we can simply use the stringr package to gather all of the lines together and save the text as a temporary list.
tmp <- readLines(ham_files[1])
tmp <- str_c(tmp, collapse = " ")
Using our temporary list, we set the corpus: main_corpus
main_corpus <- Corpus(VectorSource(tmp))
With the corpus ready, we’ll add meta data which could help us in the analysis later on. Namely, we need to know the [1] file name, [2] type (ham or spam), [3] From (who the email is from), [4] To (who the email is directed to), [5] Date of the email, [6] Subject of the email. We may not dig into those, but it could be useful to have.
meta(main_corpus[[1]], "filename") <- ham_names[1]
meta(main_corpus[[1]], "type") <- "Ham"
meta(main_corpus[[1]], "From") <- na.omit(str_extract(readLines(ham_files[1]), "^(From: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[1]], "To") <- na.omit(str_extract(readLines(ham_files[1]), "^(To: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[1]], "Date") <- na.omit(str_extract(readLines(ham_files[1]), "^(Date: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[1]], "Subject") <- na.omit(str_extract(readLines(ham_files[1]), "^(Subject: )[[:alnum:][:digit:]: ,.@+<>-]+"))
It’s usually a good idea to check the corpus to make sure it’s all running correctly, this includes looking at meta data.
ham_files[2]
## [1] "C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/easy_ham/00002.9c4069e25e1ef370c078db7ee85ff9ac"
meta(main_corpus[[1]])
## author : character(0)
## datetimestamp: 2016-04-11 03:48:37
## description : character(0)
## heading : character(0)
## id : 1
## language : en
## origin : character(0)
## filename : 00001.7c53336b37003a9286aba55d2945844c
## type : Ham
## From : From: Robert Elz <kre@munnari.OZ.AU>
## To : To: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
## Date : Date: Thu, 22 Aug 2002 18:26:25 +0700
## Subject : Subject: Re: New Sequences Window
With our corpus set-up and ready in-hand we’ll need to do that for each of the documents both spam and ham sets. We’ll [1] set the First Ham set, then [2] add the First Spam set, followed by [3] Second Ham and [4] Second Spam sets.
Combining all of the corpus functions together, we have our function ready:
for(i in 2:length(ham_names)){
tmp <- readLines(ham_files[i])
tmp <- str_c(tmp, collapse = " ")
tmp_corpus <- Corpus(VectorSource(tmp))
main_corpus <- c(main_corpus, tmp_corpus)
meta(main_corpus[[i]], "filename") <- ham_names[i]
meta(main_corpus[[i]], "type") <- "Ham"
meta(main_corpus[[i]], "From") <- na.omit(str_extract(readLines(ham_files[i]), "^(From: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[i]], "To") <- na.omit(str_extract(readLines(ham_files[i]), "^(To: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[i]], "Date") <- na.omit(str_extract(readLines(ham_files[i]), "^(Date: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[i]], "Subject") <- na.omit(str_extract(readLines(ham_files[i]), "^(Subject: )[[:alnum:][:digit:]: ,.@+<>-]+"))
}
main_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2501
meta(main_corpus[[3]])
## author : character(0)
## datetimestamp: 2016-04-11 03:48:37
## description : character(0)
## heading : character(0)
## id : 1
## language : en
## origin : character(0)
## filename : 00003.860e3c3cee1b42ead714c5c874fe25f7
## type : Ham
## From : character(0)
## To : To: zzzzteana <zzzzteana@yahoogroups.com>
## Date : Date: Thu, 22 Aug 2002 13:52:38 +0100
## Subject : character(0)
The Ham file had 2501 documents, so this checks out
It looks like our corpus and meta data are correctly gathered, so we’ll now need to add our spam data. We need to specify a new item k so that we don’t overwrite our ham files, so that’s the first important step followed by the function for the spam set:
k <- length(ham_files)
for(i in 1:length(spam_names)){
tmp <- readLines(spam_files[i])
tmp <- str_c(tmp, collapse = " ")
tmp_corpus <- Corpus(VectorSource(tmp))
main_corpus <- c(main_corpus, tmp_corpus)
meta(main_corpus[[k+i]], "filename") <- spam_names[i]
meta(main_corpus[[k+i]], "type") <- "Spam"
meta(main_corpus[[k+i]], "From") <- na.omit(str_extract(readLines(spam_files[i]), "^(From: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "To") <- na.omit(str_extract(readLines(spam_files[i]), "^(To: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "Date") <- na.omit(str_extract(readLines(spam_files[i]), "^(Date: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "Subject") <- na.omit(str_extract(readLines(spam_files[i]), "^(Subject: )[[:alnum:][:digit:]: ,.@+<>-]+"))
}
## Warning in readLines(spam_files[i]): incomplete final line found
## on 'C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/spam/
## 00136.faa39d8e816c70f23b4bb8758d8a74f0'
## Warning in readLines(spam_files[i]): incomplete final line found
## on 'C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/spam/
## 00136.faa39d8e816c70f23b4bb8758d8a74f0'
## Warning in readLines(spam_files[i]): incomplete final line found
## on 'C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/spam/
## 00136.faa39d8e816c70f23b4bb8758d8a74f0'
## Warning in readLines(spam_files[i]): incomplete final line found
## on 'C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/spam/
## 00136.faa39d8e816c70f23b4bb8758d8a74f0'
## Warning in readLines(spam_files[i]): incomplete final line found
## on 'C:/Users/itsal/Documents/GitHub/DATA607/Assignment 11/spam/
## 00136.faa39d8e816c70f23b4bb8758d8a74f0'
main_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3002
meta(main_corpus[[k+3]])
## author : character(0)
## datetimestamp: 2016-04-11 03:48:58
## description : character(0)
## heading : character(0)
## id : 1
## language : en
## origin : character(0)
## filename : 00003.2ee33bc6eacdb11f38d052c44819ba6c
## type : Spam
## From : character(0)
## To : To: <zzzz@spamassassin.taint.org>
## Date : Date: Thu, 22 Aug 2002 07:36:19 -0600
## Subject : Subject: Guaranteed to lose 10-12 lbs in 30 days 11.150
The corpus now has 3002 documents: 2501 from the first ham set, and 501 from the first spam set. This number checks out and it will be appropriate to remember this number since this is our new training document size. Remember: 3002 documents. 3002 documents. 3002 documents. 3002 documents.
Our training documents are established, now we need our first testing document set which is our second ham set. k is now set to the 3002 which is the number of documents in our first ham set and our first spam set. Which is the size of our testing document set. 3002. 3002 documents. 3002 documents. 3002 documents. 3002 documents.
k <- length(ham_files) + length(spam_files)
for(i in 1:length(ham2_names)){
tmp <- readLines(ham2_files[i])
tmp <- str_c(tmp, collapse = " ")
tmp_corpus <- Corpus(VectorSource(tmp))
main_corpus <- c(main_corpus, tmp_corpus)
meta(main_corpus[[k+i]], "filename") <- ham2_names[i]
meta(main_corpus[[k+i]], "type") <- "Ham"
meta(main_corpus[[k+i]], "From") <- na.omit(str_extract(readLines(ham2_files[i]), "^(From: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "To") <- na.omit(str_extract(readLines(ham2_files[i]), "^(To: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "Date") <- na.omit(str_extract(readLines(ham2_files[i]), "^(Date: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "Subject") <- na.omit(str_extract(readLines(ham2_files[i]), "^(Subject: )[[:alnum:][:digit:]: ,.@+<>-]+"))
}
main_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4403
meta(main_corpus[[k+3]])
## author : character(0)
## datetimestamp: 2016-04-11 03:49:03
## description : character(0)
## heading : character(0)
## id : 1
## language : en
## origin : character(0)
## filename : 00003.19be8acd739ad589cd00d8425bac7115
## type : Ham
## From : From: Chris Garrigues <cwg-exmh@DeepEddy.Com>
## To : To: Robert Elz <kre@munnari.OZ.AU>
## Date : Date: Wed, 21 Aug 2002 10:17:14 -0500
## Subject : Subject: Re: New Sequences Window
We now have 4003 documents, of which 1001 is the first part of our testing set and 3002 documents are the training set. 3002 documents. 3002 docum… Ok, I’ll stop for now.
Last but not least the second spam set is added to the corpus:
k <- length(ham_files) + length(spam_files) + length(ham2_files)
for(i in 1:length(spam2_names)){
tmp <- readLines(spam2_files[i])
tmp <- str_c(tmp, collapse = " ")
tmp_corpus <- Corpus(VectorSource(tmp))
main_corpus <- c(main_corpus, tmp_corpus)
meta(main_corpus[[k+i]], "filename") <- spam2_names[i]
meta(main_corpus[[k+i]], "type") <- "Spam"
meta(main_corpus[[k+i]], "From") <- na.omit(str_extract(readLines(spam2_files[i]), "^(From: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "To") <- na.omit(str_extract(readLines(spam2_files[i]), "^(To: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "Date") <- na.omit(str_extract(readLines(spam2_files[i]), "^(Date: )[[:alnum:][:digit:]: ,.@+<>-]+"))
meta(main_corpus[[k+i]], "Subject") <- na.omit(str_extract(readLines(spam2_files[i]), "^(Subject: )[[:alnum:][:digit:]: ,.@+<>-]+"))
}
main_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5801
meta(main_corpus[[k+3]])
## author : character(0)
## datetimestamp: 2016-04-11 03:49:17
## description : character(0)
## heading : character(0)
## id : 1
## language : en
## origin : character(0)
## filename : 00003.590eff932f8704d8b0fcbe69d023b54d
## type : Spam
## From : From: amknight@mailexcite.com
## To : To: cbmark@cbmark.com
## Date : Date: Wed, 30 Jul 1980 18:25:49
## Subject : Subject: New Improved Fat Burners, Now With TV Fat Absorbers
Our final set makes the corpus 5801 documents large. Time to clean and analyze.
With our corpus in hand, we’ll start to by cleaning the corpus, creating the TermDocumentMatrix (TDM), then creating the DocumentTermMatrix (DTM).
Since the texts contain a lot of challenging items such as punctuation, numbers, stop-words (who, that, etc.), and white space (unncessary blank lines and extra space) we’ll need to remove those and perform some formating changes to use the word stems and change upper case words to lower case words:
clean_files <- function(corpus){
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus
}
clean_files(main_corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5801
main_corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5801
Super fantastic: data has been successfully collected and a corpus created, added in meta data, and cleaned the documents. We now need to properly connect the data using a term document matrix or document term matrix from the texts before we can move on to the testing phase and remove sparse words.
The Term Document Matrix (TDM) will analyse the data (the remaining word terms) and give a count for each time the term is used. The TDM will pass through each document, giving a frequency count for each term, and place each result in a single matrix. This matrix can then be used for analysis.
tdm_emails <- TermDocumentMatrix(main_corpus)
tdm_emails
## <<TermDocumentMatrix (terms: 268124, documents: 5801)>>
## Non-/sparse entries: 1427529/1553959795
## Sparsity : 100%
## Maximal term length: 990
## Weighting : term frequency (tf)
Basically just a transposition of the TDM, the Document Term Matrix (DTM) can also be created simply:
dtm_emails <- DocumentTermMatrix(main_corpus)
dtm_emails
## <<DocumentTermMatrix (documents: 5801, terms: 268124)>>
## Non-/sparse entries: 1427529/1553959795
## Sparsity : 100%
## Maximal term length: 990
## Weighting : term frequency (tf)
If we were to see the full TDM or DTM we would see that this are several sparsely used terms that had been only used once or twice. These terms can be a bottleneck in processing and analysis, so it shouldn’t be too much of an issue to remove them:
tdm_emails <- removeSparseTerms(tdm_emails, 0.998)
tdm_emails
## <<TermDocumentMatrix (terms: 10337, documents: 5801)>>
## Non-/sparse entries: 1001006/58963931
## Sparsity : 98%
## Maximal term length: 79
## Weighting : term frequency (tf)
dtm_emails <- removeSparseTerms(dtm_emails, 0.998)
dtm_emails
## <<DocumentTermMatrix (documents: 5801, terms: 10337)>>
## Non-/sparse entries: 1001006/58963931
## Sparsity : 98%
## Maximal term length: 79
## Weighting : term frequency (tf)
To recollect: [1] We gathered and cleaned all of our data into a corpus and then [2] added meta data. In the cleaning phase we [1] removed punctuation, stop words, symbols, and white space, [2] converted words to their easier to use stems, and [3] lowercased all terms. In the connecting phase we [1] created a TDM and DTM of the copora, and [2] removed sparse terms. We now move on to the analysis page and see what the data tells us but first we need to put the TDMs in a container, then fit models for each container, then test models.
We need to save the meta data now, since things are likely to get lost. As we no longer need the file names or other characteristics, we only need to save the type (spam or ham).
meta_type <- meta(main_corpus, tag = "type")
ft_labels <- unlist(meta_type)
head(ft_labels)
## 1 1 1 1 1 1
## "Ham" "Ham" "Ham" "Ham" "Ham" "Ham"
With everything nearly ready we’ll contain the data for testing purposes, using a container and specifying which documents are our testing documents and which documents are our training documents. Thank goodness we remembered the number of training documents. Did you forget? It’s 3002 documents.
container <- create_container(
dtm_emails,
labels = ft_labels,
trainSize = 1:3002,
testSize = 3003:5801,
virgin = FALSE
)
slotNames(container)
## [1] "training_matrix" "classification_matrix" "training_codes"
## [4] "testing_codes" "column_names" "virgin"
We can simply train the model now using the preset functions for the three tools at hand: [1] SVM, [2] Random Forest, and [3] Maximum Entropy.
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")
The test data is now used to for each of the tools:
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)
We now add the correct test labels back.
labels_out <- data.frame(
correct_label = ft_labels[3003:5801],
svm = as.character(svm_out[,1]),
tree = as.character(tree_out[,1]),
maxent = as.character(maxent_out[,1]),
stringsAsFactors = F
)
Finally, the bulk of our work comes to testing the performance of the data collection algorithms. We have three tools at hand: [1] SVM, [2] Random Forest, and [3] Maximum Entropy.
table(labels_out[,1] == labels_out[,2])
##
## FALSE TRUE
## 101 2698
prop.table(table(labels_out[,1] == labels_out[,2]))
##
## FALSE TRUE
## 0.03608432 0.96391568
We can see there’s 102 mis-categorizations and 2697 real (correct) categorizations.
table(labels_out[,1] == labels_out[,3])
##
## FALSE TRUE
## 1164 1635
prop.table(table(labels_out[,1] == labels_out[,3]))
##
## FALSE TRUE
## 0.4158628 0.5841372
The Random Forest performs worse, with 1164 mis-categorizations and 1635 correct ones.
table(labels_out[,1] == labels_out[,4])
##
## FALSE TRUE
## 111 2688
prop.table(table(labels_out[,1] == labels_out[,4]))
##
## FALSE TRUE
## 0.03965702 0.96034298
The Maximum Entropy model does much better than the Random Forest, but worse than the SVM model. Here we have 111 mis-categorizations and 2688 correct ones.
Our data set was very large. Had we suck to using one set of documents for training and testing (i.e. using first ham and the first spam sets and limiting to 1000 documents in the training set) we may have had a different outcome for better or worse. In the end, the SVM model performed the best with 96.4% correct categorizations. This counts for (conservatively) 4 spam emails for every 100 emails, signficantly better than what I see in my personal email. Gmail, get your mess together.