The objective of this assignment is to build a binary classifier model to determine whether an email is “spam” or “not spam”.
This exercise also incorporates many of the data cleansing techniques we’ve been covering throughout the semester, including regex and html scraping methods.
The raw data for this assignment can be found at spamassassin.apache.org.
I chose the following spam and ham data sets from the apache.org site for my analysis:
The raw data is available on my Github repo, along with the R Markdown document.
The data wrangling methods employed in this exercise are an amalgamation of the techniques already covered in previous readings and exercises. However, the building of the classifier models draws significantly on the outlined procedures in Chapter 10 of Automated Data Collection with R, with particular emphasis on pages 310-312. Here is the full citation:
Munzert, Simon et al. “Chapter 10: Statistical Text Processing.” Automated Data Collection with R: a Practical Guide to Web Scraping and Text Mining, 1st ed., John Wiley & Sons Ltd., Chichester, UK, 2015, pp. 295-321.
if (!require(stringr)) install.packages('stringr')
if (!require(tm.plugin.webmining)) install.packages('tm.plugin.webmining')
if (!require(tm)) install.packages('tm')
if (!require(SnowballC)) install.packages('SnowballC')
if (!require(RTextTools)) install.packages('RTextTools')
if (!require(R.utils)) install.packages('R.utils')
if (!require(utils)) install.packages('utils')Download and Unzip
Download the raw zip files from Github. Then unzip the .bz2 and .tar files using functions from the R.Utils and utils packages, respectively.
# download and unzip spam document from github
download.file('https://raw.githubusercontent.com/spitakiss/Data607_HW10/master/20021010_spam.tar.bz2', destfile="spam_zip.tar.bz2")
bunzip2("spam_zip.tar.bz2", remove = F, overwrite = T)
untar("spam_zip.tar") #creates spam folder
# download and unzip spam document from github
download.file('https://raw.githubusercontent.com/spitakiss/Data607_HW10/master/20030228_easy_ham_2.tar.bz2', destfile="ham_zip.tar.bz2")
bunzip2("ham_zip.tar.bz2", remove = F, overwrite = T)
untar("ham_zip.tar") #creates easy_ham_2 folderRemove Unnecessary Files
Now we have two folders with ham and spam emails. However, there is an extraneous file in both folders that provides a content listing of the other emails in the given folder. Let’s delete these files:
# identify extraneous ham file and delete
remove_ham <- list.files(path="easy_ham_2/", full.names=T, recursive=FALSE, pattern="cmds")
file.remove(remove_ham)
# identify extraneous spam file and delete
remove_spam <-list.files(path="spam/", full.names=T, recursive=FALSE, pattern="0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1")
file.remove(remove_spam)File List and Name Shuffle
We’ll now create an object, ham_spam, that lists all email file names, regardless of whether the email is actually spam or ham.
Then we will shuffle the order of the files. This last step is important later in the analysis when we allocate the emails between the the training and test data sets.
# list of spam files
spam_files <-list.files(path="spam/", full.names=T, recursive=FALSE)
# list of ham files
ham_files <- list.files(path="easy_ham_2/",full.names=T, recursive=FALSE)
# concatenate ham and spam file lists
ham_spam <- c(ham_files,spam_files)
#shuffle file names
set.seed(2020)
ham_spam <- sample(ham_spam,length(ham_spam))
head(ham_spam,15) [1] "easy_ham_2/01230.9b29026ab85c0a0bfdba617de748c186"
[2] "easy_ham_2/00749.3500b619df0119e64fc177b3b6eff006"
[3] "easy_ham_2/01174.4542031a0483b9b0cc8cb37de5e57422"
[4] "easy_ham_2/00905.defebe39d659693316e71ad1cd70b127"
[5] "easy_ham_2/00259.977d4930f990d0c5b68b362a33b14d5f"
[6] "easy_ham_2/00128.2d0445f396770a673681019d0fbbf4c7"
[7] "easy_ham_2/00245.1a6c31f4aa59dc224123471dd267a63f"
[8] "easy_ham_2/00745.17068ba58d3abff5214b3cac4af05ef6"
[9] "easy_ham_2/00005.07b9d4aa9e6c596440295a5170111392"
[10] "easy_ham_2/01173.0402e85910220618eac922403222d0ec"
[11] "spam/0045.75baa6797e2a65053a8373d5aa96f594"
[12] "spam/0006.7a32642f8c22bbeb85d6c3b5f3890a2c"
[13] "spam/0160.b6b241d37fa9d5f772afca9ef30034c3"
[14] "easy_ham_2/00798.f0b6d4915a856bc13e789d766b13fcb9"
[15] "easy_ham_2/00772.c3b24116536159a3568324b3310a48df"
Raw Email Example
Here is an example of an email, before any data scrubbing has taken place:
# head of 1st email
head(readLines(ham_spam[1]),10) [1] "From rpm-list-admin@freshrpms.net Fri Aug 16 10:57:44 2002"
[2] "Return-Path: <rpm-zzzlist-admin@freshrpms.net>"
[3] "Delivered-To: yyyy@localhost.netnoteinc.com"
[4] "Received: from localhost (localhost [127.0.0.1])"
[5] "\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 201B143C46"
[6] "\tfor <jm@localhost>; Fri, 16 Aug 2002 05:56:31 -0400 (EDT)"
[7] "Received: from phobos [127.0.0.1]"
[8] "\tby localhost with IMAP (fetchmail-5.9.0)"
[9] "\tfor jm@localhost (single-drop); Fri, 16 Aug 2002 10:56:31 +0100 (IST)"
[10] "Received: from egwn.net (auth02.nl.egwn.net [193.172.5.4]) by"
# tail of 1st email
tail(readLines(ham_spam[1]),15) [1] ""
[2] "> They're just suggesting people use \"./configure\" instead."
[3] ""
[4] "No, they do not (what would be the use of that, anyway?). They say that"
[5] "they will _try_ to eval macros even if the user forgot to pass the"
[6] "leading '%', but that this feature should not be relied upon."
[7] ""
[8] "-- "
[9] "On the first day of Christmas my true love sent to me"
[10] " A badly configured newsreader"
[11] ""
[12] "_______________________________________________"
[13] "RPM-List mailing list <RPM-List@freshrpms.net>"
[14] "http://lists.freshrpms.net/mailman/listinfo/rpm-list"
[15] ""
Cleaning Scripts
Now, let’s do some preliminary scrubbing:
# function to find first blank line in email.
# using this function to estimate where email body begins.
find_blank_line <-function(x){
for (i in 1:length(x)){
if (str_detect(x[i],"^[:space:]*$")){
result <- i
return(i)
}
}
}
# set up variables for loop
n <- 0
if(exists('email_corpus')){rm(email_corpus)}
# loop through each email
for (i in 1:length(ham_spam)){
tmp <- readLines(ham_spam[i])
# remove email header
beg <- find_blank_line(tmp)+1
end <- length(tmp)
tmp <- tmp[beg:end]
# remove HTML tags
if(extractHTMLStrip(tmp)!=""){
tmp <- extractHTMLStrip(tmp)
}
# remove URL links, punctuation, numbers, newlines, and misc symbols
tmp <- unlist(str_replace_all(tmp,"[[:punct:]]|[[:digit:]]|http\\S+\\s*|\\n|<|>|=|_|-|#|\\$|\\|"," "))
# remove extra whitespace
tmp <- str_trim(unlist(str_replace_all(tmp,"\\s+"," ")))
tmp <- str_c(tmp,collapse="")
# Add emails to corpus, and include spam/ham category information
if (length(tmp)!=0){
n <- n + 1
tmp_corpus <- Corpus(VectorSource(tmp))
ifelse(!exists('email_corpus'), email_corpus <- tmp_corpus, email_corpus <- c(email_corpus,tmp_corpus))
meta(email_corpus[[n]], "spam_ham") <- ifelse(str_detect(ham_spam[i],"spam"),1,0)
}
} Scrubbed Email Example
Let’s take a look at the email example from earlier in this section, but in post-scrub form:
# example scrubbed 1st email
email_corpus[[1]][1]$content
[1] "Hi Dave Cridland wrote recommends against using configure We will try to support users who accidentally type the leading but this should not be relied upon and yet snip They re just suggesting people use configure instead No they do not what would be the use of that anyway They say that they will try to eval macros even if the user forgot to pass the leading but that this feature should not be relied upon On the first day of Christmas my true love sent to me A badly configured newsreader RPM List mailing list"
Initial DTM
Let’s take a first look at the document term matrix, based on the scrubbing work performed so far:
dtm <- DocumentTermMatrix(email_corpus)
dtm<<DocumentTermMatrix (documents: 1900, terms: 92546)>>
Non-/sparse entries: 300977/175536423
Sparsity : 100%
Maximal term length: 161
Weighting : term frequency (tf)
Intermediate DTM
We see that the resulting matrix is extremely sparse, and we have at least one term with a length of 161. We’ll now perform additional scrubbing work:
# transform all words in corpus to lower case
email_corpus_mod <- tm_map(email_corpus, content_transformer(tolower))
# remove all stop words (e.g. "i", "me", "she", etc.)
email_corpus_mod <- tm_map(email_corpus_mod,removeWords, words = stopwords("en"))
# stem words: cut certain terms down to word root
email_corpus_mod <- tm_map(email_corpus_mod, stemDocument)Let’s looks at at the dtm statistics, in light of the previous transformations:
dtm <- DocumentTermMatrix(email_corpus_mod)
dtm<<DocumentTermMatrix (documents: 1900, terms: 85274)>>
Non-/sparse entries: 252122/161768478
Sparsity : 100%
Maximal term length: 161
Weighting : term frequency (tf)
We cut down on the number of sparse terms, but total sparsity is still rounding to 100%. We also note that the maximum term length has not changed.
Final DTM
We now remove any terms that are not present in at least 10 documents:
dtm <- removeSparseTerms(dtm,1-(10/length(email_corpus_mod)))
dtm<<DocumentTermMatrix (documents: 1900, terms: 2547)>>
Non-/sparse entries: 138846/4700454
Sparsity : 97%
Maximal term length: 15
Weighting : term frequency (tf)
This last step made had a significant impact:
Spam Labels
We’ll first create a vector of labels to indicate whether an email was a spam or not. We’re using a value of 1 to indicate spam, and a value of 0 to indicate not spam (i.e. ham).
# create spam label vector for each email which indiciates actual status of "spam" or "not spam"
spam_labels_prelim <- unlist(meta(email_corpus_mod,"spam_ham"))
spam_labels <- c(rep(NA,length(email_corpus_mod)))
for (i in 1:length(email_corpus_mod)){
spam_labels[i] <- spam_labels_prelim[[i]]
}Set Up Models
Here we set up three supervised classifier models for our spam/ham problem:
In keeping with common practice, we allocated 80% of the corpus to the training data set, and 20% to the test set. Because we randomly shuffled the combined, spam/ham file name order order of the emails in an earlier step, we’re simply allocating the first 1,536 emails to the training set, and the remainder to the test set.
# number of emails in corpus
N <- length(spam_labels)
# set up model container; 80/20 split between train and test data
container <- create_container(
dtm,
labels = spam_labels,
trainSize = 1:(0.8*N),
testSize = (0.8*N+1):N,
virgin = FALSE
)
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)Model Output
Below is example output from the three models.
The first column represents the given model’s estimated category classification–that is, spam or not spam. The next column is an estimate of the probability of the suggested classification.
head(svm_out,5) SVM_LABEL SVM_PROB
1 0 0.9997975
2 1 1.0000000
3 1 0.9893355
4 1 0.9893273
5 0 0.8403547
head(tree_out,5) TREE_LABEL TREE_PROB
1 0 0.9929825
2 1 0.8735632
3 0 0.9049296
4 1 0.8735632
5 0 0.9929825
head(maxent_out,5) MAXENTROPY_LABEL MAXENTROPY_PROB
1 0 1.0000000
2 1 1.0000000
3 1 1.0000000
4 1 0.9999973
5 0 0.9999882
Model Performance
Finally, let’s let’s examine the accuracy of the three models. We’ll calculate the percentage of emails correctly categorized by each model, using the smaller test data set.
# create lables: actual classification, then model classification
# for three models on test data
labels_out <- data.frame(
correct_label = spam_labels[(0.8*N+1):N],
svm = as.character(svm_out[,1]),
tree = as.character(tree_out[,1]),
maxent = as.character(maxent_out[,1]),
stringAsFactors = F)
#SVM Performance
svm_table <- table(labels_out[,1] == labels_out[,2])
addmargins(svm_table)
FALSE TRUE Sum
5 375 380
svm_table
FALSE TRUE
5 375
round(prop.table(svm_table),3)
FALSE TRUE
0.013 0.987
#RF Peformance
rf_table <- table(labels_out[,1] == labels_out[,3])
addmargins(rf_table)
FALSE TRUE Sum
33 347 380
round(prop.table(rf_table),3)
FALSE TRUE
0.087 0.913
#ME Performance
me_table <- table(labels_out[,1] == labels_out[,4])
addmargins(me_table)
FALSE TRUE Sum
5 375 380
round(prop.table(me_table),3)
FALSE TRUE
0.013 0.987
Final Thoughts
We see that that both the SVM and ME models performed equally well: both models classified 375 out of the 380 test emails correct (98.7% accuracy.)
The random forest model, unfortunately, did not perform nearly as well. It correctly classified 347 out of the 380 test email (91.3% accuracy).
Based on this admittedly limited analysis, I recommend using either the SVM or ME models on future email test data sets.