In this project I will be using the [Naive] [Bayes] [classifier] machine learning algorithm to make predictions on whether an email is spam or not. IBM, the company I work at, has a lot of education around ML algorithms and I leveraged that here in the project as well as other sources. I’ve also previously worked with a team on something similar (statistical classification) using a TensorFlow model to predict credit card fraud, the public version of which is hosted [here].
First, we’ll need to ingest our data and prepare it; preparing it will involve creating the ‘metadata’ around each email by creating our corpus of terms used in the email, and create our email term frequency matrix. The important part here is that the data we are using is already marked as fraudulent, spam, or not, ham. We then split this data into two groups, one for our model training and the other to test how accurate our trained model is. I usually split these into groups of sized 90/10 or 95/5.
I’m going to directly pull the tar files containing our ham and spam emails, parse the contents into data frames, then break them up into our training and testing data.
tmpdir <- tempdir()
base_url <- "https://spamassassin.apache.org/old/publiccorpus/"
spam_url <- paste0(base_url,"20021010_spam.tar.bz2")
ham_url <- paste0(base_url,"20021010_easy_ham.tar.bz2")
spamTar <- basename(spam_url)
hamTar <- basename(ham_url)
if(!file.exists("20021010_spam.tar.bz2"))
download.file(spam_url, spamTar)
if(!file.exists("20021010_easy_ham.tar.bz2"))
download.file(ham_url, hamTar)
untar(spamTar, exdir = tmpdir)
untar(hamTar, exdir = tmpdir)
getFileText = function(uri, output){
text <-readLines(uri)
# return email text
return(paste(text, collapse="\n"))
}
getFrom = function(email, output){
return(str_extract(email,"(?<=From: ).*?(?=\\n)"))
}
getTo = function(email, output){
return(str_extract(email,"(?<=To: ).*?(?=\\s)"))
}
getCType = function(email, output){
return(str_extract(email,"(?<=Content-Type: ).*?(?=\\s)"))
}
# Need to convert body text to UTF-8
getBody = function(email, output){
return(substr(email,str_locate(email,"\\n\\n")[,2],nchar(iconv(email, from = "", to = "UTF8"))))
}
hamFiles <- list.files(path = paste0(tmpdir,"/easy_ham/"), full.names = TRUE)
hamText <- apply(array(hamFiles), 1, getFileText)
spamFiles <- list.files(path = paste0(tmpdir,"/spam/"), full.names = TRUE)
spamText <- apply(array(spamFiles), 1, getFileText)
## Warning in readLines(uri): incomplete final line found on
## 'C:\Users\Nick\AppData\Local\Temp\RtmpaKPeLo/spam/0143.260a940290dcb61f9327b224a368d4af'
from <- apply(array(hamText), 1, getFrom)
to <- apply(array(hamText), 1, getTo)
cType <- apply(array(hamText), 1, getCType)
body <- apply(array(hamText), 1, getBody)
ham <- data.frame(from,
to,
cType,
body)
from <- apply(array(spamText), 1, getFrom)
to <- apply(array(spamText), 1, getTo)
cType <- apply(array(spamText), 1, getCType)
body <- apply(array(spamText), 1, getBody)
spam <- data.frame(from,
to,
cType,
body)
unshuffled <- rbind(ham %>% mutate(type = "ham"),
spam %>% mutate(type = "spam"))
# Shuffle emails to allow us to pull random ham vs spam during test/train separation
emails <- unshuffled[sample(nrow(unshuffled), nrow(unshuffled)),]
head(emails)
## from
## 2794 "IQ - AFM" <afm@insiq.us>
## 2022 scripting <rssfeeds@example.com>
## 2407 guardian <rssfeeds@example.com>
## 2209 fark <rssfeeds@example.com>
## 1131 Ville =?ISO-8859-1?Q?Skytt=E4?= <ville.skytta@iki.fi>
## 1526 yyyy@example.com (Justin Mason)
## to cType
## 2794 zzzz@localhost.jmason.org multipart/alternative;
## 2022 yyyy@localhost.example.com text/plain;
## 2407 yyyy@localhost.example.com text/plain;
## 2209 yyyy@localhost.example.com text/plain;
## 1131 yyyy@localhost.netnoteinc.com text/plain;
## 1526 yyyy@example.com <NA>
## body
## 2794 \nThis is a multi-part message in MIME format.\n\n------=_NextPart_000_2505E5_01C254FA.FC480250\nContent-Type: text/plain;\n\tcharset="Windows-1252"\nContent-Transfer-Encoding: quoted-printable\n\n Behind every elite producer...is an elite seminar system!=09\n \t =09\n =09\n =09\n Attention all independent registered representatives:\n Our reps' business has grown dramatically in the last two years.\n What's their hook?\n Our exclusive seminar marketing system for seniors, combined with\nequity indexed and fixed annuities.=09\n \tAdvanced equity indexed annuities and how registered reps\nposition these products in their practice. (The EIA story is out =97 =\nwith\n8 years of great results).\t=20\n \tProven seminar marketing system (Hundreds of our reps are using\nthis system to build their senior client base).\t=20\n \tA simpler approach to increasing sales with our generic\nPowerPoint=AE EIA presentation, and "Should I Stay or Should I Go,"\nIncomizer and Inflationizer sales ideas.\t=20\n \tReal world scenarios that you can use to solve real world\nproblems in today's environment.\t=20\n The bottom line:=0A=\nYour senior clients and prospects want you to call us\ntoday.\n =09\nCall us today for more information!\n 800-880-3072\n- or -\n\nPlease fill out the form below for more information\t=20\nName:\t \t =09\nE-mail:\t \t=20\nPhone:\t \t=20\nCity:\t \t State:\t \t=20\n \t =09\n=20\n\n American Financial Marketing, Inc.\nWe don't want anyone to receive our mailings who does not wish to\nreceive them. This is a professional communication sent to insurance\nprofessionals. To be removed from this mailing list, DO NOT REPLY to\nthis message. Instead, go here: http://www.insuranceiq.com/optout\n<http://www.insuranceiq.com/optout/>=20\n\nLegal Notice <http://www.insiq.com/legal.htm>=20\n\n------=_NextPart_000_2505E5_01C254FA.FC480250\nContent-Type: text/html;\n\tcharset="iso-8859-1"\nContent-Transfer-Encoding: quoted-printable\n\n<html>\n<head>\n<title>Behind Every Elite Producer...</title>\n<meta http-equiv=3D"Content-Type" content=3D"text/html; =\ncharset=3Diso-8859-1">\n</head>\n<body bgcolor=3D"#666666" text=3D"#000000">\n<font face=3D"Arial, Helvetica, sans-serif">\n<table width=3D"550" border=3D"1" align=3D"center" cellpadding=3D"0" =\ncellspacing=3D"0" bordercolor=3D"#000000" bgcolor=3D"#FFFFFF">\n <tr>=20\n <td>\n\t\t\t<table width=3D"100%" cellpadding=3D"0" cellspacing=3D"0" =\nborder=3D"0" bgcolor=3D"#ffffff">\n <tr>=20\n <td><img src=3D"http://iiq.us/images/amfin/20020823-2.gif" =\nalt=3D"Behind every elite producer...is an elite seminar system!" =\nwidth=3D"550" height=3D"134"></td>\n </tr>\n <tr>=20\n <td>=20\n <table width=3D"100%" border=3D"0" cellspacing=3D"0" =\ncellpadding=3D"0">\n <tr>=20\n <td rowspan=3D"3" width=3D"375"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-101.jpg" width=3D"375" =\nheight=3D"270"></td>\n <td><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-102.gif" width=3D"175" =\nheight=3D"49"></td>\n </tr>\n <tr>=20\n <td><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-103.jpg" width=3D"175" =\nheight=3D"169"></td>\n </tr>\n <tr>=20\n <td><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-104.gif" width=3D"175" =\nheight=3D"52"></td>\n </tr>\n </table></td>\n </tr>\n <tr>=20\n <td>\n\t\t\t\t\t\t<table width=3D"100%" border=3D"0" cellspacing=3D"2" =\ncellpadding=3D"2">\n <tr>=20\n <td align=3D"center"> <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-301.gif" alt=3D"Attention all =\nindependent registered representatives:" width=3D"540" =\nheight=3D"36"><br>\n <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-302.gif" alt=3D"Our reps' =\nbusiness has grown dramatically in the last two years." width=3D"540" =\nheight=3D"26"><br>\n <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-303.gif" alt=3D"What's their =\nhook?" width=3D"540" height=3D"63"><br>\n <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-304.gif" alt=3D"Our exclusive =\nseminar marketing system for seniors, combined with equity indexed and =\nfixed annuities." width=3D"540" height=3D"53">=20\n </td>\n </tr>\n <tr>=20\n <td align=3D"center">=20\n <table width=3D"100%" border=3D"0" cellspacing=3D"2" =\ncellpadding=3D"2">\n <tr>=20\n <td width=3D"4%" valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n <td><font size=3D"2"><b>Advanced equity indexed =\nannuities</b>=20\n and how registered reps position these products =\nin their=20\n practice. (The EIA story is out — with 8 =\nyears of=20\n great results).</font></td>\n </tr>\n <tr>=20\n <td valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n <td><font size=3D"2"><b>Proven seminar marketing =\nsystem </b>(Hundreds=20\n of our reps are using this system to build their =\nsenior=20\n client base).</font></td>\n </tr>\n <tr>=20\n <td valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n <td><font size=3D"2"><b>A simpler approach to =\nincreasing sales</b>=20\n with our generic PowerPoint® EIA =\npresentation, and=20\n "Should I Stay or Should I Go," Incomizer=20\n and Inflationizer sales ideas.</font></td>\n </tr>\n <tr>=20\n <td valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n <td><font size=3D"2"><b>Real world scenarios that =\nyou can=20\n use to solve real world problems in today's =\nenvironment.</b></font></td>\n </tr>\n </table>\n </td>\n </tr>\n <tr>=20\n <td align=3D"center"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-5.gif" alt=3D"The bottom =\nline: Your senior clients and prospects want you to call us today." =\nwidth=3D"540" height=3D"59"><br>\n <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-99.gif" width=3D"8" =\nheight=3D"8"></td>\n </tr>\n <tr>=20\n <td align=3D"center">=20\n <b><font size=3D"2">Call us today for more =\ninformation!<br>\n <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-6.gif" alt=3D"800-880-3072" =\nwidth=3D"529" height=3D"90"><br>\n - or -</font></b><br>\n <table width=3D"95%" border=3D"1" cellpadding=3D"0" =\ncellspacing=3D"0" bordercolor=3D"#000000">\n <tr>=20\n <form method=3D"post" =\naction=3D"http://65.217.159.103/response/response.asp">\n <td>=20\n <table width=3D"100%" border=3D"0" =\ncellspacing=3D"0" cellpadding=3D"2" align=3D"center" =\nbgcolor=3D"#cccccc">\n <tr>=20\n <td colspan=3D"5" align=3D"center" =\nbgcolor=3D"#000066"><b><font color=3D"#FFFFFF" size=3D"2">Please fill =\nout the form below for more information</font></b></td>\n </tr>\n <tr>=20\n <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">Name:</font></b></td>\n <td colspan=3D"3"><input type=3D"text" =\nname=3D"contactname" size=3D"50"></td>\n <td rowspan=3D"5" valign=3D"middle" =\nalign=3D"center"><img src=3D"http://iiq.us/images/q3.gif" width=3D"43" =\nheight=3D"50"></td>\n </tr>\n <tr>=20\n <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">E-mail:</font></b></td>\n <td colspan=3D"3"><input type=3D"text" =\nname=3D"email" size=3D"50"></td>\n </tr>\n <tr>=20\n <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">Phone:</font></b></td>\n <td colspan=3D"3"><input type=3D"text" =\nname=3D"phone" size=3D"50"></td>\n </tr>\n <tr>=20\n <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">City:</font></b></td>\n <td width=3D"30%"><input type=3D"text" =\nname=3D"city" size=3D"20"></td>\n <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">State:</font></b></td>\n <td width=3D"20%"><input type=3D"text" =\nname=3D"state" size=3D"2"></td>\n </tr>\n <tr>=20\n <td width=3D"15%" =\nalign=3D"right"> </td>\n <td width=3D"30%" colspan=3D'3'>=20\n <input type=3D"submit" =\nname=3D"btnsubmit" value=3D"Submit">\n <input type=3D"hidden" =\nname=3D"hdnRecipientTxt" value=3D"brianb@americanfinancialbrkg.com">\n <input type=3D"hidden" =\nname=3D"hdnSubjectTxt" value=3D"AFM Ad Inquiry">\n </td>\n </tr>\n </table>\n </td>\n <input type=3D'hidden' name=3D'SentTo' =\nvalue=3D'304704'>\n</form>\n </tr>\n </table><br>\n <img src=3D"http://iiq.us/images/amfin/20020823-7.gif" =\nalt=3D"American Financial Marketing, Inc." width=3D"400" height=3D"160">\n\t\t\t\t\t\t\t\t</td>\n </tr>\n </table>\n\t\t\t\t\t</td>\n </tr>\n </table>\n\t\t</td>\n </tr>\n <tr>=20\n <td width=3D"100%" bgcolor=3D"#cccccc" align=3D"center">\n\t\t\t<p><font face=3D"Arial, Helvetica, sans-serif" size=3D"1">We don't =\nwant anyone to receive our mailings who does not=20\n\t\t\twish to receive them. This is a professional communication=20\n\t\t\tsent to insurance professionals. To be removed from this mailing=20\n\t\t\tlist, <b>DO NOT REPLY</b> to this message. Instead, go here: <a =\nhref=3D"http://www.insuranceiq.com/optout/">=20\n\t\t\thttp://www.insuranceiq.com/optout</a></font></p>\n </td>\n </tr>\n</table>\n<center>\n <a href=3D"http://www.insiq.com/legal.htm"><font size=3D"1" =\nface=3D"Arial, Helvetica, sans-serif">Legal Notice </font></a>=20\n</center>\n</font>=20\n</body>\n</html>\n\n------=_NextPart_000_2505E5_01C254FA.FC480250--\n\n
## 2022 \nURL: http://scriptingnews.userland.com/backissues/2002/09/25#When:6:25:23AM\nDate: Wed, 25 Sep 2002 13:25:23 GMT\n\nJon Hanna, on the RSS-DEV list, says[1] that RSS, was "not designed to be of \nany particular use to bloggers, aggregators, or metadata providers." This is \nnot true. Half of RSS 0.91 was scriptingNews format[2], which was totally \ndesigned to model a weblog in XML. \n\n[1] http://groups.yahoo.com/group/rss-dev/message/4023\n[2] http://my.userland.com/stories/storyReader$11\n\n
## 2407 <NA>
## 2209 \nURL: http://www.newsisfree.com/click/-2,8423194,1717/\nDate: 2002-10-01T08:34:11+01:00\n\n(St. Petersburg Times)\n\n
## 1131 <NA>
## 1526 \n\n"Craig R.Hughes" said:\n\n> Seems like a good idea. We might get one of two other issues \n> raised tomorrow too once US people get back to work tomorrow and \n> start downloading 2.40 in earnest.\n\nyep, I reckon that's likely.\n\nBTW I'm hearing reports about problems resolving spamassassin.org.\nAnyone else noticing this? if it's serious I'll see if I can get\nMark Reynolds to add a 2ndary in the US, to go with the primaries\nin Oz.\n\n> > - looks like there may be a razor2 issue\n\nI think this is a Razor bug/glitch triggered when file permissions\ndon't allow its own log system to work. At least that's the report\nI heard on the Razor list in the past...\n\nTheo, does it work now that you /dev/null'd the logfile?\n\n> > - version number (says "cvs")\n> > - tag tree as "Rel" this time too\n\nI won't bother tagging with Rel, IMO; I don't think we should\nrely on the version control system inside our code, so I've just\nput a line in Mail/SpamAssassin.pm instead. I will of course\ntag with a release *label* though.\n\n--j.\n
## type
## 2794 spam
## 2022 ham
## 2407 ham
## 2209 ham
## 1131 ham
## 1526 ham
Now that we have our emails loaded and tagged as ham or spam, we can build our corpus and use it to build our term frequency matrix (document term matrix). Using this frequency matrix, we can ‘train’ our Naive Bayes model to predict the ham or spam classification based on specific words that are or are not used. We’re able to do this as we already know which emails are ham or spam based on the tag. We can then test our model using our remaining 10% of emails that we did not pass in to our model for training.
corpus <- VCorpus(x = VectorSource(emails$body))
# Clean corpuses
corpus <- tm_map(x = corpus, FUN = removeNumbers)
corpus <- tm_map(x = corpus, content_transformer(tolower))
corpus <- tm_map(x = corpus, FUN = removePunctuation)
corpus <- tm_map(x = corpus, FUN = removeWords, stopwords())
corpus <- tm_map(x = corpus, FUN = stripWhitespace)
corpus <- tm_map(x = corpus, FUN = stemDocument)
# Create our word frequency per doc matrix
docTermMatrix <- DocumentTermMatrix(x = corpus)
# Split matrix by getting first 90% of matrix for train and last 10% for test
trainData <- docTermMatrix[1:round(nrow(docTermMatrix)*0.9, 0), ]
testData <- docTermMatrix[(round(nrow(docTermMatrix)*0.9, 0)+1):nrow(docTermMatrix), ]
trainLabels <- emails[1:round(nrow(emails)*0.9, 0), ]$type
testLabels <- emails[(round(nrow(emails)*0.9, 0)+1):nrow(docTermMatrix), ]$type
# Test proportions of
prop.table(table(trainLabels))
## trainLabels
## ham spam
## 0.8343648 0.1656352
prop.table(table(testLabels))
## testLabels
## ham spam
## 0.8491803 0.1508197
Our ratios are fairly consistent. Good to go.
emailFreqWords <- findFreqTerms(docTermMatrix, 10)
str(emailFreqWords)
## chr [1:4046] "abandon" "abil" "abl" "ablock" "absolut" "abstract" "abus" ...
trainDataFreq <- trainData[, emailFreqWords]
testDataFreq <- testData[, emailFreqWords]
Now that we have our training and test term matrices, we convert the frequency to a factor of simply whether the doc contains the term or not.
convert_counts <- function(x) {
x <- ifelse(x > 0, "Yes", "No")
}
trainDataFreq <- apply(trainDataFreq, 2, convert_counts)
testDataFreq <- apply(testDataFreq, 2, convert_counts)
Here we train our model using our training data matrix then test it for accuracy using our test data matrix.
classifier = naiveBayes(trainDataFreq, trainLabels)
sample(classifier$tables,5)
## $fox
## fox
## trainLabels No Yes
## ham 0.993455497 0.006544503
## spam 1.000000000 0.000000000
##
## $hasnt
## hasnt
## trainLabels No Yes
## ham 0.9890925 0.0109075
## spam 1.0000000 0.0000000
##
## $borderatd
## borderatd
## trainLabels No Yes
## ham 1.00000000 0.00000000
## spam 0.96923077 0.03076923
##
## $commit
## commit
## trainLabels No Yes
## ham 0.992582897 0.007417103
## spam 0.984615385 0.015384615
##
## $employ
## employ
## trainLabels No Yes
## ham 0.98778360 0.01221640
## spam 0.96263736 0.03736264
testPredictions <- predict(classifier, testDataFreq)
confusionMatrix(data = testPredictions,
reference = factor(testLabels),
positive = "spam",
dnn = c("Predicted", "Observed"))
## Confusion Matrix and Statistics
##
## Observed
## Predicted ham spam
## ham 259 12
## spam 0 34
##
## Accuracy : 0.9607
## 95% CI : (0.9323, 0.9795)
## No Information Rate : 0.8492
## P-Value [Acc > NIR] : 3.054e-10
##
## Kappa : 0.8279
##
## Mcnemar's Test P-Value : 0.001496
##
## Sensitivity : 0.7391
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9557
## Prevalence : 0.1508
## Detection Rate : 0.1115
## Detection Prevalence : 0.1115
## Balanced Accuracy : 0.8696
##
## 'Positive' Class : spam
##
Our model has an estimated 96.4% accuracy rating in its ability to determine if an email is ham or spam based on term usage.