Overview

In this project I will be using the [Naive] [Bayes] [classifier] machine learning algorithm to make predictions on whether an email is spam or not. IBM, the company I work at, has a lot of education around ML algorithms and I leveraged that here in the project as well as other sources. I’ve also previously worked with a team on something similar (statistical classification) using a TensorFlow model to predict credit card fraud, the public version of which is hosted [here].

First, we’ll need to ingest our data and prepare it; preparing it will involve creating the ‘metadata’ around each email by creating our corpus of terms used in the email, and create our email term frequency matrix. The important part here is that the data we are using is already marked as fraudulent, spam, or not, ham. We then split this data into two groups, one for our model training and the other to test how accurate our trained model is. I usually split these into groups of sized 90/10 or 95/5.

Loading Email Data

I’m going to directly pull the tar files containing our ham and spam emails, parse the contents into data frames, then break them up into our training and testing data.

tmpdir <- tempdir()

base_url <- "https://spamassassin.apache.org/old/publiccorpus/"
spam_url <- paste0(base_url,"20021010_spam.tar.bz2")
ham_url <- paste0(base_url,"20021010_easy_ham.tar.bz2")

spamTar <- basename(spam_url)
hamTar <- basename(ham_url)

if(!file.exists("20021010_spam.tar.bz2"))
  download.file(spam_url, spamTar)
if(!file.exists("20021010_easy_ham.tar.bz2"))
  download.file(ham_url, hamTar)

untar(spamTar, exdir = tmpdir)
untar(hamTar, exdir = tmpdir)
getFileText = function(uri, output){
  text <-readLines(uri)
  # return email text 
  return(paste(text, collapse="\n"))
}

getFrom = function(email, output){
  return(str_extract(email,"(?<=From: ).*?(?=\\n)"))
}

getTo = function(email, output){
  return(str_extract(email,"(?<=To: ).*?(?=\\s)"))
}

getCType = function(email, output){
  return(str_extract(email,"(?<=Content-Type: ).*?(?=\\s)"))
}
  
# Need to convert body text to UTF-8
getBody = function(email, output){
  return(substr(email,str_locate(email,"\\n\\n")[,2],nchar(iconv(email, from = "", to = "UTF8"))))
}
hamFiles <- list.files(path = paste0(tmpdir,"/easy_ham/"), full.names = TRUE)
hamText <- apply(array(hamFiles), 1, getFileText)
spamFiles <- list.files(path = paste0(tmpdir,"/spam/"), full.names = TRUE)
spamText <- apply(array(spamFiles), 1, getFileText)
## Warning in readLines(uri): incomplete final line found on
## 'C:\Users\Nick\AppData\Local\Temp\RtmpaKPeLo/spam/0143.260a940290dcb61f9327b224a368d4af'
from <- apply(array(hamText), 1, getFrom)
to <- apply(array(hamText), 1, getTo)
cType <- apply(array(hamText), 1, getCType)
body <- apply(array(hamText), 1, getBody)

ham <- data.frame(from,
                  to,
                  cType,
                  body) 

from <- apply(array(spamText), 1, getFrom)
to <- apply(array(spamText), 1, getTo)
cType <- apply(array(spamText), 1, getCType)
body <- apply(array(spamText), 1, getBody)

spam <- data.frame(from,
                  to,
                  cType,
                  body) 


unshuffled <- rbind(ham %>% mutate(type = "ham"),
  spam %>% mutate(type = "spam"))

# Shuffle emails to allow us to pull random ham vs spam during test/train separation 
emails <- unshuffled[sample(nrow(unshuffled), nrow(unshuffled)),]

head(emails)
##                                                       from
## 2794                            "IQ  - AFM" <afm@insiq.us>
## 2022                      scripting <rssfeeds@example.com>
## 2407                       guardian <rssfeeds@example.com>
## 2209                           fark <rssfeeds@example.com>
## 1131 Ville =?ISO-8859-1?Q?Skytt=E4?= <ville.skytta@iki.fi>
## 1526                       yyyy@example.com (Justin Mason)
##                                 to                  cType
## 2794     zzzz@localhost.jmason.org multipart/alternative;
## 2022    yyyy@localhost.example.com            text/plain;
## 2407    yyyy@localhost.example.com            text/plain;
## 2209    yyyy@localhost.example.com            text/plain;
## 1131 yyyy@localhost.netnoteinc.com            text/plain;
## 1526              yyyy@example.com                   <NA>
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             body
## 2794 \nThis is a multi-part message in MIME format.\n\n------=_NextPart_000_2505E5_01C254FA.FC480250\nContent-Type: text/plain;\n\tcharset="Windows-1252"\nContent-Transfer-Encoding: quoted-printable\n\n Behind every elite producer...is an elite seminar system!=09\n \t =09\n =09\n =09\n Attention all independent registered representatives:\n Our reps' business has grown dramatically in the last two years.\n What's their hook?\n Our exclusive seminar marketing system for seniors, combined with\nequity indexed and fixed annuities.=09\n \tAdvanced equity indexed annuities and how registered reps\nposition these products in their practice. (The EIA story is out =97 =\nwith\n8 years of great results).\t=20\n \tProven seminar marketing system (Hundreds of our reps are using\nthis system to build their senior client base).\t=20\n \tA simpler approach to increasing sales with our generic\nPowerPoint=AE EIA presentation, and "Should I Stay or Should I Go,"\nIncomizer and Inflationizer sales ideas.\t=20\n \tReal world scenarios that you can use to solve real world\nproblems in today's environment.\t=20\n The bottom line:=0A=\nYour senior clients and prospects want you to call us\ntoday.\n =09\nCall us today for more information!\n 800-880-3072\n- or -\n\nPlease fill out the form below for more information\t=20\nName:\t \t  =09\nE-mail:\t \t=20\nPhone:\t \t=20\nCity:\t \t State:\t \t=20\n \t    =09\n=20\n\n American Financial Marketing, Inc.\nWe don't want anyone to receive our mailings who does not wish to\nreceive them. This is a professional communication sent to insurance\nprofessionals. To be removed from this mailing list, DO NOT REPLY to\nthis message. Instead, go here: http://www.insuranceiq.com/optout\n<http://www.insuranceiq.com/optout/>=20\n\nLegal Notice <http://www.insiq.com/legal.htm>=20\n\n------=_NextPart_000_2505E5_01C254FA.FC480250\nContent-Type: text/html;\n\tcharset="iso-8859-1"\nContent-Transfer-Encoding: quoted-printable\n\n<html>\n<head>\n<title>Behind Every Elite Producer...</title>\n<meta http-equiv=3D"Content-Type" content=3D"text/html; =\ncharset=3Diso-8859-1">\n</head>\n<body bgcolor=3D"#666666" text=3D"#000000">\n<font face=3D"Arial, Helvetica, sans-serif">\n<table width=3D"550" border=3D"1" align=3D"center" cellpadding=3D"0" =\ncellspacing=3D"0" bordercolor=3D"#000000" bgcolor=3D"#FFFFFF">\n  <tr>=20\n    <td>\n\t\t\t<table width=3D"100%" cellpadding=3D"0" cellspacing=3D"0" =\nborder=3D"0" bgcolor=3D"#ffffff">\n        <tr>=20\n          <td><img src=3D"http://iiq.us/images/amfin/20020823-2.gif" =\nalt=3D"Behind every elite producer...is an elite seminar system!" =\nwidth=3D"550" height=3D"134"></td>\n        </tr>\n        <tr>=20\n          <td>=20\n            <table width=3D"100%" border=3D"0" cellspacing=3D"0" =\ncellpadding=3D"0">\n              <tr>=20\n                <td rowspan=3D"3" width=3D"375"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-101.jpg" width=3D"375" =\nheight=3D"270"></td>\n                <td><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-102.gif" width=3D"175" =\nheight=3D"49"></td>\n              </tr>\n              <tr>=20\n                <td><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-103.jpg" width=3D"175" =\nheight=3D"169"></td>\n              </tr>\n              <tr>=20\n                <td><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-104.gif" width=3D"175" =\nheight=3D"52"></td>\n              </tr>\n            </table></td>\n        </tr>\n        <tr>=20\n          <td>\n\t\t\t\t\t\t<table width=3D"100%" border=3D"0" cellspacing=3D"2" =\ncellpadding=3D"2">\n              <tr>=20\n                <td align=3D"center"> <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-301.gif" alt=3D"Attention all =\nindependent registered representatives:" width=3D"540" =\nheight=3D"36"><br>\n                  <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-302.gif" alt=3D"Our reps' =\nbusiness has grown dramatically in the last two years." width=3D"540" =\nheight=3D"26"><br>\n                  <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-303.gif" alt=3D"What's their =\nhook?" width=3D"540" height=3D"63"><br>\n                  <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-304.gif" alt=3D"Our exclusive =\nseminar marketing system for seniors, combined with equity indexed and =\nfixed annuities." width=3D"540" height=3D"53">=20\n                </td>\n              </tr>\n              <tr>=20\n                <td align=3D"center">=20\n                  <table width=3D"100%" border=3D"0" cellspacing=3D"2" =\ncellpadding=3D"2">\n                    <tr>=20\n                      <td width=3D"4%" valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n                      <td><font size=3D"2"><b>Advanced equity indexed =\nannuities</b>=20\n                        and how registered reps position these products =\nin their=20\n                        practice. (The EIA story is out &#8212; with 8 =\nyears of=20\n                        great results).</font></td>\n                    </tr>\n                    <tr>=20\n                      <td valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n                      <td><font size=3D"2"><b>Proven seminar marketing =\nsystem </b>(Hundreds=20\n                        of our reps are using this system to build their =\nsenior=20\n                        client base).</font></td>\n                    </tr>\n                    <tr>=20\n                      <td valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n                      <td><font size=3D"2"><b>A simpler approach to =\nincreasing sales</b>=20\n                        with our generic PowerPoint&reg; EIA =\npresentation, and=20\n                        "Should I Stay or Should I Go," Incomizer=20\n                        and Inflationizer sales ideas.</font></td>\n                    </tr>\n                    <tr>=20\n                      <td valign=3D"top"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-4.gif" width=3D"16" =\nheight=3D"16"></td>\n                      <td><font size=3D"2"><b>Real world scenarios that =\nyou can=20\n                        use to solve real world problems in today's =\nenvironment.</b></font></td>\n                    </tr>\n                  </table>\n                </td>\n              </tr>\n              <tr>=20\n                <td align=3D"center"><img =\nsrc=3D"http://iiq.us/images/amfin/20020823-5.gif" alt=3D"The bottom =\nline:&#10;Your senior clients and prospects want you to call us today." =\nwidth=3D"540" height=3D"59"><br>\n                  <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-99.gif" width=3D"8" =\nheight=3D"8"></td>\n              </tr>\n              <tr>=20\n                <td align=3D"center">=20\n                  <b><font size=3D"2">Call us today for more =\ninformation!<br>\n                    <img =\nsrc=3D"http://iiq.us/images/amfin/20020823-6.gif" alt=3D"800-880-3072" =\nwidth=3D"529" height=3D"90"><br>\n                    - or -</font></b><br>\n                  <table width=3D"95%" border=3D"1" cellpadding=3D"0" =\ncellspacing=3D"0" bordercolor=3D"#000000">\n                    <tr>=20\n                      <form method=3D"post" =\naction=3D"http://65.217.159.103/response/response.asp">\n                        <td>=20\n                          <table width=3D"100%" border=3D"0" =\ncellspacing=3D"0" cellpadding=3D"2" align=3D"center" =\nbgcolor=3D"#cccccc">\n                            <tr>=20\n                              <td colspan=3D"5" align=3D"center" =\nbgcolor=3D"#000066"><b><font color=3D"#FFFFFF" size=3D"2">Please fill =\nout the form below for more information</font></b></td>\n                            </tr>\n                            <tr>=20\n                              <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">Name:</font></b></td>\n                              <td colspan=3D"3"><input type=3D"text" =\nname=3D"contactname" size=3D"50"></td>\n                              <td rowspan=3D"5" valign=3D"middle" =\nalign=3D"center"><img src=3D"http://iiq.us/images/q3.gif" width=3D"43" =\nheight=3D"50"></td>\n                            </tr>\n                            <tr>=20\n                              <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">E-mail:</font></b></td>\n                              <td colspan=3D"3"><input type=3D"text" =\nname=3D"email" size=3D"50"></td>\n                            </tr>\n                            <tr>=20\n                              <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">Phone:</font></b></td>\n                              <td colspan=3D"3"><input type=3D"text" =\nname=3D"phone" size=3D"50"></td>\n                            </tr>\n                            <tr>=20\n                              <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">City:</font></b></td>\n                              <td width=3D"30%"><input type=3D"text" =\nname=3D"city" size=3D"20"></td>\n                              <td width=3D"15%" align=3D"right"><b><font =\nsize=3D"2">State:</font></b></td>\n                              <td width=3D"20%"><input type=3D"text" =\nname=3D"state" size=3D"2"></td>\n                            </tr>\n                            <tr>=20\n                              <td width=3D"15%" =\nalign=3D"right">&nbsp;</td>\n                              <td width=3D"30%" colspan=3D'3'>=20\n                                <input type=3D"submit" =\nname=3D"btnsubmit" value=3D"Submit">\n                                <input type=3D"hidden" =\nname=3D"hdnRecipientTxt" value=3D"brianb@americanfinancialbrkg.com">\n                                <input type=3D"hidden" =\nname=3D"hdnSubjectTxt" value=3D"AFM Ad Inquiry">\n                              </td>\n                            </tr>\n                          </table>\n                        </td>\n                      <input type=3D'hidden' name=3D'SentTo' =\nvalue=3D'304704'>\n</form>\n                    </tr>\n                  </table><br>\n                  <img src=3D"http://iiq.us/images/amfin/20020823-7.gif" =\nalt=3D"American Financial Marketing, Inc." width=3D"400" height=3D"160">\n\t\t\t\t\t\t\t\t</td>\n              </tr>\n            </table>\n\t\t\t\t\t</td>\n        </tr>\n      </table>\n\t\t</td>\n  </tr>\n  <tr>=20\n    <td width=3D"100%" bgcolor=3D"#cccccc" align=3D"center">\n\t\t\t<p><font face=3D"Arial, Helvetica, sans-serif" size=3D"1">We don't =\nwant anyone to receive our mailings who does not=20\n\t\t\twish to receive them. This is a professional communication=20\n\t\t\tsent to insurance professionals. To be removed from this mailing=20\n\t\t\tlist, <b>DO NOT REPLY</b> to this message. Instead, go here: <a =\nhref=3D"http://www.insuranceiq.com/optout/">=20\n\t\t\thttp://www.insuranceiq.com/optout</a></font></p>\n    </td>\n  </tr>\n</table>\n<center>\n  <a href=3D"http://www.insiq.com/legal.htm"><font size=3D"1" =\nface=3D"Arial, Helvetica, sans-serif">Legal Notice </font></a>=20\n</center>\n</font>=20\n</body>\n</html>\n\n------=_NextPart_000_2505E5_01C254FA.FC480250--\n\n
## 2022                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        \nURL: http://scriptingnews.userland.com/backissues/2002/09/25#When:6:25:23AM\nDate: Wed, 25 Sep 2002 13:25:23 GMT\n\nJon Hanna, on the RSS-DEV list, says[1] that RSS, was "not designed to be of \nany particular use to bloggers, aggregators, or metadata providers." This is \nnot true. Half of RSS 0.91 was scriptingNews format[2], which was totally \ndesigned to model a weblog in XML. \n\n[1] http://groups.yahoo.com/group/rss-dev/message/4023\n[2] http://my.userland.com/stories/storyReader$11\n\n
## 2407                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <NA>
## 2209                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      \nURL: http://www.newsisfree.com/click/-2,8423194,1717/\nDate: 2002-10-01T08:34:11+01:00\n\n(St. Petersburg Times)\n\n
## 1131                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <NA>
## 1526                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        \n\n"Craig R.Hughes" said:\n\n> Seems like a good idea.  We might get one of two other issues \n> raised tomorrow too once US people get back to work tomorrow and \n> start downloading 2.40 in earnest.\n\nyep, I reckon that's likely.\n\nBTW I'm hearing reports about problems resolving spamassassin.org.\nAnyone else noticing this?  if it's serious I'll see if I can get\nMark Reynolds to add a 2ndary in the US, to go with the primaries\nin Oz.\n\n> >    - looks like there may be a razor2 issue\n\nI think this is a Razor bug/glitch triggered when file permissions\ndon't allow its own log system to work.  At least that's the report\nI heard on the Razor list in the past...\n\nTheo, does it work now that you /dev/null'd the logfile?\n\n> >    - version number (says "cvs")\n> >    - tag tree as "Rel" this time too\n\nI won't bother tagging with Rel, IMO; I don't think we should\nrely on the version control system inside our code, so I've just\nput a line in Mail/SpamAssassin.pm instead.   I will of course\ntag with a release *label* though.\n\n--j.\n
##      type
## 2794 spam
## 2022  ham
## 2407  ham
## 2209  ham
## 1131  ham
## 1526  ham

Model Input Prep

Now that we have our emails loaded and tagged as ham or spam, we can build our corpus and use it to build our term frequency matrix (document term matrix). Using this frequency matrix, we can ‘train’ our Naive Bayes model to predict the ham or spam classification based on specific words that are or are not used. We’re able to do this as we already know which emails are ham or spam based on the tag. We can then test our model using our remaining 10% of emails that we did not pass in to our model for training.

corpus <- VCorpus(x = VectorSource(emails$body))

# Clean corpuses
corpus <- tm_map(x = corpus, FUN = removeNumbers)
corpus <- tm_map(x = corpus, content_transformer(tolower))
corpus <- tm_map(x = corpus, FUN = removePunctuation)
corpus <- tm_map(x = corpus, FUN = removeWords, stopwords())
corpus <- tm_map(x = corpus, FUN = stripWhitespace)
corpus <- tm_map(x = corpus, FUN = stemDocument)
# Create our word frequency per doc matrix
docTermMatrix <- DocumentTermMatrix(x = corpus)
# Split matrix by getting first 90% of matrix for train and last 10% for test
trainData <- docTermMatrix[1:round(nrow(docTermMatrix)*0.9, 0), ]
testData <- docTermMatrix[(round(nrow(docTermMatrix)*0.9, 0)+1):nrow(docTermMatrix), ]

trainLabels <- emails[1:round(nrow(emails)*0.9, 0), ]$type
testLabels <- emails[(round(nrow(emails)*0.9, 0)+1):nrow(docTermMatrix), ]$type

# Test proportions of 
prop.table(table(trainLabels))
## trainLabels
##       ham      spam 
## 0.8343648 0.1656352
prop.table(table(testLabels))
## testLabels
##       ham      spam 
## 0.8491803 0.1508197

Our ratios are fairly consistent. Good to go.

emailFreqWords <- findFreqTerms(docTermMatrix, 10)
str(emailFreqWords)
##  chr [1:4046] "abandon" "abil" "abl" "ablock" "absolut" "abstract" "abus" ...
trainDataFreq <- trainData[, emailFreqWords]
testDataFreq <- testData[, emailFreqWords]

Now that we have our training and test term matrices, we convert the frequency to a factor of simply whether the doc contains the term or not.

convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
trainDataFreq <- apply(trainDataFreq, 2, convert_counts)
testDataFreq <- apply(testDataFreq, 2, convert_counts)

Train Model and Test Accuracy

Here we train our model using our training data matrix then test it for accuracy using our test data matrix.

classifier = naiveBayes(trainDataFreq, trainLabels)
sample(classifier$tables,5)
## $fox
##            fox
## trainLabels          No         Yes
##        ham  0.993455497 0.006544503
##        spam 1.000000000 0.000000000
## 
## $hasnt
##            hasnt
## trainLabels        No       Yes
##        ham  0.9890925 0.0109075
##        spam 1.0000000 0.0000000
## 
## $borderatd
##            borderatd
## trainLabels         No        Yes
##        ham  1.00000000 0.00000000
##        spam 0.96923077 0.03076923
## 
## $commit
##            commit
## trainLabels          No         Yes
##        ham  0.992582897 0.007417103
##        spam 0.984615385 0.015384615
## 
## $employ
##            employ
## trainLabels         No        Yes
##        ham  0.98778360 0.01221640
##        spam 0.96263736 0.03736264
testPredictions <- predict(classifier, testDataFreq)
confusionMatrix(data = testPredictions, 
 reference = factor(testLabels),
 positive = "spam",
 dnn = c("Predicted", "Observed"))
## Confusion Matrix and Statistics
## 
##          Observed
## Predicted ham spam
##      ham  259   12
##      spam   0   34
##                                           
##                Accuracy : 0.9607          
##                  95% CI : (0.9323, 0.9795)
##     No Information Rate : 0.8492          
##     P-Value [Acc > NIR] : 3.054e-10       
##                                           
##                   Kappa : 0.8279          
##                                           
##  Mcnemar's Test P-Value : 0.001496        
##                                           
##             Sensitivity : 0.7391          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9557          
##              Prevalence : 0.1508          
##          Detection Rate : 0.1115          
##    Detection Prevalence : 0.1115          
##       Balanced Accuracy : 0.8696          
##                                           
##        'Positive' Class : spam            
## 

Our model has an estimated 96.4% accuracy rating in its ability to determine if an email is ham or spam based on term usage.