No one like’s spam email. Thus, we have spam filters built in to our email services that detect and send spam into a deep dark folder where no one can hear it scream. Let’s make a spam detector.

The first task will be to load sets of spam and ham (non-spam) messages into a dataframe.

We will begin with two seperate folders containing spam and ham (spamdir and hamdir). For practicality’s sake, the data will be loaded in from a local directory. This will be different depening on where you have downloaded the data. The code below loads all of the messages into a dataframe, and classifies them as spam or not.

body
## 1 From 12a1mailbot1@web.de  Thu Aug 22 13:17:22 2002\nReturn-Path: <12a1mailbot1@web.de>\nDelivered-To: zzzz@localhost.example.com\nReceived: from localhost (localhost [127.0.0.1])\n\tby phobos.labs.example.com (Postfix) with ESMTP id 136B943C32\n\tfor <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)\nReceived: from mail.webnote.net [193.120.211.219]\n\tby localhost with POP3 (fetchmail-5.9.0)\n\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)\nReceived: from dd_it7 ([210.97.77.167])\n\tby webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623\n\tfor <zzzz@example.com>; Thu, 22 Aug 2002 13:09:41 +0100\nFrom: 12a1mailbot1@web.de\nReceived: from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);\n\t Sat, 24 Aug 2002 09:42:10 +0900\nTo: <dcek1a1@netsgo.com>\nSubject: Life Insurance - Why Pay More?\nDate: Wed, 21 Aug 2002 20:31:57 -1600\nMIME-Version: 1.0\nMessage-ID: <0103c1042001882DD_IT7@dd_it7>\nContent-Type: text/html; charset="iso-8859-1"\nContent-Transfer-Encoding: quoted-printable\n\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<HTML><HEAD>\n<META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=\nype>\n<META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD>\n<BODY><!-- Inserted by Calypso -->\n<TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r=\nules=3Dnone \nstyle=3D"COLOR: black; DISPLAY: none" width=3D"100%">\n  <TBODY>\n  <TR>\n    <TD colSpan=3D3>\n      <HR color=3Dblack noShade SIZE=3D1>\n    </TD></TR></TD></TR>\n  <TR>\n    <TD colSpan=3D3>\n      <HR color=3Dblack noShade SIZE=3D1>\n    </TD></TR></TBODY></TABLE><!-- End Calypso --><!-- Inserted by Calypso=\n --><FONT \ncolor=3D#000000 face=3DVERDANA,ARIAL,HELVETICA size=3D-2><BR></FONT></TD><=\n/TR></TABLE><!-- End Calypso --><FONT color=3D#ff0000 \nface=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">\n<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff=\n0000 \nface=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">\n<CENTER>Why Spend More Than You Have To?\n<CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT=\nSIZE=3D"10">\n<CENTER>Life Quote Savings\n<CENTER>\n<P align=3Dleft></P>\n<P align=3Dleft></P></FONT></U></I></B><BR></FONT></U></B></U></I>\n<P></P>\n<CENTER>\n<TABLE border=3D0 borderColor=3D#111111 cellPadding=3D0 cellSpacing=3D0 wi=\ndth=3D650>\n  <TBODY></TBODY></TABLE>\n<TABLE border=3D0 borderColor=3D#111111 cellPadding=3D5 cellSpacing=3D0 wi=\ndth=3D650>\n  <TBODY>\n  <TR>\n    <TD colSpan=3D2 width=3D"35%"><B><FONT face=3DVerdana size=3D4>Ensurin=\ng your \n      family's financial security is very important. Life Quote Savings ma=\nkes \n      buying life insurance simple and affordable. We Provide FREE Access =\nto The \n      Very Best Companies and The Lowest Rates.</FONT></B></TD></TR>\n  <TR>\n    <TD align=3Dmiddle vAlign=3Dtop width=3D"18%">\n      <TABLE borderColor=3D#111111 width=3D"100%">\n        <TBODY>\n        <TR>\n          <TD style=3D"PADDING-LEFT: 5px; PADDING-RIGHT: 5px" width=3D"100=\n%"><FONT \n            face=3DVerdana size=3D4><B>Life Quote Savings</B> is FAST, EAS=\nY and \n            SAVES you money! Let us help you get started with the best val=\nues in \n            the country on new coverage. You can SAVE hundreds or even tho=\nusands \n            of dollars by requesting a FREE quote from Lifequote Savings. =\nOur \n            service will take you less than 5 minutes to complete. Shop an=\nd \n            compare. SAVE up to 70% on all types of Life insurance! \n</FONT></TD></TR>\n        <TR><BR><BR>\n          <TD height=3D50 style=3D"PADDING-LEFT: 5px; PADDING-RIGHT: 5px" \n          width=3D"100%">\n            <P align=3Dcenter><B><FONT face=3DVerdana size=3D5><A \n            href=3D"http://website.e365.cc/savequote/">Click Here For Your=\n \n            Free Quote!</A></FONT></B></P></TD>\n          <P><FONT face=3DVerdana size=3D4><STRONG>\n          <CENTER>Protecting your family is the best investment you'll eve=\nr \n          make!<BR></B></TD></TR>\n        <TR><BR><BR></STRONG></FONT></TD></TR></TD></TR>\n        <TR></TR></TBODY></TABLE>\n      <P align=3Dleft><FONT face=3D"Arial, Helvetica, sans-serif" size=3D2=\n></FONT></P>\n      <P></P>\n      <CENTER><BR><BR><BR>\n      <P></P>\n      <P align=3Dleft><BR></B><BR><BR><BR><BR></P>\n      <P align=3Dcenter><BR></P>\n      <P align=3Dleft><BR></B><BR><BR></FONT>If you are in receipt of this=\n email \n      in error and/or wish to be removed from our list, <A \n      href=3D"mailto:coins@btamail.net.cn">PLEASE CLICK HERE</A> AND TYPE =\nREMOVE. If you \n      reside in any state which prohibits e-mail solicitations for insuran=\nce, \n      please disregard this \n      email.<BR></FONT><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR=\n><BR><BR><BR></FONT></P></CENTER></CENTER></TR></TBODY></TABLE></CENTER></=\nCENTER></CENTER></CENTER></CENTER></BODY></HTML>\n\n\n
##   is_spam
## 1    TRUE

Now that we’ve got our data loaded in, we need to seperate out the words. For this, we’ll create a document term matrix.

This is basically a count of what words occur and how many times they occur for each email.

## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords(kind = "en")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., stemDocument): transformation drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents

Going forward, we’ll need two different datasets: one for training our model, and one for testing it out afterwards.

Below, the data will we have is divided between test and training sets.

So now that we know which terms occur in each document; let’s see what some common terms are.

## Warning: Use of `df$terms` is discouraged. Use `terms` instead.
## Warning: Use of `df$value` is discouraged. Use `value` instead.

## Warning: Use of `df$value` is discouraged. Use `value` instead.

It does seem like many of the most common terms are present in both types of email, although in different frequencies. Let’s see if it’s enough for the model to distinguish between the two.

We’re going to go with random forest to make the model. In short, random forest utilizes multiple decision trees to build a model that is more accurate than any one tree could be.

## 
## Call:
##  randomForest(x = train[, -1], y = train[, 1], ntree = 200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 44
## 
##         OOB estimate of  error rate: 0.12%
## Confusion matrix:
##       FALSE TRUE  class.error
## FALSE  2001    2 0.0009985022
## TRUE      1  399 0.0025000000

Easy enough to get the model using this method. The function that creates the model predicts an error rate of 0.12%, but let’s see what it actually is when we use that test data we set aside earlier.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   496    0
##      TRUE      2  100
##                                          
##                Accuracy : 0.9967         
##                  95% CI : (0.988, 0.9996)
##     No Information Rate : 0.8328         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9881         
##                                          
##  Mcnemar's Test P-Value : 0.4795         
##                                          
##             Sensitivity : 0.9960         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9804         
##              Prevalence : 0.8328         
##          Detection Rate : 0.8294         
##    Detection Prevalence : 0.8294         
##       Balanced Accuracy : 0.9980         
##                                          
##        'Positive' Class : FALSE          
## 

It looks like the model works pretty well. Using the training dataset, the model was determined to be 99.67% accurate, only mislabling two pieces of ham as spam out of about 600 messages.