Data 607 – Project 4

Ben Horvath

November 4, 2018

Load libraries:

Our purpose is to take two directories of e-mails, one containing spam, the other containing ham, and develop a model to predict whether e-mails are spam or ham.

After attempting to parse the e-mails to get rid of the header data, I will use TF-IDF scores to create a feature set, split the data into train and test sets (75/25), train a Naive Bayes model, and then use accuracy, precision, recall, and F1 score to evaluate the model.

Parsing Raw E-mails

The goal of this section is to develop a function to parse individual e-mails.

First, let’s get a look at one of them:

##   [1] "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002"                                                        
##   [2] "Return-Path: <exmh-workers-admin@spamassassin.taint.org>"                                                            
##   [3] "Delivered-To: zzzz@localhost.netnoteinc.com"                                                                         
##   [4] "Received: from localhost (localhost [127.0.0.1])"                                                                    
##   [5] "\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id D03E543C36"                                                  
##   [6] "\tfor <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)"                                                       
##   [7] "Received: from phobos [127.0.0.1]"                                                                                   
##   [8] "\tby localhost with IMAP (fetchmail-5.9.0)"                                                                          
##   [9] "\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 12:36:16 +0100 (IST)"                                           
##  [10] "Received: from listman.spamassassin.taint.org (listman.spamassassin.taint.org [66.187.233.211]) by"                  
##  [11] "    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MBYrZ04811 for"                                              
##  [12] "    <zzzz-exmh@spamassassin.taint.org>; Thu, 22 Aug 2002 12:34:53 +0100"                                             
##  [13] "Received: from listman.spamassassin.taint.org (localhost.localdomain [127.0.0.1]) by"                                
##  [14] "    listman.redhat.com (Postfix) with ESMTP id 8386540858; Thu, 22 Aug 2002"                                         
##  [15] "    07:35:02 -0400 (EDT)"                                                                                            
##  [16] "Delivered-To: exmh-workers@listman.spamassassin.taint.org"                                                           
##  [17] "Received: from int-mx1.corp.spamassassin.taint.org (int-mx1.corp.spamassassin.taint.org"                             
##  [18] "    [172.16.52.254]) by listman.redhat.com (Postfix) with ESMTP id 10CF8406D7"                                       
##  [19] "    for <exmh-workers@listman.redhat.com>; Thu, 22 Aug 2002 07:34:10 -0400"                                          
##  [20] "    (EDT)"                                                                                                           
##  [21] "Received: (from mail@localhost) by int-mx1.corp.spamassassin.taint.org (8.11.6/8.11.6)"                              
##  [22] "    id g7MBY7g11259 for exmh-workers@listman.redhat.com; Thu, 22 Aug 2002"                                           
##  [23] "    07:34:07 -0400"                                                                                                  
##  [24] "Received: from mx1.spamassassin.taint.org (mx1.spamassassin.taint.org [172.16.48.31]) by"                            
##  [25] "    int-mx1.corp.redhat.com (8.11.6/8.11.6) with SMTP id g7MBY7Y11255 for"                                           
##  [26] "    <exmh-workers@redhat.com>; Thu, 22 Aug 2002 07:34:07 -0400"                                                      
##  [27] "Received: from ratree.psu.ac.th ([202.28.97.6]) by mx1.spamassassin.taint.org"                                       
##  [28] "    (8.11.6/8.11.6) with SMTP id g7MBIhl25223 for <exmh-workers@redhat.com>;"                                        
##  [29] "    Thu, 22 Aug 2002 07:18:55 -0400"                                                                                 
##  [30] "Received: from delta.cs.mu.OZ.AU (delta.coe.psu.ac.th [172.30.0.98]) by"                                             
##  [31] "    ratree.psu.ac.th (8.11.6/8.11.6) with ESMTP id g7MBWel29762;"                                                    
##  [32] "    Thu, 22 Aug 2002 18:32:40 +0700 (ICT)"                                                                           
##  [33] "Received: from munnari.OZ.AU (localhost [127.0.0.1]) by delta.cs.mu.OZ.AU"                                           
##  [34] "    (8.11.6/8.11.6) with ESMTP id g7MBQPW13260; Thu, 22 Aug 2002 18:26:25"                                           
##  [35] "    +0700 (ICT)"                                                                                                     
##  [36] "From: Robert Elz <kre@munnari.OZ.AU>"                                                                                
##  [37] "To: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>"                                                      
##  [38] "Cc: exmh-workers@spamassassin.taint.org"                                                                             
##  [39] "Subject: Re: New Sequences Window"                                                                                   
##  [40] "In-Reply-To: <1029945287.4797.TMDA@deepeddy.vircio.com>"                                                             
##  [41] "References: <1029945287.4797.TMDA@deepeddy.vircio.com>"                                                              
##  [42] "    <1029882468.3116.TMDA@deepeddy.vircio.com> <9627.1029933001@munnari.OZ.AU>"                                      
##  [43] "    <1029943066.26919.TMDA@deepeddy.vircio.com>"                                                                     
##  [44] "    <1029944441.398.TMDA@deepeddy.vircio.com>"                                                                       
##  [45] "MIME-Version: 1.0"                                                                                                   
##  [46] "Content-Type: text/plain; charset=us-ascii"                                                                          
##  [47] "Message-Id: <13258.1030015585@munnari.OZ.AU>"                                                                        
##  [48] "X-Loop: exmh-workers@spamassassin.taint.org"                                                                         
##  [49] "Sender: exmh-workers-admin@spamassassin.taint.org"                                                                   
##  [50] "Errors-To: exmh-workers-admin@spamassassin.taint.org"                                                                
##  [51] "X-Beenthere: exmh-workers@spamassassin.taint.org"                                                                    
##  [52] "X-Mailman-Version: 2.0.1"                                                                                            
##  [53] "Precedence: bulk"                                                                                                    
##  [54] "List-Help: <mailto:exmh-workers-request@spamassassin.taint.org?subject=help>"                                        
##  [55] "List-Post: <mailto:exmh-workers@spamassassin.taint.org>"                                                             
##  [56] "List-Subscribe: <https://listman.spamassassin.taint.org/mailman/listinfo/exmh-workers>,"                             
##  [57] "    <mailto:exmh-workers-request@redhat.com?subject=subscribe>"                                                      
##  [58] "List-Id: Discussion list for EXMH developers <exmh-workers.spamassassin.taint.org>"                                  
##  [59] "List-Unsubscribe: <https://listman.spamassassin.taint.org/mailman/listinfo/exmh-workers>,"                           
##  [60] "    <mailto:exmh-workers-request@redhat.com?subject=unsubscribe>"                                                    
##  [61] "List-Archive: <https://listman.spamassassin.taint.org/mailman/private/exmh-workers/>"                                
##  [62] "Date: Thu, 22 Aug 2002 18:26:25 +0700"                                                                               
##  [63] ""                                                                                                                    
##  [64] "    Date:        Wed, 21 Aug 2002 10:54:46 -0500"                                                                    
##  [65] "    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>"                                         
##  [66] "    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>"                                                         
##  [67] ""                                                                                                                    
##  [68] ""                                                                                                                    
##  [69] "  | I can't reproduce this error."                                                                                   
##  [70] ""                                                                                                                    
##  [71] "For me it is very repeatable... (like every time, without fail)."                                                    
##  [72] ""                                                                                                                    
##  [73] "This is the debug log of the pick happening ..."                                                                     
##  [74] ""                                                                                                                    
##  [75] "18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}"
##  [76] "18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury"            
##  [77] "18:19:04 Ftoc_PickMsgs {{1 hit}}"                                                                                    
##  [78] "18:19:04 Marking 1 hits"                                                                                             
##  [79] "18:19:04 tkerror: syntax error in expression \"int ..."                                                              
##  [80] ""                                                                                                                    
##  [81] "Note, if I run the pick command by hand ..."                                                                         
##  [82] ""                                                                                                                    
##  [83] "delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury"                  
##  [84] "1 hit"                                                                                                               
##  [85] ""                                                                                                                    
##  [86] "That's where the \"1 hit\" comes from (obviously).  The version of nmh I'm"                                          
##  [87] "using is ..."                                                                                                        
##  [88] ""                                                                                                                    
##  [89] "delta$ pick -version"                                                                                                
##  [90] "pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 ICT 2002]"                                 
##  [91] ""                                                                                                                    
##  [92] "And the relevant part of my .mh_profile ..."                                                                         
##  [93] ""                                                                                                                    
##  [94] "delta$ mhparam pick"                                                                                                 
##  [95] "-seq sel -list"                                                                                                      
##  [96] ""                                                                                                                    
##  [97] ""                                                                                                                    
##  [98] "Since the pick command works, the sequence (actually, both of them, the"                                             
##  [99] "one that's explicit on the command line, from the search popup, and the"                                             
## [100] "one that comes from .mh_profile) do get created."                                                                    
## [101] ""                                                                                                                    
## [102] "kre"                                                                                                                 
## [103] ""                                                                                                                    
## [104] "ps: this is still using the version of the code form a day ago, I haven't"                                           
## [105] "been able to reach the cvs repository today (local routing issue I think)."                                          
## [106] ""                                                                                                                    
## [107] ""                                                                                                                    
## [108] ""                                                                                                                    
## [109] "_______________________________________________"                                                                     
## [110] "Exmh-workers mailing list"                                                                                           
## [111] "Exmh-workers@redhat.com"                                                                                             
## [112] "https://listman.redhat.com/mailman/listinfo/exmh-workers"                                                            
## [113] ""

There are many fields containing metadata about the e-mail, or header data. Some of these fields might be useful for a classifier, but I will not explore them here. Rather, I will try to eliminate them all so they don’t confuse the classifier. We can use regular expressions and the stringr pacakge to try to filter most of them out.

Ideally, each header field would correspond to a single line. However, some of these lines are broken by ‘\n\t’ as well as ‘\n +’ sequences. Convert those into single spaces so that each header field sits entirely on its own single line:

Next, enumerate the group of header fields to remove:

We might want to use some of the more familiar header data, so let’s extract them and then remove them from the message text:

Here is the final product, the parse_email() function:

## $id
## [1] "<1029945287.4797.TMDA@deepeddy.vircio.com>"
## 
## $from
## [1] "Robert Elz <kre@munnari.OZ.AU>"
## 
## $to
## [1] "Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>"
## 
## $cc
## [1] "exmh-workers@spamassassin.taint.org"
## 
## $subj
## [1] "Re: New Sequences Window"
## 
## $date
## [1] "Thu, 22 Aug 2002 18:26:25 +0700"
## 
## $body
## [1] "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nFrom: Robert Elz <kre@munnari.OZ.AU>\nTo: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>\nCc: exmh-workers@spamassassin.taint.org\nSubject: Re: New Sequences Window\nDate: Thu, 22 Aug 2002 18:26:25 +0700\n Date:        Wed, 21 Aug 2002 10:54:46 -0500 From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com> Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>\n\n | I can't reproduce this error.\n\nFor me it is very repeatable... (like every time, without fail).\n\nThis is the debug log of the pick happening ...\n\n18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}\n18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury\n18:19:04 Ftoc_PickMsgs {{1 hit}}\n18:19:04 Marking 1 hits\n18:19:04 tkerror: syntax error in expression \"int ...\n\nNote, if I run the pick command by hand ...\n\ndelta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury\n1 hit\n\nThat's where the \"1 hit\" comes from (obviously).  The version of nmh I'm\nusing is ...\n\ndelta$ pick -version\npick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 ICT 2002]\n\nAnd the relevant part of my .mh_profile ...\n\ndelta$ mhparam pick\n-seq sel -list\n\n\nSince the pick command works, the sequence (actually, both of them, the\none that's explicit on the command line, from the search popup, and the\none that comes from .mh_profile) do get created.\n\nkre\n\nps: this is still using the version of the code form a day ago, I haven't\nbeen able to reach the cvs repository today (local routing issue I think).\n\n\n\n_______________________________________________\nExmh-workers mailing list\nExmh-workers@redhat.com\nhttps://listman.redhat.com/mailman/listinfo/exmh-workers\n"

The function isn’t perfect, but it will do for our purposes.

Assemble the Data

Read in the spam dataset:

##                                                    id
## 1                      <0103c1042001882DD_IT7@dd_it7>
## 2                                                <NA>
## 3 <9a63c01c249e0$e5a9d610$1106fea9@freeyankeedom.com>
## 4                                                <NA>
## 5                                                <NA>
## 6                 <413-220028422154219900@freesource>
##                                          from
## 1                         12a1mailbot1@web.de
## 2      "Slim Down" <taylor@s3.serveimage.com>
## 3       "Slim Down" <sabrina@mx3.1premio.com>
## 4         Account Services <wsup@playful.com>
## 5      "Slim n Trim" <yenene@mx2.1premio.com>
## 6 "TheCashSystem" <Thecashsystem@firemail.de>
##                                to   cc
## 1            <dcek1a1@netsgo.com> <NA>
## 2                 <ilug@linux.ie> <NA>
## 3   <zzzz@spamassassin.taint.org> <NA>
## 4     zzzz@spamassassin.taint.org     
## 5               <social@linux.ie> <NA>
## 6 "1" <thecashsystem@firemail.de> <NA>
##                                                                                      subj
## 1                                                          Life Insurance - Why Pay More?
## 2                                   [ILUG] Guaranteed to lose 10-12 lbs in 30 days 10.206
## 3                 Guaranteed to lose 10-12 lbs in 30 days                          11.150
## 4 Re: Fw: User Name & Password to Membership To 5 Sites zzzz@spamassassin.taint.org pviqg
## 5                        [ILUG-Social] re: Guaranteed to lose 10-12 lbs in 30 days 10.148
## 6                                                       RE: Your Bank Account Information
##                              date
## 1 Wed, 21 Aug 2002 20:31:57 -1600
## 2 Thu, 22 Aug 2002 06:18:18 -0600
## 3 Thu, 22 Aug 2002 07:36:19 -0600
## 4 Thu, 22 Aug 2002 08:13:35 -0700
## 5 Thu, 22 Aug 2002 09:33:07 -0600
## 6 Thu, 22 Aug 2002 10:42:19 -0500
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           body
## 1 From 12a1mailbot1@web.de  Thu Aug 22 13:17:22 2002\nFrom: 12a1mailbot1@web.de\nTo: <dcek1a1@netsgo.com>\nSubject: Life Insurance - Why Pay More?\nDate: Wed, 21 Aug 2002 20:31:57 -1600\nMessage-ID: <0103c1042001882DD_IT7@dd_it7>\nContent-Transfer-Encoding: quoted-printable\n\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">\n<HTML><HEAD>\n<META content=3D"text/html; charset=3Dwindows-1252" http-equiv=3DContent-T=\nype>\n<META content=3D"MSHTML 5.00.2314.1000" name=3DGENERATOR></HEAD>\n<BODY><!-- Inserted by Calypso -->\n<TABLE border=3D0 cellPadding=3D0 cellSpacing=3D2 id=3D_CalyPrintHeader_ r=\nules=3Dnone \nstyle=3D"COLOR: black; DISPLAY: none" width=3D"100%"> <TBODY> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TD></TR> <TR> <TD colSpan=3D3> <HR color=3Dblack noShade SIZE=3D1> </TD></TR></TBODY></TABLE><!-- End Calypso --><!-- Inserted by Calypso=\n --><FONT \ncolor=3D#000000 face=3DVERDANA,ARIAL,HELVETICA size=3D-2><BR></FONT></TD><=\n/TR></TABLE><!-- End Calypso --><FONT color=3D#ff0000 \nface=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">\n<CENTER>Save up to 70% on Life Insurance.</CENTER></FONT><FONT color=3D#ff=\n0000 \nface=3D"Copperplate Gothic Bold" size=3D5 PTSIZE=3D"10">\n<CENTER>Why Spend More Than You Have To?\n<CENTER><FONT color=3D#ff0000 face=3D"Copperplate Gothic Bold" size=3D5 PT=\nSIZE=3D"10">\n<CENTER>Life Quote Savings\n<CENTER>\n<P align=3Dleft></P>\n<P align=3Dleft></P></FONT></U></I></B><BR></FONT></U></B></U></I>\n<P></P>\n<CENTER>\n<TABLE border=3D0 borderColor=3D#111111 cellPadding=3D0 cellSpacing=3D0 wi=\ndth=3D650> <TBODY></TBODY></TABLE>\n<TABLE border=3D0 borderColor=3D#111111 cellPadding=3D5 cellSpacing=3D0 wi=\ndth=3D650> <TBODY> <TR> <TD colSpan=3D2 width=3D"35%"><B><FONT face=3DVerdana size=3D4>Ensurin=\ng your  family's financial security is very important. Life Quote Savings ma=\nkes  buying life insurance simple and affordable. We Provide FREE Access =\nto The  Very Best Companies and The Lowest Rates.</FONT></B></TD></TR> <TR> <TD align=3Dmiddle vAlign=3Dtop width=3D"18%"> <TABLE borderColor=3D#111111 width=3D"100%"> <TBODY> <TR> <TD style=3D"PADDING-LEFT: 5px; PADDING-RIGHT: 5px" width=3D"100=\n%"><FONT  face=3DVerdana size=3D4><B>Life Quote Savings</B> is FAST, EAS=\nY and  SAVES you money! Let us help you get started with the best val=\nues in  the country on new coverage. You can SAVE hundreds or even tho=\nusands  of dollars by requesting a FREE quote from Lifequote Savings. =\nOur  service will take you less than 5 minutes to complete. Shop an=\nd  compare. SAVE up to 70% on all types of Life insurance! \n</FONT></TD></TR> <TR><BR><BR> <TD height=3D50 style=3D"PADDING-LEFT: 5px; PADDING-RIGHT: 5px"  width=3D"100%"> <P align=3Dcenter><B><FONT face=3DVerdana size=3D5><A  href=3D"http://website.e365.cc/savequote/">Click Here For Your=\n  Free Quote!</A></FONT></B></P></TD> <P><FONT face=3DVerdana size=3D4><STRONG> <CENTER>Protecting your family is the best investment you'll eve=\nr  make!<BR></B></TD></TR> <TR><BR><BR></STRONG></FONT></TD></TR></TD></TR> <TR></TR></TBODY></TABLE> <P align=3Dleft><FONT face=3D"Arial, Helvetica, sans-serif" size=3D2=\n></FONT></P> <P></P> <CENTER><BR><BR><BR> <P></P> <P align=3Dleft><BR></B><BR><BR><BR><BR></P> <P align=3Dcenter><BR></P> <P align=3Dleft><BR></B><BR><BR></FONT>If you are in receipt of this=\n email  in error and/or wish to be removed from our list, <A  href=3D"mailto:coins@btamail.net.cn">PLEASE CLICK HERE</A> AND TYPE =\nREMOVE. If you  reside in any state which prohibits e-mail solicitations for insuran=\nce,  please disregard this  email.<BR></FONT><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR><BR=\n><BR><BR><BR></FONT></P></CENTER></CENTER></TR></TBODY></TABLE></CENTER></=\nCENTER></CENTER></CENTER></CENTER></BODY></HTML>\n\n\n
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      From ilug-admin@linux.ie  Thu Aug 22 13:27:39 2002\nX-Authentication-Warning: lugh.tuatha.org: Host [67.104.83.251] claimed to be email.qves.com\nFrom: "Slim Down" <taylor@s3.serveimage.com>\nTo: <ilug@linux.ie>\nDate: Thu, 22 Aug 2002 06:18:18 -0600\nContent-Transfer-Encoding: 7bit\nX-Mailer: Microsoft CDO for Windows 2000\nThread-Index: AcJJ1f+3FWdz11AmR6uWbmQN5gGxxw==\nContent-Class: urn:content-classes:message\nX-Mimeole: Produced By Microsoft MimeOLE V6.00.2462.0000\nX-Originalarrivaltime: 22 Aug 2002 12:18:18.0699 (UTC) FILETIME=[FFB949B0:01C249D5]\nSubject: [ILUG] Guaranteed to lose 10-12 lbs in 30 days 10.206\n\n1) Fight The Risk of Cancer!\nhttp://www.adclick.ws/p.cfm?o=315&s=pk007\n\n2) Slim Down - Guaranteed to lose 10-12 lbs in 30 days\nhttp://www.adclick.ws/p.cfm?o=249&s=pk007\n\n3) Get the Child Support You Deserve - Free Legal Advice\nhttp://www.adclick.ws/p.cfm?o=245&s=pk002\n\n4) Join the Web's Fastest Growing Singles Community\nhttp://www.adclick.ws/p.cfm?o=259&s=pk007\n\n5) Start Your Private Photo Album Online!\nhttp://www.adclick.ws/p.cfm?o=283&s=pk007\n\nHave a Wonderful Day,\nOffer Manager\nPrizeMama\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf you wish to leave this list please use the link below.\nhttp://www.qves.com/trim/?ilug@linux.ie%7C17%7C114258\n\n\n-- \nIrish Linux Users' Group: ilug@linux.ie\nhttp://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.\nList maintainer: listmaster@linux.ie\n
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       From sabrina@mx3.1premio.com  Thu Aug 22 14:44:07 2002\nFrom: "Slim Down" <sabrina@mx3.1premio.com>\nTo: <zzzz@spamassassin.taint.org>\nSubject: Guaranteed to lose 10-12 lbs in 30 days                          11.150\nDate: Thu, 22 Aug 2002 07:36:19 -0600\nMessage-ID: <9a63c01c249e0$e5a9d610$1106fea9@freeyankeedom.com>\nContent-Transfer-Encoding: 7bit\nX-Mailer: Microsoft CDO for Windows 2000\nThread-Index: AcJJ4OWpowGq7rdNSwCz5HE3x9ZZDQ==\nContent-Class: urn:content-classes:message\nX-MimeOLE: Produced By Microsoft MimeOLE V6.00.2462.0000\nX-OriginalArrivalTime: 22 Aug 2002 13:36:20.0969 (UTC) FILETIME=[E692FD90:01C249E0]\n\n1) Fight The Risk of Cancer!\nhttp://www.adclick.ws/p.cfm?o=315&s=pk007\n\n2) Slim Down - Guaranteed to lose 10-12 lbs in 30 days\nhttp://www.adclick.ws/p.cfm?o=249&s=pk007\n\n3) Get the Child Support You Deserve - Free Legal Advice\nhttp://www.adclick.ws/p.cfm?o=245&s=pk002\n\n4) Join the Web's Fastest Growing Singles Community\nhttp://www.adclick.ws/p.cfm?o=259&s=pk007\n\n5) Start Your Private Photo Album Online!\nhttp://www.adclick.ws/p.cfm?o=283&s=pk007\n\nHave a Wonderful Day,\nOffer Manager\nPrizeMama\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf you wish to leave this list please use the link below.\nhttp://www.qves.com/trim/?zzzz@spamassassin.taint.org%7C17%7C308417\n
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               From wsup@playful.com  Thu Aug 22 16:17:00 2002\nFrom: Account Services <wsup@playful.com>\nTo: zzzz@spamassassin.taint.org\nCc: \nSubject: Re: Fw: User Name & Password to Membership To 5 Sites zzzz@spamassassin.taint.org pviqg\nMime-Version: 1.0\nDate: Thu, 22 Aug 2002 08:13:35 -0700\nX-Mailer: Microsoft Outlook Express 5.00.2615.200\nX-Priority: 1\n\n##################################################\n#                                                #\n#                 Adult Club                     #\n#           Offers FREE Membership               #\n#                                                #\n##################################################\n\n>>>>>  INSTANT ACCESS TO ALL SITES NOW\n>>>>>  Your User Name And Password is.\n>>>>>  User Name: zzzz@spamassassin.taint.org\n>>>>>  Password: 760382\n\n5 of the Best Adult Sites on the Internet for FREE!\n---------------------------------------\nNEWS 08/18/02\nWith just over 2.9 Million Members that signed up for FREE, Last month there were 721,184 New\nMembers. Are you one of them yet???\n---------------------------------------\nOur Membership FAQ\n\nQ. Why are you offering free access to 5 adult membership sites for free?\nA. I have advertisers that pay me for ad space so you don't have to pay for membership.\n\nQ. Is it true my membership is for life?\nA. Absolutely you'll never have to pay a cent the advertisers do.\n\nQ. Can I give my account to my friends and family?\nA. Yes, as long they are over the age of 18.\n\nQ. Do I have to sign up for all 5 membership sites?\nA. No just one to get access to all of them.\n\nQ. How do I get started?\nA. Click on one of the following links below to become a member.\n\n- These are multi million dollar operations with policies and rules.\n- Fill in the required info and they won't charge you for the Free pass!\n- If you don't believe us, just read their terms and conditions.\n\n---------------------------\n\n# 5. > Adults Farm\nhttp://80.71.66.8/farm/?aid=760382\nGirls and Animals Getting Freaky....FREE Lifetime Membership!!\n\n# 4. > Sexy Celebes\nhttp://80.71.66.8/celebst/?aid=760382\nThousands Of XXX Celebes doing it...FREE Lifetime Membership!!\n\n# 3. > Play House Porn\nhttp://80.71.66.8/mega/?aid=760382\nLive Feeds From 60 Sites And Web Cams...FREE Lifetime Membership!!\n\n# 2. > Asian Sex Fantasies\nhttp://80.71.66.8/asian/?aid=760382\nJapanese Schoolgirls, Live Sex Shows ...FREE Lifetime Membership!!\n\n# 1. > Lesbian Lace\nhttp://80.71.66.8/lesbian/?aid=760382\nGirls and Girls Getting Freaky! ...FREE Lifetime Membership!!\n\n--------------------------\n\nJennifer Simpson, Miami, FL\nYour FREE lifetime membership has entertained my boyffriend and I for\nthe last two years!  Your Adult Sites are the best on the net!\n\nJoe Morgan Manhattan, NY\nYour live sex shows and live sex cams are unbelievable. The best part\nabout your porn sites, is that they're absolutely FREE!\n\n--------------------------\n\n\n\n\n\n\n\n\n\nRemoval Instructions:\n\nYou have received this advertisement because you have opted in to receive free adult internet\noffers and specials through our affiliated websites. If you do not wish to receive further emails\nor have received the email in error you may opt-out of our database here\nhttp://80.71.66.8/optout/ . Please allow 24 hours for removal.\n\nvonolmosatkirekpups\n
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            From social-admin@linux.ie  Thu Aug 22 16:37:34 2002\nX-Authentication-Warning: lugh.tuatha.org: Host [67.104.83.251] claimed to be email.qves.com\nFrom: "Slim n Trim" <yenene@mx2.1premio.com>\nTo: <social@linux.ie>\nDate: Thu, 22 Aug 2002 09:33:07 -0600\nContent-Transfer-Encoding: 7bit\nX-Mailer: Microsoft CDO for Windows 2000\nThread-Index: AcJJ8TbZoOKEj0AtTsKxJ7ZmOA0e/w==\nContent-Class: urn:content-classes:message\nX-Mimeole: Produced By Microsoft MimeOLE V6.00.2462.0000\nX-Originalarrivaltime: 22 Aug 2002 15:33:08.0313 (UTC) FILETIME=[3746D490:01C249F1]\nSubject: [ILUG-Social] re: Guaranteed to lose 10-12 lbs in 30 days 10.148\n\nI thought you might like these:\n1) Slim Down - Guaranteed to lose 10-12 lbs in 30 days\nhttp://www.freeyankee.com/cgi/fy2/to.cgi?l=822slim1\n\n2) Fight The Risk of Cancer! \nhttp://www.freeyankee.com/cgi/fy2/to.cgi?l=822nic1 \n\n3) Get the Child Support You Deserve - Free Legal Advice \nhttp://www.freeyankee.com/cgi/fy2/to.cgi?l=822ppl1\n\nOffer Manager\nDaily-Deals\n\n\n\n\n\n\n\n\nIf you wish to leave this list please use the link below.\nhttp://www.qves.com/trim/?social@linux.ie%7C29%7C134077\n\n\n-- \nIrish Linux Users' Group Social Events: social@linux.ie\nhttp://www.linux.ie/mailman/listinfo/social for (un)subscription information.\nList maintainer: listmaster@linux.ie\n
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    From Thecashsystem@firemail.de  Thu Aug 22 16:58:24 2002\nMessage-ID: <413-220028422154219900@freesource>\nX-Priority: 1\nTo: "1" <thecashsystem@firemail.de>\nFrom: "TheCashSystem" <Thecashsystem@firemail.de>\nSubject: RE: Your Bank Account Information \nDate: Thu, 22 Aug 2002 10:42:19 -0500\nContent-type: text/plain; charset=US-ASCII\nX-MIME-Autoconverted: from quoted-printable to 8bit by webnote.net id QAA05573\nContent-Transfer-Encoding: 8bit\n\nA POWERHOUSE GIFTING PROGRAM You Don't Want To Miss! \n  GET IN WITH THE FOUNDERS! \nThe MAJOR PLAYERS are on This ONE\nFor ONCE be where the PlayerS are\nThis is YOUR Private Invitation\n\nEXPERTS ARE CALLING THIS THE FASTEST WAY \nTO HUGE CASH FLOW EVER CONCEIVED\nLeverage $1,000 into $50,000 Over and Over Again\n\nTHE QUESTION HERE IS:\nYOU EITHER WANT TO BE WEALTHY \nOR YOU DON'T!!!\nWHICH ONE ARE YOU?\nI am tossing you a financial lifeline and for your sake I \nHope you GRAB onto it and hold on tight For the Ride of youR life!\n\nTestimonials\n\nHear what average people are doing their first few days:\n�We've received 8,000 in 1 day and we are doing that over and over again!' Q.S. in AL\n �I'm a single mother in FL and I've received 12,000 in the last 4 days.� D. S. in FL\n�I was not sure about this when I sent off my $1,000 pledge, but I got back $2,000 the very next day!� L.L. in KY\n�I didn't have the money, so I found myself a partner to work this with. We have received $4,000 over the last 2 days. \nI think I made the right decision; don't you?� K. C. in FL\n�I pick up $3,000 my first day and I  they gave me free leads and all the training, you can too!� J.W. in CA\n\nANNOUNCING: We will CLOSE your sales for YOU! And Help you get a Fax Blast IMMEDIATELY Upon Your Entry!!!    YOU Make the MONEY!!!\nFREE LEADS!!! TRAINING!!!\n\n$$DON'T WAIT!!! CALL NOW $$\nFAX BACK TO: 1-800-421-6318 OR Call 1-800-896-6568 \n\nName__________________________________Phone___________________________________________\n\nFax_____________________________________Email____________________________________________\n\nBest Time To Call_________________________Time Zone________________________________________\n\nThis message is sent in compliance of the new e-mail bill. "Per Section 301, Paragraph (a)(2)(C) of S. 1618, further transmissions by the sender of this email may be stopped, at no cost to you, by sending a reply to this email address with the word "REMOVE" in the subject line. Errors, omissions, and exceptions excluded.\n \nThis is NOT spam! I have compiled this list from our Replicate Database, relative to Seattle Marketing Group, The Gigt, or Turbo Team for the sole purpose of these communications. Your continued inclusion is ONLY by your gracious permission. If you wish to not receive this mail from me, please send an email to tesrewinter@yahoo.com with "Remove" in the subject and you will be deleted immediately.\n\n\n
##      y
## 1 spam
## 2 spam
## 3 spam
## 4 spam
## 5 spam
## 6 spam

And the same thing with the ham dataset:

##                                           id
## 1 <1029945287.4797.TMDA@deepeddy.vircio.com>
## 2                                       <NA>
## 3                                       <NA>
## 4                                       <NA>
## 5                                       <NA>
## 6                                       <NA>
##                                          from
## 1              Robert Elz <kre@munnari.OZ.AU>
## 2   Steve Burt <steve.burt@cursor-system.com>
## 3               "Tim Chapman" <timc@2ubh.com>
## 4            Monty Solomon <monty@roscom.com>
## 5   Stewart Smith <Stewart.Smith@ee.ed.ac.uk>
## 6 "Martin Adamson" <martin@srv0.ems.ed.ac.uk>
##                                                           to
## 1 Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
## 2                                  zzzzteana@yahoogroups.com
## 3                                  zzzzteana@yahoogroups.com
## 4                                   undisclosed-recipient: ;
## 5                                  zzzzteana@yahoogroups.com
## 6                                  zzzzteana@yahoogroups.com
##                                    cc
## 1 exmh-workers@spamassassin.taint.org
## 2                                <NA>
## 3                                <NA>
## 4                                <NA>
## 5                                <NA>
## 6                                <NA>
##                                             subj
## 1                       Re: New Sequences Window
## 2                      [zzzzteana] RE: Alexander
## 3                      [zzzzteana] Moscow bomber
## 4          [IRR] Klez: The Virus That  Won't Die
## 5 Re: [zzzzteana] Nothing like mama used to make
## 6 Re: [zzzzteana] Nothing like mama used to make
##                              date
## 1 Thu, 22 Aug 2002 18:26:25 +0700
## 2 Thu, 22 Aug 2002 12:46:18 +0100
## 3 Thu, 22 Aug 2002 13:52:38 +0100
## 4 Thu, 22 Aug 2002 09:15:25 -0400
## 5 Thu, 22 Aug 2002 14:38:22 +0100
## 6 Thu, 22 Aug 2002 14:50:31 +0100
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    body
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002\nFrom: Robert Elz <kre@munnari.OZ.AU>\nTo: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>\nCc: exmh-workers@spamassassin.taint.org\nSubject: Re: New Sequences Window\nDate: Thu, 22 Aug 2002 18:26:25 +0700\n Date:        Wed, 21 Aug 2002 10:54:46 -0500 From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com> Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>\n\n | I can't reproduce this error.\n\nFor me it is very repeatable... (like every time, without fail).\n\nThis is the debug log of the pick happening ...\n\n18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}\n18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury\n18:19:04 Ftoc_PickMsgs {{1 hit}}\n18:19:04 Marking 1 hits\n18:19:04 tkerror: syntax error in expression "int ...\n\nNote, if I run the pick command by hand ...\n\ndelta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury\n1 hit\n\nThat's where the "1 hit" comes from (obviously).  The version of nmh I'm\nusing is ...\n\ndelta$ pick -version\npick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 ICT 2002]\n\nAnd the relevant part of my .mh_profile ...\n\ndelta$ mhparam pick\n-seq sel -list\n\n\nSince the pick command works, the sequence (actually, both of them, the\none that's explicit on the command line, from the search popup, and the\none that comes from .mh_profile) do get created.\n\nkre\n\nps: this is still using the version of the code form a day ago, I haven't\nbeen able to reach the cvs repository today (local routing issue I think).\n\n\n\n_______________________________________________\nExmh-workers mailing list\nExmh-workers@redhat.com\nhttps://listman.redhat.com/mailman/listinfo/exmh-workers\n
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            From Steve_Burt@cursor-system.com  Thu Aug 22 12:46:39 2002\nX-Egroups-Return: sentto-2242572-52726-1030016790-zzzz=spamassassin.taint.org@returns.groups.yahoo.com\nX-X-Apparently-To: zzzzteana@yahoogroups.com\nTo: "'zzzzteana@yahoogroups.com'" <zzzzteana@yahoogroups.com>\nX-Mailer: Internet Mail Service (5.5.2653.19)\nX-Egroups-From: Steve Burt <steve.burt@cursor-system.com>\nFrom: Steve Burt <Steve_Burt@cursor-system.com>\nX-Yahoo-Profile: pyruse\nMailing-List: list zzzzteana@yahoogroups.com; contact forteana-owner@yahoogroups.com\nDate: Thu, 22 Aug 2002 12:46:18 +0100\nSubject: [zzzzteana] RE: Alexander\nReply-To: zzzzteana@yahoogroups.com\nContent-Transfer-Encoding: 7bit\n\nMartin A posted:\nTassos Papadopoulos, the Greek sculptor behind the plan, judged that the\n limestone of Mount Kerdylio, 70 miles east of Salonika and not far from the\n Mount Athos monastic community, was ideal for the patriotic sculpture. \n \n As well as Alexander's granite features, 240 ft high and 170 ft wide, a\n museum, a restored amphitheatre and car park for admiring crowds are\nplanned\n---------------------\nSo is this mountain limestone or granite?\nIf it's limestone, it'll weather pretty fast.\n\n------------------------ Yahoo! Groups Sponsor ---------------------~-->\n4 DVDs Free +s&p Join Now\nhttp://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM\n---------------------------------------------------------------------~->\n\nTo unsubscribe from this group, send an email to:\nforteana-unsubscribe@egroups.com\n\n \n\nYour use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ \n\n\n
## 3 From timc@2ubh.com  Thu Aug 22 13:52:59 2002\nX-Egroups-Return: sentto-2242572-52733-1030020820-zzzz=spamassassin.taint.org@returns.groups.yahoo.com\nX-X-Apparently-To: zzzzteana@yahoogroups.com\nX-Mailer: Microsoft Outlook Express Macintosh Edition - 4.5 (0410)\nTo: zzzzteana <zzzzteana@yahoogroups.com>\nX-Priority: 3\nFrom: "Tim Chapman" <timc@2ubh.com>\nX-Yahoo-Profile: tim2ubh\nMailing-List: list zzzzteana@yahoogroups.com; contact forteana-owner@yahoogroups.com\nDate: Thu, 22 Aug 2002 13:52:38 +0100\nSubject: [zzzzteana] Moscow bomber\nReply-To: zzzzteana@yahoogroups.com\nContent-Transfer-Encoding: 7bit\n\nMan Threatens Explosion In Moscow \n\nThursday August 22, 2002 1:40 PM\nMOSCOW (AP) - Security officers on Thursday seized an unidentified man who\nsaid he was armed with explosives and threatened to blow up his truck in\nfront of Russia's Federal Security Services headquarters in Moscow, NTV\ntelevision reported.\nThe officers seized an automatic rifle the man was carrying, then the man\ngot out of the truck and was taken into custody, NTV said. No other details\nwere immediately available.\nThe man had demanded talks with high government officials, the Interfax and\nITAR-Tass news agencies said. Ekho Moskvy radio reported that he wanted to\ntalk with Russian President Vladimir Putin.\nPolice and security forces rushed to the Security Service building, within\nblocks of the Kremlin, Red Square and the Bolshoi Ballet, and surrounded the\nman, who claimed to have one and a half tons of explosives, the news\nagencies said. Negotiations continued for about one and a half hours outside\nthe building, ITAR-Tass and Interfax reported, citing witnesses.\nThe man later drove away from the building, under police escort, and drove\nto a street near Moscow's Olympic Penta Hotel, where authorities held\nfurther negotiations with him, the Moscow police press service said. The\nmove appeared to be an attempt by security services to get him to a more\nsecure location. \n\n------------------------ Yahoo! Groups Sponsor ---------------------~-->\n4 DVDs Free +s&p Join Now\nhttp://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM\n---------------------------------------------------------------------~->\n\nTo unsubscribe from this group, send an email to:\nforteana-unsubscribe@egroups.com\n\n \n\nYour use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ \n\n\n
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         From irregulars-admin@tb.tf  Thu Aug 22 14:23:39 2002\nX-To: undisclosed-recipient: ;\nFrom: Monty Solomon <monty@roscom.com>\nSubject: [IRR] Klez: The Virus That  Won't Die\nDate: Thu, 22 Aug 2002 09:15:25 -0400\n\nKlez: The Virus That Won't Die\n \nAlready the most prolific virus ever, Klez continues to wreak havoc.\n\nAndrew Brandt\n>>From the September 2002 issue of PC World magazine\nPosted Thursday, August 01, 2002\n\n\nThe Klez worm is approaching its seventh month of wriggling across \nthe Web, making it one of the most persistent viruses ever. And \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from PC to PC.\n\nAntivirus software makers Symantec and McAfee both report more than \n2000 new infections daily, with no sign of letup at press time. The \nBritish security firm MessageLabs estimates that 1 in every 300 \ne-mail messages holds a variation of the Klez virus, and says that \nKlez has already surpassed last summer's SirCam as the most prolific \nvirus ever.\n\nAnd some newer Klez variants aren't merely nuisances--they can carry \nother viruses in them that corrupt your data.\n\n...\n\nhttp://www.pcworld.com/news/article/0,aid,103259,00.asp\n_______________________________________________\nIrregulars mailing list\nIrregulars@tb.tf\nhttp://tb.tf/mailman/listinfo/irregulars\n
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    From Stewart.Smith@ee.ed.ac.uk  Thu Aug 22 14:44:26 2002\nX-Egroups-Return: sentto-2242572-52736-1030023506-zzzz=spamassassin.taint.org@returns.groups.yahoo.com\nX-X-Apparently-To: zzzzteana@yahoogroups.com\nOrganization: Scottish Microelectronics Centre\nUser-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.1b) Gecko/20020628\nX-Accept-Language: en, en-us\nTo: zzzzteana@yahoogroups.com\nFrom: Stewart Smith <Stewart.Smith@ee.ed.ac.uk>\nX-Yahoo-Profile: stochasticus\nMailing-List: list zzzzteana@yahoogroups.com; contact forteana-owner@yahoogroups.com\nDate: Thu, 22 Aug 2002 14:38:22 +0100\nSubject: Re: [zzzzteana] Nothing like mama used to make\nReply-To: zzzzteana@yahoogroups.com\nContent-Transfer-Encoding: 7bit\n\n>  in adding cream to spaghetti carbonara, which has the same effect on pasta as\n>  making a pizza a deep-pie; \n\nI just had to jump in here as Carbonara is one of my favourites to make and ask \nwhat the hell are you supposed to use instead of cream?  I've never seen a \nrecipe that hasn't used this.  Personally I use low fat creme fraiche because it \nworks quite nicely but the only time I've seen an supposedly authentic recipe \nfor carbonara  it was identical to mine (cream, eggs and lots of fresh parmesan) \nexcept for the creme fraiche.\n\nStew\n-- \nStewart Smith\nScottish Microelectronics Centre, University of Edinburgh.\nhttp://www.ee.ed.ac.uk/~sxs/\n\n\n------------------------ Yahoo! Groups Sponsor ---------------------~-->\n4 DVDs Free +s&p Join Now\nhttp://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM\n---------------------------------------------------------------------~->\n\nTo unsubscribe from this group, send an email to:\nforteana-unsubscribe@egroups.com\n\n \n\nYour use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ \n\n\n
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  From martin@srv0.ems.ed.ac.uk  Thu Aug 22 14:54:39 2002\nX-Egroups-Return: sentto-2242572-52737-1030024261-zzzz=spamassassin.taint.org@returns.groups.yahoo.com\nX-X-Apparently-To: zzzzteana@yahoogroups.com\nOrganization: Management School\nTo: zzzzteana@yahoogroups.com\nPriority: normal\nX-Mailer: Pegasus Mail for Windows (v4.01)\nContent-Description: Mail message body\nFrom: "Martin Adamson" <martin@srv0.ems.ed.ac.uk>\nMailing-List: list zzzzteana@yahoogroups.com; contact forteana-owner@yahoogroups.com\nDate: Thu, 22 Aug 2002 14:50:31 +0100\nSubject: Re: [zzzzteana] Nothing like mama used to make\nReply-To: zzzzteana@yahoogroups.com\nContent-Transfer-Encoding: 7bit\n\n\n> I just had to jump in here as Carbonara is one of my favourites to make and \n> ask \n> what the hell are you supposed to use instead of cream? \n\nIsn't it just basically a mixture of beaten egg and bacon (or pancetta, \nreally)? You mix in the raw egg to the cooked pasta and the heat of the pasta \ncooks the egg. That's my understanding.\n\nMartin\n\n------------------------ Yahoo! Groups Sponsor ---------------------~-->\n4 DVDs Free +s&p Join Now\nhttp://us.click.yahoo.com/pt6YBB/NXiEAA/mG3HAA/7gSolB/TM\n---------------------------------------------------------------------~->\n\nTo unsubscribe from this group, send an email to:\nforteana-unsubscribe@egroups.com\n\n \n\nYour use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ \n\n\n
##     y
## 1 ham
## 2 ham
## 3 ham
## 4 ham
## 5 ham
## 6 ham

Final dataframe:

Feature Engineering

The simplest set of features would be a count of the words (or tokens) in each e-mail. I will go one step further and use their term frequency-inverse document frequency (TF-IDF) scores. Other features are easy to imagine, as a quick look at the literature on this topic suggests.

First, let’s remove some non-informative words from the e-mails, e.g., a, of, another. I’m using a custom list of stopwords I’ve used before, plus some troublesome HTML tags.

## [1] "a"         "a's"       "able"      "about"     "above"     "according"

Next I’m going to use the tm package for various transformations to help the classifier ignore superfluous details: lowercase all words, remove numbers, remove punctuation, remove stopwords, and eliminate excess white space:

Now, generate the TF-IDF matrix. I’d like to keep the top 1000 terms, which experimentation suggests is where sparsity = 0.985.

## [1] 1017

Evaluation

Take a look how model performs on test set:

## Warning in data.matrix(newdata): NAs introduced by coercion
##       
## pred   ham spam
##   ham  485    4
##   spam 139  122

Calculate a few performance metrics:

## [1] 0.8093333
## [1] 0.968254
## [1] 0.467433
## [1] 0.630491

This model looks pretty good for a couple hours’ work!

In my opinion, optimizing for high precision is probably the route to go. For e-mail spam classification tasks, I would imagine that false positives are considered more costly than false negatives—that is, we would rather let a spam e-mail into the inbox than exclude a real e-mail as spam. Precision is a good metric for when the costs of a false positive are high, like in this case.