Section : Ham files

Downloading the Dataset for Ham

Creating Ham Data Frame

##  chr [1:2501] "00001.7c53336b37003a9286aba55d2945844c" ...
##  [1] "00001.7c53336b37003a9286aba55d2945844c"
##  [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
##  [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
##  [4] "00004.864220c5b6930b209cc287c361c99af1"
##  [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
##  [6] "00006.253ea2f9a9cc36fa0b1129b04b806608"
##  [7] "00007.37a8af848caae585af4fe35779656d55"
##  [8] "00008.5891548d921601906337dcf1ed8543cb"
##  [9] "00009.371eca25b0169ce5cb4f71d3e07b9e2d"
## [10] "00010.145d22c053c1a0c410242e46c01635b3"
## [11] "00011.fbcde1b4833bdbaaf0ced723edd6e355"
## [12] "00012.48a387bc38d1316a6f6b49e8c2e43a03"
## [13] "00013.81c34741dbed59c6dde50777e27e7ea3"
## [14] "00014.cb20e10b2bfcb8210a1c310798532a57"
## [15] "00015.4d7026347ba7478c9db04c70913e68fd"
## [1] "There are 2501 spam files in the easy_ham folder."

Creating a Ham Sender’ Email Data Frame

## # A tibble: 2 x 2
##   email                        length
##   <chr>                         <int>
## 1 Steve_Burt@cursor-system.com     28
## 2 timc@2ubh.com                    13

vizualizing the Length of Different Senders’ Emails

Example of a Ham File

## [[1]]
## [1] "From irregulars-admin@tb.tf  Thu Aug 22 14:23:39 2002\nReturn-Path: <irregulars-admin@tb.tf>\nDelivered-To: zzzz@localhost.netnoteinc.com\nReceived: from localhost (localhost [127.0.0.1])\n\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9DAE147C66\n\tfor <zzzz@localhost>; Thu, 22 Aug 2002 09:23:38 -0400 (EDT)\nReceived: from phobos [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 14:23:38 +0100 (IST)\nReceived: from web.tb.tf (route-64-131-126-36.telocity.com\n    [64.131.126.36]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id\n    g7MDGOZ07922 for <zzzz-irr@spamassassin.taint.org>; Thu, 22 Aug 2002 14:16:24 +0100\nReceived: from web.tb.tf (localhost.localdomain [127.0.0.1]) by web.tb.tf\n    (8.11.6/8.11.6) with ESMTP id g7MDP9I16418; Thu, 22 Aug 2002 09:25:09\n    -0400\nReceived: from red.harvee.home (red [192.168.25.1] (may be forged)) by\n    web.tb.tf (8.11.6/8.11.6) with ESMTP id g7MDO4I16408 for\n    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:24:04 -0400\nReceived: from prserv.net (out4.prserv.net [32.97.166.34]) by\n    red.harvee.home (8.11.6/8.11.6) with ESMTP id g7MDFBD29237 for\n    <irregulars@tb.tf>; Thu, 22 Aug 2002 09:15:12 -0400\nReceived: from [209.202.248.109]\n    (slip-32-103-249-10.ma.us.prserv.net[32.103.249.10]) by prserv.net (out4)\n    with ESMTP id <2002082213150220405qu8jce>; Thu, 22 Aug 2002 13:15:07 +0000\nMIME-Version: 1.0\nX-Sender: @ (Unverified)\nMessage-Id: <p04330137b98a941c58a8@[209.202.248.109]>\nTo: undisclosed-recipient: ;\nFrom: Monty Solomon <monty@roscom.com>\nContent-Type: text/plain; charset=\"us-ascii\"\nSubject: [IRR] Klez: The Virus That  Won't Die\nSender: irregulars-admin@tb.tf\nErrors-To: irregulars-admin@tb.tf\nX-Beenthere: irregulars@tb.tf\nX-Mailman-Version: 2.0.6\nPrecedence: bulk\nList-Help: <mailto:irregulars-request@tb.tf?subject=help>\nList-Post: <mailto:irregulars@tb.tf>\nList-Subscribe: <http://tb.tf/mailman/listinfo/irregulars>,\n    <mailto:irregulars-request@tb.tf?subject=subscribe>\nList-Id: New home of the TBTF Irregulars mailing list <irregulars.tb.tf>\nList-Unsubscribe: <http://tb.tf/mailman/listinfo/irregulars>,\n    <mailto:irregulars-request@tb.tf?subject=unsubscribe>\nList-Archive: <http://tb.tf/mailman/private/irregulars/>\nDate: Thu, 22 Aug 2002 09:15:25 -0400\n\nKlez: The Virus That Won't Die\n \nAlready the most prolific virus ever, Klez continues to wreak havoc.\n\nAndrew Brandt\n>>From the September 2002 issue of PC World magazine\nPosted Thursday, August 01, 2002\n\n\nThe Klez worm is approaching its seventh month of wriggling across \nthe Web, making it one of the most persistent viruses ever. And \nexperts warn that it may be a harbinger of new viruses that use a \ncombination of pernicious approaches to go from PC to PC.\n\nAntivirus software makers Symantec and McAfee both report more than \n2000 new infections daily, with no sign of letup at press time. The \nBritish security firm MessageLabs estimates that 1 in every 300 \ne-mail messages holds a variation of the Klez virus, and says that \nKlez has already surpassed last summer's SirCam as the most prolific \nvirus ever.\n\nAnd some newer Klez variants aren't merely nuisances--they can carry \nother viruses in them that corrupt your data.\n\n...\n\nhttp://www.pcworld.com/news/article/0,aid,103259,00.asp\n_______________________________________________\nIrregulars mailing list\nIrregulars@tb.tf\nhttp://tb.tf/mailman/listinfo/irregulars\n"

Turning These Ham Emails to a Data Frame

## # A tibble: 2 x 3
##    mail emails                         len
##   <int> <chr>                        <int>
## 1     1 Steve_Burt@cursor-system.com    28
## 2     2 Steve_Burt@cursor-system.com    28

visualizing the length of all Emails

Body of the Email

Creating a Data Frame containing the words

Adding the Frequency of Words to the Data frame

## # A tibble: 2 x 3
##   files word                                       n
##   <int> <chr>                                  <int>
## 1     1 00001.7c53336b37003a9286aba55d2945844c     1
## 2     2 from                                     442

Organizing the Data frame and adding the Term Frequency(tf), Inverse Document Frequency of a term(idf), and the combining of two term(tf_idf)

## Warning in bind_tf_idf.data.frame(., word, files, n): A value for tf_idf is negative:
## Input should have exactly one row per document-term combination.
## # A tibble: 2 x 6
##   files word                                       n     tf   idf tf_idf
##   <int> <chr>                                  <int>  <dbl> <dbl>  <dbl>
## 1     1 00001.7c53336b37003a9286aba55d2945844c     1 1       6.73  6.73 
## 2  1692 15737.33929.716821.779152                 60 0.0167  7.82  0.130

Cleaning the Data Frame,

We select only words with IDF greater than 0 and we remove words containing numbers

## # A tibble: 2 x 6
##   files word         n      tf   idf tf_idf
##   <int> <chr>    <int>   <dbl> <dbl>  <dbl>
## 1  1692 laptop's    60 0.0167   6.44 0.107 
## 2  1689 neale's    108 0.00926  7.13 0.0660

Example of the sparcity of a word

## # A tibble: 4 x 6
##   files word         n      tf   idf  tf_idf
##   <int> <chr>    <int>   <dbl> <dbl>   <dbl>
## 1  1692 laptop's    60 0.0167   6.44 0.107  
## 2  1214 laptop's   603 0.00166  6.44 0.0107 
## 3  1250 laptop's   627 0.00159  6.44 0.0103 
## 4  1215 laptop's   805 0.00124  6.44 0.00800

Section 2: Spam Files

Example of a Spam Document

## [[1]]
## [1] "From sales@outsrc-em.com  Mon Jun 24 17:53:15 2002\nReturn-Path: sales@outsrc-em.com\nDelivery-Date: Thu Jun 20 20:08:33 2002\nReceived: from outsrc-em.com ([166.70.149.104]) by dogma.slashnull.org\n    (8.11.6/8.11.6) with SMTP id g5KJ8WI08701 for <jm@jmason.org>;\n    Thu, 20 Jun 2002 20:08:32 +0100\nMessage-Id: <200206201908.g5KJ8WI08701@dogma.slashnull.org>\nFrom: \"Outsource Sales\" <sales@outsrc-em.com>\nTo: \"yyyy@spamassassin.taint.org\" <yyyy@spamassassin.taint.org>\nSubject: New Product Announcement\nSender: \"Outsource Sales\" <sales@outsrc-em.com>\nMIME-Version: 1.0\nContent-Type: text/plain; charset=\"ISO-8859-1\"\nDate: Fri, 3 Jan 1997 17:24:47 -0700\nReply-To: \"Outsource Sales\" <sales@outsrc-em.com>\nContent-Transfer-Encoding: 8bit\nX-Keywords: \n\nNEW PRODUCT ANNOUNCEMENT\n\nFrom: OUTSOURCE ENG.& MFG. INC.\n\n\nSir/Madam;\n\nThis note is to inform you of new watchdog board technology for maintaining\ncontinuous unattended operation of PC/Servers etc. that we have released for\ndistribution.\n  \nWe are proud to announce Watchdog Control Center featuring MAM (Multiple\nApplications Monitor) capability.\nThe key feature of this application enables you to monitor as many\napplications as you\nhave resident on any computer as well as the operating system for\ncontinuous unattended operation.  The Watchdog Control Center featuring\nMAM capability expands third party application \"control\" of a Watchdog as\naccess to the application's\nsource code is no longer needed.\n\nHere is how it all works:\nUpon installation of the application and Watchdog, the user may select\nmany configuration options, based on their model of Watchdog, to fit their\noperational needs.  If the MAM feature is enabled, the user may select any\nexecutable program that they wish for monitoring.\n\nA lock up of the operating system or if any one of the selected\napplications is not running, the MAM feature, in\nconjunction with the Watchdog, will reset the system allowing for\ncontinuous operation.\n\nIt's that simple!\n\nWatchdog Control Center is supported on most Microsoft Windows platforms\n(Win9x/WinNT/Win2k) and includes a Linux version for PCI Programmable\nWatchdogs.\n\nWatchdog Control Center Features:\n- Automated installation\n- Controls all Outsource Engineering Watchdogs\n- User selectable Watchdog timeout period\n- User selectable Watchdog stroke interval\n- Multiple Application Monitoring\n\nIncluded on the Installation CD:\n- Watchdog Control Center\n- Watchdog Drivers\n- Documentation\n\nFor more information, please visit out website at\nhttp://www.outsrc-em.com/ or send an e-mail to sales@outsrc-em.com\n"

Selecting the Most Frequent Words with TF_IDF

## Warning in bind_tf_idf.data.frame(., word, block, n): A value for tf_idf is negative:
## Input should have exactly one row per document-term combination.
## # A tibble: 6 x 6
##   block word                                       n      tf   idf tf_idf
##   <int> <chr>                                  <int>   <dbl> <dbl>  <dbl>
## 1     1 00001.317e78fa8ee2f54cd4890fdc09ba8176     1 1        6.14 6.14  
## 2   805 4.21.157.32                              109 0.00917  7.24 0.0664
## 3   805 g6l6w9415993                             109 0.00917  7.24 0.0664
## 4   805 1027225826.1122                          109 0.00917  7.24 0.0664
## 5   805 winnereritmugu                           109 0.00917  7.24 0.0664
## 6   805 winnergkrsvyyyyl                         109 0.00917  7.24 0.0664

Cleaning The Spam List of Words

## # A tibble: 6 x 6
##   block word                    n      tf   idf tf_idf
##   <int> <chr>               <int>   <dbl> <dbl>  <dbl>
## 1   743 luke's                127 0.00787  7.24 0.0570
## 2    58 mailto:angie_pepi     192 0.00521  7.24 0.0377
## 3   382 car's                 195 0.00513  7.24 0.0371
## 4   996 mailto:remove_me123   196 0.00510  7.24 0.0369
## 5   536 ident:nobody          125 0.008    4.53 0.0363
## 6   362 mailto:bluejo         202 0.00495  7.24 0.0359

Creating a Spam Sender’ Email Data Frame

##    Length     Class      Mode 
##      1396 character character
## [1] "lmrn@mailexcite.com"               "amknight@mailexcite.com"          
## [3] "jordan23@mailexcite.com"           "merchantsworld2001@juno.com"      
## [5] "cypherpunks-forward@ds.pro-ns.net" "sales@outsrc-em.com"

Creating a Spam Senders’ Email Data Frame

## # A tibble: 6 x 2
##   email                               len
##   <chr>                             <int>
## 1 lmrn@mailexcite.com                  19
## 2 amknight@mailexcite.com              23
## 3 jordan23@mailexcite.com              23
## 4 merchantsworld2001@juno.com          27
## 5 cypherpunks-forward@ds.pro-ns.net    33
## 6 sales@outsrc-em.com                  19

visualizing the Length of Different Spam Senders’ Emails

Example of Spam Document

## [[1]]
## [1] "From lmrn@mailexcite.com  Mon Jun 24 17:03:24 2002\nReturn-Path: merchantsworld2001@juno.com\nDelivery-Date: Mon May 13 04:46:13 2002\nReceived: from mandark.labs.netnoteinc.com ([213.105.180.140]) by\n    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g4D3kCe15097 for\n    <jm@jmason.org>; Mon, 13 May 2002 04:46:12 +0100\nReceived: from 203.129.205.5.205.129.203.in-addr.arpa ([203.129.205.5]) by\n    mandark.labs.netnoteinc.com (8.11.2/8.11.2) with SMTP id g4D3k2D12605 for\n    <jm@netnoteinc.com>; Mon, 13 May 2002 04:46:04 +0100\nReceived: from html (unverified [207.95.174.49]) by\n    203.129.205.5.205.129.203.in-addr.arpa (EMWAC SMTPRS 0.83) with SMTP id\n    <B0000178595@203.129.205.5.205.129.203.in-addr.arpa>; Mon, 13 May 2002\n    09:04:46 +0530\nMessage-Id: <B0000178595@203.129.205.5.205.129.203.in-addr.arpa>\nFrom: lmrn@mailexcite.com\nTo: ranmoore@cybertime.net\nSubject: Real Protection, Stun Guns!  Free Shipping! Time:2:01:35 PM\nDate: Mon, 28 Jul 1980 14:01:35\nMIME-Version: 1.0\nX-Keywords: \nContent-Type: text/html; charset=\"DEFAULT\"\n\n<html>\n<body>\n<center>\n<h3>\n<font color=\"blue\">\n<b>\nThe Need For Safety Is Real In 2002, You Might Only Get One Chance - Be Ready!\n<p>\nFree Shipping & Handling Within The (USA) If You Order Before May 25, 2002! \n<p>\n3 Day Super Sale, Now Until May 7, 2002!  Save Up To $30.00 On Some Items!\n\n</b>\n</font>\n</h3>\n</center>\n<p>\nIT'S GETTING TO BE SPRING AGAIN, PROTECT YOURSELF AS YOU WALK,<br>\nJOG AND EXERCISE OUTSIDE.  ALSO PROTECT YOUR LOVED ONES AS<br>\nTHEY RETURN HOME FROM COLLEGE!<br>\n<p>\n*     LEGAL PROTECTION FOR COLLEGE STUDENTS!<br>\n*     GREAT UP'COMING OUTDOOR PROTECTION GIFTS!<br>\n*     THERE IS NOTHING WORTH MORE PROTECTING THAN LIFE!<br>\n*     OUR STUN DEVICES & PEPPER PRODUCTS ARE LEGAL PROTECTION!\n<p>\n<b>\n<font color=\"red\">\nJOIN THE WAR ON CRIME!\n</b>\n</font>\n<p>\n\nSTUN GUNS AND BATONS \n<p>\nEFFECTIVE - SAFE - NONLETHAL\n<p>\nPROTECT YOUR LOVED ONES AND YOURSELF\n<p>\nNo matter who you are, no matter what City or Town you live in,<br>\nif you live in America, you will be touched by crime.\n<p>\nYou hear about it on TV.  You read about it in the newspaper.<br>\nIt's no secret that crime is a major problem in the U.S. today.<br>\nCriminals are finding it easier to commit crimes all the time.\n<p>\nWeapons are readily available.  Our cities' police forces have<br>\nmore work than they can handle.  Even if these criminal are<br>\ncaught, they won't be spending long in our nation's overcrowded<br>\njails.  And while lawmakers are well aware of the crime problem,<br>\nthey don't seem to have any effective answers.\n<p>\nOur Email Address:  <a\nhref=\"mailto:Merchants4all@aol.com\">Merchants4all@aol.com</a>\n<p>\nINTERESTED:\n<p>\nYou will be protecting yourself within 7 days!  Don't Wait,<br>\nvisit our web page below, and join The War On Crime!\n<p>\n*****************<br>\n<a\nhref=\"http://www.geocities.com/realprotection_20022003/\">http://www.geocities.com/realprotection_20022003/</a><br>\n*****************\n<p>\nWell, there is an effective answer.  Take responsibility for<br>\nyour own security.  Our site has a variety of quality personal<br>\nsecurity products.  Visit our site, choose the personal security<br>\nproducts that are right for you.  Use them, and join the war on\ncrime!\n<p>\nFREE PEPPER SPRAY WITH ANY STUN UNIT PURCHASE.<br>\n(A Value of $15.95)\n<p>\nWe Ship Orders Within 5 To 7 Days, To Every State In The U.S.A.<br>\nby UPS, FEDEX, or U.S. POSTAL SERVICE.  Visa, MasterCard, American<br>\nExpress & Debt Card Gladly Accepted.\n<p>\nAsk yourself this question, if you don't help your loved ones,\nwho will?\n<p>\nINTERESTED:\n<p>\n*****************<br>\n<a\nhref=\"http://www.geocities.com/realprotection_20022003/\">http://www.geocities.com/realprotection_20022003/</a><br>\n*****************\n<p>\n___The Stun Monster 625,000 Volts ($86.95)<br>\n___The Z-Force Slim Style 300,000 Volts ($64.95)<br>\n___The StunMaster 300,000 Volts Straight ($59.95)<br>\n___The StunMaster 300,000 Volts Curb ($59.95)<br>\n___The StunMaster 200,000 Volts Straight ($49.95)<br>\n___The StunMaster 200,000 Volts Curb ($49.95)<br>\n___The StunBaton 500,000 Volts ($89.95)<br>\n___The StunBaton 300,000 Volts ($79.95)<br>\n___Pen Knife (One $12.50, Two Or More $9.00)<br>\n___Wildfire Pepper Spray  (One $15.95, Two Or More $11.75)\n<p>\n___Add $5.75 For Shipping & Handling Charge.\n<p>\n\nTo Order by postal mail, please send to the below address.<br>\nMake payable to Mega Safety Technology.\n<p>\nMega Safety Technology<br>\n3215 Merrimac Ave.<br>\nDayton, Ohio  45405<br>\nOur Email Address:  <a\nhref=\"mailto:Merchants4all@aol.com\">Merchants4all@aol.com</a>\n<p>\nOrder by 24 Hour Fax!!!  775-257-6657.\n<p>\n*****<br>\n<b><font color=\"red\">Important Credit Card Information! Please Read Below!</b></font>\n <br><br>\n*     Credit Card Address, City, State and Zip Code, must match\n      billing address to be processed. \n<br><br>\n\nCHECK____  MONEYORDER____  VISA____ MASTERCARD____ AmericanExpress___\nDebt Card___\n<br><br>\nName_______________________________________________________<br>\n(As it appears on Check or Credit Card)\n<br><br>\nAddress____________________________________________________<br>\n(As it appears on Check or Credit Card)\n<br><br>\n___________________________________________________<br>\nCity,State,Zip(As it appears on Check or Credit Card)\n<br><br>\n___________________________________________________<br>\nCountry\n<br><br>\n___________________________________________________<br>\n(Credit Card Number)\n<br><br>\nExpiration Month_____  Year_____\n<br><br>\n___________________________________________________<br>\nAuthorized Signature\n<br><br>\n<b>\n*****IMPORTANT NOTE*****\n</b>\n<br><br>\nIf Shipping Address Is Different From The Billing Address Above,\nPlease Fill Out Information Below.\n<br><br>\nShipping Name______________________________________________\n<br><br>\nShipping Address___________________________________________\n<br><br>\n___________________________________________________________<br>\nShipping City,State,Zip\n<br><br>\n___________________________________________________________<br>\nCountry\n<br><br>\n___________________________________________________________<br>\nEmail Address & Phone Number(Please Write Neat)\n</body>\n</html>"
##    Length     Class      Mode 
##     22103 character character

visualizing the Length of Different Senders’ Emails

## # A tibble: 6 x 3
##    mail spam.emails                                          len
##   <int> <chr>                                              <int>
## 1     1 lmrn@mailexcite.com                                   19
## 2     2 merchantsworld2001@juno.com                           27
## 3     3 jm@jmason.org                                         13
## 4     4 jm@netnoteinc.com                                     17
## 5     5 B0000178595@203.129.205.5.205.129.203.in-addr.arpa    50
## 6     6 B0000178595@203.129.205.5.205.129.203.in-addr.arpa    50

## Selecting by n

Spam Ham classification using Naivebayes Classifier

We create an object/model that can loop through any list of documents and create a corpus for each. This way we avoid duplicating this code for each and every set of documents that we need to loop through.

Create a corpus for each of the two email classification using the object model above

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 3048
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 1945
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 2234
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 1914
## 
## [[5]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 1902
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 6
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 2334
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 2926
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 3602
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 3675
## 
## [[5]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 2183
## steveburtcursorsystemcom thu aug
## returnpath steveburtcursorsystemcom
## deliveredto zzzzlocalhostnetnoteinccom
## receiv localhost localhost
## phoboslabsnetnoteinccom postfix esmtp id beec
## zzzzlocalhost thu aug edt
## receiv phobo
## localhost imap fetchmail
## zzzzlocalhost singledrop thu aug ist
## receiv ngrpscdyahoocom ngrpscdyahoocom
## dogmaslashnullorg smtp id
## gmbktz zzzzspamassassintaintorg thu aug
## xegroupsreturn senttozzzzspamassassintaintorgreturnsgroupsyahoocom
## receiv ngrpscdyahoocom nnfmp
## aug
## xsender steveburtcursorsystemcom
## xapparentlyto zzzzteanayahoogroupscom
## receiv egp mail aug
## receiv qmail invok network aug
## receiv unknown mgrpscdyahoocom qmqp
## aug
## receiv unknown helo mailgatewaycursorsystemcom
## mtagrpscdyahoocom smtp aug
## receiv exchangecpsloc unverifi
## mailgatewaycursorsystemcom content technolog smtprs
## esmtp id tcdefacddmailgatewaycursorsystemcom
## forteanayahoogroupscom thu aug
## receiv exchangecpsloc internet mail servic
## id pxxat thu aug
## messageid ecadddfbbdaddefbfexchangecpsloc
## zzzzteanayahoogroupscom zzzzteanayahoogroupscom
## xmailer internet mail servic
## xegroupsfrom steve burt steveburtcursorsystemcom
## steve burt steveburtcursorsystemcom
## xyahooprofil pyrus
## mimevers
## mailinglist list zzzzteanayahoogroupscom contact
## forteanaowneryahoogroupscom
## deliveredto mail list zzzzteanayahoogroupscom
## preced bulk
## listunsubscrib mailtozzzzteanaunsubscribeyahoogroupscom
## date thu aug
## subject zzzzteana re alexand
## replyto zzzzteanayahoogroupscom
## contenttyp textplain charsetusascii
## contenttransferencod bit
## 
## martin post
## tasso papadopoulo greek sculptor behind plan judg
## limeston mount kerdylio mile east salonika far
## mount atho monast communiti ideal patriot sculptur
## 
## well alexand granit featur ft high ft wide
## museum restor amphitheatr car park admir crowd
## plan
## 
## mountain limeston granit
## limeston itll weather pretti fast
## 
## yahoo group sponsor
## dvds free sp join now
## httpusclickyahoocomptybbnxieaamghaagsolbtm
## 
## 
## unsubscrib group send email
## forteanaunsubscribeegroupscom
## 
## 
## 
## use yahoo group subject httpdocsyahoocominfoterm
## stewartsmitheeedacuk thu aug
## returnpath stewartsmitheeedacuk
## deliveredto zzzzlocalhostnetnoteinccom
## receiv localhost localhost
## phoboslabsnetnoteinccom postfix esmtp id efc
## zzzzlocalhost thu aug edt
## receiv phobo
## localhost imap fetchmail
## zzzzlocalhost singledrop thu aug ist
## receiv ngrpscdyahoocom ngrpscdyahoocom
## dogmaslashnullorg smtp id
## gmemz zzzzspamassassintaintorg thu aug
## xegroupsreturn senttozzzzspamassassintaintorgreturnsgroupsyahoocom
## receiv ngrpscdyahoocom nnfmp
## aug
## xsender stewartsmitheeedacuk
## xapparentlyto zzzzteanayahoogroupscom
## receiv egp mail aug
## receiv qmail invok network aug
## receiv unknown mgrpscdyahoocom qmqp
## aug
## receiv unknown helo postboxeeedacuk
## mtagrpscdyahoocom smtp aug
## receiv eeedacuk sxsdunblan
## postboxeeedacuk esmtp id gmeli
## forteanayahoogroupscom thu aug bst
## messageid deebeeedacuk
## organ scottish microelectron centr
## userag mozilla x u suno sunu enus rvb gecko
## xacceptlanguag en enus
## zzzzteanayahoogroupscom
## refer dfacalocalhost
## stewart smith stewartsmitheeedacuk
## xyahooprofil stochasticus
## mimevers
## mailinglist list zzzzteanayahoogroupscom contact
## forteanaowneryahoogroupscom
## deliveredto mail list zzzzteanayahoogroupscom
## preced bulk
## listunsubscrib mailtozzzzteanaunsubscribeyahoogroupscom
## date thu aug
## subject re zzzzteana noth like mama use make
## replyto zzzzteanayahoogroupscom
## contenttyp textplain charsetusascii
## contenttransferencod bit
## 
## martin adamson wrote
## 
## isnt just basic mixtur beaten egg bacon pancetta
## realli mix raw egg cook pasta heat pasta
## cook egg that understand
## 
## 
## your probabl right mine just cream ad
## egg guess tri without actual look internet
## recip found one possibl one scariest peopl ive ever seen
## hes us congressman
## httpwwwvirtualcitiescomonsmegovmegvjbhtm
## 
## that one worst nonsmil ever
## 
## stew
## ps apolog list main resid vote man wont
## youv seen pic
## 
## 
## stewart smith
## scottish microelectron centr univers edinburgh
## httpwwweeedacuksx
## 
## 
## yahoo group sponsor
## dvds free sp join now
## httpusclickyahoocomptybbnxieaamghaagsolbtm
## 
## 
## unsubscrib group send email
## forteanaunsubscribeegroupscom
## 
## 
## 
## use yahoo group subject httpdocsyahoocominfoterm

Create wordcloud for Ham and Spam corpus before cleanup using bing lexicon

##                        word  freq
## "",                     "", 32815
## character(0), character(0), 15006
## "receiv             "receiv 14044
## esmtp                 esmtp  6330
## mon                     mon  5873
## sep",                 sep",  5215
## ist",                 ist",  4217
## "jmlocalhost   "jmlocalhost  4144
## sep                     sep  4125
## localhost         localhost  3736
## Classes 'tbl_df', 'tbl' and 'data.frame':    53955 obs. of  3 variables:
##  $ term    : chr  "\"\")," "\"\"," "\"\\023c\\024" "\"aa" ...
##  $ document: chr  "1" "1" "1" "1" ...
##  $ count   : num  2440 32815 1 10 1 ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    908 obs. of  4 variables:
##  $ term     : chr  "abolish" "abort" "abound" "absurd" ...
##  $ document : chr  "1" "1" "1" "1" ...
##  $ count    : num  2 6 3 16 12 11 2 2 3 11 ...
##  $ sentiment: chr  "negative" "negative" "positive" "negative" ...
##                        word  freq
## "",                     "", 28837
## character(0), character(0),  8382
## "tr",                 "tr",  6841
## "receiv             "receiv  6116
## "td                     "td  5496
## mon                     mon  3230
## size                   size  3049
## "br",                 "br",  2833
## esmtp                 esmtp  2605
## jul",                 jul",  2572
## Classes 'tbl_df', 'tbl' and 'data.frame':    53955 obs. of  3 variables:
##  $ term    : chr  "\"\")," "\"\"," "\"\\023c\\024" "\"aa" ...
##  $ document: chr  "1" "1" "1" "1" ...
##  $ count   : num  2440 32815 1 10 1 ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    908 obs. of  4 variables:
##  $ term     : chr  "abolish" "abort" "abound" "absurd" ...
##  $ document : chr  "1" "1" "1" "1" ...
##  $ count    : num  2 6 3 16 12 11 2 2 3 11 ...
##  $ sentiment: chr  "negative" "negative" "positive" "negative" ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    77273 obs. of  3 variables:
##  $ term    : chr  "\"\")," "\"\"," "\"aa" "\"aa\"," ...
##  $ document: chr  "1" "1" "1" "1" ...
##  $ count   : num  1186 28837 5 4 2 ...
## Classes 'tbl_df', 'tbl' and 'data.frame':    541 obs. of  4 variables:
##  $ term     : chr  "abort" "abscond" "acclaim" "accomplish" ...
##  $ document : chr  "1" "1" "1" "1" ...
##  $ count    : num  1 1 3 3 1 2 7 27 4 1 ...
##  $ sentiment: chr  "negative" "negative" "positive" "positive" ...

Combine the two cleaned up corpus data into a single data frame

##                                                              text type
## 1                               exmhworkersadminredhatcom thu aug  ham
## 2                 returnpath exmhworkersadminspamassassintaintorg  ham
## 3                          deliveredto zzzzlocalhostnetnoteinccom  ham
## 4                                      receiv localhost localhost  ham
## 5                    phoboslabsnetnoteinccom postfix esmtp id dec  ham
## 6                                       zzzzlocalhost thu aug edt  ham
## 7                                                    receiv phobo  ham
## 8                                        localhost imap fetchmail  ham
## 9                            zzzzlocalhost singledrop thu aug ist  ham
## 10 receiv listmanspamassassintaintorg listmanspamassassintaintorg  ham
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 3048
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 1945
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 2234
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 1914
## 
## [[5]]
## <<PlainTextDocument>>
## Metadata:  8
## Content:  chars: 1902

Partition data into training set and test set in ratio of 70:30

##                                              text type
## 1               exmhworkersadminredhatcom thu aug  ham
## 2 returnpath exmhworkersadminspamassassintaintorg  ham
## 4                      receiv localhost localhost  ham
## 6                       zzzzlocalhost thu aug edt  ham
## 7                                    receiv phobo  ham
## 9            zzzzlocalhost singledrop thu aug ist  ham
##                                                                  text type
## 3                              deliveredto zzzzlocalhostnetnoteinccom  ham
## 5                        phoboslabsnetnoteinccom postfix esmtp id dec  ham
## 8                                            localhost imap fetchmail  ham
## 17 receiv intmxcorpspamassassintaintorg intmxcorpspamassassintaintorg  ham
## 20                                                                edt  ham
## 22                       id gmbyg exmhworkerslistmanredhatcom thu aug  ham
## 25                                   intmxcorpredhatcom smtp id gmbyy  ham
## 29                                                            thu aug  ham
## 33                         receiv munnariozau localhost deltacsmuozau  ham
## 35                                                                ict  ham

Create a Document Term Matrix

## Warning in tm_map.SimpleCorpus(trainCorpus, removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(testCorpus, removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(train_clean_corpus, removePunctuation):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(test_clean_corpus, removePunctuation):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(train_clean_corpus, removeWords,
## stopwords()): transformation drops documents
## Warning in tm_map.SimpleCorpus(test_clean_corpus, removeWords,
## stopwords()): transformation drops documents
## Warning in tm_map.SimpleCorpus(train_clean_corpus, stripWhitespace):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(test_clean_corpus, stripWhitespace):
## transformation drops documents

Create Term Document Matrix and Plot wordcloud and sentiment

##            word freq
## "nan",   "nan",  197
## "",         "",  100
## "receiv "receiv   90
## "brbr", "brbr",   63
## "br",     "br",   59
## aug",     aug",   54
## thu         thu   46
## esmtp     esmtp   26
## br",       br",   24
## mail       mail   21
## Classes 'tbl_df', 'tbl' and 'data.frame':    1622 obs. of  3 variables:
##  $ term    : chr  "\"\"," "\"abandon" "\"absorb" "\"act" ...
##  $ document: chr  "1" "1" "1" "1" ...
##  $ count   : num  100 1 1 2 3 1 1 1 3 2 ...
## # A tibble: 100 x 3
##    term             document count
##    <chr>            <chr>    <dbl>
##  1 "\"\","          1          100
##  2 "\"abandon"      1            1
##  3 "\"absorb"       1            1
##  4 "\"act"          1            2
##  5 "\"addressbr\"," 1            3
##  6 "\"age"          1            1
##  7 "\"agenc"        1            1
##  8 "\"agre"         1            1
##  9 "\"aid"          1            3
## 10 "\"alreadi"      1            2
## # ... with 90 more rows
## # A tibble: 72 x 4
##    term    document count sentiment
##    <chr>   <chr>    <dbl> <chr>    
##  1 bad     1            4 negative 
##  2 bent    1            1 negative 
##  3 betray  1            1 negative 
##  4 better  1            1 positive 
##  5 blow    1            1 negative 
##  6 bonus   1            4 positive 
##  7 boost   1            2 positive 
##  8 burn    1            1 negative 
##  9 cold    1            1 negative 
## 10 corrupt 1            1 negative 
## # ... with 62 more rows
## Classes 'tbl_df', 'tbl' and 'data.frame':    72 obs. of  4 variables:
##  $ term     : chr  "bad" "bent" "betray" "better" ...
##  $ document : chr  "1" "1" "1" "1" ...
##  $ count    : num  4 1 1 1 1 4 2 1 1 1 ...
##  $ sentiment: chr  "negative" "negative" "negative" "positive" ...
## Selecting by n

##            word freq
## "nan",   "nan",   97
## "receiv "receiv   35
## "br",     "br",   29
## thu         thu   28
## aug",     aug",   27
## "",         "",   26
## "brbr", "brbr",   24
## esmtp     esmtp   13
## aug         aug   11
## "p",       "p",   10
## Classes 'tbl_df', 'tbl' and 'data.frame':    926 obs. of  3 variables:
##  $ term    : chr  "\"\"," "\"aabaabhaceadbdc\"," "\"abl" "\"absorb" ...
##  $ document: chr  "1" "1" "1" "1" ...
##  $ count   : num  26 1 1 1 1 1 1 1 1 1 ...
## # A tibble: 100 x 3
##    term                   document count
##    <chr>                  <chr>    <dbl>
##  1 "\"\","                1           26
##  2 "\"aabaabhaceadbdc\"," 1            1
##  3 "\"abl"                1            1
##  4 "\"absorb"             1            1
##  5 "\"ad"                 1            1
##  6 "\"add"                1            1
##  7 "\"addressbr\","       1            1
##  8 "\"agre"               1            1
##  9 "\"aid"                1            1
## 10 "\"altern"             1            1
## # ... with 90 more rows
## # A tibble: 28 x 4
##    term   document count sentiment
##    <chr>  <chr>    <dbl> <chr>    
##  1 attack 1            3 negative 
##  2 boost  1            2 positive 
##  3 crime  1            2 negative 
##  4 debt   1            1 negative 
##  5 easier 1            1 positive 
##  6 enjoy  1            1 positive 
##  7 fail   1            1 negative 
##  8 famous 1            1 positive 
##  9 fat    1            7 negative 
## 10 free   1            5 positive 
## # ... with 18 more rows
## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  4 variables:
##  $ term     : chr  "attack" "boost" "crime" "debt" ...
##  $ document : chr  "1" "1" "1" "1" ...
##  $ count    : num  3 2 2 1 1 1 1 1 7 5 ...
##  $ sentiment: chr  "negative" "positive" "negative" "negative" ...

Train the model and predict the outcome

##  chr [1:1400, 1:1282] "1" "0" "0" "1" "0" "1" "0" "0" "1" "0" "1" "0" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:1400] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:1282] "aug" "exmhworkersadminredhatcom" "thu" "exmhworkersadminspamassassintaintorg" ...
##  chr [1:600, 1:765] "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ Docs : chr [1:600] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:765] "deliveredto" "zzzzlocalhostnetnoteinccom" "dec" "esmtp" ...

Use NaiveBayes Model to train and test/predict the test data set

## factor(corpus_train$type)
##  ham spam 
##  700  700
## $aug
##                          aug
## factor(corpus_train$type)          0          1
##                      ham  0.88857143 0.11142857
##                      spam 0.98857143 0.01142857
## 
## $exmhworkersadminredhatcom
##                          exmhworkersadminredhatcom
## factor(corpus_train$type)           0           1
##                      ham  0.998571429 0.001428571
##                      spam 1.000000000 0.000000000
## 
## $thu
##                          thu
## factor(corpus_train$type)           0           1
##                      ham  0.931428571 0.068571429
##                      spam 0.997142857 0.002857143
## 
## $exmhworkersadminspamassassintaintorg
##                          exmhworkersadminspamassassintaintorg
## factor(corpus_train$type)           0           1
##                      ham  0.997142857 0.002857143
##                      spam 1.000000000 0.000000000
## 
## $returnpath
##                          returnpath
## factor(corpus_train$type)           0           1
##                      ham  0.992857143 0.007142857
##                      spam 0.994285714 0.005714286
## 
## $localhost
##                          localhost
## factor(corpus_train$type)           0           1
##                      ham  0.977142857 0.022857143
##                      spam 0.994285714 0.005714286
## 
## $receiv
##                          receiv
## factor(corpus_train$type)          0          1
##                      ham  0.90285714 0.09714286
##                      spam 0.96571429 0.03428571
## 
## $edt
##                          edt
## factor(corpus_train$type)           0           1
##                      ham  0.984285714 0.015714286
##                      spam 0.998571429 0.001428571
## 
## $zzzzlocalhost
##                          zzzzlocalhost
## factor(corpus_train$type)          0          1
##                      ham  0.97857143 0.02142857
##                      spam 1.00000000 0.00000000
## 
## $phobo
##                          phobo
## factor(corpus_train$type)          0          1
##                      ham  0.98571429 0.01428571
##                      spam 1.00000000 0.00000000
## 
## $ist
##                          ist
## factor(corpus_train$type)           0           1
##                      ham  0.991428571 0.008571429
##                      spam 0.997142857 0.002857143
## 
## $singledrop
##                          singledrop
## factor(corpus_train$type)           0           1
##                      ham  0.991428571 0.008571429
##                      spam 0.997142857 0.002857143
## 
## $listmanspamassassintaintorg
##                          listmanspamassassintaintorg
## factor(corpus_train$type)           0           1
##                      ham  0.997142857 0.002857143
##                      spam 1.000000000 0.000000000
## 
## $dogmaslashnullorg
##                          dogmaslashnullorg
## factor(corpus_train$type)           0           1
##                      ham  0.985714286 0.014285714
##                      spam 0.997142857 0.002857143
## 
## $esmtp
##                          esmtp
## factor(corpus_train$type)          0          1
##                      ham  0.97142857 0.02857143
##                      spam 0.98714286 0.01285714
## [1] "ham"  "spam"
## naiveBayes.default(x = train, y = factor(corpus_train$type))

Output in the form of a confusion matrix

##       
## pred   ham spam
##   ham  199   37
##   spam 101  263
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  199   37
##       spam 101  263
##                                           
##                Accuracy : 0.77            
##                  95% CI : (0.7342, 0.8031)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.54            
##                                           
##  Mcnemar's Test P-Value : 8.189e-08       
##                                           
##             Sensitivity : 0.6633          
##             Specificity : 0.8767          
##          Pos Pred Value : 0.8432          
##          Neg Pred Value : 0.7225          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3317          
##    Detection Prevalence : 0.3933          
##       Balanced Accuracy : 0.7700          
##                                           
##        'Positive' Class : ham             
##