Assignment

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

Solution

Libraries

library(RCurl)
library(XML)
library(stringr)
library(tm)

Spam files

Take a look at the files

length(list.files("spam_2"))
## [1] 1397
list.files("spam_2")[1:3]
## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00002.9438920e9a55591b18e60d1ed37d992b"
## [3] "00003.590eff932f8704d8b0fcbe69d023b54d"

Tried to rename spam files (this did not work)

file.rename(list.files(pattern="0*."), paste0("", 1:1396))

Look at the file format of one spam email

file.info("spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176")
##                                               size isdir mode
## spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176 4721 FALSE  644
##                                                             mtime
## spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176 2003-02-28 05:58:07
##                                                             ctime
## spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176 2017-11-04 19:43:02
##                                                             atime uid gid
## spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176 2017-11-07 08:15:14 501  20
##                                                       uname grname
## spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176 emiliembolduc  staff
spam1 <- readLines("spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176")
spam1 <- str_c(spam1, collapse = "")
head(spam1)
## [1] "From ilug-admin@linux.ie  Tue Aug  6 11:51:02 2002Return-Path: <ilug-admin@linux.ie>Delivered-To: yyyy@localhost.netnoteinc.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD\tfor <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT)Received: from phobos [127.0.0.1]\tby localhost with IMAP (fetchmail-5.9.0)\tfor jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for    <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org    (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100Received: from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net    [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for    <ilug@linux.ie>; Fri, 2 Aug 2002 22:50:11 +0100X-Authentication-Warning: lugh.tuatha.org: Host w142.z064000057.nyc-ny.dsl.cnc.net    [64.0.57.142] claimed to be bettyjagessar.comReceived: from 64.0.57.142 [202.63.165.34] by bettyjagessar.com    (SMTPD32-7.06 EVAL) id A42A7FC01F2; Fri, 02 Aug 2002 02:18:18 -0400Message-Id: <1028311679.886@0.57.142>Date: Fri, 02 Aug 2002 23:37:59 0530To: ilug@linux.ieFrom: \"Start Now\" <startnow2002@hotmail.com>MIME-Version: 1.0Content-Type: text/plain; charset=\"US-ASCII\"; format=flowedSubject: [ILUG] STOP THE MLM INSANITYSender: ilug-admin@linux.ieErrors-To: ilug-admin@linux.ieX-Mailman-Version: 1.1Precedence: bulkList-Id: Irish Linux Users' Group <ilug.linux.ie>X-Beenthere: ilug@linux.ieGreetings!You are receiving this letter because you have expressed an interest in receiving information about online business opportunities. If this is erroneous then please accept my most sincere apology. This is a one-time mailing, so no removal is necessary.If you've been burned, betrayed, and back-stabbed by multi-level marketing, MLM, then please read this letter. It could be the most important one that has ever landed in your Inbox.MULTI-LEVEL MARKETING IS A HUGE MISTAKE FOR MOST PEOPLEMLM has failed to deliver on its promises for the past 50 years. The pursuit of the \"MLM Dream\" has cost hundreds of thousands of people their friends, their fortunes and their sacred honor. The fact is that MLM is fatally flawed, meaning that it CANNOT work for most people.The companies and the few who earn the big money in MLM are NOT going to tell you the real story. FINALLY, there is someone who has the courage to cut through the hype and lies and tell the TRUTH about MLM.HERE'S GOOD NEWSThere IS an alternative to MLM that WORKS, and works BIG! If you haven't yet abandoned your dreams, then you need to see this. Earning the kind of income you've dreamed about is easier than you think!With your permission, I'd like to send you a brief letter that will tell you WHY MLM doesn't work for most people and will then introduce you to something so new and refreshing that you'll wonder why you haven't heard of this before.I promise that there will be NO unwanted follow up, NO sales pitch, no one will call you, and your email address will only be used to send you the information. Period.To receive this free, life-changing information, simply click Reply, type \"Send Info\" in the Subject box and hit Send. I'll get the information to you within 24 hours. Just look for the words MLM WALL OF SHAME in your Inbox.Cordially,SiddhiP.S. Someone recently sent the letter to me and it has been the most eye-opening, financially beneficial information I have ever received. I honestly believe that you will feel the same way once you've read it. And it's FREE!------------------------------------------------------------This email is NEVER sent unsolicited.  THIS IS NOT \"SPAM\". You are receiving this email because you EXPLICITLY signed yourself up to our list with our online signup form or through use of our FFA Links Page and E-MailDOM systems, which have EXPLICIT terms of use which state that through its use you agree to receive our emailings.  You may also be a member of a Altra Computer Systems list or one of many numerous FREE Marketing Services and as such you agreed when you signed up for such list that you would also be receiving this emailing.Due to the above, this email message cannot be considered unsolicitated, or spam.------------------------------------------------------------- Irish Linux Users' Group: ilug@linux.iehttp://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.List maintainer: listmaster@linux.ie"

Create Corpus for 1

spam1_corpus <- Corpus(VectorSource(spam1))
spam1_corpus[[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 4613
meta(spam1_corpus[[1]])
##   author       : character(0)
##   datetimestamp: 2017-11-07 13:37:18
##   description  : character(0)
##   heading      : character(0)
##   id           : 1
##   language     : en
##   origin       : character(0)

Combine all files into one big list

file.list <- list.files("spam_2", pattern = "*.*")
head(file.list)
## [1] "00001.317e78fa8ee2f54cd4890fdc09ba8176"
## [2] "00002.9438920e9a55591b18e60d1ed37d992b"
## [3] "00003.590eff932f8704d8b0fcbe69d023b54d"
## [4] "00004.bdcc075fa4beb5157b5dd6cd41d8887b"
## [5] "00005.ed0aba4d386c5e62bc737cf3f0ed9589"
## [6] "00006.3ca1f399ccda5d897fecb8c57669a283"
length(file.list)
## [1] 1397
setwd("/Users/emiliembolduc/Week 10 - Text Mining/Project 4/spam_2")
spam.list <- sapply(file.list, readLines)
class(spam.list)
## [1] "list"

Create corpus for all Spam and prepare data

Remove numbers, punctuation characters, stop words, and reduce terms to stem words

SpamAll_corpus <- Corpus(VectorSource(spam.list)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(removePunctuation) %>% 
  tm_map(stemDocument) %>% 
  tm_map(stripWhitespace) #%>%
SpamAll_corpus <- tm_map(SpamAll_corpus, removeNumbers)

Create a Term Document Matrix for all Spam

Spam_tdm <- TermDocumentMatrix(SpamAll_corpus)
Spam_tdm
## <<TermDocumentMatrix (terms: 58946, documents: 1397)>>
## Non-/sparse entries: 273699/82073863
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)

Take a peak at the data…

Spam_matrix <- as.matrix(Spam_tdm)
Spam_matrix <- sort(rowSums(Spam_matrix), decreasing = TRUE)
Spam_df <- data.frame(word = names(Spam_matrix),freq=Spam_matrix)
head(Spam_df, 50)
##                                word freq
## receiv                       receiv 7196
## size                           size 5960
## jul                             jul 4382
## font                           font 3986
## widthd                       widthd 3556
## email                         email 3226
## esmtp                         esmtp 3139
## tabl                           tabl 2994
## width                         width 2899
## will                           will 2626
## tbi                             tbi 2577
## helvetica                 helvetica 2531
## may                             may 2419
## tfor                           tfor 2406
## mon                             mon 1961
## localhost                 localhost 1899
## facedari                   facedari 1882
## subject                     subject 1811
## free                           free 1783
## can                             can 1774
## sansserif                 sansserif 1773
## div                             div 1681
## mail                           mail 1661
## contenttyp               contenttyp 1598
## color                         color 1566
## date                           date 1524
## faceari                     faceari 1518
## tue                             tue 1508
## height                       height 1463
## list                           list 1458
## jun                             jun 1451
## arial                         arial 1424
## get                             get 1399
## messageid                 messageid 1399
## wed                             wed 1380
## html                           html 1375
## thu                             thu 1336
## busi                           busi 1313
## aug                             aug 1288
## smtp                           smtp 1287
## bodi                           bodi 1269
## faceverdana             faceverdana 1264
## heightd                     heightd 1251
## borderd                     borderd 1239
## new                             new 1216
## remov                         remov 1215
## pleas                         pleas 1213
## order                         order 1212
## dogmaslashnullorg dogmaslashnullorg 1200
## colord                       colord 1197

It looks like my clean up removed some letters, like “e,” from the end of some words, like “receiv”.

Add a column with 1 to classify these words with the Spam emails

spam_tdm1 <- Spam_tdm
spam_tdm1$Spam_Ham <- rep(1,nrow(Spam_tdm))

And make sure it work

Spam_matrix <- as.matrix(spam_tdm1)
Spam_matrix <- sort(rowSums(Spam_matrix), decreasing = TRUE)
Spam_df <- data.frame(Word = names(Spam_matrix), Frequency = Spam_matrix, Spam_Ham = spam_tdm1$Spam_Ham)
head(Spam_df, 30)
##                  Word Frequency Spam_Ham
## receiv         receiv      7196        1
## size             size      5960        1
## jul               jul      4382        1
## font             font      3986        1
## widthd         widthd      3556        1
## email           email      3226        1
## esmtp           esmtp      3139        1
## tabl             tabl      2994        1
## width           width      2899        1
## will             will      2626        1
## tbi               tbi      2577        1
## helvetica   helvetica      2531        1
## may               may      2419        1
## tfor             tfor      2406        1
## mon               mon      1961        1
## localhost   localhost      1899        1
## facedari     facedari      1882        1
## subject       subject      1811        1
## free             free      1783        1
## can               can      1774        1
## sansserif   sansserif      1773        1
## div               div      1681        1
## mail             mail      1661        1
## contenttyp contenttyp      1598        1
## color           color      1566        1
## date             date      1524        1
## faceari       faceari      1518        1
## tue               tue      1508        1
## height         height      1463        1
## list             list      1458        1

Ham files

Take a look at the Ham files

length(list.files("/Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham"))
## [1] 2501
list.files("/Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham")[1:3]
## [1] "00001.7c53336b37003a9286aba55d2945844c"
## [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"

Look at the file format of one Ham email

file.info("/Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c")
##                                                                                                      size
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c 5216
##                                                                                                      isdir
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c FALSE
##                                                                                                      mode
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c  644
##                                                                                                                    mtime
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c 2003-02-28 05:53:40
##                                                                                                                    ctime
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c 2017-11-04 19:41:49
##                                                                                                                    atime
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c 2017-11-07 08:24:27
##                                                                                                      uid
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c 501
##                                                                                                      gid
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c  20
##                                                                                                              uname
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c emiliembolduc
##                                                                                                      grname
## /Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c  staff
ham1 <- readLines("/Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham/00001.7c53336b37003a9286aba55d2945844c")
ham1 <- str_c(ham1, collapse = "")
head(ham1)
## [1] "From exmh-workers-admin@redhat.com  Thu Aug 22 12:36:23 2002Return-Path: <exmh-workers-admin@spamassassin.taint.org>Delivered-To: zzzz@localhost.netnoteinc.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id D03E543C36\tfor <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)Received: from phobos [127.0.0.1]\tby localhost with IMAP (fetchmail-5.9.0)\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 12:36:16 +0100 (IST)Received: from listman.spamassassin.taint.org (listman.spamassassin.taint.org [66.187.233.211]) by    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MBYrZ04811 for    <zzzz-exmh@spamassassin.taint.org>; Thu, 22 Aug 2002 12:34:53 +0100Received: from listman.spamassassin.taint.org (localhost.localdomain [127.0.0.1]) by    listman.redhat.com (Postfix) with ESMTP id 8386540858; Thu, 22 Aug 2002    07:35:02 -0400 (EDT)Delivered-To: exmh-workers@listman.spamassassin.taint.orgReceived: from int-mx1.corp.spamassassin.taint.org (int-mx1.corp.spamassassin.taint.org    [172.16.52.254]) by listman.redhat.com (Postfix) with ESMTP id 10CF8406D7    for <exmh-workers@listman.redhat.com>; Thu, 22 Aug 2002 07:34:10 -0400    (EDT)Received: (from mail@localhost) by int-mx1.corp.spamassassin.taint.org (8.11.6/8.11.6)    id g7MBY7g11259 for exmh-workers@listman.redhat.com; Thu, 22 Aug 2002    07:34:07 -0400Received: from mx1.spamassassin.taint.org (mx1.spamassassin.taint.org [172.16.48.31]) by    int-mx1.corp.redhat.com (8.11.6/8.11.6) with SMTP id g7MBY7Y11255 for    <exmh-workers@redhat.com>; Thu, 22 Aug 2002 07:34:07 -0400Received: from ratree.psu.ac.th ([202.28.97.6]) by mx1.spamassassin.taint.org    (8.11.6/8.11.6) with SMTP id g7MBIhl25223 for <exmh-workers@redhat.com>;    Thu, 22 Aug 2002 07:18:55 -0400Received: from delta.cs.mu.OZ.AU (delta.coe.psu.ac.th [172.30.0.98]) by    ratree.psu.ac.th (8.11.6/8.11.6) with ESMTP id g7MBWel29762;    Thu, 22 Aug 2002 18:32:40 +0700 (ICT)Received: from munnari.OZ.AU (localhost [127.0.0.1]) by delta.cs.mu.OZ.AU    (8.11.6/8.11.6) with ESMTP id g7MBQPW13260; Thu, 22 Aug 2002 18:26:25    +0700 (ICT)From: Robert Elz <kre@munnari.OZ.AU>To: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>Cc: exmh-workers@spamassassin.taint.orgSubject: Re: New Sequences WindowIn-Reply-To: <1029945287.4797.TMDA@deepeddy.vircio.com>References: <1029945287.4797.TMDA@deepeddy.vircio.com>    <1029882468.3116.TMDA@deepeddy.vircio.com> <9627.1029933001@munnari.OZ.AU>    <1029943066.26919.TMDA@deepeddy.vircio.com>    <1029944441.398.TMDA@deepeddy.vircio.com>MIME-Version: 1.0Content-Type: text/plain; charset=us-asciiMessage-Id: <13258.1030015585@munnari.OZ.AU>X-Loop: exmh-workers@spamassassin.taint.orgSender: exmh-workers-admin@spamassassin.taint.orgErrors-To: exmh-workers-admin@spamassassin.taint.orgX-Beenthere: exmh-workers@spamassassin.taint.orgX-Mailman-Version: 2.0.1Precedence: bulkList-Help: <mailto:exmh-workers-request@spamassassin.taint.org?subject=help>List-Post: <mailto:exmh-workers@spamassassin.taint.org>List-Subscribe: <https://listman.spamassassin.taint.org/mailman/listinfo/exmh-workers>,    <mailto:exmh-workers-request@redhat.com?subject=subscribe>List-Id: Discussion list for EXMH developers <exmh-workers.spamassassin.taint.org>List-Unsubscribe: <https://listman.spamassassin.taint.org/mailman/listinfo/exmh-workers>,    <mailto:exmh-workers-request@redhat.com?subject=unsubscribe>List-Archive: <https://listman.spamassassin.taint.org/mailman/private/exmh-workers/>Date: Thu, 22 Aug 2002 18:26:25 +0700    Date:        Wed, 21 Aug 2002 10:54:46 -0500    From:        Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>    Message-ID:  <1029945287.4797.TMDA@deepeddy.vircio.com>  | I can't reproduce this error.For me it is very repeatable... (like every time, without fail).This is the debug log of the pick happening ...18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury18:19:04 Ftoc_PickMsgs {{1 hit}}18:19:04 Marking 1 hits18:19:04 tkerror: syntax error in expression \"int ...Note, if I run the pick command by hand ...delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace  4852-4852 -sequence mercury1 hitThat's where the \"1 hit\" comes from (obviously).  The version of nmh I'musing is ...delta$ pick -versionpick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 ICT 2002]And the relevant part of my .mh_profile ...delta$ mhparam pick-seq sel -listSince the pick command works, the sequence (actually, both of them, theone that's explicit on the command line, from the search popup, and theone that comes from .mh_profile) do get created.kreps: this is still using the version of the code form a day ago, I haven'tbeen able to reach the cvs repository today (local routing issue I think)._______________________________________________Exmh-workers mailing listExmh-workers@redhat.comhttps://listman.redhat.com/mailman/listinfo/exmh-workers"

Create Corpus for 1 Ham file

ham1_corpus <- Corpus(VectorSource(ham1))
ham1_corpus[[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 5103
meta(ham1_corpus[[1]])
##   author       : character(0)
##   datetimestamp: 2017-11-07 13:37:35
##   description  : character(0)
##   heading      : character(0)
##   id           : 1
##   language     : en
##   origin       : character(0)

Combine all Ham files into one big list

hamfile.list <- list.files("/Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham", pattern = "*.*")
head(hamfile.list)
## [1] "00001.7c53336b37003a9286aba55d2945844c"
## [2] "00002.9c4069e25e1ef370c078db7ee85ff9ac"
## [3] "00003.860e3c3cee1b42ead714c5c874fe25f7"
## [4] "00004.864220c5b6930b209cc287c361c99af1"
## [5] "00005.bf27cdeaf0b8c4647ecd61b1d09da613"
## [6] "00006.253ea2f9a9cc36fa0b1129b04b806608"
length(hamfile.list)
## [1] 2501
setwd("/Users/emiliembolduc/Week 10 - Text Mining/Project 4/easy_ham")
ham.list <- sapply(hamfile.list, readLines)
class(ham.list)
## [1] "list"

Create corpus for all Ham files and prepare data

Remove numbers, punctuation characters, stop words, and reduce terms to stem words

HamAll_corpus <- Corpus(VectorSource(ham.list)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(removePunctuation) %>% 
  tm_map(stemDocument) %>% 
  tm_map(stripWhitespace) #%>%
HamAll_corpus <- tm_map(HamAll_corpus, removeNumbers)

Create a Term Document Matrix for all Ham files

Ham_tdm <- TermDocumentMatrix(HamAll_corpus)
Ham_tdm
## <<TermDocumentMatrix (terms: 37752, documents: 2501)>>
## Non-/sparse entries: 353793/94063959
## Sparsity           : 100%
## Maximal term length: 265
## Weighting          : term frequency (tf)

Take a peak at the data…

Ham_matrix <- as.matrix(Ham_tdm)
Ham_matrix <- sort(rowSums(Ham_matrix), decreasing = TRUE)
Ham_df <- data.frame(word = names(Ham_matrix), freq = Ham_matrix)
head(Ham_df, 30)
##                                word  freq
## receiv                       receiv 14230
## sep                             sep  9788
## esmtp                         esmtp  8406
## localhost                 localhost  7347
## oct                             oct  5251
## tbi                             tbi  4728
## tfor                           tfor  4723
## postfix                     postfix  4661
## aug                             aug  4476
## ist                             ist  4224
## jmlocalhost             jmlocalhost  4144
## mon                             mon  4035
## wed                             wed  3840
## thu                             thu  3837
## jalapeno                   jalapeno  3705
## deliv                         deliv  3536
## date                           date  3410
## dogmaslashnullorg dogmaslashnullorg  3048
## tue                             tue  3002
## subject                     subject  2898
## forkadminxentcom   forkadminxentcom  2743
## messageid                 messageid  2540
## use                             use  2454
## imap                           imap  2378
## fetchmail                 fetchmail  2375
## returnpath               returnpath  2369
## singledrop               singledrop  2358
## contenttyp               contenttyp  2341
## fri                             fri  2336
## list                           list  2267

Again, it looks like my clean up removed some letters, like “e,” from the end of some words, like “receiv”. Do not know how to correct.

Add a column with o to classify these words with the Ham emails

Ham_tdm1 <- Ham_tdm
Ham_tdm1$Spam_Ham <- rep(0,nrow(Ham_tdm))

And make sure it work

Ham_matrix <- as.matrix(Ham_tdm1)
Ham_matrix <- sort(rowSums(Ham_matrix), decreasing = TRUE)
Ham_df <- data.frame(Word = names(Ham_matrix), Frequency = Ham_matrix, Spam_Ham = Ham_tdm1$Spam_Ham)
head(Ham_df, 30)
##                                Word Frequency Spam_Ham
## receiv                       receiv     14230        0
## sep                             sep      9788        0
## esmtp                         esmtp      8406        0
## localhost                 localhost      7347        0
## oct                             oct      5251        0
## tbi                             tbi      4728        0
## tfor                           tfor      4723        0
## postfix                     postfix      4661        0
## aug                             aug      4476        0
## ist                             ist      4224        0
## jmlocalhost             jmlocalhost      4144        0
## mon                             mon      4035        0
## wed                             wed      3840        0
## thu                             thu      3837        0
## jalapeno                   jalapeno      3705        0
## deliv                         deliv      3536        0
## date                           date      3410        0
## dogmaslashnullorg dogmaslashnullorg      3048        0
## tue                             tue      3002        0
## subject                     subject      2898        0
## forkadminxentcom   forkadminxentcom      2743        0
## messageid                 messageid      2540        0
## use                             use      2454        0
## imap                           imap      2378        0
## fetchmail                 fetchmail      2375        0
## returnpath               returnpath      2369        0
## singledrop               singledrop      2358        0
## contenttyp               contenttyp      2341        0
## fri                             fri      2336        0
## list                           list      2267        0

Combine Spam and Ham term document matrices