The goal of this project is to classify email documents as spam(unsolicited) or ham (solicited) based on classified training documents made available by spamassassin.
# Files directories
ham_directory <-"/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/easy_ham"
spam_directory <-"/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/spam_2"
# create vectora of document names
spam_files <- list.files(spam_directory)
ham_files <- list.files(ham_directory)
glimpse(spam_files)
## chr [1:1397] "00001.317e78fa8ee2f54cd4890fdc09ba8176" ...
length(spam_files)
## [1] 1397
glimpse(ham_files)
## chr [1:2501] "00001.7c53336b37003a9286aba55d2945844c" ...
length(ham_files)
## [1] 2501
# Define the spam directory
spam_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/spam_2"
# List of file names in the spam directory
spam_file_names <- list.files(spam_directory, full.names = FALSE)
# Choose one file to extract content
selected_file <- spam_file_names[1] # Change the index as needed
# Construct the full path to the file
file_path <- file.path(spam_directory, selected_file)
# Read the content of the file
content_spam <- readLines(file_path)
# Print or process the content as needed
cat("Content of", selected_file, ":\n")
## Content of 00001.317e78fa8ee2f54cd4890fdc09ba8176 :
cat(content_spam , sep = "\n")
## From ilug-admin@linux.ie Tue Aug 6 11:51:02 2002
## Return-Path: <ilug-admin@linux.ie>
## Delivered-To: yyyy@localhost.netnoteinc.com
## Received: from localhost (localhost [127.0.0.1])
## by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD
## for <jm@localhost>; Tue, 6 Aug 2002 06:48:09 -0400 (EDT)
## Received: from phobos [127.0.0.1]
## by localhost with IMAP (fetchmail-5.9.0)
## for jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)
## Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by
## dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for
## <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100
## Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org
## (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100
## Received: from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net
## [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for
## <ilug@linux.ie>; Fri, 2 Aug 2002 22:50:11 +0100
## X-Authentication-Warning: lugh.tuatha.org: Host w142.z064000057.nyc-ny.dsl.cnc.net
## [64.0.57.142] claimed to be bettyjagessar.com
## Received: from 64.0.57.142 [202.63.165.34] by bettyjagessar.com
## (SMTPD32-7.06 EVAL) id A42A7FC01F2; Fri, 02 Aug 2002 02:18:18 -0400
## Message-Id: <1028311679.886@0.57.142>
## Date: Fri, 02 Aug 2002 23:37:59 0530
## To: ilug@linux.ie
## From: "Start Now" <startnow2002@hotmail.com>
## MIME-Version: 1.0
## Content-Type: text/plain; charset="US-ASCII"; format=flowed
## Subject: [ILUG] STOP THE MLM INSANITY
## Sender: ilug-admin@linux.ie
## Errors-To: ilug-admin@linux.ie
## X-Mailman-Version: 1.1
## Precedence: bulk
## List-Id: Irish Linux Users' Group <ilug.linux.ie>
## X-Beenthere: ilug@linux.ie
##
## Greetings!
##
## You are receiving this letter because you have expressed an interest in
## receiving information about online business opportunities. If this is
## erroneous then please accept my most sincere apology. This is a one-time
## mailing, so no removal is necessary.
##
## If you've been burned, betrayed, and back-stabbed by multi-level marketing,
## MLM, then please read this letter. It could be the most important one that
## has ever landed in your Inbox.
##
## MULTI-LEVEL MARKETING IS A HUGE MISTAKE FOR MOST PEOPLE
##
## MLM has failed to deliver on its promises for the past 50 years. The pursuit
## of the "MLM Dream" has cost hundreds of thousands of people their friends,
## their fortunes and their sacred honor. The fact is that MLM is fatally
## flawed, meaning that it CANNOT work for most people.
##
## The companies and the few who earn the big money in MLM are NOT going to
## tell you the real story. FINALLY, there is someone who has the courage to
## cut through the hype and lies and tell the TRUTH about MLM.
##
## HERE'S GOOD NEWS
##
## There IS an alternative to MLM that WORKS, and works BIG! If you haven't yet
## abandoned your dreams, then you need to see this. Earning the kind of income
## you've dreamed about is easier than you think!
##
## With your permission, I'd like to send you a brief letter that will tell you
## WHY MLM doesn't work for most people and will then introduce you to
## something so new and refreshing that you'll wonder why you haven't heard of
## this before.
##
## I promise that there will be NO unwanted follow up, NO sales pitch, no one
## will call you, and your email address will only be used to send you the
## information. Period.
##
## To receive this free, life-changing information, simply click Reply, type
## "Send Info" in the Subject box and hit Send. I'll get the information to you
## within 24 hours. Just look for the words MLM WALL OF SHAME in your Inbox.
##
## Cordially,
##
## Siddhi
##
## P.S. Someone recently sent the letter to me and it has been the most
## eye-opening, financially beneficial information I have ever received. I
## honestly believe that you will feel the same way once you've read it. And
## it's FREE!
##
##
## ------------------------------------------------------------
## This email is NEVER sent unsolicited. THIS IS NOT "SPAM". You are receiving
## this email because you EXPLICITLY signed yourself up to our list with our
## online signup form or through use of our FFA Links Page and E-MailDOM
## systems, which have EXPLICIT terms of use which state that through its use
## you agree to receive our emailings. You may also be a member of a Altra
## Computer Systems list or one of many numerous FREE Marketing Services and as
## such you agreed when you signed up for such list that you would also be
## receiving this emailing.
## Due to the above, this email message cannot be considered unsolicitated, or
## spam.
## -----------------------------------------------------------
##
##
##
##
## --
## Irish Linux Users' Group: ilug@linux.ie
## http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
## List maintainer: listmaster@linux.ie
cat("\n")
# Define the ham directory
ham_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/easy_ham"
# List of file names in the ham directory
ham_file_names <- list.files(ham_directory, full.names = FALSE)
# Choose one file to extract content
selected_file <- ham_file_names[1] # Change the index as needed
# Construct the full path to the file
file_path <- file.path(ham_directory, selected_file)
# Read the content of the file
content_ham <- readLines(file_path)
# Print or process the content as needed
cat("Content of", selected_file, ":\n")
## Content of 00001.7c53336b37003a9286aba55d2945844c :
cat(content_ham , sep = "\n")
## From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 2002
## Return-Path: <exmh-workers-admin@spamassassin.taint.org>
## Delivered-To: zzzz@localhost.netnoteinc.com
## Received: from localhost (localhost [127.0.0.1])
## by phobos.labs.netnoteinc.com (Postfix) with ESMTP id D03E543C36
## for <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)
## Received: from phobos [127.0.0.1]
## by localhost with IMAP (fetchmail-5.9.0)
## for zzzz@localhost (single-drop); Thu, 22 Aug 2002 12:36:16 +0100 (IST)
## Received: from listman.spamassassin.taint.org (listman.spamassassin.taint.org [66.187.233.211]) by
## dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MBYrZ04811 for
## <zzzz-exmh@spamassassin.taint.org>; Thu, 22 Aug 2002 12:34:53 +0100
## Received: from listman.spamassassin.taint.org (localhost.localdomain [127.0.0.1]) by
## listman.redhat.com (Postfix) with ESMTP id 8386540858; Thu, 22 Aug 2002
## 07:35:02 -0400 (EDT)
## Delivered-To: exmh-workers@listman.spamassassin.taint.org
## Received: from int-mx1.corp.spamassassin.taint.org (int-mx1.corp.spamassassin.taint.org
## [172.16.52.254]) by listman.redhat.com (Postfix) with ESMTP id 10CF8406D7
## for <exmh-workers@listman.redhat.com>; Thu, 22 Aug 2002 07:34:10 -0400
## (EDT)
## Received: (from mail@localhost) by int-mx1.corp.spamassassin.taint.org (8.11.6/8.11.6)
## id g7MBY7g11259 for exmh-workers@listman.redhat.com; Thu, 22 Aug 2002
## 07:34:07 -0400
## Received: from mx1.spamassassin.taint.org (mx1.spamassassin.taint.org [172.16.48.31]) by
## int-mx1.corp.redhat.com (8.11.6/8.11.6) with SMTP id g7MBY7Y11255 for
## <exmh-workers@redhat.com>; Thu, 22 Aug 2002 07:34:07 -0400
## Received: from ratree.psu.ac.th ([202.28.97.6]) by mx1.spamassassin.taint.org
## (8.11.6/8.11.6) with SMTP id g7MBIhl25223 for <exmh-workers@redhat.com>;
## Thu, 22 Aug 2002 07:18:55 -0400
## Received: from delta.cs.mu.OZ.AU (delta.coe.psu.ac.th [172.30.0.98]) by
## ratree.psu.ac.th (8.11.6/8.11.6) with ESMTP id g7MBWel29762;
## Thu, 22 Aug 2002 18:32:40 +0700 (ICT)
## Received: from munnari.OZ.AU (localhost [127.0.0.1]) by delta.cs.mu.OZ.AU
## (8.11.6/8.11.6) with ESMTP id g7MBQPW13260; Thu, 22 Aug 2002 18:26:25
## +0700 (ICT)
## From: Robert Elz <kre@munnari.OZ.AU>
## To: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
## Cc: exmh-workers@spamassassin.taint.org
## Subject: Re: New Sequences Window
## In-Reply-To: <1029945287.4797.TMDA@deepeddy.vircio.com>
## References: <1029945287.4797.TMDA@deepeddy.vircio.com>
## <1029882468.3116.TMDA@deepeddy.vircio.com> <9627.1029933001@munnari.OZ.AU>
## <1029943066.26919.TMDA@deepeddy.vircio.com>
## <1029944441.398.TMDA@deepeddy.vircio.com>
## MIME-Version: 1.0
## Content-Type: text/plain; charset=us-ascii
## Message-Id: <13258.1030015585@munnari.OZ.AU>
## X-Loop: exmh-workers@spamassassin.taint.org
## Sender: exmh-workers-admin@spamassassin.taint.org
## Errors-To: exmh-workers-admin@spamassassin.taint.org
## X-Beenthere: exmh-workers@spamassassin.taint.org
## X-Mailman-Version: 2.0.1
## Precedence: bulk
## List-Help: <mailto:exmh-workers-request@spamassassin.taint.org?subject=help>
## List-Post: <mailto:exmh-workers@spamassassin.taint.org>
## List-Subscribe: <https://listman.spamassassin.taint.org/mailman/listinfo/exmh-workers>,
## <mailto:exmh-workers-request@redhat.com?subject=subscribe>
## List-Id: Discussion list for EXMH developers <exmh-workers.spamassassin.taint.org>
## List-Unsubscribe: <https://listman.spamassassin.taint.org/mailman/listinfo/exmh-workers>,
## <mailto:exmh-workers-request@redhat.com?subject=unsubscribe>
## List-Archive: <https://listman.spamassassin.taint.org/mailman/private/exmh-workers/>
## Date: Thu, 22 Aug 2002 18:26:25 +0700
##
## Date: Wed, 21 Aug 2002 10:54:46 -0500
## From: Chris Garrigues <cwg-dated-1030377287.06fa6d@DeepEddy.Com>
## Message-ID: <1029945287.4797.TMDA@deepeddy.vircio.com>
##
##
## | I can't reproduce this error.
##
## For me it is very repeatable... (like every time, without fail).
##
## This is the debug log of the pick happening ...
##
## 18:19:03 Pick_It {exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace} {4852-4852 -sequence mercury}
## 18:19:03 exec pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
## 18:19:04 Ftoc_PickMsgs {{1 hit}}
## 18:19:04 Marking 1 hits
## 18:19:04 tkerror: syntax error in expression "int ...
##
## Note, if I run the pick command by hand ...
##
## delta$ pick +inbox -list -lbrace -lbrace -subject ftp -rbrace -rbrace 4852-4852 -sequence mercury
## 1 hit
##
## That's where the "1 hit" comes from (obviously). The version of nmh I'm
## using is ...
##
## delta$ pick -version
## pick -- nmh-1.0.4 [compiled on fuchsia.cs.mu.OZ.AU at Sun Mar 17 14:55:56 ICT 2002]
##
## And the relevant part of my .mh_profile ...
##
## delta$ mhparam pick
## -seq sel -list
##
##
## Since the pick command works, the sequence (actually, both of them, the
## one that's explicit on the command line, from the search popup, and the
## one that comes from .mh_profile) do get created.
##
## kre
##
## ps: this is still using the version of the code form a day ago, I haven't
## been able to reach the cvs repository today (local routing issue I think).
##
##
##
## _______________________________________________
## Exmh-workers mailing list
## Exmh-workers@redhat.com
## https://listman.redhat.com/mailman/listinfo/exmh-workers
cat("\n")
# Converting the data into Corpus and removing data without tm package
ham_2 <- Corpus(VectorSource(ham_file_names))
# Preprocessing function
preprocess_text <- function(text) {
text <- tolower(text)
text <- gsub("\\d+", "", text) # removeNumbers
text <- gsub("[[:punct:]]", "", text) # removePunctuation
text <- stripWhitespace(text)
text <- removeWords(text, stopwords("english"))
text <- removeWords(text, c("will", "the"))
return(text)
}
# Apply preprocessing to each document in the corpus
ham_2 <- lapply(ham_2$content, preprocess_text)
# Now let's build a matrix and dataframe to show the number of words to make wordcloud
tdm_ham_2 <- TermDocumentMatrix(Corpus(VectorSource(ham_2)))
m_h <- as.matrix(tdm_ham_2)
v_h <- sort(rowSums(m_h), decreasing=TRUE)
d_h <- data.frame(ham2 = names(v_h), freq = v_h)
head(d_h, 40)
str(content)
## function (x)
# Define a function to clean the email content
cleanEmailContent <- function(content) {
# Remove non-alphanumeric characters
cleaned_content <- gsub("[^a-zA-Z0-9\\s]", "", content)
# Remove extra whitespaces
cleaned_content <- gsub("\\s+", " ", cleaned_content)
return(cleaned_content)
}
# Define the spam directory
spam_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/spam_2"
# List of file names in the spam directory
spam_file_names <- list.files(spam_directory, full.names = FALSE)
# Choose one file to extract content
selected_file <- spam_file_names[1] # Change the index as needed
# Construct the full path to the file
file_path <- file.path(spam_directory, selected_file)
# Read the content of the file
content <- readLines(file_path)
# Clean the content
cleaned_content <- cleanEmailContent(content)
# Print or process the cleaned content as needed
cat("Cleaned Content of", selected_file, ":\n")
## Cleaned Content of 00001.317e78fa8ee2f54cd4890fdc09ba8176 :
cat(cleaned_content, sep = "\n")
## FromilugadminlinuxieTueAug61151022002
## ReturnPathilugadminlinuxie
## DeliveredToyyyylocalhostnetnoteinccom
## Receivedfromlocalhostlocalhost127001
## byphoboslabsnetnoteinccomPostfixwithESMTPid9E1F5441DD
## forjmlocalhostTue6Aug20020648090400EDT
## Receivedfromphobos127001
## bylocalhostwithIMAPfetchmail590
## forjmlocalhostsingledropTue06Aug20021148090100IST
## Receivedfromlughtuathaorgrootlughtuathaorg19412514545by
## dogmaslashnullorg81168116withESMTPidg72LqWv13294for
## jmilugjmasonorgFri2Aug20022252320100
## Receivedfromlughrootlocalhost127001bylughtuathaorg
## 893893withESMTPidWAA31224Fri2Aug20022250170100
## Receivedfrombettyjagessarcomw142z064000057nycnydslcncnet
## 64057142bylughtuathaorg893893withESMTPidWAA31201for
## iluglinuxieFri2Aug20022250110100
## XAuthenticationWarninglughtuathaorgHostw142z064000057nycnydslcncnet
## 64057142claimedtobebettyjagessarcom
## Receivedfrom640571422026316534bybettyjagessarcom
## SMTPD32706EVALidA42A7FC01F2Fri02Aug20020218180400
## MessageId1028311679886057142
## DateFri02Aug20022337590530
## Toiluglinuxie
## FromStartNowstartnow2002hotmailcom
## MIMEVersion10
## ContentTypetextplaincharsetUSASCIIformatflowed
## SubjectILUGSTOPTHEMLMINSANITY
## Senderilugadminlinuxie
## ErrorsToilugadminlinuxie
## XMailmanVersion11
## Precedencebulk
## ListIdIrishLinuxUsersGroupiluglinuxie
## XBeenthereiluglinuxie
##
## Greetings
##
## Youarereceivingthisletterbecauseyouhaveexpressedaninterestin
## receivinginformationaboutonlinebusinessopportunitiesIfthisis
## erroneousthenpleaseacceptmymostsincereapologyThisisaonetime
## mailingsonoremovalisnecessary
##
## Ifyouvebeenburnedbetrayedandbackstabbedbymultilevelmarketing
## MLMthenpleasereadthisletterItcouldbethemostimportantonethat
## haseverlandedinyourInbox
##
## MULTILEVELMARKETINGISAHUGEMISTAKEFORMOSTPEOPLE
##
## MLMhasfailedtodeliveronitspromisesforthepast50yearsThepursuit
## oftheMLMDreamhascosthundredsofthousandsofpeopletheirfriends
## theirfortunesandtheirsacredhonorThefactisthatMLMisfatally
## flawedmeaningthatitCANNOTworkformostpeople
##
## ThecompaniesandthefewwhoearnthebigmoneyinMLMareNOTgoingto
## tellyoutherealstoryFINALLYthereissomeonewhohasthecourageto
## cutthroughthehypeandliesandtelltheTRUTHaboutMLM
##
## HERESGOODNEWS
##
## ThereISanalternativetoMLMthatWORKSandworksBIGIfyouhaventyet
## abandonedyourdreamsthenyouneedtoseethisEarningthekindofincome
## youvedreamedaboutiseasierthanyouthink
##
## WithyourpermissionIdliketosendyouabriefletterthatwilltellyou
## WHYMLMdoesntworkformostpeopleandwillthenintroduceyouto
## somethingsonewandrefreshingthatyoullwonderwhyyouhaventheardof
## thisbefore
##
## IpromisethattherewillbeNOunwantedfollowupNOsalespitchnoone
## willcallyouandyouremailaddresswillonlybeusedtosendyouthe
## informationPeriod
##
## ToreceivethisfreelifechanginginformationsimplyclickReplytype
## SendInfointheSubjectboxandhitSendIllgettheinformationtoyou
## within24hoursJustlookforthewordsMLMWALLOFSHAMEinyourInbox
##
## Cordially
##
## Siddhi
##
## PSSomeonerecentlysentthelettertomeandithasbeenthemost
## eyeopeningfinanciallybeneficialinformationIhaveeverreceivedI
## honestlybelievethatyouwillfeelthesamewayonceyouvereaditAnd
## itsFREE
##
##
##
## ThisemailisNEVERsentunsolicitedTHISISNOTSPAMYouarereceiving
## thisemailbecauseyouEXPLICITLYsignedyourselfuptoourlistwithour
## onlinesignupformorthroughuseofourFFALinksPageandEMailDOM
## systemswhichhaveEXPLICITtermsofusewhichstatethatthroughitsuse
## youagreetoreceiveouremailingsYoumayalsobeamemberofaAltra
## ComputerSystemslistoroneofmanynumerousFREEMarketingServicesandas
## suchyouagreedwhenyousignedupforsuchlistthatyouwouldalsobe
## receivingthisemailing
## Duetotheabovethisemailmessagecannotbeconsideredunsolicitatedor
## spam
##
##
##
##
##
##
## IrishLinuxUsersGroupiluglinuxie
## httpwwwlinuxiemailmanlistinfoilugforunsubscriptioninformation
## Listmaintainerlistmasterlinuxie
cat("\n")
cleanEmailContent <- function(content) {
# Remove non-alphanumeric characters
cleaned_content <- gsub("[^a-zA-Z0-9\\s]", "", content)
# Remove extra whitespaces
cleaned_content <- gsub("\\s+", " ", cleaned_content)
return(cleaned_content)
}
# Define the ham directory
ham_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/easy_ham"
# List of file names in the spam directory
ham_file_names <- list.files(ham_directory, full.names = FALSE)
# Choose one file to extract content
selected_file <- ham_file_names[1]
# Construct the full path to the file
file_path <- file.path(ham_directory, selected_file)
# Read the content of the file
content <- readLines(file_path)
# Clean the content
cleaned_content<-cleanEmailContent(content)
# Print or process the cleaned content as needed
cat("Cleaned Content of", selected_file, ":\n")
## Cleaned Content of 00001.7c53336b37003a9286aba55d2945844c :
cat(cleaned_content, sep = "\n")
## FromexmhworkersadminredhatcomThuAug221236232002
## ReturnPathexmhworkersadminspamassassintaintorg
## DeliveredTozzzzlocalhostnetnoteinccom
## Receivedfromlocalhostlocalhost127001
## byphoboslabsnetnoteinccomPostfixwithESMTPidD03E543C36
## forzzzzlocalhostThu22Aug20020736160400EDT
## Receivedfromphobos127001
## bylocalhostwithIMAPfetchmail590
## forzzzzlocalhostsingledropThu22Aug20021236160100IST
## Receivedfromlistmanspamassassintaintorglistmanspamassassintaintorg66187233211by
## dogmaslashnullorg81168116withESMTPidg7MBYrZ04811for
## zzzzexmhspamassassintaintorgThu22Aug20021234530100
## Receivedfromlistmanspamassassintaintorglocalhostlocaldomain127001by
## listmanredhatcomPostfixwithESMTPid8386540858Thu22Aug2002
## 0735020400EDT
## DeliveredToexmhworkerslistmanspamassassintaintorg
## Receivedfromintmx1corpspamassassintaintorgintmx1corpspamassassintaintorg
## 1721652254bylistmanredhatcomPostfixwithESMTPid10CF8406D7
## forexmhworkerslistmanredhatcomThu22Aug20020734100400
## EDT
## Receivedfrommaillocalhostbyintmx1corpspamassassintaintorg81168116
## idg7MBY7g11259forexmhworkerslistmanredhatcomThu22Aug2002
## 0734070400
## Receivedfrommx1spamassassintaintorgmx1spamassassintaintorg172164831by
## intmx1corpredhatcom81168116withSMTPidg7MBY7Y11255for
## exmhworkersredhatcomThu22Aug20020734070400
## Receivedfromratreepsuacth20228976bymx1spamassassintaintorg
## 81168116withSMTPidg7MBIhl25223forexmhworkersredhatcom
## Thu22Aug20020718550400
## ReceivedfromdeltacsmuOZAUdeltacoepsuacth17230098by
## ratreepsuacth81168116withESMTPidg7MBWel29762
## Thu22Aug20021832400700ICT
## ReceivedfrommunnariOZAUlocalhost127001bydeltacsmuOZAU
## 81168116withESMTPidg7MBQPW13260Thu22Aug2002182625
## 0700ICT
## FromRobertElzkremunnariOZAU
## ToChrisGarriguescwgdated103037728706fa6dDeepEddyCom
## Ccexmhworkersspamassassintaintorg
## SubjectReNewSequencesWindow
## InReplyTo10299452874797TMDAdeepeddyvirciocom
## References10299452874797TMDAdeepeddyvirciocom
## 10298824683116TMDAdeepeddyvirciocom96271029933001munnariOZAU
## 102994306626919TMDAdeepeddyvirciocom
## 1029944441398TMDAdeepeddyvirciocom
## MIMEVersion10
## ContentTypetextplaincharsetusascii
## MessageId132581030015585munnariOZAU
## XLoopexmhworkersspamassassintaintorg
## Senderexmhworkersadminspamassassintaintorg
## ErrorsToexmhworkersadminspamassassintaintorg
## XBeenthereexmhworkersspamassassintaintorg
## XMailmanVersion201
## Precedencebulk
## ListHelpmailtoexmhworkersrequestspamassassintaintorgsubjecthelp
## ListPostmailtoexmhworkersspamassassintaintorg
## ListSubscribehttpslistmanspamassassintaintorgmailmanlistinfoexmhworkers
## mailtoexmhworkersrequestredhatcomsubjectsubscribe
## ListIdDiscussionlistforEXMHdevelopersexmhworkersspamassassintaintorg
## ListUnsubscribehttpslistmanspamassassintaintorgmailmanlistinfoexmhworkers
## mailtoexmhworkersrequestredhatcomsubjectunsubscribe
## ListArchivehttpslistmanspamassassintaintorgmailmanprivateexmhworkers
## DateThu22Aug20021826250700
##
## DateWed21Aug20021054460500
## FromChrisGarriguescwgdated103037728706fa6dDeepEddyCom
## MessageID10299452874797TMDAdeepeddyvirciocom
##
##
## Icantreproducethiserror
##
## Formeitisveryrepeatablelikeeverytimewithoutfail
##
## Thisisthedebuglogofthepickhappening
##
## 181903PickItexecpickinboxlistlbracelbracesubjectftprbracerbrace48524852sequencemercury
## 181903execpickinboxlistlbracelbracesubjectftprbracerbrace48524852sequencemercury
## 181904FtocPickMsgs1hit
## 181904Marking1hits
## 181904tkerrorsyntaxerrorinexpressionint
##
## NoteifIrunthepickcommandbyhand
##
## deltapickinboxlistlbracelbracesubjectftprbracerbrace48524852sequencemercury
## 1hit
##
## Thatswherethe1hitcomesfromobviouslyTheversionofnmhIm
## usingis
##
## deltapickversion
## picknmh104compiledonfuchsiacsmuOZAUatSunMar17145556ICT2002
##
## Andtherelevantpartofmymhprofile
##
## deltamhparampick
## seqsellist
##
##
## Sincethepickcommandworksthesequenceactuallybothofthemthe
## onethatsexplicitonthecommandlinefromthesearchpopupandthe
## onethatcomesfrommhprofiledogetcreated
##
## kre
##
## psthisisstillusingtheversionofthecodeformadayagoIhavent
## beenabletoreachthecvsrepositorytodaylocalroutingissueIthink
##
##
##
##
## Exmhworkersmailinglist
## Exmhworkersredhatcom
## httpslistmanredhatcommailmanlistinfoexmhworkers
cat("\n")
# Function to preprocess and clean the text
preprocessText <- function(text) {
# Convert to lowercase
text <- tolower(text)
# Remove numbers
text <- removeNumbers(text)
# Remove punctuation
text <- removePunctuation(text)
# Remove stopwords
text <- removeWords(text, stopwords("en"))
# Strip unnecessary whitespaces
text <- stripWhitespace(text)
return(text)
}
# Define the spam directory
spam_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/spam_2"
# List of file names in the spam directory
spam_file_names <- list.files(spam_directory, full.names = FALSE)
# Choose one file to extract content
selected_file <- spam_file_names[1] # Change the index as needed
# Construct the full path to the file
file_path <- file.path(spam_directory, selected_file)
# Read the content of the file
content <- readLines(file_path)
# Clean the content
cleaned_content <- cleanEmailContent(content)
# Preprocess the cleaned content
preprocessed_content <- preprocessText(cleaned_content)
# Create a Corpus
corpus1 <- Corpus(VectorSource(preprocessed_content))
# Create a Document-Term Matrix (DTM)
dtm1 <- DocumentTermMatrix(corpus1)
# Convert DTM to a matrix
matrix <- as.matrix(dtm1)
spam_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/spam_2"
tdm_s <- TermDocumentMatrix((corpus1))
m_s <- as.matrix(tdm_s)
v_s <- sort(rowSums(m_s), decreasing=TRUE)
d_s <- data.frame(spam_2= names(v_s), freq=v_s)
head(d_s,5)
We can make use of wordcloud function to visualize the most common words in either the spam or the ham corpus. ### Let’s visualize spam corpus as a wordcloud
# spam corpus-wordcloud
set.seed(131017)
wordcloud(corpus1, max.words = 500, random.order = FALSE)
# Function to preprocess and clean the text
preprocessText <- function(text) {
# Convert to lowercase
text <- tolower(text)
# Remove numbers
text <- removeNumbers(text)
# Remove punctuation
text <- removePunctuation(text)
# Remove stopwords
text <- removeWords(text, stopwords("en"))
# Strip unnecessary whitespaces
text <- stripWhitespace(text)
return(text)
}
# Define the ham directory
ham_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/easy_ham"
# List of file names in the spam directory
ham_file_names <- list.files(ham_directory, full.names = FALSE)
# Choose one file to extract content
selected_file <- ham_file_names[1] # Change the index as needed
# Construct the full path to the file
file_path <- file.path(ham_directory, selected_file)
# Read the content of the file
content <- readLines(file_path)
# Clean the content
cleaned_content <- cleanEmailContent(content)
# Preprocess the cleaned content
preprocessed_content <- preprocessText(cleaned_content)
# Split by non-word characters
# Install and load the tm package if not already installed
if (!require("tm")) install.packages("tm", dependencies=TRUE)
library(tm)
# Function to tokenize text, handling specific cases
custom_tokenizer <- function(x) {
# Split by non-word characters
tokens <- unlist(strsplit(x, "\\W+"))
# Handle specific cases
tokens <- gsub("spamassassin", "spam assassin", tokens, ignore.case = TRUE)
tokens <- gsub("received", "receive ed", tokens, ignore.case = TRUE)
tokens <- gsub("delivered", "deliver ed", tokens, ignore.case = TRUE)
tokens <- gsub("\\b(for|with)\\b", " \\1 ", tokens, ignore.case = TRUE)
# Remove empty strings
tokens <- tokens[tokens != ""]
return(tokens)
}
# Create a Corpus
corpus <- Corpus(VectorSource(preprocessed_content))
# Create a Document-Term Matrix (DTM) with custom tokenizer
dtm2 <- DocumentTermMatrix(corpus, control = list(tokenize = custom_tokenizer))
# Convert DTM to a matrix
matrix <- as.matrix(dtm2)
# Convert the matrix to a data frame for better visualization
df <- as.data.frame(matrix)
# Create a Corpus
corpus2 <- Corpus(VectorSource(preprocessed_content))
# Create a Document-Term Matrix (DTM) with custom tokenizer
dtm2 <- DocumentTermMatrix(corpus2, control = list(tokenize = custom_tokenizer))
# Convert DTM to a matrix
matrix <- as.matrix(dtm2)
# Convert the matrix to a data frame for better visualization
df <- as.data.frame(matrix)
#Visualizing ham as a wordcloud
# Set Seed
set.seed(3300)
ham_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/easy_ham"
wordcloud(corpus2, max.words = 2000, random.order = FALSE, min.freq=60,colors=brewer.pal(8,"Dark2"))
#Reduce sparsity -ham
dtm2_filtered <- removeSparseTerms(dtm2, 0.99)
inspect(dtm2_filtered)
## <<DocumentTermMatrix (documents: 113, terms: 2)>>
## Non-/sparse entries: 4/222
## Sparsity : 98%
## Maximal term length: 21
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs edt tmdadeepeddyvirciocom
## 1 0 0
## 15 1 0
## 2 0 0
## 20 1 0
## 3 0 0
## 4 0 0
## 43 0 1
## 44 0 1
## 5 0 0
## 6 0 0
#Reduce sparsity - spam
dtm1_filtered <- removeSparseTerms(dtm1, 0.99)
# Function to process each file
process_file <- function(file_path) {
input_file <- readLines(file_path, warn = FALSE)
# Find the index of the first blank row
first_blank_row <- which(input_file == "")
if (length(first_blank_row) == 0) {
# If no blank row is found, use the length of the vector
first_blank_row <- length(input_file)
} else {
# Use the first occurrence of a blank row
first_blank_row <- first_blank_row[1] - 1
}
# Check if first_blank_row is valid
if (first_blank_row > 0) {
body <- input_file[-(1:first_blank_row)]
body <- paste(body, collapse = " ")
data.frame(content = body, doc_num = basename(file_path), stringsAsFactors = FALSE)
} else {
# If first_blank_row is not valid, return a data frame with NA
data.frame(content = NA, doc_num = basename(file_path), stringsAsFactors = FALSE)
}
}
# Directory containing ham files
ham_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/easy_ham"
# List of ham files
ham_files <- list.files(ham_directory, full.names = TRUE)
# Process each file and bind the results into a data frame
ham_df <- map_dfr(ham_files, process_file)
# Remove rows with NA content
ham_df <- ham_df %>% filter(!is.na(content))
set.seed(131017)
wordcloud(ham_df, max.words = 1000, random.order = FALSE, min.freq=250,colors=brewer.pal(8,"Dark2"))
# Function to process each file
process_file <- function(file_path) {
input_file <- readLines(file_path, warn = FALSE)
# Find the index of the first blank row
first_blank_row <- which(input_file == "")
if (length(first_blank_row) == 0) {
# If no blank row is found, use the length of the vector
first_blank_row <- length(input_file)
} else {
# Use the first occurrence of a blank row
first_blank_row <- first_blank_row[1] - 1
}
# Check if first_blank_row is valid
if (first_blank_row > 0) {
body <- input_file[-(1:first_blank_row)]
body <- paste(body, collapse = " ")
data.frame(content = body, doc_num = basename(file_path), stringsAsFactors = FALSE)
} else {
# If first_blank_row is not valid, return a data frame with NA
data.frame(content = NA, doc_num = basename(file_path), stringsAsFactors = FALSE)
}
}
# Directory containing ham files
ham_directory <- "/Users/Heleine/Library/Mobile Documents/com~apple~CloudDocs/spamham/spam_2"
# List of ham files
spam_files <- list.files(spam_directory, full.names = TRUE)
# Process each file and bind the results into a data frame
spam_df <- map_dfr(spam_files, process_file)
# Remove rows with NA content
spam_df <- spam_df %>% filter(!is.na(content))
set.seed(131017)
wordcloud(spam_df, max.words = 1000, random.order = FALSE, min.freq=250,colors=brewer.pal(8,"Dark2"))
ham_corpus <- VCorpus(VectorSource(ham_df$content))
spam_corpus <- VCorpus(VectorSource(spam_df$content))
corpus <- c(ham_corpus,spam_corpus)
# leveraging library(text)
corpus <- corpus %>% tm_map(content_transformer(PlainTextDocument))
corpus <- corpus %>% tm_map(content_transformer(removePunctuation))
# Function to set encoding and perform text preprocessing
preprocess_text <- function(text) {
text <- iconv(text, to = "UTF-8", sub = "byte")
text <- tolower(text)
text <- removeNumbers(text)
text <- removeWords(text, stopwords("en"))
return(text)
}
# Function to set encoding and perform text preprocessing
preprocess_text <- function(text) {
# Ensure 'text' is a character vector
if (!is.character(text)) {
warning("Input is not a character vector. Returning unchanged.")
return(text)
}
# Convert text to UTF-8 encoding
text <- iconv(text, to = "UTF-8", sub = "byte")
# Perform text preprocessing
text <- tolower(text)
text <- remove_numbers(text)
text <- remove_words(text, stopwords = stopwords("en"))
return(text)
}
# Function to preprocess text
preprocess_text <- function(text) {
# Convert text to UTF-8 encoding
text <- iconv(text, to = "UTF-8", sub = "byte")
# Perform text preprocessing
text <- tolower(text)
text <- removeNumbers(text)
# Split the text into a character vector of words
words <- unlist(strsplit(text, "\\s+"))
# Remove common English stopwords
words <- setdiff(words, stopwords("en"))
# Combine the words back into a preprocessed text
preprocessed_text <- paste(words, collapse = " ")
return(preprocessed_text)
}
# Apply text preprocessing to each document in the corpus
corpus <- lapply(corpus, function(doc) {
doc$content <- preprocess_text(doc$content)
return(doc)
})
# Create Document-Term Matrix (DTM) and remove sparse terms
dtm <- DocumentTermMatrix(corpus) %>% removeSparseTerms(0.95)
# Convert DTM to a tidy format
library(tidytext)
tidy_dtm <- tidy(dtm)
# Add a classification column based on the document index
tidy_dtm$classification <- ifelse(tidy_dtm$document <= length(ham_files), "ham", "spam")
# Display the head and structure of the tidy DTM
head(tidy_dtm)
glimpse(tidy_dtm)
## Rows: 204,659
## Columns: 4
## $ document <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", …
## $ term <chr> "\"en\",", "0),", "10,", "123,", "15,", "19,", "322,", …
## $ count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1…
## $ classification <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham",…
data("stop_words")
tidy_dtm <- tidy_dtm %>%
anti_join(stop_words, by= c("term"="word"))
# find common words
tidy_dtm %>% group_by(term, classification) %>%
count() %>% arrange(desc(n)) %>%
filter(n>300) %>%
ggplot(aes(x=term,y=n,fill=classification)) +
geom_bar(stat='identity') +
coord_flip()
# I added more stop words based on the barplot. This was done in an iterative manner by finding words common to spam and ham from the barplot.
more_stops <- tibble(term = c("wrote","time", "people","email","free","sponsored", "people","message","mail","dont","supplied", "httpwwwnewsisfreecomclick","people" ,"rpmlistfreshrpmsnet", "rpmlist","sfnet","httplistsfreshrpmsnetmailmanlistinforpmlist","information"))
tidy_dtm <- tidy_dtm %>%
anti_join(more_stops, by= "term")
# Set Seed
set.seed(3300)
wordcloud(tidy_dtm, max.words = 2000, random.order = FALSE, min.freq=300,colors=brewer.pal(8,"Dark2"))
#top ham words
tidy_dtm %>% filter(classification=='ham') %>% group_by(term) %>% count() %>% arrange(desc(n))
#top spam words
tidy_dtm %>% filter(classification=='spam') %>% group_by(term) %>% count() %>% arrange(desc(n))
Now that the data is prepared a model can be run that predicts document classification. It is custom for the data to be split 75%/25% for training and test sets.
# Set Seed
set.seed(3300)
n <- nrow(tidy_dtm)
sample_size <- floor(0.10 * n)
# Generate random indices for the sample
sample_indices <- sample(1:n, size = sample_size, replace = FALSE)
# Create training and test sets
train <- tidy_dtm[sample_indices, ]
test <- tidy_dtm[-sample_indices, ]
# Convert to data frame and factor
train <- as.data.frame(train)
train$classification <- as.factor(train$classification)
(summary(train$classification))
## ham spam
## 5398 8751
(summary(train))
## document term count classification
## Length:14149 Length:14149 Min. :1.000 ham :5398
## Class :character Class :character 1st Qu.:1.000 spam:8751
## Mode :character Mode :character Median :1.000
## Mean :1.057
## 3rd Qu.:1.000
## Max. :3.000