For this assignment we have given two sets of email messages. One set
is known to be spam and another set is known to be “ham”, a legitimate
message. I have downloaded two files from the example corpus https://spamassassin.apache.org/old/publiccorpus/ 1. I have
downloaded the files named 20021010_easy_ham.tar.bz2 and
20021010_spam.tar.bz2 containing sample ham and spam
messages respectively. I have also made my chosen files available on github.
For this section I will download and unzip both using the URL.
hamURL <- 'https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2'
spamURL <- 'https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2'
download.file(hamURL,"corpus/20021010_easy_ham.tar.bz2")
download.file(spamURL,"corpus/20021010_spam.tar.bz2")
untar("corpus/20021010_easy_ham.tar.bz2", exdir = "corpus/")
untar("corpus/20021010_spam.tar.bz2", exdir = "corpus/")
I will be using the Text Mining Package tm, in R to
import the spam and ham data sets and to do some analysis. Then imported
the data sets as Corpus objects. I have converted these
Corpus objects to data frames in order to manipulate the
data.
library(tm)
spamCorpus <- Corpus(DirSource(directory = "corpus", encoding = "ASCII"))
hamCorpus <- Corpus(DirSource(directory = "corpus/easy_ham/",encoding = "ASCII"))
spam <- data.frame(text = sapply(spamCorpus, as.character), stringsAsFactors = FALSE)
ham <- data.frame(text = sapply(hamCorpus, as.character), stringsAsFactors = FALSE)
I will be combining the data sets.
spam <- spam %>%
rownames_to_column("message-id") %>%
rename( message=text ) %>%
mutate ( isSpam = 1)
ham <- ham %>%
rownames_to_column("message-id") %>%
rename( message=text ) %>%
mutate ( isSpam = 0)
The body of a mail message comes after the header and consists of everything that follows the first blank line. 2. As a result I will split the message column by a field containing two consecutive new line characters.
combinedDataSet <- rbind(spam,ham) %>%
separate(message,sep = "(\r\n|\r|\n)(\r\n|\r|\n)", into = c("headers","body"), extra = "merge")
I will now strip any HTML tags from the message, using a regular expression.
combinedDataSet <- combinedDataSet %>%
mutate( body_plaintext = str_replace_all(body,"</?[^>]+>","") )
I would also look at the originating IP address from the header using
IPV4. The regex below was adapted from the following website3. It will
exclude any private and loopback IP addresses. The
str_extract function on the headers with this regex should
give us the first public IP address that the message passed through.
regex <-"\\b(?!(10)|(127)|192\\.168|172\\.(2[0-9]|1[6-9]|3[0-2]))[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}"
combinedDataSet <- combinedDataSet %>%
mutate( originatingIP = str_extract(headers,regex) )
The data frame now looks like this:
combinedDataSet %>%
head(30) %>%
reactable(wrap = F)