Elina Azrilyan

Project 4

11/1/18

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

I have begun by downloading the files at the location below:

https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2

https://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2

The next step was to load required packages:

suppressPackageStartupMessages(require(tidyverse))
## Warning: package 'dplyr' was built under R version 3.5.1
suppressPackageStartupMessages(require(tm))

Reading in the data.

I read in the data as a Corpus by creating 2 separate files, one for each type of data: ham and spam.

#Used the following help page to figure out for to read in corpus data: https://stackoverflow.com/questions/27340008/loading-the-data-to-corpus-from-2-directories-in-r

spam <- DirSource("/Users/elinaazrilyan/Documents/Fall 2018/Data 607/Project4/spam_2 2")
spam_corpus <- Corpus(spam, readerControl=list(reader=readPlain))
length(spam_corpus)
## [1] 397
ham <- DirSource("/Users/elinaazrilyan/Documents/Fall 2018/Data 607/Project4/easy_ham_2")
ham_corpus <- Corpus(ham, readerControl=list(reader=readPlain))
length(ham_corpus)
## [1] 1401

Analysis.

The next and final step was to identify the 10 top words most commonly associated with ham and spam email. This will be useful for future analysis to search for such words in an exmaple data set. I will not be doing that analysis in this project.

docs <- ham_corpus

#This code makes everything lowercase
docs <- tm_map(docs, content_transformer(tolower))

#This code removes common English stop words (not relevant for analysis)
docs <- tm_map(docs, removeWords, stopwords("english"))

#This code removes numbers
docs <- tm_map(docs, removeNumbers)

#The code below removes headed information - I did that manually since it was the only way I came up with in this data corpus
docs <- tm_map(docs, removeWords, c("com", "net", "org", "for", "with", "localhost", "received", "com", "net", "org", "for", "with", "localhost", "received", "font", "size", "nbsp", "color", "http", "width", "face", "align", "arial", "www", "center", "height", "table", "netnoteinc", "href", "aug", "border", "html", "content", "mail", "verdana", "helvetica", "style", "bgcolor", "type", "text", "esmtp", "may", "div", "name", "sans", "subject", "img", "src", "serif", "email", "ffffff", "smtp", "tue", "list", "message", "will", "can", "date", "mon", "valign", "xent", "fork", "gif", "span", "cellpadding", "cellspacing", "version", "body", "jul", "return", "yyyy", "yahoo", "mailto", "charset", "images", "path", "thu", "left", "linux", "ilug", "admin", "users", "sourceforge", "spamassassin", "mailman", "rpm", "taint", "razor", "lugh", "listinfo", "freshrpms", "lists", "postfix", "tuatha", "phobos", "wed", "exmh", "fri", "slashnull", "edt", "zzzlist", "jmason", "ist", "sender", "delivered", "dogma", "help", "labs", "mime", "drop", "imap", "fetchmail", "beenthere", "redhat", "root", "bulk", "plain", "reply", "egwn", "usw", "pdt")) 


#Now that we have finished claning up our corpus, let's create a table with 10 most common words.
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
##                    word freq
## request         request 2564
## example         example 2096
## unsubscribe unsubscribe 1710
## subscribe     subscribe 1692
## single           single 1468
## precedence   precedence 1370
## errors           errors 1353
## group             group 1221
## irish             irish 1200
## workers         workers 1084

I ran into an issue with spam data, where I kept getting a gsub error. I had to manually remove a large chunk of my spam data to fix it. I believe my computer was having a memory issue, or perhaps there were a few corrupt files in the data which were cause the error. The dataset was reduced from

docs2 <- spam_corpus

#This code makes everything lowercase
docs2 <- tm_map(docs2, content_transformer(tolower))

#This code removes common English stop words (not relevant for analysis)
docs2 <- tm_map(docs2, removeWords, stopwords("english"))

#This code removes numbers
docs2 <- tm_map(docs2, removeNumbers)

#The code below removes headed information - I did that manually since it was the only way I came up with in this data corpus
docs2 <- tm_map(docs2, removeWords, c("com", "net", "org", "for", "with", "localhost", "received", "font", "size", "nbsp", "color", "http", "width", "face", "align", "arial", "www", "center", "height", "table", "netnoteinc", "href", "aug", "border", "html", "content", "mail", "verdana", "helvetica", "style", "bgcolor", "type", "text", "esmtp", "may", "div", "name", "sans", "subject", "img", "src", "serif", "email", "ffffff", "smtp", "tue", "list", "message", "will", "can", "date", "mon", "valign", "xent", "fork", "gif", "span", "cellpadding", "cellspacing", "version", "body", "jul", "return", "yyyy", "yahoo", "mailto", "charset", "images", "path", "thu", "left")) 

#This code creates a table with 10 most common words.
dtm2 <- TermDocumentMatrix(docs2)
m2 <- as.matrix(dtm2)
v2 <- sort(rowSums(m2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)
head(d2, 10)
##              word freq
## free         free  676
## top           top  535
## business business  512
## click       click  440
## please     please  425
## one           one  404
## labs         labs  397
## value       value  384
## get           get  364
## mime         mime  362

Conclusion

I have obtained the list of most common terms associated with ham and spam data. At a future time this may be used to feed this data to a model which will allow us to predict whether the message is spam or legitimate, but it is out of scope for this project.