DATA 607 - Project 4

Background

The focus of this project is document classification.

For this project, we will start with a corpus dataset, unzip our data, generate a training model that we’ll then use to predict the class of new documents (those withheld from the training set or taken from another source), and then analyze the accuracy of our predictive classifier.

Download Data

We lean on the R.utils library to automatically download, bunzip, extract the contents of tar archive into our “emails” directory, and then create a corresponding list of file names from the spam and ham emails available on spamassassin:

#Download, bunzip, and untar spam_2 files into "emails" directory
#download.file("http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", destfile= "20050311_spam_2.tar.bz2")
#bunzip2("20050311_spam_2.tar.bz2")
#untar("20050311_spam_2.tar", exdir="emails")

#Create corresponding list of file names for spam_2 and exclude cmds file
if (file.exists("emails\\spam_2\\cmds")) file.remove("emails\\spam_2\\cmds")
spam_list = list.files("emails\\spam_2\\") 

#Download, bunzip, and untar easy_ham files into "emails" directory
#download.file("http://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2", destfile="20030228_easy_ham.tar.bz2")
#bunzip2("20030228_easy_ham.tar.bz2")
#untar("20030228_easy_ham.tar", exdir = "emails")

#Create corresponding list of file names for easy_ham and exclude cmds file
if (file.exists("emails\\easy_ham\\cmds")) file.remove("emails\\easy_ham\\cmds")
ham_list = list.files("emails\\easy_ham\\")

As can be noted above, we remove the cmds files prior to creating our corresponding lists and we create our corresponding lists using the list.files() function to produce a character vector of the names of files or directories in the named directory.

Note: if it’s your first time running the code please UNCOMMENT the download.file(), bunzip2(), and untar() portions of the code. Otherwise, comment them out to avoid “file already exists” error messages.

Check the length of corresponding spam and ham lists:

length(spam_list)

## [1] 1396

length(ham_list)

## [1] 2500

The ham_list contains 2500 emails (64.17% of total) and the spam_list contains 1396 emails (35.85% of total) for a total of 3896 emails that we’ll be processing. These values are worth noting as we proceed through the code that follows and especially when we consider naive Bayes later.

Once we have lists of spam and ham emaiils, we set out to build a dataframe of all emails df_mails and

# Build data frame
df_mails <- tibble()
df_mails_folders <- c("emails\\spam_2\\", "emails\\easy_ham")
df_mails_types <- c("spam", "ham")

#Extract type (spam vs. ham) and message of corresponding file to populate df_mails 
for (i in 1: length(df_mails_folders))
  
  {
        type <- df_mails_types[i] #spam or ham
        
        #access files
        l <- tibble(file = dir(df_mails_folders[i],  full.names = TRUE)) %>% 
          #read in email messages
          mutate(messages = map(file, read_lines)) %>%
                #use file name as id, type as spam / ham, and message as content
                transmute(id = basename(file), type = type, messages) %>%
                unnest(messages) #make each element of messages its own row
                df_mails<- bind_rows(df_mails, l)
 }

Once we’ve built out our data frame, we notice that it’s HUGE. df_mails contains 389362 observations which was a real pain to process later on so we merged messages based on shared id’s (if they came from the same email file):

#Merge messages based on shared ids:
new_df <- df_mails[!duplicated(df_mails$id), ]
new_df[, 'messages'] <- aggregate (messages~id, data = df_mails, toString) [,2]
head(new_df)

#Subset data frame to only contain type (ham / spam) and message (email content): 
df_final <- new_df %>%
  select(type, messages)
dim(df_final)

## [1] 3896    2

#Randomize our elements for better representation of proportions
set.seed(1228)
df_final<- df_final[sample(nrow(df_final)),]
str(df_final) #observe output

## tibble [3,896 x 2] (S3: tbl_df/tbl/data.frame)
##  $ type    : chr [1:3896] "ham" "ham" "ham" "spam" ...
##  $ messages: chr [1:3896] "Return-Path: guido@python.org, Delivery-Date: Mon Sep  9 16:22:52 2002, From: guido@python.org (Guido van Rossu"| __truncated__ "From taxfree7550288@yahoo.com  Fri Jul 26 08:16:45 2002, Return-Path: <taxfree7550288@yahoo.com>, Delivered-To:"| __truncated__ "From rssfeeds@jmason.org  Mon Sep 30 13:37:03 2002, Return-Path: <rssfeeds@spamassassin.taint.org>, Delivered-T"| __truncated__ "From greenman@swbell.net  Mon Jun 24 17:05:13 2002, Return-Path: greenman@swbell.net, Delivery-Date: Mon May 20"| __truncated__ ...

From the above output, we note that our dataframe has proper dimensions: 2 columns (type, message) and 3896 rows (email contents).

Now that we have a data frame with exclusively type and message, we can convert our data frame into a corpus, clean the contents of this corpus and create a document term matrix so that later we can visualize the highest frequency words, train a naive Bayes model and observe the test results of this model …

Create and Clean Corpus

We start by creating a corpus of all of our email messages and then we clean the corpus by removing URLs, homogenizing to lower case, removing numbers, removing punctuation, removing stopwords, removing non-word characters, and then stripping white space:

#Create corpus:
text_corpus <- Corpus(VectorSource(df_final$messages))

#Initialize clean_corpus by removing non-word characters and URLs:
clean_corpus <- text_corpus %>%
  tm_map(content_transformer(gsub), pattern="\\W",replace=" ")

## Warning in tm_map.SimpleCorpus(., content_transformer(gsub), pattern = "\\W", :
## transformation drops documents

removeURL <- function(x) gsub("http^\\s\\s*", "", x)%>% 
  clean_corpus <- tm_map(clean_corpus, content_transformer(removeURL))

#Clean corpus:
clean_corpus <- clean_corpus %>%
  tm_map(tolower) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeWords, stopwords()) %>% 
  tm_map(stripWhitespace)

## Warning in tm_map.SimpleCorpus(., tolower): transformation drops documents

## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents

## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., removeWords, stopwords()): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents

inspect(clean_corpus[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] return path guido python org delivery date mon sep guido python org guido van rossum date mon sep subject spambayes spambayes package reply message mon sep cdt client attbi com references client attbi com glnt pcppcs reston va comcast net ga cthulhu gerg ca client attbi com message id gfmql pcppcs reston va comcast net nasty side effect placing py files package obvious executable scripts like timtest hammie can keep package care installing extra files long re inside package guido van rossum home page http www python org guido                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [2]  taxfree yahoo com fri jul return path taxfree yahoo com delivered yyyy localhost netnoteinc com received localhost localhost phobos labs netnoteinc com postfix esmtp id jm localhost fri jul edt received phobos localhost imap fetchmail jm localhost single drop fri jul ist received mandark labs netnoteinc com dogma slashnull org esmtp id gqed jm jmason org fri jul received www dpn com cn mandark labs netnoteinc com esmtp id gqdrp jm netnoteinc com fri jul received www dpn com cn sun smtp id paa fri jul cst taxfree yahoo com message id paa www dpn com cn atna aol com cc khegji hotmail com potter usask ca royqian hotmail com hzppby yahoo com larrycline address com chungyin hkstar com u bestnet net bdevenow hotmail com suprempurg aol com jpkielty nameplanet com date fri jul subject sickest bitches mime version x mailer microsoft outlook express x mimeole produced microsoft mimeole v x precedence ref lzxcvbnmlkjhgfq content transfer encoding bit content type text html charset us ascii doctype html public wc dtd html transitional en html head meta http equiv content type content text html charset windows meta content mshtml name generator head body p beaton abused humiliated begging br sluts won t disappoint br br b font size href http f freehost net site click font b p p nbsp p p nbsp p p nbsp p p b font size must least enter font b p p font size br removed house mailing list href http f freehost net site takemeoff html click br will automatically removed future mailings br br received email either requesting information br one sites someone may used email address br received email error please accept apologies br font p body html 
## [3]  rssfeeds jmason org mon sep return path rssfeeds spamassassin taint org delivered yyyy localhost spamassassin taint org received localhost jalapeno jmason org postfix esmtp id baf jm localhost mon sep ist received jalapeno localhost imap fetchmail jm localhost single drop mon sep ist received dogma slashnull org localhost dogma slashnull org esmtp id gtkg jm jmason org sun sep message id gtkg dogma slashnull org yyyy spamassassin taint org gamasutra rssfeeds spamassassin taint org subject windsor soup date sun sep content type text plain encoding utf url http www newsisfree com click date t comment prince charles s attempts meddle politics show detached real life says nick cohen

Again, we note the proper amount of documents for text_corpus and then observe the sheer size of the email messages we’re dealing with. Fortunately there’s a means of making sense of these emails.

As a next step, we can create our document term matrix, which is a way of representing the words in clean_corpus as a table (with rows representative of text responses to be analyzed and columns representative of words from the text used in the analysis):

#Create DTM using function:
dtm <- DocumentTermMatrix(clean_corpus)
inspect(dtm) #inspect

## <<DocumentTermMatrix (documents: 3896, terms: 71653)>>
## Non-/sparse entries: 672033/278488055
## Sparsity           : 100%
## Maximal term length: 121
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   com font fork http list localhost net org received spamassassin
##   1170  21    0   18    3    7         6   0   9        7            6
##   1545 224   16    0   98    4         0  16   0        2            0
##   1803  15    0   15    3    7         8   0   9        6            5
##   1922 167   41    0   83   12         0  14   0        2            0
##   2120 204 1102    0  516    2         6  43   4        7            1
##   2792 204 1102    0  516    5         6  43   4        7            1
##   3135  19    0    0    9    5         5  25   2        8            3
##   3232  17    0   17    3    7         7   0   8        7            5
##   527  175  198    0  141    2         5  11  16        4           11
##   969  146 1627    0   80    4         0   8   5        2            0

Note the 100% sparsity. Taking this into account, we then filter our DTM (document term matrix) to reduce sparsity by looking for terms that are in at least ~1% of the documents before inspecting the output:

#Reduce sparsity 
dtm_filtered <- removeSparseTerms(dtm, 0.99) 
inspect(dtm_filtered)

## <<DocumentTermMatrix (documents: 3896, terms: 2251)>>
## Non-/sparse entries: 477379/8292517
## Sparsity           : 95%
## Maximal term length: 33
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   com font fork http list localhost net org received spamassassin
##   1170  21    0   18    3    7         6   0   9        7            6
##   1922 167   41    0   83   12         0  14   0        2            0
##   2120 204 1102    0  516    2         6  43   4        7            1
##   2792 204 1102    0  516    5         6  43   4        7            1
##   2858 165   41    0   93    0         0  24   4        2            0
##   3135  19    0    0    9    5         5  25   2        8            3
##   3232  17    0   17    3    7         7   0   8        7            5
##   527  175  198    0  141    2         5  11  16        4           11
##   68    22  542    0    4    6         5  30   0       16            3
##   969  146 1627    0   80    4         0   8   5        2            0

We now have ~95% sparsity and account for 2251 terms rather than more than ~71000 in the unfiltered DTM.

From this point, we can revisit clean_corpus via visualization. We can make use of the wordcloud() function to visualize the most common 50 words with a minimum count of 1000. Those words are:

#Visualize (ham and spam together) as a wordcloud
wordcloud(clean_corpus, max.words = 50, random.order = FALSE, min.freq=1000)

We can note the prominence of words like “com” and “received” and jumbled words like “esmtp” and “xent” …

At this point we’re ready to enter the final phase. We’re ready to …

Train and Test Data

We’re going to apply the naive Bayes classifier.

To steal from UC Business Analytics’ site:

Although it is often outperformed by other techniques, and despite the naïve design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, it is definitely a machine learning algorithm to have in your toolkit.

First, we divide the corpus into training and test data using a proportion of 80:20. 80% training data, 20$ test dataa. We do so for our “raw” emails (from our data frame df_final), our DTM prior to accounting for sparsity, and our cleaned corpus of documents.

Because the data was stored randomly, we can simply take the first 80% of our entries as our training set and the remainder as our test set. The initial calculation of the “divider row” has been commented out (at the top of the code):

#Determine 80% of row number:
##nrow(df_final) #3896
##round(0.8 * 3896,0) #3117

df_train <- df_final[1:3117,]
df_test <- df_final[3118:3896,]

#print(email_raw_train$type) #verify randomization :)
#print(email_raw_test$type) #verify randomization

dtm_train <- dtm[1:3117,] 
dtm_test <- dtm[3118:3896,]

#Since corpus is stored as documents:
corpus_train <- clean_corpus[1:3117]
corpus_test <- clean_corpus[3118:3896]

Once our data’s been divided, we subset our training data based on type (spam vs. ham), identify words that appear at least five times, make use of a simple function convert_count() to detect the presence or absence of each word in a message, and then we apply this function to our training and test data:

#Identify words that appear at least 5 times:
five_words <- findFreqTerms(dtm_train, 5)
length(five_words) #how many words are there?

## [1] 13109

five_words[1:5] #what are the 1st 5?

## [1] "attbi"  "can"    "care"   "cdt"    "client"

#Create DTMs using frequent words:
email_train <- DocumentTermMatrix(corpus_train, control=list(dictionary = five_words))
email_test <- DocumentTermMatrix(corpus_test, control=list(dictionary = five_words))

#Convert count info to "Yes" or "No"
convert_count <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}

#Convert document-term matrices:
email_train <- apply(email_train, 2, convert_count) 
email_test <- apply(email_test, 2, convert_count)

From the outputs above, we see the number of words that appeared at least 5 times and we see the first five words in our corresponding list.

At this point we’re ready to apply and evaluate the performance of naive Bayes. First we train the model with our training data (email_train), we verify the class of the assigned variable (email_classifier) and then we evaluate the performance of our model with the associated test data (email_test). The result can be seen below:

#Create naive Bayes classifier object
email_classifier <- naiveBayes(email_train, factor(df_train$type))
class(email_classifier) #verify the class of this classifier

## [1] "naiveBayes"

#Evaluate performance on test data
email_pred <- predict(email_classifier, newdata=email_test)
table(email_pred, df_test$type)

##           
## email_pred ham spam
##       ham  394   66
##       spam  75  244

As you can see above, of our 779 test documents (emails), 84% (394/469) were accurately labeled “ham” (the ham-ham, 1st row, 1st column entry) and 79% (244/310) were accurately labeled “spam” (the spam-spam, 2nd row, 2nd column entry).

Upon our first implementation of Bayes our output was only reading ham files. After going back through our code we realized that we had not randomized our input data and thus our training data was reading in primarily spam messages while our test data was only testing ham messages, so we corrected course.

After doing so, the output format was correct but the accuracy was still low. So we tried varying sparsity (alternating between dtm and dtm_filtered), as well as varying the order of our corpus cleaning functions. Ultimately neither of these alterations improved our accuracy.

We then altered the order of our corpus_clean section, removed the factor() function we’d input to the table() function on the last line, and simplified the convert_count() function to a simple ifelse() statement and voila! the accuracy of our ham filter went from ~13% to 84% and the accuracy of our spam filter dropped from ~90% to 78%. While we aren’t sure exactly which of these “fixes” helped improve our accuracy (being that the processing time was quite slow so doing one at a time would have been too time consuming), we do know that the combination worked in improving the overall accuracy of our model.

References

In completing Project 4, we found the following resources useful and applicable:

Notre Dame. (2014). Text mining example: spam filtering [slide set]. Retrieved from https://www3.nd.edu/~steve/computing_with_data/20_text_mining/text_mining_example.html#/
Gorakala, Suresh Kumar. (2013). Document classification using R [document]. Retrieved from https://www.r-bloggers.com/2013/07/document-classification-using-r/
Kharshit. (2017). Email spam filtering: Text analysis in R [sample code]. Retrieved from https://kharshit.github.io/blog/2017/08/25/email-spam-filtering-text-analysis-in-r