Assignment

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham data-set, then predict the class of new documents (either withheld from the training data-set or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/old/publiccorpus/

Solution

Overview

Executive Summary

The tm package will be used to create a corpus of data which will serve as the source of features and observations for the analysis. This will then be converted into a document-term matrix. Finally, The caret package will be used for the model fitting, validation, and testing.

The process of building a ham/spam filter is an oft-used pedagogical tool when teaching predictive modeling. Therefore, there is a multitude of information available on-line and in texts, of which we availed ourselves.

It should be noted that one of the more common packages in recent use for text mining, the RTextTools package was recently removed from CRAN, and personal communication by one of us with the author (who is now building the news feed at LinkedIn) confirmed that the package is abandonware.

Lastly, we understand that the object of this exercise is not to build an excellent predictor but to demonstrate the necessary knowledge required to build classification algorithms.

Document-Term Matrix

A document-term matrix (DTM) is the model matrix used in natural language processing (NLP). Its rows represent the documents in the corpus and its columns represent the selected terms or tokens which are treated as features. The values in each cell depends on the weighting schema selected. The simplest is term-frequency (tf). This is just the number of times the word is found in that document. A more sophisticated weighting scheme is term frequency–inverse document frequency (tf-idf). This measure increases with the frequency of the term, but offsets it by the number of documents in which it appears. This will lower the predictive power of words that naturally appear very often in all kinds of documents, and so do not shed much light on the type of document. This problem is also addressed by removing words so common as to have no predictive power at all like “and” or “the”. These are often called stop words.

Code and Process

Style

In the following document, all user-created variables will be in snake_case and all user-created functions will be in CamelCase. Unfortunately, the tm packages uses camelCase for its functions. wE aPoLoGIze fOr anY IncoNVenIence.

Load Libraries and Set Seed

# allows us to repeat analysis with same outcomes
set.seed(12)

# Enable parallel processing to speed up code
library(doParallel)    # library to enable parallel processing to leverage multiple CPU's & Cores
num_cores <- detectCores() - 1

#registerDoParallel(cores=num_cores)  
cl <- makeCluster(num_cores, type="FORK")

#cl <- makePSOCKcluster(6L)
registerDoParallel(cl)

library(tm)            # tool to facilitate building corpus of data
library(SnowballC)     # tools to find word stems
library(caret)         # tools to run machine learning
library(wordcloud)     # tool to help build vidual wordclouds
library(tidyverse)

List files

The files were downloaded from the link above, and the spam_2 and easy_ham sets were selected for analysis. These were unzipped so that each email is its own file in the directory.

# Get a list of all the spam file names (each file is a single email message)
s_files <- list.files("./Data/spam_2", full.names = TRUE)
s_len <- length(s_files)

# Get a list of all the ham files names (each file is a single email message)
h_files <- list.files("./Data/easy_ham", full.names = TRUE)
h_len <- length(h_files)

We loaded {r} s_len spam email messages and {r} h_len ham (non-spam) email messages. The first thing to note is that we have an unbalanced data set with more good email messages (ham) than spam. This may affect our choice of models and/or force us to take extra steps to accomodate the difference in set sizes.

Building the Corpus

Email Headers

We will be focusing on email content, and not the meta information or doing reverse DNS lookups. Therefore, it makes sense to remove the email headers. According to the most recent RFC about email, RFC 5322, Section 2.2, the header should not contain any purely blank lines. Therefore, it is a very reasonable approach to look for the first blank line and only start ingesting the email from the next line. That is what is searched for by the regex pattern "^$" in the function below.

In the headers, some information that could be used to enhance a model might include: the Subject line, sender’s email address domain name (e.g. @gmail.com, @companyname.com, etc), whether the sender’s email domain matches the sender’s SMTP server domain name, the hour (UTC) when the email was sent, the origin country (based on SMTP server name or IP address lookup), and potentially information about the originating domain name (e.g. when was he domain registered). If this was a critical project, we could also download RBL (realtime blakc lists) and use that information to provide additional pattern matching.

Raw Corpus

The readLines function reads each line as a separate vector. To turn this into a single character vector, the paste function is used with the appropriate sep and collapse values. The class of the document is passed as a parameter to the BuildCorpus function.

#' Build a corpus from a list of file names
#' 
#' @param files List of documents to load.
#' @param class The class to be applied to the loaded documents
#' @return A charater vector
BuildCorpus <- function(files, class) {
  # loop thru files and process each one as we go
  for (i in seq_along(files)) { 
    raw_text <- readLines(files[i])
    em_length <- length(raw_text)
    
    # Lets extract the Subject line (if present) and clean it   
    subject_line <- str_extract(raw_text, "^Subject: (.*)$")
    subject_line <- subject_line[!is.na(subject_line)]
    subject_line <- iconv(subject_line, to="UTF-8")
    
    # let's scrub / clean up the subject line text
    subject_line <- gsub("[^0-9A-Za-z///' ]","" , subject_line, ignore.case = TRUE, useBytes = TRUE)
    subject_line <- tolower(subject_line)
    subject_line <- str_replace_all(subject_line, "(\\[)|(\\])|(re )|(subject )", "")

    # Lets extract the email body content
    body_start <- min(grep("^$", raw_text, fixed = FALSE, useBytes = TRUE)) + 1L
    em_body <- paste(raw_text[body_start:em_length], sep="", collapse=" ")
    em_body <- iconv(em_body, to="UTF-8")
    
    # make the text lower case
    em_body <- tolower(em_body)
        
    # remove HTML tags
    em_body <- str_replace_all(em_body, "(<[^>]*>)", "")
    em_body <- str_replace_all(em_body, "(&.*;)", "")

    # remove any URL's
    em_body <- str_replace_all(em_body, "http(s)?:(.*) ", " ")

    # remove non alpha (leave lower case and apostrophe for contractions)
    em_body <- str_replace_all(em_body, "[^a-z///' ]", "")
    em_body <- str_replace_all(em_body, "''|' ", "")

    # Since the subject line might have important info, lets concatenate it to the top of the email body
    em_body <- paste(c(subject_line, em_body), sep="", collapse=" ")
    
    if (i == 1L) {
      ret_Corpus <- VCorpus(VectorSource(em_body))
    } else {
      tmp_Corpus <- VCorpus(VectorSource(em_body))
      ret_Corpus <- c(ret_Corpus, tmp_Corpus)
    }
  }
  
  meta(ret_Corpus, tag = "class", type = "indexed") <- class
  
  return(ret_Corpus)
}

h_corp_raw <- BuildCorpus(h_files, "ham")
s_corp_raw <- BuildCorpus(s_files, "spam")

Cleaning the Corpus

We used many of the default cleaning tools in the tm package to perform standard adjustments like lower-casing, removing numbers, etc. We made two non-native adjustments. First we stripped out anything that looked like a URL. This needed to be done prior to removing punctuation, of course. We also added a few words to the removal list which we think have little predictive power due to their overuse. We considered removing all punctuation, but decided to leave both intra-word contractions and internal punctuation.

Lastly, we used the SnowballC package to stem the document. This process tries to identify common roots shared by similar words and then treat them as one. For example:

wordStem(c('run', 'running', 'ran', 'runt'), language = 'porter')
## [1] "run"  "run"  "ran"  "runt"

The complete cleaning rules are in the CleanCorpus function.

# https://stackoverflow.com/questions/47410866/r-inspect-document-term-matrix-results-in-error-repeated-indices-currently-not
#' Scrub the text in a corpus
#' @param corpus A text corpus prepared by tm
#' @return A sanitized corpus
CleanCorpus <- function(corpus){
  overused_words <- c("ok", 'okay', 'day', "might", "bye", "hello", "hi",
                      "dear", "thank", "you", "please", "sorry")

  # lower case everything
  corpus <- tm_map(corpus, content_transformer(tolower))
  
  # remove any HTML markup
  removeHTMLTags <- function(x) {gsub("(<[^>]*>)", "", x)}
  corpus <- tm_map(corpus, content_transformer(removeHTMLTags))

  # remove any URL's
  StripURL <- function(x) {gsub("(http[^ ]*)|(www\\.[^ ]*)", "", x)}
  corpus <- tm_map(corpus, content_transformer(StripURL))
  
  # remove anything not a simple letter
  KeepAlpha <- function(x) {gsub("[^a-z///-///' ]", "", x, ignore.case = TRUE, useBytes = TRUE)}
  corpus <- tm_map(corpus, content_transformer(KeepAlpha))

  # remove any numbers
  corpus <- tm_map(corpus, removeNumbers)
  
  # remove punctuation
  corpus <- tm_map(corpus, removePunctuation,
                   preserve_intra_word_contractions = TRUE,
                   preserve_intra_word_dashes = TRUE)
  
  # remove any stop words
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, removeWords, overused_words)
  
  # remove extra white space
  corpus <- tm_map(corpus, stripWhitespace)
  
  # use the SnowballC stem algorithm to find the root stem of similar words
  corpus <- tm_map(corpus, stemDocument)
  
  return(corpus)
}

Removing Very Sparse Terms

Even with a cleaned corpus, the overwhelming majority of the terms are rare. There are two ways to address sparsity of terms in the tm package. The first is to generate a list of words that appear at least \(k\) times in the corpus. This is done using the findFreqTerms command. Then the document-term matrix (DTM) can be built using only those words.

The second way is to build the DTM with all words, and then remove the words that don’t appear in at least \(p\%\) of documents. This is done using the removeSparseTerms function in tm. Both methods make manual inspection of more than one line of the matrix impossible. The matrix is stored sparsely as a triplet, and once terms are removed, it becomes impossible for R to print properly.

The removeSparseTerms is intuitively more appealing as it measures frequency by document, and not across documents. However, applying that to three separate corpuses would result in the validation and testing sets not having the same words as the training set. Therefore, the build-up method will be used, but used by finding the remaining terms after calling remove.

However, before we do that, we need to discuss…

Training, Validation, and Testing

Hastie & Tibshirani, in their seminal work ESL, suggest breaking ones data into three parts: 50% training, 25% validation, and 25% testing. Confusingly, some literature uses “test” for the validation set and “holdout” for the test set. Regardless, the idea is that you train your model on 50% of the data, and use 25% of the data (the validation set) to refine any hyper-parameters of the model. You do this for each model, and then once all the models are tuned as best possible, they are compared with each other by their performance on the heretofore unused testing/holdout set. The SplitSample function was used to split the data at the start.

# https://stackoverflow.com/questions/47410866/r-inspect-document-term-matrix-results-in-error-repeated-indices-currently-not
#' Split a sample into Training, Validation and Test groups.  Return a vector with the label for each sample using 
#' the provided probabilities.  Note: training, validation and test should be non-negative and, not all zero.
#' @param n The total number of samples in the set
#' @param n Desired training set size (percent)
#' @param n Desired validation set size (percent)
#' @param n Desired test set size (percent)
#' @return A sanitized corpus
SplitSample <- function(n, training=0.5, validation=0.25, test=0.25) {
  if((training >= 0 && validation >= 0 && test >= 0) && 
     ((training + validation + test) > 0) && 
     ((training + validation + test) <= 1.0 )) {
    n_split <- sample(x = c("train", "validate", "test"), size = n,
                    replace = TRUE, prob = c(0.5, 0.25, 0.25))
  } else {
    n_split <- FALSE
  }
  
  return(n_split)
}

# build vectors that identify which group each sample will be placed (training, validation or test)
h_split <- SplitSample(h_len)
s_split <- SplitSample(s_len)

Note that with machine learning, another popular approach is to setup K-fold Cross Validation. With this approach, we create a Training/Testing split as shown above, train a model, then repeat the process with a different random Training/Testing splits. By iterating (typically 5-10 times), we ensure that every observation has a chance of being included during Training or Testing and can appear in any split group. We then average the performance metrics and use that to evaluate the model. This helps reduce bias that might have been introduced by random chance with just a single Training/Testing split.

If there are limited number of samples to work with, thus limiting the information available during the training phase, it is common to compromise and use a 70%/30% or 80%/20% Training to Testing split and skip the third Validation set. If there are limited observations, Bootstrapping is one method for generating additional data and works well if the known samples provide sufficient reprentation of the expected distribution of possible values or datapoints.

When we have the possibility of multiple rows from the same source, there is the possibility of leakage between the training and test/validation sets such that the model performs better on the validation and/or test sets than expected. We are not going to consider this now, but a more rigorous model would tag each row with the sender’s email address and/or IP address and use groupKFold() or some other similar technique to ensures all rows from a given sender are kept together in the same data set (trainng, validation or test). See https://topepo.github.io/caret/data-splitting.html for more information. Note that this approach can lead to complexity … for further discussion, see https://towardsdatascience.com/the-story-of-a-bad-train-test-split-3343fcc33d2c.

Building the Term List

As both training and validation are part of the model construction, we feel that the term list can be built from the combination of the two. The terms in the testing/holdout set will not be seen prior to testing. We will restrict the word list to words that appear in at least 100 of the combined 2922 documents. In a real world scenario, email messages may contain new terms not seen suring the training steps. By excluding the final validation terms, we better simulate a realworld implementation where new words are appearing that we didn’t have available during model training

# pull all terms from the training sets (both hame and spam)
raw_train <- c(h_corp_raw[h_split == "train"],
               s_corp_raw[s_split == "train"])

# pull all terms from the validation sets (both hame and spam)
raw_val <- c(h_corp_raw[h_split == "validate"],
             s_corp_raw[s_split == "validate"])

# pull all terms from the test sets (both hame and spam)
raw_test <- c(h_corp_raw[h_split == "test"],
              s_corp_raw[s_split == "test"])

# combine both training and test terms into a master list
raw_term_corp <- c(raw_train, raw_val)
clean_term_corp <- CleanCorpus(raw_term_corp)

dtm_terms <- DocumentTermMatrix(clean_term_corp, control = list(bounds = list(global = c(100L, Inf))))

freq_terms <- Terms(dtm_terms)

Here are the top 20 stemmed terms out of the 273 terms we will use in the dictionary:

ft <- colSums(as.matrix(dtm_terms))
ft_df <- data.frame(term = names(ft), count = as.integer(ft))
knitr::kable(head(ft_df[order(ft, decreasing = TRUE), ], n = 20L),
             row.names = FALSE)
term count
email 1802
will 1624
use 1406
can 1351
get 1340
just 996
mail 986
one 970
list 957
messag 928
time 924
work 921
free 889
make 842
like 835
now 789
peopl 781
new 740
receiv 716
click 628

Here is a histogram of word frequency using the Freedman-Diaconis rule for binwidth.

bw_fd <- 2 * IQR(ft_df$count) / (dim(ft_df)[[1]]) ^ (1/3)
ggplot(ft_df, aes(x = count)) + geom_histogram(binwidth = bw_fd) + xlab("Term")

Finally, a wordcloud of the stemmed terms appearing at least 250 times:

wordcloud(ft_df$term,ft_df$count, scale = c(3, 0.6), min.freq = 250L,
          colors = brewer.pal(5, "Dark2"), random.color = TRUE,
          random.order = TRUE, rot.per = 0, fixed.asp = FALSE)