Training a Classifier on Email Documents

library(tidyverse)
library(httr2)
library(R.utils)
library(tm)
library(quanteda)
library(e1071)
library(caret)

Introduction

When you are dealing with classification problems that have large sets of publicly available data, the best method to resolve such problems is to train a classifier on them. In this project we will be utilizing a public dataset of spam emails and ham (non-spam) emails from https://spamassassin.apache.org/ to attempt to train a model on detecting the difference between spam and ham emails.

Loading the Data for Classification

To begin we must load in the different emails in our environment. We will be loading in a collection of 2500 ham emails and 1396 spam emails from our source. These files need to be unzipped twice to get to the text files containing the messages.

# Initialize our URLs we will download from
ham_url <- r"(https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham.tar.bz2)"
spam_url <- r"(https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2)"

# Initialize the file names we will be working with
ham_file <- str_extract(ham_url, "(?<=/)[^/]+$")
spam_file <- str_extract(spam_url, "(?<=/)[^/]+$")

# Download the initial compressed files
download.file(ham_url, ham_file)
download.file(spam_url, spam_file)

# Decompress from bz2
bunzip2(ham_file, overwrite = TRUE)
bunzip2(spam_file, overwrite = TRUE)

# Modify the file names to match the updated file names
ham_file <- gsub("[.]bz2$", "", ham_file)
spam_file <- gsub("[.]bz2$", "", spam_file)

# Decompress from tar
untar(ham_file, exdir = gsub("[.]tar$", "", ham_file))
untar(spam_file, exdir = gsub("[.]tar$", "", spam_file))

# Modify the file names to match the updated file names
ham_file <- gsub("[.]tar$", "", ham_file)
spam_file <- gsub("[.]tar$", "", spam_file)

# Get the list of all extracted text files
ham_files <- list.files(ham_file, recursive = TRUE, full.names = TRUE)
spam_files <- list.files(spam_file, recursive = TRUE, full.names = TRUE)

# Remove the CMD file which is irrelevant data
ham_files <- ham_files[!str_detect(ham_files,pattern="cmd")]
spam_files <- spam_files[!str_detect(spam_files,pattern="cmd")]

# Process the text files into a tibble for both Spam and Ham
ham_list <- lapply(ham_files, read_lines)
ham_list <- lapply(ham_list, paste, collapse = "")
spam_list <- lapply(spam_files, read_lines)
spam_list <- lapply(spam_list, paste, collapse = "")
ham_df <- tibble(text = unlist(ham_list), class = "ham")
spam_df <- tibble(text = unlist(spam_list), class = "spam")

# Combine the two tibbles into a dataframe which has type data along with randomizing the frame for optimal model building later on
set.seed(1337)
email_df <- rbind(ham_df, spam_df)[sample(nrow(ham_df)+nrow(spam_df)),]

head(email_df)

## # A tibble: 6 × 2
##   text                                                                     class
##   <chr>                                                                    <chr>
## 1 "From fork-admin@xent.com  Mon Oct  7 20:37:04 2002Return-Path: <fork-a… ham  
## 2 "From rssfeeds@spamassassin.taint.org  Tue Oct  8 10:56:13 2002Return-P… ham  
## 3 "From fork-admin@xent.com  Tue Sep 24 10:49:30 2002Return-Path: <fork-a… ham  
## 4 "From rpm-list-admin@freshrpms.net  Mon Sep  9 18:00:12 2002Return-Path… ham  
## 5 "From rpm-list-admin@freshrpms.net  Thu Oct  3 19:28:33 2002Return-Path… ham  
## 6 "From fork-admin@xent.com  Fri Sep 20 11:32:49 2002Return-Path: <fork-a… ham

Processing a Corpus

Now that we have a dataframe with our text loaded into our environment we can begin creating a corpus which we will use to clean our data and later load it into a DTM. We clean our corpus by converting it all into a single encoding standard, removing whitespace, numbers, punctuation, stop words, standardizing case, and converting words to their root form.

email_corpus <- Corpus(VectorSource(email_df$text))
processed_corpus <- email_corpus %>%
  tm_map(content_transformer(iconv), to = "UTF-8") %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(tolower) %>%
  tm_map(removeWords, stopwords()) %>% 
  tm_map(stemDocument)

inspect(processed_corpus[500])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] rssfeedsjmasonorg mon sep returnpath rssfeedsspamassassintaintorgdeliveredto yyyylocalhostspamassassintaintorgreceiv localhost jalapeno jmasonorg postfix esmtp id af jmlocalhost mon sep istreceiv jalapeno localhost imap fetchmail jmlocalhost singledrop mon sep istreceiv dogmaslashnullorg localhost dogmaslashnullorg esmtp id gtng jmjmasonorg sun sep messageid gtngdogmaslashnullorgto yyyyspamassassintaintorgfrom ask rssfeedsspamassassintaintorgsubject ufo skydat sun sep contenttyp textplain encodingutfurl httpwwwaskbjoernhansencomarchiveshtmld tyesterday viridiana came place drag look sky beauti odd color light obvious wonder today jim explain rocket test vandenburg air forc base neat bit blur saw much street light make good photo happi found photo nasa

Taking a look at a random message, we know that we might need some extra processing to filter out nonsensical items such as those which had previously held a URL or email, this shouldn’t be too hard in DTM form.

DTM Conversion

In order to best tokenize our data for analysis we can turn it into a document-term matrix which shows the frequency of each term per document.

email_dtm <- DocumentTermMatrix(processed_corpus)
inspect(email_dtm)

## <<DocumentTermMatrix (documents: 3896, terms: 96531)>>
## Non-/sparse entries: 526469/375558307
## Sparsity           : 100%
## Maximal term length: 74230
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug esmtp jmlocalhost localhost mon postfix receiv sep thu wed
##   1558   0     2           0         0   0       2      6   0   0   0
##   2152   0     6           0         1   0       0     13   0  11   0
##   2347   0     5           2         4   0       3      3   8   0   0
##   2547   0     4           0         1   0       0      5   0   9   0
##   2598   0     4           2         3   0       3      3   9   0   0
##   2661   0     4           0         1   0       0     30   0   0   1
##   2760   0     1           2         3   0       1      8   0   0   0
##   344    0     1           2         3   0       1      8   0   1   0
##   599    0     4           0         1   8       0     29   0   0   0
##   615    0     2           0         0   0       2      4   0   0   0

Observing our DTM we can see that there are 96531 terms in total, but with 100% sparsity those terms might not be very meaningful. If we want to filter out some gibberish terms from our transformations it can help to turn down the sparsity to 95%.This means only terms that are in 5% or greater documents in our corpus will be considered.

email_dtm <- removeSparseTerms(email_dtm, 0.95)
inspect(email_dtm)

## <<DocumentTermMatrix (documents: 3896, terms: 380)>>
## Non-/sparse entries: 196140/1284340
## Sparsity           : 87%
## Maximal term length: 62
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug esmtp jmlocalhost localhost mon postfix receiv sep thu wed
##   1558   0     2           0         0   0       2      6   0   0   0
##   2010   0     1           0         0   0       1     14   0   3   0
##   2347   0     5           2         4   0       3      3   8   0   0
##   2547   0     4           0         1   0       0      5   0   9   0
##   2598   0     4           2         3   0       3      3   9   0   0
##   2661   0     4           0         1   0       0     30   0   0   1
##   3632   0     4           0         1   0       0     19   0   0   0
##   472    5     1           2         3   0       1     12   0   0   0
##   599    0     4           0         1   8       0     29   0   0   0
##   615    0     2           0         0   0       2      4   0   0   0

Data Splitting

Next we want to split our data into test and training sets. This allows us to determine if our model ends up being a good fit without looking at external email data. To do this we’ll use 65% of the data to train the model and 35% of the data to test it.

We calculate the sample size required for such a split and then utilize it to create splits on the dataframe and dtm.

After splitting we need to factorize the dtm by creating factors on if a word is detected or if it is not.

sample_size <- floor(0.65 * nrow(email_df))

email_df_train <- email_df[1:sample_size,]
email_df_test <- email_df[(1+sample_size):nrow(email_df),]

email_dtm_train <- email_dtm[1:sample_size,]
email_dtm_test <- email_dtm[(1+sample_size):nrow(email_df),]

factorize <- function(x) {x <- ifelse(x > 0, "y", "n")}
email_dtm_train <- apply(email_dtm_train, MARGIN = 2, factorize)
email_dtm_test <- apply(email_dtm_test, MARGIN = 2, factorize)

Model Training

Finally, we beginning training the model. We will be utilizing a Naive Bayes model to train our model.

email_model <- naiveBayes(email_dtm_train,factor(email_df_train$class))
test_results <- predict(email_model, email_dtm_test)

confusionMatrix(test_results, factor(email_df_test$class), positive = "spam", dnn = c("Prediction",
    "Actual"))

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction ham spam
##       ham  801   23
##       spam  73  467
##                                           
##                Accuracy : 0.9296          
##                  95% CI : (0.9147, 0.9426)
##     No Information Rate : 0.6408          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8505          
##                                           
##  Mcnemar's Test P-Value : 5.702e-07       
##                                           
##             Sensitivity : 0.9531          
##             Specificity : 0.9165          
##          Pos Pred Value : 0.8648          
##          Neg Pred Value : 0.9721          
##              Prevalence : 0.3592          
##          Detection Rate : 0.3424          
##    Detection Prevalence : 0.3959          
##       Balanced Accuracy : 0.9348          
##                                           
##        'Positive' Class : spam            
##

Displaying the results within a confidence matrix we can see that our model correctly predicted if an email was spam or not 91.5% of the time. False classification rates of spam were high.

Conclusion

We have walked through implementing a naive model classifier on text data which we harvested from the web. We then insert the data into a corpus while cleaning it and splitting it to train the model on. Our end results were a relatively solid model with an accuracy of 91.5%.

To extend this project we might choose to increase the amount of data that is used for training. More specifically, it would be useful to intake spam emails from other sources outside of a singular one we are using. This would lead to spreading out what the model is training on, and make it more generally applicable.