Document Classification

Introduction

This project seeks to build a classifier that can predict if an email is spam or ham. It will be using the Apache Spam Assassin public corpus as the dataset.

Data Acquisition

The first step is to download the files. I am selecting the most recent releases of spam and easy ham.

if (!file.exists("data/20050311_spam_2.tar.bz2") && !file.exists("data/20050311_spam_2.tar")){
  download.file("https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", "data/20050311_spam_2.tar.bz2")
}

if (!file.exists("data/20030228_easy_ham_2.tar.bz2") & !file.exists("data/20030228_easy_ham_2.tar")){
  download.file("https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2", "data/20030228_easy_ham_2.tar.bz2")
}

Now that the files have been acquired, they need to be decompressed.

if (!dir.exists("data/spam_2")){
  R.utils::bunzip2("data/20050311_spam_2.tar.bz2", "data/20050311_spam_2.tar")
  untar("data/20050311_spam_2.tar", exdir = "data")
  # Delete the cmds file
  unlink("data/spam_2/cmds")
  # Delete these messages b/c it breaks the pipeline
  unlink("data/spam_2/00706.5116018237368c3633823b2d24f8ac86")
  unlink("data/spam_2/00708.89f1f9108884517148fdbd744e18ec1e")
  unlink("data/spam_2/00737.af5f503fe444ae773bfeb4652d122349")
  unlink("data/spam_2/01125.46ca779f86e1dd0a03c3ffc67b57f55e")
  unlink("data/spam_2/01217.d5a1734ec521c1bd55270eca3ab4acd8")
}

if (!dir.exists("data/easy_ham_2")){
  R.utils::bunzip2("data/20030228_easy_ham_2.tar.bz2", "data/20030228_easy_ham_2.tar")
  untar("data/20030228_easy_ham_2.tar", exdir = "data")
  # Delete the cmds file
  unlink("data/easy_ham_2/cmds")
}

Build a Corpus

Now that we have a dataset we need to construct a corpus. I will use the tm package and dplyr to get the job done.

library(dplyr)
library(tm)

get_corpus <- function(dir, label){
  corpus <- VCorpus(DirSource(dir)) %>%
    tm_map(PlainTextDocument)  %>% # Create plain text document
    tm_map(content_transformer(tolower)) %>% # Standardize case
    tm_map(removeWords, stopwords("SMART")) %>% # Remove stopwords
    tm_map(removePunctuation) %>% # Remove punctuation marks
    tm_map(removeNumbers) %>% # Remove numbers
    tm_map(stripWhitespace) %>% # Remove extra whitespace
    tm_map(stemDocument) # Stem the documents
  meta(corpus, "LABEL") <- label
  return(corpus)
}

corpus<- c(get_corpus("data/spam_2", "Spam"), get_corpus("data/easy_ham_2", "Ham"))

Build a Document-Term Matrix

The next step is to construct a document term matrix.

dtm <- DocumentTermMatrix(corpus)

There are 2791 documents and 71521 terms in the document-term matrix.

Remove Infrequent Terms

There are a lot of terms resulting in a very sparce matrix. I am going to prune out the less frequent terms from the matrix. The term must be found in at least 10 documents.

min_docs <- 10
dtm <- removeSparseTerms(dtm, 1 - (min_docs / length(corpus)))

There are still 2791 documents, but there are 4476 terms in the pruned document-term matrix.

Normalizing the Data

Some messages have more words than others. We are going to normalize the data and pull all the data together into a model dataset.

model_data <- as.matrix(dtm)
words <- rowSums(model_data)
model_data <- model_data / words
model_data <- data.frame(model_data)
model_data <- cbind(meta(corpus), model_data) %>%
  mutate(LABEL = as.factor(LABEL))

Split Data into Training/Test Data Sets

We are now ready to break up the data into training and test sets. We are going to use a 75%/25% split. This decision was arbitrary.

library(caret)
set.seed(12345)
in_training_set <- createDataPartition(model_data$LABEL, p = 0.75,  list = FALSE)
training_data <- model_data[in_training_set, ]
testing_data <- model_data[-in_training_set, ]

Build a SVM Model

We will take out training data and produce a SVM model that predicts if a message is spam or ham.

library(e1071)
model <- svm(LABEL ~ ., data = training_data)

Test the Model

Now that we have trained a model we can test it out and see how it preforms.

predictions <- testing_data %>%
  select(-LABEL) %>%
  predict(model, .)

Table 1. Confusion Matrix

Ham Spam
Ham 349 21
Spam 1 326

There were 22 messages that were misclassified out of 697 for a 97% accuracy rate. Not bad!

29 October 2018