Abstract

For this project we will ingest and prepare a spam/ham dataset then build and evaluate classifer models that, given a new email, will determine whether it is spam or ham.

Environment Prep

if (!require('stringr')) install.packages('stringr')
if (!require('tm')) install.packages('tm')
if (!require('SnowballC')) install.packages('SnowballC')
if (!require('RTextTools')) install.packages('RTextTools')
if (!require('caret')) install.packages('caret')

Data Source

We are will use the Apache SpamAssassin public email corpus.
More information on the dataset can be found at: https://spamassassin.apache.org/publiccorpus/readme.html
Note - we manually removed ‘cmd’ files that were found in some of these directories.

Data Ingestion

Compressed files from the public corpus have been downloaded and uncompressed in an “SData” directory. We will use the all the most current hard and easy spam/ham files to train and test our models.

Review: Clean 1

tmp <- readLines("SData/spam/0001.bfc8d64d12b325ff385cca8d07b84288")
tmp <- str_c(tmp, collapse = "")
tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
tmp <- str_replace_all(tmp, pattern="\\=", replacement = "")
tmp

## [1] "From 12a1mailbot1@web.de  Thu Aug 22 13:17:22 2002Return-Path:  Delivered-To: zzzz@localhost.example.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.example.com (Postfix) with ESMTP id 136B943C32\tfor  ; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)Received: from mail.webnote.net [193.120.211.219]\tby localhost with POP3 (fetchmail-5.9.0)\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)Received: from dd_it7 ([210.97.77.167])\tby webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623\tfor  ; Thu, 22 Aug 2002 13:09:41 +0100From: 12a1mailbot1@web.deReceived: from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);\t Sat, 24 Aug 2002 09:42:10 +0900To:  Subject: Life Insurance - Why Pay More?Date: Wed, 21 Aug 2002 20:31:57 -1600MIME-Version: 1.0Message-ID:  Content-Type: text/html; charset\"iso-8859-1\"Content-Transfer-Encoding: quoted-printable                                                                     Save up to 70% on Life Insurance.    Why Spend More Than You Have To?   Life Quote Savings                                      Ensuring your       family's financial security is very important. Life Quote Savings makes       buying life insurance simple and affordable. We Provide FREE Access to The       Very Best Companies and The Lowest Rates.                                                  Life Quote Savings  is FAST, EASY and             SAVES you money! Let us help you get started with the best values in             the country on new coverage. You can SAVE hundreds or even thousands             of dollars by requesting a FREE quote from Lifequote Savings. Our             service will take you less than 5 minutes to complete. Shop and             compare. SAVE up to 70% on all types of Life insurance!                                          Click Here For Your             Free Quote!                             Protecting your family is the best investment you'll ever           make!                                                                                                        If you are in receipt of this email       in error and/or wish to be removed from our list,  PLEASE CLICK HERE  AND TYPE REMOVE. If you       reside in any state which prohibits e-mail solicitations for insurance,       please disregard this       email.                                "

tmpcorp <- Corpus(VectorSource(tmp))
meta(tmpcorp, "class") <- "spam"

# recall meta
meta(tmpcorp, "class")

##   class
## 1  spam

Function: Build Corpus

buildcorpus <- function(dir, class) {
    filelist <- list.files(dir)
    for (i in 1:length(filelist)){
        path <- paste0(dir, filelist[i])
        tmp <- readLines(path)
        tmp <- str_c(tmp, collapse = "")
        tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
        tmp <- str_replace_all(tmp, pattern="\\=", replacement = "")
        
        if (!exists("corp")) {
            corp <- Corpus(VectorSource(tmp))
        } else {
            corp <- c(corp, Corpus(VectorSource(tmp)))
        }
        meta(corp[[i]], "class") <- class
    }
    corp
}

allcorp <- c(buildcorpus("SData/spam/", "spam"),
             buildcorpus("SData/spam_2/", "spam"),
             buildcorpus("SData/easy_ham/", "ham"),
             buildcorpus("SData/easy_ham_2/", "ham"),
             buildcorpus("SData/hard_ham/", "ham"))

Corpus Clean

# clean and prep for tf
cleancorp <- tm_map(allcorp, removeWords, stopwords("english")) 
cleancorp <- tm_map(cleancorp, stripWhitespace)
cleancorp <- tm_map(cleancorp, stemDocument, language="en")

Review: Clean 1 + Clean 2

cleancorp[[1]][[1]]

## [1] "From 12a1mailbot1@web.d Thu Aug 22 13:17:22 2002Return-Path: Delivered-To: zzzz@localhost.example.comReceived: localhost (localhost [127.0.0.1]) phobos.labs.example.com (Postfix) ESMTP id 136B943C32 ; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)Received: mail.webnote.net [193.120.211.219] localhost POP3 (fetchmail-5.9.0) zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)Received: dd_it7 ([210.97.77.167]) webnote.net (8.9.3/8.9.3) ESMTP id NAA04623 ; Thu, 22 Aug 2002 13:09:41 +0100From: 12a1mailbot1@web.deReceived: r-smtp.korea.com - 203.122.2.197 dd_it7 Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900To: Subject: Life Insuranc - Whi Pay More?Date: Wed, 21 Aug 2002 20:31:57 -1600MIME-Version: 1.0Message-ID: Content-Type: text/html; charset\"iso-8859-1\"Content-Transfer-Encoding: quoted-print Save 70% Life Insurance. Whi Spend More Than You Have To? Life Quot Save Ensure famili financi secur important. Life Quot Save make buy life insur simpl affordable. We Provid FREE Access The Veri Best Compani The Lowest Rates. Life Quot Save FAST, EASi SAVES money! Let us help get start best valu countri new coverage. You can SAVE hundr even thousand dollar request FREE quot Lifequot Savings. Our servic will take less 5 minut complete. Shop compare. SAVE 70% type Life insurance! Click Here For Your Free Quote! Protect famili best invest ever make! If receipt email error / wish remov list, PLEASE CLICK HERE AND TYPE REMOVE. If resid state prohibit e-mail solicit insurance, pleas disregard email."

# randomize for training / testing
cleancorp <- sample(cleancorp)

Review: Spam / Ham

meta_list <- factor(unlist(meta(cleancorp, "class")))
table(meta_list)

## meta_list
##  ham spam 
## 4150 1896

spamtotal <- length(which(meta_list == "spam")) 
hamtotal <- length(which(meta_list == "ham")) 
sratio <- round(spamtotal / (spamtotal + hamtotal), 4)

We calculate the ratio of spam to ham in the entire corpus to be 0.3136.

Term Document Matrix

dtm <- DocumentTermMatrix(cleancorp)
# only look at terms that appear in at least 20 of the emails
dtm <- removeSparseTerms(dtm, 1-(20/length(cleancorp)))

# preview sparsity
dtm

## <<DocumentTermMatrix (documents: 6046, terms: 5517)>>
## Non-/sparse entries: 756865/32598917
## Sparsity           : 98%
## Maximal term length: 133
## Weighting          : term frequency (tf)

Modeling w/ RTextTools

For this classification task, we will train, test and compare 4 models built into the RTextTools package - Support Vector Machines, Tree, Lasso and Elastic-Net Generalized Linear Model, and Maximum Entropy.

Because of our approach to data organization, it is not trivial to implment the Naive Bayes method, which is not offered in the RTextTools package.

Make Container

# split data for train/test
N <- length(meta_list)
trainpartition <- round(.75 * N)

# make container for RTextTools
container <- create_container(
    dtm,
    labels=meta_list,
    trainSize=1:trainpartition,
    testSize=trainpartition:N,
    virgin=FALSE
)

Train Models

svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
glm_model <- train_model(container, "GLMNET")
maxent_model <- train_model(container, "MAXENT")

Test Models

svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
glm_out <- classify_model(container, glm_model)
maxent_out <- classify_model(container, maxent_model)

Evaluate Performance

labels_out <- data.frame(
    correct_label = meta_list[trainpartition:N],
    svm = as.character(svm_out[,1]),
    tree = as.character(tree_out[,1]),
    glm = as.character(glm_out[,1]),
    maxent = as.character(maxent_out[,1]),
    stringsAsFactors = F)

#func to call confusionMatrix 
makematrix <- function(col) {
    confusionMatrix(table(labels_out[[col]], labels_out$correct_label), positive="ham")
}

# create all 
svmconf <- makematrix('svm')
treeconf <- makematrix('tree')
glmconf <- makematrix('glm')
maxentconf <- makematrix('maxent')

Plot: Confusion Matrix

par(mfrow=c(2,2))
fourfoldplot(svmconf$table, color = c("#B22222", "#2E8B57"), main="SVM")
fourfoldplot(treeconf$table, color = c("#B22222", "#2E8B57"), main="Tree")
fourfoldplot(glmconf$table, color = c("#B22222", "#2E8B57"), main="GLM")
fourfoldplot(maxentconf$table, color = c("#B22222", "#2E8B57"), main="MaxEnt")

Comparison Table: F1 Score

#setup df
eval <- data.frame(treeconf$byClass, 
                   svmconf$byClass,
                   glmconf$byClass,
                   maxentconf$byClass)

eval <- data.frame(t(eval))

#calc FScore
precision <- eval$Pos.Pred.Value
recall <- eval$Sensitivity
eval$Fscore <- 2 * ((precision * recall) / (precision + recall))  

# manipulate results DF
eval <- eval[,c(1:3, 9)]
row.names(eval) <- c("Tree", "SVM", "GLM", "MaxEnt")
eval <- eval[order(eval$Fscore, decreasing=TRUE),]

knitr::kable(eval)

	Sensitivity	Specificity	Pos.Pred.Value	Fscore
MaxEnt	0.9910803	0.9880952	0.9940358	0.9925558
SVM	0.9920714	0.9821429	0.9910891	0.9915800
GLM	0.9920714	0.9464286	0.9737354	0.9828179
Tree	0.9653122	0.9523810	0.9759519	0.9706029

Conclusion

In summary, we ingested and cleaned our spam/ham data, then trained 4 different models using 75% of the data and tested on the withheld 25%. Our results were uniformly excellent, and we compared the models by using a confusion matrix and calculating the F1 score for binary classification using the precision and recall metrics. Note that in the confusionMatrix function in caret, these values precision = positive predictive value and recall = sensitivity.

As noted, all of the models performed very well. The Maximum Entropy and Support Vector Machines were the top performers with Fscores > 0.99. Combining models into an ensemble has the potential to improve performance even further.

The RTextTools package provides excellent one-stop shopping for document preparation and modelling, however the algorithm choices are limited and a bit too ‘black box’ for a noob learner like me. I look forward to learning more about the inner workings of individual modeling algorithms in future projects.

DATA 607: Document Classification; Week 10

Walt Wells, Fall 2016