For this project we will ingest and prepare a spam/ham dataset then build and evaluate classifer models that, given a new email, will determine whether it is spam or ham.
if (!require('stringr')) install.packages('stringr')
if (!require('tm')) install.packages('tm')
if (!require('SnowballC')) install.packages('SnowballC')
if (!require('RTextTools')) install.packages('RTextTools')
if (!require('caret')) install.packages('caret')Compressed files from the public corpus have been downloaded and uncompressed in an “SData” directory. We will use the all the most current hard and easy spam/ham files to train and test our models.
tmp <- readLines("SData/spam/0001.bfc8d64d12b325ff385cca8d07b84288")
tmp <- str_c(tmp, collapse = "")
tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
tmp <- str_replace_all(tmp, pattern="\\=", replacement = "")
tmp## [1] "From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002Return-Path: Delivered-To: zzzz@localhost.example.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.example.com (Postfix) with ESMTP id 136B943C32\tfor ; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)Received: from mail.webnote.net [193.120.211.219]\tby localhost with POP3 (fetchmail-5.9.0)\tfor zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)Received: from dd_it7 ([210.97.77.167])\tby webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623\tfor ; Thu, 22 Aug 2002 13:09:41 +0100From: 12a1mailbot1@web.deReceived: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6);\t Sat, 24 Aug 2002 09:42:10 +0900To: Subject: Life Insurance - Why Pay More?Date: Wed, 21 Aug 2002 20:31:57 -1600MIME-Version: 1.0Message-ID: Content-Type: text/html; charset\"iso-8859-1\"Content-Transfer-Encoding: quoted-printable Save up to 70% on Life Insurance. Why Spend More Than You Have To? Life Quote Savings Ensuring your family's financial security is very important. Life Quote Savings makes buying life insurance simple and affordable. We Provide FREE Access to The Very Best Companies and The Lowest Rates. Life Quote Savings is FAST, EASY and SAVES you money! Let us help you get started with the best values in the country on new coverage. You can SAVE hundreds or even thousands of dollars by requesting a FREE quote from Lifequote Savings. Our service will take you less than 5 minutes to complete. Shop and compare. SAVE up to 70% on all types of Life insurance! Click Here For Your Free Quote! Protecting your family is the best investment you'll ever make! If you are in receipt of this email in error and/or wish to be removed from our list, PLEASE CLICK HERE AND TYPE REMOVE. If you reside in any state which prohibits e-mail solicitations for insurance, please disregard this email. "
tmpcorp <- Corpus(VectorSource(tmp))
meta(tmpcorp, "class") <- "spam"
# recall meta
meta(tmpcorp, "class")## class
## 1 spam
buildcorpus <- function(dir, class) {
filelist <- list.files(dir)
for (i in 1:length(filelist)){
path <- paste0(dir, filelist[i])
tmp <- readLines(path)
tmp <- str_c(tmp, collapse = "")
tmp <- str_replace_all(tmp, pattern="<.*?>", replacement = " ")
tmp <- str_replace_all(tmp, pattern="\\=", replacement = "")
if (!exists("corp")) {
corp <- Corpus(VectorSource(tmp))
} else {
corp <- c(corp, Corpus(VectorSource(tmp)))
}
meta(corp[[i]], "class") <- class
}
corp
}
allcorp <- c(buildcorpus("SData/spam/", "spam"),
buildcorpus("SData/spam_2/", "spam"),
buildcorpus("SData/easy_ham/", "ham"),
buildcorpus("SData/easy_ham_2/", "ham"),
buildcorpus("SData/hard_ham/", "ham"))# clean and prep for tf
cleancorp <- tm_map(allcorp, removeWords, stopwords("english"))
cleancorp <- tm_map(cleancorp, stripWhitespace)
cleancorp <- tm_map(cleancorp, stemDocument, language="en")cleancorp[[1]][[1]]## [1] "From 12a1mailbot1@web.d Thu Aug 22 13:17:22 2002Return-Path: Delivered-To: zzzz@localhost.example.comReceived: localhost (localhost [127.0.0.1]) phobos.labs.example.com (Postfix) ESMTP id 136B943C32 ; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)Received: mail.webnote.net [193.120.211.219] localhost POP3 (fetchmail-5.9.0) zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)Received: dd_it7 ([210.97.77.167]) webnote.net (8.9.3/8.9.3) ESMTP id NAA04623 ; Thu, 22 Aug 2002 13:09:41 +0100From: 12a1mailbot1@web.deReceived: r-smtp.korea.com - 203.122.2.197 dd_it7 Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900To: Subject: Life Insuranc - Whi Pay More?Date: Wed, 21 Aug 2002 20:31:57 -1600MIME-Version: 1.0Message-ID: Content-Type: text/html; charset\"iso-8859-1\"Content-Transfer-Encoding: quoted-print Save 70% Life Insurance. Whi Spend More Than You Have To? Life Quot Save Ensure famili financi secur important. Life Quot Save make buy life insur simpl affordable. We Provid FREE Access The Veri Best Compani The Lowest Rates. Life Quot Save FAST, EASi SAVES money! Let us help get start best valu countri new coverage. You can SAVE hundr even thousand dollar request FREE quot Lifequot Savings. Our servic will take less 5 minut complete. Shop compare. SAVE 70% type Life insurance! Click Here For Your Free Quote! Protect famili best invest ever make! If receipt email error / wish remov list, PLEASE CLICK HERE AND TYPE REMOVE. If resid state prohibit e-mail solicit insurance, pleas disregard email."
# randomize for training / testing
cleancorp <- sample(cleancorp)meta_list <- factor(unlist(meta(cleancorp, "class")))
table(meta_list)## meta_list
## ham spam
## 4150 1896
spamtotal <- length(which(meta_list == "spam"))
hamtotal <- length(which(meta_list == "ham"))
sratio <- round(spamtotal / (spamtotal + hamtotal), 4)We calculate the ratio of spam to ham in the entire corpus to be 0.3136.
dtm <- DocumentTermMatrix(cleancorp)
# only look at terms that appear in at least 20 of the emails
dtm <- removeSparseTerms(dtm, 1-(20/length(cleancorp)))
# preview sparsity
dtm## <<DocumentTermMatrix (documents: 6046, terms: 5517)>>
## Non-/sparse entries: 756865/32598917
## Sparsity : 98%
## Maximal term length: 133
## Weighting : term frequency (tf)
For this classification task, we will train, test and compare 4 models built into the RTextTools package - Support Vector Machines, Tree, Lasso and Elastic-Net Generalized Linear Model, and Maximum Entropy.
Because of our approach to data organization, it is not trivial to implment the Naive Bayes method, which is not offered in the RTextTools package.
# split data for train/test
N <- length(meta_list)
trainpartition <- round(.75 * N)
# make container for RTextTools
container <- create_container(
dtm,
labels=meta_list,
trainSize=1:trainpartition,
testSize=trainpartition:N,
virgin=FALSE
)svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
glm_model <- train_model(container, "GLMNET")
maxent_model <- train_model(container, "MAXENT")svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
glm_out <- classify_model(container, glm_model)
maxent_out <- classify_model(container, maxent_model)labels_out <- data.frame(
correct_label = meta_list[trainpartition:N],
svm = as.character(svm_out[,1]),
tree = as.character(tree_out[,1]),
glm = as.character(glm_out[,1]),
maxent = as.character(maxent_out[,1]),
stringsAsFactors = F)
#func to call confusionMatrix
makematrix <- function(col) {
confusionMatrix(table(labels_out[[col]], labels_out$correct_label), positive="ham")
}
# create all
svmconf <- makematrix('svm')
treeconf <- makematrix('tree')
glmconf <- makematrix('glm')
maxentconf <- makematrix('maxent')par(mfrow=c(2,2))
fourfoldplot(svmconf$table, color = c("#B22222", "#2E8B57"), main="SVM")
fourfoldplot(treeconf$table, color = c("#B22222", "#2E8B57"), main="Tree")
fourfoldplot(glmconf$table, color = c("#B22222", "#2E8B57"), main="GLM")
fourfoldplot(maxentconf$table, color = c("#B22222", "#2E8B57"), main="MaxEnt")#setup df
eval <- data.frame(treeconf$byClass,
svmconf$byClass,
glmconf$byClass,
maxentconf$byClass)
eval <- data.frame(t(eval))
#calc FScore
precision <- eval$Pos.Pred.Value
recall <- eval$Sensitivity
eval$Fscore <- 2 * ((precision * recall) / (precision + recall))
# manipulate results DF
eval <- eval[,c(1:3, 9)]
row.names(eval) <- c("Tree", "SVM", "GLM", "MaxEnt")
eval <- eval[order(eval$Fscore, decreasing=TRUE),]
knitr::kable(eval)| Sensitivity | Specificity | Pos.Pred.Value | Fscore | |
|---|---|---|---|---|
| MaxEnt | 0.9910803 | 0.9880952 | 0.9940358 | 0.9925558 |
| SVM | 0.9920714 | 0.9821429 | 0.9910891 | 0.9915800 |
| GLM | 0.9920714 | 0.9464286 | 0.9737354 | 0.9828179 |
| Tree | 0.9653122 | 0.9523810 | 0.9759519 | 0.9706029 |
In summary, we ingested and cleaned our spam/ham data, then trained 4 different models using 75% of the data and tested on the withheld 25%. Our results were uniformly excellent, and we compared the models by using a confusion matrix and calculating the F1 score for binary classification using the precision and recall metrics. Note that in the confusionMatrix function in caret, these values precision = positive predictive value and recall = sensitivity.
As noted, all of the models performed very well. The Maximum Entropy and Support Vector Machines were the top performers with Fscores > 0.99. Combining models into an ensemble has the potential to improve performance even further.
The RTextTools package provides excellent one-stop shopping for document preparation and modelling, however the algorithm choices are limited and a bit too ‘black box’ for a noob learner like me. I look forward to learning more about the inner workings of individual modeling algorithms in future projects.