As part of this assignment, I downloaded the following files from https://spamassasin.apache.org.
easy_ham and spam. I choose to use the built-in libraries (RTextTools) referenced by “Automated Data Collection with R.” I had originally planned to use both RTextTools and Caret, but due to time constraints, I went with a single method for this project.
For the data loading process, I extracted the tarbal files into two folders and loaded them into two volatile corpus using the TM library.
The ham emails provided 2551 documents as indicated below:
<
The Spam documents provided 500 documents:
<
This is the data I used to clean and train my model for spam/ham classification.
ham.emails <- file.path(("/Users/davidapolinar/Dropbox/CUNYProjects/Srping2019/Data607/Project4/easy_ham"))
spam.emails <- file.path(("/Users/davidapolinar/Dropbox/CUNYProjects/Srping2019/Data607/Project4/spam"))
head(dir(ham.emails))
## [1] "0001.ea7e79d3153e7469e7a9c3e0af6a357e"
## [2] "0002.b3120c4bcbf3101e661161ee7efcb8bf"
## [3] "0003.acfc5ad94bbd27118a0d8685d18c89dd"
## [4] "0004.e8d5727378ddde5c3be181df593f1712"
## [5] "0005.8c3b9e9c0f3f183ddaf7592a11b99957"
## [6] "0006.ee8b0dba12856155222be180ba122058"
head(dir(spam.emails))
## [1] "0001.bfc8d64d12b325ff385cca8d07b84288"
## [2] "0002.24b47bb3ce90708ae29d0aec1da08610"
## [3] "0003.4b3d943b8df71af248d12f8b2e7a224a"
## [4] "0004.1874ab60c71f0b31b580f313a3f6e777"
## [5] "0005.1f42bb885de0ef7fc5cd09d34dc2ba54"
## [6] "0006.7a32642f8c22bbeb85d6c3b5f3890a2c"
ham.email.corps <- VCorpus(DirSource(ham.emails))
spam.email.corps <- VCorpus((DirSource(spam.emails)))
When I originally looked at the email process, I wasn’t quite sure what the best approach was to determine what the predictor models would look for, e.g. SVM, Random Forest, maxent. Many of the emails had several sections that I did not feel were useful for analysis. For example, the email below contains sections that may not be very useful to create a document tree:
From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: 12a1mailbot1@web.de Delivered-To: zzzz@localhost.example.com Received: from localhost (localhost [127.0.0.1]) by phobos.labs.example.com (Postfix) with ESMTP id 136B943C32 for zzzz@localhost; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for zzzz@example.com; Thu, 22 Aug 2002 13:09:41 +0100 From: 12a1mailbot1@web.de Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: dcek1a1@netsgo.com Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: 0103c1042001882DD_IT7@dd_it7
While the subject ID and from fields could help filter emails easily, spammers will easily find methods around this filtering process. I choose to focus mostly on the body of the email. Therefore, I decided to eliminate the majority of these fields in my training. I created a custom method to filter any lines that contained a colon. There is a possibility that I lost some data in the body, but I felt that it would be minimal. I relied on both TM and QDAP for cleaning other HTML tags and characters that were not very helpful. I also removed non utf characters as some of the libraries would feel when running through the corpus.
clean_corpus <- function(corpus)
{
corpus <- tm_map(corpus, content_transformer(bracketX))
}
clean.corpus.lines <- function(corpus)
{
corpus <- tm_map(corpus, content_transformer(clean))
corpus <- tm_map(corpus, content_transformer(remove.non.utf))
corpus <- tm_map(corpus, content_transformer(removeDoubleSlash))
corpus <- tm_map(corpus, content_transformer(removeLinesWithColons))
corpus <- tm_map(corpus, content_transformer(rm_angle))
corpus <- tm_map(corpus, content_transformer(bracketX))
# corpus <- tm_map(corpus, replace_symbol)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stripWhitespace)
}
remove.non.utf <- function(x)
{
x <- unlist(x)
Encoding(x) <- "UTF-8"
y <- iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
return(unname(y))
}
removeDoubleSlash <- function(x)
{
# x<- str_remove_all(x, "\\")
x<- str_remove_all(x, "-")
x<- gsub("<.*/>","",x)
return(x)
}
removeLinesWithColons <- function(x)
{
content <- unlist(x)
y <- content[!str_detect(content, ":")]
z <- y[!str_detect(y, "@")]
return(unname(z))
}
Using RTextTools, I choose to tag ham and spam emails ahead to determine how “off” the predictor models were. I also chose to reduce the sparseness as documented in Automated Data Collection with R, Chapter 10. I was able to use only MAXENT and SVC_MODEL in my analysis because Random Forest would fail due to an R error with protect(): protection stack overflow.
#
cleaned.spam <- clean.corpus.lines(spam.email.corps)
meta(cleaned.spam, tag = "type", type ="indexed") <- c("spam")
dtm.spam <- DocumentTermMatrix(cleaned.spam)
labels <- c("spam", "ham")
#cleaned.spam.matrix <- as.matrix(cleaned.spam)
cleaned.ham <- clean.corpus.lines(ham.email.corps)
meta(cleaned.ham, tag = "type", type ="indexed") <- c("ham")
dtm.ham <- DocumentTermMatrix(cleaned.ham)
# Reduce sparseness
dtm.ham <- removeSparseTerms(dtm.ham, 1-(10/length(cleaned.ham)))
dtm.spam <- removeSparseTerms(dtm.spam, 1-(10/length(cleaned.spam)))
#meta(cleaned.ham, tag = "type", type ="indexed") <- c("ham")
combined.email <- c(cleaned.ham, cleaned.spam)
sampled_corpus <- list.sample(combined.email, size = 2000)
combined.email.dtm <- DocumentTermMatrix(sampled_corpus)
email_types <- unlist(meta(sampled_corpus, "type"))
container <- create_container(combined.email.dtm,
labels = email_types,
trainSize = 1:1000,
testSize = 1001:2000,
virgin = FALSE
)
svm_model <- train_model(container = container, "SVM")
svm_out <- classify_model(container, svm_model)
head(svm_out, n=15)
## SVM_LABEL SVM_PROB
## 1 ham 0.9879749
## 2 ham 0.9536875
## 3 ham 0.9929224
## 4 ham 0.9856811
## 5 ham 0.9909459
## 6 ham 0.9832534
## 7 ham 0.9914328
## 8 ham 0.9906568
## 9 spam 0.9999999
## 10 spam 0.6279801
## 11 ham 0.9966116
## 12 spam 0.9950072
## 13 ham 0.7847045
## 14 ham 0.9879979
## 15 ham 0.9848829
#tree_model <- train_model(container = container, "TREE")
maxent_model <- train_model(container, "MAXENT")
maxent_out <- classify_model(container, maxent_model)
head(maxent_out, n=15)
## MAXENTROPY_LABEL MAXENTROPY_PROB
## 1 ham 0.9999981
## 2 ham 1.0000000
## 3 ham 0.9999906
## 4 ham 1.0000000
## 5 ham 0.9999990
## 6 ham 0.9999980
## 7 ham 0.9999987
## 8 ham 1.0000000
## 9 spam 1.0000000
## 10 spam 0.9982343
## 11 ham 1.0000000
## 12 spam 1.0000000
## 13 ham 1.0000000
## 14 ham 0.9999891
## 15 ham 0.9998939
Using our classification for the SVM model, we see that we correctly predicted roughly 97.9% of the documents, whereas Maxent predicted 99.7% of the spam correctly. MaxEnt had a much better prediction accuracy, but the results were fairly similar. Ideally, providing more documents, will help train the models better.
labels_out <- data.frame(
correct_label = email_types[1001:2000],
smv = as.character(svm_out[,1]),
maxent = as.character(maxent_out[,1]),
stringsAsFactors = F)
table(labels_out[,1] == labels_out[,2])
##
## FALSE TRUE
## 21 979
# First few columns
head(labels_out)
## correct_label smv maxent
## type1001 ham ham ham
## type1002 ham ham ham
## type1003 ham ham ham
## type1004 ham ham ham
## type1005 ham ham ham
## type1006 ham ham ham
#SVM Prediction
prop.table(table(labels_out[,1] == labels_out[,2]))
##
## FALSE TRUE
## 0.021 0.979
# Max Ent Prediction
prop.table(table(labels_out[,1] == labels_out[,3]))
##
## FALSE TRUE
## 0.003 0.997