Project 4

Introduction

As part of this assignment, I downloaded the following files from https://spamassasin.apache.org.

easy_ham and spam. I choose to use the built-in libraries (RTextTools) referenced by “Automated Data Collection with R.” I had originally planned to use both RTextTools and Caret, but due to time constraints, I went with a single method for this project.

Data Loading

For the data loading process, I extracted the tarbal files into two folders and loaded them into two volatile corpus using the TM library.

The ham emails provided 2551 documents as indicated below:

<> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 2551

The Spam documents provided 500 documents:

<> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 500

This is the data I used to clean and train my model for spam/ham classification.

ham.emails <- file.path(("/Users/davidapolinar/Dropbox/CUNYProjects/Srping2019/Data607/Project4/easy_ham"))
spam.emails <- file.path(("/Users/davidapolinar/Dropbox/CUNYProjects/Srping2019/Data607/Project4/spam"))
head(dir(ham.emails))
## [1] "0001.ea7e79d3153e7469e7a9c3e0af6a357e"
## [2] "0002.b3120c4bcbf3101e661161ee7efcb8bf"
## [3] "0003.acfc5ad94bbd27118a0d8685d18c89dd"
## [4] "0004.e8d5727378ddde5c3be181df593f1712"
## [5] "0005.8c3b9e9c0f3f183ddaf7592a11b99957"
## [6] "0006.ee8b0dba12856155222be180ba122058"
head(dir(spam.emails))
## [1] "0001.bfc8d64d12b325ff385cca8d07b84288"
## [2] "0002.24b47bb3ce90708ae29d0aec1da08610"
## [3] "0003.4b3d943b8df71af248d12f8b2e7a224a"
## [4] "0004.1874ab60c71f0b31b580f313a3f6e777"
## [5] "0005.1f42bb885de0ef7fc5cd09d34dc2ba54"
## [6] "0006.7a32642f8c22bbeb85d6c3b5f3890a2c"
ham.email.corps <- VCorpus(DirSource(ham.emails))
spam.email.corps <- VCorpus((DirSource(spam.emails)))

Data Cleansing Process

When I originally looked at the email process, I wasn’t quite sure what the best approach was to determine what the predictor models would look for, e.g. SVM, Random Forest, maxent. Many of the emails had several sections that I did not feel were useful for analysis. For example, the email below contains sections that may not be very useful to create a document tree:

From 12a1mailbot1@web.de Thu Aug 22 13:17:22 2002 Return-Path: 12a1mailbot1@web.de Delivered-To: zzzz@localhost.example.com Received: from localhost (localhost [127.0.0.1]) by phobos.labs.example.com (Postfix) with ESMTP id 136B943C32 for zzzz@localhost; Thu, 22 Aug 2002 08:17:21 -0400 (EDT) Received: from mail.webnote.net [193.120.211.219] by localhost with POP3 (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST) Received: from dd_it7 ([210.97.77.167]) by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623 for zzzz@example.com; Thu, 22 Aug 2002 13:09:41 +0100 From: 12a1mailbot1@web.de Received: from r-smtp.korea.com - 203.122.2.197 by dd_it7 with Microsoft SMTPSVC(5.5.1775.675.6); Sat, 24 Aug 2002 09:42:10 +0900 To: dcek1a1@netsgo.com Subject: Life Insurance - Why Pay More? Date: Wed, 21 Aug 2002 20:31:57 -1600 MIME-Version: 1.0 Message-ID: 0103c1042001882DD_IT7@dd_it7

While the subject ID and from fields could help filter emails easily, spammers will easily find methods around this filtering process. I choose to focus mostly on the body of the email. Therefore, I decided to eliminate the majority of these fields in my training. I created a custom method to filter any lines that contained a colon. There is a possibility that I lost some data in the body, but I felt that it would be minimal. I relied on both TM and QDAP for cleaning other HTML tags and characters that were not very helpful. I also removed non utf characters as some of the libraries would feel when running through the corpus.

clean_corpus <- function(corpus)
{
  corpus <- tm_map(corpus, content_transformer(bracketX))

}
clean.corpus.lines <- function(corpus)
{
  corpus <- tm_map(corpus, content_transformer(clean))

  corpus <- tm_map(corpus, content_transformer(remove.non.utf))
  corpus <- tm_map(corpus, content_transformer(removeDoubleSlash))
  corpus <- tm_map(corpus, content_transformer(removeLinesWithColons))
  corpus <- tm_map(corpus, content_transformer(rm_angle))

  corpus <- tm_map(corpus, content_transformer(bracketX))
 # corpus <- tm_map(corpus, replace_symbol)
 corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
 corpus <- tm_map(corpus, stripWhitespace)
  
}
remove.non.utf <- function(x)
{
  x <- unlist(x)
  Encoding(x) <- "UTF-8"
  y <- iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
  return(unname(y))
}
removeDoubleSlash <- function(x)
{
 # x<- str_remove_all(x, "\\")
  x<- str_remove_all(x, "-")
  x<- gsub("<.*/>","",x)
  return(x)
}
removeLinesWithColons <- function(x)
{
  content <- unlist(x)
  y <- content[!str_detect(content, ":")]
  z <- y[!str_detect(y, "@")]
  return(unname(z))
}

Data Classification Process

Using RTextTools, I choose to tag ham and spam emails ahead to determine how “off” the predictor models were. I also chose to reduce the sparseness as documented in Automated Data Collection with R, Chapter 10. I was able to use only MAXENT and SVC_MODEL in my analysis because Random Forest would fail due to an R error with protect(): protection stack overflow.

#
cleaned.spam <- clean.corpus.lines(spam.email.corps)
meta(cleaned.spam, tag = "type", type ="indexed") <- c("spam")
dtm.spam <- DocumentTermMatrix(cleaned.spam)

labels <- c("spam", "ham")
#cleaned.spam.matrix <- as.matrix(cleaned.spam)

cleaned.ham <- clean.corpus.lines(ham.email.corps)
meta(cleaned.ham, tag = "type", type ="indexed") <- c("ham")
dtm.ham <- DocumentTermMatrix(cleaned.ham)

# Reduce sparseness
dtm.ham <- removeSparseTerms(dtm.ham, 1-(10/length(cleaned.ham)))

dtm.spam <- removeSparseTerms(dtm.spam, 1-(10/length(cleaned.spam)))
#meta(cleaned.ham, tag = "type", type ="indexed") <- c("ham")

combined.email <- c(cleaned.ham, cleaned.spam)

sampled_corpus <- list.sample(combined.email, size = 2000)
combined.email.dtm <- DocumentTermMatrix(sampled_corpus)
email_types <- unlist(meta(sampled_corpus, "type"))

container <- create_container(combined.email.dtm,
                              labels = email_types,
                              trainSize = 1:1000,
                              testSize = 1001:2000,
                              virgin = FALSE
)

svm_model <- train_model(container = container, "SVM")
svm_out <- classify_model(container, svm_model)
head(svm_out, n=15) 
##    SVM_LABEL  SVM_PROB
## 1        ham 0.9879749
## 2        ham 0.9536875
## 3        ham 0.9929224
## 4        ham 0.9856811
## 5        ham 0.9909459
## 6        ham 0.9832534
## 7        ham 0.9914328
## 8        ham 0.9906568
## 9       spam 0.9999999
## 10      spam 0.6279801
## 11       ham 0.9966116
## 12      spam 0.9950072
## 13       ham 0.7847045
## 14       ham 0.9879979
## 15       ham 0.9848829
#tree_model <- train_model(container = container, "TREE")

maxent_model <- train_model(container, "MAXENT")
maxent_out <- classify_model(container, maxent_model)
head(maxent_out, n=15)
##    MAXENTROPY_LABEL MAXENTROPY_PROB
## 1               ham       0.9999981
## 2               ham       1.0000000
## 3               ham       0.9999906
## 4               ham       1.0000000
## 5               ham       0.9999990
## 6               ham       0.9999980
## 7               ham       0.9999987
## 8               ham       1.0000000
## 9              spam       1.0000000
## 10             spam       0.9982343
## 11              ham       1.0000000
## 12             spam       1.0000000
## 13              ham       1.0000000
## 14              ham       0.9999891
## 15              ham       0.9998939

Conclusion and Analysis

Using our classification for the SVM model, we see that we correctly predicted roughly 97.9% of the documents, whereas Maxent predicted 99.7% of the spam correctly. MaxEnt had a much better prediction accuracy, but the results were fairly similar. Ideally, providing more documents, will help train the models better.

labels_out <- data.frame(
  correct_label = email_types[1001:2000],
  smv = as.character(svm_out[,1]),
  maxent = as.character(maxent_out[,1]),
  stringsAsFactors = F)

table(labels_out[,1] == labels_out[,2])
## 
## FALSE  TRUE 
##    21   979
# First few columns

head(labels_out)
##          correct_label smv maxent
## type1001           ham ham    ham
## type1002           ham ham    ham
## type1003           ham ham    ham
## type1004           ham ham    ham
## type1005           ham ham    ham
## type1006           ham ham    ham
#SVM Prediction

prop.table(table(labels_out[,1] == labels_out[,2]))
## 
## FALSE  TRUE 
## 0.021 0.979
# Max Ent Prediction
prop.table(table(labels_out[,1] == labels_out[,3]))
## 
## FALSE  TRUE 
## 0.003 0.997