Project Four Question

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

Load Relevant Packages

library(tm)
## Loading required package: NLP
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(RTextTools)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
library(wordcloud)
## Loading required package: RColorBrewer

Set working directory

The files used in this project were dowloaded from the following site: https://spamassassin.apache.org/old/publiccorpus/

The specific files that were downloaded are: 1. 20021010_easy_ham.tar.bz2 2. 20021010_hard_ham.tar.bz2 3. 20021010_spam.tar.bz

These files were saved in a folder on my desktop. The working directory was set to the location of the files.

getwd()
## [1] "/Users/juanelle/Desktop/MSDS/Data607/week 11"
#setwd("/Users/juanelle/Desktop/MSDS/Data607/week 11")

Document classification falls within the realm of text mining.This is a relatively new area for me as a beginner data scientist. After much research, i learnt that there are about five main steps to document classification. These are: 1. Create corpus 2. Preprocess Corpus 3. Prepare Document Term Matrix 4. Prepare features and labels for model 5. Create “train” and “test” data 6. Run and test model

These were the steps followed in this project

Step One:

Create Individual and Combined Corpus

spam <- VCorpus(DirSource("spam", encoding = "UTF-8"), readerControl = list(language="en"))

easy_ham <- VCorpus(DirSource("easy_ham",encoding = "UTF-8"), readerControl = list(language="en"))

hard_ham <- VCorpus(DirSource("hard_ham", encoding = "UTF-8"), readerControl = list(language="en"))

#A peek at the content of a spam and easy_ham document
#writeLines(as.character(spam[[30]]))
#writeLines(as.character(easy_ham[[30]]))


#Add meta labels
meta(spam, tag = "type") <- "spam"
meta(easy_ham, tag = "type") <- "ham"
meta(hard_ham, tag = "type") <- "ham"


# Combine all corpus
combined_emails <- c(spam, easy_ham, hard_ham)


#meta(spam[[1]])

Step Two:

Preprocessing-Clean and randomise combined emails corpus

This is the process whereby the data was cleaned to remove dirty data since these can negatively affect the results.

combined_emails <- tm_map(combined_emails, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
combined_emails <- tm_map(combined_emails, content_transformer(tolower))
combined_emails <- tm_map(combined_emails, removeNumbers) # remove numbers as these are not of intdrest
combined_emails <- tm_map(combined_emails, removeWords, words = stopwords("en")) # remove common words such as a, an the)
combined_emails <- tm_map(combined_emails, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))

combined_emails <- tm_map(combined_emails, stripWhitespace) # remove all white spaces

#reduce and randomise corpus
combined_emails <- sample(combined_emails, 750) #reduce and randomise corpus

#combined_emails #randomised version

writeLines(as.character(combined_emails[[30]])) # a peek at the corpus
##  fork admin xent com thu sep 
## return path fork admin xent com 
## delivered yyyy localhost example com
## received localhost jalapeno 
##  jmason org postfix esmtp id bcff
##  jm localhost thu sep + ist 
## received jalapeno 
##  localhost imap fetchmail 
##  jm localhost single drop thu sep + ist 
## received xent com dogma slashnull org
##  esmtp id gdz jm jmason org 
##  thu sep +
## received lair xent com localhost xent com postfix 
##  esmtp id ece wed sep pdt 
## delivered fork example com
## received crank slack net slack net xent com
##  postfix esmtp id caa fork xent com wed 
##  sep pdt 
## received crank slack net postfix userid id adcedf 
##  thu sep edt 
## received localhost localhost crank slack net
##  postfix esmtp id aedad thu sep edt 
##  tom tomwhore slack net 
##  mr fork fork list hotmail com 
## cc fork fork xent com 
## subject re cd player ui toddlers
##  reply davxoivcnomkrwdvtbce hotmail com 
## message id pine bso crank slack net 
## mime version 
## content type text plain charset=us ascii
## sender fork admin xent com
## errors fork admin xent com
## x beenthere fork example com
## x mailman version 
## precedence bulk
## list help mailto fork request xent com subject=help 
## list post mailto fork example com 
## list subscribe http xent com mailman listinfo fork mailto fork request xent com subject=subscribe 
## list id friends rohit khare fork xent com 
## list unsubscribe http xent com mailman listinfo fork 
##  mailto fork request xent com subject=unsubscribe 
## list archive http xent com pipermail fork 
## date thu sep edt 
## x spam status hits= required= 
##  tests=awl in rep to known mailing list spam phrase 
##  user agent pine
##  version= cvs
## x spam level 
## 
##  wed sep mr fork wrote 
##  d mp player solid state storage instant 
## 
## 
## getting new media bit reach kindala cd
## solution hand em disc goes 
## 
## tradeoffs abound 
## 
## heather got cd player even though crappy
## handmedown worked great batterys poping bad bad ui
##  next one store bought audio player mp
## decoders yet wanted bottom line volt momala put
##  kabash anything costing bucks heck scrounge ebay
##  get palm m bucks 
## 
##  hitch new music upshot spend time going usenet
## listing togther 
## 
##  happy family 
## 
## now benjamin yea id love something like amazingly cool
## fisher price first cd casset vasectomy dirtybomb products perhaps
##  first cd might work time let ebay walking

750 random emails were selected from the dataset

wordcloud(combined_emails, min.freq = 5, max.words = 300)

Check Corpus for spam/ ham proportions

spam_ham_prop <- combined_emails %>%
  meta(tag = "type") %>%
  unlist() %>%
  table() 
spam_ham_prop
## .
##  ham spam 
##  647  103

Of the 750 randomly selected emails, 640 were labeled ham and 110 were labelled spam.


Step Three

Create Document Term Matrix

#email_dtm <- DocumentTermMatrix(combined_emails)
#email_dtm

email_dtm <- combined_emails %>% 
  DocumentTermMatrix() %>% 
  removeSparseTerms(1-(10/length(combined_emails)))
email_labels <- unlist(meta(combined_emails, "type"))
#email_dtm
#table(email_labels)

A glimpse of the frequent words that occurred in the DTM:

findFreqTerms(email_dtm, 1000)
##  [1] "admin"       "aug"         "border="     "click"       "cnet"       
##  [6] "com"         "content"     "date"        "dogma"       "esmtp"      
## [11] "example"     "exmh"        "font"        "fork"        "freshrpms"  
## [16] "gif"         "height="     "href="       "http"        "img"        
## [21] "ist"         "jmason"      "list"        "lists"       "localhost"  
## [26] "mailman"     "mailto"      "message"     "mon"         "nbsp"       
## [31] "net"         "oct"         "online"      "org"         "postfix"    
## [36] "received"    "request"     "rpm"         "sep"         "size="      
## [41] "slashnull"   "sourceforge" "spam"        "src="        "table"      
## [46] "text"        "thu"         "users"       "version"     "width="     
## [51] "width=d"     "www"         "xent"

Steps Four and Five

Create container and designate training vs testing

Dataset was split 80/20. 80 % of the data was used for the training set and 20 % was used for the test set.

emails_container <- create_container(email_dtm,  labels = email_labels, trainSize = 1:600, testSize = 600:length(email_labels), virgin = FALSE)

Step Six

Run and Test models

Train model

Three randomly selected models were used. These are: Support Vector Machines (SVM), Tree-Based and Max-Entropy.

svm_model <- train_model(emails_container, "SVM")
tree_model <- train_model(emails_container, "TREE")
maxent_model <- train_model(emails_container, "MAXENT")

Classification using the three models

svm_classified <- classify_model(emails_container, svm_model)
tree_classified <- classify_model(emails_container, tree_model)
maxent_classified <- classify_model(emails_container, maxent_model)

Results of classification

classified_DF <- data.frame(
  label = email_labels[600:length(email_labels)],
  svm = svm_classified[,1],
  tree_classified[,1],
  maxent_classified[,1],
  stringsAsFactors = F)
#head(classified_DF_all)

A look at how each model performed

SVM

# How did the SVM model perform?
prop.table(table(classified_DF[,1] == classified_DF[,2]))
## 
##      FALSE       TRUE 
## 0.05960265 0.94039735

Tree-Based

# How did  the tree-based model perform?
prop.table(table(classified_DF[,1] == classified_DF[,3]))
## 
##      FALSE       TRUE 
## 0.01324503 0.98675497

MAXENT

# How did MAXENT model perform?
prop.table(table(classified_DF[,1] == classified_DF[,4]))
## 
##     FALSE      TRUE 
## 0.0397351 0.9602649

Discussion

All three models were able to classify the documents with more than 95% accuracy.However, of the three models used in this project, the tree- based model performed the best at classifying the documents as spam or ham. It will be interesting to conduct some more research on these models (their individual characteristics) to better understand how they influenced the results in this project.