It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
library(tm)
## Loading required package: NLP
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
library(wordcloud)
## Loading required package: RColorBrewer
The files used in this project were dowloaded from the following site: https://spamassassin.apache.org/old/publiccorpus/
The specific files that were downloaded are: 1. 20021010_easy_ham.tar.bz2 2. 20021010_hard_ham.tar.bz2 3. 20021010_spam.tar.bz
These files were saved in a folder on my desktop. The working directory was set to the location of the files.
getwd()
## [1] "/Users/juanelle/Desktop/MSDS/Data607/week 11"
#setwd("/Users/juanelle/Desktop/MSDS/Data607/week 11")
Document classification falls within the realm of text mining.This is a relatively new area for me as a beginner data scientist. After much research, i learnt that there are about five main steps to document classification. These are: 1. Create corpus 2. Preprocess Corpus 3. Prepare Document Term Matrix 4. Prepare features and labels for model 5. Create “train” and “test” data 6. Run and test model
These were the steps followed in this project
spam <- VCorpus(DirSource("spam", encoding = "UTF-8"), readerControl = list(language="en"))
easy_ham <- VCorpus(DirSource("easy_ham",encoding = "UTF-8"), readerControl = list(language="en"))
hard_ham <- VCorpus(DirSource("hard_ham", encoding = "UTF-8"), readerControl = list(language="en"))
#A peek at the content of a spam and easy_ham document
#writeLines(as.character(spam[[30]]))
#writeLines(as.character(easy_ham[[30]]))
#Add meta labels
meta(spam, tag = "type") <- "spam"
meta(easy_ham, tag = "type") <- "ham"
meta(hard_ham, tag = "type") <- "ham"
# Combine all corpus
combined_emails <- c(spam, easy_ham, hard_ham)
#meta(spam[[1]])
This is the process whereby the data was cleaned to remove dirty data since these can negatively affect the results.
combined_emails <- tm_map(combined_emails, content_transformer(function(x) iconv(x, to = 'UTF-8-MAC', sub = 'byte')))
combined_emails <- tm_map(combined_emails, content_transformer(tolower))
combined_emails <- tm_map(combined_emails, removeNumbers) # remove numbers as these are not of intdrest
combined_emails <- tm_map(combined_emails, removeWords, words = stopwords("en")) # remove common words such as a, an the)
combined_emails <- tm_map(combined_emails, content_transformer(function(x) str_replace_all(x, "[[:punct:]]|<|>", " ")))
combined_emails <- tm_map(combined_emails, stripWhitespace) # remove all white spaces
#reduce and randomise corpus
combined_emails <- sample(combined_emails, 750) #reduce and randomise corpus
#combined_emails #randomised version
writeLines(as.character(combined_emails[[30]])) # a peek at the corpus
## fork admin xent com thu sep
## return path fork admin xent com
## delivered yyyy localhost example com
## received localhost jalapeno
## jmason org postfix esmtp id bcff
## jm localhost thu sep + ist
## received jalapeno
## localhost imap fetchmail
## jm localhost single drop thu sep + ist
## received xent com dogma slashnull org
## esmtp id gdz jm jmason org
## thu sep +
## received lair xent com localhost xent com postfix
## esmtp id ece wed sep pdt
## delivered fork example com
## received crank slack net slack net xent com
## postfix esmtp id caa fork xent com wed
## sep pdt
## received crank slack net postfix userid id adcedf
## thu sep edt
## received localhost localhost crank slack net
## postfix esmtp id aedad thu sep edt
## tom tomwhore slack net
## mr fork fork list hotmail com
## cc fork fork xent com
## subject re cd player ui toddlers
## reply davxoivcnomkrwdvtbce hotmail com
## message id pine bso crank slack net
## mime version
## content type text plain charset=us ascii
## sender fork admin xent com
## errors fork admin xent com
## x beenthere fork example com
## x mailman version
## precedence bulk
## list help mailto fork request xent com subject=help
## list post mailto fork example com
## list subscribe http xent com mailman listinfo fork mailto fork request xent com subject=subscribe
## list id friends rohit khare fork xent com
## list unsubscribe http xent com mailman listinfo fork
## mailto fork request xent com subject=unsubscribe
## list archive http xent com pipermail fork
## date thu sep edt
## x spam status hits= required=
## tests=awl in rep to known mailing list spam phrase
## user agent pine
## version= cvs
## x spam level
##
## wed sep mr fork wrote
## d mp player solid state storage instant
##
##
## getting new media bit reach kindala cd
## solution hand em disc goes
##
## tradeoffs abound
##
## heather got cd player even though crappy
## handmedown worked great batterys poping bad bad ui
## next one store bought audio player mp
## decoders yet wanted bottom line volt momala put
## kabash anything costing bucks heck scrounge ebay
## get palm m bucks
##
## hitch new music upshot spend time going usenet
## listing togther
##
## happy family
##
## now benjamin yea id love something like amazingly cool
## fisher price first cd casset vasectomy dirtybomb products perhaps
## first cd might work time let ebay walking
750 random emails were selected from the dataset
wordcloud(combined_emails, min.freq = 5, max.words = 300)
Check Corpus for spam/ ham proportions
spam_ham_prop <- combined_emails %>%
meta(tag = "type") %>%
unlist() %>%
table()
spam_ham_prop
## .
## ham spam
## 647 103
Of the 750 randomly selected emails, 640 were labeled ham and 110 were labelled spam.
#email_dtm <- DocumentTermMatrix(combined_emails)
#email_dtm
email_dtm <- combined_emails %>%
DocumentTermMatrix() %>%
removeSparseTerms(1-(10/length(combined_emails)))
email_labels <- unlist(meta(combined_emails, "type"))
#email_dtm
#table(email_labels)
findFreqTerms(email_dtm, 1000)
## [1] "admin" "aug" "border=" "click" "cnet"
## [6] "com" "content" "date" "dogma" "esmtp"
## [11] "example" "exmh" "font" "fork" "freshrpms"
## [16] "gif" "height=" "href=" "http" "img"
## [21] "ist" "jmason" "list" "lists" "localhost"
## [26] "mailman" "mailto" "message" "mon" "nbsp"
## [31] "net" "oct" "online" "org" "postfix"
## [36] "received" "request" "rpm" "sep" "size="
## [41] "slashnull" "sourceforge" "spam" "src=" "table"
## [46] "text" "thu" "users" "version" "width="
## [51] "width=d" "www" "xent"
Dataset was split 80/20. 80 % of the data was used for the training set and 20 % was used for the test set.
emails_container <- create_container(email_dtm, labels = email_labels, trainSize = 1:600, testSize = 600:length(email_labels), virgin = FALSE)
Three randomly selected models were used. These are: Support Vector Machines (SVM), Tree-Based and Max-Entropy.
svm_model <- train_model(emails_container, "SVM")
tree_model <- train_model(emails_container, "TREE")
maxent_model <- train_model(emails_container, "MAXENT")
svm_classified <- classify_model(emails_container, svm_model)
tree_classified <- classify_model(emails_container, tree_model)
maxent_classified <- classify_model(emails_container, maxent_model)
classified_DF <- data.frame(
label = email_labels[600:length(email_labels)],
svm = svm_classified[,1],
tree_classified[,1],
maxent_classified[,1],
stringsAsFactors = F)
#head(classified_DF_all)
# How did the SVM model perform?
prop.table(table(classified_DF[,1] == classified_DF[,2]))
##
## FALSE TRUE
## 0.05960265 0.94039735
# How did the tree-based model perform?
prop.table(table(classified_DF[,1] == classified_DF[,3]))
##
## FALSE TRUE
## 0.01324503 0.98675497
# How did MAXENT model perform?
prop.table(table(classified_DF[,1] == classified_DF[,4]))
##
## FALSE TRUE
## 0.0397351 0.9602649
All three models were able to classify the documents with more than 95% accuracy.However, of the three models used in this project, the tree- based model performed the best at classifying the documents as spam or ham. It will be interesting to conduct some more research on these models (their individual characteristics) to better understand how they influenced the results in this project.