It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

library(tm)
## Loading required package: NLP
library(knitr)
library(plyr)
library(wordcloud)
## Loading required package: RColorBrewer
library(SnowballC)
library(RTextTools)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## 
## Attaching package: 'RTextTools'
## The following objects are masked from 'package:SnowballC':
## 
##     getStemLanguages, wordStem
library(stringr)

The first thing we want to do is load in the data. Using this data, we want to construct a corpus and come up with a way to classsify the documents as spam or not spam.

Lets define the paths where the data will be stored on our local machine

easy_ham<-"/Users/vinicioharo/Desktop/DATA Science SPS/DATA 607/Week 10/corpus/easy_ham_2"
hard_ham<-"/Users/vinicioharo/Desktop/DATA Science SPS/DATA 607/Week 10/corpus/hard_ham"
spam<-"/Users/vinicioharo/Desktop/DATA Science SPS/DATA 607/Week 10/corpus/spam_2"

These are the URL’s where the spam/ham folders are located. We are going to select the most recent spam, which is spam 2, and easy/hard ham. spam: “http://spamassassin.apache.org/old/publiccorpus/20030228_spam_2.tar.bz2” easy_ham: “http://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2” hard_ham: “http://spamassassin.apache.org/old/publiccorpus/20030228_hard_ham.tar.bz2

We can establish a directory to get each folder

easy_ham_1<-DirSource(easy_ham)
#Encoding(easy_ham_1) <- "latin1"
hard_ham_1<-DirSource(hard_ham)
#Encoding(hard_ham_1) <- "latin1"
spam_1<-DirSource(spam)
#Encoding(spam_1) <- "latin1"

This part proved to be more difficult for me. After several iterations, I ran into problems regarding the encoding of the documents. It stopped me from building a document matrix or even using the tm map function. Reading and processing the corpus using the following sequence of loops, allows me to get over that issue. We can also turn the data into a proper data frame which will prevent errors involving type list or type character.

Get the spam

spam_corpus <- VCorpus(spam_1, readerControl=list(reader=readPlain))
#spam_corpus <- sapply(spam_corpus,function(row) iconv(row, "latin1", "ASCII", sub=""))
length(spam_1)
## [1] 1397

Get the easy ham

easyham_corpus <- VCorpus(easy_ham_1, readerControl=list(reader=readPlain))
#easyham_corpus <- sapply(easyham_corpus,function(row) iconv(row, "latin1", "ASCII", sub=""))
length(easy_ham_1)
## [1] 1401

Get the hard ham

hardham_corpus <- VCorpus(hard_ham_1, readerControl=list(reader=readPlain))
#hardham_corpus <- sapply(hardham_corpus,function(row) iconv(row, "latin1", "ASCII", sub=""))
length(hard_ham_1)
## [1] 251

We need to now include some meta labels in order to identify the elements as spam, easy ham, and hard ham

meta(spam_corpus, "filter") <- "spam"
meta(easyham_corpus, "filter") <- "easy ham"
meta(hardham_corpus, "filter") <- "hard ham"

We will partition the classification process into two parts: Part 1) easy ham vs spam Part 2) hard ham vs spam

We will create the corpus for both parts 1 and 2 in the same sequence of steps. Anything labeled with a “B” pertains to part 2 Lets proceed to part 1

corpusA<-c(spam_corpus, easyham_corpus)
#summary(corpusA)
corpusB<-c(spam_corpus, hardham_corpus)

We can take a random sample of our corpus

set.seed(1)
tdf_corpus = sample(corpusA)
head(meta(tdf_corpus, "filter"))
##        filter
## 743      spam
## 1041     spam
## 1602 easy ham
## 2539 easy ham
## 564      spam
## 2510 easy ham
set.seed(1)
tdf_corpusB = sample(corpusB)
head(meta(tdf_corpusB, "filter"))
##        filter
## 438      spam
## 613      spam
## 943      spam
## 1495 hard ham
## 332      spam
## 1477 hard ham

Lets now take our random corpus and turn it into a document term matrix

tdf_corpus2<-tm_map(tdf_corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
tdf_corpusB2<-tm_map(tdf_corpusB, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
tdf_corpus3 <- tm_map(tdf_corpus2, removeSpecialChars)
tdf_corpusB3 <- tm_map(tdf_corpusB2, removeSpecialChars)

It seems that all these transformations may corrupt the corpus, so this code should restore it allowing us to build the document term matrix

tdf_corpus4 <- tm_map(tdf_corpus3, PlainTextDocument) #This action restores the corpus.
tdf_corpusB4 <- tm_map(tdf_corpusB3, PlainTextDocument) #This action restores the corpus.

The term document matrix is finally built for both parts, we can proceed to clean the matrix as follows: -remove numbers -remove punctuation -remove stop words -remove extra white space

#Easy ham vs spam
  tdf_corpus4 <- tm_map(tdf_corpus4,removeNumbers) 
  tdf_corpus4 <- tm_map(tdf_corpus4,str_replace_all,pattern = "[[:punct:]]", replacement = " ")
  tdf_corpus4 <- tm_map(tdf_corpus4,removeWords, words = stopwords("en"))
  tdf_corpus4 <- tm_map(tdf_corpus4, tolower)
  tdf_corpus4 <- tm_map(tdf_corpus4, stemDocument)
  tdf_corpus4 <- tm_map(tdf_corpus4, PlainTextDocument) 
#hard ham vs spam
  tdf_corpusB4 <- tm_map(tdf_corpusB4,removeNumbers) 
  tdf_corpusB4 <- tm_map(tdf_corpusB4,str_replace_all,pattern = "[[:punct:]]", replacement = " ")
  tdf_corpusB4 <- tm_map(tdf_corpusB4,removeWords, words = stopwords("en"))
  tdf_corpusB4 <- tm_map(tdf_corpusB4, tolower)
  tdf_corpusB4 <- tm_map(tdf_corpusB4, stemDocument)
  tdf_corpusB4 <- tm_map(tdf_corpusB4, PlainTextDocument) 
tdm <- TermDocumentMatrix(tdf_corpus4)
tdm
## <<TermDocumentMatrix (terms: 82461, documents: 2798)>>
## Non-/sparse entries: 524267/230201611
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)
tdmB <- TermDocumentMatrix(tdf_corpusB4)
tdmB
## <<TermDocumentMatrix (terms: 92317, documents: 1648)>>
## Non-/sparse entries: 429065/151709351
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)

We can reduce the dimension of the matrix by removing sparse terms

dtm <- DocumentTermMatrix(tdf_corpus4)
dtm <- removeSparseTerms(dtm, 1 - (10/length(tdf_corpus4)))
dtm
## <<DocumentTermMatrix (documents: 2798, terms: 4925)>>
## Non-/sparse entries: 398192/13381958
## Sparsity           : 97%
## Maximal term length: 73
## Weighting          : term frequency (tf)
dtmB <- DocumentTermMatrix(tdf_corpusB4)
dtmB <- removeSparseTerms(dtmB, 1 - (10/length(tdf_corpusB4)))
dtmB
## <<DocumentTermMatrix (documents: 1648, terms: 4420)>>
## Non-/sparse entries: 289050/6995110
## Sparsity           : 96%
## Maximal term length: 95
## Weighting          : term frequency (tf)

We now proceed to divide the data into a training set and testing set. Before getting to this step, we need to collect the labels using unlist from each corpus.

labels <- as.factor(unlist(meta(tdf_corpus4, "filter")[,1]))
class(labels)
## [1] "factor"
labelsB <- as.factor(unlist(meta(tdf_corpusB4, "filter")[,1]))
class(labelsB)
## [1] "factor"

Using the abilities of RTextTools, we can create a container with specific parameters on how to divide the data into a test and training set. A common partition is 70% vs 30%

#Easy ham vs spam
N <- length(labels)
container <- create_container(dtm,
              labels = labels,
              trainSize = 1:1958,
              testSize = 1959:N,
              virgin = F)
#hard ham vs spam
NB <- length(labelsB)
containerB <- create_container(dtmB,
              labels = labelsB,
              trainSize = 1:1153,
              testSize = 1154:NB,
              virgin = F)

Now we can proceed to building some models to see if we can predict a document as being spam The first model type is called a support vector machine (SVM). It is a supervised learning model used for classification and regression tasks. Mathematically, it contructs a hyperplane mapping which generates a classification function.

svm_model <- train_model(container, "SVM")
svm_modelB <- train_model(containerB, "SVM")

Lets evaluate the SVM

svm_out <- classify_model(container, svm_model)
svm_outB <- classify_model(containerB, svm_modelB)

Out labels

labels_out <- data.frame(correct_label = labels[2798:N], svm = as.character(svm_out[,1]), stingsAsFactors = F)
table(labels_out[,1] == labels_out[,2])
## 
## FALSE  TRUE 
##   399   441
prop.table(table(labels_out[,1] == labels_out[,2]))
## 
## FALSE  TRUE 
## 0.475 0.525

The hard ham vs spam has a poor outcome

labels_outB <- data.frame(correct_label = labelsB[1648:NB], svm = as.character(svm_outB[,1]), stingsAsFactors = F)
table(labels_outB[,1] == labels_outB[,2])
## 
## FALSE  TRUE 
##   416    79
prop.table(table(labels_outB[,1] == labels_outB[,2]))
## 
##    FALSE     TRUE 
## 0.840404 0.159596

Lets try a different model Lets try a decision tree model. It does the same task as the support vector machine.

tree_model <- train_model(container, "TREE")
tree_modelB <- train_model(containerB, "TREE")

Lets evaluate the tree model

tree_out <- classify_model(container, tree_model)
tree_outB <- classify_model(containerB, tree_modelB)

Out Labels

labels_out_tree <- data.frame(correct_label = labels[2798:N], tree = as.character(tree_out[,1]), stingsAsFactors = F)
table(labels_out_tree[,1] == labels_out_tree[,2])
## 
## FALSE  TRUE 
##   405   435
prop.table(table(labels_out_tree[,1] == labels_out_tree[,2]))
## 
##     FALSE      TRUE 
## 0.4821429 0.5178571

I am still getting the same poor result

labels_out_treeB <- data.frame(correct_label = labelsB[1648:NB], tree = as.character(tree_outB[,1]), stingsAsFactors = F)
table(labels_out_treeB[,1] == labels_out_treeB[,2])
## 
## FALSE  TRUE 
##   431    64
prop.table(table(labels_out_treeB[,1] == labels_out_treeB[,2]))
## 
##     FALSE      TRUE 
## 0.8707071 0.1292929

The last model we can build is the maximum entropy model.

max_model <- train_model(container, "MAXENT")
max_modelB <- train_model(containerB, "MAXENT")

Evaluate the model

max_out <- classify_model(container, max_model)
max_outB <- classify_model(containerB, max_modelB)

Out labels

max_out_ent <- data.frame(correct_label = labels[2798:N], max_entropy = as.character(max_out[,1]), stingsAsFactors = F)
table(max_out_ent[,1] == max_out_ent[,2])
## 
## FALSE  TRUE 
##   403   437
prop.table(table(max_out_ent[,1] == max_out_ent[,2]))
## 
##     FALSE      TRUE 
## 0.4797619 0.5202381

Max entropy model was slighty better

max_out_entB <- data.frame(correct_label = labelsB[1648:NB], max_entropy = as.character(max_outB[,1]), stingsAsFactors = F)
table(max_out_entB[,1] == max_out_entB[,2])
## 
## FALSE  TRUE 
##   418    77
prop.table(table(max_out_entB[,1] == max_out_entB[,2]))
## 
##     FALSE      TRUE 
## 0.8444444 0.1555556

Can we make a conclusion regarding which model was better for classification? For easy ham vs spam, it looks like the SVM was the better performing model capturing more than 50% of the proper types. For hard ham vs spam, the max entropy model seemed to be the best performing.

We also have the ability to do some top level analysis using word clouds. This gives a visual cue into the type of content present in each corpus.

#Easy ham vs Spam corpus 
wordcloud(tdf_corpus4, max.words = 200, random.order = FALSE, colors=c('red'))

#hard ham vs Spam corpus 
wordcloud(tdf_corpusB4, max.words = 200, random.order = FALSE, colors=c('red'))