Introduction

using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. Start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

Loading packages

library(usethis)
library(devtools)
install_github("cran/maxent")
## Skipping install of 'maxent' from a github remote, the SHA1 (9d46c6aa) has not changed since last install.
##   Use `force = TRUE` to force installation
install_github("cran/RTextTools")
## Skipping install of 'RTextTools' from a github remote, the SHA1 (dc584154) has not changed since last install.
##   Use `force = TRUE` to force installation
library(tm)
## Loading required package: NLP
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
## x purrr::lift()       masks caret::lift()
library(stringr)
library(RTextTools)
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli

Reading and Prepping data

I retrieve the ham and spam emails from (https://spamassassin.apache.org/old/publiccorpus/). I chose the two folders 20021010_spam and 20021010_easy_ham that includes spam and ham emails.

spam_dir <- 'C:/Users/Udaya/Documents/Geeth/DATA607/Project4/20021010_spam/spam/'
ham_dir <- 'C:/Users/Udaya/Documents/Geeth/DATA607/Project4/20021010_easy_ham/easy_ham/'

We then need to create text Corpuses from the files. The tidying procedures are similar to those found in the course text. For this project I am using VCorpus() function that is there for text mining with the content_transformer() function wrapped around the tm_map variables.

spam <- spam_dir %>%DirSource()%>% VCorpus()
ham <- ham_dir %>% DirSource() %>% VCorpus()
meta(spam[[1]])
##   author       : character(0)
##   datetimestamp: 2019-11-19 04:33:21
##   description  : character(0)
##   heading      : character(0)
##   id           : 0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1
##   language     : en
##   origin       : character(0)
meta(ham[[1]])
##   author       : character(0)
##   datetimestamp: 2019-11-19 04:33:53
##   description  : character(0)
##   heading      : character(0)
##   id           : 0001.ea7e79d3153e7469e7a9c3e0af6a357e
##   language     : en
##   origin       : character(0)
#Now we can tidy our Corpuses
spam <- spam %>% tm_map(content_transformer(PlainTextDocument))
spam <- spam %>% tm_map(content_transformer(removePunctuation))
spam <- spam %>% tm_map(content_transformer(tolower))
spam <- spam %>% tm_map(content_transformer(removeNumbers))
spam <- spam %>% tm_map(content_transformer(stemDocument),  language = 'english') #Stemming seems to truncate words
spam <- spam %>% tm_map(removeWords, c('receiv', stopwords('english')))
ham <- ham %>% tm_map(content_transformer(PlainTextDocument))
ham <- ham %>% tm_map(content_transformer(removePunctuation))
ham <- ham %>% tm_map(content_transformer(tolower))
ham <- ham %>% tm_map(content_transformer(removeNumbers))
ham <- ham %>% tm_map(content_transformer(stemDocument),  language = 'english') #Stemming seems to truncate words
ham <- ham %>% tm_map(removeWords, c('receiv', 'spamassassin', stopwords('english')))
ham_spam <- c(ham,spam)
#This loop places a meta data label on all documents as ham or spam, the c() function puts the two Corpuses back to Back
#So we can use their lengths to index the loops.
for(i in 1:length(ham)){
  meta(ham_spam[[i]],"classification") <- "Ham"
}
for(i in (length(ham)+1):(length(spam)+length(ham))){
  meta(ham_spam[[i]],"classification") <- "Spam"
}
for(i in 1:5){
  ham_spam <- sample(ham_spam)
}# This scramble the corpus so it is not all Ham then all Spam
meta(ham_spam[[127]])
##   author        : character(0)
##   datetimestamp : 2019-11-19 04:33:53
##   description   : character(0)
##   heading       : character(0)
##   id            : 1731.5dc6289341b8bccf7a1b7b976f96e005
##   language      : en
##   origin        : character(0)
##   classification: Ham

Document Term Matrices

This is to analyze the statistical information of the text document. This function has bunch of dummy variables that tell us if a specific term appear in a specific document.

spam_dtm <- spam %>% DocumentTermMatrix()
spam_dtm <- spam_dtm %>% removeSparseTerms(1-(10/length(spam)))
spam_dtm
## <<DocumentTermMatrix (documents: 501, terms: 1437)>>
## Non-/sparse entries: 58844/661093
## Sparsity           : 92%
## Maximal term length: 56
## Weighting          : term frequency (tf)

The summary of Document Term Matrix is informative. We have 1437 different items in 502 documents. There are 58844 non-zero and 661093 sparse entries.

ham_dtm <- ham %>% DocumentTermMatrix()
ham_dtm <- ham_dtm %>% removeSparseTerms(1-(10/length(ham)))
ham_dtm
## <<DocumentTermMatrix (documents: 2551, terms: 3862)>>
## Non-/sparse entries: 314878/9537084
## Sparsity           : 97%
## Maximal term length: 68
## Weighting          : term frequency (tf)

In ham corpus, We have 3862 different items in 2551 documents. There are 314878 non-zero and 9537084 sparse entries.

ham_spam_dtm <- ham_spam %>% DocumentTermMatrix()
ham_spam_dtm <- ham_spam_dtm %>% removeSparseTerms(1-(10/length(ham_spam)))
ham_spam_dtm
## <<DocumentTermMatrix (documents: 3052, terms: 4556)>>
## Non-/sparse entries: 383287/13521625
## Sparsity           : 97%
## Maximal term length: 68
## Weighting          : term frequency (tf)

This is a combination of ham and spam corpus.

Summary Statistics

First we will look at the Spam emails. This will show the top 10 frequently used words in the spam corpus.

spam_freq <-  spam_dtm %>% as.matrix() %>% colSums()
length(spam_freq) #Should be the same as term count, not document count.
## [1] 1437
spam_freq_ord <- spam_freq %>% order(decreasing = TRUE)
#spam_freq_ord is a vector of the indicies of spam_freq in order of highest word count to lowest.
par(las=1)
#This will create a bar plot of the top 10 words in the spam Corpus
barplot(spam_freq[spam_freq_ord[1:10]], horiz = TRUE, col=terrain.colors(10), cex.names=0.7)

#Spam Cloud
wordcloud(spam, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())

Next we look at the ham email corpus where we will see more calendar references with days and months terms as most frequent words.

ham_freq <-  ham_dtm %>% as.matrix() %>% colSums()
length(ham_freq) #Should be the same as term count, not document count.
## [1] 3862
ham_freq_ord <- ham_freq %>% order(decreasing = TRUE)
#ham_freq_ord is a vector of the indicies of ham_freq in order of highest word count to lowest.
par(las=1)
#This will create a bar plot of the top 10 words in the ham Corpus
barplot(ham_freq[ham_freq_ord[1:10]], horiz = TRUE,col=terrain.colors(10),cex.names=0.7)

#Ham Cloud
wordcloud(ham, max.words = 75, random.order = FALSE, random.color = TRUE,colors=palette())

Further Analysis

We begin by creating a container of the data that needs to be input into the models. Then we will use the Support Vector Machine supervised learning model to classify emails in the test set as ham or spam. This technique works by classifying each document as a position vector and it looks for a plane through the phase space that creates the largest distance between the two classes in the training set, here ‘spam’ emails and ‘ham’ emails. It then classifies each document in the test set by it’s position relative to the plane of separation.

lbls <- as.vector(unlist(meta(ham_spam, type="local", tag = "classification")))
head(lbls)
## [1] "Ham"  "Ham"  "Spam" "Ham"  "Ham"  "Ham"
N <- length(lbls)
container <- create_container(ham_spam_dtm, labels = lbls, trainSize = 1:501,testSize = 502:N,virgin = TRUE)

Support Vector Machine

SVM uses a supervised learning approach, which means it learns to classify unseen data based on a set of labeled training data, such as corporate documents. The initial set of training data is typically identified by domain experts and is used to build a model that can be applied to any other data outside the training set. In this scenario we will use SVM to classify emails in the test set as spam and ham.

suppressMessages(suppressWarnings(library("RTextTools")))
svm_model <- train_model(container, "SVM")
svm_result <- classify_model(container,svm_model)
head(svm_result)
##   SVM_LABEL  SVM_PROB
## 1       Ham 0.9956135
## 2       Ham 0.9974236
## 3       Ham 0.9987647
## 4       Ham 0.9961652
## 5       Ham 0.9995255
## 6      Spam 0.9947051
prop.table(table(svm_result[,1] == lbls[502:N]))
## 
##      FALSE       TRUE 
## 0.01724814 0.98275186

According to the calculations SVM model gives over 99% accuracy rate.

Random Forest

Next step is to use random forest method with the training set. It uses the training set to create multiple decition trees.

tree_model <- train_model(container, "TREE")
tree_result <- classify_model(container, tree_model)
head(tree_result)
##   TREE_LABEL TREE_PROB
## 1        Ham         1
## 2        Ham         1
## 3        Ham         1
## 4        Ham         1
## 5        Ham         1
## 6       Spam         1
prop.table(table(tree_result[,1] == lbls[502:N]))
## 
##       FALSE        TRUE 
## 0.005488044 0.994511956

Same as the SVM model, Random Forest model also have over 99% accuracy rate.

Maximum Entropy

The maximum entropy principle (MaxEnt) states that the most appropriate distribution to model a given set of data is the one with highest entropy among all those that satisfy the constrains of our prior knowledge.We start with an assumption that the properties in population are normally distributed with known mean and standard deviation. Max Entropy builds these distributions with the training set.

maxent_model <- train_model(container, "MAXENT")
max_result <- classify_model(container, maxent_model)
head(max_result)
##   MAXENTROPY_LABEL MAXENTROPY_PROB
## 1              Ham       0.9999935
## 2              Ham       0.9999974
## 3              Ham       0.9999996
## 4              Ham       0.9999857
## 5              Ham       0.9999999
## 6             Spam       0.9999998
prop.table(table(max_result[,1] == lbls[502:N]))
## 
##      FALSE       TRUE 
## 0.01411211 0.98588789

This method also gives us more than 99% accuracy with Max entropy.

Conclusion

Accuracy of individual machine learning models are over 99% which is an excellent outcome.