Project 4

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

Libraries

library(tm)
library(tidyverse)
library(stringr)
library(wordcloud)
library(RTextTools)
library(knitr)
library(kableExtra)

First I performed “VCorpus”" (Volatile Corpus) to pull in the data from my directory. This allows us to pull in the entire directory into R Studio to begin our analysis on Spam vs Ham.

easy_ham <- VCorpus(DirSource("C:/Users/manda/OneDrive/Documents/easy_ham"))
easy_spam <- VCorpus(DirSource("C:/Users/manda/OneDrive/Documents/easy_spam"))

Here we add the meta infomation to set the data for Spam, and Ham.

meta(easy_spam, tag = "type") <- "spam"
meta(easy_ham, tag = "type") <- "ham"

easy_comb <- c(easy_spam, easy_ham)

Cleaning/ tidying up the data

In this step we begin to clean the data of any inconsistencies. Our goal is to remove numbers, stopwords, punctuation, and white space.

easy_comb <- tm_map(easy_comb, content_transformer(function(x) iconv(x, "UTF-8", sub="byte")))
easy_comb <- tm_map(easy_comb, content_transformer(tolower))
easy_comb <- tm_map(easy_comb, removeNumbers)
easy_comb <- tm_map(easy_comb, removeWords, stopwords("english"))
easy_comb <- tm_map(easy_comb, removePunctuation)
easy_comb <- tm_map(easy_comb, stripWhitespace)

We then arrange the data into a term text matrix.

Building a term Matrix and inspect

dtm <- DocumentTermMatrix(easy_comb)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
##                                      word   freq
## 1                     20050311_spam_2.tar 491214
## 2  00677.b957e34b4dd0d9263b56bf71b1168d8a   7752
## 3  00670.be029e37187b8615a231865e3663dcf9   7717
## 4  01083.a6b3c50be5abf782b585995d2c11176b   6436
## 5                                    cmds   4999
## 6  00570.d98ca90ac201b5d881f2397c95838eb2   3734
## 7  00942.727cb1619115cdee240fa418da19dd1f   3226
## 8  00765.ea01c46568902b1338c9685b55d77f6c   3155
## 9  00265.d0ebd6ba8f3e2b8d71e9cdaa2ec6fd91   3054
## 10 01094.91779ec04e5e6b27e84297c28fc7369f   2974

At this point, the data is then cleaned further by elimating spare words- infrequent words in the dataset (for example, less than 10 times).

dtm <- removeSparseTerms(dtm, 1-(10/length(easy_comb)))
dtm
## <<DocumentTermMatrix (documents: 3470, terms: 6104)>>
## Non-/sparse entries: 429232/20751648
## Sparsity           : 98%
## Maximal term length: 73
## Weighting          : term frequency (tf)

In rhis case, we like to take a peak to see which terms were frequently used.

dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=T)
table_freq <- head(frequency, 15)
kable(table_freq, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T)
x
received 23947
esmtp 13590
localhost 10121
sep 9436
jul 8112
font 7649
widthd 6870
mon 6732
jmlocalhost 6166
email 6084
postfix 5981
table 5980
thu 5922
will 5902
date 5876
wordfreq <- data.frame(word=names(frequency), frequency=frequency)

p <- ggplot(subset(wordfreq, frequency>2000), aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill='#35a2c4') +
  theme(axis.text.x=element_text(angle=90, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#adc8d1'))
p

Analysis: Predictions and Models

The metadata was then analyzed, it turns out we have 2500 emails classified as HAM, and 1397 emails classified as spam.

meta_type <- as.vector(unlist(meta(easy_comb)))
meta_data <- data.frame(type = unlist(meta_type))

table(meta_data)
## meta_data
##  ham spam 
## 2245 1225

Futhermore, we create a container using creat_container()funcation from RTextTools.

N <- length(meta_type)
container <- create_container(dtm,
                              labels = meta_type,
                              trainSize = 1:2727,
                              testSize = 2728:N,
                              virgin = F)

Matrix_container. It contains a set of objects that are used for the estimation procedures of the supervised learning methods

slotNames(container)
## [1] "training_matrix"       "classification_matrix" "training_codes"       
## [4] "testing_codes"         "column_names"          "virgin"

For this portion we use the train_model() function on the train data.

svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")

We then use our model to estimate if an email in our test dataset is spam or ham.

svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)

By looking at the outcome: the three models were combined into a single dataframe where the labels and estimes of the probability of classification are present.

model_results <- data.frame(head(svm_out), head(tree_out), head(maxent_out) )
kable(model_results, "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T)
SVM_LABEL SVM_PROB TREE_LABEL TREE_PROB MAXENTROPY_LABEL MAXENTROPY_PROB
ham 0.9990225 ham 1 ham 0.9999987
ham 0.9974185 ham 1 ham 0.9999718
ham 0.7451459 ham 1 ham 0.9625643
ham 0.9735851 ham 1 ham 0.9995415
ham 0.9998221 ham 1 ham 1.0000000
ham 0.9711250 ham 1 ham 0.9992097

Since using supervised learning, our models know the correct labels. We can use this to see exactly how correct the algorithm was in correctly classifying the documents.

labels_out <- data.frame(
  correct_label = meta_type[2728:N],
  svm = as.character(svm_out[,1]),
  tree = as.character(tree_out[,1]),
  maxent = as.character(maxent_out[,1]),
  stringsAsFactors = F)
table(labels_out[,1] == labels_out[,2])
## 
## FALSE  TRUE 
##    31   712
prop.table(table(labels_out[,1] == labels_out[,2]))
## 
##      FALSE       TRUE 
## 0.04172275 0.95827725
table(labels_out[,1] == labels_out[,3])
## 
## FALSE  TRUE 
##    40   703
prop.table(table(labels_out[,1] == labels_out[,3]))
## 
##     FALSE      TRUE 
## 0.0538358 0.9461642
table(labels_out[,1] == labels_out[,4])
## 
## FALSE  TRUE 
##    31   712
prop.table(table(labels_out[,1] == labels_out[,4]))
## 
##      FALSE       TRUE 
## 0.04172275 0.95827725
dfdata <- data.frame(table(labels_out[,1] == labels_out[,2]),
                     table(labels_out[,1] == labels_out[,3]),
                     table(labels_out[,1] == labels_out[,4])
                     )

colnames(dfdata) <- c("SVM","Freq", "Random Forest", "Freq", "Max Entropy", "Freq")
kable(dfdata, "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T)
SVM Freq Random Forest Freq Max Entropy Freq
FALSE 31 FALSE 40 FALSE 31
TRUE 712 TRUE 703 TRUE 712

Conclusions:

From looking at the results, we can tell that the Maximum Entropy was the best classifier, followed by the SVM. therefore, the worst classifier was the Random Forest.

References:

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining