Project 4

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

Libraries

library(tm)
library(tidyverse)
library(stringr)
library(wordcloud)
library(RTextTools)
library(knitr)
library(kableExtra)

First I performed “VCorpus”" (Volatile Corpus) to pull in the data from my directory. This allows us to pull in the entire directory into R Studio to begin our analysis on Spam vs Ham.

easy_ham <- VCorpus(DirSource("C:/Users/manda/OneDrive/Documents/easy_ham"))
easy_spam <- VCorpus(DirSource("C:/Users/manda/OneDrive/Documents/easy_spam"))

Here we add the meta infomation to set the data for Spam, and Ham.

meta(easy_spam, tag = "type") <- "spam"
meta(easy_ham, tag = "type") <- "ham"

easy_comb <- c(easy_spam, easy_ham)

Cleaning/ tidying up the data

In this step we begin to clean the data of any inconsistencies. Our goal is to remove numbers, stopwords, punctuation, and white space.

easy_comb <- tm_map(easy_comb, content_transformer(function(x) iconv(x, "UTF-8", sub="byte")))
easy_comb <- tm_map(easy_comb, content_transformer(tolower))
easy_comb <- tm_map(easy_comb, removeNumbers)
easy_comb <- tm_map(easy_comb, removeWords, stopwords("english"))
easy_comb <- tm_map(easy_comb, removePunctuation)
easy_comb <- tm_map(easy_comb, stripWhitespace)

We then arrange the data into a term text matrix.

Building a term Matrix and inspect

dtm <- DocumentTermMatrix(easy_comb)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

##                                      word   freq
## 1                     20050311_spam_2.tar 491214
## 2  00677.b957e34b4dd0d9263b56bf71b1168d8a   7752
## 3  00670.be029e37187b8615a231865e3663dcf9   7717
## 4  01083.a6b3c50be5abf782b585995d2c11176b   6436
## 5                                    cmds   4999
## 6  00570.d98ca90ac201b5d881f2397c95838eb2   3734
## 7  00942.727cb1619115cdee240fa418da19dd1f   3226
## 8  00765.ea01c46568902b1338c9685b55d77f6c   3155
## 9  00265.d0ebd6ba8f3e2b8d71e9cdaa2ec6fd91   3054
## 10 01094.91779ec04e5e6b27e84297c28fc7369f   2974

At this point, the data is then cleaned further by elimating spare words- infrequent words in the dataset (for example, less than 10 times).

dtm <- removeSparseTerms(dtm, 1-(10/length(easy_comb)))
dtm

## <<DocumentTermMatrix (documents: 3470, terms: 6104)>>
## Non-/sparse entries: 429232/20751648
## Sparsity           : 98%
## Maximal term length: 73
## Weighting          : term frequency (tf)

In rhis case, we like to take a peak to see which terms were frequently used.

dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=T)
table_freq <- head(frequency, 15)
kable(table_freq, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T)

	x
received	23947
esmtp	13590
localhost	10121
sep	9436
jul	8112
font	7649
widthd	6870
mon	6732
jmlocalhost	6166
email	6084
postfix	5981
table	5980
thu	5922
will	5902
date	5876

wordfreq <- data.frame(word=names(frequency), frequency=frequency)

p <- ggplot(subset(wordfreq, frequency>2000), aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "identity", fill='#35a2c4') +
  theme(axis.text.x=element_text(angle=90, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#adc8d1'))
p

Analysis: Predictions and Models

The metadata was then analyzed, it turns out we have 2500 emails classified as HAM, and 1397 emails classified as spam.

meta_type <- as.vector(unlist(meta(easy_comb)))
meta_data <- data.frame(type = unlist(meta_type))

table(meta_data)

## meta_data
##  ham spam 
## 2245 1225

Futhermore, we create a container using creat_container()funcation from RTextTools.

N <- length(meta_type)
container <- create_container(dtm,
                              labels = meta_type,
                              trainSize = 1:2727,
                              testSize = 2728:N,
                              virgin = F)

Matrix_container. It contains a set of objects that are used for the estimation procedures of the supervised learning methods

slotNames(container)

## [1] "training_matrix"       "classification_matrix" "training_codes"       
## [4] "testing_codes"         "column_names"          "virgin"

For this portion we use the train_model() function on the train data.

svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")

We then use our model to estimate if an email in our test dataset is spam or ham.

svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)

By looking at the outcome: the three models were combined into a single dataframe where the labels and estimes of the probability of classification are present.

model_results <- data.frame(head(svm_out), head(tree_out), head(maxent_out) )
kable(model_results, "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T)

SVM_LABEL	SVM_PROB	TREE_LABEL	TREE_PROB	MAXENTROPY_LABEL	MAXENTROPY_PROB
ham	0.9990225	ham	1	ham	0.9999987
ham	0.9974185	ham	1	ham	0.9999718
ham	0.7451459	ham	1	ham	0.9625643
ham	0.9735851	ham	1	ham	0.9995415
ham	0.9998221	ham	1	ham	1.0000000
ham	0.9711250	ham	1	ham	0.9992097

Since using supervised learning, our models know the correct labels. We can use this to see exactly how correct the algorithm was in correctly classifying the documents.

labels_out <- data.frame(
  correct_label = meta_type[2728:N],
  svm = as.character(svm_out[,1]),
  tree = as.character(tree_out[,1]),
  maxent = as.character(maxent_out[,1]),
  stringsAsFactors = F)

table(labels_out[,1] == labels_out[,2])

## 
## FALSE  TRUE 
##    31   712

prop.table(table(labels_out[,1] == labels_out[,2]))

## 
##      FALSE       TRUE 
## 0.04172275 0.95827725

table(labels_out[,1] == labels_out[,3])

## 
## FALSE  TRUE 
##    40   703

prop.table(table(labels_out[,1] == labels_out[,3]))

## 
##     FALSE      TRUE 
## 0.0538358 0.9461642

table(labels_out[,1] == labels_out[,4])

## 
## FALSE  TRUE 
##    31   712

prop.table(table(labels_out[,1] == labels_out[,4]))

## 
##      FALSE       TRUE 
## 0.04172275 0.95827725

dfdata <- data.frame(table(labels_out[,1] == labels_out[,2]),
                     table(labels_out[,1] == labels_out[,3]),
                     table(labels_out[,1] == labels_out[,4])
                     )

colnames(dfdata) <- c("SVM","Freq", "Random Forest", "Freq", "Max Entropy", "Freq")
kable(dfdata, "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T)

SVM	Freq	Random Forest	Freq	Max Entropy	Freq
FALSE	31	FALSE	40	FALSE	31
TRUE	712	TRUE	703	TRUE	712

Conclusions:

From looking at the results, we can tell that the Maximum Entropy was the best classifier, followed by the SVM. therefore, the worst classifier was the Random Forest.

References:

Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

CUNY MSDS DATA 607 Project 4