In this assignment, we are using a corpus of labeled spam and ham (non-spam) emails to predict whether or not a new document is spam. The dataset can be obtained using the link below.

http://spamassassin.apache.org/old/publiccorpus/

Importing Data into a Corpus

A corpus is a large collection of texts typically used for natural language processsing, and is a very useful structure for managing documents. In this instance, we are using “VCorpus”, short for Volatile Corpus. It is volitile because once the object is destroyed, the whole corpus is gone. It is generated using the tm package, a package in R that utilized a text mining framework.

Each folder we have dowloaded contain multiple documents and are imported to an object using tm’s “VCorpus”. We specifiy the language and then add meta data, a tag that becomes useful when bringing the two corpora together for analysis. We can use this same meta() code to write meta data or view meta data.

#library(tm)
#library(tidyr)
#library(dplyr)
#library(RTextTools)

spam <- VCorpus(DirSource("/Users/Michele/Desktop/spam"), readerControl = list(language="english"))
ham <- VCorpus(DirSource("/Users/Michele/Desktop/easy_ham"), readerControl = list(language="english"))
meta(spam, tag = "type", type="corpus") <- "spam"
meta(ham, tag = "type", type="corpus") <- "ham"
meta(spam, type="corpus")
## $type
## [1] "spam"
## 
## attr(,"class")
## [1] "CorpusMeta"

Corpus Transformations

Corpus documents such as emails can be very messy. We want to modify these focuments by transforming the corpora using the tm_map function. First, it transforms the data so my Mac can read it, then it strips white space, removes numbers/punctuations, stems the words (so similar words come together for later analysis), and remove common stopwords without much meaning (like and, the, or, is). I also included a function to remove non-english words, since much of the resulting output were without meaning, such as “abbcbabebadpduxx” and “abbcbaffdaddedd”. Then, the corpus was converted to a DocumentTermMatrix, a common approach in text mining. With DocumentTermMatrix, we generate documents as columns and each word generates another column. We use the inspect() function to view them. Do NOT use inspect to view the entire DTM…

spam <- spam %>%
  tm_map(content_transformer(function(x) iconv(x, to= 'UTF-8-MAC', sub = 'byte'))) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(stemDocument, language = "english")
spam <- sapply(spam, function(row) iconv(row, "latin1", "ASCII", sub=""))
spam <- VCorpus(VectorSource(spam))
spam_dtm <- DocumentTermMatrix(spam)
spam_dtm
## <<DocumentTermMatrix (documents: 502, terms: 31837)>>
## Non-/sparse entries: 107654/15874520
## Sparsity           : 99%
## Maximal term length: 298
## Weighting          : term frequency (tf)
inspect(spam_dtm[1:2, 200:202])
## <<DocumentTermMatrix (documents: 2, terms: 3)>>
## Non-/sparse entries: 0/6
## Sparsity           : 100%
## Maximal term length: 64
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs abbadebonylustfreecom
##    1                     0
##    2                     0
##     Terms
## Docs abbazbqaababeabtabababhayadnaxafbaababfabaaabfbaabzaabaacdbfefaa
##    1                                                                0
##    2                                                                0
##     Terms
## Docs abbbabea
##    1        0
##    2        0
ham <- ham %>%
  tm_map(content_transformer(function(x) iconv(x, to= 'UTF-8-MAC', sub = 'byte'))) %>%
  tm_map(stripWhitespace) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(stemDocument, language = "english")
ham <- sapply(ham, function(row) iconv(row, "latin1", "ASCII", sub=""))
ham <- VCorpus(VectorSource(ham))
ham_dtm <- DocumentTermMatrix(ham)
ham_dtm
## <<DocumentTermMatrix (documents: 2551, terms: 33495)>>
## Non-/sparse entries: 382575/85063170
## Sparsity           : 100%
## Maximal term length: 265
## Weighting          : term frequency (tf)
inspect(ham_dtm[1:2, 200:202])
## <<DocumentTermMatrix (documents: 2, terms: 3)>>
## Non-/sparse entries: 0/6
## Sparsity           : 100%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs abvoltairebb abw aca
##    1            0   0   0
##    2            0   0   0

Common Words

Let’s look as some common words! We find the column sums of the DTMs and display them in decreasing order.

freq_spam <- colSums(as.matrix(spam_dtm))
ord_spam <- order(freq_spam, decreasing = TRUE)
freq_spam[head(ord_spam)]
##    receiv       sep    widthd localhost       aug     width 
##      2881      2128      1523      1178      1173      1157
freq_ham <- colSums(as.matrix(ham_dtm))
ord_ham <- order(freq_ham, decreasing = TRUE)
freq_ham[head(ord_ham)]
##    receiv       sep     esmtp localhost       oct      from 
##     14477     10008      8709      7602      5456      5403

Combine and Transform into Matrix

spamham <- c(spam, ham, recursive=FALSE)
spamham_dtm <- DocumentTermMatrix(spamham)
spamham_dtm <- removeSparseTerms(spamham_dtm, .99)
spamham_dtm
## <<DocumentTermMatrix (documents: 3053, terms: 2103)>>
## Non-/sparse entries: 351660/6068799
## Sparsity           : 95%
## Maximal term length: 66
## Weighting          : term frequency (tf)
n <- length(spamham)
spamhamsample <- sample(spamham, n)
spam_ham_list <- unlist(meta(spamhamsample, "type")[,1])

Create Container for Training Estimation

container <- create_container(
  spamham_dtm, 
  labels = spam_ham_list,
  trainSize = 1:2900,
  testSize = 2901:n,
  virgin = FALSE
)
str(container)
## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
##   ..@ training_matrix      :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:342131] 1 1 1 6 3 1 1 3 3 1 ...
##   .. .. ..@ ja       : int [1:342131] 7 64 65 120 169 185 192 197 198 229 ...
##   .. .. ..@ ia       : int [1:2901] 1 132 234 311 446 535 682 756 865 1036 ...
##   .. .. ..@ dimension: int [1:2] 2900 2103
##   ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
##   .. .. ..@ ra       : num [1:9529] 1 1 1 1 1 2 1 2 1 2 ...
##   .. .. ..@ ja       : int [1:9529] 69 70 227 365 380 413 437 477 534 558 ...
##   .. .. ..@ ia       : int [1:154] 1 52 100 217 270 319 375 423 478 534 ...
##   .. .. ..@ dimension: int [1:2] 153 2103
##   ..@ training_codes       : Factor w/ 2 levels "ham","spam": 2 1 1 2 1 1 1 1 1 1 ...
##   ..@ testing_codes        : Factor w/ 2 levels "ham","spam": 1 1 2 2 1 1 1 1 1 1 ...
##   ..@ column_names         : chr [1:2103] "abil" "abl" "about" "absolut" ...
##   ..@ virgin               : logi FALSE

Support Vector Machines (SVM)

SVM is a supervised machine learning technique that analyzes data used for classification and regression analysis. Employs a spatial representation of the data. Attempt to fit vectors between the document features that best separate the documents into various groups. Select vectors in a way that they maximize the space between the groups.

svm_model <- train_model(container, "SVM")
svm_out <- classify_model(container, svm_model)
head(svm_out)

Random Forest

Random forests or random decision forests are a supervised machine learning technique in which we create multiple decision trees and take the most frequently predicted membership category of many decision trees as the classification that is most likely to be accurate. Decision trees can tend to learn highly irregular patterns and overfit their training sets and uses general bootstrap aggragating to train

tree_model <- train_model(container, "TREE")
tree_out <- classify_model(container, tree_model)
head(tree_out)

Maximum Entropy

This is another supervised machine learning technique. The Max Entropy classifier itself – is a probabilistic classifier that does not assume that features are conditionally independent of each other. It is based on the principle of maximum entropy and selects the model with the largest amount of entropy. Using contectual evidence, it categorizes the evidence into sentiment groups using the standard bag-of-words framework. Then, a stochastic model attempts to represent the behavior of the random process and construct the model using the information and its class.

maxent_model <- train_model(container, "MAXENT")
maxent_out <- classify_model(container, maxent_model)
head(maxent_out)

Bootstrap Aggregating (Bagging)

Bootstrap aggregating or bagging, is a supervised machine learning technique designed to improve the stability and accuracy of machine learning algorithms – particularly “unstable procedures”.

bag_model <- train_model(container, "BAGGING")
bag_out <- classify_model(container, bag_model)
head(bag_out)

Analysis

Going to put actual results and the machine learning results into the same dataframe, then see how accurate the different tests were. Currently this isn’t running very well since no models accurately filter out spam emails. For the only one that does, the maxent model, unfortunately it performs even worse – since it determines spam emails are not actually spam

correct_spam <- data.frame(
    correct_label = spam_ham_list[2901:length(spamham)],
    svm = as.character(svm_out[,1]),
    svm_prob = as.character(svm_out[,2]),
    tree = as.character(tree_out[,1]),
    tree_prob = as.character(tree_out[,2]),
    maxent = as.character(maxent_out[,1]),
    maxent_prob = as.character(maxent_out[,2]),
    bag = as.character(bag_out[,1]),
    bag_prob = as.character(bag_out[,2]),
    stringsAsFactors = FALSE)

correct_spam

Support Vector Machines

correct_spam %>%
  group_by(correct_label) %>%
  count(svm)
prop.table(table(correct_spam[,1] == correct_spam[,2]))
## 
##     FALSE      TRUE 
## 0.1503268 0.8496732

Random Forest

correct_spam %>%
  group_by(correct_label) %>%
  count(tree)
prop.table(table(correct_spam[,1] == correct_spam[,4]))
## 
##     FALSE      TRUE 
## 0.1503268 0.8496732

Maximum Entropy

correct_spam %>%
  group_by(correct_label) %>%
  count(maxent)
prop.table(table(correct_spam[,1] == correct_spam[,6]))
## 
##     FALSE      TRUE 
## 0.3398693 0.6601307

Bootstrap Aggregating

correct_spam %>%
  group_by(correct_label) %>%
  count(bag)
prop.table(table(correct_spam[,1] == correct_spam[,8]))
## 
##     FALSE      TRUE 
## 0.1503268 0.8496732