In this assignment, we are using a corpus of labeled spam and ham (non-spam) emails to predict whether or not a new document is spam. The dataset can be obtained using the link below.
http://spamassassin.apache.org/old/publiccorpus/
A corpus is a large collection of texts typically used for natural language processsing, and is a very useful structure for managing documents. In this instance, we are using “VCorpus”, short for Volatile Corpus. It is volitile because once the object is destroyed, the whole corpus is gone. It is generated using the tm package, a package in R that utilized a text mining framework.
Each folder we have dowloaded contain multiple documents and are imported to an object using tm’s “VCorpus”. We specifiy the language and then add meta data, a tag that becomes useful when bringing the two corpora together for analysis. We can use this same meta() code to write meta data or view meta data.
#library(tm)
#library(tidyr)
#library(dplyr)
#library(RTextTools)
spam <- VCorpus(DirSource("/Users/Michele/Desktop/spam"), readerControl = list(language="english"))
ham <- VCorpus(DirSource("/Users/Michele/Desktop/easy_ham"), readerControl = list(language="english"))
meta(spam, tag = "type", type="corpus") <- "spam"
meta(ham, tag = "type", type="corpus") <- "ham"
meta(spam, type="corpus")
## $type
## [1] "spam"
##
## attr(,"class")
## [1] "CorpusMeta"
Corpus documents such as emails can be very messy. We want to modify these focuments by transforming the corpora using the tm_map function. First, it transforms the data so my Mac can read it, then it strips white space, removes numbers/punctuations, stems the words (so similar words come together for later analysis), and remove common stopwords without much meaning (like and, the, or, is). I also included a function to remove non-english words, since much of the resulting output were without meaning, such as “abbcbabebadpduxx” and “abbcbaffdaddedd”. Then, the corpus was converted to a DocumentTermMatrix, a common approach in text mining. With DocumentTermMatrix, we generate documents as columns and each word generates another column. We use the inspect() function to view them. Do NOT use inspect to view the entire DTM…
spam <- spam %>%
tm_map(content_transformer(function(x) iconv(x, to= 'UTF-8-MAC', sub = 'byte'))) %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stemDocument, language = "english")
spam <- sapply(spam, function(row) iconv(row, "latin1", "ASCII", sub=""))
spam <- VCorpus(VectorSource(spam))
spam_dtm <- DocumentTermMatrix(spam)
spam_dtm
## <<DocumentTermMatrix (documents: 502, terms: 31837)>>
## Non-/sparse entries: 107654/15874520
## Sparsity : 99%
## Maximal term length: 298
## Weighting : term frequency (tf)
inspect(spam_dtm[1:2, 200:202])
## <<DocumentTermMatrix (documents: 2, terms: 3)>>
## Non-/sparse entries: 0/6
## Sparsity : 100%
## Maximal term length: 64
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs abbadebonylustfreecom
## 1 0
## 2 0
## Terms
## Docs abbazbqaababeabtabababhayadnaxafbaababfabaaabfbaabzaabaacdbfefaa
## 1 0
## 2 0
## Terms
## Docs abbbabea
## 1 0
## 2 0
ham <- ham %>%
tm_map(content_transformer(function(x) iconv(x, to= 'UTF-8-MAC', sub = 'byte'))) %>%
tm_map(stripWhitespace) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stemDocument, language = "english")
ham <- sapply(ham, function(row) iconv(row, "latin1", "ASCII", sub=""))
ham <- VCorpus(VectorSource(ham))
ham_dtm <- DocumentTermMatrix(ham)
ham_dtm
## <<DocumentTermMatrix (documents: 2551, terms: 33495)>>
## Non-/sparse entries: 382575/85063170
## Sparsity : 100%
## Maximal term length: 265
## Weighting : term frequency (tf)
inspect(ham_dtm[1:2, 200:202])
## <<DocumentTermMatrix (documents: 2, terms: 3)>>
## Non-/sparse entries: 0/6
## Sparsity : 100%
## Maximal term length: 12
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs abvoltairebb abw aca
## 1 0 0 0
## 2 0 0 0
Let’s look as some common words! We find the column sums of the DTMs and display them in decreasing order.
freq_spam <- colSums(as.matrix(spam_dtm))
ord_spam <- order(freq_spam, decreasing = TRUE)
freq_spam[head(ord_spam)]
## receiv sep widthd localhost aug width
## 2881 2128 1523 1178 1173 1157
freq_ham <- colSums(as.matrix(ham_dtm))
ord_ham <- order(freq_ham, decreasing = TRUE)
freq_ham[head(ord_ham)]
## receiv sep esmtp localhost oct from
## 14477 10008 8709 7602 5456 5403
spamham <- c(spam, ham, recursive=FALSE)
spamham_dtm <- DocumentTermMatrix(spamham)
spamham_dtm <- removeSparseTerms(spamham_dtm, .99)
spamham_dtm
## <<DocumentTermMatrix (documents: 3053, terms: 2103)>>
## Non-/sparse entries: 351660/6068799
## Sparsity : 95%
## Maximal term length: 66
## Weighting : term frequency (tf)
n <- length(spamham)
spamhamsample <- sample(spamham, n)
spam_ham_list <- unlist(meta(spamhamsample, "type")[,1])
container <- create_container(
spamham_dtm,
labels = spam_ham_list,
trainSize = 1:2900,
testSize = 2901:n,
virgin = FALSE
)
str(container)
## Formal class 'matrix_container' [package "RTextTools"] with 6 slots
## ..@ training_matrix :Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:342131] 1 1 1 6 3 1 1 3 3 1 ...
## .. .. ..@ ja : int [1:342131] 7 64 65 120 169 185 192 197 198 229 ...
## .. .. ..@ ia : int [1:2901] 1 132 234 311 446 535 682 756 865 1036 ...
## .. .. ..@ dimension: int [1:2] 2900 2103
## ..@ classification_matrix:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
## .. .. ..@ ra : num [1:9529] 1 1 1 1 1 2 1 2 1 2 ...
## .. .. ..@ ja : int [1:9529] 69 70 227 365 380 413 437 477 534 558 ...
## .. .. ..@ ia : int [1:154] 1 52 100 217 270 319 375 423 478 534 ...
## .. .. ..@ dimension: int [1:2] 153 2103
## ..@ training_codes : Factor w/ 2 levels "ham","spam": 2 1 1 2 1 1 1 1 1 1 ...
## ..@ testing_codes : Factor w/ 2 levels "ham","spam": 1 1 2 2 1 1 1 1 1 1 ...
## ..@ column_names : chr [1:2103] "abil" "abl" "about" "absolut" ...
## ..@ virgin : logi FALSE
SVM is a supervised machine learning technique that analyzes data used for classification and regression analysis. Employs a spatial representation of the data. Attempt to fit vectors between the document features that best separate the documents into various groups. Select vectors in a way that they maximize the space between the groups.
svm_model <- train_model(container, "SVM")
svm_out <- classify_model(container, svm_model)
head(svm_out)
Random forests or random decision forests are a supervised machine learning technique in which we create multiple decision trees and take the most frequently predicted membership category of many decision trees as the classification that is most likely to be accurate. Decision trees can tend to learn highly irregular patterns and overfit their training sets and uses general bootstrap aggragating to train
tree_model <- train_model(container, "TREE")
tree_out <- classify_model(container, tree_model)
head(tree_out)
This is another supervised machine learning technique. The Max Entropy classifier itself – is a probabilistic classifier that does not assume that features are conditionally independent of each other. It is based on the principle of maximum entropy and selects the model with the largest amount of entropy. Using contectual evidence, it categorizes the evidence into sentiment groups using the standard bag-of-words framework. Then, a stochastic model attempts to represent the behavior of the random process and construct the model using the information and its class.
maxent_model <- train_model(container, "MAXENT")
maxent_out <- classify_model(container, maxent_model)
head(maxent_out)
Bootstrap aggregating or bagging, is a supervised machine learning technique designed to improve the stability and accuracy of machine learning algorithms – particularly “unstable procedures”.
bag_model <- train_model(container, "BAGGING")
bag_out <- classify_model(container, bag_model)
head(bag_out)
Going to put actual results and the machine learning results into the same dataframe, then see how accurate the different tests were. Currently this isn’t running very well since no models accurately filter out spam emails. For the only one that does, the maxent model, unfortunately it performs even worse – since it determines spam emails are not actually spam
correct_spam <- data.frame(
correct_label = spam_ham_list[2901:length(spamham)],
svm = as.character(svm_out[,1]),
svm_prob = as.character(svm_out[,2]),
tree = as.character(tree_out[,1]),
tree_prob = as.character(tree_out[,2]),
maxent = as.character(maxent_out[,1]),
maxent_prob = as.character(maxent_out[,2]),
bag = as.character(bag_out[,1]),
bag_prob = as.character(bag_out[,2]),
stringsAsFactors = FALSE)
correct_spam
correct_spam %>%
group_by(correct_label) %>%
count(svm)
prop.table(table(correct_spam[,1] == correct_spam[,2]))
##
## FALSE TRUE
## 0.1503268 0.8496732
correct_spam %>%
group_by(correct_label) %>%
count(tree)
prop.table(table(correct_spam[,1] == correct_spam[,4]))
##
## FALSE TRUE
## 0.1503268 0.8496732
correct_spam %>%
group_by(correct_label) %>%
count(maxent)
prop.table(table(correct_spam[,1] == correct_spam[,6]))
##
## FALSE TRUE
## 0.3398693 0.6601307
correct_spam %>%
group_by(correct_label) %>%
count(bag)
prop.table(table(correct_spam[,1] == correct_spam[,8]))
##
## FALSE TRUE
## 0.1503268 0.8496732
https://stackoverflow.com/questions/7927367/r-text-file-and-text-mining-how-to-load-data https://stackoverflow.com/questions/30435054/how-to-show-corpus-text-in-r-tm-package
https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/
https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
https://stackoverflow.com/questions/18153504/removing-non-english-text-from-corpus-in-r-using-tm
http://blog.datumbox.com/machine-learning-tutorial-the-max-entropy-text-classifier/