It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
library(tm)
library(RCurl)
library(dplyr)
library(stringr)
library(SnowballC)
library(wordcloud)
library(RTextTools)
The spam and ham datasets from Apache SpamAssassin Project had to be unzipped twice manually. The files were then manually uploaded to GitHub before being read into R.
base_url <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/spamham/"
cmds_spam <- read.table(paste0(base_url, "spam_2/cmds"),
quote="\"", comment.char="", stringsAsFactors = F)[ , 3]
cmds_ham <- read.table(paste0(base_url, "easy_ham/cmds"),
quote="\"", comment.char="", stringsAsFactors = F)[ , 3]
spam <- ham <- character()
for (i in 1:max(length(cmds_spam), length(cmds_ham))) {
url_ham <- paste0(base_url,"easy_ham/",cmds_ham[i])
url_spam <- paste0(base_url,"spam_2/",cmds_spam[i])
if (url.exists(url_ham)) { ham <- append(getURL(url_ham), ham) }
if (url.exists(url_spam)) { spam <- append(getURL(url_spam), spam) }
}
The file names did not have an simple pattern to loop through. Each unzipped folder however, did contain a cmds file which outlines the file names in their respective folder. The file names are contained in the third column. There is a gap in the sequence numbers of the files which requires use of the url.exists() function as a check.
ham <- ham %>%
str_replace_all("<.*?>", " html_tag ") %>% str_replace_all("([^[:alnum:]]){5,}", " ") %>%
str_replace_all("[[:alnum:].-_]+@{1}([[:alnum:].-_]+){2,5}", " email_address ") %>%
str_replace_all("(https?:\\/\\/)?([[:alnum:].-_])+\\.([[:alnum:].-_/])+", " clickable_link ")
spam <- spam %>%
str_replace_all("<.*?>", " html_tag ") %>% str_replace_all("([^[:alnum:]]){5,}", " ") %>%
str_replace_all("[[:alnum:].-_]+@{1}([[:alnum:].-_]+){2,5}", " email_address ") %>%
str_replace_all("(https?:\\/\\/)?([[:alnum:].-_])+\\.([[:alnum:].-_/])+", " clickable_link ")
ham <- paste(ham %>% str_extract_all("(Subject: ).*") %>% str_replace_all("Subject: ", ""),
ham %>% str_extract("(\n\n)(.*\n)+" ) %>% str_replace_all("\n", " "))
spam <- paste(spam %>% str_extract_all("(Subject: ).*") %>% str_replace_all("Subject: ", ""),
spam %>% str_extract("(\n\n)(.*\n)+" ) %>% str_replace_all("\n", " "))
ham <- iconv(ham, from = "latin1", to = "UTF-8")
spam <- iconv(spam, from = "latin1", to = "UTF-8")
HTML tags, email addresses, and hyperlinks appear frequently in varying forms throughout every document. In order to get counts of how many times they appear in a message, they have been generalized to html_tag, email_address, and clickable_link; respectively. Last, the subject and body of the messages are extracted, and Latin characters removed.
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
corpus_ham <- ham %>% VectorSource() %>% Corpus()
meta(corpus_ham, "Spam") <- 0
corpus_spam <- spam %>% VectorSource() %>% Corpus()
meta(corpus_spam, "Spam") <- 1
(corpus <- c(corpus_spam, corpus_ham))
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 1
## Content: documents: 3895
A corpus is the central element for text operations in the tm package. The text is first wrapped in a VectorSource() function call. This specifies that the corpus is being created from text which is stored in a character vector. A corpus is then created by calling the Corpus() function. The meta() function adds meta information to the text.
par(mfrow = c(1, 2))
wordcloud(corpus_ham, scale=c(4,0.5),
max.words=100,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))
title(main="Ham Messages")
wordcloud(corpus_spam, scale=c(4,0.5),
max.words=100,
random.order=FALSE,
rot.per=0.35,
use.r.layout=FALSE,
colors=brewer.pal(8, "Dark2"))
title(main="Spam Messages")
(tdm <- corpus %>%
tm_map(removePunctuation) %>% tm_map(removeNumbers) %>%
tm_map(removeWords, words = stopwords("en")) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(stemDocument) %>% DocumentTermMatrix() %>%
removeSparseTerms(1 - (10 / length(corpus))))
## <<DocumentTermMatrix (documents: 3895, terms: 3893)>>
## Non-/sparse entries: 295538/14867697
## Sparsity : 98%
## Maximal term length: 45
## Weighting : term frequency (tf)
The following are removed from the corpus in preparation for conversion into a term document matrix: punctuations, numbers, stop words (the most common words in a language that appear quite frequently in all text). Removal of stop words is performed more in order to increase computational performance and less in order to improve the estimation procedures. All text is also converted to lower case and all terms are reduced to their stem. This operation reduces the terms in documents to their stem so that words that have the same root can be combined. Many statistical analyses of text will perform a stemming of terms prior to the estimation. After the term document matrix is set, sparse terms that appear in ten documents or less are removed.
The supervised in the term reflects the commonality of classifiers in this class that some pre-coded data are used to estimate membership of non-classified documents. The pre-coded data are called the training dataset. The major advantage of supervised classifiers is that they provide researchers with the opportunity to specify a classification scheme of their choosing.
classes <- unlist(meta(corpus, "Spam"))
a <- length(corpus_spam) * 2; b <- length(corpus)
container <- create_container(tdm, labels = classes,
trainSize = 1:a, testSize = (a + 1):b, virgin = F)
svm <- classify_model(container, train_model(container, "SVM"))
tree <- classify_model(container, train_model(container, "TREE"))
forest <- classify_model(container, train_model(container, "RF"))
maxent <- classify_model(container, train_model(container, "MAXENT"))
Support Vector Machine (SVM) is currently one of the most well-known and most commonly applied classifiers in supervised learning. The SVM employs a spatial representation of the data. SVMs fit vectors between the document features that best separate the documents into the various groups. Specifically, vectors are selected in a way that they maximize the space between the groups. After the estimation new documents are classified by checking on which sides of the vectors the features of unlabeled documents come to lie.
The Random Forest classifier creates multiple decision trees and takes the most frequently predicted membership category of many decision trees as the classification that is most likely to be accurate. A single decision tree consists of several layers that consecutively ask whether a particular feature is present or absent in a document. The random forest classifier is an extension of the decision tree in that it generates multiple decision trees and makes predictions based on the most frequent prediction from the various decision trees.
The maximum Entropy classifier is analogous to the multinomial logit model which is a generalization of the logit model. The logit model predicts the probability of belonging to one of two categories. The multinomial logit model generalizes this model to a situation where the dependent variable has more than two categories.
The RTextTools packagein R includes the following nine algorithms for ensemble classification:
cbind("LABEL" = 0,
"SVM_PROB" = head(svm)[ , 2],
"TREE_PROB" = head(tree)[ , 2],
"RANDFOREST_PROB" = head(forest)[ , 2],
"MAXENTROPY_PROB" =head(maxent)[ , 2])
## LABEL SVM_PROB TREE_PROB RANDFOREST_PROB MAXENTROPY_PROB
## [1,] 0 0.9999990 0.8676471 0.905 1
## [2,] 0 1.0000000 0.9923372 0.900 1
## [3,] 0 1.0000000 0.8676471 0.875 1
## [4,] 0 1.0000000 0.9923372 0.930 1
## [5,] 0 0.9999996 0.8676471 0.970 1
## [6,] 0 1.0000000 0.9923372 0.995 1
labels <- data.frame(
correct_label = classes[(a + 1):b],
svm = as.character(svm[ , 1]),
tree = as.character(tree[ , 1]),
forest = as.character(forest[ , 1]),
maxent = as.character(maxent[ , 1]),
stringAsFactors = F)
svm_perf <- table(labels[ , 1] == labels[ , 2])
tree_perf <- table(labels[ , 1] == labels[ , 3])
forest_perf <- table(labels[ , 1] == labels[ , 4])
maxent_perf <- table(labels[ , 1] == labels[ , 5])
prop.table(svm_perf)
##
## FALSE TRUE
## 0.2624434 0.7375566
prop.table(tree_perf)
##
## FALSE TRUE
## 0.3701357 0.6298643
prop.table(forest_perf)
##
## FALSE TRUE
## 0.1809955 0.8190045
prop.table(maxent_perf)
##
## FALSE TRUE
## 0.1158371 0.8841629
The maximum entropy classifier correctly classified 977 out of 1105, or about 88.42% of the documents correctly. The Random Forest fared just a little worse and got 905 out of 1105, or about 81.9% of the documents right. In third was the SVM that got 815 out of 1105, or about 73.76% of the documents correct. The worst classifier in this application is the decision tree classifier, which correctly estimates the publishing organization in merely 696 or 62.99% of the 1105 cases.
par(mfrow = c(1,2))
pie(maxent_perf, main = "Maximum Entropy", col = c("red", "blue"))
pie(forest_perf, main = "Random Forest", col = c("red", "blue"))
pie(svm_perf, main = "SVM", col = c("red", "blue"))
pie(tree_perf, main = "Decision Tree", col = c("red", "blue"))
The simplest way to score the sentiment of a text is to count the positively and negatively charged terms in a document. The dictionary that is provided by Hu and Liu (2004) and Liu et al. (2005) consists of two lists of several thousand terms that reveal the sentiment orientation of a text.
pos <- readLines("https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/positive-words.txt")
pos <- pos[36:length(pos)]
pos <- stemDocument(pos, language = "english")
pos <- pos[!duplicated(pos)]
neg <- readLines("https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/negative-words.txt")
neg <- neg[36:length(neg)]
neg <- stemDocument(neg, language = "english")
neg <- neg[!duplicated(neg)]
The files are loaded and irrelevant introductory lines are discarded. The lists are then stemed and duplicates discarded.
In an ordinary term-document matrix, the frequency of the terms in the texts would be displayed in the cells. Here each term is counted only once by adding the control option weighting, regardless of the frequency with which it appears in the text. The textbook argues that the simple presence or absence of the terms in the texts is a more robust summary indicator of the sentiment orientation of the texts.
tdm_ham <- corpus_ham %>%
tm_map(removePunctuation) %>% tm_map(removeNumbers) %>%
tm_map(removeWords, words = stopwords("en")) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(stemDocument) %>%
TermDocumentMatrix(control = list(weighting = weightBin)) %>%
removeSparseTerms(1 - (10 / length(corpus_ham)))
pos_ham <- apply(tdm_ham[rownames(tdm_ham) %in% pos, ], 2, sum)
neg_ham <- apply(tdm_ham[rownames(tdm_ham) %in% neg, ], 2, sum)
sentiment_diff_ham <- pos_ham - neg_ham
sentiment_diff_ham[sentiment_diff_ham == 0] <- NA
(count_ham <- data.frame(pos = sum(pos_ham), neg = sum(neg_ham)))
## pos neg
## 1 19111 9631
(sentiment_ham <- summary(sentiment_diff_ham))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -7.000 1.000 3.000 4.367 6.000 75.000 329
The ham contains 19111 postive and 9631 negative terms. The mean message is positive with 4.367 positive words on average. The most positive text contains a net of 75 positive terms and the least positive text contains a net of -7 positive terms. Such variance highlights the obstacle of extreme variation in the length of the messages.
range(nchar(corpus_ham))
## [1] 6 4061636
tdm_spam <- corpus_spam %>%
tm_map(removePunctuation) %>% tm_map(removeNumbers) %>%
tm_map(removeWords, words = stopwords("en")) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(stemDocument) %>%
TermDocumentMatrix(control = list(weighting = weightBin)) %>%
removeSparseTerms(1 - (10 / length(corpus_spam)))
pos_spam <- apply(tdm_spam[rownames(tdm_spam) %in% pos, ], 2, sum)
neg_spam <- apply(tdm_spam[rownames(tdm_spam) %in% neg, ], 2, sum)
sentiment_diff_spam <- pos_spam - neg_spam
sentiment_diff_spam[sentiment_diff_spam == 0] <- NA
(count_spam <- data.frame(pos = sum(pos_spam), neg = sum(neg_spam)))
## pos neg
## 1 18648 4661
(sentiment_spam <- summary(sentiment_diff_spam))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3.00 4.00 8.00 10.76 13.00 80.00 95
The ham contains 18648 postive and 4661 negative terms. The mean message is positive with 10.76 positive words on average. The most positive text contains a net of 80 positive terms and the least positive text contains a net of -3 positive terms. Such variance highlights the obstacle of extreme variation in the length of the messages.
range(nchar(corpus_spam))
## [1] 6 6033645
Spam is 80% postive while ham is 66.49% positive. Given that spam tends to be marketing, this finding seems to follow intuitions rather appropriately. It is surprising however to see that an overall positive sentiment prevails in ham since ham is generally just standard communication.
https://spamassassin.apache.org/publiccorpus/
https://www.youtube.com/watch?v=6IzhRaSePKU
https://en.wikipedia.org/wiki/Document-term_matrix
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
http://stackoverflow.com/questions/25551514/termdocumentmatrix-errors-in-r
http://www.exegetic.biz/blog/2013/09/text-mining-the-complete-works-of-william-shakespeare/
https://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/
Automated Data Collection with R [2015]