Assignment Instructions

It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/

library(tm)
library(RCurl)
library(dplyr)
library(stringr)
library(SnowballC)
library(wordcloud)
library(RTextTools)

Obtain Dataset

The spam and ham datasets from Apache SpamAssassin Project had to be unzipped twice manually. The files were then manually uploaded to GitHub before being read into R.

base_url <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/spamham/"
cmds_spam <- read.table(paste0(base_url, "spam_2/cmds"), 
            quote="\"", comment.char="", stringsAsFactors = F)[ , 3]
cmds_ham <- read.table(paste0(base_url, "easy_ham/cmds"), 
            quote="\"", comment.char="", stringsAsFactors = F)[ , 3]

spam <- ham <- character()
for (i in 1:max(length(cmds_spam), length(cmds_ham))) {
  url_ham <- paste0(base_url,"easy_ham/",cmds_ham[i])
  url_spam <- paste0(base_url,"spam_2/",cmds_spam[i])  
  if (url.exists(url_ham))  { ham  <- append(getURL(url_ham), ham)   }
  if (url.exists(url_spam)) { spam <- append(getURL(url_spam), spam) }
}

The file names did not have an simple pattern to loop through. Each unzipped folder however, did contain a cmds file which outlines the file names in their respective folder. The file names are contained in the third column. There is a gap in the sequence numbers of the files which requires use of the url.exists() function as a check.

Clean Data Obtained

ham <- ham %>% 
  str_replace_all("<.*?>", " html_tag ") %>% str_replace_all("([^[:alnum:]]){5,}", " ") %>% 
  str_replace_all("[[:alnum:].-_]+@{1}([[:alnum:].-_]+){2,5}", " email_address ") %>%
  str_replace_all("(https?:\\/\\/)?([[:alnum:].-_])+\\.([[:alnum:].-_/])+", " clickable_link ")
spam <- spam %>% 
  str_replace_all("<.*?>", " html_tag ") %>% str_replace_all("([^[:alnum:]]){5,}", " ") %>% 
  str_replace_all("[[:alnum:].-_]+@{1}([[:alnum:].-_]+){2,5}", " email_address ") %>%
  str_replace_all("(https?:\\/\\/)?([[:alnum:].-_])+\\.([[:alnum:].-_/])+", " clickable_link ")
ham <- paste(ham %>% str_extract_all("(Subject: ).*") %>% str_replace_all("Subject: ", ""),
  ham %>% str_extract("(\n\n)(.*\n)+" ) %>% str_replace_all("\n", " "))
spam <- paste(spam %>% str_extract_all("(Subject: ).*") %>% str_replace_all("Subject: ", ""),
  spam %>% str_extract("(\n\n)(.*\n)+" ) %>% str_replace_all("\n", " "))
ham <- iconv(ham, from = "latin1", to = "UTF-8")
spam <- iconv(spam, from = "latin1", to = "UTF-8")

HTML tags, email addresses, and hyperlinks appear frequently in varying forms throughout every document. In order to get counts of how many times they appear in a message, they have been generalized to html_tag, email_address, and clickable_link; respectively. Last, the subject and body of the messages are extracted, and Latin characters removed.

Read Data into Term Document Matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

Create Corpus and Classify

corpus_ham <- ham %>% VectorSource() %>% Corpus()
  meta(corpus_ham, "Spam") <- 0
corpus_spam <- spam %>% VectorSource() %>% Corpus()
  meta(corpus_spam, "Spam") <- 1
(corpus <- c(corpus_spam, corpus_ham))
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 1
## Content:  documents: 3895

A corpus is the central element for text operations in the tm package. The text is first wrapped in a VectorSource() function call. This specifies that the corpus is being created from text which is stored in a character vector. A corpus is then created by calling the Corpus() function. The meta() function adds meta information to the text.

par(mfrow = c(1, 2))
wordcloud(corpus_ham, scale=c(4,0.5), 
          max.words=100, 
          random.order=FALSE, 
          rot.per=0.35, 
          use.r.layout=FALSE, 
          colors=brewer.pal(8, "Dark2"))
title(main="Ham Messages")
wordcloud(corpus_spam, scale=c(4,0.5), 
          max.words=100, 
          random.order=FALSE, 
          rot.per=0.35, 
          use.r.layout=FALSE, 
          colors=brewer.pal(8, "Dark2"))
title(main="Spam Messages")

Create Term Document Matrix

(tdm <- corpus %>% 
  tm_map(removePunctuation) %>% tm_map(removeNumbers) %>%
  tm_map(removeWords, words = stopwords("en")) %>% 
  tm_map(content_transformer(tolower)) %>%
  tm_map(stemDocument) %>% DocumentTermMatrix() %>% 
  removeSparseTerms(1 - (10 / length(corpus))))
## <<DocumentTermMatrix (documents: 3895, terms: 3893)>>
## Non-/sparse entries: 295538/14867697
## Sparsity           : 98%
## Maximal term length: 45
## Weighting          : term frequency (tf)

The following are removed from the corpus in preparation for conversion into a term document matrix: punctuations, numbers, stop words (the most common words in a language that appear quite frequently in all text). Removal of stop words is performed more in order to increase computational performance and less in order to improve the estimation procedures. All text is also converted to lower case and all terms are reduced to their stem. This operation reduces the terms in documents to their stem so that words that have the same root can be combined. Many statistical analyses of text will perform a stemming of terms prior to the estimation. After the term document matrix is set, sparse terms that appear in ten documents or less are removed.

Supervised Learning Techniques

The supervised in the term reflects the commonality of classifiers in this class that some pre-coded data are used to estimate membership of non-classified documents. The pre-coded data are called the training dataset. The major advantage of supervised classifiers is that they provide researchers with the opportunity to specify a classification scheme of their choosing.

Estimation Using Different Supervised Classifiers

classes <- unlist(meta(corpus, "Spam"))
a <- length(corpus_spam) * 2; b <- length(corpus)
container <- create_container(tdm, labels = classes,
  trainSize = 1:a, testSize = (a + 1):b, virgin = F)
svm <- classify_model(container, train_model(container, "SVM"))
tree <- classify_model(container, train_model(container, "TREE"))
forest <- classify_model(container, train_model(container, "RF"))
maxent <- classify_model(container, train_model(container, "MAXENT"))

Support Vector Machine (SVM) is currently one of the most well-known and most commonly applied classifiers in supervised learning. The SVM employs a spatial representation of the data. SVMs fit vectors between the document features that best separate the documents into the various groups. Specifically, vectors are selected in a way that they maximize the space between the groups. After the estimation new documents are classified by checking on which sides of the vectors the features of unlabeled documents come to lie.

The Random Forest classifier creates multiple decision trees and takes the most frequently predicted membership category of many decision trees as the classification that is most likely to be accurate. A single decision tree consists of several layers that consecutively ask whether a particular feature is present or absent in a document. The random forest classifier is an extension of the decision tree in that it generates multiple decision trees and makes predictions based on the most frequent prediction from the various decision trees.

The maximum Entropy classifier is analogous to the multinomial logit model which is a generalization of the logit model. The logit model predicts the probability of belonging to one of two categories. The multinomial logit model generalizes this model to a situation where the dependent variable has more than two categories.

Other Supervised Classifiers

The RTextTools packagein R includes the following nine algorithms for ensemble classification:

  • SVM: Support Vector Machine
  • SLDA: Supervised Latent Dirichlet Allocation
  • BOOSTING: Boosting (Machine Learning)
  • BAGGING: Bagging (Bootstrap Aggregating)
  • RF: Random Forests
  • GLMNET: Net Regularized Generalized Linear Model
  • TREE: Decision Trees
  • NNET: Neural Networks
  • MAXENT: Maximum Entropy

Evaluation of Supervised Classifiers Applied

cbind("LABEL" = 0, 
      "SVM_PROB" = head(svm)[ , 2], 
      "TREE_PROB" = head(tree)[ , 2], 
      "RANDFOREST_PROB" = head(forest)[ , 2], 
      "MAXENTROPY_PROB" =head(maxent)[ , 2])
##      LABEL  SVM_PROB TREE_PROB RANDFOREST_PROB MAXENTROPY_PROB
## [1,]     0 0.9999990 0.8676471           0.905               1
## [2,]     0 1.0000000 0.9923372           0.900               1
## [3,]     0 1.0000000 0.8676471           0.875               1
## [4,]     0 1.0000000 0.9923372           0.930               1
## [5,]     0 0.9999996 0.8676471           0.970               1
## [6,]     0 1.0000000 0.9923372           0.995               1
labels <- data.frame(
  correct_label = classes[(a + 1):b],
  svm = as.character(svm[ , 1]),
  tree = as.character(tree[ , 1]),
  forest = as.character(forest[ , 1]),
  maxent = as.character(maxent[ , 1]),
  stringAsFactors = F)
svm_perf <- table(labels[ , 1] == labels[ , 2])
tree_perf <- table(labels[ , 1] == labels[ , 3])
forest_perf <- table(labels[ , 1] == labels[ , 4])
maxent_perf <- table(labels[ , 1] == labels[ , 5])
prop.table(svm_perf)
## 
##     FALSE      TRUE 
## 0.2624434 0.7375566
prop.table(tree_perf)
## 
##     FALSE      TRUE 
## 0.3701357 0.6298643
prop.table(forest_perf)
## 
##     FALSE      TRUE 
## 0.1809955 0.8190045
prop.table(maxent_perf)
## 
##     FALSE      TRUE 
## 0.1158371 0.8841629

The maximum entropy classifier correctly classified 977 out of 1105, or about 88.42% of the documents correctly. The Random Forest fared just a little worse and got 905 out of 1105, or about 81.9% of the documents right. In third was the SVM that got 815 out of 1105, or about 73.76% of the documents correct. The worst classifier in this application is the decision tree classifier, which correctly estimates the publishing organization in merely 696 or 62.99% of the 1105 cases.

par(mfrow = c(1,2))
pie(maxent_perf, main = "Maximum Entropy", col = c("red", "blue"))
pie(forest_perf, main = "Random Forest", col = c("red", "blue"))

pie(svm_perf, main = "SVM", col = c("red", "blue"))
pie(tree_perf, main = "Decision Tree", col = c("red", "blue"))

Dictionary-Based Sentiment Analysis

The simplest way to score the sentiment of a text is to count the positively and negatively charged terms in a document. The dictionary that is provided by Hu and Liu (2004) and Liu et al. (2005) consists of two lists of several thousand terms that reveal the sentiment orientation of a text.

Load and Clean Dictionaries

pos <- readLines("https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/positive-words.txt")
pos <- pos[36:length(pos)]
pos <- stemDocument(pos, language = "english")
pos <- pos[!duplicated(pos)]
neg <- readLines("https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/negative-words.txt")
neg <- neg[36:length(neg)]
neg <- stemDocument(neg, language = "english")
neg <- neg[!duplicated(neg)]

The files are loaded and irrelevant introductory lines are discarded. The lists are then stemed and duplicates discarded.

Compute Sentiments

In an ordinary term-document matrix, the frequency of the terms in the texts would be displayed in the cells. Here each term is counted only once by adding the control option weighting, regardless of the frequency with which it appears in the text. The textbook argues that the simple presence or absence of the terms in the texts is a more robust summary indicator of the sentiment orientation of the texts.

Sentiments in Ham

tdm_ham <- corpus_ham %>% 
  tm_map(removePunctuation) %>% tm_map(removeNumbers) %>%
  tm_map(removeWords, words = stopwords("en")) %>% 
  tm_map(content_transformer(tolower)) %>%
  tm_map(stemDocument) %>% 
  TermDocumentMatrix(control = list(weighting = weightBin)) %>% 
  removeSparseTerms(1 - (10 / length(corpus_ham)))
pos_ham <- apply(tdm_ham[rownames(tdm_ham) %in% pos, ], 2, sum)
neg_ham <- apply(tdm_ham[rownames(tdm_ham) %in% neg, ], 2, sum)
sentiment_diff_ham <- pos_ham - neg_ham
sentiment_diff_ham[sentiment_diff_ham == 0] <- NA
(count_ham <- data.frame(pos = sum(pos_ham), neg = sum(neg_ham)))
##     pos  neg
## 1 19111 9631
(sentiment_ham <- summary(sentiment_diff_ham))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -7.000   1.000   3.000   4.367   6.000  75.000     329

The ham contains 19111 postive and 9631 negative terms. The mean message is positive with 4.367 positive words on average. The most positive text contains a net of 75 positive terms and the least positive text contains a net of -7 positive terms. Such variance highlights the obstacle of extreme variation in the length of the messages.

range(nchar(corpus_ham))
## [1]       6 4061636

Sentiments in Spam

tdm_spam <- corpus_spam %>% 
  tm_map(removePunctuation) %>% tm_map(removeNumbers) %>%
  tm_map(removeWords, words = stopwords("en")) %>% 
  tm_map(content_transformer(tolower)) %>%
  tm_map(stemDocument) %>% 
  TermDocumentMatrix(control = list(weighting = weightBin)) %>% 
  removeSparseTerms(1 - (10 / length(corpus_spam)))
pos_spam <- apply(tdm_spam[rownames(tdm_spam) %in% pos, ], 2, sum)
neg_spam <- apply(tdm_spam[rownames(tdm_spam) %in% neg, ], 2, sum)
sentiment_diff_spam <- pos_spam - neg_spam
sentiment_diff_spam[sentiment_diff_spam == 0] <- NA
(count_spam <- data.frame(pos = sum(pos_spam), neg = sum(neg_spam)))
##     pos  neg
## 1 18648 4661
(sentiment_spam <- summary(sentiment_diff_spam))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -3.00    4.00    8.00   10.76   13.00   80.00      95

The ham contains 18648 postive and 4661 negative terms. The mean message is positive with 10.76 positive words on average. The most positive text contains a net of 80 positive terms and the least positive text contains a net of -3 positive terms. Such variance highlights the obstacle of extreme variation in the length of the messages.

range(nchar(corpus_spam))
## [1]       6 6033645

Conclusion

Spam is 80% postive while ham is 66.49% positive. Given that spam tends to be marketing, this finding seems to follow intuitions rather appropriately. It is surprising however to see that an overall positive sentiment prevails in ham since ham is generally just standard communication.

References

http://online.b1.org/online

http://www.convertfiles.com/

https://spamassassin.apache.org/publiccorpus/

https://www.youtube.com/watch?v=6IzhRaSePKU

https://en.wikipedia.org/wiki/Document-term_matrix

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

http://stackoverflow.com/questions/25551514/termdocumentmatrix-errors-in-r

http://www.exegetic.biz/blog/2013/09/text-mining-the-complete-works-of-william-shakespeare/

https://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/

Automated Data Collection with R [2015]