It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
I referenced chapter 10 from Automated Data Collection with R and the website https://www.r-bloggers.com/classifying-emails-as-spam-or-ham-using-rtexttools/ to build out my solution. So any code that mimics what is found in these resources is due to my direct use of them. Consequently, I note them above to give them proper credit.
I downloaded the 20021010_easy_ham.tar.bz2 and 20021010_spam.tar.bz2 files from the above website. After unzipping, I had 2551 “ham” emails and 500 “spam” emails (I removed the file names file) to build my corpus. After examining the code for the website mentioned above and examining the emails directly, it appears that the message for the email is always after the first blank line in the file. So this is the part of the file that I want to capture to build the term document matrix on. Because the “spam” emails contain some non-english words, the initial attempt to pull the data fails. After adding a try statement, it succeeds.
library(tm)
library(RTextTools)
library(stringr)
library(SnowballC)
#get directories
ham<-"C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Assignment10/easy_ham/"
spam<-"C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Assignment10/spam/"
#get file names
ham_names<-dir(ham)
spam_names<-dir(spam)
#get email messages
#ham
ham_message<-c()
for(i in 1:length(ham_names)) {
file<-paste0(ham,ham_names[i])
connection <- file(file, open="rt", encoding="latin1")
text <- readLines(connection)
msg <- text[seq(which(text=="")[1]+1,length(text),1)]
close(connection)
vector<-c(i,paste(msg, collapse=" "))
ham_message<-rbind(ham_message,vector)
}
ham_df<-data.frame(ham_message, stringsAsFactors = FALSE, row.names = NULL)
#spam
spam_message<-c()
for(i in 1:length(spam_names)) {
file<-paste0(spam,spam_names[i])
connection <- file(file, open="rt", encoding="latin1")
text <- readLines(connection)
msg <- try(text[seq(which(text=="")[1]+1,length(text),1)], silent = TRUE)
close(connection)
vector<-c(i,paste(msg, collapse=" "))
spam_message<-rbind(spam_message,vector)
}
spam_df<-data.frame(spam_message, stringsAsFactors = FALSE, row.names = NULL)
spam_df<-spam_df[which(str_detect(str_sub(spam_df$X2,1,5), "Error") == FALSE),]The “spam” emails had 5 messages that couldn’t be processed due to non-ASCII characters, leaving me with 496 “spam” emails to work with. That means we have 496/2551 = 19.4% spam messages in our data. I create corpuses for the “ham” and “spam” messages, add a label spam in the metadata for each (0 for ham, 1 for spam), and then I create a final corpus of both data sets.
#create corpuses
ham_corpus<-Corpus(VectorSource(ham_df$X2))
spam_corpus<-Corpus(VectorSource(spam_df$X2))
#add label in metadata
for(i in 1:length(ham_corpus)){
meta(ham_corpus[[i]], "spam")<-0
}
for(i in 1:length(spam_corpus)){
meta(spam_corpus[[i]], "spam")<-1
}
#create final corpus
final<-c(ham_corpus,spam_corpus)Now I can do cleanup of the data. I remove numbers, punctuation (adding whitespace using the code from Chapter 10), and stopwords. I make the text all lower case and do a stemming step.
#remove numbers
final<-tm_map(final, removeNumbers)
#remove punctuation
# use content_transformer to deal with error
#note: http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument
final<-tm_map(final, content_transformer(str_replace_all), pattern = "[[:punct:]]", replacement = " ")
#remove stopwords
final<-tm_map(final, removeWords, words = stopwords("en"))
#make lowercase
final<-tm_map(final, content_transformer(tolower))
#stemming
final<-tm_map(final, stemDocument)Now I can create a term-document matrix of the final corpus of the combined “spam” and “ham” messages. Since there is 100% sparsity, I will remove the sparse terms. I will only keep words that have less than .99 sparsity. This brings us down to 96% sparsity.
#create term document matrix
tdm<-TermDocumentMatrix(final)
tdm## <<TermDocumentMatrix (terms: 44212, documents: 3047)>>
## Non-/sparse entries: 301142/134412822
## Sparsity : 100%
## Maximal term length: 123
## Weighting : term frequency (tf)
#remove sparse terms
tdm<-removeSparseTerms(tdm, .99)
tdm## <<TermDocumentMatrix (terms: 1688, documents: 3047)>>
## Non-/sparse entries: 197883/4945453
## Sparsity : 96%
## Maximal term length: 70
## Weighting : term frequency (tf)
Now we can use the RTextTools package to do some predictions of ham or spam. We first have to create a DocumentTermMatrix (the needed input for RTextTools). Then I remove the sparse terms again and get the labels for the emails (spam or ham) that was stored in the meta data. I do a 70/30 split of the data in order to have separate training and test datasets that have both ham and spam in them.
#create document term matrix
#create term document matrix
dtm<-DocumentTermMatrix(final)
dtm## <<DocumentTermMatrix (documents: 3047, terms: 44212)>>
## Non-/sparse entries: 301142/134412822
## Sparsity : 100%
## Maximal term length: 123
## Weighting : term frequency (tf)
#remove sparse terms
dtm<-removeSparseTerms(dtm, .99)
dtm## <<DocumentTermMatrix (documents: 3047, terms: 1688)>>
## Non-/sparse entries: 197883/4945453
## Sparsity : 96%
## Maximal term length: 70
## Weighting : term frequency (tf)
#get labels
label<-c()
for(i in 1:length(final)){
label<-c(label,final[[i]]$meta$spam)
}
#create 70/30 split of data
set.seed(10)
probs<-runif(length(final),0,1)
train<-which(probs<=.70)
test<-which(probs>.70)Once we have created the split, we can create a container for the data, which provides the needed structure to perform the prediction. Finally, I use the MAXENT classifier to make the predictions of the test data.
#create container
container <-create_container(dtm
,labels = label
,trainSize = train
,testSize = test
,virgin = FALSE
)
#predictions
maxent_model<-train_model(container, "MAXENT")
maxent_out<-classify_model(container, maxent_model)
head(maxent_out)## MAXENTROPY_LABEL MAXENTROPY_PROB
## 1 0 1.0000000
## 2 0 1.0000000
## 3 0 0.9999999
## 4 0 1.0000000
## 5 0 1.0000000
## 6 0 1.0000000
How well did the classifier do? We can look at several measures. First, we see the confusion matrix. When I ran this, only 14 of the 955 emails in the test set was misclassified for an accuracy of 98%. The AUC was 0.97, so the model did a pretty good job of classifying the spam and ham.
library(SDMTools)
library(pROC)
#create data frame for confusion matrix
labels_out<-data.frame(correct_label = label[test],maxent = maxent_out[,1])
#output confusion matrix
confusion.matrix(labels_out$correct_label, labels_out$maxent)## obs
## pred 0 1
## 0 781 9
## 1 5 160
## attr(,"class")
## [1] "confusion.matrix"
#view accuaracy, auc, etc.
accuracy(labels_out$correct_label, labels_out$maxent)## threshold AUC omission.rate sensitivity specificity prop.correct
## 1 0.5 0.9701921 0.05325444 0.9467456 0.9936387 0.9853403
## Kappa
## 1 0.9492021
#roc curve
roc_curve<-roc(labels_out$correct_label, as.numeric(as.character(maxent_out[,1])))
plot(roc_curve)##
## Call:
## roc.default(response = labels_out$correct_label, predictor = as.numeric(as.character(maxent_out[, 1])))
##
## Data: as.numeric(as.character(maxent_out[, 1])) in 786 controls (labels_out$correct_label 0) < 169 cases (labels_out$correct_label 1).
## Area under the curve: 0.9702
Which words signal spam? While I couldn’t find any documenation to explain Weight in the maxent_model@weights, it appears that the extreme values (positive or negative) are the most important words. I match the terms from dtm$dimnames$Terms to the output of maxent_model@weights using the Feature column, and then arrange by the Weight to get the most important words. I show the top 15 terms by Weight.
library(tidyr)
library(dplyr)
#get model weights
maxent_df<-maxent_model@weights
maxent_df$Weight_Adj<-as.numeric(as.character(maxent_df$Weight))
maxent_df$Feature_Adj<-as.numeric(as.character(maxent_df$Feature))
#get terms
maxent_df$Terms<-dtm$dimnames$Terms[maxent_df$Feature_Adj]
#view top terms by weight
head(arrange(maxent_df, desc(Weight_Adj)), n =15)## Weight Label Feature Weight_Adj Feature_Adj Terms
## 1 5.86 0 1671 5.86 1671 wrote
## 2 4.96 0 1572 4.96 1572 url
## 3 3.57 0 404 3.57 404 date
## 4 3.52 0 1017 3.52 1017 newsisfre
## 5 2.89 0 1386 2.89 1386 spamassassin
## 6 2.87 1 809 2.87 809 internet
## 7 2.52 1 1062 2.52 1062 order
## 8 2.51 0 951 2.51 951 messag
## 9 2.37 0 1574 2.37 1574 use
## 10 2.32 1 1242 2.32 1242 remov
## 11 2.15 0 657 2.15 657 geeg
## 12 2.14 0 1063 2.14 1063 org
## 13 2.09 1 748 2.09 748 href=
## 14 2.06 1 980 2.06 980 money
## 15 2.02 1 7 2.02 7 <br>
If I have interpreted this correctly, on top we see that “wrote”, “url”, and “date” appear to signal ham while “internet”, “order”, and “remove” signal spam. For spam, “order” could signal money order (as “money” appears lower in the list), and “remov” could be part of “if you wish to be removed from this mailing list…”. So there is some intuitive plausibility to this result.
I have shown that emails can be classified as spam and ham using the tm and RTextTools packages, and that such classification is very effective on the data being used.