It can be useful to be able to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.
For this project, you can start with a spam/ham dataset, then predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). One example corpus: https://spamassassin.apache.org/publiccorpus/
library(tm)
library(tidyverse)
library(stringr)
library(wordcloud)
library(RTextTools)
library(knitr)
library(kableExtra)First I performed “VCorpus”" (Volatile Corpus) to pull in the data from my directory. This allows us to pull in the entire directory into R Studio to begin our analysis on Spam vs Ham.
easy_ham <- VCorpus(DirSource("C:/Users/manda/OneDrive/Documents/easy_ham"))
easy_spam <- VCorpus(DirSource("C:/Users/manda/OneDrive/Documents/easy_spam"))Here we add the meta infomation to set the data for Spam, and Ham.
meta(easy_spam, tag = "type") <- "spam"
meta(easy_ham, tag = "type") <- "ham"
easy_comb <- c(easy_spam, easy_ham)In this step we begin to clean the data of any inconsistencies. Our goal is to remove numbers, stopwords, punctuation, and white space.
easy_comb <- tm_map(easy_comb, content_transformer(function(x) iconv(x, "UTF-8", sub="byte")))
easy_comb <- tm_map(easy_comb, content_transformer(tolower))
easy_comb <- tm_map(easy_comb, removeNumbers)
easy_comb <- tm_map(easy_comb, removeWords, stopwords("english"))
easy_comb <- tm_map(easy_comb, removePunctuation)
easy_comb <- tm_map(easy_comb, stripWhitespace)We then arrange the data into a term text matrix.
Building a term Matrix and inspect
dtm <- DocumentTermMatrix(easy_comb)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)## word freq
## 1 20050311_spam_2.tar 491214
## 2 00677.b957e34b4dd0d9263b56bf71b1168d8a 7752
## 3 00670.be029e37187b8615a231865e3663dcf9 7717
## 4 01083.a6b3c50be5abf782b585995d2c11176b 6436
## 5 cmds 4999
## 6 00570.d98ca90ac201b5d881f2397c95838eb2 3734
## 7 00942.727cb1619115cdee240fa418da19dd1f 3226
## 8 00765.ea01c46568902b1338c9685b55d77f6c 3155
## 9 00265.d0ebd6ba8f3e2b8d71e9cdaa2ec6fd91 3054
## 10 01094.91779ec04e5e6b27e84297c28fc7369f 2974
At this point, the data is then cleaned further by elimating spare words- infrequent words in the dataset (for example, less than 10 times).
dtm <- removeSparseTerms(dtm, 1-(10/length(easy_comb)))
dtm## <<DocumentTermMatrix (documents: 3470, terms: 6104)>>
## Non-/sparse entries: 429232/20751648
## Sparsity : 98%
## Maximal term length: 73
## Weighting : term frequency (tf)
In rhis case, we like to take a peak to see which terms were frequently used.
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=T)
table_freq <- head(frequency, 15)
kable(table_freq, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T)| x | |
|---|---|
| received | 23947 |
| esmtp | 13590 |
| localhost | 10121 |
| sep | 9436 |
| jul | 8112 |
| font | 7649 |
| widthd | 6870 |
| mon | 6732 |
| jmlocalhost | 6166 |
| 6084 | |
| postfix | 5981 |
| table | 5980 |
| thu | 5922 |
| will | 5902 |
| date | 5876 |
wordfreq <- data.frame(word=names(frequency), frequency=frequency)
p <- ggplot(subset(wordfreq, frequency>2000), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill='#35a2c4') +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(panel.background = element_rect(fill = '#adc8d1'))
pThe metadata was then analyzed, it turns out we have 2500 emails classified as HAM, and 1397 emails classified as spam.
meta_type <- as.vector(unlist(meta(easy_comb)))
meta_data <- data.frame(type = unlist(meta_type))
table(meta_data)## meta_data
## ham spam
## 2245 1225
Futhermore, we create a container using creat_container()funcation from RTextTools.
N <- length(meta_type)
container <- create_container(dtm,
labels = meta_type,
trainSize = 1:2727,
testSize = 2728:N,
virgin = F)Matrix_container. It contains a set of objects that are used for the estimation procedures of the supervised learning methods
slotNames(container)## [1] "training_matrix" "classification_matrix" "training_codes"
## [4] "testing_codes" "column_names" "virgin"
For this portion we use the train_model() function on the train data.
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")We then use our model to estimate if an email in our test dataset is spam or ham.
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)By looking at the outcome: the three models were combined into a single dataframe where the labels and estimes of the probability of classification are present.
model_results <- data.frame(head(svm_out), head(tree_out), head(maxent_out) )
kable(model_results, "html", escape = F) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T)| SVM_LABEL | SVM_PROB | TREE_LABEL | TREE_PROB | MAXENTROPY_LABEL | MAXENTROPY_PROB |
|---|---|---|---|---|---|
| ham | 0.9990225 | ham | 1 | ham | 0.9999987 |
| ham | 0.9974185 | ham | 1 | ham | 0.9999718 |
| ham | 0.7451459 | ham | 1 | ham | 0.9625643 |
| ham | 0.9735851 | ham | 1 | ham | 0.9995415 |
| ham | 0.9998221 | ham | 1 | ham | 1.0000000 |
| ham | 0.9711250 | ham | 1 | ham | 0.9992097 |
Since using supervised learning, our models know the correct labels. We can use this to see exactly how correct the algorithm was in correctly classifying the documents.
labels_out <- data.frame(
correct_label = meta_type[2728:N],
svm = as.character(svm_out[,1]),
tree = as.character(tree_out[,1]),
maxent = as.character(maxent_out[,1]),
stringsAsFactors = F)table(labels_out[,1] == labels_out[,2])##
## FALSE TRUE
## 31 712
prop.table(table(labels_out[,1] == labels_out[,2]))##
## FALSE TRUE
## 0.04172275 0.95827725
table(labels_out[,1] == labels_out[,3])##
## FALSE TRUE
## 40 703
prop.table(table(labels_out[,1] == labels_out[,3]))##
## FALSE TRUE
## 0.0538358 0.9461642
table(labels_out[,1] == labels_out[,4])##
## FALSE TRUE
## 31 712
prop.table(table(labels_out[,1] == labels_out[,4]))##
## FALSE TRUE
## 0.04172275 0.95827725
dfdata <- data.frame(table(labels_out[,1] == labels_out[,2]),
table(labels_out[,1] == labels_out[,3]),
table(labels_out[,1] == labels_out[,4])
)
colnames(dfdata) <- c("SVM","Freq", "Random Forest", "Freq", "Max Entropy", "Freq")
kable(dfdata, "html", escape = F) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T)| SVM | Freq | Random Forest | Freq | Max Entropy | Freq |
|---|---|---|---|---|---|
| FALSE | 31 | FALSE | 40 | FALSE | 31 |
| TRUE | 712 | TRUE | 703 | TRUE | 712 |
From looking at the results, we can tell that the Maximum Entropy was the best classifier, followed by the SVM. therefore, the worst classifier was the Random Forest.
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining