The task assigned is to classify new “test” documents using already classified “training” documents. We could either use the spam/ham files suggested in the assignment or select our own documents (for example, from our own spam folder or from scraping text off the web). I chose to use the spam/ham files suggested in the assignment.
I downloaded the “20030228_easy_ham_2.tar.bz2” and “20050311_spam_2.tar.bz2” files from https://spamassassin.apache.org/publiccorpus. After unzipping the files, I saved the first file into a folder called “easy_ham_2” and the second file into a folder called “spam_2”, which were both located in my working directory. Next, I installed the “tm” package (https://cran.r-project.org/web/packages/tm/index.html) so I could input the documents into a text mining Corpus:
install.packages("tm", repos='http://cran.wustl.edu/')
library("tm")
I then created two Corpora - one for the spam documents and one for the ham documents - and created document-level metadata to identify a document as spam (ind = 1) or ham (ind = 0). Finally, I created a combined Corpus which was ordered so that the training data and the test data would both include about the same percentage of spam and ham:
# Read documents into Corpora
spam_corpus <- Corpus(DirSource("./spam_2", pattern = "[[:digit:]]"))
ham_corpus <- Corpus(DirSource("./easy_ham_2", pattern = "[[:digit:]]"))
# Add metadata indicator for spam or not spam
meta(spam_corpus, "ind") <- 1
meta(ham_corpus, "ind") <- 0
# Create combined corpus ordered for RTextTools
spamham_corpus <- c(ham_corpus[1:1120], spam_corpus[1:1117], ham_corpus[1121:1400], spam_corpus[1118:1396])
I used the “RTextTools” package (https://cran.r-project.org/web/packages/RTextTools/index.html) to perform supervised classification:
install.packages("RTextTools",repos='http://cran.wustl.edu/')
library(RTextTools)
First I created a Document-Term matrix from the Corpus and eliminated sparse teams (terms which appeared in 10 documents or less):
# Create Document-Term Matrix
dtm <- DocumentTermMatrix(spamham_corpus)
# Remove sparse terms
dtm <- removeSparseTerms(dtm, 1-(10/length(spamham_corpus)))
Then I created spam vs. ham labels using the metadata from the combined Corpus. The labels are referenced in the creation of the container object which is used by RTextTools for classification. Next I created the container object, trained three models (support vector machines, random forest, and maximum entropy), and then used those models to classify the test data:
# Create labels to use when creating container
spam_labels <- unlist(meta(spamham_corpus))
# Create Container used by RTextTools package to execute estimation procedures
container <- create_container(dtm, labels = spam_labels, trainSize = 1:2237, testSize = 2238:2796, virgin = FALSE)
# Train models
svm_model <- train_model(container, "SVM")
tree_model <- train_model(container, "TREE")
maxent_model <- train_model(container, "MAXENT")
# Run classifications
svm_out <- classify_model(container, svm_model)
tree_out <- classify_model(container, tree_model)
maxent_out <- classify_model(container, maxent_model)
I created a data frame which contains the true classification labels for the test data set as well as the classications estimated by each model. I then compared the results from each model to the true labels to see how accurate they were:
# Data frame containing true classifications and classification
# estimated by each model for all of the test documents
labels_out <- data.frame(correct_label = spam_labels[2238:2796], svm = svm_out[, 1], tree = tree_out[, 1], maxent = maxent_out[, 1], stringsAsFactors = FALSE)
# See how support vector machines performed using counts and proportions
table(labels_out[, 1] == labels_out[, 2])
##
## FALSE TRUE
## 42 517
prop.table(table(labels_out[, 1] == labels_out[, 2]))
##
## FALSE TRUE
## 0.07513417 0.92486583
# See how random forest performed using counts and proportions
table(labels_out[, 1] == labels_out[, 3])
##
## FALSE TRUE
## 149 410
prop.table(table(labels_out[, 1] == labels_out[, 3]))
##
## FALSE TRUE
## 0.2665474 0.7334526
# See how maximum entropy performed using counts and proportions
table(labels_out[, 1] == labels_out[, 4])
##
## FALSE TRUE
## 41 518
prop.table(table(labels_out[, 1] == labels_out[, 4]))
##
## FALSE TRUE
## 0.07334526 0.92665474
The support vector machines and maximum entropy models performed exactly the same, and both were much more accurate than the random forest model.