Document Classification
Introduction
This project seeks to build a classifier that can predict if an email is spam or ham. It will be using the Apache Spam Assassin public corpus as the dataset.
Data Acquisition
The first step is to download the files. I am selecting the most recent releases of spam and easy ham.
if (!file.exists("data/20050311_spam_2.tar.bz2") && !file.exists("data/20050311_spam_2.tar")){
download.file("https://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2", "data/20050311_spam_2.tar.bz2")
}
if (!file.exists("data/20030228_easy_ham_2.tar.bz2") & !file.exists("data/20030228_easy_ham_2.tar")){
download.file("https://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2", "data/20030228_easy_ham_2.tar.bz2")
}Now that the files have been acquired, they need to be decompressed.
if (!dir.exists("data/spam_2")){
R.utils::bunzip2("data/20050311_spam_2.tar.bz2", "data/20050311_spam_2.tar")
untar("data/20050311_spam_2.tar", exdir = "data")
# Delete the cmds file
unlink("data/spam_2/cmds")
# Delete these messages b/c it breaks the pipeline
unlink("data/spam_2/00706.5116018237368c3633823b2d24f8ac86")
unlink("data/spam_2/00708.89f1f9108884517148fdbd744e18ec1e")
unlink("data/spam_2/00737.af5f503fe444ae773bfeb4652d122349")
unlink("data/spam_2/01125.46ca779f86e1dd0a03c3ffc67b57f55e")
unlink("data/spam_2/01217.d5a1734ec521c1bd55270eca3ab4acd8")
}
if (!dir.exists("data/easy_ham_2")){
R.utils::bunzip2("data/20030228_easy_ham_2.tar.bz2", "data/20030228_easy_ham_2.tar")
untar("data/20030228_easy_ham_2.tar", exdir = "data")
# Delete the cmds file
unlink("data/easy_ham_2/cmds")
}Build a Corpus
Now that we have a dataset we need to construct a corpus. I will use the tm package and dplyr to get the job done.
library(dplyr)
library(tm)
get_corpus <- function(dir, label){
corpus <- VCorpus(DirSource(dir)) %>%
tm_map(PlainTextDocument) %>% # Create plain text document
tm_map(content_transformer(tolower)) %>% # Standardize case
tm_map(removeWords, stopwords("SMART")) %>% # Remove stopwords
tm_map(removePunctuation) %>% # Remove punctuation marks
tm_map(removeNumbers) %>% # Remove numbers
tm_map(stripWhitespace) %>% # Remove extra whitespace
tm_map(stemDocument) # Stem the documents
meta(corpus, "LABEL") <- label
return(corpus)
}
corpus<- c(get_corpus("data/spam_2", "Spam"), get_corpus("data/easy_ham_2", "Ham"))Build a Document-Term Matrix
The next step is to construct a document term matrix.
dtm <- DocumentTermMatrix(corpus)There are 2791 documents and 71521 terms in the document-term matrix.
Remove Infrequent Terms
There are a lot of terms resulting in a very sparce matrix. I am going to prune out the less frequent terms from the matrix. The term must be found in at least 10 documents.
min_docs <- 10
dtm <- removeSparseTerms(dtm, 1 - (min_docs / length(corpus)))There are still 2791 documents, but there are 4476 terms in the pruned document-term matrix.
Normalizing the Data
Some messages have more words than others. We are going to normalize the data and pull all the data together into a model dataset.
model_data <- as.matrix(dtm)
words <- rowSums(model_data)
model_data <- model_data / words
model_data <- data.frame(model_data)
model_data <- cbind(meta(corpus), model_data) %>%
mutate(LABEL = as.factor(LABEL))Split Data into Training/Test Data Sets
We are now ready to break up the data into training and test sets. We are going to use a 75%/25% split. This decision was arbitrary.
library(caret)
set.seed(12345)
in_training_set <- createDataPartition(model_data$LABEL, p = 0.75, list = FALSE)
training_data <- model_data[in_training_set, ]
testing_data <- model_data[-in_training_set, ]Build a SVM Model
We will take out training data and produce a SVM model that predicts if a message is spam or ham.
library(e1071)
model <- svm(LABEL ~ ., data = training_data)Test the Model
Now that we have trained a model we can test it out and see how it preforms.
predictions <- testing_data %>%
select(-LABEL) %>%
predict(model, .)Table 1. Confusion Matrix
| Ham | Spam | |
|---|---|---|
| Ham | 349 | 21 |
| Spam | 1 | 326 |
There were 22 messages that were misclassified out of 697 for a 97% accuracy rate. Not bad!