In this project, we are tasked with implementing a spam email classifier. We will be using emails from the spamassassin public corpus found here in order to train and test our model.
Here is a brief summary of the upcoming steps:
Load required libraries. We will not be using the tm package in this project.
library(tidyverse)
library(tidytext)
library(kableExtra)
library(gridExtra)
library(caret)
library(e1071)
library(klaR)A helper function for displaying tables
To integrate the data described above into the project, download the .tar files and place the extracted files in the working directory. If you inspect the extracted folders, you will notice a cmds file containing commands instead of email. We will remove this file in R. There are two main steps in this stage:
Given a path to a folder and a file name, this function will extract the email body and ignore the header content.
get_email_body <- function(path, email) {
# read the lines of each email passed by the function
content <- read_lines(str_c(path,"/",email), locale=default_locale())
# find the location of the first empty string representing the start of the body of the message and extract until the end
body <- content[which(content=="")[1]:length(content)]
# combine the lines into a single character vector
body <- paste(body, collapse="\n")
return(body)
}Given a folder path and label, this function return a data frame containing the email file names, the label “ham” or “spam” and the body of the emails.
get_emails <- function(path, label) {
# generate a list of emails files in the given folder path
email_list <- list.files(path=path)
# exclude cmds file always located at the bottom of the dataframe
email_list <- email_list[-length(email_list)]
# initialize a dataframe that will store the content of an email on each row
df <- tibble("email" = email_list, "label" = label)
# call the get_email_body function to extract the content of interest
df$text <- lapply(df$email, function(x) { get_email_body(path, x) })
return(df)
}Call the functions above and load the different email types into R.
# call the get_emails function to extract the email data and returns a dataframe with the content of interest
easy_ham <- get_emails("./easy_ham", "ham")
easy_ham_2 <- get_emails("./easy_ham_2", "ham")
hard_ham <- get_emails("./hard_ham", "ham")
spam <- get_emails("./spam", "spam")
spam_2 <- get_emails("./spam_2", "spam")At this stage, the data looks like this:
We set up a filter to get the most frequent words in a corpus. We arbitrarily choose to return the 100 top words to limit the size of the data we are working with. The filter looks to eliminate words that contain digits or the _ character. The filter can be enhanced with additional conditions to eliminate words.
get_freq_words <- function(corpus) {
freq_words <- corpus %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[:digit:]|_", negate = TRUE)) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE) %>%
top_n(100)
return(freq_words)
}Get the most frequent words from both corpora that will be used for training.
Transform corpus into long format and filter out words that do not appear often.
Below are the top 25 words for both the ham and spam categories. At first glance, there is no obvious way to distinguish between the two classes. We can make a few observations. The majority of the top spam words are html elements like href, tr, td, table etc, or fonts. In the ham words, we can spot words like message, date which are typical of email replies.
p1 <- easy_ham.freqwords %>%
top_n(25) %>%
ggplot(aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "identity") +
coord_flip() +
xlab("word") + ylab("occurences") + ggtitle("Top 25 Ham Words")
p2 <- spam.freqwords %>%
top_n(25) %>%
ggplot(aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "identity") +
coord_flip() +
xlab("word") + ylab("occurences") + ggtitle("Top 25 Spam Words")
grid.arrange(p1, p2, nrow = 1)The function below is passed a ham and spam corpus and returns a combined data frame with emails as rows and words as columns. This function is used for both training and testing corpora.
get_model_data <- function(corpus.ham, corpus.spam) {
# get ham and spam filtered words
model.ham <- get_filtered_words(corpus.ham, easy_ham.freqwords)
model.spam <- get_filtered_words(corpus.spam, spam.freqwords)
# combine filtered ham and spam long format data frames
model.data <- rbind(model.ham, model.spam)
# spread the word column into colum headers and fill missing values with 0
model.data <- model.data %>% spread(word, n, fill = 0)
#training <- apply(test, MARGIN=2, convert_counts)
return(model.data)
}Set up the training and testing data and record the labels of the training data as well as the testing data labels for prediction result comparison. The training and testing data will be sparsely populated data frames with documents as rows, individual words as columns and number of occurences as values.
training.data <- get_model_data(easy_ham, spam)
training.labels <- training.data$label
training.data <- training.data[,-1]
testing.data <- get_model_data(easy_ham_2, spam_2)
testing.labels <- testing.data$label
testing.data <- testing.data[,-1]The training data set now ressembles of the form of a document term matrix where each row represents a document (document name not shown). See subset below:
| address | align | alsa | alt | apt | arial | background | bgcolor | blockquote | body | border |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 3 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 28 | 0 | 1 | 0 | 24 | 0 | 6 | 0 | 3 | 18 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Call the naiveBayes function from the e1071 package to train the model. We pass it the data and the known spam or ham labels.
Make predictions based on the trained model and test data.
The confusion matrix provides us with a lot of information. Let’s dissect it:
Evaluating our predictions based on the descriptions above, we can say that while our model will correctly classify 75% of the easy_ham_2 and spam_2 corpora, its ability to detect positive results (sensitivity) is only 0.5240. The model does well in correctly identifying negative results (specificity) at 0.9792.
confusionMatrix(data = preds, reference = factor(testing.labels),
positive = "spam", dnn = c("Prediction", "Actual"))## Confusion Matrix and Statistics
##
## Actual
## Prediction ham spam
## ham 1367 663
## spam 29 730
##
## Accuracy : 0.7519
## 95% CI : (0.7354, 0.7678)
## No Information Rate : 0.5005
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5035
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5240
## Specificity : 0.9792
## Pos Pred Value : 0.9618
## Neg Pred Value : 0.6734
## Prevalence : 0.4995
## Detection Rate : 0.2617
## Detection Prevalence : 0.2721
## Balanced Accuracy : 0.7516
##
## 'Positive' Class : spam
##
All the functionality above is encapsulated in the model function which will be applied to the different datasets.
model <- function(train.ham, train.spam, test.ham, test.spam) {
# get the training data and slice out the labels column
training.data <- get_model_data(train.ham, train.spam)
training.labels <- training.data$label
training.data <- training.data[,-1]
# get the testing data and slice out the labels column
testing.data <- get_model_data(test.ham, test.spam)
testing.labels <- testing.data$label
testing.data <- testing.data[,-1]
# model
email_classifier <- naiveBayes(training.data, factor(training.labels))
preds <- predict(email_classifier, newdata=testing.data)
# extract results
cm <- confusionMatrix(data = preds, reference = factor(testing.labels), positive = "spam", dnn = c("Prediction", "Actual"))
accuracy <- cm[3]$overall[1]
sensitivity <- cm[4]$byClass[1]
specificity <- cm[4]$byClass[2]
results <- c(accuracy, sensitivity, specificity)
}We assemble the different testing data and summarize the resuls of the various models.
# set up data frame columns with the different training and testing data
train.ham <- c("easy_ham", "easy_ham", "easy_ham")
train.spam <- c("spam", "spam", "spam")
test.ham <- c("easy_ham", "easy_ham_2", "hard_ham")
test.spam <- c("spam", "spam_2", "spam_2")
# call the model function to predict on different testing data
m0 <- model(easy_ham, spam, easy_ham, spam)
m1 <- model(easy_ham, spam, easy_ham_2, spam_2)
m2 <- model(easy_ham, spam, hard_ham, spam_2)
# extract results
accuracy <- c(m0[1], m1[1], m2[1])
sensitivity <- c(m0[2], m1[2], m2[2])
specificity <- c(m0[3], m1[3], m2[3])
# create data frame add the names of the training and testing sets as well as the prediction results
results <- tibble(train.ham, train.spam, test.ham, test.spam)
results <- cbind(results, accuracy, sensitivity, specificity)The in-sample accuracy (first row) was nearly 90%. What these different results show us is that the spam emails across datasets get correctly classified about 50% of the time. This is not a great success rate and suggests room for improvement. We also notice that the model’s ability to correctly identify ham emails was heavily impacted by the hard_ham set which brought down the specificity from greater than 0.97 to 0.43. As a results, accuracy also degrades.
| train.ham | train.spam | test.ham | test.spam | accuracy | sensitivity | specificity |
|---|---|---|---|---|---|---|
| easy_ham | spam | easy_ham | spam | 0.9021849 | 0.4829659 | 0.9866721 |
| easy_ham | spam | easy_ham_2 | spam_2 | 0.7518824 | 0.5240488 | 0.9792264 |
| easy_ham | spam | hard_ham | spam_2 | 0.4729154 | 0.4802584 | 0.4320000 |
This simple Naive Bayes model does a decent job of identifying non-spam emails correctly but its performance is affected when the complexity of the ham emails is increased. The model’s ability to identify spam emails does not suffer much but it remains low. In order to improve this model and this project overall, we can consider the following: