Project 4 - Email Classification

In this project, we are tasked with implementing a spam email classifier. We will be using emails from the spamassassin public corpus found here in order to train and test our model.

Here is a brief summary of the upcoming steps:

Load and process the data in R in order to only extract the body of each email
Process the data by removing words of no interest like stop words or numbers
Reduce the size of the data by filtering out words that do not occur often
Assemble training and testing data
Train a Naive Bayes model to classify the testing data as either ham or spam
Review performance

Setup

Load required libraries. We will not be using the tm package in this project.

library(tidyverse)
library(tidytext)
library(kableExtra)
library(gridExtra)
library(caret)
library(e1071)
library(klaR)

A helper function for displaying tables

showtable <- function(data, title="") {
  kable(data, caption = title) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), latex_options = "scale_down") 
}

Data Import and Pre-Processing

To integrate the data described above into the project, download the .tar files and place the extracted files in the working directory. If you inspect the extracted folders, you will notice a cmds file containing commands instead of email. We will remove this file in R. There are two main steps in this stage:

Construct a list of email file names and load the email content into R.
For each email, identify the email body and ignore the header content

Given a path to a folder and a file name, this function will extract the email body and ignore the header content.

get_email_body <- function(path, email) {
  # read the lines of each email passed by the function
  content <- read_lines(str_c(path,"/",email), locale=default_locale())
  # find the location of the first empty string representing the start of the body of the message and extract until the end
  body <- content[which(content=="")[1]:length(content)]
  # combine the lines into a single character vector
  body <- paste(body, collapse="\n")
  return(body)
}

Given a folder path and label, this function return a data frame containing the email file names, the label “ham” or “spam” and the body of the emails.

get_emails <- function(path, label) {
  # generate a list of emails files in the given folder path
  email_list <- list.files(path=path)
  # exclude cmds file always located at the bottom of the dataframe
  email_list <- email_list[-length(email_list)]
  # initialize a dataframe that will store the content of an email on each row
  df <- tibble("email" = email_list, "label" = label)
  # call the get_email_body function to extract the content of interest
  df$text <- lapply(df$email, function(x) { get_email_body(path, x) })
  return(df)
}

Call the functions above and load the different email types into R.

# call the get_emails function to extract the email data and returns a dataframe with the content of interest
easy_ham <- get_emails("./easy_ham", "ham")
easy_ham_2 <- get_emails("./easy_ham_2", "ham")
hard_ham <- get_emails("./hard_ham", "ham")
spam <- get_emails("./spam", "spam")
spam_2 <- get_emails("./spam_2", "spam")

At this stage, the data looks like this: easy_ham

Data Processing

We set up a filter to get the most frequent words in a corpus. We arbitrarily choose to return the 100 top words to limit the size of the data we are working with. The filter looks to eliminate words that contain digits or the _ character. The filter can be enhanced with additional conditions to eliminate words.

# Import `stop_words` to remove common words that usually are not significant.
data(stop_words)

get_freq_words <- function(corpus) {
  freq_words <- corpus %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[:digit:]|_", negate = TRUE)) %>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE) %>% 
  top_n(100)
  return(freq_words)
}

Get the most frequent words from both corpora that will be used for training.

easy_ham.freqwords <- get_freq_words(easy_ham)
spam.freqwords <- get_freq_words(spam)

Transform corpus into long format and filter out words that do not appear often.

get_filtered_words <- function(corpus, freq) {
  data <- corpus %>% 
    unnest_tokens(word, text) %>%
    filter(str_detect(word, "[:digit:]|_", negate = TRUE)) %>%
    anti_join(stop_words) %>%
    filter(word %in% freq$word) %>%
    count(email, label, word, sort=TRUE)
  return(data)
}

Exploratory Analysis

Below are the top 25 words for both the ham and spam categories. At first glance, there is no obvious way to distinguish between the two classes. We can make a few observations. The majority of the top spam words are html elements like href, tr, td, table etc, or fonts. In the ham words, we can spot words like message, date which are typical of email replies.

p1 <- easy_ham.freqwords %>% 
  top_n(25) %>%
  ggplot(aes(x = reorder(word, -n), y = n)) + 
  geom_bar(stat = "identity") + 
  coord_flip() +
  xlab("word") + ylab("occurences") + ggtitle("Top 25 Ham Words")

p2 <- spam.freqwords %>% 
  top_n(25) %>%
  ggplot(aes(x = reorder(word, -n), y = n)) + 
  geom_bar(stat = "identity") + 
  coord_flip() +
  xlab("word") + ylab("occurences")  + ggtitle("Top 25 Spam Words")

grid.arrange(p1, p2, nrow = 1)

Data Transformation & Modeling

The function below is passed a ham and spam corpus and returns a combined data frame with emails as rows and words as columns. This function is used for both training and testing corpora.

get_model_data <- function(corpus.ham, corpus.spam) {
  # get ham and spam filtered words
  model.ham <- get_filtered_words(corpus.ham, easy_ham.freqwords)
  model.spam <- get_filtered_words(corpus.spam, spam.freqwords)
  # combine filtered ham and spam long format data frames
  model.data <- rbind(model.ham, model.spam)
  # spread the word column into colum headers and fill missing values with 0
  model.data <- model.data %>% spread(word, n, fill = 0)
  #training <- apply(test, MARGIN=2, convert_counts)
  return(model.data)
}

Set up the training and testing data and record the labels of the training data as well as the testing data labels for prediction result comparison. The training and testing data will be sparsely populated data frames with documents as rows, individual words as columns and number of occurences as values.

training.data <- get_model_data(easy_ham, spam)
training.labels <- training.data$label
training.data <- training.data[,-1]

testing.data <- get_model_data(easy_ham_2, spam_2)
testing.labels <- testing.data$label
testing.data <-  testing.data[,-1]

The training data set now ressembles of the form of a document term matrix where each row represents a document (document name not shown). See subset below:

showtable(training.data[1:17, 1:11])


address	align	alt	arial	bgcolor	body	border
0	8	0	2	0	2	3
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
1	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	0	0	0	0	0	0
0	28	1	24	6	3	18
1	0	0	0	0	0	0

Call the naiveBayes function from the e1071 package to train the model. We pass it the data and the known spam or ham labels.

email_classifier <- naiveBayes(training.data, factor(training.labels))

Make predictions based on the trained model and test data.

preds <- predict(email_classifier, newdata=testing.data)

The confusion matrix provides us with a lot of information. Let’s dissect it:

The accuracy is about 75%. This number represents the sum of correctly classified emails (true positives and true negatives) divided by the total number of emails.
The sensitivity measures a test’s ability to identify positive results. Also referred to as power, true positive rate, recall, or probability of detection. It is computed as 1 - false positive rate (alpha).
The specificity measures a test’s ability to identify negative results. Also called true negative rate. It is computed as 1 - false negative rate (beta).
In this case, the positive class is spam as those are the emails we are trying to detect.

Evaluating our predictions based on the descriptions above, we can say that while our model will correctly classify 75% of the easy_ham_2 and spam_2 corpora, its ability to detect positive results (sensitivity) is only 0.5240. The model does well in correctly identifying negative results (specificity) at 0.9792.

confusionMatrix(data = preds, reference = factor(testing.labels),
                positive = "spam", dnn = c("Prediction", "Actual"))

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  ham spam
##       ham  1367  663
##       spam   29  730
##                                           
##                Accuracy : 0.7519          
##                  95% CI : (0.7354, 0.7678)
##     No Information Rate : 0.5005          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5035          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5240          
##             Specificity : 0.9792          
##          Pos Pred Value : 0.9618          
##          Neg Pred Value : 0.6734          
##              Prevalence : 0.4995          
##          Detection Rate : 0.2617          
##    Detection Prevalence : 0.2721          
##       Balanced Accuracy : 0.7516          
##                                           
##        'Positive' Class : spam            
##

All the functionality above is encapsulated in the model function which will be applied to the different datasets.

model <- function(train.ham, train.spam, test.ham, test.spam) {
  # get the training data and slice out the labels column
  training.data <- get_model_data(train.ham, train.spam)
  training.labels <- training.data$label
  training.data <- training.data[,-1]
  # get the testing data and slice out the labels column
  testing.data <- get_model_data(test.ham, test.spam)
  testing.labels <- testing.data$label
  testing.data <-  testing.data[,-1]
  # model
  email_classifier <- naiveBayes(training.data, factor(training.labels))
  preds <- predict(email_classifier, newdata=testing.data)
  # extract results
  cm <- confusionMatrix(data = preds, reference = factor(testing.labels), positive = "spam", dnn = c("Prediction", "Actual"))
  accuracy <- cm[3]$overall[1]
  sensitivity <- cm[4]$byClass[1]
  specificity <- cm[4]$byClass[2]
  
  results <- c(accuracy, sensitivity, specificity)
}

Results and Analysis

We assemble the different testing data and summarize the resuls of the various models.

# set up data frame columns with the different training and testing data
train.ham <- c("easy_ham", "easy_ham", "easy_ham")
train.spam <- c("spam", "spam", "spam")
test.ham <- c("easy_ham", "easy_ham_2", "hard_ham")
test.spam <- c("spam", "spam_2", "spam_2")
# call the model function to predict on different testing data
m0 <- model(easy_ham, spam, easy_ham, spam)
m1 <- model(easy_ham, spam, easy_ham_2, spam_2)
m2 <- model(easy_ham, spam, hard_ham, spam_2)
# extract results
accuracy <- c(m0[1], m1[1], m2[1])
sensitivity <- c(m0[2], m1[2], m2[2])
specificity <- c(m0[3], m1[3], m2[3])
# create data frame add the names of the training and testing sets as well as the prediction results 
results <- tibble(train.ham, train.spam, test.ham, test.spam)
results <- cbind(results, accuracy, sensitivity, specificity)

The in-sample accuracy (first row) was nearly 90%. What these different results show us is that the spam emails across datasets get correctly classified about 50% of the time. This is not a great success rate and suggests room for improvement. We also notice that the model’s ability to correctly identify ham emails was heavily impacted by the hard_ham set which brought down the specificity from greater than 0.97 to 0.43. As a results, accuracy also degrades.

showtable(results, "Testing Results")

Testing Results
train.ham	train.spam	test.ham	test.spam	accuracy	sensitivity	specificity
easy_ham	spam	easy_ham	spam	0.9021849	0.4829659	0.9866721
easy_ham	spam	easy_ham_2	spam_2	0.7518824	0.5240488	0.9792264
easy_ham	spam	hard_ham	spam_2	0.4729154	0.4802584	0.4320000

Conclusion

This simple Naive Bayes model does a decent job of identifying non-spam emails correctly but its performance is affected when the complexity of the ham emails is increased. The model’s ability to identify spam emails does not suffer much but it remains low. In order to improve this model and this project overall, we can consider the following:

Additional filtering of training data (stemming, remove punctuation, etc.)
Equalizing the number of training instances by sampling so that spam and ham emails are represented equally.
Changing the bounds on most frequent words from 100 to more or less words. This will impact the complexity of the model and run time.
Cross-validation of training data.
Using different models for performance comparisons (logistic regression, SVM, random forest, etc.).

data607-illien-project4

Mael Illien

10/23/2019