Project 4 - Email Classification

In this project, we are tasked with implementing a spam email classifier. We will be using emails from the spamassassin public corpus found here in order to train and test our model.

Here is a brief summary of the upcoming steps:

  • Load and process the data in R in order to only extract the body of each email
  • Process the data by removing words of no interest like stop words or numbers
  • Reduce the size of the data by filtering out words that do not occur often
  • Assemble training and testing data
  • Train a Naive Bayes model to classify the testing data as either ham or spam
  • Review performance

Data Import and Pre-Processing

To integrate the data described above into the project, download the .tar files and place the extracted files in the working directory. If you inspect the extracted folders, you will notice a cmds file containing commands instead of email. We will remove this file in R. There are two main steps in this stage:

  1. Construct a list of email file names and load the email content into R.
  2. For each email, identify the email body and ignore the header content

Given a path to a folder and a file name, this function will extract the email body and ignore the header content.

Given a folder path and label, this function return a data frame containing the email file names, the label “ham” or “spam” and the body of the emails.

Call the functions above and load the different email types into R.

At this stage, the data looks like this: easy_ham

Data Processing

We set up a filter to get the most frequent words in a corpus. We arbitrarily choose to return the 100 top words to limit the size of the data we are working with. The filter looks to eliminate words that contain digits or the _ character. The filter can be enhanced with additional conditions to eliminate words.

Get the most frequent words from both corpora that will be used for training.

Transform corpus into long format and filter out words that do not appear often.

Exploratory Analysis

Below are the top 25 words for both the ham and spam categories. At first glance, there is no obvious way to distinguish between the two classes. We can make a few observations. The majority of the top spam words are html elements like href, tr, td, table etc, or fonts. In the ham words, we can spot words like message, date which are typical of email replies.

Data Transformation & Modeling

The function below is passed a ham and spam corpus and returns a combined data frame with emails as rows and words as columns. This function is used for both training and testing corpora.

Set up the training and testing data and record the labels of the training data as well as the testing data labels for prediction result comparison. The training and testing data will be sparsely populated data frames with documents as rows, individual words as columns and number of occurences as values.

The training data set now ressembles of the form of a document term matrix where each row represents a document (document name not shown). See subset below:

address align alsa alt apt arial background bgcolor blockquote body border
0 8 0 0 0 2 0 0 0 2 3
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 28 0 1 0 24 0 6 0 3 18
1 0 0 0 0 0 0 0 0 0 0

Call the naiveBayes function from the e1071 package to train the model. We pass it the data and the known spam or ham labels.

Make predictions based on the trained model and test data.

The confusion matrix provides us with a lot of information. Let’s dissect it:

  • The accuracy is about 75%. This number represents the sum of correctly classified emails (true positives and true negatives) divided by the total number of emails.
  • The sensitivity measures a test’s ability to identify positive results. Also referred to as power, true positive rate, recall, or probability of detection. It is computed as 1 - false positive rate (alpha).
  • The specificity measures a test’s ability to identify negative results. Also called true negative rate. It is computed as 1 - false negative rate (beta).
  • In this case, the positive class is spam as those are the emails we are trying to detect.

Evaluating our predictions based on the descriptions above, we can say that while our model will correctly classify 75% of the easy_ham_2 and spam_2 corpora, its ability to detect positive results (sensitivity) is only 0.5240. The model does well in correctly identifying negative results (specificity) at 0.9792.

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  ham spam
##       ham  1367  663
##       spam   29  730
##                                           
##                Accuracy : 0.7519          
##                  95% CI : (0.7354, 0.7678)
##     No Information Rate : 0.5005          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5035          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5240          
##             Specificity : 0.9792          
##          Pos Pred Value : 0.9618          
##          Neg Pred Value : 0.6734          
##              Prevalence : 0.4995          
##          Detection Rate : 0.2617          
##    Detection Prevalence : 0.2721          
##       Balanced Accuracy : 0.7516          
##                                           
##        'Positive' Class : spam            
## 

All the functionality above is encapsulated in the model function which will be applied to the different datasets.

Results and Analysis

We assemble the different testing data and summarize the resuls of the various models.

The in-sample accuracy (first row) was nearly 90%. What these different results show us is that the spam emails across datasets get correctly classified about 50% of the time. This is not a great success rate and suggests room for improvement. We also notice that the model’s ability to correctly identify ham emails was heavily impacted by the hard_ham set which brought down the specificity from greater than 0.97 to 0.43. As a results, accuracy also degrades.

Testing Results
train.ham train.spam test.ham test.spam accuracy sensitivity specificity
easy_ham spam easy_ham spam 0.9021849 0.4829659 0.9866721
easy_ham spam easy_ham_2 spam_2 0.7518824 0.5240488 0.9792264
easy_ham spam hard_ham spam_2 0.4729154 0.4802584 0.4320000

Conclusion

This simple Naive Bayes model does a decent job of identifying non-spam emails correctly but its performance is affected when the complexity of the ham emails is increased. The model’s ability to identify spam emails does not suffer much but it remains low. In order to improve this model and this project overall, we can consider the following:

  • Additional filtering of training data (stemming, remove punctuation, etc.)
  • Equalizing the number of training instances by sampling so that spam and ham emails are represented equally.
  • Changing the bounds on most frequent words from 100 to more or less words. This will impact the complexity of the model and run time.
  • Cross-validation of training data.
  • Using different models for performance comparisons (logistic regression, SVM, random forest, etc.).