DATA607 Project 4

0. Packages

I used the tm package to create document-term matrices and the caret package to perform supervised machine learning (SML) for text classification. If needed, you can install them using the commands below.

install.packages("caret")
install.packages("tm")

1. Introduction

Document classification is the process of assigning documents to one or more classes or categories based on shared characteristics (eg, subjects) of their content. Here, I compare the performance of four SML algorithms (decision trees, random forest, k-nearest neighbor, and support vector machines) to classify emails as spam or ham (ie, not spam). I also evaluate whether computer-interpretable content in emails (eg, headers) provides predictive value to spam vs ham classification.

2. Data source

I downloaded “ham” and spam data files from https://spamassassin.apache.org/old/publiccorpus/ and decompressed them on the command line.

I used both “easy” and “hard” ham emails because I was curious to see the difference in classification performance.

# Easy ham
bzip2 -d 20030228_easy_ham.tar.bz2
tar -xvf 20030228_easy_ham.tar

# Hard ham
bzip2 -d 20030228_hard_ham.tar.bz2
tar -xvf 20030228_hard_ham.tar

# Spam
bzip2 -d 20050311_spam_2.tar.bz2
tar -xvf 20050311_spam_2.tar

I thought it would be a little cumbersome to load thousands of individual emails to my GitHub repository, so I read them into tibbles and then saved and uploaded the R data files.

Create spam and ham tibbles.

# Note filepath is specific to my computer
working_dir <- getwd()
filepath <- paste(working_dir, "/Data/easy_ham", sep = "") 
easy_ham_files <- readtext(filepath, encoding='UTF-8')

filepath <- paste(working_dir, "/Data/hard_ham", sep = "") 
hard_ham_files <- readtext(filepath, encoding='UTF-8')

filepath <- paste(working_dir, "/Data/spam_2", sep = "") 
spam_files <- readtext(filepath, encoding='UTF-8')

Save the tibbles as R data files.

saveRDS(easy_ham_files, "easy_ham_files.rds")
saveRDS(hard_ham_files, "hard_ham_files.rds")
saveRDS(spam_files, "spam_files.rds")

These data files can be read from my GitHub repository to recreate the spam and ham tibbles.

# Method from https://forum.posit.co/t/how-to-read-rds-files-hosted-at-github-repository/128561
easy_ham_emails <- readRDS(gzcon(url("https://github.com/alexandersimon1/Data607/raw/main/Project4/easy_ham_files.rds")))
hard_ham_emails <- readRDS(gzcon(url("https://github.com/alexandersimon1/Data607/raw/main/Project4/hard_ham_files.rds")))
spam_emails <- readRDS(gzcon(url("https://github.com/alexandersimon1/Data607/raw/main/Project4/spam_files.rds")))

3. Data checks and transformations

3.1. Label the data

First I labeled the ham and spam emails as “ham” and “spam” (as factors), respectively, and then combined everything into a single tibble. I also omitted the doc_id column since it isn’t needed.

# This function labels all emails in a tibble with the specified label (string),
# and then returns the updated tibble of emails
label_emails <- function (emails_df, label_text) {
  emails_df <- emails_df %>%
    select(text) %>%
    mutate(
      label = as.factor(label_text)
    )
  return(emails_df)
}

easy_ham_emails <- label_emails(easy_ham_emails, "ham")
hard_ham_emails <- label_emails(hard_ham_emails, "ham")
spam_emails <- label_emails(spam_emails, "spam")

all_emails_easy <- rbind(easy_ham_emails, spam_emails)
all_emails_hard <- rbind(hard_ham_emails, spam_emails)

3.2. Check data balance

Both datasets are imbalanced. The “easy” dataset has a higher proportion of ham emails than spam emails.

table(all_emails_easy$label) %>% prop.table()

## 
##       ham      spam 
## 0.6416111 0.3583889

in contrast, the “hard” dataset has a much higher proportion of spam emails vs ham emails.

table(all_emails_hard$label) %>% prop.table()

## 
##       ham      spam 
## 0.1523058 0.8476942

Ideally, these datasets should be balanced because imbalance can bias the prediction model toward the more common class (ie, the model is more likely to make predictions with high accuracy by simply selecting the most common class).¹ Methods to balance data include undersampling (randomly select samples from the overrepresented class) and oversampling (randomly duplicate samples from the underrepresented class), which are implemented in the caret package. However, these procedures are out of scope for this assignment.

Below I construct models using the datasets as is (ie, unadjusted for balance). This starts with creating corpuses of the email text and document-term matrices. The code in sections 3.3 and 3.4 are adapted from a tutorial about the tm package.

3.3. Create corpus

Create corpuses of email text

easy_email_corpus <- VCorpus(VectorSource(all_emails_easy$text))
hard_email_corpus <- VCorpus(VectorSource(all_emails_hard$text))

As an example, the “easy” email corpus looks like this

# Print first 10 lines
writeLines(head(strwrap(easy_email_corpus[[1]]), 10))

## From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 2002
## Return-Path: <exmh-workers-admin@spamassassin.taint.org> Delivered-To:
## zzzz@localhost.netnoteinc.com Received: from localhost (localhost
## [127.0.0.1]) by phobos.labs.netnoteinc.com (Postfix) with ESMTP id
## D03E543C36 for <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)
## Received: from phobos [127.0.0.1] by localhost with IMAP
## (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002
## 12:36:16 +0100 (IST) Received: from listman.spamassassin.taint.org
## (listman.spamassassin.taint.org [66.187.233.211]) by
## dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MBYrZ04811 for

Remove punctuation and numbers, remove English stop words, change case to lowercase, and strip extra white space.

# This function cleans up a specified corpus by removing punctuation, numbers, stop words, and
# white space, and changes all text to lowercase. It returns the tidied corpus.
tidy_corpus <- function(corpus) {
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, stripWhitespace)  
}

easy_email_corpus <- tidy_corpus(easy_email_corpus)
hard_email_corpus <- tidy_corpus(hard_email_corpus)

Now the “easy” email corpus looks like this

writeLines(head(strwrap(easy_email_corpus[[1]]), 10))

## from exmhworkersadminredhatcom thu aug returnpath
## exmhworkersadminspamassassintaintorg deliveredto
## zzzzlocalhostnetnoteinccom received localhost localhost
## phoboslabsnetnoteinccom postfix esmtp id dec zzzzlocalhost thu aug edt
## received phobos localhost imap fetchmail zzzzlocalhost singledrop thu
## aug ist received listmanspamassassintaintorg
## listmanspamassassintaintorg dogmaslashnullorg esmtp id gmbyrz
## zzzzexmhspamassassintaintorg thu aug received
## listmanspamassassintaintorg localhostlocaldomain listmanredhatcom
## postfix esmtp id thu aug edt deliveredto

3.4. Create document term matrix

Both the “easy” and “hard” email document term matrices are extremely sparse (>99.5%).

easy_email_dtm <- DocumentTermMatrix(easy_email_corpus)
tm::inspect(easy_email_dtm)

## <<DocumentTermMatrix (documents: 3898, terms: 95865)>>
## Non-/sparse entries: 636174/373045596
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug esmtp from jmlocalhost localhost mon oct postfix received sep
##   2501   0     0    0           0         0   0   0       0        0   0
##   2552   0     2    4           0         0   0   0       2        2   0
##   2578   0     2    2           0         0   0   0       2        2   0
##   3404   0     4    9           0         1   8   0       0       20   0
##   3580   0     4    3           0         1   0   0       0        8   0
##   3591   0     1    2           2         3   0   0       1        7   0
##   3592   0     1    2           2         3   0   0       1        7   0
##   3898   0     0    0           0         0   0   0       0        0   0
##   670    0     5    2           2         4   0   0       3        7   8
##   677    0     4    3           2         3   0   0       3        7   9

hard_email_dtm <- DocumentTermMatrix(hard_email_corpus)
tm::inspect(hard_email_dtm)

## <<DocumentTermMatrix (documents: 1648, terms: 88952)>>
## Non-/sparse entries: 413696/146179200
## Sparsity           : 100%
## Maximal term length: 868
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   border font height helvetica jul received size table width widthd
##   1154      0    0      0         0   8       20    1     1     0      0
##   1330      0    0      0         0  10        8    0     1     0      0
##   1341      0    8      0         0   5        7    2     0     0      0
##   1342      0    8      0         0   5        7    2     0     0      0
##   158      44   55     52         0   6        5   28    30    61      0
##   1648      0    0      0         0   0        0    0     0     0      0
##   198       0    5      0        29   0        9    0    36     0    303
##   302       9    9      4         0   0        2   24    24    13      0
##   328       9    9      4         0   4        2   24    22    13      0
##   39        0    0      0         0   4        3    0     0     0      0

To prevent R from crashing due to lack of memory when constructing the models (I learned this the hard way), I simplified the document term matrices by removing terms that have >95% sparsity (ie, occur in <5% emails).

The less sparse document term matrices look like this:

easy_email_dtm <- removeSparseTerms(easy_email_dtm, 0.95)
tm::inspect(easy_email_dtm)

## <<DocumentTermMatrix (documents: 3898, terms: 438)>>
## Non-/sparse entries: 264484/1442840
## Sparsity           : 85%
## Maximal term length: 50
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   aug esmtp from jmlocalhost localhost mon oct postfix received sep
##   2552   0     2    4           0         0   0   0       2        2   0
##   2578   0     2    2           0         0   0   0       2        2   0
##   3404   0     4    9           0         1   8   0       0       20   0
##   3490   0     4    8           0         1   0   0       0       20   0
##   3580   0     4    3           0         1   0   0       0        8   0
##   3591   0     1    2           2         3   0   0       1        7   0
##   3592   0     1    2           2         3   0   0       1        7   0
##   3805   5     1    4           2         3   0   0       1       12   0
##   670    0     5    2           2         4   0   0       3        7   8
##   677    0     4    3           2         3   0   0       3        7   9

hard_email_dtm <- removeSparseTerms(hard_email_dtm, 0.95)
tm::inspect(hard_email_dtm)

## <<DocumentTermMatrix (documents: 1648, terms: 737)>>
## Non-/sparse entries: 161198/1053378
## Sparsity           : 87%
## Maximal term length: 58
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   border font height helvetica jul received size table width widthd
##   1154      0    0      0         0   8       20    1     1     0      0
##   1240      0    0      0         0   8       20    1     1     0      0
##   126       6   59     13         0   4        2   28     4    15      0
##   1330      0    0      0         0  10        8    0     1     0      0
##   1342      0    8      0         0   5        7    2     0     0      0
##   158      44   55     52         0   6        5   28    30    61      0
##   18       34   37     65        59   3        2   59    46   112      0
##   302       9    9      4         0   0        2   24    24    13      0
##   33       63   19    136         4   3        2   19    39   228      0
##   93       34   35     65        59   3        2   58    46   112      0

4. Supervised machine learning

I did a lot of background reading to understand how to create supervised machine learning models for text classification such as spam/ham emails. The code below is adapted from what I learned.²

4.1. Partition the data

I partitioned the labeled emails (“easy” and “hard”) into training and test sets using a 70-30 split.

trainIndex_easy <- createDataPartition(y = all_emails_easy$label, p = 0.7, list = FALSE)
trainIndex_hard <- createDataPartition(y = all_emails_hard$label, p = 0.7, list = FALSE)

Next, split up the document-term matrices using the corresponding partition.

# "easy" dataset
training_set_easy <- easy_email_dtm[trainIndex_easy, ] %>% as.matrix() %>% as.data.frame()
test_set_easy <- easy_email_dtm[-trainIndex_easy, ] %>% as.matrix() %>% as.data.frame()
# "hard" dataset
training_set_hard <- hard_email_dtm[trainIndex_hard, ] %>% as.matrix() %>% as.data.frame()
test_set_hard <- hard_email_dtm[-trainIndex_hard, ] %>% as.matrix() %>% as.data.frame()

Similarly, split up the email labels using the corresponding partition.

# "easy" dataset
training_labels_easy <- all_emails_easy$label[trainIndex_easy]
test_labels_easy <- all_emails_easy$label[-trainIndex_easy]
# "hard" dataset
training_labels_hard <- all_emails_hard$label[trainIndex_hard]
test_labels_hard <- all_emails_hard$label[-trainIndex_hard]

4.2. Define resampling method

I used a bootstrap resampling method.

resampling_method <- trainControl(method = "boot")

4.3. Algorithms

4.3.1. Decision trees

First, train the model for the “easy” dataset and the model for the “hard” dataset.

dt_model_easy <- caret::train(x = training_set_easy, y = training_labels_easy, method = "rpart",
                              trControl = resampling_method)

print(dt_model_easy)

## CART 
## 
## 2729 samples
##  438 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa    
##   0.1124744  0.9440208  0.8805996
##   0.2689162  0.8703450  0.7255566
##   0.5245399  0.7592002  0.3960552
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1124744.

dt_model_hard <- caret::train(x = training_set_hard, y = training_labels_hard, method = "rpart",
                              trControl = resampling_method)

print(dt_model_hard)

## CART 
## 
## 1154 samples
##  737 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.06818182  0.9310748  0.7085640
##   0.09659091  0.9254577  0.6734704
##   0.50568182  0.8744114  0.2711564
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.06818182.

Then evaluate the models on the corresponding test dataset.

The summary statistics below show that the accuracy (95% CI) of the decision tree model with the “easy” dataset is 92.3% (90.62% - 93.76%). There are a couple of indicators that the model performs better than would be expected by chance alone.

The accuracy is greater than the “no information rate” (NIR; 64.16%) and the p-value of accuracy vs NIR is much less than 0.05, so the decision tree model performs significantly better than a naive classifier that classifies everything by the most common class.
Kappa value \(𝜅>0.8\). The kappa value is a measure of the degree of agreement between two raters that classify items into mutually exclusive categories. 𝜅 ranges from 0 (no agreement between the raters other than chance) to 1 (complete agreement and therefore unlikely due to chance). \(𝜅 > 0.8\) indicates the agreement between the prediction and reference categories is strong and unlikely to be due to chance.

dt_predict_easy <- predict(dt_model_easy, newdata = test_set_easy)
dt_easy_confusion_matrix <- caret::confusionMatrix(dt_predict_easy, test_labels_easy, 
                                                   mode = "prec_recall")
dt_easy_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  668    8
##       spam  82  411
##                                           
##                Accuracy : 0.923           
##                  95% CI : (0.9062, 0.9376)
##     No Information Rate : 0.6416          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8389          
##                                           
##  Mcnemar's Test P-Value : 1.416e-14       
##                                           
##               Precision : 0.9882          
##                  Recall : 0.8907          
##                      F1 : 0.9369          
##              Prevalence : 0.6416          
##          Detection Rate : 0.5714          
##    Detection Prevalence : 0.5783          
##       Balanced Accuracy : 0.9358          
##                                           
##        'Positive' Class : ham             
##

The confusion matrix can also be visualized by a “four-fold plot”. This plot shows that the decision tree model does fairly well predicting true positives (upper left quadrant) and true negatives (lower right quadrant) in the “easy” dataset, which agrees with the 92.3% accuracy ((668 + 411)/(668 + 8 + 82 + 411)). Among the incorrect predictions, there are more false negatives (lower left quadrant) than false positives (upper right quadrant).

# Reference: https://www.geeksforgeeks.org/visualize-confusion-matrix-using-caret-package-in-r/
fourfoldplot(as.table(dt_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))

For the “hard” dataset, the accuracy (95% CI) of the decision tree model is 92.51% (89.82%-94.67%), which is about the same as the accuracy for the “easy” dataset. The accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the decision tree model for the hard dataset is also significantly better than a naive classifier.

However, the difference between the accuracy and NIR (0.9251 - 0.8482 = 0.0769) is less than that for the easy dataset (0.9230 - 0.6416 = 0.2814), which suggests that the decision tree model is a better model for the easy dataset than the hard dataset. This is supported by lower kappa value (\(0.6 \leq 𝜅 \leq 0.8\)), which indicates only moderate agreement between the prediction and reference categories.

dt_predict_hard <- predict(dt_model_hard, newdata = test_set_hard)
dt_hard_confusion_matrix <- caret::confusionMatrix(dt_predict_hard, test_labels_hard, 
                                                   mode = "prec_recall")
dt_hard_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   50   12
##       spam  25  407
##                                           
##                Accuracy : 0.9251          
##                  95% CI : (0.8982, 0.9467)
##     No Information Rate : 0.8482          
##     P-Value [Acc > NIR] : 1.62e-07        
##                                           
##                   Kappa : 0.6869          
##                                           
##  Mcnemar's Test P-Value : 0.04852         
##                                           
##               Precision : 0.8065          
##                  Recall : 0.6667          
##                      F1 : 0.7299          
##              Prevalence : 0.1518          
##          Detection Rate : 0.1012          
##    Detection Prevalence : 0.1255          
##       Balanced Accuracy : 0.8190          
##                                           
##        'Positive' Class : ham             
##

The four-fold plot of the confusion matrix for the hard dataset looks like this:

fourfoldplot(as.table(dt_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))

4.3.2. Random forest (RF)

First, train the models for the easy and hard datasets.

rf_model_easy <- train(x = training_set_easy, y = training_labels_easy, method = "ranger",
                       trControl = resampling_method,
                       # hyperparameters
                       tuneGrid = data.frame(
                       # number of variables to randomly collect and split
                       mtry = floor(sqrt(dim(training_set_easy)[2])),
                       # rule for how to split the data as decisions are made
                       splitrule = "extratrees",
                       # tree depth, ie keep branching until it reaches minimum node size
                       min.node.size = 5))

print(rf_model_easy)

## Random Forest 
## 
## 2729 samples
##  438 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9939238  0.9867075
## 
## Tuning parameter 'mtry' was held constant at a value of 20
## Tuning
##  parameter 'splitrule' was held constant at a value of extratrees
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5

rf_model_hard <- train(x = training_set_hard, y = training_labels_hard, method = "ranger",
                       trControl = resampling_method,
                       # hyperparameters
                       tuneGrid = data.frame(
                       # number of variables to randomly collect and split
                       mtry = floor(sqrt(dim(training_set_hard)[2])),
                       # rule for how to split the data as decisions are made
                       splitrule = "extratrees",
                       # tree depth, ie keep branching until it reaches minimum node size
                       min.node.size = 5))

print(rf_model_hard)

## Random Forest 
## 
## 1154 samples
##  737 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ... 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.9565281  0.809856
## 
## Tuning parameter 'mtry' was held constant at a value of 27
## Tuning
##  parameter 'splitrule' was held constant at a value of extratrees
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5

Then evaluate the models on the corresponding test dataset.

The summary statistics below show that the accuracy (95% CI) of the random forest model for the “easy” dataset is 99.49% (98.89%-99.81%). This accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the RF model performs significantly better than a naive classifier. In addition, 𝜅 is very close to 1, which means that the agreement between the prediction and reference categories is strong and highly unlikely to be due to chance.

rf_predict_easy <- predict(rf_model_easy, newdata = test_set_easy)
rf_easy_confusion_matrix <- caret::confusionMatrix(rf_predict_easy, test_labels_easy, 
                                                   mode = "prec_recall")
rf_easy_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  749    5
##       spam   1  414
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9889, 0.9981)
##     No Information Rate : 0.6416          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9888          
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##               Precision : 0.9934          
##                  Recall : 0.9987          
##                      F1 : 0.9960          
##              Prevalence : 0.6416          
##          Detection Rate : 0.6407          
##    Detection Prevalence : 0.6450          
##       Balanced Accuracy : 0.9934          
##                                           
##        'Positive' Class : ham             
##

The four-fold plot shows that the random forest model for the “easy” dataset” does extremely well predicting true positives (upper left quadrant) and true negatives (lower right quadrant), which agrees with the 99.49% accuracy ((749 + 414)/(749 + 5 + 1 + 414)). There are much fewer false negatives (lower left quadrant) and false positives (upper right quadrant) than that from the decision tree model, which increases the accuracy of the RF model.

fourfoldplot(as.table(rf_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))

The accuracy (95% CI) of the random forest model for the “hard” dataset is slightly lower, 95.75% (93.58%-97.35%). The lower accuracy vs the “easy” dataset suggests that the RF model is less susceptible to the imbalanced data than the decision tree model.

Similar to the analysis of the decision tree model for easy and hard datasets, the difference between the accuracy and NIR (0.9575 - 0.8482 = 0.1093) with the RF model for the hard dataset is less than that for the easy dataset (0.9949 - 0.6416 = 0.3533), which suggests that the RF model is a better model for the easy dataset than the hard dataset. This is supported by \(𝜅 = 0.8135\), which indicates only moderate agreement between the prediction and reference categories.

rf_predict_hard <- predict(rf_model_hard, newdata = test_set_hard)
rf_hard_confusion_matrix <- caret::confusionMatrix(rf_predict_hard, test_labels_hard, 
                                                   mode = "prec_recall")
rf_hard_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   54    0
##       spam  21  419
##                                           
##                Accuracy : 0.9575          
##                  95% CI : (0.9358, 0.9735)
##     No Information Rate : 0.8482          
##     P-Value [Acc > NIR] : 5.975e-15       
##                                           
##                   Kappa : 0.8135          
##                                           
##  Mcnemar's Test P-Value : 1.275e-05       
##                                           
##               Precision : 1.0000          
##                  Recall : 0.7200          
##                      F1 : 0.8372          
##              Prevalence : 0.1518          
##          Detection Rate : 0.1093          
##    Detection Prevalence : 0.1093          
##       Balanced Accuracy : 0.8600          
##                                           
##        'Positive' Class : ham             
##

The four-fold plot of the confusion matrix for the hard dataset is shown below. Surprisingly, there were no false positives (upper right quadrant).

fourfoldplot(as.table(rf_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))

4.3.3. k-Nearest neighbor (kNN)

First, train the models for the “easy” and “hard” datasets.

knn_model_easy <- train(x = training_set_easy, y = training_labels_easy, method = "knn", 
                        trControl = resampling_method,
                        # hyperparameter
                        tuneGrid = data.frame(k = 2))

print(knn_model_easy)

## k-Nearest Neighbors 
## 
## 2729 samples
##  438 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9824494  0.9614311
## 
## Tuning parameter 'k' was held constant at a value of 2

knn_model_hard <- train(x = training_set_hard, y = training_labels_hard, method = "knn", 
                        trControl = resampling_method,
                        # hyperparameter
                        tuneGrid = data.frame(k = 2))

print(knn_model_hard)

## k-Nearest Neighbors 
## 
## 1154 samples
##  737 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9378703  0.7256909
## 
## Tuning parameter 'k' was held constant at a value of 2

Then evaluate the models on the corresponding test set.

The summary statistics below show that the accuracy (95% CI) of the k-nearest neighbor model is 98.55% (97.68%-99.15%). This accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the kNN model performs significantly better than a naive classifier. In addition, 𝜅 is very close to 1, which means that the agreement between the prediction and reference categories is strong and unlikely to be due to chance.

knn_predict_easy <- predict(knn_model_easy, newdata = test_set_easy)
knn_easy_confusion_matrix <- caret::confusionMatrix(knn_predict_easy, test_labels_easy, 
                                                    mode = "prec_recall")
knn_easy_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  748   15
##       spam   2  404
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9768, 0.9915)
##     No Information Rate : 0.6416          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9682          
##                                           
##  Mcnemar's Test P-Value : 0.003609        
##                                           
##               Precision : 0.9803          
##                  Recall : 0.9973          
##                      F1 : 0.9888          
##              Prevalence : 0.6416          
##          Detection Rate : 0.6399          
##    Detection Prevalence : 0.6527          
##       Balanced Accuracy : 0.9808          
##                                           
##        'Positive' Class : ham             
##

The four-fold plot of the kNN model confusion matrix for the “easy” dataset” is similar to that of the RF model for the “easy” dataset. The kNN model does very well predicting true positives (upper left quadrant) and true negatives (lower right quadrant), which agrees with the 98.55% accuracy ((748 + 404)/(748 + 15 + 2 + 404)). There were relatively few false negatives (lower left quadrant).

fourfoldplot(as.table(knn_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))

The accuracy (95% CI) of the kNN model for the “hard” dataset is a little lower, 94.13% (91.68%-96.03%). Similar to the RF model, the lower accuracy of the kNN model for the “hard” dataset vs the “easy” dataset suggests that the kNN model is less susceptible to the imbalanced data than the decision tree model.

The difference between the accuracy and NIR (0.9413 - 0.8482 = 0.0931) with the kNN model for the hard dataset is less than that for the easy dataset (0.9855 - 0.6416 = 0.3439). This suggests that, like the RF model, the kNN model is a better model for the easy dataset than the hard dataset. This is supported by \(0.6 \leq 𝜅 \leq 0.8\), which indicates only moderate agreement between the prediction and reference categories.

knn_predict_hard <- predict(knn_model_hard, newdata = test_set_hard)
knn_hard_confusion_matrix <- caret::confusionMatrix(knn_predict_hard, test_labels_hard, 
                                                    mode = "prec_recall")
knn_hard_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   51    5
##       spam  24  414
##                                           
##                Accuracy : 0.9413          
##                  95% CI : (0.9168, 0.9603)
##     No Information Rate : 0.8482          
##     P-Value [Acc > NIR] : 9.863e-11       
##                                           
##                   Kappa : 0.7456          
##                                           
##  Mcnemar's Test P-Value : 0.0008302       
##                                           
##               Precision : 0.9107          
##                  Recall : 0.6800          
##                      F1 : 0.7786          
##              Prevalence : 0.1518          
##          Detection Rate : 0.1032          
##    Detection Prevalence : 0.1134          
##       Balanced Accuracy : 0.8340          
##                                           
##        'Positive' Class : ham             
##

The four-fold plot of the confusion matrix for the hard dataset looks like this:

fourfoldplot(as.table(knn_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))

4.3.4. Support vector machine (SVM)

First, train the models for the “easy” and “hard” datasets.

svm_model_easy <- train(x = training_set_easy, y = training_labels_easy, method = "svmLinear3", 
                        trControl = resampling_method,
                        # hyperparameters
                        tuneGrid = data.frame(
                        # cost for over-fitting
                        cost = 1,
                        # penalty for misclassifications
                        Loss = 2))

print(svm_model_easy)

## L2 Regularized Support Vector Machine (dual) with Linear Kernel 
## 
## 2729 samples
##  438 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9942328  0.9874286
## 
## Tuning parameter 'cost' was held constant at a value of 1
## Tuning
##  parameter 'Loss' was held constant at a value of 2

svm_model_hard <- train(x = training_set_hard, y = training_labels_hard, method = "svmLinear3", 
                        trControl = resampling_method,
                        # hyperparameters
                        tuneGrid = data.frame(
                        # cost for over-fitting
                        cost = 1,
                        # penalty for misclassifications
                        Loss = 2))

print(svm_model_hard)

## L2 Regularized Support Vector Machine (dual) with Linear Kernel 
## 
## 1154 samples
##  737 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9657199  0.8648259
## 
## Tuning parameter 'cost' was held constant at a value of 1
## Tuning
##  parameter 'Loss' was held constant at a value of 2

Then evaluate the models on the corresponding test set.

The summary statistics below show that the accuracy (95% CI) of the SVM model is 99.49% (98.89%-99.81%). This accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the SVM model performs significantly better than a naive classifier. In addition, 𝜅 is very close to 1, which means that the agreement between the prediction and reference categories is strong and highly unlikely to be due to chance.

svm_predict_easy <- predict(svm_model_easy, newdata = test_set_easy)
svm_easy_confusion_matrix <- caret::confusionMatrix(svm_predict_easy, test_labels_easy, 
                                                    mode = "prec_recall")
svm_easy_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham  748    4
##       spam   2  415
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9889, 0.9981)
##     No Information Rate : 0.6416          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9888          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##               Precision : 0.9947          
##                  Recall : 0.9973          
##                      F1 : 0.9960          
##              Prevalence : 0.6416          
##          Detection Rate : 0.6399          
##    Detection Prevalence : 0.6433          
##       Balanced Accuracy : 0.9939          
##                                           
##        'Positive' Class : ham             
##

The four-fold plot looks very similar to the four-fold plot of the random forest confusion matrix. Like the RF model, the SVM model predicts true positives (upper left quadrant) and true negatives (lower right quadrant) extremely well, with few false negatives (lower left quadrant) or false positives (upper right quadrant).

fourfoldplot(as.table(svm_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))

The accuracy (95% CI) of the SVM model for the “hard” dataset is a little lower, 95.95% (93.82%-97.51%), but is still very good. The difference in accuracy between the easy and hard datasets was smaller for the SVM model (0.9949 - 0.9595 = 0.0354) than for the kNN model (0.9829 - 0.9312 = 0.0517), which suggests that the SVM model is less susceptible to imbalanced data.

The difference between the accuracy and NIR (0.9595 - 0.8482 = 0.1113) with the SVM model for the hard dataset is less than that for the easy dataset (0.9949 - 0.6416 = 0.3533). This suggests that, like the RF and kNN models, the SVM model is a better model for the easy dataset than the hard dataset. Nevertheless, \(𝜅>0.8\), which indicates strong agreement between the prediction and reference categories.

svm_predict_hard <- predict(svm_model_hard, newdata = test_set_hard)
svm_hard_confusion_matrix <- caret::confusionMatrix(svm_predict_hard, test_labels_hard, 
                                                    mode = "prec_recall")
svm_hard_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   59    4
##       spam  16  415
##                                           
##                Accuracy : 0.9595          
##                  95% CI : (0.9382, 0.9751)
##     No Information Rate : 0.8482          
##     P-Value [Acc > NIR] : 1.456e-15       
##                                           
##                   Kappa : 0.8317          
##                                           
##  Mcnemar's Test P-Value : 0.01391         
##                                           
##               Precision : 0.9365          
##                  Recall : 0.7867          
##                      F1 : 0.8551          
##              Prevalence : 0.1518          
##          Detection Rate : 0.1194          
##    Detection Prevalence : 0.1275          
##       Balanced Accuracy : 0.8886          
##                                           
##        'Positive' Class : ham             
##

The four-fold plot of the confusion matrix for the hard dataset looks like this:

fourfoldplot(as.table(svm_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))

4.4. Comparison of model performance

The performance of the four SML models can be compared using resampling to estimate the distribution of the performance metrics (eg, accuracy).

models <- list(DT_easy = dt_model_easy, DT_hard = dt_model_hard,
               RF_easy = rf_model_easy, RF_hard = rf_model_hard, 
               KNN_easy = knn_model_easy, KNN_hard = knn_model_hard, 
               SVM_easy = svm_model_easy, SVM_hard = svm_model_hard)
resampling <- resamples(models)

The accuracy and kappa values of the resamples look like this:

resampling_metrics_df <- resampling$values
resampling_metrics_df

##      Resample DT_easy~Accuracy DT_easy~Kappa DT_hard~Accuracy DT_hard~Kappa
## 1  Resample01        0.9393638     0.8700066        0.9186047     0.6635669
## 2  Resample02        0.9267068     0.8454626        0.9124088     0.6443953
## 3  Resample03        0.9159919     0.8252140        0.9370460     0.7570478
## 4  Resample04        0.9302789     0.8544301        0.9322430     0.6710833
## 5  Resample05        0.9396637     0.8620530        0.9302885     0.6691531
## 6  Resample06        0.9772952     0.9495633        0.9448441     0.7653922
## 7  Resample07        0.9230019     0.8383143        0.9097561     0.6016491
## 8  Resample08        0.9221154     0.8352796        0.9317647     0.7421170
## 9  Resample09        0.9749750     0.9452880        0.9186603     0.6470647
## 10 Resample10        0.9309309     0.8540219        0.9170616     0.6427362
## 11 Resample11        0.9272727     0.8461465        0.9275701     0.7144947
## 12 Resample12        0.9722772     0.9396536        0.9390244     0.7703339
## 13 Resample13        0.9687185     0.9324241        0.9367397     0.7001515
## 14 Resample14        0.9422492     0.8763297        0.9218009     0.6559613
## 15 Resample15        0.9629630     0.9201858        0.9458239     0.7547179
## 16 Resample16        0.9279279     0.8505386        0.9287411     0.6617749
## 17 Resample17        0.9768145     0.9506194        0.9252747     0.6992963
## 18 Resample18        0.9274510     0.8467722        0.9265403     0.7316072
## 19 Resample19        0.9351944     0.8607424        0.9375000     0.7363495
## 20 Resample20        0.9763547     0.9489092        0.9413146     0.7384192
## 21 Resample21        0.9322362     0.8546842        0.9220183     0.6856658
## 22 Resample22        0.9412955     0.8731427        0.9367946     0.7310611
## 23 Resample23        0.9276986     0.8476628        0.9553571     0.7930812
## 24 Resample24        0.9343284     0.8582535        0.9465116     0.7623168
## 25 Resample25        0.9674134     0.9292908        0.9331797     0.7746625
##    RF_easy~Accuracy RF_easy~Kappa RF_hard~Accuracy RF_hard~Kappa
## 1         0.9891304     0.9764733        0.9690476     0.8486528
## 2         0.9894535     0.9770332        0.9565217     0.8056237
## 3         0.9910090     0.9798259        0.9656751     0.8584203
## 4         0.9949239     0.9889273        0.9740566     0.8925346
## 5         0.9949444     0.9889662        0.9370629     0.7384324
## 6         0.9939516     0.9867034        0.9694836     0.8635221
## 7         0.9950249     0.9892634        0.9495413     0.7850002
## 8         0.9949187     0.9887886        0.9683973     0.8634222
## 9         0.9940653     0.9871824        0.9479905     0.7841536
## 10        0.9941176     0.9870042        0.9467593     0.7826582
## 11        0.9941176     0.9872009        0.9425287     0.7735083
## 12        0.9960396     0.9912903        0.9619048     0.8155366
## 13        0.9920080     0.9826067        0.9490291     0.7553998
## 14        0.9970238     0.9934478        0.9561201     0.7978177
## 15        0.9939940     0.9868234        0.9481481     0.7749464
## 16        0.9950593     0.9892145        0.9602978     0.8393942
## 17        0.9960474     0.9915271        0.9649533     0.8376328
## 18        0.9951877     0.9895982        0.9559165     0.8247705
## 19        0.9940000     0.9867605        0.9491525     0.7367111
## 20        0.9970356     0.9935287        0.9520548     0.8095416
## 21        0.9911591     0.9806001        0.9331683     0.7065533
## 22        0.9919598     0.9822527        0.9673660     0.8541242
## 23        0.9959839     0.9911783        0.9708029     0.8636514
## 24        0.9920949     0.9826453        0.9604938     0.8114635
## 25        0.9948454     0.9888457        0.9567308     0.8229285
##    KNN_easy~Accuracy KNN_easy~Kappa KNN_hard~Accuracy KNN_hard~Kappa
## 1          0.9817629      0.9608773         0.9376499      0.7445575
## 2          0.9744094      0.9429536         0.9462103      0.7942091
## 3          0.9829146      0.9619964         0.9243499      0.6771300
## 4          0.9790210      0.9544808         0.9439024      0.7456164
## 5          0.9814815      0.9588937         0.9230769      0.6739812
## 6          0.9843902      0.9661259         0.9498807      0.7315823
## 7          0.9873909      0.9723141         0.9363208      0.7213514
## 8          0.9807497      0.9572898         0.9437939      0.7611300
## 9          0.9798793      0.9566070         0.9304556      0.6764934
## 10         0.9781312      0.9516811         0.9349776      0.7277855
## 11         0.9869215      0.9715373         0.9287356      0.7260872
## 12         0.9774127      0.9512515         0.9391101      0.6844230
## 13         0.9868288      0.9710154         0.9417476      0.7488315
## 14         0.9901768      0.9783177         0.9279070      0.6760632
## 15         0.9766971      0.9466981         0.9463869      0.7660573
## 16         0.9820896      0.9598236         0.9318735      0.7157396
## 17         0.9809619      0.9580682         0.9260143      0.6693986
## 18         0.9782823      0.9520521         0.9553991      0.7796352
## 19         0.9843597      0.9655803         0.9303944      0.6730225
## 20         0.9834146      0.9640288         0.9304556      0.7033193
## 21         0.9768844      0.9502196         0.9338061      0.6891828
## 22         0.9857868      0.9690953         0.9447005      0.7775215
## 23         0.9894737      0.9767400         0.9463869      0.7337058
## 24         0.9838872      0.9651451         0.9356322      0.7219432
## 25         0.9879276      0.9729850         0.9575893      0.8235050
##    SVM_easy~Accuracy SVM_easy~Kappa SVM_hard~Accuracy SVM_hard~Kappa
## 1          0.9980020      0.9956068         0.9781553      0.9128268
## 2          0.9960239      0.9914206         0.9727273      0.8790932
## 3          0.9890220      0.9753930         0.9559902      0.8166193
## 4          0.9930279      0.9851578         0.9600939      0.8447189
## 5          0.9960591      0.9914017         0.9666667      0.8675437
## 6          0.9939577      0.9871185         0.9447115      0.8013619
## 7          0.9969819      0.9933633         0.9734300      0.9014755
## 8          0.9884837      0.9749166         0.9708029      0.8829560
## 9          0.9941003      0.9871326         0.9602804      0.8505341
## 10         0.9930830      0.9852445         0.9683258      0.8825584
## 11         0.9970000      0.9933575         0.9623529      0.8599959
## 12         0.9920239      0.9826892         0.9562212      0.8181217
## 13         0.9918864      0.9817544         0.9485981      0.7896994
## 14         0.9939271      0.9867437         0.9774266      0.9109941
## 15         0.9929577      0.9846543         0.9786730      0.9242430
## 16         0.9940000      0.9868986         0.9600000      0.8465997
## 17         0.9950932      0.9892416         0.9744780      0.9056949
## 18         0.9960317      0.9913473         0.9671362      0.8745002
## 19         0.9939516      0.9868400         0.9639423      0.8464869
## 20         0.9951076      0.9895658         0.9760766      0.9053142
## 21         0.9960435      0.9911243         0.9533170      0.7653609
## 22         0.9969325      0.9933045         0.9610706      0.8573474
## 23         0.9929648      0.9846431         0.9668246      0.8726998
## 24         0.9951172      0.9895560         0.9688995      0.8837647
## 25         0.9940417      0.9872393         0.9767981      0.9201379

To plot these data, I first reshaped them to be in a long format and extracted the names of the performance metric, model, and difficulty of the dataset (easy/hard) for each resample.

resampling_metrics_df <- resampling_metrics_df %>%
  melt() %>%
  rowwise() %>%
  mutate(
    metric = if_else(str_detect(variable, "Accuracy", negate = FALSE), "Accuracy", "Kappa"),
    model = str_extract(variable, ".*(?=_)"),
    # remove "~Accuracy" and "~Kappa" from variable names
    variable = str_replace(variable, "~.*", ""),
    difficulty = str_extract(variable, "(?<=_).*")    
  )

The boxplots below show that, for the “easy” dataset, the SVM and RF models performed the best in terms of accuracy and 𝜅 value. However, for the “hard” dataset, the SVM model showed a greater difference from the RF model. Since real-world spam emails are likely to be “hard”, these results suggest that the SVM model is the best classifier of ham vs spam emails.

ggplot(resampling_metrics_df, aes(x = model, y = value, color = model)) +
  geom_boxplot() +
  coord_flip() +
  facet_grid(difficulty ~ metric, scales = "free_x") +
  ylab("value") + xlab("model") +   
  theme(
    strip.text = element_text(face = "bold"),
    axis.title = element_text(face = "bold"),
    legend.position = "none"
  )

5. Comparison of computer- vs human-interpretable content on classification performance

In this section, I only focus on the best-performing model (SVM) and the “hard” dataset.

I took a slightly broader approach than comparing emails with or without headers because the subject line is meaningful to email recipients and usually (unless it’s missing) gives a clue about the body of an email. So I call the subject + body “human-interpretable content”, in contrast to the entire email, which is “computer-interpretable content”.

5.1 Extract human-interpretable content

Removing the email headers with regular expressions proved to be challenging (dead ends not shown), so I made a simplifying assumption that email headers are separated from the main body by a blank line. After dividing these two parts, I extracted the subject line from the header part and stripped HTML markup from the body part. Finally, I concatenated the subject and cleaned the message body to form the human-interpretable content.

human_content <- all_emails_hard %>%
rowwise() %>%
mutate(
    # Add a dummy newline character to end of email text
    # The purpose of this is to enable extraction of the subject line of blank emails
    text = str_c(text, "\n", sep = ""),
    # Capture subject line
    subject = str_extract(text, "Subject: (.*)\n", group = 1),
    # Email headers are separated from the body by a blank line, so the "body" is everything after
    body = str_sub(text, str_locate(text, "[\n]{2,}")[2] + 1, str_length(text)),
    # Remove URLs
    body = str_replace_all(body, "http.*(\\n|\")", ""),
    # Remove HTML tags
    body = str_replace_all(body, "<[^>]*>", ""),    
    # Remove special characters, eg &nbsp; = non-breaking whitespace
    body = str_replace_all(body, "&#?[\\w|\\d]+;", ""),    
    # Remove excess whitespace
    body = str_replace_all(body, "[\\s]+", " "),
    # Concatenate email subject and body
    body = if_else(is.na(body) | body == " ", 
                            subject,  # if no body, use subject as body
                            str_c(subject, body, sep = " "))  # otherwise concatenate subject and body
  ) %>%
select(body, label)

A small fraction of emails did not contain human-interpretable content (as defined above). Because these messages are not useful for classification, I omitted them.

n_emails <- nrow(human_content)
n_no_content <- sum(is.na(human_content$body))
sprintf("%s of %s emails (%.3f%%) do not have human-interpretable content", n_no_content, n_emails, n_no_content / n_emails)

## [1] "11 of 1648 emails (0.007%) do not have human-interpretable content"

human_content <- human_content %>%
  drop_na(body)

5.2. Create corpus

human_corpus <- VCorpus(VectorSource(human_content$body)) %>%
  tidy_corpus()

5.3. Create document term matrix

human_dtm <- DocumentTermMatrix(human_corpus) %>%
  removeSparseTerms(., 0.95)
tm::inspect(human_dtm)

## <<DocumentTermMatrix (documents: 1637, terms: 458)>>
## Non-/sparse entries: 81153/668593
## Sparsity           : 89%
## Maximal term length: 40
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   business can email free get please the this will you
##   1147       24  14    25   10  20      5  33   28   39  31
##   1223       49  14     1   36  25      0  19   12   23  16
##   1231       20  12    26   10  20      6  33   28   40  31
##   1297       49  14     2   36  25      0  20   13   23  17
##   1320        4   9     1    6   1      2  96   10   29   1
##   1324       49  14     2   36  25      0  20   13   23  17
##   1545        6   9    12    6   8      5  26   15   24  16
##   158         8  19     1    3  13      1  23    7    7  10
##   300        10  40    14   28  22      6  12    7   76  26
##   326        16  10     8   16  20      4  10    5   18  24

5.4. Partition data

As before, I partitioned the emails into training and test sets using a 70-30 split.

trainIndex_human <- createDataPartition(y = human_content$label, p = 0.7, list = FALSE)

training_set_human <- human_dtm[trainIndex_human, ] %>% as.matrix() %>% as.data.frame()
test_set_human <- human_dtm[-trainIndex_human, ] %>% as.matrix() %>% as.data.frame()

training_labels_human <- human_content$label[trainIndex_human]
test_labels_human <- human_content$label[-trainIndex_human]

5.5. Classification performance of SVM model with human-interpretable content

Train the model

svm_model_human <- train(x = training_set_human, y = training_labels_human, method = "svmLinear3", 
                        trControl = resampling_method, tuneGrid = data.frame(cost = 1, Loss = 2))

print(svm_model_human)

## L2 Regularized Support Vector Machine (dual) with Linear Kernel 
## 
## 1147 samples
##  458 predictor
##    2 classes: 'ham', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1147, 1147, 1147, 1147, 1147, 1147, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9430329  0.7699226
## 
## Tuning parameter 'cost' was held constant at a value of 1
## Tuning
##  parameter 'Loss' was held constant at a value of 2

Evaluate the model on the test set

The accuracy (95% CI) of the SVM model for the “human-interpretable” dataset is 93.67% (91.14%-95.56%) and \(𝜅 = 0.7309\), which indicates moderate agreement between prediction and reference categories.

svm_predict_human <- predict(svm_model_human, newdata = test_set_human)
svm_human_confusion_matrix <- caret::confusionMatrix(svm_predict_human, test_labels_human, mode = "prec_recall")
svm_human_confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ham spam
##       ham   51    8
##       spam  23  408
##                                           
##                Accuracy : 0.9367          
##                  95% CI : (0.9114, 0.9566)
##     No Information Rate : 0.849           
##     P-Value [Acc > NIR] : 1.499e-09       
##                                           
##                   Kappa : 0.7309          
##                                           
##  Mcnemar's Test P-Value : 0.01192         
##                                           
##               Precision : 0.8644          
##                  Recall : 0.6892          
##                      F1 : 0.7669          
##              Prevalence : 0.1510          
##          Detection Rate : 0.1041          
##    Detection Prevalence : 0.1204          
##       Balanced Accuracy : 0.8350          
##                                           
##        'Positive' Class : ham             
##

5.6. Comparison of SVM model performance with computer- vs human-interpretable content

As before, resample to estimate the distribution of the performance metrics (eg, accuracy)

models <- list(computer_content = svm_model_hard, human_content = svm_model_human)
resampling <- resamples(models)

Then reshape the data

resampling_metrics_df <- resampling$values
resampling_metrics_df <- resampling_metrics_df %>%
  melt() %>%
  rowwise() %>%
  mutate(
    metric = if_else(str_detect(variable, "Accuracy", negate = FALSE), "Accuracy", "Kappa"),
    content_type = str_extract(variable, ".*(?=_)"),
    # remove "~Accuracy" and "~Kappa" from variable names
    variable = str_replace(variable, "~.*", ""),
  )

The boxplots below show that the accuracy and kappa value of the SVM model for the “computer-interpretable” dataset is greater than the metrics for the human-interpretable dataset. Together, these findings indicate that computer-interpretable content in emails (eg, headers) provides predictive value to the SVM model for classifying ham vs spam.

ggplot(resampling_metrics_df, aes(x = content_type, y = value, color = content_type)) +
  geom_boxplot() +
  coord_flip() +
  facet_grid(~ metric, scales = "free_x") +
  ylab("Value") + xlab("Content Type") +   
  theme(
    strip.text = element_text(face = "bold"),
    axis.title = element_text(face = "bold"),
    legend.position = "none"
  )

6. Conclusions and future directions

These analyses show that supervised machine learning (SML) algorithms perform spam vs ham classification well—the four methods I compared (decision trees, random forest, k-nearest neighbor, and support vector machine) were significantly better than a naive classifier, had accuracy >90%, and most had \(𝜅 > 0.8\). In general, the algorithms performed better for the “easy” emails than the “hard” emails. Overall, the SVM model performed best for both types, which suggests that it would have the best performance in the “real world”. Of note, the SVM performance was dependent on information from the entire email as shown by the reduced performance when email headers were excluded.

Additional improvements in classification performance may be possible by balancing the spam and ham emails, fine tuning hyperparameters in SML algorithms, and using more advanced methods such as neural networks or large-language models.

DATA607 Project 4

Alexander Simon

2024-04-28

0. Packages

1. Introduction

2. Data source

3. Data checks and transformations

3.1. Label the data

3.2. Check data balance

3.3. Create corpus

3.4. Create document term matrix

4. Supervised machine learning

4.1. Partition the data

4.2. Define resampling method

4.3. Algorithms

4.3.1. Decision trees

4.3.2. Random forest (RF)

4.3.3. k-Nearest neighbor (kNN)

4.3.4. Support vector machine (SVM)

4.4. Comparison of model performance

5. Comparison of computer- vs human-interpretable content on classification performance

5.1 Extract human-interpretable content

5.2. Create corpus

5.3. Create document term matrix

5.4. Partition data

5.5. Classification performance of SVM model with human-interpretable content

5.6. Comparison of SVM model performance with computer- vs human-interpretable content

6. Conclusions and future directions