I used the tm package to create document-term matrices
and the caret package to perform supervised machine
learning (SML) for text classification. If needed, you can install them
using the commands below.
install.packages("caret")
install.packages("tm")
Document classification is the process of assigning documents to one or more classes or categories based on shared characteristics (eg, subjects) of their content. Here, I compare the performance of four SML algorithms (decision trees, random forest, k-nearest neighbor, and support vector machines) to classify emails as spam or ham (ie, not spam). I also evaluate whether computer-interpretable content in emails (eg, headers) provides predictive value to spam vs ham classification.
I downloaded “ham” and spam data files from https://spamassassin.apache.org/old/publiccorpus/ and decompressed them on the command line.
I used both “easy” and “hard” ham emails because I was curious to see the difference in classification performance.
# Easy ham
bzip2 -d 20030228_easy_ham.tar.bz2
tar -xvf 20030228_easy_ham.tar
# Hard ham
bzip2 -d 20030228_hard_ham.tar.bz2
tar -xvf 20030228_hard_ham.tar
# Spam
bzip2 -d 20050311_spam_2.tar.bz2
tar -xvf 20050311_spam_2.tar
I thought it would be a little cumbersome to load thousands of individual emails to my GitHub repository, so I read them into tibbles and then saved and uploaded the R data files.
Create spam and ham tibbles.
# Note filepath is specific to my computer
working_dir <- getwd()
filepath <- paste(working_dir, "/Data/easy_ham", sep = "")
easy_ham_files <- readtext(filepath, encoding='UTF-8')
filepath <- paste(working_dir, "/Data/hard_ham", sep = "")
hard_ham_files <- readtext(filepath, encoding='UTF-8')
filepath <- paste(working_dir, "/Data/spam_2", sep = "")
spam_files <- readtext(filepath, encoding='UTF-8')
Save the tibbles as R data files.
saveRDS(easy_ham_files, "easy_ham_files.rds")
saveRDS(hard_ham_files, "hard_ham_files.rds")
saveRDS(spam_files, "spam_files.rds")
These data files can be read from my GitHub repository to recreate the spam and ham tibbles.
# Method from https://forum.posit.co/t/how-to-read-rds-files-hosted-at-github-repository/128561
easy_ham_emails <- readRDS(gzcon(url("https://github.com/alexandersimon1/Data607/raw/main/Project4/easy_ham_files.rds")))
hard_ham_emails <- readRDS(gzcon(url("https://github.com/alexandersimon1/Data607/raw/main/Project4/hard_ham_files.rds")))
spam_emails <- readRDS(gzcon(url("https://github.com/alexandersimon1/Data607/raw/main/Project4/spam_files.rds")))
First I labeled the ham and spam emails as “ham” and “spam” (as
factors), respectively, and then combined everything into a single
tibble. I also omitted the doc_id column since it isn’t
needed.
# This function labels all emails in a tibble with the specified label (string),
# and then returns the updated tibble of emails
label_emails <- function (emails_df, label_text) {
emails_df <- emails_df %>%
select(text) %>%
mutate(
label = as.factor(label_text)
)
return(emails_df)
}
easy_ham_emails <- label_emails(easy_ham_emails, "ham")
hard_ham_emails <- label_emails(hard_ham_emails, "ham")
spam_emails <- label_emails(spam_emails, "spam")
all_emails_easy <- rbind(easy_ham_emails, spam_emails)
all_emails_hard <- rbind(hard_ham_emails, spam_emails)
Both datasets are imbalanced. The “easy” dataset has a higher proportion of ham emails than spam emails.
table(all_emails_easy$label) %>% prop.table()
##
## ham spam
## 0.6416111 0.3583889
in contrast, the “hard” dataset has a much higher proportion of spam emails vs ham emails.
table(all_emails_hard$label) %>% prop.table()
##
## ham spam
## 0.1523058 0.8476942
Ideally, these datasets should be balanced because imbalance can bias
the prediction model toward the more common class (ie, the model is more
likely to make predictions with high accuracy by simply selecting the
most common class).1 Methods to balance data include
undersampling (randomly select samples from the overrepresented class)
and oversampling (randomly duplicate samples from the underrepresented
class), which are implemented in the caret package.
However, these procedures are out of scope for this assignment.
Below I construct models using the datasets as is (ie, unadjusted for
balance). This starts with creating corpuses of the email text and
document-term matrices. The code in sections 3.3 and 3.4 are adapted
from a tutorial about the tm
package.
Create corpuses of email text
easy_email_corpus <- VCorpus(VectorSource(all_emails_easy$text))
hard_email_corpus <- VCorpus(VectorSource(all_emails_hard$text))
As an example, the “easy” email corpus looks like this
# Print first 10 lines
writeLines(head(strwrap(easy_email_corpus[[1]]), 10))
## From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 2002
## Return-Path: <exmh-workers-admin@spamassassin.taint.org> Delivered-To:
## zzzz@localhost.netnoteinc.com Received: from localhost (localhost
## [127.0.0.1]) by phobos.labs.netnoteinc.com (Postfix) with ESMTP id
## D03E543C36 for <zzzz@localhost>; Thu, 22 Aug 2002 07:36:16 -0400 (EDT)
## Received: from phobos [127.0.0.1] by localhost with IMAP
## (fetchmail-5.9.0) for zzzz@localhost (single-drop); Thu, 22 Aug 2002
## 12:36:16 +0100 (IST) Received: from listman.spamassassin.taint.org
## (listman.spamassassin.taint.org [66.187.233.211]) by
## dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MBYrZ04811 for
Remove punctuation and numbers, remove English stop words, change case to lowercase, and strip extra white space.
# This function cleans up a specified corpus by removing punctuation, numbers, stop words, and
# white space, and changes all text to lowercase. It returns the tidied corpus.
tidy_corpus <- function(corpus) {
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
}
easy_email_corpus <- tidy_corpus(easy_email_corpus)
hard_email_corpus <- tidy_corpus(hard_email_corpus)
Now the “easy” email corpus looks like this
writeLines(head(strwrap(easy_email_corpus[[1]]), 10))
## from exmhworkersadminredhatcom thu aug returnpath
## exmhworkersadminspamassassintaintorg deliveredto
## zzzzlocalhostnetnoteinccom received localhost localhost
## phoboslabsnetnoteinccom postfix esmtp id dec zzzzlocalhost thu aug edt
## received phobos localhost imap fetchmail zzzzlocalhost singledrop thu
## aug ist received listmanspamassassintaintorg
## listmanspamassassintaintorg dogmaslashnullorg esmtp id gmbyrz
## zzzzexmhspamassassintaintorg thu aug received
## listmanspamassassintaintorg localhostlocaldomain listmanredhatcom
## postfix esmtp id thu aug edt deliveredto
Both the “easy” and “hard” email document term matrices are extremely sparse (>99.5%).
easy_email_dtm <- DocumentTermMatrix(easy_email_corpus)
tm::inspect(easy_email_dtm)
## <<DocumentTermMatrix (documents: 3898, terms: 95865)>>
## Non-/sparse entries: 636174/373045596
## Sparsity : 100%
## Maximal term length: 868
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug esmtp from jmlocalhost localhost mon oct postfix received sep
## 2501 0 0 0 0 0 0 0 0 0 0
## 2552 0 2 4 0 0 0 0 2 2 0
## 2578 0 2 2 0 0 0 0 2 2 0
## 3404 0 4 9 0 1 8 0 0 20 0
## 3580 0 4 3 0 1 0 0 0 8 0
## 3591 0 1 2 2 3 0 0 1 7 0
## 3592 0 1 2 2 3 0 0 1 7 0
## 3898 0 0 0 0 0 0 0 0 0 0
## 670 0 5 2 2 4 0 0 3 7 8
## 677 0 4 3 2 3 0 0 3 7 9
hard_email_dtm <- DocumentTermMatrix(hard_email_corpus)
tm::inspect(hard_email_dtm)
## <<DocumentTermMatrix (documents: 1648, terms: 88952)>>
## Non-/sparse entries: 413696/146179200
## Sparsity : 100%
## Maximal term length: 868
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs border font height helvetica jul received size table width widthd
## 1154 0 0 0 0 8 20 1 1 0 0
## 1330 0 0 0 0 10 8 0 1 0 0
## 1341 0 8 0 0 5 7 2 0 0 0
## 1342 0 8 0 0 5 7 2 0 0 0
## 158 44 55 52 0 6 5 28 30 61 0
## 1648 0 0 0 0 0 0 0 0 0 0
## 198 0 5 0 29 0 9 0 36 0 303
## 302 9 9 4 0 0 2 24 24 13 0
## 328 9 9 4 0 4 2 24 22 13 0
## 39 0 0 0 0 4 3 0 0 0 0
To prevent R from crashing due to lack of memory when constructing the models (I learned this the hard way), I simplified the document term matrices by removing terms that have >95% sparsity (ie, occur in <5% emails).
The less sparse document term matrices look like this:
easy_email_dtm <- removeSparseTerms(easy_email_dtm, 0.95)
tm::inspect(easy_email_dtm)
## <<DocumentTermMatrix (documents: 3898, terms: 438)>>
## Non-/sparse entries: 264484/1442840
## Sparsity : 85%
## Maximal term length: 50
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aug esmtp from jmlocalhost localhost mon oct postfix received sep
## 2552 0 2 4 0 0 0 0 2 2 0
## 2578 0 2 2 0 0 0 0 2 2 0
## 3404 0 4 9 0 1 8 0 0 20 0
## 3490 0 4 8 0 1 0 0 0 20 0
## 3580 0 4 3 0 1 0 0 0 8 0
## 3591 0 1 2 2 3 0 0 1 7 0
## 3592 0 1 2 2 3 0 0 1 7 0
## 3805 5 1 4 2 3 0 0 1 12 0
## 670 0 5 2 2 4 0 0 3 7 8
## 677 0 4 3 2 3 0 0 3 7 9
hard_email_dtm <- removeSparseTerms(hard_email_dtm, 0.95)
tm::inspect(hard_email_dtm)
## <<DocumentTermMatrix (documents: 1648, terms: 737)>>
## Non-/sparse entries: 161198/1053378
## Sparsity : 87%
## Maximal term length: 58
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs border font height helvetica jul received size table width widthd
## 1154 0 0 0 0 8 20 1 1 0 0
## 1240 0 0 0 0 8 20 1 1 0 0
## 126 6 59 13 0 4 2 28 4 15 0
## 1330 0 0 0 0 10 8 0 1 0 0
## 1342 0 8 0 0 5 7 2 0 0 0
## 158 44 55 52 0 6 5 28 30 61 0
## 18 34 37 65 59 3 2 59 46 112 0
## 302 9 9 4 0 0 2 24 24 13 0
## 33 63 19 136 4 3 2 19 39 228 0
## 93 34 35 65 59 3 2 58 46 112 0
I did a lot of background reading to understand how to create supervised machine learning models for text classification such as spam/ham emails. The code below is adapted from what I learned.2
I partitioned the labeled emails (“easy” and “hard”) into training and test sets using a 70-30 split.
trainIndex_easy <- createDataPartition(y = all_emails_easy$label, p = 0.7, list = FALSE)
trainIndex_hard <- createDataPartition(y = all_emails_hard$label, p = 0.7, list = FALSE)
Next, split up the document-term matrices using the corresponding partition.
# "easy" dataset
training_set_easy <- easy_email_dtm[trainIndex_easy, ] %>% as.matrix() %>% as.data.frame()
test_set_easy <- easy_email_dtm[-trainIndex_easy, ] %>% as.matrix() %>% as.data.frame()
# "hard" dataset
training_set_hard <- hard_email_dtm[trainIndex_hard, ] %>% as.matrix() %>% as.data.frame()
test_set_hard <- hard_email_dtm[-trainIndex_hard, ] %>% as.matrix() %>% as.data.frame()
Similarly, split up the email labels using the corresponding partition.
# "easy" dataset
training_labels_easy <- all_emails_easy$label[trainIndex_easy]
test_labels_easy <- all_emails_easy$label[-trainIndex_easy]
# "hard" dataset
training_labels_hard <- all_emails_hard$label[trainIndex_hard]
test_labels_hard <- all_emails_hard$label[-trainIndex_hard]
I used a bootstrap resampling method.
resampling_method <- trainControl(method = "boot")
First, train the model for the “easy” dataset and the model for the “hard” dataset.
dt_model_easy <- caret::train(x = training_set_easy, y = training_labels_easy, method = "rpart",
trControl = resampling_method)
print(dt_model_easy)
## CART
##
## 2729 samples
## 438 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.1124744 0.9440208 0.8805996
## 0.2689162 0.8703450 0.7255566
## 0.5245399 0.7592002 0.3960552
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.1124744.
dt_model_hard <- caret::train(x = training_set_hard, y = training_labels_hard, method = "rpart",
trControl = resampling_method)
print(dt_model_hard)
## CART
##
## 1154 samples
## 737 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.06818182 0.9310748 0.7085640
## 0.09659091 0.9254577 0.6734704
## 0.50568182 0.8744114 0.2711564
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.06818182.
Then evaluate the models on the corresponding test dataset.
The summary statistics below show that the accuracy (95% CI) of the decision tree model with the “easy” dataset is 92.3% (90.62% - 93.76%). There are a couple of indicators that the model performs better than would be expected by chance alone.
The accuracy is greater than the “no information rate” (NIR; 64.16%) and the p-value of accuracy vs NIR is much less than 0.05, so the decision tree model performs significantly better than a naive classifier that classifies everything by the most common class.
Kappa value \(𝜅>0.8\). The kappa value is a measure of the degree of agreement between two raters that classify items into mutually exclusive categories. 𝜅 ranges from 0 (no agreement between the raters other than chance) to 1 (complete agreement and therefore unlikely due to chance). \(𝜅 > 0.8\) indicates the agreement between the prediction and reference categories is strong and unlikely to be due to chance.
dt_predict_easy <- predict(dt_model_easy, newdata = test_set_easy)
dt_easy_confusion_matrix <- caret::confusionMatrix(dt_predict_easy, test_labels_easy,
mode = "prec_recall")
dt_easy_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 668 8
## spam 82 411
##
## Accuracy : 0.923
## 95% CI : (0.9062, 0.9376)
## No Information Rate : 0.6416
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8389
##
## Mcnemar's Test P-Value : 1.416e-14
##
## Precision : 0.9882
## Recall : 0.8907
## F1 : 0.9369
## Prevalence : 0.6416
## Detection Rate : 0.5714
## Detection Prevalence : 0.5783
## Balanced Accuracy : 0.9358
##
## 'Positive' Class : ham
##
The confusion matrix can also be visualized by a “four-fold plot”. This plot shows that the decision tree model does fairly well predicting true positives (upper left quadrant) and true negatives (lower right quadrant) in the “easy” dataset, which agrees with the 92.3% accuracy ((668 + 411)/(668 + 8 + 82 + 411)). Among the incorrect predictions, there are more false negatives (lower left quadrant) than false positives (upper right quadrant).
# Reference: https://www.geeksforgeeks.org/visualize-confusion-matrix-using-caret-package-in-r/
fourfoldplot(as.table(dt_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))
For the “hard” dataset, the accuracy (95% CI) of the decision tree model is 92.51% (89.82%-94.67%), which is about the same as the accuracy for the “easy” dataset. The accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the decision tree model for the hard dataset is also significantly better than a naive classifier.
However, the difference between the accuracy and NIR (0.9251 - 0.8482 = 0.0769) is less than that for the easy dataset (0.9230 - 0.6416 = 0.2814), which suggests that the decision tree model is a better model for the easy dataset than the hard dataset. This is supported by lower kappa value (\(0.6 \leq 𝜅 \leq 0.8\)), which indicates only moderate agreement between the prediction and reference categories.
dt_predict_hard <- predict(dt_model_hard, newdata = test_set_hard)
dt_hard_confusion_matrix <- caret::confusionMatrix(dt_predict_hard, test_labels_hard,
mode = "prec_recall")
dt_hard_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 50 12
## spam 25 407
##
## Accuracy : 0.9251
## 95% CI : (0.8982, 0.9467)
## No Information Rate : 0.8482
## P-Value [Acc > NIR] : 1.62e-07
##
## Kappa : 0.6869
##
## Mcnemar's Test P-Value : 0.04852
##
## Precision : 0.8065
## Recall : 0.6667
## F1 : 0.7299
## Prevalence : 0.1518
## Detection Rate : 0.1012
## Detection Prevalence : 0.1255
## Balanced Accuracy : 0.8190
##
## 'Positive' Class : ham
##
The four-fold plot of the confusion matrix for the hard dataset looks like this:
fourfoldplot(as.table(dt_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))
First, train the models for the easy and hard datasets.
rf_model_easy <- train(x = training_set_easy, y = training_labels_easy, method = "ranger",
trControl = resampling_method,
# hyperparameters
tuneGrid = data.frame(
# number of variables to randomly collect and split
mtry = floor(sqrt(dim(training_set_easy)[2])),
# rule for how to split the data as decisions are made
splitrule = "extratrees",
# tree depth, ie keep branching until it reaches minimum node size
min.node.size = 5))
print(rf_model_easy)
## Random Forest
##
## 2729 samples
## 438 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9939238 0.9867075
##
## Tuning parameter 'mtry' was held constant at a value of 20
## Tuning
## parameter 'splitrule' was held constant at a value of extratrees
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
rf_model_hard <- train(x = training_set_hard, y = training_labels_hard, method = "ranger",
trControl = resampling_method,
# hyperparameters
tuneGrid = data.frame(
# number of variables to randomly collect and split
mtry = floor(sqrt(dim(training_set_hard)[2])),
# rule for how to split the data as decisions are made
splitrule = "extratrees",
# tree depth, ie keep branching until it reaches minimum node size
min.node.size = 5))
print(rf_model_hard)
## Random Forest
##
## 1154 samples
## 737 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9565281 0.809856
##
## Tuning parameter 'mtry' was held constant at a value of 27
## Tuning
## parameter 'splitrule' was held constant at a value of extratrees
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
Then evaluate the models on the corresponding test dataset.
The summary statistics below show that the accuracy (95% CI) of the random forest model for the “easy” dataset is 99.49% (98.89%-99.81%). This accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the RF model performs significantly better than a naive classifier. In addition, 𝜅 is very close to 1, which means that the agreement between the prediction and reference categories is strong and highly unlikely to be due to chance.
rf_predict_easy <- predict(rf_model_easy, newdata = test_set_easy)
rf_easy_confusion_matrix <- caret::confusionMatrix(rf_predict_easy, test_labels_easy,
mode = "prec_recall")
rf_easy_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 749 5
## spam 1 414
##
## Accuracy : 0.9949
## 95% CI : (0.9889, 0.9981)
## No Information Rate : 0.6416
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9888
##
## Mcnemar's Test P-Value : 0.2207
##
## Precision : 0.9934
## Recall : 0.9987
## F1 : 0.9960
## Prevalence : 0.6416
## Detection Rate : 0.6407
## Detection Prevalence : 0.6450
## Balanced Accuracy : 0.9934
##
## 'Positive' Class : ham
##
The four-fold plot shows that the random forest model for the “easy” dataset” does extremely well predicting true positives (upper left quadrant) and true negatives (lower right quadrant), which agrees with the 99.49% accuracy ((749 + 414)/(749 + 5 + 1 + 414)). There are much fewer false negatives (lower left quadrant) and false positives (upper right quadrant) than that from the decision tree model, which increases the accuracy of the RF model.
fourfoldplot(as.table(rf_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))
The accuracy (95% CI) of the random forest model for the “hard” dataset is slightly lower, 95.75% (93.58%-97.35%). The lower accuracy vs the “easy” dataset suggests that the RF model is less susceptible to the imbalanced data than the decision tree model.
Similar to the analysis of the decision tree model for easy and hard datasets, the difference between the accuracy and NIR (0.9575 - 0.8482 = 0.1093) with the RF model for the hard dataset is less than that for the easy dataset (0.9949 - 0.6416 = 0.3533), which suggests that the RF model is a better model for the easy dataset than the hard dataset. This is supported by \(𝜅 = 0.8135\), which indicates only moderate agreement between the prediction and reference categories.
rf_predict_hard <- predict(rf_model_hard, newdata = test_set_hard)
rf_hard_confusion_matrix <- caret::confusionMatrix(rf_predict_hard, test_labels_hard,
mode = "prec_recall")
rf_hard_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 54 0
## spam 21 419
##
## Accuracy : 0.9575
## 95% CI : (0.9358, 0.9735)
## No Information Rate : 0.8482
## P-Value [Acc > NIR] : 5.975e-15
##
## Kappa : 0.8135
##
## Mcnemar's Test P-Value : 1.275e-05
##
## Precision : 1.0000
## Recall : 0.7200
## F1 : 0.8372
## Prevalence : 0.1518
## Detection Rate : 0.1093
## Detection Prevalence : 0.1093
## Balanced Accuracy : 0.8600
##
## 'Positive' Class : ham
##
The four-fold plot of the confusion matrix for the hard dataset is shown below. Surprisingly, there were no false positives (upper right quadrant).
fourfoldplot(as.table(rf_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))
First, train the models for the “easy” and “hard” datasets.
knn_model_easy <- train(x = training_set_easy, y = training_labels_easy, method = "knn",
trControl = resampling_method,
# hyperparameter
tuneGrid = data.frame(k = 2))
print(knn_model_easy)
## k-Nearest Neighbors
##
## 2729 samples
## 438 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9824494 0.9614311
##
## Tuning parameter 'k' was held constant at a value of 2
knn_model_hard <- train(x = training_set_hard, y = training_labels_hard, method = "knn",
trControl = resampling_method,
# hyperparameter
tuneGrid = data.frame(k = 2))
print(knn_model_hard)
## k-Nearest Neighbors
##
## 1154 samples
## 737 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9378703 0.7256909
##
## Tuning parameter 'k' was held constant at a value of 2
Then evaluate the models on the corresponding test set.
The summary statistics below show that the accuracy (95% CI) of the k-nearest neighbor model is 98.55% (97.68%-99.15%). This accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the kNN model performs significantly better than a naive classifier. In addition, 𝜅 is very close to 1, which means that the agreement between the prediction and reference categories is strong and unlikely to be due to chance.
knn_predict_easy <- predict(knn_model_easy, newdata = test_set_easy)
knn_easy_confusion_matrix <- caret::confusionMatrix(knn_predict_easy, test_labels_easy,
mode = "prec_recall")
knn_easy_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 748 15
## spam 2 404
##
## Accuracy : 0.9855
## 95% CI : (0.9768, 0.9915)
## No Information Rate : 0.6416
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9682
##
## Mcnemar's Test P-Value : 0.003609
##
## Precision : 0.9803
## Recall : 0.9973
## F1 : 0.9888
## Prevalence : 0.6416
## Detection Rate : 0.6399
## Detection Prevalence : 0.6527
## Balanced Accuracy : 0.9808
##
## 'Positive' Class : ham
##
The four-fold plot of the kNN model confusion matrix for the “easy” dataset” is similar to that of the RF model for the “easy” dataset. The kNN model does very well predicting true positives (upper left quadrant) and true negatives (lower right quadrant), which agrees with the 98.55% accuracy ((748 + 404)/(748 + 15 + 2 + 404)). There were relatively few false negatives (lower left quadrant).
fourfoldplot(as.table(knn_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))
The accuracy (95% CI) of the kNN model for the “hard” dataset is a little lower, 94.13% (91.68%-96.03%). Similar to the RF model, the lower accuracy of the kNN model for the “hard” dataset vs the “easy” dataset suggests that the kNN model is less susceptible to the imbalanced data than the decision tree model.
The difference between the accuracy and NIR (0.9413 - 0.8482 = 0.0931) with the kNN model for the hard dataset is less than that for the easy dataset (0.9855 - 0.6416 = 0.3439). This suggests that, like the RF model, the kNN model is a better model for the easy dataset than the hard dataset. This is supported by \(0.6 \leq 𝜅 \leq 0.8\), which indicates only moderate agreement between the prediction and reference categories.
knn_predict_hard <- predict(knn_model_hard, newdata = test_set_hard)
knn_hard_confusion_matrix <- caret::confusionMatrix(knn_predict_hard, test_labels_hard,
mode = "prec_recall")
knn_hard_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 51 5
## spam 24 414
##
## Accuracy : 0.9413
## 95% CI : (0.9168, 0.9603)
## No Information Rate : 0.8482
## P-Value [Acc > NIR] : 9.863e-11
##
## Kappa : 0.7456
##
## Mcnemar's Test P-Value : 0.0008302
##
## Precision : 0.9107
## Recall : 0.6800
## F1 : 0.7786
## Prevalence : 0.1518
## Detection Rate : 0.1032
## Detection Prevalence : 0.1134
## Balanced Accuracy : 0.8340
##
## 'Positive' Class : ham
##
The four-fold plot of the confusion matrix for the hard dataset looks like this:
fourfoldplot(as.table(knn_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))
First, train the models for the “easy” and “hard” datasets.
svm_model_easy <- train(x = training_set_easy, y = training_labels_easy, method = "svmLinear3",
trControl = resampling_method,
# hyperparameters
tuneGrid = data.frame(
# cost for over-fitting
cost = 1,
# penalty for misclassifications
Loss = 2))
print(svm_model_easy)
## L2 Regularized Support Vector Machine (dual) with Linear Kernel
##
## 2729 samples
## 438 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2729, 2729, 2729, 2729, 2729, 2729, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9942328 0.9874286
##
## Tuning parameter 'cost' was held constant at a value of 1
## Tuning
## parameter 'Loss' was held constant at a value of 2
svm_model_hard <- train(x = training_set_hard, y = training_labels_hard, method = "svmLinear3",
trControl = resampling_method,
# hyperparameters
tuneGrid = data.frame(
# cost for over-fitting
cost = 1,
# penalty for misclassifications
Loss = 2))
print(svm_model_hard)
## L2 Regularized Support Vector Machine (dual) with Linear Kernel
##
## 1154 samples
## 737 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1154, 1154, 1154, 1154, 1154, 1154, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9657199 0.8648259
##
## Tuning parameter 'cost' was held constant at a value of 1
## Tuning
## parameter 'Loss' was held constant at a value of 2
Then evaluate the models on the corresponding test set.
The summary statistics below show that the accuracy (95% CI) of the SVM model is 99.49% (98.89%-99.81%). This accuracy is greater than the NIR and the p-value of accuracy vs NIR is much less than 0.05, so the SVM model performs significantly better than a naive classifier. In addition, 𝜅 is very close to 1, which means that the agreement between the prediction and reference categories is strong and highly unlikely to be due to chance.
svm_predict_easy <- predict(svm_model_easy, newdata = test_set_easy)
svm_easy_confusion_matrix <- caret::confusionMatrix(svm_predict_easy, test_labels_easy,
mode = "prec_recall")
svm_easy_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 748 4
## spam 2 415
##
## Accuracy : 0.9949
## 95% CI : (0.9889, 0.9981)
## No Information Rate : 0.6416
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9888
##
## Mcnemar's Test P-Value : 0.6831
##
## Precision : 0.9947
## Recall : 0.9973
## F1 : 0.9960
## Prevalence : 0.6416
## Detection Rate : 0.6399
## Detection Prevalence : 0.6433
## Balanced Accuracy : 0.9939
##
## 'Positive' Class : ham
##
The four-fold plot looks very similar to the four-fold plot of the random forest confusion matrix. Like the RF model, the SVM model predicts true positives (upper left quadrant) and true negatives (lower right quadrant) extremely well, with few false negatives (lower left quadrant) or false positives (upper right quadrant).
fourfoldplot(as.table(svm_easy_confusion_matrix), color = c("#00BFC4", "#F8766D"))
The accuracy (95% CI) of the SVM model for the “hard” dataset is a little lower, 95.95% (93.82%-97.51%), but is still very good. The difference in accuracy between the easy and hard datasets was smaller for the SVM model (0.9949 - 0.9595 = 0.0354) than for the kNN model (0.9829 - 0.9312 = 0.0517), which suggests that the SVM model is less susceptible to imbalanced data.
The difference between the accuracy and NIR (0.9595 - 0.8482 = 0.1113) with the SVM model for the hard dataset is less than that for the easy dataset (0.9949 - 0.6416 = 0.3533). This suggests that, like the RF and kNN models, the SVM model is a better model for the easy dataset than the hard dataset. Nevertheless, \(𝜅>0.8\), which indicates strong agreement between the prediction and reference categories.
svm_predict_hard <- predict(svm_model_hard, newdata = test_set_hard)
svm_hard_confusion_matrix <- caret::confusionMatrix(svm_predict_hard, test_labels_hard,
mode = "prec_recall")
svm_hard_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 59 4
## spam 16 415
##
## Accuracy : 0.9595
## 95% CI : (0.9382, 0.9751)
## No Information Rate : 0.8482
## P-Value [Acc > NIR] : 1.456e-15
##
## Kappa : 0.8317
##
## Mcnemar's Test P-Value : 0.01391
##
## Precision : 0.9365
## Recall : 0.7867
## F1 : 0.8551
## Prevalence : 0.1518
## Detection Rate : 0.1194
## Detection Prevalence : 0.1275
## Balanced Accuracy : 0.8886
##
## 'Positive' Class : ham
##
The four-fold plot of the confusion matrix for the hard dataset looks like this:
fourfoldplot(as.table(svm_hard_confusion_matrix), color = c("#00BFC4", "#F8766D"))
The performance of the four SML models can be compared using resampling to estimate the distribution of the performance metrics (eg, accuracy).
models <- list(DT_easy = dt_model_easy, DT_hard = dt_model_hard,
RF_easy = rf_model_easy, RF_hard = rf_model_hard,
KNN_easy = knn_model_easy, KNN_hard = knn_model_hard,
SVM_easy = svm_model_easy, SVM_hard = svm_model_hard)
resampling <- resamples(models)
The accuracy and kappa values of the resamples look like this:
resampling_metrics_df <- resampling$values
resampling_metrics_df
## Resample DT_easy~Accuracy DT_easy~Kappa DT_hard~Accuracy DT_hard~Kappa
## 1 Resample01 0.9393638 0.8700066 0.9186047 0.6635669
## 2 Resample02 0.9267068 0.8454626 0.9124088 0.6443953
## 3 Resample03 0.9159919 0.8252140 0.9370460 0.7570478
## 4 Resample04 0.9302789 0.8544301 0.9322430 0.6710833
## 5 Resample05 0.9396637 0.8620530 0.9302885 0.6691531
## 6 Resample06 0.9772952 0.9495633 0.9448441 0.7653922
## 7 Resample07 0.9230019 0.8383143 0.9097561 0.6016491
## 8 Resample08 0.9221154 0.8352796 0.9317647 0.7421170
## 9 Resample09 0.9749750 0.9452880 0.9186603 0.6470647
## 10 Resample10 0.9309309 0.8540219 0.9170616 0.6427362
## 11 Resample11 0.9272727 0.8461465 0.9275701 0.7144947
## 12 Resample12 0.9722772 0.9396536 0.9390244 0.7703339
## 13 Resample13 0.9687185 0.9324241 0.9367397 0.7001515
## 14 Resample14 0.9422492 0.8763297 0.9218009 0.6559613
## 15 Resample15 0.9629630 0.9201858 0.9458239 0.7547179
## 16 Resample16 0.9279279 0.8505386 0.9287411 0.6617749
## 17 Resample17 0.9768145 0.9506194 0.9252747 0.6992963
## 18 Resample18 0.9274510 0.8467722 0.9265403 0.7316072
## 19 Resample19 0.9351944 0.8607424 0.9375000 0.7363495
## 20 Resample20 0.9763547 0.9489092 0.9413146 0.7384192
## 21 Resample21 0.9322362 0.8546842 0.9220183 0.6856658
## 22 Resample22 0.9412955 0.8731427 0.9367946 0.7310611
## 23 Resample23 0.9276986 0.8476628 0.9553571 0.7930812
## 24 Resample24 0.9343284 0.8582535 0.9465116 0.7623168
## 25 Resample25 0.9674134 0.9292908 0.9331797 0.7746625
## RF_easy~Accuracy RF_easy~Kappa RF_hard~Accuracy RF_hard~Kappa
## 1 0.9891304 0.9764733 0.9690476 0.8486528
## 2 0.9894535 0.9770332 0.9565217 0.8056237
## 3 0.9910090 0.9798259 0.9656751 0.8584203
## 4 0.9949239 0.9889273 0.9740566 0.8925346
## 5 0.9949444 0.9889662 0.9370629 0.7384324
## 6 0.9939516 0.9867034 0.9694836 0.8635221
## 7 0.9950249 0.9892634 0.9495413 0.7850002
## 8 0.9949187 0.9887886 0.9683973 0.8634222
## 9 0.9940653 0.9871824 0.9479905 0.7841536
## 10 0.9941176 0.9870042 0.9467593 0.7826582
## 11 0.9941176 0.9872009 0.9425287 0.7735083
## 12 0.9960396 0.9912903 0.9619048 0.8155366
## 13 0.9920080 0.9826067 0.9490291 0.7553998
## 14 0.9970238 0.9934478 0.9561201 0.7978177
## 15 0.9939940 0.9868234 0.9481481 0.7749464
## 16 0.9950593 0.9892145 0.9602978 0.8393942
## 17 0.9960474 0.9915271 0.9649533 0.8376328
## 18 0.9951877 0.9895982 0.9559165 0.8247705
## 19 0.9940000 0.9867605 0.9491525 0.7367111
## 20 0.9970356 0.9935287 0.9520548 0.8095416
## 21 0.9911591 0.9806001 0.9331683 0.7065533
## 22 0.9919598 0.9822527 0.9673660 0.8541242
## 23 0.9959839 0.9911783 0.9708029 0.8636514
## 24 0.9920949 0.9826453 0.9604938 0.8114635
## 25 0.9948454 0.9888457 0.9567308 0.8229285
## KNN_easy~Accuracy KNN_easy~Kappa KNN_hard~Accuracy KNN_hard~Kappa
## 1 0.9817629 0.9608773 0.9376499 0.7445575
## 2 0.9744094 0.9429536 0.9462103 0.7942091
## 3 0.9829146 0.9619964 0.9243499 0.6771300
## 4 0.9790210 0.9544808 0.9439024 0.7456164
## 5 0.9814815 0.9588937 0.9230769 0.6739812
## 6 0.9843902 0.9661259 0.9498807 0.7315823
## 7 0.9873909 0.9723141 0.9363208 0.7213514
## 8 0.9807497 0.9572898 0.9437939 0.7611300
## 9 0.9798793 0.9566070 0.9304556 0.6764934
## 10 0.9781312 0.9516811 0.9349776 0.7277855
## 11 0.9869215 0.9715373 0.9287356 0.7260872
## 12 0.9774127 0.9512515 0.9391101 0.6844230
## 13 0.9868288 0.9710154 0.9417476 0.7488315
## 14 0.9901768 0.9783177 0.9279070 0.6760632
## 15 0.9766971 0.9466981 0.9463869 0.7660573
## 16 0.9820896 0.9598236 0.9318735 0.7157396
## 17 0.9809619 0.9580682 0.9260143 0.6693986
## 18 0.9782823 0.9520521 0.9553991 0.7796352
## 19 0.9843597 0.9655803 0.9303944 0.6730225
## 20 0.9834146 0.9640288 0.9304556 0.7033193
## 21 0.9768844 0.9502196 0.9338061 0.6891828
## 22 0.9857868 0.9690953 0.9447005 0.7775215
## 23 0.9894737 0.9767400 0.9463869 0.7337058
## 24 0.9838872 0.9651451 0.9356322 0.7219432
## 25 0.9879276 0.9729850 0.9575893 0.8235050
## SVM_easy~Accuracy SVM_easy~Kappa SVM_hard~Accuracy SVM_hard~Kappa
## 1 0.9980020 0.9956068 0.9781553 0.9128268
## 2 0.9960239 0.9914206 0.9727273 0.8790932
## 3 0.9890220 0.9753930 0.9559902 0.8166193
## 4 0.9930279 0.9851578 0.9600939 0.8447189
## 5 0.9960591 0.9914017 0.9666667 0.8675437
## 6 0.9939577 0.9871185 0.9447115 0.8013619
## 7 0.9969819 0.9933633 0.9734300 0.9014755
## 8 0.9884837 0.9749166 0.9708029 0.8829560
## 9 0.9941003 0.9871326 0.9602804 0.8505341
## 10 0.9930830 0.9852445 0.9683258 0.8825584
## 11 0.9970000 0.9933575 0.9623529 0.8599959
## 12 0.9920239 0.9826892 0.9562212 0.8181217
## 13 0.9918864 0.9817544 0.9485981 0.7896994
## 14 0.9939271 0.9867437 0.9774266 0.9109941
## 15 0.9929577 0.9846543 0.9786730 0.9242430
## 16 0.9940000 0.9868986 0.9600000 0.8465997
## 17 0.9950932 0.9892416 0.9744780 0.9056949
## 18 0.9960317 0.9913473 0.9671362 0.8745002
## 19 0.9939516 0.9868400 0.9639423 0.8464869
## 20 0.9951076 0.9895658 0.9760766 0.9053142
## 21 0.9960435 0.9911243 0.9533170 0.7653609
## 22 0.9969325 0.9933045 0.9610706 0.8573474
## 23 0.9929648 0.9846431 0.9668246 0.8726998
## 24 0.9951172 0.9895560 0.9688995 0.8837647
## 25 0.9940417 0.9872393 0.9767981 0.9201379
To plot these data, I first reshaped them to be in a long format and extracted the names of the performance metric, model, and difficulty of the dataset (easy/hard) for each resample.
resampling_metrics_df <- resampling_metrics_df %>%
melt() %>%
rowwise() %>%
mutate(
metric = if_else(str_detect(variable, "Accuracy", negate = FALSE), "Accuracy", "Kappa"),
model = str_extract(variable, ".*(?=_)"),
# remove "~Accuracy" and "~Kappa" from variable names
variable = str_replace(variable, "~.*", ""),
difficulty = str_extract(variable, "(?<=_).*")
)
The boxplots below show that, for the “easy” dataset, the SVM and RF models performed the best in terms of accuracy and 𝜅 value. However, for the “hard” dataset, the SVM model showed a greater difference from the RF model. Since real-world spam emails are likely to be “hard”, these results suggest that the SVM model is the best classifier of ham vs spam emails.
ggplot(resampling_metrics_df, aes(x = model, y = value, color = model)) +
geom_boxplot() +
coord_flip() +
facet_grid(difficulty ~ metric, scales = "free_x") +
ylab("value") + xlab("model") +
theme(
strip.text = element_text(face = "bold"),
axis.title = element_text(face = "bold"),
legend.position = "none"
)
In this section, I only focus on the best-performing model (SVM) and the “hard” dataset.
I took a slightly broader approach than comparing emails with or without headers because the subject line is meaningful to email recipients and usually (unless it’s missing) gives a clue about the body of an email. So I call the subject + body “human-interpretable content”, in contrast to the entire email, which is “computer-interpretable content”.
Removing the email headers with regular expressions proved to be challenging (dead ends not shown), so I made a simplifying assumption that email headers are separated from the main body by a blank line. After dividing these two parts, I extracted the subject line from the header part and stripped HTML markup from the body part. Finally, I concatenated the subject and cleaned the message body to form the human-interpretable content.
human_content <- all_emails_hard %>%
rowwise() %>%
mutate(
# Add a dummy newline character to end of email text
# The purpose of this is to enable extraction of the subject line of blank emails
text = str_c(text, "\n", sep = ""),
# Capture subject line
subject = str_extract(text, "Subject: (.*)\n", group = 1),
# Email headers are separated from the body by a blank line, so the "body" is everything after
body = str_sub(text, str_locate(text, "[\n]{2,}")[2] + 1, str_length(text)),
# Remove URLs
body = str_replace_all(body, "http.*(\\n|\")", ""),
# Remove HTML tags
body = str_replace_all(body, "<[^>]*>", ""),
# Remove special characters, eg = non-breaking whitespace
body = str_replace_all(body, "&#?[\\w|\\d]+;", ""),
# Remove excess whitespace
body = str_replace_all(body, "[\\s]+", " "),
# Concatenate email subject and body
body = if_else(is.na(body) | body == " ",
subject, # if no body, use subject as body
str_c(subject, body, sep = " ")) # otherwise concatenate subject and body
) %>%
select(body, label)
A small fraction of emails did not contain human-interpretable content (as defined above). Because these messages are not useful for classification, I omitted them.
n_emails <- nrow(human_content)
n_no_content <- sum(is.na(human_content$body))
sprintf("%s of %s emails (%.3f%%) do not have human-interpretable content", n_no_content, n_emails, n_no_content / n_emails)
## [1] "11 of 1648 emails (0.007%) do not have human-interpretable content"
human_content <- human_content %>%
drop_na(body)
human_corpus <- VCorpus(VectorSource(human_content$body)) %>%
tidy_corpus()
human_dtm <- DocumentTermMatrix(human_corpus) %>%
removeSparseTerms(., 0.95)
tm::inspect(human_dtm)
## <<DocumentTermMatrix (documents: 1637, terms: 458)>>
## Non-/sparse entries: 81153/668593
## Sparsity : 89%
## Maximal term length: 40
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs business can email free get please the this will you
## 1147 24 14 25 10 20 5 33 28 39 31
## 1223 49 14 1 36 25 0 19 12 23 16
## 1231 20 12 26 10 20 6 33 28 40 31
## 1297 49 14 2 36 25 0 20 13 23 17
## 1320 4 9 1 6 1 2 96 10 29 1
## 1324 49 14 2 36 25 0 20 13 23 17
## 1545 6 9 12 6 8 5 26 15 24 16
## 158 8 19 1 3 13 1 23 7 7 10
## 300 10 40 14 28 22 6 12 7 76 26
## 326 16 10 8 16 20 4 10 5 18 24
As before, I partitioned the emails into training and test sets using a 70-30 split.
trainIndex_human <- createDataPartition(y = human_content$label, p = 0.7, list = FALSE)
training_set_human <- human_dtm[trainIndex_human, ] %>% as.matrix() %>% as.data.frame()
test_set_human <- human_dtm[-trainIndex_human, ] %>% as.matrix() %>% as.data.frame()
training_labels_human <- human_content$label[trainIndex_human]
test_labels_human <- human_content$label[-trainIndex_human]
Train the model
svm_model_human <- train(x = training_set_human, y = training_labels_human, method = "svmLinear3",
trControl = resampling_method, tuneGrid = data.frame(cost = 1, Loss = 2))
print(svm_model_human)
## L2 Regularized Support Vector Machine (dual) with Linear Kernel
##
## 1147 samples
## 458 predictor
## 2 classes: 'ham', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1147, 1147, 1147, 1147, 1147, 1147, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9430329 0.7699226
##
## Tuning parameter 'cost' was held constant at a value of 1
## Tuning
## parameter 'Loss' was held constant at a value of 2
Evaluate the model on the test set
The accuracy (95% CI) of the SVM model for the “human-interpretable” dataset is 93.67% (91.14%-95.56%) and \(𝜅 = 0.7309\), which indicates moderate agreement between prediction and reference categories.
svm_predict_human <- predict(svm_model_human, newdata = test_set_human)
svm_human_confusion_matrix <- caret::confusionMatrix(svm_predict_human, test_labels_human, mode = "prec_recall")
svm_human_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction ham spam
## ham 51 8
## spam 23 408
##
## Accuracy : 0.9367
## 95% CI : (0.9114, 0.9566)
## No Information Rate : 0.849
## P-Value [Acc > NIR] : 1.499e-09
##
## Kappa : 0.7309
##
## Mcnemar's Test P-Value : 0.01192
##
## Precision : 0.8644
## Recall : 0.6892
## F1 : 0.7669
## Prevalence : 0.1510
## Detection Rate : 0.1041
## Detection Prevalence : 0.1204
## Balanced Accuracy : 0.8350
##
## 'Positive' Class : ham
##
As before, resample to estimate the distribution of the performance metrics (eg, accuracy)
models <- list(computer_content = svm_model_hard, human_content = svm_model_human)
resampling <- resamples(models)
Then reshape the data
resampling_metrics_df <- resampling$values
resampling_metrics_df <- resampling_metrics_df %>%
melt() %>%
rowwise() %>%
mutate(
metric = if_else(str_detect(variable, "Accuracy", negate = FALSE), "Accuracy", "Kappa"),
content_type = str_extract(variable, ".*(?=_)"),
# remove "~Accuracy" and "~Kappa" from variable names
variable = str_replace(variable, "~.*", ""),
)
The boxplots below show that the accuracy and kappa value of the SVM model for the “computer-interpretable” dataset is greater than the metrics for the human-interpretable dataset. Together, these findings indicate that computer-interpretable content in emails (eg, headers) provides predictive value to the SVM model for classifying ham vs spam.
ggplot(resampling_metrics_df, aes(x = content_type, y = value, color = content_type)) +
geom_boxplot() +
coord_flip() +
facet_grid(~ metric, scales = "free_x") +
ylab("Value") + xlab("Content Type") +
theme(
strip.text = element_text(face = "bold"),
axis.title = element_text(face = "bold"),
legend.position = "none"
)
These analyses show that supervised machine learning (SML) algorithms perform spam vs ham classification well—the four methods I compared (decision trees, random forest, k-nearest neighbor, and support vector machine) were significantly better than a naive classifier, had accuracy >90%, and most had \(𝜅 > 0.8\). In general, the algorithms performed better for the “easy” emails than the “hard” emails. Overall, the SVM model performed best for both types, which suggests that it would have the best performance in the “real world”. Of note, the SVM performance was dependent on information from the entire email as shown by the reduced performance when email headers were excluded.
Additional improvements in classification performance may be possible by balancing the spam and ham emails, fine tuning hyperparameters in SML algorithms, and using more advanced methods such as neural networks or large-language models.