1 Overview

In this project, I build a document classification predictive model using R tools for text mining: tm for preprocessing and a Naive Bayes classifier from e1071. I will read emails that are already labeled (SpamAssassin corpus) from local folders, clean the text, convert to a TF-IDF weighted matrix, train Naive Bayes, evaluate accuracy, precision, and test on a few new example messages.

2 Libraries

library(tm)
library(dplyr)
library(SnowballC)
library(e1071)
library(stringi)
ham_path  <- "~/Desktop/data/easy_ham_2"
spam_path <- "~/Desktop/data/spam_2"

stopifnot(dir.exists(ham_path), dir.exists(spam_path))

3 Data

ham_corp  <- VCorpus(DirSource(ham_path,  encoding = "UTF-8"))
spam_corp <- VCorpus(DirSource(spam_path, encoding = "UTF-8"))

# smaller sample to keep things fast

sample_corpus <- function(corp, n_max = 600) {
n <- length(corp)
if (n > n_max) corp[sort(sample(seq_len(n), n_max))] else corp
}
ham_corp  <- sample_corpus(ham_corp,  600)
spam_corp <- sample_corpus(spam_corp, 600)

length(ham_corp); length(spam_corp)
## [1] 600
## [1] 600
clean_corpus <- function(corp) {
  corp <- tm_map(corp, content_transformer(function(x) {
    x <- iconv(x, from = "", to = "UTF-8", sub = " ")
    x <- stri_enc_toutf8(x, is_unknown_8bit = TRUE, validate = TRUE)
    x <- gsub("[\\x00-\\x1F\\x7F]", " ", x, perl = TRUE)
    x
  }))
  
  # text cleaning
  corp <- tm_map(corp, content_transformer(tolower))
  corp <- tm_map(corp, removeNumbers)
  corp <- tm_map(corp, removePunctuation)
  corp <- tm_map(corp, removeWords, stopwords("en"))
  corp <- tm_map(corp, stemDocument, language = "en")
  corp <- tm_map(corp, stripWhitespace)
  corp
}


ham_corp_cln  <- clean_corpus(ham_corp)
spam_corp_cln <- clean_corpus(spam_corp)

all_corp <- c(ham_corp_cln, spam_corp_cln)
labels   <- factor(c(rep("ham",  length(ham_corp_cln)),
rep("spam", length(spam_corp_cln))),
levels = c("ham","spam"))

4 Split data into training and testing data (80/20)

set.seed(1234)
idx <- sample(seq_along(all_corp), size = floor(0.8 * length(all_corp)))
train_corp <- all_corp[idx]
test_corp  <- all_corp[-idx]
y_train    <- labels[idx]
y_test     <- labels[-idx]

length(train_corp); length(test_corp)
## [1] 960
## [1] 240
table(y_train); table(y_test)
## y_train
##  ham spam 
##  484  476
## y_test
##  ham spam 
##  116  124

5 DTM -> TF-IDF

# train DTM
dtm_train <- DocumentTermMatrix(train_corp)

# trim very sparse terms
dtm_train <- removeSparseTerms(dtm_train, 0.99)

# TF-IDF weighting
dtm_train_tfidf <- weightTfIdf(dtm_train)

# test DTM using the training dict
dtm_test  <- DocumentTermMatrix(test_corp, control = list(dictionary = Terms(dtm_train)))
dtm_test_tfidf <- weightTfIdf(dtm_test)

# numeric matrices
x_train <- as.matrix(dtm_train_tfidf)
x_test  <- as.matrix(dtm_test_tfidf)

# replace NA/inf with 0
x_train[!is.finite(x_train)] <- 0
x_test[!is.finite(x_test)]   <- 0

dim(x_train); dim(x_test)
## [1]  960 2278
## [1]  240 2278

6 Naive-Bayes Model

nb_fit <- naiveBayes(x = x_train, y = y_train, laplace = 1)
glimpse(nb_fit)
## List of 5
##  $ apriori  : 'table' int [1:2(1d)] 484 476
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ y_train: chr [1:2] "ham" "spam"
##  $ tables   :List of 2278
##   ..$ aaa                                                              : num [1:2, 1:2] 0.000525 0.000515 0.006085 0.005381
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ abil                                                             : num [1:2, 1:2] 0.000357 0.000791 0.002811 0.005555
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ abl                                                              : num [1:2, 1:2] 0.00141 0.00127 0.00547 0.00586
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ absolut                                                          : num [1:2, 1:2] 0.000128 0.001877 0.001635 0.00801
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ abus                                                             : num [1:2, 1:2] 0.000456 0.000339 0.004814 0.003608
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ acc                                                              : num [1:2, 1:2] 6.12e-04 4.85e-05 4.52e-03 1.06e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ accept                                                           : num [1:2, 1:2] 0.000579 0.002394 0.003784 0.011696
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ access                                                           : num [1:2, 1:2] 0.00126 0.00163 0.00712 0.00647
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ accord                                                           : num [1:2, 1:2] 0.00038 0.000388 0.002635 0.002747
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ account                                                          : num [1:2, 1:2] 0.000891 0.002871 0.004866 0.012862
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ achiev                                                           : num [1:2, 1:2] 9.71e-05 6.40e-04 1.45e-03 4.49e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ acquir                                                           : num [1:2, 1:2] 7.33e-05 4.20e-04 1.02e-03 3.25e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ across                                                           : num [1:2, 1:2] 0.000745 0.000425 0.004695 0.003772
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ act                                                              : num [1:2, 1:2] 0.000305 0.002128 0.002019 0.009733
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ action                                                           : num [1:2, 1:2] 0.000793 0.001199 0.006254 0.006978
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ actiondhttpresponseresponseasp                                   : num [1:2, 1:2] 0 0.000459 0 0.002522
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ activ                                                            : num [1:2, 1:2] 0.000722 0.000907 0.005327 0.005893
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ actual                                                           : num [1:2, 1:2] 0.00219 0.00126 0.00805 0.00564
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ adam                                                             : num [1:2, 1:2] 0.00199 0 0.01259 0
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ add                                                              : num [1:2, 1:2] 0.002098 0.000609 0.008306 0.003619
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ addit                                                            : num [1:2, 1:2] 0.000761 0.001251 0.003766 0.005845
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ address                                                          : num [1:2, 1:2] 0.00221 0.00634 0.00831 0.0134
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ admin                                                            : num [1:2, 1:2] 1.03e-03 9.49e-05 7.44e-03 2.07e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ administr                                                        : num [1:2, 1:2] 0.000715 0.000548 0.005568 0.00491
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ adsldslsnfcpacbellnet                                            : num [1:2, 1:2] 0.000517 0.000244 0.004339 0.003125
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ adult                                                            : num [1:2, 1:2] 3.02e-05 2.51e-03 4.70e-04 1.77e-02
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ adv                                                              : num [1:2, 1:2] 1.95e-05 2.44e-03 4.29e-04 1.12e-02
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ advanc                                                           : num [1:2, 1:2] 0.00063 0.00109 0.0051 0.00826
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ advantag                                                         : num [1:2, 1:2] 0.000415 0.00086 0.003071 0.005189
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ advertis                                                         : num [1:2, 1:2] 0.000275 0.002921 0.002473 0.015289
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ advic                                                            : num [1:2, 1:2] 0.000907 0.000601 0.006213 0.005736
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ affili                                                           : num [1:2, 1:2] 0 0.000801 0 0.005384
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ afford                                                           : num [1:2, 1:2] 0.00018 0.000805 0.002448 0.005237
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ africa                                                           : num [1:2, 1:2] 6.13e-05 6.82e-04 1.35e-03 6.60e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ age                                                              : num [1:2, 1:2] 0.000353 0.003905 0.00261 0.014005
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ agenc                                                            : num [1:2, 1:2] 0.000204 0.000309 0.001965 0.002563
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ agent                                                            : num [1:2, 1:2] 0.00126 0.00145 0.00812 0.01042
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ ago                                                              : num [1:2, 1:2] 0.001393 0.000258 0.005349 0.002147
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ agre                                                             : num [1:2, 1:2] 0.000468 0.000559 0.003426 0.004418
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aid                                                              : num [1:2, 1:2] 0 0.000265 0 0.002143
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aiesubericnet                                                    : num [1:2, 1:2] 0.000986 0 0.006797 0
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alert                                                            : num [1:2, 1:2] 0.000222 0.00047 0.003506 0.004459
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ algorithm                                                        : num [1:2, 1:2] 7.54e-04 3.05e-05 6.04e-03 6.65e-04
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligncent                                                        : num [1:2, 1:2] 3.16e-05 5.70e-03 6.95e-04 1.93e-02
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligncentera                                                     : num [1:2, 1:2] 0 0.00139 0 0.01035
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligncenterbfont                                                 : num [1:2, 1:2] 0 0.00242 0 0.01212
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligncenterfont                                                  : num [1:2, 1:2] 0 0.00332 0 0.01457
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligncenternbspp                                                 : num [1:2, 1:2] 0 0.00231 0 0.01812
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndcent                                                       : num [1:2, 1:2] 0 0.0101 0 0.0421
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndcentera                                                    : num [1:2, 1:2] 0 0.00139 0 0.01084
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndcenterbfont                                                : num [1:2, 1:2] 0 0.00151 0 0.00867
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndcenterfont                                                 : num [1:2, 1:2] 0 0.00414 0 0.02002
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndcenterimg                                                  : num [1:2, 1:2] 0 0.000733 0 0.004599
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndleft                                                       : num [1:2, 1:2] 0 0.00256 0 0.01517
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndleftfont                                                   : num [1:2, 1:2] 0 0.00207 0 0.01908
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndmiddl                                                      : num [1:2, 1:2] 0 0.000944 0 0.00589
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndright                                                      : num [1:2, 1:2] 3.68e-05 3.50e-03 8.09e-04 1.99e-02
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndrightbfont                                                 : num [1:2, 1:2] 0 0.00236 0 0.01431
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aligndrightnbsptd                                                : num [1:2, 1:2] 0 0.000853 0 0.004911
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alignleft                                                        : num [1:2, 1:2] 0 0.0032 0 0.0169
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alignleftfont                                                    : num [1:2, 1:2] 0 0.00155 0 0.01063
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alignright                                                       : num [1:2, 1:2] 2.19e-05 1.62e-03 4.81e-04 1.03e-02
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alink                                                            : num [1:2, 1:2] 0 0.00056 0 0.00378
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ allow                                                            : num [1:2, 1:2] 0.00101 0.00147 0.00406 0.00597
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ almost                                                           : num [1:2, 1:2] 0.000789 0.000698 0.004058 0.003776
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alon                                                             : num [1:2, 1:2] 0.000465 0.000168 0.003656 0.001839
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ along                                                            : num [1:2, 1:2] 0.00109 0.00056 0.00508 0.00349
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alreadi                                                          : num [1:2, 1:2] 0.00132 0.00101 0.00537 0.0044
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ also                                                             : num [1:2, 1:2] 0.00269 0.0021 0.006 0.00582
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ altd                                                             : num [1:2, 1:2] 7.54e-05 1.19e-03 1.66e-03 6.78e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alter                                                            : num [1:2, 1:2] 0.000482 0.000234 0.004473 0.002875
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ altern                                                           : num [1:2, 1:2] 0.000993 0.000612 0.005629 0.003943
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ although                                                         : num [1:2, 1:2] 0.000952 0.000135 0.005131 0.001312
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ alway                                                            : num [1:2, 1:2] 0.00147 0.00111 0.00545 0.00465
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ amaz                                                             : num [1:2, 1:2] 0.00033 0.00156 0.00298 0.00805
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ america                                                          : num [1:2, 1:2] 0.000724 0.001799 0.005664 0.008581
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ american                                                         : num [1:2, 1:2] 0.00128 0.00114 0.00999 0.00817
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ among                                                            : num [1:2, 1:2] 0.000521 0.000129 0.003706 0.001516
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ amount                                                           : num [1:2, 1:2] 0.000542 0.000899 0.003273 0.003946
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ amp                                                              : num [1:2, 1:2] 0 0.00199 0 0.00877
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ analysi                                                          : num [1:2, 1:2] 0.000158 0.000283 0.001782 0.00249
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ ander                                                            : num [1:2, 1:2] 0.00125 0 0.00948 0
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ andor                                                            : num [1:2, 1:2] 0.000666 0.000498 0.004137 0.003057
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ angl                                                             : num [1:2, 1:2] 0.000984 0 0.007219 0
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anim                                                             : num [1:2, 1:2] 0.000895 0.00034 0.011502 0.004146
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ announc                                                          : num [1:2, 1:2] 0.000932 0.000671 0.006278 0.006437
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ annoy                                                            : num [1:2, 1:2] 6.86e-04 9.68e-05 5.00e-03 2.11e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ annual                                                           : num [1:2, 1:2] 2.19e-05 1.13e-03 4.81e-04 8.95e-03
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anoth                                                            : num [1:2, 1:2] 0.00215 0.00124 0.00604 0.00493
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ answer                                                           : num [1:2, 1:2] 0.001071 0.000688 0.005229 0.004056
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ answerspablo                                                     : num [1:2, 1:2] 0.00104 0 0.00559 0
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anybodi                                                          : num [1:2, 1:2] 0.00078 0.000346 0.005407 0.002993
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anymor                                                           : num [1:2, 1:2] 0.000689 0.000179 0.004707 0.001597
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anyon                                                            : num [1:2, 1:2] 0.00262 0.00192 0.00706 0.00732
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anyth                                                            : num [1:2, 1:2] 0.001033 0.000572 0.004826 0.003045
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anytim                                                           : num [1:2, 1:2] 0.000127 0.000625 0.001663 0.004312
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anyway                                                           : num [1:2, 1:2] 1.24e-03 1.71e-05 5.88e-03 2.72e-04
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ anywher                                                          : num [1:2, 1:2] 0.000261 0.001408 0.002534 0.008028
##   .. ..- attr(*, "dimnames")=List of 2
##   ..$ aol                                                              : num [1:2, 1:2] 0.000239 0.000848 0.003168 0.007277
##   .. ..- attr(*, "dimnames")=List of 2
##   .. [list output truncated]
##  $ levels   : chr [1:2] "ham" "spam"
##  $ isnumeric: Named logi [1:2278] TRUE TRUE TRUE TRUE TRUE TRUE ...
##   ..- attr(*, "names")= chr [1:2278] "aaa" "abil" "abl" "absolut" ...
##  $ call     : language naiveBayes.default(x = x_train, y = y_train, laplace = 1)
##  - attr(*, "class")= chr "naiveBayes"
# nb_fit

7 Evaluation: Accuracy, precision, recall, F1

pred_class <- predict(nb_fit, x_test)
tab <- table(Actual = y_test, Pred = pred_class)
tab
##       Pred
## Actual ham spam
##   ham  114    2
##   spam   4  120
# "spam" is positive
TP <- tab["spam","spam"]; FP <- tab["ham","spam"]
FN <- tab["spam","ham"];  TN <- tab["ham","ham"]

accuracy  <- (TP + TN) / sum(tab)
precision <- ifelse((TP + FP) == 0, NA, TP / (TP + FP))
recall    <- ifelse((TP + FN) == 0, NA, TP / (TP + FN))
f1        <- ifelse(is.na(precision) || is.na(recall) || (precision+recall)==0,
NA, 2 * precision * recall / (precision + recall))

knitr::kable(
data.frame(Metric = c("Accuracy","Precision","Recall","F1"),
Value  = round(c(accuracy, precision, recall, f1), 3)),
caption = "Evaluation metrics on the test dataset."
)
Evaluation metrics on the test dataset.
Metric Value
Accuracy 0.975
Precision 0.984
Recall 0.968
F1 0.976

8 Prediction on my personal emails

new_messages <- c(
  "Hello Joao, a few days ago I opened an email from my client Amy, now thriving as Director of Payments and Checkout at PlayStation with a high salary. She used an updated resume and strategy to move from frustrated and overworked to a much better role. The email links to her resume and says this is what happens when strategy meets execution.",
  "We couldn't help ourselves. By popular demand, we are adding Mac and Cheese to our Thanksgiving feast with Colonia Verde flair. Our three-cheese mac with guajillo and aji amarillo is now part of the menu with turkey, stuffing, mashed potatoes, glazed squash, cranberry sauce, and tres leches. Pickups will be available on Wednesday and Thursday in the afternoon and orders close soon.",
  "YOUR FEEDBACK: Help us shape the future of Amplify Classroom. As part of our community, we value your insights as educators and invite you to complete a short survey about your experience. Your feedback helps us improve our tools for teachers and students, and everyone who completes the survey is entered into a drawing for a $25 gift card.",
  "Your IRS tax refund is pending acceptance. You must accept within 24 hours using this link: http://bit.ly/sdfsdf"
)

new_corp <- VCorpus(VectorSource(new_messages))
new_corp <- clean_corpus(new_corp)

# check what the model actually sees after cleaning to understand misclassification
cat("---- Cleaned documents ----\n")
## ---- Cleaned documents ----
for (i in seq_along(new_corp)) {
  cat("Doc", i, ":\n")
  cat(as.character(content(new_corp[[i]])), "\n\n")
}
## Doc 1 :
## hello joao day ago open email client ami now thrive director payment checkout playstat high salari use updat resum strategi move frustrat overwork much better role email link resum say happen strategi meet execut 
## 
## Doc 2 :
## couldnt help popular demand ad mac chees thanksgiv feast colonia verd flair threechees mac guajillo aji amarillo now part menu turkey stuf mash potato glaze squash cranberri sauc tres lech pickup will avail wednesday thursday afternoon order close soon 
## 
## Doc 3 :
## feedback help us shape futur amplifi classroom part communiti valu insight educ invit complet short survey experi feedback help us improv tool teacher student everyon complet survey enter draw gift card 
## 
## Doc 4 :
## ir tax refund pend accept must accept within hour use link httpbitlysdfsdf
new_dtm   <- DocumentTermMatrix(new_corp, control = list(dictionary = Terms(dtm_train)))
new_tfidf <- weightTfIdf(new_dtm)
new_x <- as.matrix(new_tfidf)
new_x[!is.finite(new_x)] <- 0

nz_terms <- rowSums(new_x != 0)
nz_terms
##  1  2  3  4 
## 23 14 17  8
new_pred <- predict(nb_fit, new_x, type = "class")

knitr::kable(
  data.frame(
    message = new_messages,
    predicted_class = as.character(new_pred)
  ),
  caption = "Predictions for example messages."
)
Predictions for example messages.
message predicted_class
Hello Joao, a few days ago I opened an email from my client Amy, now thriving as Director of Payments and Checkout at PlayStation with a high salary. She used an updated resume and strategy to move from frustrated and overworked to a much better role. The email links to her resume and says this is what happens when strategy meets execution. ham
We couldn’t help ourselves. By popular demand, we are adding Mac and Cheese to our Thanksgiving feast with Colonia Verde flair. Our three-cheese mac with guajillo and aji amarillo is now part of the menu with turkey, stuffing, mashed potatoes, glazed squash, cranberry sauce, and tres leches. Pickups will be available on Wednesday and Thursday in the afternoon and orders close soon. spam
YOUR FEEDBACK: Help us shape the future of Amplify Classroom. As part of our community, we value your insights as educators and invite you to complete a short survey about your experience. Your feedback helps us improve our tools for teachers and students, and everyone who completes the survey is entered into a drawing for a $25 gift card. spam
Your IRS tax refund is pending acceptance. You must accept within 24 hours using this link: http://bit.ly/sdfsdf ham

9 Conclusion

This report built a complete spam vs ham classifier in R. After cleaning emails and transforming them into TF-IDF features, I used a Naive Bayes model to achieve solid performance on a held-out test set, measured by accuracy, precision, recall, and F1. The model also generalized to short, unseen messages. When testing with my own emails, the model consistently misclassified them, which I did not expect. This might have happened because the vocabulary in the SpamAssassin corpus is outdated and shares very little overlap with modern newsletters, corporate emails, and phishing attempts. So, I think that a modern spam filter must be trained on recent, representative data.