In this project, I build a document classifier to distinguish between spam and non-spam (ham) emails using the SpamAssassin public corpus.
# Core packages
library(tidyverse) # includes dplyr, readr, stringr, purrr, ggplot2, etc.
# Your folder paths (ham = non-spam, spam = spam)
ham_dir <- "/Users/kidd/Desktop/CUNY/CUNY_SPS/2025_2_Fall/DATA_607/Projects/Project_4_Document_classification/easy_ham"
spam_dir <- "/Users/kidd/Desktop/CUNY/CUNY_SPS/2025_2_Fall/DATA_607/Projects/Project_4_Document_classification/spam"
# List all files in each folder
ham_files <- list.files(ham_dir, full.names = TRUE)
spam_files <- list.files(spam_dir, full.names = TRUE)
length(ham_files)
## [1] 2551
length(spam_files)
## [1] 501
# ---- read-email-text ----
read_email <- function(file){
paste(read_lines(file, progress = FALSE), collapse = "\n")
}
ham_texts <- tibble(
text = map_chr(ham_files, read_email),
label = "ham"
)
spam_texts <- tibble(
text = map_chr(spam_files, read_email),
label = "spam"
)
emails <- bind_rows(ham_texts, spam_texts)
dplyr::glimpse(emails)
## Rows: 3,052
## Columns: 2
## $ text <chr> "From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 2002\nR…
## $ label <chr> "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "ham", "…
# ---- clean-text ----
library(tidytext)
emails_clean <- emails %>%
mutate(text = str_to_lower(text)) %>%
mutate(text = str_replace_all(text, "[^a-z']", " ")) %>%
mutate(text = str_squish(text))
head(emails_clean$text, 3)
## [1] "from exmh workers admin redhat com thu aug return path exmh workers admin example com delivered to zzzz localhost netnoteinc com received from localhost localhost by phobos labs netnoteinc com postfix with esmtp id d e c for zzzz localhost thu aug edt received from phobos by localhost with imap fetchmail for zzzz localhost single drop thu aug ist received from listman example com listman example com by dogma slashnull org with esmtp id g mbyrz for zzzz exmh example com thu aug received from listman example com localhost localdomain by listman redhat com postfix with esmtp id thu aug edt delivered to exmh workers listman example com received from int mx corp example com int mx corp example com by listman redhat com postfix with esmtp id cf d for exmh workers listman redhat com thu aug edt received from mail localhost by int mx corp example com id g mby g for exmh workers listman redhat com thu aug received from mx example com mx example com by int mx corp redhat com with smtp id g mby y for exmh workers redhat com thu aug received from ratree psu ac th by mx example com with smtp id g mbihl for exmh workers redhat com thu aug received from delta cs mu oz au delta coe psu ac th by ratree psu ac th with esmtp id g mbwel thu aug ict received from munnari oz au localhost by delta cs mu oz au with esmtp id g mbqpw thu aug ict from robert elz kre munnari oz au to chris garrigues cwg dated fa d deepeddy com cc exmh workers example com subject re new sequences window in reply to tmda deepeddy vircio com references tmda deepeddy vircio com tmda deepeddy vircio com munnari oz au tmda deepeddy vircio com tmda deepeddy vircio com mime version content type text plain charset us ascii message id munnari oz au x loop exmh workers example com sender exmh workers admin example com errors to exmh workers admin example com x beenthere exmh workers example com x mailman version precedence bulk list help mailto exmh workers request example com subject help list post mailto exmh workers example com list subscribe https listman example com mailman listinfo exmh workers mailto exmh workers request redhat com subject subscribe list id discussion list for exmh developers exmh workers example com list unsubscribe https listman example com mailman listinfo exmh workers mailto exmh workers request redhat com subject unsubscribe list archive https listman example com mailman private exmh workers date thu aug date wed aug from chris garrigues cwg dated fa d deepeddy com message id tmda deepeddy vircio com i can't reproduce this error for me it is very repeatable like every time without fail this is the debug log of the pick happening pick it exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury exec pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury ftoc pickmsgs hit marking hits tkerror syntax error in expression int note if i run the pick command by hand delta pick inbox list lbrace lbrace subject ftp rbrace rbrace sequence mercury hit that's where the hit comes from obviously the version of nmh i'm using is delta pick version pick nmh compiled on fuchsia cs mu oz au at sun mar ict and the relevant part of my mh profile delta mhparam pick seq sel list since the pick command works the sequence actually both of them the one that's explicit on the command line from the search popup and the one that comes from mh profile do get created kre ps this is still using the version of the code form a day ago i haven't been able to reach the cvs repository today local routing issue i think exmh workers mailing list exmh workers redhat com https listman redhat com mailman listinfo exmh workers"
## [2] "from steve burt cursor system com thu aug return path steve burt cursor system com delivered to zzzz localhost netnoteinc com received from localhost localhost by phobos labs netnoteinc com postfix with esmtp id be e c for zzzz localhost thu aug edt received from phobos by localhost with imap fetchmail for zzzz localhost single drop thu aug ist received from n grp scd yahoo com n grp scd yahoo com by dogma slashnull org with smtp id g mbktz for zzzz example com thu aug x egroups return sentto zzzz example com returns groups yahoo com received from by n grp scd yahoo com with nnfmp aug x sender steve burt cursor system com x apparently to zzzzteana yahoogroups com received egp mail aug received qmail invoked from network aug received from unknown by m grp scd yahoo com with qmqp aug received from unknown helo mailgateway cursor system com by mta grp scd yahoo com with smtp aug received from exchange cps local unverified by mailgateway cursor system com content technologies smtprs with esmtp id t cde f ac d d mailgateway cursor system com for forteana yahoogroups com thu aug received by exchange cps local with internet mail service id pxx at thu aug message id ec ad d d fb bda d d ef b f exchange cps local to 'zzzzteana yahoogroups com' zzzzteana yahoogroups com x mailer internet mail service x egroups from steve burt steve burt cursor system com from steve burt steve burt cursor system com x yahoo profile pyruse mime version mailing list list zzzzteana yahoogroups com contact forteana owner yahoogroups com delivered to mailing list zzzzteana yahoogroups com precedence bulk list unsubscribe mailto zzzzteana unsubscribe yahoogroups com date thu aug subject zzzzteana re alexander reply to zzzzteana yahoogroups com content type text plain charset us ascii content transfer encoding bit martin a posted tassos papadopoulos the greek sculptor behind the plan judged that the limestone of mount kerdylio miles east of salonika and not far from the mount athos monastic community was ideal for the patriotic sculpture as well as alexander's granite features ft high and ft wide a museum a restored amphitheatre and car park for admiring crowds are planned so is this mountain limestone or granite if it's limestone it'll weather pretty fast yahoo groups sponsor dvds free s p join now http us click yahoo com pt ybb nxieaa mg haa gsolb tm to unsubscribe from this group send an email to forteana unsubscribe egroups com your use of yahoo groups is subject to http docs yahoo com info terms"
## [3] "from timc ubh com thu aug return path timc ubh com delivered to zzzz localhost netnoteinc com received from localhost localhost by phobos labs netnoteinc com postfix with esmtp id c for zzzz localhost thu aug edt received from phobos by localhost with imap fetchmail for zzzz localhost single drop thu aug ist received from n grp scd yahoo com n grp scd yahoo com by dogma slashnull org with smtp id g mcrdz for zzzz example com thu aug x egroups return sentto zzzz example com returns groups yahoo com received from by n grp scd yahoo com with nnfmp aug x sender timc ubh com x apparently to zzzzteana yahoogroups com received egp mail aug received qmail invoked from network aug received from unknown by m grp scd yahoo com with qmqp aug received from unknown helo rhenium btinternet com by mta grp scd yahoo com with smtp aug received from host in addr btopenworld com by rhenium btinternet com with esmtp exim id hrt gj for forteana yahoogroups com thu aug x mailer microsoft outlook express macintosh edition to zzzzteana zzzzteana yahoogroups com x priority message id e hrt gj rhenium btinternet com from tim chapman timc ubh com x yahoo profile tim ubh mime version mailing list list zzzzteana yahoogroups com contact forteana owner yahoogroups com delivered to mailing list zzzzteana yahoogroups com precedence bulk list unsubscribe mailto zzzzteana unsubscribe yahoogroups com date thu aug subject zzzzteana moscow bomber reply to zzzzteana yahoogroups com content type text plain charset us ascii content transfer encoding bit man threatens explosion in moscow thursday august pm moscow ap security officers on thursday seized an unidentified man who said he was armed with explosives and threatened to blow up his truck in front of russia's federal security services headquarters in moscow ntv television reported the officers seized an automatic rifle the man was carrying then the man got out of the truck and was taken into custody ntv said no other details were immediately available the man had demanded talks with high government officials the interfax and itar tass news agencies said ekho moskvy radio reported that he wanted to talk with russian president vladimir putin police and security forces rushed to the security service building within blocks of the kremlin red square and the bolshoi ballet and surrounded the man who claimed to have one and a half tons of explosives the news agencies said negotiations continued for about one and a half hours outside the building itar tass and interfax reported citing witnesses the man later drove away from the building under police escort and drove to a street near moscow's olympic penta hotel where authorities held further negotiations with him the moscow police press service said the move appeared to be an attempt by security services to get him to a more secure location yahoo groups sponsor dvds free s p join now http us click yahoo com pt ybb nxieaa mg haa gsolb tm to unsubscribe from this group send an email to forteana unsubscribe egroups com your use of yahoo groups is subject to http docs yahoo com info terms"
# ---- dtm ----
email_words <- emails_clean %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
word_counts <- email_words %>%
count(label, word) %>%
bind_tf_idf(word, label, n)
head(word_counts)
## # A tibble: 6 × 6
## label word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 ham a'x 2 0.00000255 0.693 0.00000176
## 2 ham aa 159 0.000202 0 0
## 3 ham aaa 16 0.0000204 0 0
## 4 ham aaaa 2 0.00000255 0 0
## 5 ham aaaaaabaadkaohxzserqwgeead 1 0.00000127 0.693 0.000000882
## 6 ham aaaaaaib 1 0.00000127 0.693 0.000000882
# ---- train-test ----
library(e1071)
# Create training/testing split
set.seed(123)
train_idx <- sample(1:nrow(emails_clean), size = 0.8 * nrow(emails_clean))
train <- emails_clean[train_idx, ]
test <- emails_clean[-train_idx, ]
# Very simple model for now: label by keywords
model <- naiveBayes(label ~ text, data = train)
pred <- predict(model, test)
table(pred, test$label)
##
## pred ham spam
## ham 527 84
## spam 0 0
# ---- train-test-split ----
set.seed(123) # for reproducibility
n <- nrow(emails_clean)
# 70% of rows for training
train_idx <- sample(seq_len(n), size = floor(0.7 * n))
train <- emails_clean[train_idx, ]
test <- emails_clean[-train_idx, ]
# Quick sanity checks
table(train$label)
##
## ham spam
## 1777 359
table(test$label)
##
## ham spam
## 774 142
# ---- model ----
library(e1071)
# 1. Train/test split
set.seed(123)
train_idx <- sample(seq_len(nrow(word_counts)), size = 0.8 * nrow(word_counts))
train <- word_counts[train_idx, ]
test <- word_counts[-train_idx, ]
# 2. Train Naive Bayes model
model <- naiveBayes(label ~ word + tf_idf, data = train)
# 3. Predict on test set
pred <- predict(model, newdata = test)
# 4. Confusion matrix
table(
truth = test$label,
pred = pred
)
## pred
## truth ham spam
## ham 1433 5120
## spam 58 15066
# ---- metrics ----
# Rebuild the confusion matrix just to be safe
cm <- table(truth = test$label, pred = pred)
cm
## pred
## truth ham spam
## ham 1433 5120
## spam 58 15066
# Overall accuracy: how often we were correct
accuracy <- sum(diag(cm)) / sum(cm)
# Treat "spam" as the positive class
tp <- cm["spam", "spam"] # correctly predicted spam
fn <- cm["spam", "ham"] # spam that we called ham
fp <- cm["ham", "spam"] # ham that we called spam
tn <- cm["ham", "ham"] # correctly predicted ham
# Key metrics
recall_spam <- tp / (tp + fn) # of all spam, how many we caught?
precision_spam <- tp / (tp + fp) # of what we called spam, how many were really spam?
metrics <- tibble::tibble(
metric = c("Accuracy", "Recall (spam)", "Precision (spam)"),
value = c(accuracy, recall_spam, precision_spam)
)
metrics
## # A tibble: 3 × 2
## metric value
## <chr> <dbl>
## 1 Accuracy 0.761
## 2 Recall (spam) 0.996
## 3 Precision (spam) 0.746
# ---- evaluation ----
library(yardstick)
library(ggplot2)
# Convert to factors
test$label <- factor(test$label, levels = c("ham", "spam"))
pred <- factor(pred, levels = c("ham", "spam"))
# Accuracy
accuracy_value <- mean(pred == test$label)
accuracy_value
## [1] 0.7611293
# Confusion matrix
conf_mat <- table(
truth = test$label,
pred = pred
)
conf_mat
## pred
## truth ham spam
## ham 1433 5120
## spam 58 15066
# ---- top predictive words (visual) ----
top_words <- word_counts %>%
arrange(desc(tf_idf)) %>%
group_by(label) %>%
slice_head(n = 15)
ggplot(top_words, aes(x = reorder_within(word, tf_idf, label),
y = tf_idf,
fill = label)) +
geom_col(show.legend = FALSE) +
facet_wrap(~label, scales = "free") +
scale_x_reordered() +
coord_flip() +
labs(title = "Top TF-IDF Words by Class (Spam vs Ham)",
x = "Words",
y = "TF-IDF Score")
This project used a simple document classification workflow to
separate spam and non-spam (ham) emails from the SpamAssassin public
corpus. I loaded messages from two folders (easy_ham and
spam), cleaned the text (lowercasing, removing punctuation
and extra spaces), and then turned the emails into word-level features
using tokenization and TF–IDF.
The classifier was then trained on a subset of the emails and evaluated on a held-out test set. The confusion matrix and summary metrics show how often the model correctly identifies spam and ham, and where it makes mistakes. Overall, the model does a reasonable job at separating spam from real messages, but there is still room for improvement, especially around misclassified ham messages.
The TF–IDF plots help explain why the model works. Certain words appear much more strongly in ham (for example, technical mailing-list words) while other words are strongly associated with spam (fonts like “serif” / “helvetica”, and marketing or insurance terms). These patterns make sense: spam messages often share similar layout and sales language, while ham messages contain more regular conversational or technical vocabulary. In a real-world system, we could extend this approach by adding more features, trying more advanced models, and retraining regularly as spam tactics change.
Working through this assignment actually made me stop and think more about how spam works and why it still manages to slip through filters. It’s crazy that after all these years, spammers still try to make their messages look as “ham-like” as possible — almost like they’re trying to blend in at a party they weren’t invited to. When you hover over an email that claims to be Home Depot and the address is something wild, you realize how much effort goes into pretending to be legitimate. So doing this project helped me understand why the line between spam and ham is always shifting. Filters get smarter, but the spammers keep trying to mimic real emails, and that back-and-forth is why this type of classification is still such a big deal in the real world.