library(tidyverse)
library(tidytext)
library(stringr)
library(readr)
library(purrr)
library(fs)
library(e1071)
library(caret)Project 4 – Document Classification of Spam and Ham Emails: Codebase
Introduction
This codebase implements a supervised document classification workflow using the SpamAssassin Public Mail Corpus. The goal is to classify email documents as either spam or ham using labeled training documents and then evaluate the model on withheld test documents.
The workflow includes reading raw email files, creating a structured document-level dataset, cleaning and tokenizing text, building text features, training a classifier, evaluating predictions, and visualizing important patterns in the data.
Load Required Packages
Define Project Paths
The raw email files were downloaded from the SpamAssassin Public Mail Corpus and extracted locally into the data/raw/ folder.
ham_dir <- "data/raw/20030228_easy_ham/easy_ham"
spam_dir <- "data/raw/20030228_spam/spam"
processed_dir <- "data/processed"Validate Folder Structure
Before reading the email files, I first check that the expected raw data folders exist. This helps confirm that the project is being run from the correct root directory and that the extracted corpus files are stored in the correct location.
dir.exists(ham_dir)[1] TRUE
dir.exists(spam_dir)[1] TRUE
if (!dir.exists(ham_dir)) {
stop("The ham folder was not found. Please check the path: ", ham_dir)
}
if (!dir.exists(spam_dir)) {
stop("The spam folder was not found. Please check the path: ", spam_dir)
}
if (!dir.exists(processed_dir)) {
dir.create(processed_dir, recursive = TRUE)
}Read Raw Email File Paths
Each individual file represents one email document. The class label is inferred from the folder where the file is stored.
ham_files <- dir_ls(ham_dir, type = "file")
spam_files <- dir_ls(spam_dir, type = "file")
length(ham_files)[1] 2501
length(spam_files)[1] 501
Create a Safe Email Reading Function
Some raw emails may contain unusual characters or encoding issues. This helper function reads each file safely and returns the email text as one character string.
read_email_safe <- function(path) {
tryCatch(
{
read_lines(path, locale = locale(encoding = "Latin1"), progress = FALSE) |>
paste(collapse = " ")
},
error = function(e) {
NA_character_
}
)
}Build Document-Level Dataset
The ham and spam files are converted into two labeled datasets and then combined into one document-level table. Each row represents one email document.
ham_emails <- tibble(
doc_id = paste0("ham_", seq_along(ham_files)),
label = "ham",
file_path = as.character(ham_files),
text = map_chr(ham_files, read_email_safe)
)
spam_emails <- tibble(
doc_id = paste0("spam_", seq_along(spam_files)),
label = "spam",
file_path = as.character(spam_files),
text = map_chr(spam_files, read_email_safe)
)
emails <- bind_rows(ham_emails, spam_emails) |>
mutate(
label = factor(label, levels = c("ham", "spam")),
text = str_squish(text)
) |>
filter(
!is.na(text),
text != "",
!str_detect(file_path, "cmds$")
)
glimpse(emails)Rows: 3,000
Columns: 4
$ doc_id <chr> "ham_1", "ham_2", "ham_3", "ham_4", "ham_5", "ham_6", "ham_7…
$ label <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, …
$ file_path <chr> "data/raw/20030228_easy_ham/easy_ham/00001.7c53336b37003a928…
$ text <chr> "From exmh-workers-admin@redhat.com Thu Aug 22 12:36:23 2002…
Validate Document-Level Dataset
These checks confirm that the dataset contains both classes, that each document has a label, and that there are no missing or empty email texts.
emails |>
count(label)# A tibble: 2 × 2
label n
<fct> <int>
1 ham 2500
2 spam 500
validation_summary <- emails |>
summarize(
total_documents = n(),
missing_labels = sum(is.na(label)),
missing_text = sum(is.na(text)),
empty_text = sum(str_trim(text) == "", na.rm = TRUE)
)
validation_summary# A tibble: 1 × 4
total_documents missing_labels missing_text empty_text
<int> <int> <int> <int>
1 3000 0 0 0
if (validation_summary$missing_text > 0) {
warning("Some email files could not be read and produced missing text.")
}
if (validation_summary$empty_text > 0) {
warning("Some email files appear to have empty text.")
}Save Processed Document-Level Dataset
The raw email files are converted into a structured CSV file for transparency and reproducibility. This processed file contains one row per email document.
emails_clean_export <- emails |>
mutate(text = str_squish(text))
write_csv(emails_clean_export, file.path(processed_dir, "spam_ham_emails.csv"))Initial Class Distribution
This plot shows the number of ham and spam documents available in the dataset. It also helps identify whether the dataset is imbalanced before modeling.
emails |>
count(label) |>
ggplot(aes(x = label, y = n)) +
geom_col() +
labs(
title = "Class Distribution of Spam and Ham Emails",
x = "Email Class",
y = "Number of Documents"
)Text Cleaning and Tokenization
The next step is to convert the document-level dataset into a word-level dataset. Each email is tokenized into individual words, common stop words are removed, and very short or uninformative terms are filtered out.
data("stop_words")
email_tokens <- emails |>
select(doc_id, label, text) |>
unnest_tokens(word, text) |>
anti_join(stop_words, by = "word") |>
filter(
str_detect(word, "^[a-z]+$"),
str_length(word) >= 3
)
glimpse(email_tokens)Rows: 705,559
Columns: 3
$ doc_id <chr> "ham_1", "ham_1", "ham_1", "ham_1", "ham_1", "ham_1", "ham_1", …
$ label <fct> ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham, ham…
$ word <chr> "exmh", "workers", "admin", "thu", "aug", "return", "path", "ex…
Validate Tokenized Dataset
These checks confirm that tokenization produced a usable word-level dataset and that both classes are still represented.
email_tokens |>
count(label)# A tibble: 2 × 2
label n
<fct> <int>
1 ham 528901
2 spam 176658
email_tokens |>
summarize(
total_tokens = n(),
unique_words = n_distinct(word),
documents_with_tokens = n_distinct(doc_id)
)# A tibble: 1 × 3
total_tokens unique_words documents_with_tokens
<int> <int> <int>
1 705559 31893 3000
Most Common Words by Class
This table shows the most common remaining words in ham and spam emails after stop word removal and basic filtering.
top_words <- email_tokens |>
count(label, word, sort = TRUE) |>
group_by(label) |>
slice_max(n, n = 15) |>
ungroup()
top_words# A tibble: 30 × 3
label word n
<fct> <chr> <int>
1 ham received 14086
2 ham list 13388
3 ham localhost 12624
4 ham fork 10877
5 ham sep 9790
6 ham esmtp 8407
7 ham http 7518
8 ham subject 7151
9 ham mailto 6170
10 ham admin 6058
# ℹ 20 more rows
Top Word Frequency Visualization
This plot compares the most frequent words in spam and ham emails. It provides an initial look at vocabulary differences between the two classes.
top_words |>
mutate(word = reorder_within(word, n, label)) |>
ggplot(aes(x = word, y = n)) +
geom_col() +
coord_flip() +
facet_wrap(~ label, scales = "free_y") +
scale_x_reordered() +
labs(
title = "Most Common Words in Spam and Ham Emails",
x = "Word",
y = "Frequency"
)TF-IDF Feature Exploration
TF-IDF gives more weight to words that are important within a class but not equally common across all classes. This helps identify terms that are more distinctive for spam or ham messages.
tfidf_by_class <- email_tokens |>
count(label, word, sort = TRUE) |>
bind_tf_idf(word, label, n) |>
arrange(desc(tf_idf))
tfidf_by_class |>
group_by(label) |>
slice_max(tf_idf, n = 15) |>
ungroup()# A tibble: 30 × 6
label word n tf idf tf_idf
<fct> <chr> <int> <dbl> <dbl> <dbl>
1 ham fork 10877 0.0206 0.693 0.0143
2 ham rpm 5869 0.0111 0.693 0.00769
3 ham exmh 4972 0.00940 0.693 0.00652
4 ham zzzlist 2726 0.00515 0.693 0.00357
5 ham razor 2076 0.00393 0.693 0.00272
6 ham rssfeeds 1869 0.00353 0.693 0.00245
7 ham zzzzteana 926 0.00175 0.693 0.00121
8 ham wrote 925 0.00175 0.693 0.00121
9 ham devel 907 0.00171 0.693 0.00119
10 ham khare 740 0.00140 0.693 0.000970
# ℹ 20 more rows
TF-IDF Visualization
This plot shows the most distinctive words for each email class based on TF-IDF.
tfidf_by_class |>
group_by(label) |>
slice_max(tf_idf, n = 15) |>
ungroup() |>
mutate(word = reorder_within(word, tf_idf, label)) |>
ggplot(aes(x = word, y = tf_idf)) +
geom_col() +
coord_flip() +
facet_wrap(~ label, scales = "free_y") +
scale_x_reordered() +
labs(
title = "Distinctive Terms by Email Class Using TF-IDF",
x = "Word",
y = "TF-IDF"
)Because the corpus contains raw email files, some frequent and distinctive terms come from email headers, mailing-list metadata, and HTML formatting. These terms are still useful for document classification because they are part of the raw documents the classifier sees, but they should be interpreted as document-level signals rather than purely message-body language.
Save Token-Level Feature Output
The token-level feature table is saved as a processed output. This file supports transparency because it shows the cleaned words and their class-level TF-IDF values.
write_csv(tfidf_by_class, file.path(processed_dir, "spam_ham_features.csv"))Train/Test Split
The dataset is split into training and testing sets using a stratified split. This keeps the class proportions similar in both sets while making sure the testing data remains unseen during model training.
set.seed(607)
train_index <- createDataPartition(emails$label, p = 0.80, list = FALSE)
email_train <- emails[train_index, ]
email_test <- emails[-train_index, ]
email_train |> count(label)# A tibble: 2 × 2
label n
<fct> <int>
1 ham 2000
2 spam 400
email_test |> count(label)# A tibble: 2 × 2
label n
<fct> <int>
1 ham 500
2 spam 100
Create Document-Term Matrix Features
For modeling, I create document-level word-count features. Only the most frequent terms are kept so that the model remains manageable and avoids an extremely sparse feature table.
top_model_terms <- email_tokens |>
count(label, word, sort = TRUE) |>
group_by(label) |>
slice_max(n, n = 750) |>
ungroup() |>
distinct(word) |>
pull(word)
model_tokens <- email_tokens |>
filter(word %in% top_model_terms) |>
count(doc_id, word, name = "count")
dtm <- model_tokens |>
pivot_wider(
names_from = word,
values_from = count,
values_fill = 0
)
dtm <- emails |>
select(doc_id, label) |>
left_join(dtm, by = "doc_id") |>
mutate(across(where(is.numeric), ~replace_na(.x, 0)))
dtm_summary <- tibble(
documents = nrow(dtm),
total_columns = ncol(dtm),
feature_columns = ncol(dtm) - 2
)
dtm_summary# A tibble: 1 × 3
documents total_columns feature_columns
<int> <int> <dbl>
1 3000 1165 1163
Create Training and Testing Matrices
For the final model, I use binary word-presence features instead of raw word counts. Each selected word is represented as either present or absent in a document. This approach works well for Naive Bayes because it focuses on whether important terms appear in an email rather than how many times they appear.
train_dtm <- dtm |>
filter(doc_id %in% email_train$doc_id) |>
select(-doc_id)
test_dtm <- dtm |>
filter(doc_id %in% email_test$doc_id) |>
select(-doc_id)
train_labels <- train_dtm$label
test_labels <- test_dtm$label
train_x <- train_dtm |>
select(-label) |>
mutate(across(everything(), ~factor(if_else(.x > 0, "yes", "no"),
levels = c("no", "yes"))))
test_x <- test_dtm |>
select(-label) |>
mutate(across(everything(), ~factor(if_else(.x > 0, "yes", "no"),
levels = c("no", "yes"))))Train Naive Bayes Classifier
A Naive Bayes classifier is trained using binary word-presence features from the training set. Laplace smoothing is included to help handle words that may appear in one class but not the other.
nb_model <- naiveBayes(
x = train_x,
y = train_labels,
laplace = 1
)
model_summary <- tibble(
model_type = "Naive Bayes",
feature_type = "Binary word-presence",
training_documents = nrow(train_x),
testing_documents = nrow(test_x),
laplace_smoothing = 1
)
model_summary# A tibble: 1 × 5
model_type feature_type training_documents testing_documents laplace_smoothing
<chr> <chr> <int> <int> <dbl>
1 Naive Bay… Binary word… 2400 600 1
Generate Predictions
The trained model is used to predict labels for the withheld testing documents.
nb_pred <- predict(nb_model, newdata = test_x)
prediction_results <- tibble(
doc_id = email_test$doc_id,
label = test_labels,
predicted_label = nb_pred
)
prediction_results |> count(label, predicted_label)# A tibble: 4 × 3
label predicted_label n
<fct> <fct> <int>
1 ham ham 496
2 ham spam 4
3 spam ham 7
4 spam spam 93
Evaluate Model Performance
The confusion matrix compares the predicted labels with the true labels.
confusion_output <- confusionMatrix(
data = prediction_results$predicted_label,
reference = prediction_results$label,
positive = "spam"
)
confusion_outputConfusion Matrix and Statistics
Reference
Prediction ham spam
ham 496 7
spam 4 93
Accuracy : 0.9817
95% CI : (0.9674, 0.9908)
No Information Rate : 0.8333
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9332
Mcnemar's Test P-Value : 0.5465
Sensitivity : 0.9300
Specificity : 0.9920
Pos Pred Value : 0.9588
Neg Pred Value : 0.9861
Prevalence : 0.1667
Detection Rate : 0.1550
Detection Prevalence : 0.1617
Balanced Accuracy : 0.9610
'Positive' Class : spam
Confusion Matrix Table
confusion_table <- as.data.frame(confusion_output$table)
confusion_table Prediction Reference Freq
1 ham ham 496
2 spam ham 4
3 ham spam 7
4 spam spam 93
Confusion Matrix Visualization
confusion_table |>
ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), size = 5) +
labs(
title = "Confusion Matrix for Spam and Ham Classification",
x = "True Class",
y = "Predicted Class"
)Metric Summary
metric_summary <- tibble(
metric = c("Accuracy", "Sensitivity / Spam Recall", "Specificity", "Precision"),
estimate = c(
confusion_output$overall["Accuracy"],
confusion_output$byClass["Sensitivity"],
confusion_output$byClass["Specificity"],
confusion_output$byClass["Precision"]
)
)
metric_summary# A tibble: 4 × 2
metric estimate
<chr> <dbl>
1 Accuracy 0.982
2 Sensitivity / Spam Recall 0.93
3 Specificity 0.992
4 Precision 0.959
Metric Summary Visualization
metric_summary |>
ggplot(aes(x = metric, y = estimate)) +
geom_col() +
ylim(0, 1) +
labs(
title = "Naive Bayes Classification Metric Summary",
x = "Metric",
y = "Estimate"
)Save Model Predictions
The final prediction results are saved as a processed output for transparency.
write_csv(prediction_results, file.path(processed_dir, "spam_ham_predictions.csv"))Interpretation of Model Results
The final Naive Bayes model performed strongly on the withheld testing data. Out of 600 test emails, the model correctly classified 496 ham emails and 93 spam emails. Only 4 ham emails were incorrectly predicted as spam, and only 7 spam emails were incorrectly predicted as ham.
The overall accuracy was 98.17%, but accuracy alone is not enough for this project because the dataset contains more ham emails than spam emails. The spam recall was 93.00%, meaning the model successfully identified most of the true spam messages. The precision was 95.88%, meaning that most emails predicted as spam were actually spam. The specificity was 99.20%, showing that the model was also very strong at correctly identifying legitimate ham emails.
This result is much better than a simple majority-class classifier because it performs well on both classes. The small number of false positives and false negatives shows that the word-presence Naive Bayes approach is effective for this spam/ham classification task.